JP5577546B2

JP5577546B2 - Computer system

Info

Publication number: JP5577546B2
Application number: JP2010110912A
Authority: JP
Inventors: 浩彦佐川; 康嗣森本; 義行小林
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2010-05-13
Filing date: 2010-05-13
Publication date: 2014-08-27
Anticipated expiration: 2030-05-13
Also published as: JP2011238159A

Description

本発明は、計算機システムに関し、特に、文書から情報を抽出する計算機システムに関する。 The present invention relates to a computer system, and more particularly to a computer system that extracts information from a document.

文書中の内容が重要であるか否かを判定し、重要箇所を抽出する技術は、あらかじめ設定された判定ルールに従って文書の内容を判定する技術の一つであり、従来から提案されている（例えば、特許文献１及び特許文献２参照）。 A technique for determining whether content in a document is important and extracting an important part is one of techniques for determining the contents of a document in accordance with a preset determination rule, and has been conventionally proposed ( For example, see Patent Literature 1 and Patent Literature 2).

特許文献１は、利用者によって入力された提示条件と照合条件とを、有向グラフによって表現される知識ルールに変換し、この知識ルールを判定ルールとして利用する。判定の条件である照合条件は、トピックスに該当する単語とエレメントに該当する単語との関係によって記述される。また、判定する対象である文書と知識ルールとの照合において、特許文献１の装置は、該当する単語の出現位置が、有向グラフの向き及び階層構造とマッチしており、かつリンクの長さがより短い単語の組み合わせを対象の文書中から求める。 Patent Document 1 converts presentation conditions and collation conditions input by a user into knowledge rules expressed by a directed graph, and uses the knowledge rules as determination rules. A collation condition that is a determination condition is described by a relationship between a word corresponding to a topic and a word corresponding to an element. Further, in the collation between the document to be determined and the knowledge rule, the apparatus of Patent Document 1 is such that the appearance position of the corresponding word matches the direction and hierarchical structure of the directed graph, and the length of the link is greater. Find a short word combination from the target document.

また、特許文献２は、あらかじめ重要文（正例）または非重要文（負例）のラベルがついた訓練データから特徴ベクトルを抽出し、統計的な方法であるサポートベクターマシンを用いた分類器によって、特徴ベクトルを正例と負例とに分類する分離平面を生成する。特許文献２の技術において、分離平面は判定ルールに相当する。新たな文書についても特徴ベクトルを抽出し、上記生成された分離平面と比較することによって、重要文か非重要文かを判定する。訓練データ及び新たな文書から抽出する特徴量、すなわち特徴ベクトルの各要素としては、文の長さ、文の出現位置、又は、文に出現するキーワードの有無などが用いられる。 Further, Patent Document 2 discloses a classifier using a support vector machine, which is a statistical method, by extracting feature vectors from training data previously labeled with important sentences (positive examples) or non-important sentences (negative examples). Thus, a separation plane that classifies the feature vector into a positive example and a negative example is generated. In the technique of Patent Document 2, the separation plane corresponds to a determination rule. A feature vector is also extracted for a new document and compared with the generated separation plane to determine whether it is an important sentence or an unimportant sentence. As each element of the training data and the feature quantity extracted from the new document, that is, the feature vector, the length of the sentence, the appearance position of the sentence, or the presence or absence of a keyword appearing in the sentence is used.

特開２００７−１９３５００号公報JP 2007-193500 A 特開２００３−０３６２６２号公報JP 2003-036262 A

特許文献１及び特許文献２に挙げられたような文書の内容を判定する技術は、判定するための基準を作成する際に、基になる訓練データ、又は、提示条件及び照合条件が、あらかじめ十分準備できれば、精度のよい重要文抽出が可能である。しかし特許文献１において、提示条件及び照合条件は、人手による入力が前提である。 In the technology for determining the content of a document as listed in Patent Document 1 and Patent Document 2, when creating a standard for determination, the training data that is the basis, or the presentation condition and the matching condition are sufficient in advance. If prepared, it is possible to extract important sentences with high accuracy. However, in Patent Document 1, the presentation condition and the collation condition are premised on manual input.

また特許文献２において、訓練データの作成方法は特に開示されていないが、通常、訓練データの作成は人手による作成が前提とされている場合が多い。このため、特許文献２においても、訓練データは、同様に人手による作成が前提であると言える。 In Patent Document 2, a method for creating training data is not particularly disclosed, but usually, training data is often created manually. For this reason, in Patent Document 2, it can be said that the training data is similarly premised on manual creation.

特許文献１で開示される技術のように、利用者があらかじめ判定基準を明示的にシステムに登録する方法は、少数の判定ルールによって対応可能な場合には、精度良く判定できることが期待できる。しかし、判定対象となる文書の規模が大きくなり、多数かつ複雑な判定基準が必要な場合、提示条件及び照合条件を人手によって入力することは困難である。さらにこのように多数かつ複雑な判定基準が必要な場合、矛盾の無い、一貫性のある判定ルールを作成し、さらに維持することも困難になるという問題が生じる。 A method in which a user explicitly registers a determination criterion in advance in the system as in the technique disclosed in Patent Document 1 can be expected to be determined with high accuracy when it can be handled by a small number of determination rules. However, when the scale of a document to be determined becomes large and a large number of complicated determination criteria are required, it is difficult to manually input presentation conditions and collation conditions. Further, when such a large number of complicated determination criteria are required, there arises a problem that it is difficult to create and maintain a consistent determination rule having no contradiction.

一方、特許文献２に開示される技術は、重要文であるか非重要文であるかのラベル付けが行われた訓練データを統計的に処理することによって、判定ルールに該当する分離平面を求める。これによって、特許文献１と比較して、一貫性のある判定ルールを容易に構築できることが期待される。 On the other hand, the technique disclosed in Patent Document 2 obtains a separation plane corresponding to a determination rule by statistically processing training data labeled as an important sentence or a non-important sentence. . As a result, it is expected that a consistent determination rule can be easily constructed as compared with Patent Document 1.

しかし、統計的な方法によって精度の高い判定ルールを構築するためには、大量のデータを集めると共に、集めたデータに重要文であるか非重要文であるかを示すラベルを付与する必要がある。小規模なデータであれば、人手によって容易にラベルを付与することができるが、データが大量になると人手による作業では時間を要するため、その利用自体が困難になるという問題が生じる。 However, in order to build a highly accurate judgment rule by a statistical method, it is necessary to collect a large amount of data and attach a label indicating whether the collected data is an important sentence or a non-important sentence. . If the data is small, labels can be easily assigned manually. However, if the data is large, it takes time for manual work, and thus the use of the data becomes difficult.

また、特許文献１に示すシステムは、判定対象の文書と判定ルールとを照合する際、判定対象の文書中の単語と判定ルール中の単語とが一致するか否かに基づいて照合する。さらに、特許文献１に示すシステムは、同義語辞書を用いるため、照合の対象となる単語の表記のゆれにも対応できる。 Further, when collating the determination target document and the determination rule, the system disclosed in Patent Document 1 performs the verification based on whether or not the word in the determination target document matches the word in the determination rule. Furthermore, since the system shown in Patent Document 1 uses a synonym dictionary, it can cope with fluctuations in the notation of words to be collated.

一方、特許文献２においては、判定対象となる文書中の単語について、その単語の有無に関する情報を特徴ベクトルの要素として使用する。しかし、通常、文書中には数値に関する記述も多数含まれており、数値に関しては、その数値の一致又は不一致ではなく、「〜以上」、「〜以下」又は「〜から〜まで」といった、数値の大きさ、又は数値の範囲に基づく照合が必要であり、それらに対応した数値用の判定ルールを構築する必要がある。 On the other hand, in Patent Document 2, information on the presence / absence of a word in a document to be determined is used as an element of a feature vector. However, a document usually contains a lot of descriptions related to numerical values. Regarding numerical values, numerical values such as “to more than”, “to less than”, or “to to to” are not included in the numerical values. Therefore, it is necessary to collate based on the size or the range of numerical values, and it is necessary to construct a determination rule for numerical values corresponding to them.

特許文献２の技術は、単語の表記すなわち単語を示す文字列の一致又は不一致に基づく判定ルールの構築に対応しているが、数値の大きさ又は数値の範囲を考慮した判定ルールの構築には対応していない。このため、数値の大きさ又は数値の範囲を考慮する必要がある場合、判定精度が低下するという問題がある。 The technique of Patent Document 2 corresponds to the construction of a judgment rule based on word notation, that is, matching or mismatching of character strings indicating words, but for construction of a judgment rule in consideration of the size of numerical values or the range of numerical values. Not supported. For this reason, when it is necessary to consider the magnitude | size or the range of a numerical value, there exists a problem that determination accuracy falls.

判定ルールを数値に対応させること、及び、その数値に対応した判定ルールを人手によって構築することは技術的に容易であるが、大規模な文書を対象とする場合、前述の通り、一貫性のある精度の高い判定ルールを効率的に構築することが困難になる。 It is technically easy to make a judgment rule correspond to a numerical value and to manually construct a judgment rule corresponding to the numerical value. It becomes difficult to efficiently construct a certain highly accurate determination rule.

本発明の目的は、文書中から重要と判断される箇所を抽出するシステムにおいて使用される判定ルールを、大量の文書から効率的に生成するとともに、数値の大きさ及び数値の範囲を考慮した判定精度の高い判定ルールを、容易に生成するシステムの提供である。 It is an object of the present invention to efficiently generate a determination rule used in a system for extracting a portion determined to be important from a document from a large number of documents, and to consider a numerical value and a numerical range. The provision of a system for easily generating a highly accurate determination rule.

本発明の代表的な一例を示せば以下の通りである。すなわち、演算処理をするプロセッサと、前記プロセッサに接続される記憶装置とを備え、前記プロセッサが文書を解析する計算機システムであって、前記文書は、各々が複数の単語を含み、文章を構成する複数の要素を含み、前記複数の要素は、文又は段落を含み、前記プロセッサは、複数の第１の前記文書と、前記第１の文書への参照を含む参照文書と、重要箇所の判定を行う対象となる第２の前記文書とを入力され、前記各第１の文書から前記要素を抽出するとともに、前記参照文書中から前記第１の文書への参照箇所を参照情報として抽出し、前記各第１の文書から抽出された要素と前記参照情報とによって算出される類似度に基づいて、前記各第１の文書から抽出された要素と前記参照情報との間で類似する要素を重要箇所であるとして第１の前記要素に、前記各第１の文書から抽出された要素と前記参照情報との間で類似しない箇所を非重要箇所として第２の前記要素に、前記各第１の文書から抽出された要素を分割し、前記分割された第１の要素及び第２の要素に含まれる前記複数の単語に基づいて、前記各文書の第１の特徴量を取得し、前記取得された第１の特徴量に基づいて、前記重要箇所を含むか否かを判定するための判定ルールを生成し、前記第２の文書から前記要素を抽出し、前記第２の文書から抽出した要素に含まれる前記複数の単語に基づいて、第２の特徴量を取得し、前記生成された判定ルールと、前記取得された第２の特徴量とを比較することによって、前記第２の文書から抽出した要素を、重要箇所と非重要箇所に分類する。 A typical example of the present invention is as follows. That is, a computer system that includes a processor that performs arithmetic processing and a storage device that is connected to the processor, and in which the processor analyzes a document, the document includes a plurality of words, and forms a sentence. A plurality of elements, the plurality of elements including a sentence or a paragraph; and the processor determines a plurality of first documents, a reference document including a reference to the first document, and an important part determination. The second document to be performed is input, and the elements are extracted from each first document, and a reference location to the first document is extracted from the reference document as reference information, Based on the similarity calculated from the elements extracted from each first document and the reference information, the similar elements between the elements extracted from each first document and the reference information are determined as important points. Is Then, a portion that is not similar to the first extracted element from the first document and the reference information is regarded as a non-important portion from the first document to the second element. The extracted element is divided, the first feature amount of each document is acquired based on the plurality of words included in the divided first element and second element, and the acquired first A determination rule for determining whether or not the important part is included is generated based on the feature amount of 1, and the element is extracted from the second document and included in the element extracted from the second document A second feature amount is obtained based on the plurality of words extracted, and extracted from the second document by comparing the generated determination rule with the obtained second feature amount Classify elements into important and non-critical parts.

本発明の一実施形態によると、文書中から重要と判断される箇所を抽出するための判定ルールを、効率的に生成する。 According to an embodiment of the present invention, a determination rule for extracting a portion determined to be important from a document is efficiently generated.

本発明の第１の実施形態による重要箇所判定システムの構成を示すブロック図である。It is a block diagram which shows the structure of the important location determination system by the 1st Embodiment of this invention. 本発明の第１の実施形態の文書分類部の詳細な構成を示すブロック図である。It is a block diagram which shows the detailed structure of the document classification | category part of the 1st Embodiment of this invention. 本発明の第１の実施形態の文書分割ルールの例を示す説明図である。It is explanatory drawing which shows the example of the document division | segmentation rule of the 1st Embodiment of this invention. 本発明の第１の実施形態の教師データ生成部の詳細な構成を示すブロック図である。It is a block diagram which shows the detailed structure of the teacher data generation part of the 1st Embodiment of this invention. 本発明の第１の実施形態の属性抽出ルールに含まれる属性名及び単位の例を示す説明図である。It is explanatory drawing which shows the example of the attribute name and unit contained in the attribute extraction rule of the 1st Embodiment of this invention. 本発明の第１の実施形態の属性抽出ルールに含まれるパターンの例を示す説明図である。It is explanatory drawing which shows the example of the pattern contained in the attribute extraction rule of the 1st Embodiment of this invention. 本発明の第１の実施形態の教師データ生成部の別の構成を示すブロック図である。It is a block diagram which shows another structure of the teacher data generation part of the 1st Embodiment of this invention. 本発明の第１の実施形態の判定データ生成部の詳細を示すブロック図である。It is a block diagram which shows the detail of the determination data generation part of the 1st Embodiment of this invention. 本発明の第１の実施形態の重要箇所判定システムを、計算機に実装した場合の構成図を示す。The block diagram at the time of mounting the important location determination system of the 1st Embodiment of this invention in the computer is shown. 本発明の第２の実施形態の重要箇所判定システムの構成を示すブロック図である。It is a block diagram which shows the structure of the important location determination system of the 2nd Embodiment of this invention. 本発明の第２の実施形態の教師データ生成部の詳細な構成を示すブロック図である。It is a block diagram which shows the detailed structure of the teacher data generation part of the 2nd Embodiment of this invention. 本発明の第２の実施形態の判定ルール生成部の詳細を示すブロック図である。It is a block diagram which shows the detail of the determination rule production | generation part of the 2nd Embodiment of this invention. 本発明の第２の実施形態の判定データ生成部の詳細を示すブロック図である。It is a block diagram which shows the detail of the determination data production | generation part of the 2nd Embodiment of this invention. 本発明の第２の実施形態の判定処理部の詳細を示すブロック図である。It is a block diagram which shows the detail of the determination process part of the 2nd Embodiment of this invention. 本発明の第３の実施形態の重要箇所判定システムの構成を示すブロック図である。It is a block diagram which shows the structure of the important location determination system of the 3rd Embodiment of this invention. 本発明の第３の実施形態の判定結果の表示例を示す説明図である。It is explanatory drawing which shows the example of a display of the determination result of the 3rd Embodiment of this invention. 本発明の第３の実施形態の判定結果の別の表示例を示す説明図である。It is explanatory drawing which shows another example of a display of the determination result of the 3rd Embodiment of this invention.

（第１の実施形態）
本発明の第１の実施形態を図１から図８を用いて説明する。 (First embodiment)
A first embodiment of the present invention will be described with reference to FIGS.

図１は、本発明の第１の実施形態による重要箇所判定システムの構成を示すブロック図である。 FIG. 1 is a block diagram showing the configuration of an important point determination system according to the first embodiment of the present invention.

本発明の重要箇所判定システムは、事例文書１０１、参照情報記載文書１０２、文書分類部１０３、教師データ生成部１０４、判定ルール生成部１０５、判定ルール１０６、判定対象文書１０７、文書分割部１０８、文書分類ルール１０９、判定データ生成部１１０、及び、判定処理部１１１を備える。事例文書１０１、参照情報記載文書１０２、判定ルール１０６、判定対象文書１０７、及び、文書分類ルール１０９は、データの集合である。また、文書分類部１０３、教師データ生成部１０４、判定ルール生成部１０５、文書分割部１０８、判定データ生成部１１０、及び、判定処理部１１１は、プログラムである。 The important point determination system of the present invention includes a case document 101, a reference information description document 102, a document classification unit 103, a teacher data generation unit 104, a determination rule generation unit 105, a determination rule 106, a determination target document 107, a document division unit 108, A document classification rule 109, a determination data generation unit 110, and a determination processing unit 111 are provided. The case document 101, the reference information description document 102, the determination rule 106, the determination target document 107, and the document classification rule 109 are a set of data. The document classification unit 103, the teacher data generation unit 104, the determination rule generation unit 105, the document division unit 108, the determination data generation unit 110, and the determination processing unit 111 are programs.

事例文書１０１は、重要な箇所を判定する対象となる文書（判定対象文書１０７）に類似した文書、判定対象文書１０７と同じ分野の文書、及び、判定対象文書１０７に関連する内容を含む文書など、過去の文書を事例として保持する。例えば、判定対象文書１０７に類似した文書のうち、過去に蓄積された文書が事例文書１０１に蓄積される。 The case document 101 is a document similar to a document (determination target document 107) for which an important part is determined, a document in the same field as the determination target document 107, a document including contents related to the determination target document 107, and the like. , Keep past documents as examples. For example, among the documents similar to the determination target document 107, documents accumulated in the past are accumulated in the case document 101.

参照情報記載文書１０２は、事例文書１０１中の特定の箇所を参照するための記述が含まれる文書を保持する。例えば、事例文書１０１の中の特定の属性が重要であることを、ユーザによって指示された文書などが、参照情報記載文書１０２に蓄積される。 The reference information description document 102 holds a document including a description for referring to a specific part in the case document 101. For example, a document instructed by the user that a specific attribute in the case document 101 is important is stored in the reference information description document 102.

参照情報記載文書１０２の事例文書１０１を参照するための記述には、該当する事例文書１０１の文書名、章番号、見出し、ページ数、段落番号、カラム番号、行番号、及び、該当箇所の文章の全体又は一部などを適宜組み合わせた情報が含まれる。 The description for referring to the case document 101 of the reference information description document 102 includes the document name, chapter number, heading, number of pages, paragraph number, column number, line number, and sentence of the corresponding part of the case document 101. The information which combined the whole or a part etc. suitably is contained.

文書分類部１０３では、重要な箇所を判定する単位（事例文書要素）に事例文書１０１を分割する。また、分割された事例文書要素を、参照情報記載文書１０２から参照される箇所であり、重要な箇所である重要要素、又は、それ以外の非重要要素に分類する。 The document classification unit 103 divides the case document 101 into units (case document elements) for determining important parts. Further, the divided case document elements are classified into important elements that are important parts or other non-important elements that are referred to from the reference information description document 102.

教師データ生成部１０４は、文書分類部１０３によって分類された重要要素及び非重要要素を教師データに変換する。教師データは、判定ルール１０６を生成するために用いられる。 The teacher data generation unit 104 converts important elements and unimportant elements classified by the document classification unit 103 into teacher data. The teacher data is used to generate the determination rule 106.

判定ルール生成部１０５は、教師データ生成部１０４によって生成された教師データに基づいて、一般的に利用される機械学習の技術を用いることによって、判定ルール１０６を生成する。 The determination rule generation unit 105 generates the determination rule 106 based on the teacher data generated by the teacher data generation unit 104 by using a generally used machine learning technique.

判定対象文書１０７は、判定ルール１０６を用いて重要な箇所を判定する対象となる文書を保持する。文書分割部１０８は、文書分割ルール１０９を用いて、判定する単位（判定文書要素）に判定対象文書１０７を分割する。文書分割ルール１０９には、章若しくは段落、又は、表の形式など、文章の構造に関する情報が保持される。 The determination target document 107 holds a document that is a target for determining an important part using the determination rule 106. The document dividing unit 108 divides the determination target document 107 into determination units (determination document elements) using the document division rule 109. The document division rule 109 holds information related to a sentence structure such as a chapter or paragraph or a table format.

判定データ生成部１１０は、判定処理部１１１によって利用できるデータ形式に、各判定文書要素を変換することによって、判定データを生成する。一般的に、判定ルール１０６を生成するために用いられる機械学習の技術と、判定処理において用いられる技術とは対になっているため、判定データ生成部１１０において出力されるデータ形式は、教師データ生成部１０４において生成されるデータ形式と同じである。 The determination data generation unit 110 generates determination data by converting each determination document element into a data format that can be used by the determination processing unit 111. In general, since the machine learning technique used to generate the determination rule 106 is paired with the technique used in the determination process, the data format output in the determination data generation unit 110 is teacher data. The data format is the same as that generated by the generation unit 104.

判定処理部１１１は、判定データに判定ルール１０６を適用し、各判定データが重要要素であるか非重要要素であるかを判定し、その結果を判定結果１１２として出力する。 The determination processing unit 111 applies the determination rule 106 to the determination data, determines whether each determination data is an important element or a non-important element, and outputs the result as a determination result 112.

図２は、本発明の第１の実施形態の文書分類部１０３の詳細な構成を示すブロック図である。 FIG. 2 is a block diagram illustrating a detailed configuration of the document classification unit 103 according to the first embodiment of this invention.

文書分類部１０３は、文書分割部２０１、参照情報抽出部２０３、マッチング部２０５、分類処理部２０６、文書分類ルール２０２及び参照情報抽出ルール２０４を備える。文書分割部２０１、参照情報抽出部２０３、マッチング部２０５及び分類処理部２０６は、プログラムであり、文書分類ルール２０２及び参照情報抽出ルール２０４は、データベースなどの記憶領域である。 The document classification unit 103 includes a document division unit 201, a reference information extraction unit 203, a matching unit 205, a classification processing unit 206, a document classification rule 202, and a reference information extraction rule 204. The document division unit 201, the reference information extraction unit 203, the matching unit 205, and the classification processing unit 206 are programs, and the document classification rule 202 and the reference information extraction rule 204 are storage areas such as a database.

図２に示す文書分割部２０１は、文書分割ルール２０２を用いて、判定処理を行う単位（事例文書要素）に事例文書を分割する。この文書分割部２０１は図１に示す文書分割部１０８と同じでもよく、文書分割ルール２０２は図１に示す文書分割ルール１０９と同じでもよい。文書分割ルール２０２には、章の見出し若しくは段落、又は、表の形式などの文章の構造に関する情報が保持される。 The document dividing unit 201 illustrated in FIG. 2 divides the case document into units (case document elements) for performing the determination process using the document dividing rule 202. The document dividing unit 201 may be the same as the document dividing unit 108 shown in FIG. 1, and the document dividing rule 202 may be the same as the document dividing rule 109 shown in FIG. The document division rule 202 holds information on the structure of a sentence such as a chapter heading or paragraph, or a table format.

ここで、文書分割ルール２０２の例を以下に示す。 Here, an example of the document division rule 202 is shown below.

図３は、本発明の第１の実施形態の文書分割ルール２０２の例を示す説明図である。 FIG. 3 is an explanatory diagram illustrating an example of the document division rule 202 according to the first embodiment of this invention.

図３に示す文書分割ルール２０２は、章の見出し又は表のタイトルを含むパターンによって、文章の構造に関する情報を示す。文書分割ルール２０２は、パターン３０１、階層レベル３０２、及び、内容３０３を含む。また、図３に示す文書分割ルール２０２は、各行にエントリー３０４〜エントリー３０７を含む。 The document division rule 202 shown in FIG. 3 indicates information related to the structure of a sentence by a pattern including a chapter heading or a table title. The document division rule 202 includes a pattern 301, a hierarchy level 302, and contents 303. 3 includes an entry 304 to an entry 307 in each line.

パターン３０１は、文書中に記述される、章の見出し又は表のタイトルのパターンを示す。階層レベル３０２は、パターン３０１に示される記述が、章の見出しのように、第１章、第２節、などといった階層構造である場合、階層構造の深さを示す。階層レベル３０２は、数値が大きい程、階層が深いことを示す。 A pattern 301 indicates a chapter heading or a table title pattern described in a document. The hierarchical level 302 indicates the depth of the hierarchical structure when the description shown in the pattern 301 has a hierarchical structure such as Chapter 1, Section 2, etc., as in the chapter heading. The hierarchical level 302 indicates that the higher the numerical value, the deeper the hierarchy.

内容３０３は、パターン３０１に示される項目の内容を示す。内容３０３は、パターン３０１の管理を容易にするための情報であり、文書分割部２０１による処理において特に利用されるものではない。 A content 303 indicates the content of an item shown in the pattern 301. The contents 303 are information for facilitating the management of the pattern 301 and are not particularly used in the processing by the document dividing unit 201.

また、図３に示すパターン３０１の欄において、「Ｎ」には任意の数字が記述されることを、「Ａ」には任意のアルファベット等の文字が記述されることを、さらに「＊」は任意の文字列が記述されることを、それぞれ示す。文書分類部１０３は、パターン３０１と各事例文書１０１の内容とを比較することによって、パターン３０１にマッチする箇所を抽出する。 In the column of the pattern 301 shown in FIG. 3, “N” indicates that an arbitrary number is described, “A” indicates that an arbitrary alphabetic character is described, and “*” indicates Each indicates that an arbitrary character string is described. The document classification unit 103 extracts a portion that matches the pattern 301 by comparing the pattern 301 with the contents of each case document 101.

例えば文書分類部１０３は、エントリー３０４のパターン３０１によって、事例文書１０１から、「第１章総論」、「第２章構成」などの文字列を抽出し、エントリー３０５のパターン３０１によって、「１．１概要」、「２．１仕様」などの文字列を抽出する。また、エントリー３０６のパターン３０１によって「（ａ）メモリ容量は５ＭＢ」、「（ｂ）ハードディスクは１ＴＢ」などの文字列を抽出し、エントリー３０７のパターン３０１によって「・メモリスロットは２つ」などの文字列を抽出する。 For example, the document classification unit 103 extracts a character string such as “Chapter 1 general remarks” or “Chapter 2 configuration” from the case document 101 using the pattern 301 of the entry 304. Character strings such as “1 Overview” and “2.1 Specifications” are extracted. In addition, a character string such as “(a) memory capacity is 5 MB” or “(b) hard disk is 1 TB” is extracted by the pattern 301 of the entry 306, and “• two memory slots” are extracted by the pattern 301 of the entry 307. Extract a string.

事例文書１０１中で文書分割ルール２０２にマッチする箇所を抽出した後、抽出された箇所の直後に本文が存在する場合、文書分割部２０１は、抽出された箇所及び直後の本文を一つのまとまった事例文書要素として、事例文書１０１を分割する。また、抽出された箇所の次に本文が存在しない場合、文書分割部２０１は、抽出された箇所を事例文書要素として、事例文書１０１を分割する。 After extracting the part that matches the document division rule 202 in the case document 101, when the text exists immediately after the extracted part, the document dividing unit 201 collects the extracted part and the immediately following text as one. The case document 101 is divided as a case document element. When there is no text next to the extracted part, the document dividing unit 201 divides the case document 101 by using the extracted part as a case document element.

抽出された箇所及び直後の本文を、一つのまとまった事例文書要素として、事例文書１０１を分割する際、分割された事例文書要素が、章の見出しなどによって示される階層構造のうち下位の階層を含む場合、文書分割部２０１は、階層レベル３０２が格納されるパターン３０１にマッチした上位の階層の章の見出しなどを文脈情報として取得し、また、下位の階層の章の見出しなどを事例文書要素として取得する。そして、取得された事例文書要素に取得された文脈情報を付加する。 When the case document 101 is divided by using the extracted part and the text immediately after as a single case document element, the divided case document element has a lower hierarchy in the hierarchical structure indicated by a chapter heading or the like. If included, the document dividing unit 201 acquires, as context information, the heading of the chapter in the higher hierarchy that matches the pattern 301 in which the hierarchical level 302 is stored, and also uses the case document element as the heading of the chapter in the lower hierarchy. Get as. Then, the acquired context information is added to the acquired case document element.

例えば、事例文書１０１が、以下のような階層構造を含む場合を示す。 For example, a case where the case document 101 includes the following hierarchical structure is shown.

「第２章構成
…
２．２仕様
（ａ）メモリ容量は５ＭＢ
（ｂ）ハードディスクは１ＴＢ
… 」・・・事例文書１０１
前述の事例文書１０１の例に、図３に示す文書分割ルール２０２を適用した場合、全ての行の項目が、パターン３０１に該当すると判定され、また、全ての行の項目に、判定された箇所の直後に本文が存在しないと判定されるため、以下の二つの事例文書要素が生成される。 “Chapter 2 Structure…
2.2 Specifications (a) Memory capacity is 5MB
(B) Hard disk is 1TB
... "... Case document 101
When the document division rule 202 shown in FIG. 3 is applied to the example of the case document 101 described above, it is determined that all the line items correspond to the pattern 301, and the determined items are all the line items. Since it is determined that the text does not exist immediately after, the following two case document elements are generated.

「第２章構成・・・文脈情報
２．２仕様・・・文脈情報
（ａ）メモリ容量は５ＭＢ」・・・事例文書要素
「第２章構成・・・文脈情報
２．２仕様・・・文脈情報
（ｂ）ハードディスクは１ＴＢ」・・・事例文書要素
前述の二つの文書が、事例文書要素として抽出される。ここで、「第２章構成」及び「２．２仕様」は、該当するパターン３０１の階層レベル３０２に値が格納されているため、文脈情報として付加された項目である。文脈情報には、章の見出しなどのほか、文書名、ページ数、段落番号、又は、行番号などを含めてもよい。 "Chapter 2 Configuration ... Context Information 2.2 Specification ... Context Information (a) Memory capacity is 5 MB" ... Example Document Element "Chapter 2 Configuration ... Context Information 2.2 Specification ... Context information (b) Hard disk is 1 TB ”... Case document element The above two documents are extracted as case document elements. Here, “Chapter 2 Configuration” and “2.2 Specification” are items added as context information because values are stored in the hierarchical level 302 of the corresponding pattern 301. In addition to chapter headings, the context information may include document names, page numbers, paragraph numbers, or line numbers.

また、抽出された事例文書要素の直後の本文が複数の段落に分割される場合、段落毎に、文脈情報、直前に抽出された箇所、及び、段落を組として事例文書要素が生成されてもよい。さらに、本文中の各文を単位として事例文書要素が生成されてもよい。 In addition, when the text immediately after the extracted case document element is divided into a plurality of paragraphs, even if the case document element is generated for each paragraph, the context information, the location extracted immediately before, and the paragraph are combined. Good. Furthermore, a case document element may be generated with each sentence in the text as a unit.

また、文書分割ルール２０２のパターン３０１は、文字列のほか、使用されるフォントの種類、文字の大きさ、下線の有無、又は、字下げの有無など、事例文書１０１に記述される形式に関するルールであれば、どのようなルールでも格納されてよい。 In addition to the character string, the pattern 301 of the document division rule 202 is a rule regarding the format described in the case document 101, such as the type of font used, the size of the character, the presence or absence of an underline, or the presence or absence of indentation. Any rule may be stored as long as it is.

図２に示す参照情報抽出部２０３は、参照情報抽出ルール２０４の内容に基づいて、参照情報記載文書１０２から、事例文書１０１中の特定箇所を参照するための記述を抽出する。参照情報抽出ルール２０４には、文書分割ルール２０２と同様に、参照箇所の記述にあてはまる固有のパターンが図３に示すパターン３０１と同様に格納される。 The reference information extraction unit 203 illustrated in FIG. 2 extracts a description for referring to a specific location in the case document 101 from the reference information description document 102 based on the content of the reference information extraction rule 204. Similar to the document division rule 202, the reference information extraction rule 204 stores a unique pattern applicable to the description of the reference location, as with the pattern 301 shown in FIG.

参照情報抽出部２０３は、それらのパターンがマッチする箇所を参照情報記載文書１０２から検索することによって、参照情報記載文書１０２から参照情報を抽出する。すなわち、参照情報抽出部２０３によって抽出される参照情報は、参照情報記載文書１０２に保持されていた文書を、文書分割部２０１によって生成された事例文書要素と比較可能な形式に変更した情報である。 The reference information extraction unit 203 extracts the reference information from the reference information description document 102 by searching the reference information description document 102 for a place where these patterns match. That is, the reference information extracted by the reference information extraction unit 203 is information obtained by changing the document held in the reference information description document 102 into a format that can be compared with the case document element generated by the document dividing unit 201. .

参照情報抽出部２０３によって抽出される参照情報は、該当する事例文書１０１の文書名、章番号、章の見出し、ページ数、段落番号、カラム番号、行番号、若しくは、該当箇所の文章の全体又は一部などの組み合わせを含む。参照情報にいずれの項目が含まれるかは、参照情報記載文書１０２の内容に依存する。 The reference information extracted by the reference information extraction unit 203 is the document name, chapter number, chapter heading, page number, paragraph number, column number, line number of the corresponding case document 101, Including some combinations. Which item is included in the reference information depends on the content of the reference information description document 102.

マッチング部２０５は、文書分割部２０１によって抽出された各事例文書要素と、参照情報抽出部２０３によって抽出された各参照情報とをマッチングする。マッチング部２０５は、該当する項目が存在するか否か、及び、該当する項目の内容が一致するか否かに基づいてマッチングする。 The matching unit 205 matches each case document element extracted by the document dividing unit 201 with each reference information extracted by the reference information extracting unit 203. The matching unit 205 performs matching based on whether the corresponding item exists and whether the content of the corresponding item matches.

なお、該当箇所の文章の一部のみが参照情報に含まれる場合、マッチング部２０５は、動的計画法による符号のマッチング技術（例えば、特開２００２−２２１９８４号公報参照）を用いることによって、参照情報中の内容と事例文書要素中の該当する文章とを柔軟にマッチングすることができる。 In addition, when only a part of the sentence of the corresponding part is included in the reference information, the matching unit 205 performs reference by using a code matching technique based on dynamic programming (see, for example, JP-A-2002-221984). It is possible to flexibly match the content in the information with the corresponding text in the case document element.

次に、マッチング部２０５は、マッチングの結果に基づいて、参照情報と事例文書要素との一致度を算出する。一致度は、マッチングした参照情報と事例文書要素とのうち、一致した数によって算出される。 Next, the matching unit 205 calculates the matching degree between the reference information and the case document element based on the matching result. The degree of coincidence is calculated based on the number of matches between the matched reference information and the case document elements.

また、動的計画法による技術を用いた場合、マッチング部２０５は、参照情報と事例文書要素との距離を算出できる。このため、算出された距離の逆数を求めることによって、距離が小さいほど大きい値が得られる関数を求め、この関数によって算出された値を一致度としてもよい。 Further, when a technique based on dynamic programming is used, the matching unit 205 can calculate the distance between the reference information and the case document element. For this reason, by calculating the reciprocal of the calculated distance, a function that obtains a larger value as the distance is smaller may be obtained, and the value calculated by this function may be used as the matching degree.

分類処理部２０６は、マッチングの結果、各参照情報に対して最も一致度が高い事例文書要素を重要箇所である事例文書要素（重要要素）として分類し、どの参照情報にも対応せず、一致度が低い事例文書要素を、重要箇所ではない事例文書要素（非重要要素）として分類する。そして、分類された重要要素及び非重要要素（分類結果）を、教師データ生成部１０４に送る。 As a result of matching, the classification processing unit 206 classifies the case document element having the highest matching degree with respect to each reference information as a case document element (important element) that is an important part, does not correspond to any reference information, and matches Class document elements with low degrees are classified as case document elements (non-important elements) that are not important parts. Then, the classified important elements and non-important elements (classification results) are sent to the teacher data generation unit 104.

図４は、本発明の第１の実施形態の教師データ生成部１０４の詳細な構成を示すブロック図である。 FIG. 4 is a block diagram illustrating a detailed configuration of the teacher data generation unit 104 according to the first embodiment of this invention.

教師データ生成部１０４は、単語分割部４０１、属性情報抽出部４０２、属性抽出ルール４０３、単語集計部４０４、単語リスト４０５及びデータ変換部４０６を備える。単語分割部４０１、属性情報抽出部４０２、単語集計部４０４及びデータ変換部４０６は、プログラムである。属性抽出ルール４０３及び単語リスト４０５は、データベースなどの記憶領域である。 The teacher data generation unit 104 includes a word division unit 401, an attribute information extraction unit 402, an attribute extraction rule 403, a word totaling unit 404, a word list 405, and a data conversion unit 406. The word division unit 401, the attribute information extraction unit 402, the word totaling unit 404, and the data conversion unit 406 are programs. The attribute extraction rule 403 and the word list 405 are storage areas such as a database.

なお、本実施形態において、判定ルール生成部１０５における処理及び判定処理には、サポートベクターマシン（例えば、特開２００３−３６２６２号参照）を用いるが、数値データを教師データとして利用する機械学習技術であれば、特に制限無く、いずれの機械学習技術でも用いてよい。 In the present embodiment, a support vector machine (see, for example, Japanese Patent Application Laid-Open No. 2003-36262) is used for processing and determination processing in the determination rule generation unit 105. However, this is a machine learning technique that uses numerical data as teacher data. Any machine learning technique may be used as long as there is no particular limitation.

図４における単語分割部４０１は、文書分類部１０３から入力される重要要素及び非重要要素に含まれる文章を、単語に分割する。文章を単語に分割する技術には、自然言語処理又は機械翻訳の分野において一般的に用いられる形態素解析技術（例えば、「岩波講座ソフトウェア科学（１５）自然言語処理」、岩波書店、１９９６年）を用いてもよい。 A word dividing unit 401 in FIG. 4 divides sentences included in important and non-important elements input from the document classification unit 103 into words. As a technique for dividing a sentence into words, a morphological analysis technique generally used in the field of natural language processing or machine translation (for example, “Iwanami Course Software Science (15) Natural Language Processing”, Iwanami Shoten, 1996) is used. It may be used.

また、対象文書が英語などの言語によって記述され、あらかじめ付された空白によって対象文書の中の単語を区切ることができる場合、単語分割部４０１は、文章中から空白を抽出することによって、文章を単語に分割してもよい。 In addition, when the target document is described in a language such as English, and words in the target document can be separated by a blank added in advance, the word dividing unit 401 extracts the blank by extracting the blank from the sentence. It may be divided into words.

単語に分割された重要要素及び非重要要素中を含む文書は、単語集計部４０４に送られると共に、属性情報抽出部４０２に送られる。属性情報抽出部４０２は、属性抽出ルール４０３に格納されるルールに基づいて、属性名及び属性値を含む属性情報を、重要要素及び非重要要素から抽出する。 A document including important elements and non-important elements divided into words is sent to the word totaling unit 404 and also sent to the attribute information extracting unit 402. Based on the rules stored in the attribute extraction rule 403, the attribute information extraction unit 402 extracts attribute information including attribute names and attribute values from important elements and non-important elements.

属性抽出ルール４０３の内容を図５Ａ及び図５Ｂを用いて説明する。図５Ａ及び図５Ｂは、属性抽出ルール４０３に含まれるルールを示す。 The contents of the attribute extraction rule 403 will be described with reference to FIGS. 5A and 5B. 5A and 5B show rules included in the attribute extraction rule 403. FIG.

図５Ａは、本発明の第１の実施形態の属性抽出ルール４０３に含まれる属性名５０１及び単位５０２の例を示す説明図である。 FIG. 5A is an explanatory diagram illustrating an example of attribute names 501 and units 502 included in the attribute extraction rule 403 according to the first embodiment of this invention.

属性抽出ルール４０３は、重要要素及び非重要要素から属性情報を抽出するために、属性情報抽出部４０２によって用いられるルールである。属性抽出ルール４０３は、属性名５０１、単位５０２及びパターン５０４を含む。属性名５０１は、各文書における属性名の表記を示し、単位５０２は、属性値と共に記述される属性値の単位の表記を示す。 The attribute extraction rule 403 is a rule used by the attribute information extraction unit 402 to extract attribute information from important elements and non-important elements. The attribute extraction rule 403 includes an attribute name 501, a unit 502, and a pattern 504. The attribute name 501 indicates the notation of the attribute name in each document, and the unit 502 indicates the notation of the attribute value unit described together with the attribute value.

なお、属性抽出ルール４０３は、事例文書１０１等に含まれる属性名、単位及びパターンに基づいて、あらかじめ生成されている。 Note that the attribute extraction rule 403 is generated in advance based on the attribute name, unit, and pattern included in the case document 101 and the like.

図５Ａにおいて、属性名５０１の欄には、重要要素又は非重要要素（以下、文書と記載）に含まれると想定される属性名の表記が格納され、単位５０２には各属性名５０１に対応する属性値と共に記述される単位の表記が格納される。エントリー５０３−１は、属性名５０１が「メモリサイズ」であり、単位５０２が「Ｇバイト」又は「ＧＢ」などである表記が、文書に記述されることを示す。エントリー５０３−２及びエントリー５０３−３も同様に、どのような表記が文書に記述されるかを示す。 In FIG. 5A, the attribute name 501 column stores notation of attribute names assumed to be included in important elements or non-important elements (hereinafter referred to as documents), and unit 502 corresponds to each attribute name 501. The unit description described together with the attribute value to be stored is stored. The entry 503-1 indicates that a notation in which the attribute name 501 is “memory size” and the unit 502 is “GB” or “GB” is described in the document. Similarly, the entry 503-2 and the entry 503-3 indicate what notation is described in the document.

図５Ｂは、本発明の第１の実施形態の属性抽出ルール４０３に含まれるパターン５０４の例を示す説明図である。図５Ｂに示すパターン５０４は、文書中において属性情報が表記されるパターンを示す。 FIG. 5B is an explanatory diagram illustrating an example of the pattern 504 included in the attribute extraction rule 403 according to the first embodiment of this invention. A pattern 504 illustrated in FIG. 5B indicates a pattern in which attribute information is written in a document.

図５Ｂに示すパターン５０４によれば、＜属性名＞に図５Ａに示す属性名５０１が記述され、＜単位＞に図５Ａに示す単位５０２が記述されるパターンを、文書が含んだ場合、属性情報抽出部４０２は、そのパターンを含んだ文書が属性抽出ルール４０３と一致すると判定する。また、図５Ｂに示す「Ｎ」は、任意の数値が記述されるパターンを示す。 According to the pattern 504 shown in FIG. 5B, when the document includes a pattern in which the attribute name 501 shown in FIG. 5A is described in <attribute name> and the unit 502 shown in FIG. 5A is described in <unit>. The information extraction unit 402 determines that the document including the pattern matches the attribute extraction rule 403. “N” shown in FIG. 5B indicates a pattern in which an arbitrary numerical value is described.

例えば、文書中に「メモリサイズは１ＧＢ」という記述がある場合、その記述は属性抽出ルール４０３のパターン５０４−１と一致するため、属性情報抽出部４０２は、文書中から「メモリサイズは１ＧＢ」を属性情報として抽出する。 For example, if there is a description “memory size is 1 GB” in the document, the description matches the pattern 504-1 of the attribute extraction rule 403, so the attribute information extraction unit 402 reads “memory size is 1” from the document. “GB” is extracted as attribute information.

また例えば、文書中に「動作電圧１００Ｖの」という記述がある場合、その記述は属性抽出ルール４０３のパターン５０４−２と一致するため、属性情報抽出部４０２は、文書中から「動作電圧１００Ｖの」を属性情報として抽出する。 Also, for example, if there is a description “operating voltage 100 V” in the document, the description matches the pattern 504-2 of the attribute extraction rule 403, so the attribute information extracting unit 402 reads “operating voltage 100” from the document. “V” is extracted as attribute information.

なお、図５Ｂに示す属性抽出ルール４０３において、「Ｎ」などの符号によって示された数値の範囲を示すルールを追加してもよい。また属性情報抽出部４０２は、属性名５０１と単位５０２が示す属性値とのみでなく、属性名５０１が示す属性を有する事物名（品物名）も合わせて属性情報として抽出してもよい。 In the attribute extraction rule 403 shown in FIG. 5B, a rule indicating a numerical range indicated by a symbol such as “N” may be added. Further, the attribute information extraction unit 402 may extract not only the attribute name 501 and the attribute value indicated by the unit 502 but also the item name (item name) having the attribute indicated by the attribute name 501 as attribute information.

属性名５０１が示す属性を有する事物名も合わせて抽出する場合、図５Ｂにおける属性抽出ルール４０３に、「＜事物名＞の＜属性名＞はＮ＜単位＞」を追加し、＜事物名＞の箇所に該当する単語を検索することによって、容易に事物名を抽出し、抽出された事物名を属性情報に含めてもよい。 When the thing name having the attribute indicated by the attribute name 501 is also extracted, “<thing name> <attribute name> is N <unit>” is added to the attribute extraction rule 403 in FIG. It is also possible to easily extract the name of an object by searching for a word corresponding to the location of, and include the extracted name of the object in the attribute information.

さらに、属性抽出ルール４０３は、図５Ａに示す属性名５０１及び単位５０２のように、あらかじめ想定される事物名の一覧を含んでもよい。また、前述のような属性抽出ルール４０３に合致する任意の単語又はフレーズ（連続した単語の集合）を重要要素又は非重要要素から、事物名として抽出してもよい。 Furthermore, the attribute extraction rule 403 may include a list of presumed object names such as an attribute name 501 and a unit 502 shown in FIG. 5A. In addition, any word or phrase (a set of consecutive words) that matches the attribute extraction rule 403 as described above may be extracted as an object name from an important element or a non-important element.

図４に示す単語集計部４０４は、単語分割部４０１から送られた重要要素及び非重要要素に含まれる全ての単語と、属性情報抽出部４０２によって抽出された全ての属性名５０１とを集計し、重複する単語及び属性名５０１をマージする。そして、マージされた単語及び属性名５０１を単語リスト４０５に格納する。 The word totaling unit 404 shown in FIG. 4 totals all the words included in the important elements and non-important elements sent from the word dividing unit 401 and all the attribute names 501 extracted by the attribute information extracting unit 402. , Merge overlapping words and attribute names 501. The merged word and attribute name 501 are stored in the word list 405.

さらにデータ変換部４０６は、重要要素及び非重要要素の内容、属性情報抽出部４０２によって抽出された属性情報、及び、単語リスト４０５の内容に基づいて、判定ルール１０６を生成するために必要となる教師データを生成する。本実施形態では、判定ルール生成の処理及び判定処理に、サポートベクターマシンを使用するため、データ変換部４０６は、重要要素及び非重要要素の内容を多次元ベクトルに変換する。教師データは、多次元ベクトルによって表現される。 Further, the data conversion unit 406 is necessary to generate the determination rule 106 based on the contents of the important element and the non-important element, the attribute information extracted by the attribute information extraction unit 402, and the contents of the word list 405. Generate teacher data. In this embodiment, since a support vector machine is used for determination rule generation processing and determination processing, the data conversion unit 406 converts the contents of important elements and non-important elements into multidimensional vectors. The teacher data is represented by a multidimensional vector.

具体的には、まずデータ変換部４０６は、単語リスト４０５中の各単語及び属性名５０１を多次元ベクトルの各要素に割り当てる。またデータ変換部４０６は、属性情報中に事物名が含まれる場合、事物名と属性名５０１との組に、対応する多次元ベクトルの一つの要素を割り当てる。 Specifically, first, the data conversion unit 406 assigns each word and attribute name 501 in the word list 405 to each element of the multidimensional vector. In addition, when the attribute name includes the thing name in the attribute information, the data conversion unit 406 assigns one element of the corresponding multidimensional vector to the set of the thing name and the attribute name 501.

次に、データ変換部４０６は、各重要要素及び非重要要素に含まれる単語に対応する要素に「１」を、それ以外の要素には「０」を格納する。ただし、該当する単語が属性情報として抽出されている場合、単語に対応する要素には「０」を割り当てると共に、数字文字列で表記されている属性値を数値データに変換し、属性名５０１に対応する要素の値として格納する。 Next, the data conversion unit 406 stores “1” in the element corresponding to the word included in each important element and non-important element, and “0” in the other elements. However, if the corresponding word is extracted as attribute information, “0” is assigned to the element corresponding to the word, and the attribute value represented by the numeric character string is converted into numerical data, and the attribute name 501 is converted. Store as the value of the corresponding element.

以下に、各重要要素及び非重要要素を、多次元ベクトルに変換する例を示す。 An example of converting each important element and non-important element into a multidimensional vector is shown below.

単語リスト４０５に、以下の単語及び属性名５０１が格納されているものとする。 It is assumed that the following words and attribute names 501 are stored in the word list 405.

単語：ＰＣ、ＣＰＵ、メモリ、メモリサイズ、動作電圧
属性名５０１：メモリサイズ、動作電圧
単語リスト４０５に、上記の単語及び属性名５０１が格納されている場合、求める多次元ベクトルは以下の７次元のベクトルである。 Word: PC, CPU, memory, memory size, operating voltage Attribute name 501: Memory size, operating voltage When the above word and attribute name 501 are stored in the word list 405, the obtained multidimensional vector is the following 7 dimensions Vector.

多次元ベクトル：（＜ＰＣ＞、＜ＣＰＵ＞、＜メモリ＞、＜メモリサイズ＞、＜動作電圧＞、［メモリサイズ］、［動作電圧］）
Ｘに該当する単語が存在する場合、＜Ｘ＞は「１」を示し、Ｘに該当する単語が存在しない場合、＜Ｘ＞は「０」を示す。また、Ｙに該当する属性名５０１に対応する属性値が設定される場合、［Ｙ］は「１」を示し、Ｙに該当する属性名５０１に対応する属性値が設定されていない場合、［Ｙ］は「０」を示す。 Multi-dimensional vector: (<PC>, <CPU>, <memory>, <memory size>, <operating voltage>, [memory size], [operating voltage])
When a word corresponding to X exists, <X> indicates “1”, and when a word corresponding to X does not exist, <X> indicates “0”. When an attribute value corresponding to the attribute name 501 corresponding to Y is set, [Y] indicates “1”, and when an attribute value corresponding to the attribute name 501 corresponding to Y is not set, Y] indicates “0”.

ここで、単語分割部４０１から単語集計部４０４を経由して送られた重要要素又は非重要要素に、以下の文字列が含まれるものとする。 Here, it is assumed that the following character strings are included in the important elements or the unimportant elements sent from the word dividing unit 401 via the word totaling unit 404.

「ＰＣはＣＰＵ及びメモリを有する」
重要要素又は非重要要素に上記のような文字列が含まれる場合、求める多次元ベクトルは、該当する単語に対応する要素に「１」をセットすることによって、以下のように示される。 "PC has CPU and memory"
When the above-described character string is included in the important element or the non-important element, the obtained multidimensional vector is shown as follows by setting “1” to the element corresponding to the corresponding word.

多次元ベクトル：（１、１、１、０、０、０、０）
すなわち、多次元ベクトルは、対象となる重要要素又は非重要要素に、「ＰＣ」、「ＣＰＵ」及び「メモリ」の単語が存在し、「メモリサイズ」及び「動作電圧」の単語が存在しないことを示す。 Multidimensional vector: (1, 1, 1, 0, 0, 0, 0)
That is, in the multidimensional vector, the words “PC”, “CPU” and “memory” exist in the target important or non-important elements, and the words “memory size” and “operating voltage” do not exist. Indicates.

また、単語分割部４０１から単語集計部４０４を経由して送られた重要要素又は非重要要素に、以下の文字列が含まれるものとする。 Further, it is assumed that the following character strings are included in the important elements or the unimportant elements sent from the word dividing unit 401 via the word totaling unit 404.

「ＰＣの動作電圧は１００Ｖ」
重要要素又は非重要要素が上記のような文字列であった場合、図５Ａ及び図５Ｂに示す属性抽出ルール４０３を適用することによって、以下の文字列が抽出される。 “PC operating voltage is 100 V”
When the important element or the unimportant element is a character string as described above, the following character string is extracted by applying the attribute extraction rule 403 shown in FIGS. 5A and 5B.

属性名：動作電圧
属性値：１００Ｖ
このため、得られる多次元ベクトルは、以下のとおりである。 Attribute name: Operating voltage Attribute value: 100V
For this reason, the obtained multidimensional vector is as follows.

（１、０、０、０、０、０、１００）
すなわち、対象となる重要要素又は非重要要素には、「ＰＣ」の単語が存在し、他の単語は存在しないことを示す。また、対象となる重要要素又は非重要要素には、動作電圧が「１００（Ｖ）」である記述が存在することを示す。「動作電圧」という単語は、属性情報として抽出されるため、その有無を示す要素（左から５番目の要素）には０が格納され、属性値に関する要素（一番右の要素）のみに値が格納される。 (1, 0, 0, 0, 0, 0, 100)
That is, it is shown that the word “PC” exists in the target important element or non-important element, and no other word exists. Further, it is indicated that there is a description in which the operating voltage is “100 (V)” in the target important element or non-important element. Since the word “operating voltage” is extracted as attribute information, 0 is stored in the element indicating the presence or absence (the fifth element from the left), and only the element related to the attribute value (the rightmost element) has a value. Is stored.

なお、「動作電圧」という単語が重要要素又は非重要要素に単独で記述されている場合、ベクトルの左から５番目の要素に１が格納され、属性値に関する要素には０が格納される。また、前述の例において、単語リスト４０５中に含まれない単語を、データ変換部４０６はすべて無視する。 When the word “operating voltage” is described alone as an important element or an unimportant element, 1 is stored in the fifth element from the left of the vector, and 0 is stored in an element related to the attribute value. In the above example, the data conversion unit 406 ignores all words that are not included in the word list 405.

さらに、データ変換部４０６は、前述の例において、属性値を示す数値データをそのまま対応する多次元ベクトルの要素に格納したが、属性名毎にあらかじめ定められた数値を乗じることによって大きさを変更した値を多次元ベクトルの要素に格納してもよい。また、属性値を示す数値データを０〜１の間に正規化し、得られた値を、多次元ベクトルの要素に格納してもよい。 Further, in the above example, the data conversion unit 406 stores the numerical data indicating the attribute value as it is in the corresponding multidimensional vector element, but the size is changed by multiplying a predetermined numerical value for each attribute name. The obtained value may be stored in an element of a multidimensional vector. Further, numerical data indicating the attribute value may be normalized between 0 and 1, and the obtained value may be stored in an element of a multidimensional vector.

さらに前述の例において、全ての単語一つ一つにベクトルの要素に割り当てられていたが、データ変換部４０６は、類似する意味を示す単語は一つのベクトルの要素に割り当ててもよい。 Furthermore, in the above-described example, all the words are assigned to vector elements, but the data conversion unit 406 may assign words having similar meanings to one vector element.

図６は、本発明の第１の実施形態の教師データ生成部１０４の別の例を示すブロック図である。 FIG. 6 is a block diagram illustrating another example of the teacher data generation unit 104 according to the first embodiment of this invention.

データ変換部４０６によって類似する意味を示す単語が一つのベクトルの要素に割り当てられる場合、データ変換部４０６は、類似した単語の一覧を格納した同義語辞書４０７に接続される。そして、同義語辞書４０７の内容を検索することによって、単語リスト４０５の中で同義語と判断される単語を、多次元ベクトルにおける同一のベクトルの要素に割り当てる。 When words having similar meanings are assigned to one vector element by the data conversion unit 406, the data conversion unit 406 is connected to a synonym dictionary 407 storing a list of similar words. Then, by searching the contents of the synonym dictionary 407, the word determined to be a synonym in the word list 405 is assigned to the element of the same vector in the multidimensional vector.

例えば、前述の例において、「メモリ」と「メモリサイズ」とが類似である場合、同義語辞書４０７には「メモリ」と「メモリサイズ」との組が保持される。そして、データ変換部４０６は、同義語辞書４０７に保持される組を参照し、「メモリ」と「メモリサイズ」との属性名５０２を、一つのベクトルの要素に割り当てる。 For example, in the above example, when “memory” and “memory size” are similar, the synonym dictionary 407 holds a set of “memory” and “memory size”. Then, the data conversion unit 406 refers to the set held in the synonym dictionary 407 and assigns the attribute names 502 of “memory” and “memory size” to the elements of one vector.

以上のように生成された多次元ベクトルは、教師データを示す。データ変換部４０６は、生成された教師データを、判定ルール生成部１０５に送る。 The multidimensional vector generated as described above indicates teacher data. The data conversion unit 406 sends the generated teacher data to the determination rule generation unit 105.

判定ルール生成部１０５は、送られた教師データを用いて、判定ルール１０６を生成する。前述したように、本実施形態における判定ルール生成部１０５は、サポートベクターマシンを使用することを想定しており、重要要素及び非重要要素ごとに分類された多次元ベクトルにサポートベクターマシンを適用し、判定ルール１０６を生成する。 The determination rule generation unit 105 generates a determination rule 106 using the transmitted teacher data. As described above, the determination rule generation unit 105 in the present embodiment is assumed to use a support vector machine, and applies the support vector machine to multidimensional vectors classified for each important element and non-important element. The determination rule 106 is generated.

サポートベクターマシンによって、判定ルール生成部１０５は、多次元ベクトルにおける分離面を生成する。これによって、判定ルール１０６は、重要要素及び非重要要素に含まれる属性名５０１と、その属性名５０１に対応する属性値とがとりうる値の範囲（または分布）を示す情報を保持する。 With the support vector machine, the determination rule generation unit 105 generates a separation plane in the multidimensional vector. Accordingly, the determination rule 106 holds information indicating a range (or distribution) of values that can be taken by the attribute name 501 included in the important element and the non-important element and the attribute value corresponding to the attribute name 501.

判定処理部１１１は、判定ルール１０６によって、重要な属性名５０１とその属性名５０１に対応する属性値がとりうる値とを取得することができる。 The determination processing unit 111 can acquire an important attribute name 501 and values that can be taken by the attribute value corresponding to the attribute name 501 according to the determination rule 106.

一方、図１に示す判定対象文書１０７の内容に、重要要素であるか否かを判定する処理を以下に示す。判定対象文書１０７は、前述のとおり、重要要素を抽出される対象の文書である。 On the other hand, processing for determining whether or not the content of the determination target document 107 shown in FIG. 1 is an important element will be described below. As described above, the determination target document 107 is a target document from which important elements are extracted.

まず、文書分割部１０８は、判定する単位である判定文書要素に判定対象文書１０７を分割する。文書分割部１０８によって行われる処理は、文書分類部１０３における文書分割部２０１と同じであり、また、その際使用される文書分割ルール１０９も、文書分割部２０１において用いられた文書分割ルール２０２と同じものを使用することができる。 First, the document dividing unit 108 divides the determination target document 107 into determination document elements that are determination units. The processing performed by the document dividing unit 108 is the same as that of the document dividing unit 201 in the document classification unit 103, and the document dividing rule 109 used at this time is the same as the document dividing rule 202 used in the document dividing unit 201. The same can be used.

すなわち判定対象文書１０７は、文書分割部１０８によって、章ごと、又は、段落ごとなどに分割され、判定データ生成部１１０に送られる。 That is, the determination target document 107 is divided into chapters or paragraphs by the document dividing unit 108 and sent to the determination data generating unit 110.

次に、判定データ生成部１１０は、文書分割部１０８によって分割された各判定文書要素を、判定処理部１１１において利用できるデータ形式に変換する。判定処理部１１１に判定ルール生成部１０５と同様にサポートベクターマシンを使用する場合、教師データと判定データの形式とは同一であるため、判定データ生成部１１０の処理は、教師データ生成部１０４と同じ処理である。 Next, the determination data generation unit 110 converts each determination document element divided by the document division unit 108 into a data format that can be used by the determination processing unit 111. When the support vector machine is used for the determination processing unit 111 as in the case of the determination rule generation unit 105, the format of the teacher data and the determination data is the same. Therefore, the processing of the determination data generation unit 110 is the same as that of the teacher data generation unit 104. The same process.

すなわち判定データ生成部１１０は、文書分割部１０８によって送られた各判定文書要素を、単語に分割し、さらに多次元ベクトルに変換する。これによって、判定処理部１１１は、判定ルール１０６と、判定文書要素から変換された多次元ベクトルとを比較することができる。 That is, the determination data generation unit 110 divides each determination document element sent by the document division unit 108 into words and further converts them into multidimensional vectors. Accordingly, the determination processing unit 111 can compare the determination rule 106 with the multidimensional vector converted from the determination document element.

図７は、本発明の第１の実施形態の判定データ生成部１１０の詳細を示すブロック図である。 FIG. 7 is a block diagram illustrating details of the determination data generation unit 110 according to the first embodiment of this invention.

判定データ生成部１１０は、単語分割部７０１、属性情報抽出部７０２、属性抽出ルール７０３、単語リスト７０４、及び、データ変換部７０５を備える。単語分割部７０１、属性情報抽出部７０２、及び、データ変換部７０５は、プログラムである。属性抽出ルール７０３、及び、単語リスト７０４は、データベースなどの記憶領域である。 The determination data generation unit 110 includes a word division unit 701, an attribute information extraction unit 702, an attribute extraction rule 703, a word list 704, and a data conversion unit 705. The word division unit 701, the attribute information extraction unit 702, and the data conversion unit 705 are programs. The attribute extraction rule 703 and the word list 704 are storage areas such as a database.

単語分割部７０１、属性情報抽出部７０２、属性抽出ルール７０３及びデータ変換部７０５は、教師データ生成部１０４における単語分割部４０１、属性情報抽出部４０２、属性抽出ルール４０３及びデータ変換部４０６とそれぞれ同じである。 The word division unit 701, the attribute information extraction unit 702, the attribute extraction rule 703, and the data conversion unit 705 are respectively the word division unit 401, the attribute information extraction unit 402, the attribute extraction rule 403, and the data conversion unit 406 in the teacher data generation unit 104. The same.

教師データ生成部１０４と判定データ生成部１１０との違いは、教師データ生成部１０４における単語集計部４０４が判定データ生成部１１０にはなく、データ変換部７０５において使用される単語リスト７０４は、教師データ生成部１０４において単語集計部４０４によって作成された単語リストを利用することである。これによって判定文書要素は、事例文書中に含まれる単語及び属性名５０１に基づいて、多次元ベクトルに変換される。 The difference between the teacher data generation unit 104 and the determination data generation unit 110 is that the word totaling unit 404 in the teacher data generation unit 104 is not included in the determination data generation unit 110, and the word list 704 used in the data conversion unit 705 is different from the teacher data generation unit 104. This is to use the word list created by the word totaling unit 404 in the data generation unit 104. Thus, the determination document element is converted into a multidimensional vector based on the word and the attribute name 501 included in the case document.

なお、教師データ生成部１０４が同義語辞書４０７を備える場合、判定データ生成部１１０も同義語辞書４０７を備える。そしてデータ変換部７０５は、同義語辞書４０７に接続され、属性名５０１のうち類似する属性名５０１を、多次元ベクトルにおける同一のベクトルの要素に割り当てる。 When the teacher data generation unit 104 includes the synonym dictionary 407, the determination data generation unit 110 also includes the synonym dictionary 407. The data conversion unit 705 is connected to the synonym dictionary 407 and assigns similar attribute names 501 among the attribute names 501 to elements of the same vector in the multidimensional vector.

判定データ生成部１１０は、変換された多次元ベクトル（判定データ）を、判定処理部１１１に送る。 The determination data generation unit 110 sends the converted multidimensional vector (determination data) to the determination processing unit 111.

判定処理部１１１は、判定データ生成部１１０から送られた判定データに、サポートベクターマシンを使用して判定ルール１０６と比較することによって、各判定データが重要要素であるか非重要要素であるかを判定し、結果を判定結果１１２として出力する。 The determination processing unit 111 compares the determination data sent from the determination data generation unit 110 with the determination rule 106 using a support vector machine, thereby determining whether each determination data is an important element or an unimportant element. And the result is output as the determination result 112.

具体的には判定処理部１１１は、判定データに含まれる多次元ベクトルによって、判定ルール１０６を検索し、判定データに含まれる属性名５０１に対応する属性が、重要要素であるか、非重要要素であるかを取得する。 Specifically, the determination processing unit 111 searches the determination rule 106 based on the multidimensional vector included in the determination data, and determines whether the attribute corresponding to the attribute name 501 included in the determination data is an important element or a non-important element. Get what it is.

さらに判定処理部１１１は、判定データの多次元ベクトルが示す属性名５０１と属性値とを、判定ルール１０６において検索することによって、判定データに含まれる属性値と、判定ルール１０６に含まれる属性値との距離が離れていることを取得することができる。すなわち判定処理部１１１は、判定データを判定ルール１０６において検索することによって、事例文書１０１に含まれる属性値が、判定対象文書１０７に含まれる属性値と、どの程度離れているかを取得することができる。 Further, the determination processing unit 111 searches the determination rule 106 for the attribute name 501 and the attribute value indicated by the multidimensional vector of the determination data, thereby determining the attribute value included in the determination data and the attribute value included in the determination rule 106. And get away that the distance. That is, the determination processing unit 111 searches the determination data in the determination rule 106 to obtain how far the attribute value included in the case document 101 is different from the attribute value included in the determination target document 107. it can.

図８は、本発明の第１の実施形態の重要箇所判定システムを、計算機１００に実装した場合の構成図を示す。 FIG. 8 is a configuration diagram when the important point determination system according to the first embodiment of this invention is mounted on the computer 100.

計算機１００は、情報処理装置８０１、入力装置８０２、表示装置８０３、記憶装置８０４、事例文書８１４、参照情報記載文書８１５、判定対象文書８１６、文書分割ルール８１７、参照情報抽出ルール８１８、属性抽出ルール８１９、単語リスト８２０、及び、判定ルール８２１を備える。 The computer 100 includes an information processing device 801, an input device 802, a display device 803, a storage device 804, a case document 814, a reference information description document 815, a determination target document 816, a document division rule 817, a reference information extraction rule 818, and an attribute extraction rule. 819, a word list 820, and a determination rule 821.

情報処理装置８０１は、重要箇所判定処理に必要な各種のプログラムを実行するためのＣＰＵなどの演算装置である。入力装置８０２は、システム利用者がシステムを操作するための装置であり、一般的に用いられるキーボード又はマウスなどの装置である。表示装置８０３は、判定結果１１２を出力するための装置であり、一般的に用いられるモニタ又はスピーカなどの装置である。 The information processing device 801 is an arithmetic device such as a CPU for executing various programs necessary for important point determination processing. The input device 802 is a device for a system user to operate the system, and is a device such as a keyboard or a mouse that is generally used. The display device 803 is a device for outputting the determination result 112, and is a generally used device such as a monitor or a speaker.

記憶装置８０４には、重要箇所判定処理に必要な各種のプログラムや処理の途中経過に関する情報が格納される。記憶装置８０４には、文書分割プログラム８０５、参照情報抽出プログラム８０６、マッチングプログラム８０７、分類処理プログラム８０８、単語分割プログラム８０９、属性情報抽出プログラム８１０、単語集計プログラム８１１、データ交換プログラム８１２、及び、判定処理プログラム８１３が格納される。 The storage device 804 stores various programs necessary for the important part determination process and information on the progress of the process. The storage device 804 includes a document division program 805, a reference information extraction program 806, a matching program 807, a classification processing program 808, a word division program 809, an attribute information extraction program 810, a word aggregation program 811, a data exchange program 812, and a determination A processing program 813 is stored.

文書分割プログラム８０５は、文書分類部１０３に含まれる文書分割部２０１及び文書分割部１０８に対応する処理を行う。参照情報抽出８０６は、文書分類部１０３に含まれる参照情報抽出部２０３に対応する処理を行う。マッチングプログラム８０７は、文書分類部１０３に含まれるマッチング部２０５に対応する処理を行う。 The document division program 805 performs processing corresponding to the document division unit 201 and the document division unit 108 included in the document classification unit 103. The reference information extraction 806 performs processing corresponding to the reference information extraction unit 203 included in the document classification unit 103. The matching program 807 performs processing corresponding to the matching unit 205 included in the document classification unit 103.

分類処理プログラム８０８は、文書分類部１０３に含まれる分類処理部２０６に対応する処理を行う。単語分割プログラム８０９は、教師データ生成部１０４に含まれる単語分割部４０１及び判定データ生成部１１０に含まれる単語分割部７０１に対応する処理を行う。 The classification processing program 808 performs processing corresponding to the classification processing unit 206 included in the document classification unit 103. The word division program 809 performs processing corresponding to the word division unit 401 included in the teacher data generation unit 104 and the word division unit 701 included in the determination data generation unit 110.

属性情報抽出プログラム８１０は、教師データ生成部１０４に含まれる属性抽出部４０２及び判定データ生成部１１０に含まれる属性抽出部７０２に対応する処理を行う。単語集計プログラム８１１は、教師データ生成部に含まれる単語集計部４０４に対応する処理を行う。 The attribute information extraction program 810 performs processing corresponding to the attribute extraction unit 402 included in the teacher data generation unit 104 and the attribute extraction unit 702 included in the determination data generation unit 110. The word totaling program 811 performs processing corresponding to the word totaling unit 404 included in the teacher data generation unit.

データ交換プログラム８１２は、教師データ生成部に含まれるデータ変換部４０６及び判定データ生成部１１０に含まれるデータ変換部７０５に対応する処理を行う。判定処理プログラム８１３は、判定処理部１１１に対応する処理を行う。 The data exchange program 812 performs processing corresponding to the data conversion unit 406 included in the teacher data generation unit and the data conversion unit 705 included in the determination data generation unit 110. The determination processing program 813 performs processing corresponding to the determination processing unit 111.

また事例文書８１４には、図１に示す事例文書１０１が格納され、参照情報記載文書８１５には、図１に示す参照情報記載文書１０２が格納される。判定対象文書８１６には、図１に示す判定対象文書１０７が格納され、文書分割ルール８１７には、図２に示す文書分割ルール２０２及び図１に示す文書分割ルール１０９が格納される。 The case document 814 stores the case document 101 shown in FIG. 1, and the reference information description document 815 stores the reference information description document 102 shown in FIG. The determination target document 816 stores the determination target document 107 shown in FIG. 1, and the document division rule 817 stores the document division rule 202 shown in FIG. 2 and the document division rule 109 shown in FIG.

参照情報抽出ルール８１８には、図２に示す参照情報抽出ルール２０４が格納され、属性抽出ルール８１９には、図４に示す属性抽出ルール４０３及び図７に示す属性抽出ルール７０３が格納される。単語リスト８２０には、図４に示す単語リスト４０５及び図７に示す単語リスト７０４が格納され、判定ルール８２１には、図１に示す判定ルール１０６が格納される。 The reference information extraction rule 818 stores the reference information extraction rule 204 shown in FIG. 2, and the attribute extraction rule 819 stores the attribute extraction rule 403 shown in FIG. 4 and the attribute extraction rule 703 shown in FIG. In the word list 820, the word list 405 shown in FIG. 4 and the word list 704 shown in FIG. 7 are stored, and in the determination rule 821, the determination rule 106 shown in FIG. 1 is stored.

本発明の第１の実施形態によれば、事例文書１０１と参照情報記載文書１０２との対応関係を求めることによって、文書中の重要要素を判定するための判定ルール１０６を生成するために必要となる教師データを効率的に構築することができるようになると共に、事例文書１０１から属性情報を抽出し、数値データに変換した属性値を教師データに埋め込むことによって、「〜以上」又は「〜から〜まで」のような数値の大きさや範囲に基づく判定ルールを容易に構築し、精度の良い重要箇所判定システムを構成することが可能となる。 According to the first embodiment of the present invention, it is necessary to generate a determination rule 106 for determining an important element in a document by obtaining a correspondence relationship between the case document 101 and the reference information description document 102. Can be efficiently constructed, and attribute information is extracted from the case document 101, and attribute values converted into numerical data are embedded in the teacher data. It is possible to easily construct a determination rule based on the size and range of numerical values such as “to” and configure an accurate important point determination system.

例えば、計算機システムを設計する際に提示される膨大な要求仕様書の中から、特に重要なシステムの情報（ＣＰＵ性能、又は、メモリサイズ等）を抽出したい時、第１の実施形態によれば、過去に蓄積された設計書などの事例文書と、ユーザからの要望などが記述された参照情報とに基づいて、判定ルール１０６を生成する。これによって、要求仕様書から効率的にシステムの情報を抽出することができる。また、判定ルール１０６に含まれる属性値によって、要求仕様書に示された数値が、過去の設計書に示された数値から離れている場合も、抽出することができる。 For example, when it is desired to extract particularly important system information (CPU performance, memory size, etc.) from an enormous requirement specification presented when designing a computer system, according to the first embodiment. The determination rule 106 is generated based on the case documents such as the design documents accumulated in the past and the reference information describing the request from the user. As a result, system information can be efficiently extracted from the requirement specifications. Further, even when the numerical value indicated in the requirement specification is separated from the numerical value indicated in the past design document by the attribute value included in the determination rule 106, it can be extracted.

（第２の実施形態）
本発明の第２の実施形態を図９から図１３を用いて説明する。 (Second Embodiment)
A second embodiment of the present invention will be described with reference to FIGS.

第１の実施形態では、事例文書から抽出した属性情報から得られた数値データを、教師データ中に埋め込むことによって、一種類の判定ルール１０６を生成し、さらに生成された一種類の判定ルール１０６によって、精度のよい重要箇所判定を行うことができる。一方、第２の実施形態では、属性情報を除いた事例文書要素を判定するルールと、属性情報に関する判定を行うルールとを分離することに特徴を有する。 In the first embodiment, the numerical data obtained from the attribute information extracted from the case document is embedded in the teacher data, thereby generating one type of determination rule 106, and further generating the one type of determination rule 106. Therefore, it is possible to determine an important point with high accuracy. On the other hand, the second embodiment is characterized in that a rule for determining a case document element excluding attribute information is separated from a rule for performing determination regarding attribute information.

図９は、本発明の第２の実施形態の重要箇所判定システムの構成を示すブロック図である。 FIG. 9 is a block diagram showing the configuration of the important point determination system according to the second embodiment of the present invention.

図１に示す構成と図９に示す構成との違いは、図１では一種類であった判定ルール１０６が、図９において、属性情報を含まない事例文書要素を判定するための文書要素判定ルール９０３と、属性情報に関する判定のみを行う属性値判定ルール９０４の二種類になっている点である。 The difference between the configuration shown in FIG. 1 and the configuration shown in FIG. 9 is that the determination rule 106, which is one type in FIG. 1, is a document element determination rule for determining a case document element that does not include attribute information in FIG. There are two types of rule 903 and an attribute value determination rule 904 that performs only determination regarding attribute information.

また、教師データ生成部９０１、判定ルール生成部９０２、判定データ生成部９０５及び判定処理部９０６の構成及び機能は、図１に示す教師データ生成部１０４、判定ルール生成部１０５、判定データ生成部１１０、及び、判定処理部１１１と同様であるが、その処理の詳細は異なる。 The configuration and function of the teacher data generation unit 901, the determination rule generation unit 902, the determination data generation unit 905, and the determination processing unit 906 are the same as the teacher data generation unit 104, the determination rule generation unit 105, and the determination data generation unit illustrated in FIG. 110 and the determination processing unit 111, but the details of the processing are different.

また、第１の実施形態の事例文書１０１、参照情報記載文書１０２、文書分類部１０３、判定対象文書１０７、文書分割部２０１、及び、文書分割ルール２０２を、第２の実施形態の重要箇所判定システムも備える。 In addition, the case document 101, the reference information description document 102, the document classification unit 103, the determination target document 107, the document division unit 201, and the document division rule 202 of the first embodiment are determined as important point determinations of the second embodiment. It also has a system.

図１０は、本発明の第２の実施形態の教師データ生成部９０１の詳細な構成を示すブロック図である。 FIG. 10 is a block diagram illustrating a detailed configuration of the teacher data generation unit 901 according to the second embodiment of this invention.

図１０に示す教師データ生成部９０１と図４に示す教師データ生成部１０４とは、データ変換部における処理及び出力される教師データに違いがある。第２の実施形態において、データ変換部１００１が出力する教師データは、文書要素教師データ１００２及び属性値教師データ１００３である。 The teacher data generation unit 901 illustrated in FIG. 10 and the teacher data generation unit 104 illustrated in FIG. 4 are different in processing in the data conversion unit and output teacher data. In the second embodiment, the teacher data output from the data converter 1001 is document element teacher data 1002 and attribute value teacher data 1003.

文書要素教師データ１００２は、第１の実施形態における教師データから属性情報を含まない事例文書要素に、判定ルールを生成するための教師データである。属性値教師データ１００３は、属性情報を含む事例文書要素に、判定ルールを生成するための教師データである。 The document element teacher data 1002 is teacher data for generating a determination rule for a case document element that does not include attribute information from the teacher data in the first embodiment. The attribute value teacher data 1003 is teacher data for generating a determination rule for a case document element including attribute information.

第１の実施形態と同様に、第２の実施形態においても、事例文書要素からの判定ルール生成及び判定対象文書要素の判定にサポートベクターマシンを利用する。第２の実施形態のデータ変換部１００１は、単語リスト１００４中の各単語に多次元ベクトルの要素を割り当て、各事例文書要素に、事例文書要素中の単語に対応する多次元ベクトルの要素に１を、それ以外に０を設定することによって、教師データを生成する。 Similar to the first embodiment, in the second embodiment, a support vector machine is used to generate a determination rule from a case document element and to determine a determination target document element. The data conversion unit 1001 of the second embodiment assigns a multidimensional vector element to each word in the word list 1004, and assigns 1 to each case document element as a multidimensional vector element corresponding to the word in the case document element. Is set to 0 in addition to that, teacher data is generated.

この際、データ変換部１００１は、あらかじめ、属性情報抽出部１００５において属性情報として抽出された箇所に該当する事例文書要素中の単語を除いておく。または、属性名の単語を除くと共に属性値に該当する単語を、単語の内容に依存せず、かつ、他の意味を持つ単語として現れることが無い特定の文字列に置き換える。例えば、属性値に該当する単語を、「ＮＮＮ」などに置き換えるようにしてもよい。 At this time, the data conversion unit 1001 removes the word in the case document element corresponding to the location extracted as the attribute information by the attribute information extraction unit 1005 in advance. Alternatively, the word corresponding to the attribute value is replaced with a specific character string that does not depend on the content of the word and does not appear as a word having another meaning while excluding the attribute name word. For example, the word corresponding to the attribute value may be replaced with “NNN” or the like.

このようにして生成された多次元ベクトルを、データ変換部１００１は、文書要素教師データ１００２として出力する。文書要素教師データ１００２から文書要素判定ルール９０３を生成するための技術には、サポートベクターマシン以外にも、一般的に用いられている機械学習の技術を用いることができるため、文書要素教師データ１００２には、使用する機械学習に適した形式を用いて、文書要素判定ルール９０３を生成すればよい。 The data conversion unit 1001 outputs the multidimensional vector generated in this way as document element teacher data 1002. As a technique for generating the document element determination rule 903 from the document element teacher data 1002, in addition to the support vector machine, a generally used machine learning technique can be used. In this case, the document element determination rule 903 may be generated using a format suitable for machine learning to be used.

また、属性情報抽出部１００５は、属性名５０２と数値データに変換された属性値とを抽出することによって取得された属性情報の一覧を、属性値教師データ１００３として出力する。属性情報抽出部１００５は、第１の実施形態と同じく、該当する属性を有する事物名も属性情報と合わせて抽出し、属性値教師データ１００３に含めて出力してもよい。 Further, the attribute information extraction unit 1005 outputs a list of attribute information acquired by extracting the attribute name 502 and the attribute value converted into numerical data as attribute value teacher data 1003. As in the first embodiment, the attribute information extraction unit 1005 may extract the thing name having the corresponding attribute together with the attribute information, and output it by including it in the attribute value teacher data 1003.

なお、文書要素教師データ１００２及び属性値教師データ１００３における各教師データには、重要要素から生成された教師データであるか、非重要要素から生成された教師データであるかに関する識別子が、各々付加される。 Note that each teacher data in the document element teacher data 1002 and the attribute value teacher data 1003 is appended with an identifier regarding whether the teacher data is generated from an important element or teacher data generated from an unimportant element. Is done.

図１１は、本発明の第２の実施形態の判定ルール生成部９０２の詳細を示すブロック図である。 FIG. 11 is a block diagram illustrating details of the determination rule generation unit 902 according to the second embodiment of this invention.

第１の実施形態の判定ルール生成部１０５は、一種類の機械学習によって判定ルール１０６を生成することを前提としたが、第２の実施形態の判定ルール生成部９０２は、文書要素判定ルール９０３及び属性値判定ルール９０４の二種類の判定ルールを生成するため、それぞれに文書要素判定ルール生成部１１０１及び属性値判定ルール生成部１１０２を備える。 The determination rule generation unit 105 of the first embodiment is based on the premise that the determination rule 106 is generated by one type of machine learning. However, the determination rule generation unit 902 of the second embodiment has a document element determination rule 903. In addition, in order to generate two types of determination rules, an attribute value determination rule 904, a document element determination rule generation unit 1101 and an attribute value determination rule generation unit 1102 are provided.

判定ルール生成部９０２は、教師データの形式に合わせた、それぞれに適した技術を用いて判定ルールを生成してよい。例えば、文書要素判定ルール９０３を生成する方法には、前述のように教師データが多次元ベクトルによって与えられる場合、サポートベクターマシンを用いることが可能である。これによって、第１の実施形態と同じく、判定ルールを容易に生成することができる。 The determination rule generation unit 902 may generate a determination rule using a technique suitable for each in accordance with the format of the teacher data. For example, the method for generating the document element determination rule 903 can use a support vector machine when the teacher data is given by a multidimensional vector as described above. As a result, as in the first embodiment, the determination rule can be easily generated.

一方、属性値判定ルール９０４の場合、教師データは属性名と数値データとに変換された属性値の組であることから、判定ルール生成部９０２は、例えば、属性名毎に重要要素の属性値の平均と分散とを求め、それによって決定される範囲に属性値が存在しているか否かによって判定するルールを生成する。また、非重要要素の平均と分散とを加えることによって、重要要素である確率を算出する判定ルールを生成する。当然ながら、属性情報に関する判定ルールの生成においても、一般的に使用される機械学習の技術を使用してもよい。 On the other hand, in the case of the attribute value determination rule 904, since the teacher data is a set of attribute values converted into attribute names and numerical data, the determination rule generation unit 902, for example, sets attribute values of important elements for each attribute name. And generating a rule for determining whether or not an attribute value exists in a range determined by the average and variance. Also, a determination rule for calculating the probability of being an important element is generated by adding the average and variance of the non-important elements. Needless to say, a commonly used machine learning technique may also be used in generating a determination rule related to attribute information.

図１２は、本発明の第２の実施形態の判定データ生成部９０５の詳細を示すブロック図である。 FIG. 12 is a block diagram illustrating details of the determination data generation unit 905 according to the second embodiment of this invention.

図７に示す判定データ生成部１０４と図１２に示す判定データ生成部９０５との違いは、データ変換部１２０１における処理と、出力する判定データとである。図１２に示す判定データ生成部９０５は、属性情報を除いた文書要素判定データ１２０２及び属性情報に関する属性値判定データ１２０３の二種類を出力する。 The difference between the determination data generation unit 104 illustrated in FIG. 7 and the determination data generation unit 905 illustrated in FIG. 12 is the processing in the data conversion unit 1201 and the determination data to be output. The determination data generation unit 905 illustrated in FIG. 12 outputs two types of document element determination data 1202 excluding attribute information and attribute value determination data 1203 related to attribute information.

第２の実施形態における単語分割部７０１、属性抽出ルール７０３、及び、単語リスト７０４は、第１の実施形態と同じである。 The word division unit 701, the attribute extraction rule 703, and the word list 704 in the second embodiment are the same as those in the first embodiment.

判定ルール生成部９０２及び判定処理部９０６にサポートベクターマシンを用いる場合、データ変換部１２０１は、前述した文書要素教師データと同じく、属性情報に該当する箇所を除いた判定文書要素の内容に基づいた多次元ベクトルを、文書要素判定データ１２０２として生成する。又は、属性名の単語を除くと共に属性値に該当する単語を、その内容に依存せず、且つ、単語として現れることが無い特定の文字列に置き換えてもよい。例えば、「ＮＮＮ」などに置き換えてもよい。 When a support vector machine is used for the determination rule generation unit 902 and the determination processing unit 906, the data conversion unit 1201 is based on the content of the determination document element excluding the portion corresponding to the attribute information, similar to the document element teacher data described above. A multidimensional vector is generated as document element determination data 1202. Alternatively, the word corresponding to the attribute value may be replaced with a specific character string that does not depend on the content and does not appear as a word, while excluding the attribute name word. For example, it may be replaced with “NNN”.

データ変換部１２０１によって生成される文書要素判定データの形式には、文書要素判定ルール生成及び後述の文書要素判定処理部１３０１において使用される技術に適した形式を選択すればよい。さらに、判定データ生成部９０５は、教師データ生成部９０１と同様の処理をする。すなわち、属性情報抽出部１２０４は、属性名及び数値データに変換された属性情報を、属性値から抽出し、抽出された属性情報を属性値判定データ１２０３として出力する。抽出された属性情報は、該当する属性を有する事物名も合わせて付加され、属性値教師データ１２０３として出力されてもよい。 As a format of the document element determination data generated by the data conversion unit 1201, a format suitable for the technique used in the document element determination rule generation and the document element determination processing unit 1301 described later may be selected. Further, the determination data generation unit 905 performs the same processing as the teacher data generation unit 901. That is, the attribute information extraction unit 1204 extracts attribute information converted into attribute names and numerical data from attribute values, and outputs the extracted attribute information as attribute value determination data 1203. The extracted attribute information may be added together with the name of an object having the corresponding attribute and output as attribute value teacher data 1203.

図１３は、本発明の第２の実施形態の判定処理部９０６の詳細を示すブロック図である。 FIG. 13 is a block diagram illustrating details of the determination processing unit 906 according to the second embodiment of this invention.

判定処理部９０６は、文書要素判定処理部１３０１と、属性値判定処理部１３０２とを備える。文書要素判定処理部１３０１は、文書要素判定ルール９０３及び文書要素判定データ１２０２に基づく判定処理を行う。属性値判定処理部１３０２は、属性値判定ルール９０４及び属性値判定データ１２０３に基づく判定処理を行う。それぞれの判定処理に用いる技術は、判定ルールを生成するために用いた技術に、適した方法を選択することができる。 The determination processing unit 906 includes a document element determination processing unit 1301 and an attribute value determination processing unit 1302. The document element determination processing unit 1301 performs a determination process based on the document element determination rule 903 and the document element determination data 1202. The attribute value determination processing unit 1302 performs a determination process based on the attribute value determination rule 904 and the attribute value determination data 1203. As a technique used for each determination process, a method suitable for the technique used for generating the determination rule can be selected.

例えば、文書要素判定ルール９０３が、前述の例のようにサポートベクターマシンを使用して生成される場合、同様にサポートベクターマシンを文書要素判定処理部１３０１に使用すればよい。また、属性値判定ルール９０４が、例えば、事例文書から得られた属性値の平均値と分散値とによって与えられるような場合、広く知られた正規分布の計算式を用いて確率的に判定する方法などを使用すればよい。 For example, when the document element determination rule 903 is generated using a support vector machine as in the above-described example, the support vector machine may be similarly used for the document element determination processing unit 1301. Further, when the attribute value determination rule 904 is given by, for example, the average value and the variance value of the attribute values obtained from the case document, it is determined probabilistically using a well-known normal distribution formula. A method or the like may be used.

結果統合部１３０３では、文書要素判定処理部１３０１における判定結果と属性値判定処理部１３０２における判定結果を統合し、最終的な判定結果を出力する。統合の方法は、例えば、判定対象となっている判定文書要素から属性情報が抽出されている場合は、属性値判定処理部１３０２の結果のみを使用して判定結果とする方法でもよい。 The result integration unit 1303 integrates the determination result in the document element determination processing unit 1301 and the determination result in the attribute value determination processing unit 1302 and outputs the final determination result. For example, when the attribute information is extracted from the determination document element that is the determination target, the integration method may be a method in which only the result of the attribute value determination processing unit 1302 is used as the determination result.

また結果統合部１３０３は、例えば、文書要素判定処理部１３０１からの判定結果及び属性値判定処理部１３０２からの判定結果が、重要要素である確率値など、重要要素又は非重要要素として判定される程度を示す連続値によって出力される場合、両者の判定結果の加重平均又は相加平均など、両者の値から一つの連続値を算出する関数を用いることによって、両者の結果を統合してもよい。 In addition, the result integration unit 1303 determines that the determination result from the document element determination processing unit 1301 and the determination result from the attribute value determination processing unit 1302 are important elements or non-important elements such as probability values that are important elements. When output is a continuous value indicating the degree, the results of both may be integrated by using a function that calculates one continuous value from both values, such as a weighted average or an arithmetic average of both determination results. .

文書要素判定処理部１３０１及び属性値判定処理部１３０２それぞれからの結果が、重要要素であるか非重要要素であるかという判定のみが行われた結果である場合、結果統合部１３０３は、重要要素の場合を１、非重要要素の場合を０と定義することによって、加重平均又は相加平均などの関数を用いて両者の結果を統合してもよい。 When the result from each of the document element determination processing unit 1301 and the attribute value determination processing unit 1302 is a result of only the determination as to whether it is an important element or a non-important element, the result integration unit 1303 By defining 1 as the case of 1 and 0 as the case of the non-important element, the results of both may be integrated using a function such as a weighted average or an arithmetic mean.

結果統合部１３０３によって出力される判定結果は、前述の方法によって求めた値を出力するほか、それらの値とあらかじめ定めた閾値とに基づいて、重要要素であるか非重要要素であるかを判定した結果を出力するようにしてもよい。 The determination result output by the result integrating unit 1303 outputs the values obtained by the above-described method, and determines whether the determination result is an important element or a non-important element based on those values and a predetermined threshold value. The result may be output.

以上のように、本発明の第２の実施形態によれば、事例文書と参照情報記載文書の対応関係を求めることによって、文書中の重要要素を判定するための判定ルールを生成するために必要となる教師データを効率的に構築することができるようになると共に、文書中で文字列が重要となる箇所と属性値のような数値が重要となる箇所を分離し、それぞれに適した判定ルール生成技術及び判定処理技術を適用した柔軟で精度の高い重要箇所判定システムを構成することが可能となる。 As described above, according to the second embodiment of the present invention, it is necessary to generate a determination rule for determining an important element in a document by obtaining a correspondence relationship between a case document and a reference information description document. It is possible to efficiently construct the teacher data to be used, and to separate the part where the character string is important in the document from the part where the numerical value such as the attribute value is important, and the decision rule suitable for each. It is possible to configure a flexible and highly accurate important point determination system to which the generation technique and the determination processing technique are applied.

（第３の実施形態）
本発明の第３の実施形態を図１４から図１６を用いて説明する。 (Third embodiment)
A third embodiment of the present invention will be described with reference to FIGS.

図１４は、本発明の第３の実施形態の重要箇所判定システムの構成を示すブロック図である。 FIG. 14 is a block diagram showing a configuration of an important point determination system according to the third embodiment of the present invention.

図１４における構成要素において、判定結果訂正部１４０１及び分類済事例文書１４０２は、第１の実施形態の重要箇所判定システムと異なる。他の構成要素は、第１の実施形態の構成要素と同じである。 In the components in FIG. 14, the determination result correction unit 1401 and the classified case document 1402 are different from the important part determination system of the first embodiment. Other components are the same as those of the first embodiment.

第３の実施形態では、文書分類部１０３からの出力を、分類済事例文書１４０２に格納する。教師データ生成部１０４は、分類済事例文書１４０２の内容を参照し、教師データを生成する。また、第３の実施形態では、判定対象文書に判定結果の一覧をシステム利用者に提示し、システム利用者が判定結果を確認すると共に、誤った判定結果を訂正する手段を提供することが特徴である。さらに、システム利用者が訂正を行った後、判定対象文書と判定結果とを新たな事例として、判定ルールの生成に利用する手段を設けたことが特徴である。 In the third embodiment, the output from the document classification unit 103 is stored in the classified case document 1402. The teacher data generation unit 104 refers to the contents of the classified case document 1402 and generates teacher data. In the third embodiment, a list of determination results is presented to the system user in the determination target document, and the system user confirms the determination results and provides means for correcting the erroneous determination results. It is. Further, after the system user corrects, a means for using the determination target document and the determination result as a new case to generate a determination rule is provided.

図１４に示す判定結果訂正部１４０１は、判定処理部１１１から受信した判定結果の一覧をシステム利用者に提示すると共に、システム利用者が判定結果を訂正する手段をシステム利用者に提供する。具体的には、判定する単位に判定対象文章を分割した結果である判定文書要素とそれに対する判定結果を対にしてモニタ画面上に表示し、キーボード又はマウスを操作することによって、画面上で、判定結果を訂正できるようにする。 The determination result correction unit 1401 illustrated in FIG. 14 presents a list of determination results received from the determination processing unit 111 to the system user, and also provides the system user with a means for the system user to correct the determination result. Specifically, a determination document element that is a result of dividing a determination target sentence into units to be determined and a determination result corresponding to the determination document element are displayed on a monitor screen, and by operating a keyboard or a mouse, on the screen, Enable to correct the judgment result.

図１５は、本発明の第３の実施形態の判定結果の表示例を示す説明図である。 FIG. 15 is an explanatory diagram illustrating a display example of the determination result according to the third embodiment of this invention.

図１５に示す判定結果一覧には、判定対象文書１５０１、判定結果１５０２、訂正ボタン１５０３、「事例に追加」ボタン１５０６、及び、終了ボタン１５０７が含まれる。 The determination result list illustrated in FIG. 15 includes a determination target document 1501, a determination result 1502, a correction button 1503, an “add to case” button 1506, and an end button 1507.

判定対象文書１５０１には、各判定文書要素の内容が表示される。判定対処文書１５０１には、該当する判定文書要素の内容のみを表示してもよいし、文脈情報と合わせて表示してもよい。また、表示される内容を切り替えるための操作手段としてボタンなどが画面上に配置され、このボタンへの操作によって、表示される内容が切り替えられてもよい。 The determination target document 1501 displays the contents of each determination document element. In the determination handling document 1501, only the contents of the corresponding determination document element may be displayed, or may be displayed together with the context information. In addition, a button or the like may be arranged on the screen as an operation unit for switching the displayed content, and the displayed content may be switched by operating the button.

判定結果１５０２には、各判定対象文書１５０１が示す判定文書要素に対する判定結果が表示される。図１５の判定結果１５０２にフラグを示す値が格納されている場合、該当する行は重要要素に判定されたことを示し、空白は、非重要要素に判定されたことを示す。すなわち、図１５に示す表示例は、行１５０４及び行１５０５の判定対象文書１５０１に記述された判定文書要素のみが重要要素であると、判定されたことを示す。 In the determination result 1502, the determination result for the determination document element indicated by each determination target document 1501 is displayed. When a value indicating a flag is stored in the determination result 1502 of FIG. 15, the corresponding row indicates that it is determined as an important element, and a blank indicates that it is determined as a non-important element. That is, the display example illustrated in FIG. 15 indicates that it is determined that only the determination document element described in the determination target document 1501 in the lines 1504 and 1505 is an important element.

判定文書要素が重要要素と判定されたか、非重要要素と判定されたかを表示する方法には、両者を区別する方法であれば、いずれの方法でもよい。例えば、異なる色によって表示してもよく、または、異なる文字によって表示してもよい。 As a method for displaying whether the determination document element is determined to be an important element or a non-important element, any method may be used as long as the two are distinguished from each other. For example, you may display by a different color or you may display by a different character.

訂正ボタン１５０３のうち、「訂正」と記述されている箇所は、各判定対象文書１５０１の判定文書要素の判定結果を訂正するためのボタンである。システム利用者がこのボタンをキーボード又はマウスなどによって操作すると、操作する毎に、判定結果に表示される内容が、重要要素の表示と非重要要素の表示が切り替わって表示される。 In the correction button 1503, a portion described as “correction” is a button for correcting the determination result of the determination document element of each determination target document 1501. When the system user operates this button with a keyboard or a mouse, the contents displayed in the determination result are displayed by switching between the display of the important element and the display of the non-important element each time the button is operated.

例えば、行１５０４の判定結果１５０２には、重要要素であるとの判定結果が表示されている。この状態において、システム利用者が訂正ボタン１５０３を一回操作すると、判定結果１５０２が非重要要素の表示に切り替わり、行１５０４における判定結果１５０２の欄は空白となる。再度、システム利用者が訂正ボタン１５０３を操作すると、行１５０４における判定結果１５０２は、再び重要要素の表示に切り替わり、フラグを示す値が表示される。 For example, the determination result 1502 in the row 1504 displays the determination result that it is an important element. In this state, when the system user operates the correction button 1503 once, the determination result 1502 is switched to display of an unimportant element, and the column of the determination result 1502 in the row 1504 is blank. When the system user operates the correction button 1503 again, the determination result 1502 in the row 1504 switches to the display of the important element again, and the value indicating the flag is displayed.

訂正ボタン１５０３には、判定結果１５０２を訂正するために、前述のようなボタンによる方法の他、システム利用者が判定結果の欄に重要要素か非重要要素かを示す内容を直接入力する表示方法を用いてもよい。図１４に示す判定結果訂正部１４０１は、前述のように、システムによる判定結果１５０２をシステム利用者が訂正した判定対象文書１５０１の箇所を全て記録する。 In the correction button 1503, in order to correct the determination result 1502, in addition to the above-described method using the button, a display method in which the system user directly inputs content indicating whether it is an important element or an unimportant element in the determination result column May be used. As described above, the determination result correction unit 1401 illustrated in FIG. 14 records all the portions of the determination target document 1501 in which the system user corrects the determination result 1502 by the system.

システム利用者が、「事例に追加」ボタン１５０６をキーボード又はマウスによって操作すると、判定結果訂正部１４０１は、判定結果を、重要要素又は／及び非重要要素に訂正された判定対象要素を分類済事例文書１４０２に追加する。判定結果訂正部１４０１から送られる内容は、文書分類部１０３から送られる内容（重要要素に分類された事例文書要素及び非重要要素に分類された事例文書要素）と同じである。このため、分類済事例文書１４０２に単純に追加されることによって、判定結果訂正部１４０１から送られた事例文書要素は、教師データ生成部１０４において利用することができる。 When the system user operates the “add to case” button 1506 with the keyboard or the mouse, the determination result correction unit 1401 classifies the determination target element that has been corrected into an important element and / or an unimportant element as a classified case. It is added to the document 1402. The contents sent from the determination result correction unit 1401 are the same as the contents sent from the document classification unit 103 (case document elements classified as important elements and case document elements classified as non-important elements). Therefore, the case data element sent from the determination result correction unit 1401 can be used in the teacher data generation unit 104 by simply being added to the classified case document 1402.

また、判定処理部１１１から、判定ルール生成部１０５において処理が可能な形式の内容を取得することも可能であり、その場合、教師データ生成部１０４から出力される内容を格納する記憶領域を設け、その記憶領域に判定結果訂正部１４０１から出力される内容を追加してもよい。なお、この場合は教師データ生成部１０４における単語リストの生成処理が省略されるため、判定対象文書内に含まれる単語は全て、事例文書中に含まれていることが望ましい。 It is also possible to acquire content in a format that can be processed by the determination rule generation unit 105 from the determination processing unit 111. In this case, a storage area for storing the content output from the teacher data generation unit 104 is provided. The contents output from the determination result correction unit 1401 may be added to the storage area. In this case, since the word list generation process in the teacher data generation unit 104 is omitted, it is desirable that all words included in the determination target document are included in the case document.

さらに、判定結果訂正部１４０１から分類済事例文書１４０２等に送られる判定対象要素は、全ての判定対象要素と判定結果の組に関する情報でもよく、システム利用者が訂正を行った判定対象要素に関する情報のみを送ってもよい。または、システムが重要要素と判定した判定対象要素とシステム利用者が訂正を行った判定対象要素のみ、といった組み合わせを、分類済事例文書１４０２に送ってもよい。さらには、それらをシステム利用者が選択することができるようにしてもよい。 Further, the determination target element sent from the determination result correction unit 1401 to the classified case document 1402 or the like may be information regarding all combinations of determination target elements and determination results, or information regarding determination target elements corrected by the system user. You may send only. Alternatively, a combination of the determination target element determined as an important element by the system and only the determination target element corrected by the system user may be sent to the classified case document 1402. Furthermore, the system user may be able to select them.

図１５に示す「終了」ボタン１５０７をシステム利用者が操作すると、判定対象要素を分類済事例文書１４０２に追加することなく、判定結果の表示を終了する。さらに終了する際、別途、判定結果及びシステム利用者が行った訂正に関する情報を格納するようにしてもよい。 When the system user operates the “end” button 1507 shown in FIG. 15, the display of the determination result is ended without adding the determination target element to the classified case document 1402. Further, when the process is finished, information regarding the determination result and correction made by the system user may be separately stored.

さらに、判定結果を表示する方法には、図１５に示すように、判定対象となる単位である判定文書要素毎に表示する方法の他、判定対象文書の内容をそのまま表示し、重要要素と判定された箇所を判定対象文書の上で識別可能な形で表示する、という方法を採ってもよい。この場合の表示方法の一例を図１６に示す。 Furthermore, as a method of displaying the determination result, as shown in FIG. 15, in addition to the method of displaying for each determination document element that is a unit to be determined, the content of the determination target document is displayed as it is and it is determined as an important element. A method may be employed in which the determined location is displayed in a form that can be identified on the determination target document. An example of the display method in this case is shown in FIG.

図１６は、本発明の第３の実施形態の判定結果の別の表示例を示す説明図である。 FIG. 16 is an explanatory diagram illustrating another display example of the determination result according to the third embodiment of this invention.

図１６に示す判定結果は、結果表示画面１６０１、「事例に追加」ボタン１６０２、及び、「終了」ボタン１６０３を含む。 The determination results shown in FIG. 16 include a result display screen 1601, an “add to case” button 1602, and an “end” button 1603.

結果表示画面１６０１は、判定対象文書の内容を表示する。文書分割部１０８における処理結果を利用することによって、各判定文書要素に対応する判定文書上の文脈情報を取得できるため、各判定文書要素に対する判定結果を判定文書の文脈情報の中に反映する。そして、図１６に示す結果表示画面１６０１は、重要要素と判定された判定文書要素に対応する行１６０２及び行１６０３の箇所を、他と背景色を変えて表示する。 The result display screen 1601 displays the contents of the determination target document. By using the processing result in the document dividing unit 108, the context information on the determination document corresponding to each determination document element can be acquired. Therefore, the determination result for each determination document element is reflected in the context information of the determination document. Then, the result display screen 1601 shown in FIG. 16 displays the positions of the lines 1602 and 1603 corresponding to the determination document element determined to be an important element while changing the background color.

図１６における重要要素の表示方法は、背景色を変える方法の他、非重要要素との違いが容易に識別できる方法であれば、どのような方法でもよい。例えば、文字の色を変える、下線を引く、枠をつけるなどによって表示してもよい。 The important element display method in FIG. 16 may be any method other than the method of changing the background color as long as the difference from the non-important elements can be easily identified. For example, it may be displayed by changing the color of the character, underlining or adding a frame.

さらに、図１６に示す表示方法において判定結果を訂正する方法には、訂正する判定文書要素に対応する行において、マウスのダブルクリックを行う毎に、該当箇所が重要要素であるか非重要要素であるかを示す表示が切り替わるようにする、という方法を使用することができる。または、図１５に示す訂正ボタン１５０３のような手段を用意し、訂正を行う箇所をキーボード又はマウスによって選択した後、ボタンを操作することによって訂正してもよい。また図１６に示す「事例に追加」ボタン１６０４及び「終了」ボタン１６０５は、図１５に示す「事例に追加」ボタン１５０６及び「終了」ボタン１５０７と、それぞれ同じ機能を持つ。 Further, in the method of correcting the determination result in the display method shown in FIG. 16, each time the mouse is double-clicked on the line corresponding to the determination document element to be corrected, the corresponding part is an important element or an unimportant element. It is possible to use a method of switching the display indicating whether or not there is. Alternatively, a means such as a correction button 1503 shown in FIG. 15 may be prepared, and the correction may be performed by operating the button after selecting a portion to be corrected using a keyboard or a mouse. Further, an “add to case” button 1604 and an “end” button 1605 shown in FIG. 16 have the same functions as the “add to case” button 1506 and the “end” button 1507 shown in FIG.

以上のように、本発明の第３の実施形態によれば、システムによる判定結果をシステム利用者が確認すると共に、判定結果を容易に訂正できる手段を提供し、その結果を新たな事例文書として追加することによって、システムを利用する中で判定精度の高い判定ルールを構築していくことが可能となる。 As described above, according to the third embodiment of the present invention, the system user confirms the determination result by the system and provides means for easily correcting the determination result, and the result is used as a new case document. By adding, it becomes possible to construct a determination rule with high determination accuracy while using the system.

１０１事例文書
１０２参照情報記載文書
１０３文書分類部
１０４教師データ生成部
１０５判定ルール生成部
１０６判定ルール
１０７判定対象文書
１０８文書分割部
１０９文書分割ルール
１１０判定データ生成部
１１１判定処理部
１１２判定結果
２０３参照情報抽出部
２０５マッチング部
２０６分類処理部
４０１単語分割部
４０２属性情報抽出部
４０４単語集計部
４０６データ変換部
１４０１判定結果訂正部 101 case document 102 reference information description document 103 document classification unit 104 teacher data generation unit 105 determination rule generation unit 106 determination rule 107 determination target document 108 document division unit 109 document division rule 110 determination data generation unit 111 determination processing unit 112 determination result 203 Reference information extraction unit 205 Matching unit 206 Classification processing unit 401 Word division unit 402 Attribute information extraction unit 404 Word aggregation unit 406 Data conversion unit 1401 Determination result correction unit

Claims

A computer system comprising a processor for performing arithmetic processing and a storage device connected to the processor, wherein the processor analyzes a document,
The document includes a plurality of elements each including a plurality of words and constituting a sentence ,
The plurality of elements includes a sentence or a paragraph,
The processor is
A plurality of first documents, a reference document including a reference to the first document, and a second document to be determined as an important part are input;
Extracting the element from each of the first documents, extracting a reference location to the first document from the reference document as reference information;
Based on the similarity calculated by the elements extracted from the first documents and the reference information, the elements similar between the elements extracted from the first documents and the reference information are important. A location that is not similar between the element extracted from each first document and the reference information is regarded as a non-important location in the second element, and the first element is assumed to be a location. Split the elements extracted from the document,
Acquiring a first feature amount of each document based on the plurality of words included in the divided first element and second element;
Based on the acquired first feature value, a determination rule for determining whether or not the important part is included is generated,
Extracting the element from the second document;
Obtaining a second feature amount based on the plurality of words included in the element extracted from the second document ;
A computer that classifies elements extracted from the second document into important and non-important parts by comparing the generated determination rule and the acquired second feature amount. system.

Wherein the plurality of words comprises the attribute name representing an attribute of the object indicated by the words contained in the document, the attribute value is a value of the attribute corresponding to the attribute name, as a word,
The computer system holds extraction information for extracting the attribute name and the attribute value,
The processor is
Extracting all the words indicating the attribute name and attribute value from the first element and the second element , respectively, by extracting locations that match the extraction information ,
Corresponds to information indicating whether or not all words indicating the extracted attribute name are included in each of the first element and the second element, and all words indicating the extracted attribute value And obtaining the first feature amount including a value calculated from the attribute value included in each of the first element and the second element,
Corresponding to information indicating whether or not all the words indicating the extracted attribute name are included in the second document and all words indicating the extracted attribute value, The computer system according to claim 1, wherein the second feature amount including a value calculated from the included attribute value is acquired.

The plurality of words include an attribute name representing an attribute of a target indicated by a word included in the document, an attribute value that is an attribute value corresponding to the attribute name, and an article name having an attribute corresponding to the attribute name; As a word ,
The computer system holds extraction information for extracting the attribute name, the attribute value, and the article name,
The processor is
Extracting all the words indicating the attribute name, the attribute value, and the article name from the first element and the second element by extracting a portion that matches the extraction information ,
Information indicating whether or not all the words indicating the extracted attribute name and article name are included in each of the first element or the second element, and all indicating the extracted attribute value A first feature amount corresponding to a word and including a value calculated from the attribute value included in each of the first element and the second element;
Corresponding to information indicating whether or not all words indicating the extracted attribute name and article name are included in the second document, and all words indicating the extracted attribute value, the second The computer system according to claim 1, wherein the second feature amount including a value calculated from the attribute value included in the document is acquired.

The processor is
Obtaining the first feature quantity and the second feature quantity as teacher data expressed in a vector format;
3. The computer system according to claim 2, wherein the determination rule is generated from the first feature amount using a support vector machine.

The processor is
Obtaining the words whose meaning is similar to the plurality of words,
The computer system according to claim 2, wherein the attribute name includes the similar word.

The computer system holds information indicating a plurality of other elements whose meanings are related to the plurality of elements ,
The processor identifies a location in the document where the plurality of other elements are included based on information indicating the plurality of other elements;
The meaning of the document is extracted by extracting the plurality of other elements, or the plurality of other elements and the plurality of elements having meanings related to the other plurality of elements, from the specified portion. Split into multiple related elements,
The first element includes a group of the plurality of elements including the important part ,
The computer system according to claim 1, wherein the second element includes a group of the plurality of elements not including the important part .

Wherein the plurality of words comprises the attribute name representing an attribute of the object indicated by the words contained in the document, the attribute value is a value of the attribute corresponding to the attribute name, as a word,
The computer system holds extraction information for extracting the attribute name and the attribute value,
The processor is
By extracting a portion that matches the extraction information, the first element is divided into the third element that does not include the attribute name and attribute value, and the fourth element that includes the attribute name and attribute value. Divided into
By extracting a portion that matches the extraction information, the second element is divided into the fifth element that does not include the attribute name and attribute value, and the sixth element that includes the attribute name and attribute value. Divided into
Based on the third element and the fifth element, a third feature quantity including the feature quantity of the third element and the feature quantity of the fifth element is acquired,
Generating a first determination rule for determining whether or not a document that does not include the attribute name and attribute value includes the important part based on the acquired third feature amount;
Based on the fourth element and the sixth element, a fourth feature quantity including the feature quantity of the fourth element and the feature quantity of the sixth element is acquired,
Generating a second determination rule for determining whether or not a document including the attribute name and attribute value includes the important part based on the acquired fourth feature amount;
Extracting the element from the second document;
Dividing the element extracted from the second sentence into a plurality of seventh elements not including the attribute name and attribute value and a plurality of eighth elements including the attribute name and attribute value;
Acquiring the fifth feature amount based on the seventh element;
Obtaining the sixth feature amount based on the eighth element;
Comparing the generated first determination rule with the acquired fifth feature amount;
Comparing the generated second determination rule with the acquired sixth feature amount;
A comparison result of the first determination rule and the fifth feature amount, by Rukoto using both the comparison result of the second determination rule and the feature of the sixth, extracted from the second document The computer system according to claim 1 , wherein the elements are classified into the important part and the non-important part .

The processor is
Connected with the output device,
Outputting the result of classifying elements extracted from the second document into important and non-important parts to the output device;
Obtaining a correction of the classification result extracted from the second document;
According to a result of classification of the second document and the correction, an element extracted from the second document is added to the first element or the second element, and the determination rule is generated. The computer system according to claim 1.