JP2012068844A

JP2012068844A - Document comparison processor and document comparison processing program

Info

Publication number: JP2012068844A
Application number: JP2010212470A
Authority: JP
Inventors: Yasushi Ito; 泰伊藤
Original assignee: Fuji Xerox Co Ltd
Current assignee: Fujifilm Business Innovation Corp
Priority date: 2010-09-22
Filing date: 2010-09-22
Publication date: 2012-04-05

Abstract

PROBLEM TO BE SOLVED: To provide a document comparison processor and a document comparison processing program detecting consistency of the content of document information based on comparison results of a plurality of pieces of document information.SOLUTION: A document comparison processor 1 includes: document format classification means 100 for classifying document information for every document format; format template extraction means 102 for extracting character strings whose positions and contents are in common among the pieces of document information of the same format, as format template information 111; case common-value extraction means 103 for extracting along with the positions, character strings that are in common among the pieces of document information contained in the same case, as common value information 112; case template extraction means 105 for extracting case template information 113 forming an input column from the document information of the case by excluding the character strings contained in the common value information 112; and error detecting means 106 for applying the content corresponding to the input column of the case template information 113 to the common value information 112 in the document information that is an object of the error detection, to detect character strings that have different contents from each other.

Description

本発明は、文書比較処理装置及び文書比較処理プログラムに関する。 The present invention relates to a document comparison processing apparatus and a document comparison processing program.

文書情報の内容の整合性の良否を判別する技術が提案されている。 Techniques have been proposed for determining whether or not the consistency of the contents of document information is good.

これに関連する技術として、特許文献１には、入力される文書情報中の記述情報と、文書情報中に出現する複数種類の項目名、出現位置及び表現パターンの情報とを関連付けた文書構造テーブルとを照合して文書情報中の各種項目名の記述情報を判別し、判別した各種項目名の記述情報と、項目名及びルールの情報を関連付けた分析ルールテーブルとを照合して各種項目名の記述情報がルールに合致しているか否かを判別して、記述情報がルールに合致していない場合にその旨のメッセージを表示部に表示する文書データ処理装置が開示されている。 As a technology related to this, Patent Document 1 discloses a document structure table in which description information in input document information is associated with information on a plurality of types of item names, appearance positions, and expression patterns that appear in the document information. Is used to determine the description information of the various item names in the document information, and the description information of the determined various item names is compared with the analysis rule table in which the item name and rule information are associated with each other. A document data processing apparatus is disclosed that determines whether or not description information matches a rule, and displays a message to that effect on the display unit when the description information does not match the rule.

特開２００７−１２２６６１号公報JP 2007-122661 A

本発明の目的は、文書情報の内容の整合性を複数の文書情報の比較結果から検出する文書比較処理装置及び文書比較処理プログラムを提供することにある。 An object of the present invention is to provide a document comparison processing apparatus and a document comparison processing program for detecting consistency of contents of document information from a comparison result of a plurality of document information.

［１］コンピュータを、
内容が関連する複数の文書情報を有する案件が複数ある場合に、当該複数の案件に含まれる複数の文書情報を文書形式毎に分類する形式分類手段と、
前記形式分類手段が分類した文書形式の複数の文書情報間で位置及び内容が共通する文字列を第１のテンプレートとして抽出する第１のテンプレート抽出手段と、
同一案件に含まれる複数の文書情報間で内容が共通する文字列を当該文字列が記載された位置とともに抽出し、当該抽出された共通の文字列から前記第１のテンプレートの文字列のうち位置が同一の文字列を除去して共通値情報として抽出する共通値情報抽出手段と、
前記案件に含まれるそれぞれの文書情報から前記共通値情報に含まれる文字列を除いて入力欄とした第２のテンプレートとして抽出する第２のテンプレート抽出手段と、
対象となる前記案件の文書情報において、前記第２のテンプレートの前記入力欄に対応する文字列を前記共通値情報に当てはめて互いに異なる内容となる文字列を検出する検出手段として機能させるための文書比較処理プログラム。 [1]
A format classification means for classifying a plurality of document information included in the plurality of items for each document format when there are a plurality of items having a plurality of document information related to the contents;
First template extraction means for extracting, as a first template, a character string having a common position and content among a plurality of document information in the document format classified by the format classification means;
A character string having a common content among a plurality of pieces of document information included in the same item is extracted together with a position where the character string is described, and a position of the character string of the first template is extracted from the extracted common character string. Common value information extracting means for removing the same character string and extracting it as common value information;
A second template extracting means for extracting a second template as an input field by removing a character string included in the common value information from each document information included in the case;
Document for functioning as detection means for detecting character strings having different contents by applying a character string corresponding to the input field of the second template to the common value information in the document information of the subject matter Comparison processing program.

［２］コンピュータを、
複数の案件の前記第２のテンプレートを同一形式に分類し、第２のテンプレートの入力欄のうち当該入力欄の位置が同一形式の複数の第２のテンプレート間で予め定めた割合で共通する入力欄を抽出し、当該抽出された入力欄からなる第３のテンプレートを抽出する第３のテンプレート抽出手段としてさらに機能させ、
前記検出手段は、当該第３のテンプレートの入力欄に入力される内容を前記共通値情報に当てはめて互いに異なる内容となる文字列を検出する前記［１］に記載の文書比較処理プログラム。 [2]
The second template of a plurality of items is classified into the same format, and the input column position of the second template is common to a plurality of second templates of the same format at a predetermined ratio. And further function as a third template extracting means for extracting a third template including the extracted input field,
The document comparison processing program according to [1], wherein the detection unit applies a content input in an input field of the third template to the common value information to detect a character string having different content.

［３］内容が関連する複数の文書情報を有する案件が複数ある場合に、当該複数の案件に含まれる複数の文書情報を文書形式毎に分類する形式分類手段と、
前記形式分類手段が分類した文書形式の複数の文書情報間で位置及び内容が共通する文字列を第１のテンプレートとして抽出する第１のテンプレート抽出手段と、
同一案件に含まれる複数の文書情報間で内容が共通する文字列を当該文字列が記載された位置とともに抽出し、当該抽出された共通の文字列から前記第１のテンプレートの文字列のうち位置が同一の文字列を除去して共通値情報として抽出する共通値情報抽出手段と、
前記案件に含まれるそれぞれの文書情報から前記共通値情報に含まれる文字列を除いて入力欄とした第２のテンプレートとして抽出する第２のテンプレート抽出手段と、
対象となる前記案件の文書情報において、前記第２のテンプレートの前記入力欄に対応する文字列を前記共通値情報に当てはめて互いに異なる内容となる文字列を検出する検出手段とを有する文書比較処理装置。 [3] A format classification means for classifying a plurality of document information included in the plurality of items for each document format when there are a plurality of items having a plurality of document information related to the contents;
First template extraction means for extracting, as a first template, a character string having a common position and content among a plurality of document information in the document format classified by the format classification means;
A character string having a common content among a plurality of pieces of document information included in the same item is extracted together with a position where the character string is described, and a position of the character string of the first template is extracted from the extracted common character string. Common value information extracting means for removing the same character string and extracting it as common value information;
A second template extracting means for extracting a second template as an input field by removing a character string included in the common value information from each document information included in the case;
Document comparison processing comprising: detecting means for detecting a character string having different contents by applying a character string corresponding to the input field of the second template to the common value information in the document information of the subject matter apparatus.

請求項１又は３に係る発明によれば、文書情報の内容の整合性を複数の文書情報の比較結果から検出することができる。 According to the first or third aspect of the invention, the consistency of the contents of the document information can be detected from the comparison result of the plurality of document information.

請求項２に係る発明によれば、複数案件にわたって共通にすべき文書情報の内容を複数の文書情報の比較結果から検出することができる。 According to the second aspect of the present invention, it is possible to detect the content of document information that should be shared across a plurality of cases from the comparison result of the plurality of document information.

図１は、文書比較処理システムの構成例を示す概略図である。FIG. 1 is a schematic diagram illustrating a configuration example of a document comparison processing system. 図２は、文書比較処理装置の構成例を示すブロック図である。FIG. 2 is a block diagram illustrating a configuration example of the document comparison processing apparatus. 図３は、文書情報の格納パス構造の一例を示す概略図である。FIG. 3 is a schematic diagram illustrating an example of a storage path structure of document information. 図４（ａ）及び（ｂ）は、それぞれ形式毎に分類された文書情報の一例を示す概略図である。4A and 4B are schematic diagrams illustrating examples of document information classified for each format. 図５（ａ）及び（ｂ）は、それぞれ形式毎に抽出された形式テンプレート情報の一例を示す概略図である。FIGS. 5A and 5B are schematic diagrams illustrating examples of format template information extracted for each format. 図６は、文書情報の格納パス構造の一例を示す概略図である。FIG. 6 is a schematic diagram illustrating an example of a storage path structure of document information. 図７（ａ）及び（ｂ）は、それぞれ案件毎に分類された文書情報の一例を示す概略図である。FIGS. 7A and 7B are schematic diagrams illustrating examples of document information classified for each case. 図８は、案件に含まれる文書情報において抽出された共通値の一例を示す概略図である。FIG. 8 is a schematic diagram illustrating an example of common values extracted in document information included in a case. 図９は、抽出された共通値から構成される共通値情報の内容の一例を示す概略図である。FIG. 9 is a schematic diagram showing an example of the content of common value information composed of the extracted common values. 図１０は、形式テンプレート情報の文字列と一致する文字列を除去した共通値情報の内容の一例を示す概略図である。FIG. 10 is a schematic diagram illustrating an example of the content of the common value information from which the character string that matches the character string of the format template information is removed. 図１１は、形式テンプレート情報の文字列と一致する文字列が除去された共通値情報の内容の一例を示す概略図である。FIG. 11 is a schematic diagram illustrating an example of the content of the common value information from which the character string that matches the character string of the format template information is removed. 図１２は、案件テンプレート情報の内容の一例を示す概略図である。FIG. 12 is a schematic diagram illustrating an example of the contents of the item template information. 図１３（ａ）は、複数案件における共通値情報の内容に該当する文字列の一例を示す概略図であり、図１３（ｂ）は、複数案件における共通値情報の内容に該当する文字列の共通値の一例を示す概略図である。FIG. 13A is a schematic diagram illustrating an example of a character string corresponding to the content of common value information in a plurality of cases, and FIG. 13B illustrates a character string corresponding to the content of common value information in a plurality of cases. It is the schematic which shows an example of a common value. 図１４（ａ）〜（ｄ）は、誤り検出手段の動作例を説明するための概略図である。14A to 14D are schematic views for explaining an operation example of the error detection means.

（文書比較処理システムの構成）
図１は、文書比較処理システムの構成例を示す概略図である。 (Configuration of document comparison processing system)
FIG. 1 is a schematic diagram illustrating a configuration example of a document comparison processing system.

この文書比較処理システム４は、文書比較処理装置１と、文書データベース（ＤＢ）２とをネットワーク３によって互いに通信可能に接続することで構成される。 The document comparison processing system 4 is configured by connecting a document comparison processing device 1 and a document database (DB) 2 to each other via a network 3 so that they can communicate with each other.

文書比較処理装置１は、情報を処理するための機能を備えたＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）や記憶部等の電子部品を備え、文書ＤＢ２に格納された複数の文書情報２００の内容を分析して、誤っている蓋然性が高い文書情報中に記載された文字列を検出する情報処理装置である。 The document comparison processing apparatus 1 includes a CPU (Central Processing Unit) having a function for processing information and an electronic component such as a storage unit, and analyzes the contents of a plurality of document information 200 stored in the document DB 2. This is an information processing device that detects a character string described in document information that has a high probability of being erroneous.

また、文書比較処理装置１は、画像を表示する液晶ディスプレイ等の表示部１２と、操作に応じた操作信号を発するキーボード、マウス、タッチパッド等の操作部１３とを備える。なお、文書比較処理装置１は、例えば、パーソナルコンピュータであり、その他にＰＤＡ（ＰｅｒｓｏｎａｌＤｉｇｉｔａｌＡｓｓｉｓｔａｎｔ）、携帯電話機等を用いることもできる。 Further, the document comparison processing device 1 includes a display unit 12 such as a liquid crystal display that displays an image, and an operation unit 13 such as a keyboard, a mouse, and a touch pad that emits an operation signal according to the operation. Note that the document comparison processing apparatus 1 is, for example, a personal computer, and a PDA (Personal Digital Assistant), a mobile phone, or the like can also be used.

文書ＤＢ２は、テキストや画像等の情報から構成される文書情報２００等を格納する。文書情報２００は、本実施の形態において、一例として、業務進行の際に必要な見積依頼書、見積回答書、納品依頼書、納品回答書等の書類である。 The document DB 2 stores document information 200 composed of information such as text and images. In the present embodiment, the document information 200 is, for example, a document such as an estimate request form, an estimate reply form, a delivery request form, and a delivery reply form necessary for business progress.

ネットワーク３は、ＬＡＮ（ＬｏｃａｌＡｒｅａＮｅｔｗｏｒｋ）、インターネット等の通信網であり、有線、無線は問わない。 The network 3 is a communication network such as a LAN (Local Area Network) or the Internet, and may be wired or wireless.

（文書比較処理装置の構成）
図２は、文書比較処理装置１の構成例を示すブロック図である。 (Configuration of document comparison processing device)
FIG. 2 is a block diagram illustrating a configuration example of the document comparison processing apparatus 1.

文書比較処理装置１は、ＣＰＵ等から構成され各部を制御するとともに各種のプログラムを実行する制御部１０と、ＨＤＤ（ＨａｒｄＤｉｓｋＤｒｉｖｅ）やフラッシュメモリ等の記憶媒体であって情報を記憶する記憶部１１と、上述した画像を表示する液晶ディスプレイ等の表示部１２と、操作に応じた操作信号を発するキーボード、マウス、タッチパッド等の操作部１３と、ネットワーク３を介して外部と通信する通信部１４とを備える。 The document comparison processing apparatus 1 includes a CPU and the like. The control unit 10 controls each unit and executes various programs, and a storage unit such as a HDD (Hard Disk Drive) or a flash memory and stores information. 11, a display unit 12 such as a liquid crystal display that displays the above-described image, an operation unit 13 such as a keyboard, a mouse, or a touch pad that generates an operation signal according to the operation, and a communication unit that communicates with the outside via the network 3. 14.

制御部１０は、後述する文書比較処理プログラム１１０を実行することで、文書形式分類手段１００、文書案件分類手段１０１、形式テンプレート抽出手段１０２、案件共通値抽出手段１０３、形式共通値除去手段１０４、案件テンプレート抽出手段１０５、誤り検出手段１０６及び共通値範囲抽出手段１０７等として機能する。 The control unit 10 executes a document comparison processing program 110 to be described later, whereby a document format classification unit 100, a document case classification unit 101, a format template extraction unit 102, a case common value extraction unit 103, a format common value removal unit 104, It functions as a case template extraction unit 105, an error detection unit 106, a common value range extraction unit 107, and the like.

文書形式分類手段１００は、文書ＤＢ２の文書情報２００を、文書情報２００のファイル名や格納場所により、その文書形式、例えば、見積依頼書、見積回答書、納品依頼書、納品回答書等に分類する。なお、文書情報２００にタグ付けする等して分類してもよい。 The document format classification unit 100 classifies the document information 200 of the document DB 2 into its document format, for example, a request for quotation, a reply to estimate, a delivery request, a delivery reply, etc., according to the file name and storage location of the document information 200. To do. The document information 200 may be classified by tagging it.

文書案件分類手段１０１は、文書情報２００のファイル名や格納場所（格納パス）により、文書ＤＢ２の文書情報２００のうちある相手に対する又はその相手から受け付ける一連の文書情報、例えば、見積依頼書、見積回答書、納品依頼書、納品回答書…等を、その相手に対する案件として分類する。 The document item classification unit 101 uses a file name or a storage location (storage path) of the document information 200 to set a series of document information, for example, a request for quotation, an estimate, for a certain partner in the document information 200 of the document DB 2. Responses, delivery requests, delivery responses, etc. are classified as projects for the other party.

形式テンプレート抽出手段１０２は、異なる案件間の共通の形式を有する文書情報において共通する文字列を形式テンプレート情報１１１として抽出する。 The format template extraction unit 102 extracts a character string common in the document information having a common format between different cases as the format template information 111.

案件共通値抽出手段１０３は、同一案件間のそれぞれ形式の異なる文書情報において共通する文字列を共通値情報１１２として抽出する。 The case common value extraction unit 103 extracts a common character string as common value information 112 in document information of different formats between the same case.

形式共通値除去手段１０４は、案件共通値抽出手段１０３が抽出した共通値情報１１２から形式テンプレート抽出手段１０２が抽出した形式テンプレート情報１１１と一致する文字列を除去する。 The format common value removing unit 104 removes a character string that matches the format template information 111 extracted by the format template extracting unit 102 from the common value information 112 extracted by the case common value extracting unit 103.

案件テンプレート抽出手段１０５は、同一案件の書類情報から形式共通値除去手段１０４が除去した共通値情報１１２以外の文字列を案件テンプレート情報１１３として抽出する。 The case template extraction unit 105 extracts a character string other than the common value information 112 removed by the format common value removal unit 104 from the document information of the same case as the case template information 113.

誤り検出手段１０６は、文書情報と案件テンプレート情報１１３とを比較して、案件テンプレート情報１１３の共通値に入力される文字列が互いに異なる場合に、当該互いに異なる文字列を文書情報の内容の誤りとして検出する。 The error detection unit 106 compares the document information with the case template information 113, and when the character strings input to the common value of the case template information 113 are different from each other, Detect as.

共通値範囲抽出手段１０７は、案件テンプレート情報１１３の共通値を異なる案件の間で比較し、さらに共通する共通値情報を抽出する。 The common value range extraction unit 107 compares the common values in the case template information 113 between different cases, and further extracts common value information.

記憶部１１は、制御部１０を上述した各手段１００〜１０６として動作させる文書比較処理プログラム１１０、形式テンプレート抽出手段１０２が抽出して出力する形式テンプレート情報１１１、案件共通値抽出手段１０３及び形式共通値除去手段１０４が出力する共通値情報１１２、及び案件テンプレート抽出手段１０５が出力する案件テンプレート情報１１３等を記憶する。 The storage unit 11 includes a document comparison processing program 110 that causes the control unit 10 to operate as the above-described units 100 to 106, format template information 111 that is extracted and output by the format template extraction unit 102, a case common value extraction unit 103, and a format common The common value information 112 output by the value removing unit 104, the item template information 113 output by the item template extracting unit 105, and the like are stored.

図３は、文書情報２００の格納場所である格納パス構造の一例を示す概略図である。 FIG. 3 is a schematic diagram illustrating an example of a storage path structure that is a storage location of the document information 200.

格納パス構造２０は、文書情報２００ａ、２００ｂ…、２０１ａ、２０１ｂ…の格納パスの構造を示し、例えば、「見積業務」フォルダには案件名を示す「見積０００１」、「見積０００２」…のフォルダが含まれ、「見積０００１」及び「見積０００２」のフォルダにはそれぞれ「見積」、「承認」、「納品」及び「発注」等のフォルダが含まれる。 The storage path structure 20 indicates the storage path structure of the document information 200a, 200b,..., 201a, 201b..., For example, the “estimate work” folder includes “estimate 0001”, “estimate 0002”. The “estimate 0001” and “estimate 0002” folders include “estimate”, “approval”, “delivery” and “order” folders, respectively.

「見積０００１」の「見積」フォルダには、「見積依頼」の文書情報２００ａと、「見積回答」の文書情報２００ｂとが格納される。ここで、「見積依頼」フォルダに含まれる文書情報２００ａを形式Ａといい、「見積回答」に含まれる文書情報２００ｂを形式Ｂという。 In the “estimation” folder of “estimation 0001”, document information 200a of “estimation request” and document information 200b of “estimation response” are stored. Here, the document information 200a included in the “estimate request” folder is referred to as format A, and the document information 200b included in the “estimate response” is referred to as format B.

また、「見積０００２」の「見積」には、「見積依頼」の文書情報２０１ａ（形式Ａ）と、「見積回答」の文書情報２０１ｂ（形式Ｂ）とが格納される。 Also, in the “estimation” of “estimation 0002”, document information 201a (format A) of “request for quotation” and document information 201b (format B) of “estimate response” are stored.

（文書比較処理装置の動作）
以下に、文書比較処理装置１の動作例を図１〜図１５を参照しつつ、（１）基本動作、（２）形式テンプレート抽出動作、（３）案件テンプレート抽出動作、（４）誤り検出動作に分けて説明する。 (Operation of document comparison processing device)
Hereinafter, with reference to FIGS. 1 to 15, an example of the operation of the document comparison processing apparatus 1 will be described. (1) Basic operation, (2) Format template extraction operation, (3) Item template extraction operation, (4) Error detection operation This will be explained separately.

（１）基本動作
まず、利用者は、図示しない端末装置等を操作し、図３に示す文書情報２００ａ、２００ｂ…、２０１ａ、２０１ｂ…を作成する。作成された文書情報２００ａ、２００ｂ…、２０１ａ、２０１ｂ…は、利用者の要求により端末装置によって文書ＤＢ２に格納される。文書情報２００ａ、２００ｂ…、２０１ａ、２０１ｂ…の作成は複数の利用者によって行われてもよい。 (1) Basic operation First, a user operates a terminal device (not shown) and creates document information 200a, 200b,..., 201a, 201b. The created document information 200a, 200b,..., 201a, 201b... Is stored in the document DB 2 by the terminal device at the request of the user. Creation of the document information 200a, 200b..., 201a, 201b... May be performed by a plurality of users.

次に、管理者は、複数の利用者の文書情報２００ａ、２００ｂ…、２０１ａ、２０１ｂ…の作成状況を管理するため、文書比較処理装置１を操作する。具体的な管理内容として、管理者は、文書比較処理装置１を用いて作成された文書情報２００ａ、２００ｂ…、２０１ａ、２０１ｂ…に入力ミス等による記載の誤りがないかどうかを監視する。 Next, the administrator operates the document comparison processing device 1 in order to manage the creation status of the document information 200a, 200b..., 201a, 201b. As specific management contents, the administrator monitors whether there is a description error due to an input error or the like in the document information 200a, 200b... 201a, 201b.

まず、管理者は、監視する対象とする文書情報を文書ＤＢ２から選択するために、文書比較処理装置１の操作部１３を操作する。文書比較処理装置１は、操作部１３から出力される操作信号に応じて、文書ＤＢ２から、例えば、文書情報２００ａ、２００ｂ…、２０１ａ、２０１ｂ…を読み出す。 First, the administrator operates the operation unit 13 of the document comparison processing apparatus 1 in order to select document information to be monitored from the document DB 2. The document comparison processing apparatus 1 reads out, for example, document information 200a, 200b,..., 201a, 201b... From the document DB 2 in response to an operation signal output from the operation unit 13.

（２）形式テンプレート抽出動作
次に、文書形式分類手段１００は、読み出した文書情報２００ａ、２００ｂ…、２０１ａ、２０１ｂ…を形式毎に分類する。 (2) Format Template Extraction Operation Next, the document format classification unit 100 classifies the read document information 200a, 200b..., 201a, 201b.

図４（ａ）及び（ｂ）は、それぞれ形式毎に分類された文書情報の一例を示す概略図である。 4A and 4B are schematic diagrams illustrating examples of document information classified for each format.

文書形式分類手段１００は、文書情報２００ａ、２００ｂ…、２０１ａ、２０１ｂ…を図４（ａ）に示す「見積依頼書」を示す形式Ａ及び図４（ｂ）に示す「見積回答書」を示す形式Ｂに分類する。形式Ａは、文書情報２００ａ及び２０１ａを有し、形式Ｂは、文書情報２００ｂ及び２０１ｂを有する。 The document format classification means 100 indicates the document information 200a, 200b,..., 201a, 201b..., A format A indicating “estimation request” shown in FIG. Classify into format B. Format A has document information 200a and 201a, and format B has document information 200b and 201b.

次に、形式テンプレート抽出手段１０２は、文書形式分類手段１００が分類した各形式に含まれる文書情報から、共通の記載位置に共通する内容が入力された文字列を抽出して形式テンプレート情報１１１を出力する。 Next, the format template extracting unit 102 extracts the character string in which the content common to the common description position is input from the document information included in each format classified by the document format classifying unit 100 to obtain the format template information 111. Output.

図５（ａ）及び（ｂ）は、それぞれ形式毎に抽出された形式テンプレート情報の一例を示す概略図である。図５（ａ）及び（ｂ）により示した形式テンプレート情報は、網掛け部分以外の文字列の内容と位置情報を含むものである。 FIGS. 5A and 5B are schematic diagrams illustrating examples of format template information extracted for each format. The format template information shown in FIGS. 5A and 5B includes the contents of character strings other than the shaded portion and position information.

形式テンプレート抽出手段１０２は、図４（ａ）に示す文書情報２００ａ及び２０１ａの共通でない文字列１０２ａを削除して、共通の文字列を抽出し、図５（ａ）に示す形式テンプレート情報１１１ａを出力する。 The format template extraction unit 102 deletes the character string 102a that is not common to the document information 200a and 201a shown in FIG. 4A, extracts a common character string, and uses the format template information 111a shown in FIG. Output.

また、形式テンプレート抽出手段１０２は、図４（ｂ）に示す文書情報２００ｂ及び２０１ｂの共通でない文字列１０２ｂを削除して、共通の文字列を抽出し、図５（ｂ）に示す形式テンプレート情報１１１ｂを出力する。 Further, the format template extraction unit 102 deletes the character string 102b that is not common to the document information 200b and 201b shown in FIG. 4B, extracts a common character string, and formats template information shown in FIG. 5B. 111b is output.

（３）案件テンプレート抽出動作
次に、文書案件分類手段１０１は、文書ＤＢ２から読み出した文書情報２００ａ、２００ｂ…、２０１ａ、２０１ｂ…を案件毎に分類する。 (3) Case Template Extraction Operation Next, the document case classification unit 101 classifies the document information 200a, 200b..., 201a, 201b.

図６は、文書情報２００の格納パス構造の一例を示す概略図である。 FIG. 6 is a schematic diagram illustrating an example of a storage path structure of the document information 200.

格納パス構造２０において案件を示す「見積０００１」に含まれる文書情報２００ａ、２００ｂ…及び「見積０００２」に含まれる文書情報２０１ａ、２０１ｂ…をそれぞれ「案件０１」及び「案件０２」として分類する。 In the storage path structure 20, the document information 200a, 200b... Included in the “estimate 0001” indicating the case and the document information 201a, 201b... Included in the “estimate 0002” are classified as “case 01” and “case 02”, respectively.

図７（ａ）及び（ｂ）は、それぞれ案件毎に分類された文書情報の一例を示す概略図である。 FIGS. 7A and 7B are schematic diagrams illustrating examples of document information classified for each case.

文書案件分類手段１０１は、文書情報２００ａ、２００ｂ…、２０１ａ、２０１ｂ…を図７（ａ）に示す「見積０００１」を示す案件０１及び図７（ｂ）に示す「見積０００２」を示す案件０２に分類する。案件０１は、文書情報２００ａ及び２００ｂを有し、案件０２は、文書情報２０１ａ及び２０１ｂを有する。 The document item classifying unit 101 uses the document information 200a, 200b,..., 201a, 201b... To indicate the item 01 indicating “estimation 0001” illustrated in FIG. 7A and the item 02 indicating “estimation 0002” illustrated in FIG. Classify into: The case 01 has document information 200a and 200b, and the case 02 has document information 201a and 201b.

次に、案件共通値抽出手段１０３は、同一案件に属する文書情報のそれぞれにおいて内容が共通する文字列を共通値として抽出する。なお、共通値は、入力された内容が共通すればよく、記載位置は異なっていてもよい。 Next, the case common value extraction unit 103 extracts a character string having the same content in each piece of document information belonging to the same case as a common value. In addition, the common value should just have the input content in common, and the description position may differ.

以下、案件共通値抽出手段１０３は、案件０１及び０２に対し同様に動作するため、代表して案件０１の内容について説明する。 Hereinafter, since the case common value extraction unit 103 operates in the same manner for the cases 01 and 02, the contents of the case 01 will be described as a representative.

図８は、案件０１に含まれる文書情報において抽出された共通値の一例を示す概略図である。 FIG. 8 is a schematic diagram illustrating an example of the common value extracted in the document information included in the case 01.

案件共通値抽出手段１０３は、文書情報２００ａ、２００ｂ…から内容が共通する文字列を共通値１１２ａ〜１１２ｇとして抽出する。なお、本実施の形態においては完全に一致する文字列を共通値としているが、完全一致でなくとも文字列のうち予め定めた文字数が一致している場合や、正式な記載と略称での記載にて一致している場合等に共通値として抽出してもよい。 The case common value extraction unit 103 extracts character strings having common contents from the document information 200a, 200b... As common values 112a to 112g. In the present embodiment, a completely matching character string is used as a common value. However, even if the character string does not match completely, a predetermined number of characters in the character string match, or a formal description and an abbreviation description. Or the like may be extracted as a common value.

図９は、案件０１の文書情報から抽出された共通値から構成される共通値情報の内容の一例を示す概略図である。 FIG. 9 is a schematic diagram illustrating an example of the content of common value information including common values extracted from the document information of the case 01.

案件共通値抽出手段１０３は、図８において抽出した共通値から共通値情報１１２として共通値１１２ａ、１１２ｂ、１１２ｃ、１１２ｄ…に該当する共通値範囲１１２Ａ、１１２Ｂ、１１２Ｃ、１１２Ｄ…を出力する。 The case common value extraction unit 103 outputs the common value ranges 112A, 112B, 112C, 112D... Corresponding to the common values 112a, 112b, 112c, 112d.

共通値範囲１１２Ａ、１１２Ｂ、１１２Ｃ、１１２Ｄ…は、それぞれ共通値が抽出された文書の形式を示す文書形式欄と、共通値が抽出された文書上の記載位置の範囲を示す範囲欄と、抽出された記載内容を示す値欄とを有する。なお、範囲欄は、文書情報２００ａの共通値１１２ａの記載範囲をＡ_１、共通値１１２ｂの記載範囲をＢ_１、共通値１１２ｃの記載範囲をＣ_１、共通値１１２ｄの記載範囲をＤ_１とし、文書情報２００ｂの共通値１１２ａの記載範囲をＡ_２、共通値１１２ｂの記載範囲をＢ_２、共通値１１２ｃの記載範囲をＣ_２、共通値１１２ｄの記載範囲をＤ_２として記載されている。 The common value ranges 112A, 112B, 112C, 112D,... Are a document format column that indicates the format of the document from which the common value is extracted, a range column that indicates the range of the description position on the document from which the common value is extracted, And a value column indicating the described contents. The range field, a stated range of common values 112a of the document information 200a _{A 1,} _{B 1} a stated range of common values 112b, a stated range of common values 112c _{C 1,} a stated range of common values 112d and _{D 1} , _{a 2} a stated range of common values 112a document information 200b, are described _{B 2} a stated range of common values 112b, a stated range of common values 112c _{C 2,} a stated range of common values 112d as _{D 2.}

次に、形式共通値除去手段１０４は、共通値情報１１２から形式テンプレート情報１１１の文字列と一致する文字列を除去する。 Next, the format common value removing unit 104 removes a character string that matches the character string of the format template information 111 from the common value information 112.

図１０は、形式テンプレート情報１１１の文字列と一致する文字列が除去された共通値情報１１２の内容の一例を示す概略図である。 FIG. 10 is a schematic diagram illustrating an example of the content of the common value information 112 from which the character string that matches the character string of the format template information 111 is removed.

形式共通値除去手段１０４は、図８に示す書類情報２００ａ及び２００ｂの共通値１１２ａ〜１１２ｇから形式テンプレート情報１１１の文字列と一致する文字列、つまり１１２ｃ、１１２ｅ、１１２ｆ、１１２ｇ、１１２ｈに該当する文字列１０４ａ及び１０４ｂをそれぞれ除去する。 The format common value removing unit 104 corresponds to a character string that matches the character string of the format template information 111 from the common values 112a to 112g of the document information 200a and 200b shown in FIG. 8, that is, 112c, 112e, 112f, 112g, and 112h. The character strings 104a and 104b are removed, respectively.

図１１は、形式テンプレート情報１１１の文字列と一致する文字列が除去された共通値情報の内容の一例を示す概略図である。 FIG. 11 is a schematic diagram illustrating an example of the content of the common value information from which the character string that matches the character string of the format template information 111 is removed.

形式共通値除去手段１０４は、共通値１１２ｃ、１１２ｅ、１１２ｆ、１１２ｇ、１１２ｈを除去することで共通値情報１１２として共通値範囲１１２Ａ、１１２Ｂ及び１１２Ｄを出力する。 The format common value removing unit 104 outputs the common value ranges 112A, 112B, and 112D as the common value information 112 by removing the common values 112c, 112e, 112f, 112g, and 112h.

次に、案件テンプレート抽出手段１０５は、文書情報２００ａ及び２００ｂから図１１に示す共通値情報１１２の内容を削除して以下に説明する案件０１の案件テンプレート情報１１３_１ａ及び１１３_１ｂを抽出する。 Next, the case template extraction unit 105 deletes the content of the common value information 112 shown in FIG. 11 from the document information 200a and 200b, and extracts the case template information 113 ₁ a and 113 ₁ b of the case 01 described below. .

図１２は、案件テンプレート情報の内容の一例を示す概略図である。 FIG. 12 is a schematic diagram illustrating an example of the contents of the item template information.

案件テンプレート情報１１３_１ａ及び１１３_１ｂは、案件０１である文書情報２００ａ及び２００ｂから共通値情報１１２の内容に該当する文字列１０５_１ａ及び１０５_１ｂを削除して得られるテンプレートである。 The case template information 113 ₁ a and 113 ₁ b are templates obtained by deleting the character strings 105 ₁ a and 105 ₁ b corresponding to the content of the common value information 112 from the document information 200 a and 200 b as the case 01.

なお、案件０２に関しても同様に案件テンプレート情報１１３_２ａ及び１１３_２ｂを抽出する。 For the case 02, the case template information 113 ₂ a and 113 ₂ b are extracted in the same manner.

次に、共通値範囲抽出手段１０７は、共通値情報１１２の内容に該当する文字列の共通値範囲を抽出する。 Next, the common value range extraction unit 107 extracts a common value range of character strings corresponding to the contents of the common value information 112.

図１３（ａ）は、複数案件における共通値情報１１２の内容に該当する文字列の一例を示す概略図であり、図１３（ｂ）は、複数案件における共通値情報１１２の内容に該当する文字列の共通値の一例を示す概略図である。 FIG. 13A is a schematic diagram illustrating an example of a character string corresponding to the content of the common value information 112 in a plurality of cases, and FIG. 13B is a character corresponding to the content of the common value information 112 in a plurality of cases. It is the schematic which shows an example of the common value of a column.

共通値範囲抽出手段１０７は、図１３（ａ）に示すように、複数案件の削除した文字列１０５_１ａ及び１０５_１ｂ、１０５_２ａ及び１０５_２ｂ…を形式毎に分類して各形式において記載範囲が共通する文字列を文字列１０５ａ及び１０５ｂとして抽出し、図１３（ｂ）に示すように、共通案件の共通値テンプレート情報１１３ａ及び１１３ｂとして抽出する。なお、複数案件における共通する記載範囲の抽出は、すべての案件において完全一致するものであってもよいし、予め定めた割合で一致するものであってもよい。 As shown in FIG. 13A, the common value range extraction unit 107 classifies the character strings 105 ₁ a and 105 ₁ b, 105 ₂ a and 105 ₂ b... Are extracted as character strings 105a and 105b, and are extracted as common value template information 113a and 113b of the common case as shown in FIG. 13B. In addition, the extraction of the common description range in a plurality of cases may be a perfect match in all cases, or may be a match at a predetermined ratio.

（４）誤り検出動作
次に、誤り検出手段１０６は、共通値範囲抽出手段１０７が抽出した共通値テンプレート情報１１３ａ及び１１３ｂに基づいて、文書情報の誤り検出を行う。 (4) Error Detection Operation Next, the error detection unit 106 detects an error in the document information based on the common value template information 113a and 113b extracted by the common value range extraction unit 107.

図１４（ａ）〜（ｄ）は、誤り検出手段１０６の動作例を説明するための概略図であり、図１４（ａ）は文書情報の格納パス構造、図１４（ｂ）は文書情報の内容、図１４（ｃ）は共通値テンプレート、図１４（ｄ）は共通値情報の内容を示す。 14A to 14D are schematic diagrams for explaining an example of the operation of the error detection unit 106. FIG. 14A shows the storage path structure of the document information, and FIG. 14B shows the document information. FIG. 14C shows the common value template, and FIG. 14D shows the content of the common value information.

まず、誤り検出手段１０６は、誤り検出の対象として、図１４（ａ）に示すように、例えば、案件０３を選択し、案件０３に含まれる文書情報２０２ａ及び２０２ｂを文書ＤＢ２から取得する。 First, as shown in FIG. 14A, the error detection unit 106 selects, for example, the case 03 and acquires the document information 202a and 202b included in the case 03 from the document DB 2 as an error detection target.

次に、誤り検出手段１０６は、図１４（ｂ）に示す取得した文書情報２０２ａ及び２０２ｂの内容と、図１４（ｃ）に示す共通値テンプレート情報１１３ａの文字列１０５ａ及び共通値テンプレート情報１１３ｂの文字列１０５ｂとを、共通の形式間で比較し、文書情報２０２ａ及び２０２ｂの文字列のうち文字列１０５ａ及び１０５ｂの範囲に該当する値を抽出して、図１４（ｄ）に示す共通値情報１１２Ａ、１１２Ｂ、１１２Ｄを生成する。 Next, the error detection means 106 includes the contents of the acquired document information 202a and 202b shown in FIG. 14B, the character string 105a of the common value template information 113a and the common value template information 113b shown in FIG. The character string 105b is compared between the common formats, the values corresponding to the range of the character strings 105a and 105b are extracted from the character strings of the document information 202a and 202b, and the common value information shown in FIG. 112A, 112B, and 112D are generated.

次に、誤り検出手段１０６は、共通値情報１１２Ａ、１１２Ｂ、１１２Ｄのうち値欄が互いに一致しない不整合箇所１１２０Ｂ及び１１２０Ｄを検出するとともに、当該不整合箇所１１２０Ｂ及び１１２０Ｄに該当する記載範囲に誤りがあると判断する。 Next, the error detection means 106 detects inconsistent portions 1120B and 1120D in which the value columns do not match each other in the common value information 112A, 112B, and 112D, and an error in the description range corresponding to the inconsistent portions 1120B and 1120D. Judge that there is.

誤り検出手段１０６は、検出した誤りの箇所を文書情報とともに表示部１２に表示し、利用者に提示する。 The error detection means 106 displays the detected error location on the display unit 12 together with the document information and presents it to the user.

［他の実施の形態］
なお、本発明は、上記実施の形態に限定されず、本発明の要旨を逸脱しない範囲で種々な変形が可能である。 [Other embodiments]
The present invention is not limited to the above embodiment, and various modifications can be made without departing from the gist of the present invention.

また、上記文書比較処理プログラム１１０をＣＤ−ＲＯＭ等の記憶媒体に格納して提供することも可能であり、インターネット等のネットワークに接続されているサーバ装置等から装置内の記憶部にダウンロードしてもよい。また、文書形式分類手段１００、文書案件分類手段１０１、形式テンプレート抽出手段１０２、案件共通値抽出手段１０３、形式共通値除去手段１０４、誤り検出手段１０６及び共通値範囲抽出手段１０７の一部又は全部をＡＳＩＣ等のハードウェアによって実現してもよい。なお、上記実施の形態の動作説明で示した各ステップは、順序の変更、ステップの省略、追加が可能である。 It is also possible to provide the document comparison processing program 110 by storing it in a storage medium such as a CD-ROM, and download it from a server device connected to a network such as the Internet to a storage unit in the device. Also good. Further, part or all of the document format classification unit 100, the document case classification unit 101, the format template extraction unit 102, the case common value extraction unit 103, the format common value removal unit 104, the error detection unit 106, and the common value range extraction unit 107. May be realized by hardware such as ASIC. Note that each step shown in the operation description of the above embodiment can be changed in order, omitted or added.

１…文書比較処理装置、２…文書データベース（ＤＢ）、３…ネットワーク、４…文書比較処理システム、１０…制御部、１１…記憶部、１２…表示部、１３…操作部、１４…通信部、２０…格納パス構造、１００…文書形式分類手段、１０１…文書案件分類手段、１０２…形式テンプレート抽出手段、１０３…案件共通値抽出手段、１０４…形式共通値除去手段、１０５…案件テンプレート抽出手段、１０６…誤り検出手段、１０７…共通値範囲抽出手段、１１０…検出プログラム、１１１…形式テンプレート情報、１１２…共通値情報、１１３…案件テンプレート情報、２００…文書情報 DESCRIPTION OF SYMBOLS 1 ... Document comparison processing apparatus, 2 ... Document database (DB), 3 ... Network, 4 ... Document comparison processing system, 10 ... Control part, 11 ... Memory | storage part, 12 ... Display part, 13 ... Operation part, 14 ... Communication part , 20 ... Storage path structure, 100 ... Document format classification means, 101 ... Document matter classification means, 102 ... Format template extraction means, 103 ... Case common value extraction means, 104 ... Format common value removal means, 105 ... Case template extraction means 106: Error detection means, 107 ... Common value range extraction means, 110 ... Detection program, 111 ... Format template information, 112 ... Common value information, 113 ... Case template information, 200 ... Document information

Claims

Computer
A format classification means for classifying a plurality of document information included in the plurality of items for each document format when there are a plurality of items having a plurality of document information related to the contents;
First template extraction means for extracting, as a first template, a character string having a common position and content among a plurality of document information in the document format classified by the format classification means;
A character string having a common content among a plurality of pieces of document information included in the same item is extracted together with a position where the character string is described, and a position of the character string of the first template is extracted from the extracted common character string. Common value information extracting means for removing the same character string and extracting it as common value information;
A second template extracting means for extracting a second template as an input field by removing a character string included in the common value information from each document information included in the case;
Document for functioning as detection means for detecting character strings having different contents by applying a character string corresponding to the input field of the second template to the common value information in the document information of the subject matter Comparison processing program.

Computer
The second template of a plurality of items is classified into the same format, and the input column position of the second template is common to a plurality of second templates of the same format at a predetermined ratio. And further function as a third template extracting means for extracting a third template including the extracted input field,
The document comparison processing program according to claim 1, wherein the detection unit applies a content input in an input field of the third template to the common value information to detect a character string having different content.

A format classification means for classifying a plurality of document information included in the plurality of items for each document format when there are a plurality of items having a plurality of document information related to the contents;
First template extraction means for extracting, as a first template, a character string having a common position and content among a plurality of document information in the document format classified by the format classification means;
A character string having a common content among a plurality of pieces of document information included in the same item is extracted together with a position where the character string is described, and a position of the character string of the first template is extracted from the extracted common character string. Common value information extracting means for removing the same character string and extracting it as common value information;
A second template extracting means for extracting a second template as an input field by removing a character string included in the common value information from each document information included in the case;
Document comparison processing comprising: detecting means for detecting a character string having different contents by applying a character string corresponding to the input field of the second template to the common value information in the document information of the subject matter apparatus.