JP2005141295A

JP2005141295A - Device, method and program for retrieving document

Info

Publication number: JP2005141295A
Application number: JP2003374275A
Authority: JP
Inventors: Hiroshi Matsuda; 寛松田; Hiroki Tanioka; 広樹谷岡; Hitoshi Uno; 仁宇野
Original assignee: JustSystems Corp
Current assignee: JustSystems Corp
Priority date: 2003-11-04
Filing date: 2003-11-04
Publication date: 2005-06-02
Anticipated expiration: 2023-11-04
Also published as: JP4446714B2

Abstract

<P>PROBLEM TO BE SOLVED: To permit a retrieving person to precisely and comprehensively designate a position and a character string by an easy operation when the position in a document is designated and the desired character string is retrieved. <P>SOLUTION: A document group being an object of retrieval is previously analyzed and an attribute name list and a value list in a table, which appear in it, are created. Attribute names extracted at that time are classified into a group where the names are the same in terms of meaning or a group where the names are similar (To put it more concretely, the group of the names whose notation is the same or similar). All the attribute names appearing in the table in the document group being the object of retrieval are displayed on the list at every group in accordance with an instruction from the retrieving person. When the prescribed attribute is designated, a value appearing in the attribute is displayed in the list. Even if there are the tables which have delicate notation shakes in the attribute names, the values, etc. in large quantities, retrieval omission due to the shake is avoided and precise retrieval can be preformed. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

この発明は、指定された位置に指定された文字列を含む電子文書を検索する文書検索装置、文書検索方法、および文書検索プログラムに関する。 The present invention relates to a document search apparatus, a document search method, and a document search program that search for an electronic document including a specified character string at a specified position.

従来の文書検索技術では、通常、検索条件として文書中の文字列を指定することはできても、当該文字列の文書中での位置を指定することはできなかった。したがって、たとえばスプレッドシートのような表構造を有する文書の場合、指定された文字列を属性（列項目）Ｘの値として含む文書も、属性（列項目）Ｙの値として含む文書も、同じように検索されてしまっていた。 In the conventional document search technique, although a character string in a document can be normally specified as a search condition, a position in the document of the character string cannot be specified. Therefore, in the case of a document having a table structure such as a spreadsheet, a document including a specified character string as an attribute (column item) X value and a document including an attribute (column item) Y value are the same. Has been searched for.

そこで、検索条件として任意の文字列とともに、その出現位置を指定できるようにすることが考えられる。これにより、たとえば指定された属性に指定された文字列を含む文書のみを検索することができる。 Therefore, it is conceivable that the appearance position can be specified together with an arbitrary character string as a search condition. As a result, for example, only documents including a character string designated by a designated attribute can be searched.

しかしながら上記のようにする場合は、検索者が所望の属性を正確に指定する必要がある。検索対象となる文書が大量にある場合、この指定は必ずしも容易ではない。たとえば顧客名簿Ａ、Ｂ、・・・、Ｎがあり、この中から徳島市在住の顧客を含む文書を検索しようとした場合、顧客の住所を意味する情報が文書Ａでは属性「住所」に、文書Ｂでは属性「現住所」に、・・・、文書Ｎでは属性「住所／所在地」にそれぞれ格納されていたとすると、検索者は文字列「徳島市」が出現すべき属性として、「住所」「現住所」・・・「住所／所在地」のすべてを指定しなければならない。 However, in the case described above, it is necessary for the searcher to accurately specify the desired attribute. This designation is not always easy when there are a large number of documents to be searched. For example, there are customer lists A, B,..., N, and when an attempt is made to search for a document including a customer residing in Tokushima City, information indicating the customer's address is the attribute “address” in document A, If the document B is stored in the attribute “current address”,..., And the document N is stored in the attribute “address / location”, the searcher determines that the character string “Tokushima City” should appear as “address” “ All of "Current Address" ... "Address / Location" must be specified.

また、実質的に同一の値でも、表ごとに書式が異なることも多い。たとえば電話番号が文書Ａでは「０８８−６６６−＊＊＊＊」、文書Ｂでは「０８８（６６６）＊＊＊＊」、・・・、文書Ｎでは「０８８６６６＊＊＊＊」のように記述されていたとすると、検索者は文字列「０８８−６６６−＊＊＊＊」「０８８（６６６）＊＊＊＊」・・・「０８８６６６＊＊＊＊」のすべてを指定しなければならない。 Moreover, even if the value is substantially the same, the format is often different for each table. For example, the telephone number is described as “088-666-****” in document A, “088 (666) *****” in document B,..., “088666 *** in document N”. If the search has been made, the searcher must designate all of the character strings “088-666-***”, “088 (666) *****”, and “088666 ***”.

このように従来技術では、意味的には同一の情報でも表ごとに異なる属性名のもとに、あるいは異なる書式で格納されている場合があるため、検索対象となるすべての文書中で、所望の文字列がどんな名称の属性中にどんな書式で出現するかを正確に把握した上でないと、もれのない検索ができない。そしてこの属性や書式のピックアップは、現状ではもっぱら検索者の経験や手作業によっているという問題点があった。 As described above, in the prior art, even the same information in terms of meaning may be stored under different attribute names for each table or in different formats. If you don't know exactly what format the name string appears in what name attribute, you will not be able to search for it. In addition, there is a problem that the pickup of the attribute and the format is based on the searcher's experience and manual work at present.

この発明は、上述した従来技術による問題点を解消するため、表構造を有する検索対象文書中に出現する複数の属性を意味的に同一または類似するものごとに分類して一覧表示するとともに、各属性の実際の値を一覧表示することも可能な文書検索装置、文書検索方法、および文書検索プログラムを提供することを目的とする。 In order to solve the above-described problems caused by the prior art, the present invention classifies and displays a plurality of attributes appearing in a search target document having a table structure by semantically the same or similar, An object of the present invention is to provide a document search apparatus, a document search method, and a document search program capable of displaying a list of actual values of attributes.

上述した課題を解決し、目的を達成するため、請求項１の発明にかかる文書検索装置は、指定された位置に指定された文字列を含む電子文書を検索する文書検索装置において、前記電子文書に含まれる表内の各属性の名称を抽出する属性名抽出手段と、前記属性名抽出手段により抽出された名称を意味的に同一または類似するもの同士のグループに分類する属性名分類手段と、前記属性名分類手段により分類された名称を前記グループごとに表示する属性名表示手段と、を備えることを特徴とする。 In order to solve the above-described problems and achieve the object, a document search apparatus according to claim 1 is a document search apparatus for searching for an electronic document including a specified character string at a specified position. Attribute name extraction means for extracting the name of each attribute in the table included in the table, attribute name classification means for classifying the names extracted by the attribute name extraction means into groups of semantically identical or similar, Attribute name display means for displaying the name classified by the attribute name classification means for each of the groups.

この請求項１の発明によれば、検索対象文書群に出現する表の属性の名称をグループごとに一覧表示させることができる。 According to the first aspect of the present invention, it is possible to display a list of table attribute names appearing in the search target document group for each group.

また、請求項２の発明にかかる文書検索装置は、請求項１に記載の発明において、前記属性名分類手段が前記属性名抽出手段により抽出された名称をその表記が同一または類似するもの同士のグループに分類することを特徴とする。 The document search apparatus according to the invention of claim 2 is the document search apparatus according to claim 1, wherein the attribute name classification means extracts the names extracted by the attribute name extraction means with the same or similar notation. It is characterized by classifying into groups.

この請求項２の発明によれば、複数の属性はその名称の類似度を基礎としてグループごとに分類される。 According to the invention of claim 2, the plurality of attributes are classified for each group based on the similarity of the names.

また、請求項３の発明にかかる文書検索装置は、請求項１または請求項２に記載の発明において、さらに、前記属性名抽出手段により名称を抽出された各属性の値を抽出する値抽出手段と、前記値抽出手段により抽出された値を表示する値表示手段と、を備えることを特徴とする。 According to a third aspect of the present invention, there is provided the document retrieval apparatus according to the first or second aspect, further comprising a value extracting means for extracting a value of each attribute whose name is extracted by the attribute name extracting means. And value display means for displaying the value extracted by the value extraction means.

この請求項３の発明によれば、検索対象文書群に出現する表内の値を一覧表示させることができる。 According to the invention of claim 3, it is possible to display a list of values in the table appearing in the search target document group.

また、請求項４の発明にかかる文書検索方法は、指定された位置に指定された文字列を含む電子文書を検索する文書検索方法において、前記電子文書に含まれる表内の各属性の名称を抽出する属性名抽出工程と、前記属性名抽出工程で抽出された名称を意味的に同一または類似するもの同士のグループに分類する属性名分類工程と、前記属性名分類工程で分類された名称を前記グループごとに表示する属性名表示工程と、を含むことを特徴とする。 According to a fourth aspect of the present invention, there is provided a document retrieval method for retrieving an electronic document including a designated character string at a designated position, wherein the name of each attribute in a table included in the electronic document is obtained. The attribute name extraction step to extract, the attribute name classification step to classify the names extracted in the attribute name extraction step into groups of semantically the same or similar, and the names classified in the attribute name classification step And an attribute name display step for displaying each group.

この請求項４の発明によれば、検索対象文書群に出現する表の属性の名称をグループごとに一覧表示させることができる。 According to the invention of claim 4, it is possible to display a list of table attribute names appearing in the search target document group for each group.

また、請求項５の発明にかかる文書検索方法は、請求項４に記載の発明において、前記属性名分類工程では前記属性名抽出工程で抽出された名称をその表記が同一または類似するもの同士のグループに分類することを特徴とする。 According to a fifth aspect of the present invention, there is provided the document search method according to the fourth aspect of the present invention, wherein in the attribute name classification step, the names extracted in the attribute name extraction step are identical or similar to each other. It is characterized by classifying into groups.

この請求項５の発明によれば、複数の属性はその名称の類似度を基礎としてグループごとに分類される。 According to the invention of claim 5, the plurality of attributes are classified for each group based on the similarity of the names.

また、請求項６の発明にかかる文書検索方法は、請求項４または請求項５に記載の発明において、さらに、前記属性名抽出工程で名称を抽出された各属性の値を抽出する値抽出工程と、前記値抽出工程で抽出された値を表示する値表示工程と、を含むことを特徴とする。 According to a sixth aspect of the present invention, there is provided the document retrieval method according to the fourth or fifth aspect, further comprising a value extracting step of extracting a value of each attribute whose name is extracted in the attribute name extracting step. And a value display step for displaying the value extracted in the value extraction step.

この請求項６の発明によれば、検索対象文書群に出現する表内の値を一覧表示させることができる。 According to the invention of claim 6, it is possible to display a list of values in the table appearing in the search target document group.

また、請求項７の発明にかかる文書検索プログラムによれば、請求項４〜請求項６のいずれか一つに記載された方法をコンピュータに実行させることができる。 Moreover, according to the document search program concerning invention of Claim 7, a computer can be made to perform the method described in any one of Claims 4-6.

本発明にかかる文書検索装置、文書検索方法、および文書検索プログラムによれば、文書中での位置を指定して文字列を検索することが可能な文書検索装置、文書検索方法、および文書検索プログラムにおいて、上記位置および文字列を検索者が容易な操作で、正確かつ網羅的に指定することができるという効果を奏する。 According to the document search device, the document search method, and the document search program according to the present invention, the document search device, the document search method, and the document search program capable of searching for a character string by designating a position in the document. In this case, the position and the character string can be accurately and comprehensively designated by the searcher with an easy operation.

以下に添付図面を参照して、この発明にかかる文書検索装置、文書検索方法、および文書検索プログラムの好適な実施の形態を詳細に説明する。 Exemplary embodiments of a document search apparatus, a document search method, and a document search program according to the present invention will be explained below in detail with reference to the accompanying drawings.

図１は、この発明の実施の形態にかかる文書検索装置のハードウエア構成の一例を示す説明図である。図中、１０１は装置全体を制御するＣＰＵを、１０２は基本入出力プログラムなどを記憶したＲＯＭを、１０３はＣＰＵ１０１のワークエリアとして使用されるＲＡＭを、それぞれ示している。 FIG. 1 is an explanatory diagram showing an example of a hardware configuration of a document search apparatus according to an embodiment of the present invention. In the figure, 101 indicates a CPU that controls the entire apparatus, 102 indicates a ROM that stores basic input / output programs, and 103 indicates a RAM that is used as a work area of the CPU 101.

また、１０４はＣＰＵ１０１の制御にしたがってＨＤ（ハードディスク）１０５に対するデータのリード／ライトを制御するＨＤＤ（ハードディスクドライブ）を、１０５はＨＤＤ１０４の制御にしたがって書き込まれたデータを記憶するＨＤを、それぞれ示している。 Reference numeral 104 denotes an HDD (hard disk drive) that controls reading / writing of data with respect to the HD (hard disk) 105 according to the control of the CPU 101, and 105 denotes an HD that stores data written according to the control of the HDD 104. Yes.

また、１０６はＣＰＵ１０１の制御にしたがってＦＤ（フレキシブルディスク）１０７に対するデータのリード／ライトを制御するＦＤＤ（フレキシブルディスクドライブ）を、１０７はＦＤＤ１０６の制御にしたがって書き込まれたデータを記憶する着脱自在のＦＤを、それぞれ示している。 Reference numeral 106 denotes an FDD (flexible disk drive) that controls reading / writing of data with respect to the FD (flexible disk) 107 according to the control of the CPU 101, and 107 denotes a detachable FD that stores data written according to the control of the FDD 106. Respectively.

また、１０８はＣＰＵ１０１の制御にしたがってＣＤ−ＲＷ１０９に対するデータのリード／ライトを制御するＣＤ−ＲＷドライブを、１０９はＣＤ−ＲＷドライブ１０８の制御にしたがって書き込まれたデータを記憶する着脱自在のＣＤ−ＲＷを、それぞれ示している。 Reference numeral 108 denotes a CD-RW drive that controls reading / writing of data with respect to the CD-RW 109 according to the control of the CPU 101, and reference numeral 109 denotes a removable CD-ROM that stores data written according to the control of the CD-RW drive 108. RW is shown respectively.

また、１１０はカーソル、メニュー、ウィンドウ、あるいは文字や画像などの各種データを表示するディスプレイを、１１１は文字、数値、各種指示などの入力のための複数のキーを備えたキーボードを、１１２は各種指示の選択や実行、処理対象の選択、マウスポインタの移動などをおこなうマウスを、それぞれ示している。 Reference numeral 110 denotes a cursor, menu, window, or display for displaying various data such as characters and images, 111 denotes a keyboard having a plurality of keys for inputting characters, numerical values, various instructions, and the like, and 112 denotes various types. A mouse that performs selection and execution of an instruction, selection of a processing target, movement of a mouse pointer, and the like is shown.

また、１１３は通信ケーブル１１４を介してＬＡＮやＷＡＮなどのネットワークに接続され、当該ネットワークとＣＰＵ１０１とのインターフェースとして機能するネットワークＩ／Ｆを、１００は上記各部を接続するためのバスを、それぞれ示している。 Reference numeral 113 denotes a network I / F that is connected to a network such as a LAN or a WAN via a communication cable 114, and functions as an interface between the network and the CPU 101. Reference numeral 100 denotes a bus for connecting the above-described units. ing.

次に、図２はこの発明の実施の形態にかかる文書検索装置の構成を機能的に示す説明図である。図中、２００は文書記憶部であり、後述する文書検索部２０４による検索の対象となる複数の文書を保持している。なお、これらの文書は具体的には、いわゆる「表計算ソフト」により作成された文書のほか、＜ｔａｂｌｅ＞タグを含むＨＴＭＬ文書、より一般に、表を意味する所定のタグを含むＸＭＬ文書、あるいは罫線などで区分された表画像を含むイメージデータ（ＧＩＦファイルやＰＤＦファイルなど）などであって、文書中に少なくとも一つの表を含んでいるものとする。 Next, FIG. 2 is an explanatory diagram functionally showing the configuration of the document search apparatus according to the embodiment of the present invention. In the figure, reference numeral 200 denotes a document storage unit, which holds a plurality of documents to be searched by a document search unit 204 described later. Specifically, these documents include documents created by so-called “spreadsheet software”, HTML documents including <table> tags, more generally XML documents including predetermined tags meaning tables, or It is assumed that image data (GIF file, PDF file, etc.) including a table image divided by ruled lines and the like includes at least one table in the document.

２０１は文書解析部であり、文書記憶部２００に保持された各文書を解析して、まずその方向（縦または横）を判定するとともに、文書中の罫線の位置関係などから、表領域や当該領域中の見出し領域（列方向に属性が配置された表の場合、通常は表の先頭行、あるいは先頭行から数行がこれに該当する）、個々のセルの領域などを特定する機能部である。 A document analysis unit 201 analyzes each document held in the document storage unit 200 and first determines the direction (vertical or horizontal), and determines the table region and the relevant area from the positional relationship of the ruled lines in the document. In the function section that specifies the heading area in the area (in the case of a table with attributes arranged in the column direction, this usually corresponds to the first line of the table or several lines from the first line), the area of each cell, etc. is there.

２０２は属性名抽出部であり、文書解析部２０１により特定された見出し領域から、個々の見出し文字列すなわち表を構成する個々の属性の名称（属性名）を抽出するとともに、その出現数をカウントする機能部である。すなわち、抽出した属性名と当該属性名の抽出元となった文書名、および当該属性名の出現数（初期値は１）を対応づけて属性名リストに登録してゆく。なお、同一の属性名のエントリがすでに上記リスト内にある場合は、当該エントリに新たな抽出元の文書名を追加するとともに、その出現数を１だけ増加させる。たとえば文書記憶部２００内の全文書を通じて、「住所」という名の属性を有する表が３個あったとすると、最終的に属性名「住所」に対応づけられた出現数は３となる。 Reference numeral 202 denotes an attribute name extraction unit which extracts individual heading character strings, that is, names of individual attributes constituting the table (attribute names) from the heading area specified by the document analysis unit 201, and counts the number of appearances. It is a functional part to do. That is, the extracted attribute name, the document name from which the attribute name is extracted, and the number of appearances of the attribute name (initial value is 1) are associated and registered in the attribute name list. If an entry with the same attribute name already exists in the list, a new extraction source document name is added to the entry, and the number of appearances is increased by one. For example, if there are three tables having an attribute named “address” through all the documents in the document storage unit 200, the number of appearances finally associated with the attribute name “address” is three.

さらに属性名抽出部２０２は、抽出した属性名を意味的に同一または類似するものごとに分類する属性名分類部２０２ａを備えている。 Further, the attribute name extraction unit 202 includes an attribute name classification unit 202a that classifies the extracted attribute names into semantically identical or similar items.

属性名の中にはたとえば「電話番号」と「ＴＥＬ」のように、表記はまったく異なっても意味的には同一のものがある。逆に表記は同じでも、表ごとにまったく異なる意味で用いられる可能性もある。しかしこうした例外を除き、一般に、意味的に同一あるいは類似する属性名は、その表記も同一あるいは類似していると考えられる。 Some attribute names, such as “telephone number” and “TEL”, are semantically identical even if the notation is completely different. Conversely, even if the notation is the same, it may be used in a completely different meaning for each table. However, with the exception of the above, in general, attribute names that are semantically identical or similar are considered to have the same or similar notation.

そこで属性名分類部２０２ａは、属性名抽出部２０２により抽出された属性名を、その表記が同一あるいは類似するもの同士のグループ（カテゴリ）に分類する。たとえば、属性名抽出部２０２により抽出された属性名が
・ご要望事項
・さ厚
・その他
・なし
・はんだ
・はんだの仕様
・はんだの径
・はんだ付け
・ガス圧
・ガラスベース
・ガラス中央
・ガラス接着
・レンズ
・レンズの種類
・レンズ角度
・使用
・使用レンズ
・使用機器
・使用機種
・備考
のようであったとすると、これらは
・ご要望事項
・さ厚
・その他
・なし
・はんだはんだの仕様はんだの径はんだ付け
・ガス圧ガラスベースガラス中央ガラス接着
・レンズレンズの種類レンズ角度
・使用使用レンズ使用機器使用機種
・備考
のように分類されることになる。 Therefore, the attribute name classification unit 202a classifies the attribute names extracted by the attribute name extraction unit 202 into groups (categories) having the same or similar notation. For example, the attribute name extracted by the attribute name extraction unit 202 is as follows:-Request-Thickness-Other-None-Solder-Solder specifications-Solder diameter-Soldering-Gas pressure-Glass base-Glass center-Glass adhesion・ Lens ・ Lens type ・ Lens angle ・ Use ・ Use lens ・ Use device ・ Use model ・ If it seems to be a remark, these are: ・ Request ・ Thickness ・ Other ・ None ・ Solder Solder specifications Solder diameter Soldering ・ Gas pressure Glass base Glass center Glass bonding ・ Lens Lens type Lens angle ・ Used lens Used equipment Used model ・ Remarks

なお、このまとめ上げの方針としては、もっとも単純には文字列の先頭から固定長（たとえば２文字）が共通である文字列を１グループとすることが考えられる。固定長ではなく、文字列の長さｎに応じた可変長（たとえばｎ／２、ｎ＊２／５＋１など）としてもよい。さらに高度な方法としては、シーケンシャルパターンマイニングなどの手法を用い、先頭からだけでなく末尾からの連続した文字列、あるいは文字列中から飛び飛びで選択した複数文字からなる文字列によりグループ化の可否を判断する。あるいは文字列順に関係なく、構成文字種の割合のみで類似度を測定してもよい。属性名間の類似度は上記のほか、一般に文字列間の類似度の計算方法として従来から提案されている各種の手法を用いて算出することができる。 Note that, as a grouping policy, it is considered that the character strings having a fixed length (for example, two characters) from the beginning of the character string are made into one group in the simplest manner. Instead of a fixed length, a variable length according to the length n of the character string (for example, n / 2, n * 2/5 + 1, etc.) may be used. As a more advanced method, a method such as sequential pattern mining is used to determine whether or not grouping is possible with a continuous character string not only from the beginning but also from the end, or a character string consisting of multiple characters selected from within the character string. to decide. Alternatively, the degree of similarity may be measured only by the ratio of the constituent character types regardless of the character string order. In addition to the above, the similarity between attribute names can be calculated using various methods that have been conventionally proposed as methods for calculating the similarity between character strings.

なお、分類された個々のグループには、たとえばグループ内の全属性名に共通する文字列を一意なグループ名として付与する。これにより、たとえば属性名「住所」「現住所」および「住所／所在地」はグループ「住所」に統合される。抽出された個々の属性名が所属するグループは、属性名分類部２０２ａにより上述の属性名リストに登録される。 For example, a character string common to all attribute names in the group is assigned to each classified group as a unique group name. Thus, for example, the attribute names “address”, “current address”, and “address / location” are integrated into the group “address”. The group to which each extracted attribute name belongs is registered in the attribute name list by the attribute name classification unit 202a.

２０３は値抽出部であり、文書解析部２０１により特定された個々のセルから各属性の具体的な値を抽出するとともに、その出現数を属性ごとにカウントする機能部である。すなわち抽出した値、当該値の抽出元となった文書名と属性名、および当該値の出現数（初期値は１）を対応づけて値リストに登録してゆく。なお、同一の属性名から抽出された同一の値のエントリがすでに上記リスト内にある場合は、当該エントリに新たな抽出元の文書名を追加するとともに、その出現数を１だけ増加させる。たとえば文書記憶部２００内の全文書を通じて、「住所」属性中に「徳島市」を含むレコードが３個あったとすると、「住所」属性における「徳島市」の出現数は３である。なお、このときたとえば「現住所」属性中に「徳島市」を含むレコードが５個あったとしても、この数は「住所」属性における「徳島市」の出現数には加えない。 A value extraction unit 203 is a functional unit that extracts a specific value of each attribute from each cell specified by the document analysis unit 201 and counts the number of appearances for each attribute. That is, the extracted value, the document name and attribute name from which the value is extracted, and the number of occurrences of the value (the initial value is 1) are associated and registered in the value list. When an entry having the same value extracted from the same attribute name is already in the list, a new extraction source document name is added to the entry and the number of appearances is increased by one. For example, if there are three records including “Tokushima city” in the “address” attribute through all documents in the document storage unit 200, the number of occurrences of “Tokushima city” in the “address” attribute is three. At this time, for example, even if there are five records including “Tokushima city” in the “current address” attribute, this number is not added to the number of occurrences of “Tokushima city” in the “address” attribute.

２０４は文書検索部であり、文書記憶部２００内の文書のうち、検索者が入力した検索条件に適合する文書を検索する機能部である。この文書検索部２０４は、指定された文字列を指定された属性中に含む文書のみを検索する機能を有している。 Reference numeral 204 denotes a document search unit, which is a functional unit that searches for documents in the document storage unit 200 that match a search condition input by a searcher. The document search unit 204 has a function of searching only a document that includes a specified character string in a specified attribute.

２０５は入出力部であり、検索者による各種指示などの入力をキーボード１１１あるいはマウス１１２から受け付けるとともに、後述のフローチャート中で説明する各種画面を作成して、ディスプレイ１１０に表示する機能部である。 An input / output unit 205 is a functional unit that receives input of various instructions and the like by a searcher from the keyboard 111 or the mouse 112, and generates various screens described in a flowchart described later and displays them on the display 110.

次に、図３はこの発明の実施の形態にかかる文書検索装置における、属性名一覧および値一覧の作成処理の手順を示すフローチャートである。図示する処理は、後述する文書検索処理の前にあらかじめ実行されているものとする。 Next, FIG. 3 is a flowchart showing a procedure for creating an attribute name list and a value list in the document search apparatus according to the embodiment of the present invention. It is assumed that the illustrated process is executed in advance before a document search process to be described later.

まず、文書解析部２０１は文書記憶部２００内の文書を順次解析して、各文書中の表領域などを特定する（ステップＳ３０１）。次にこの解析結果にもとづき、属性名抽出部２０２が文書中の各表から属性名を抽出するとともに、その出現数をカウントする（ステップＳ３０２）。さらに属性名分類部２０２ａが、上記で抽出された属性名を複数のグループに分類する（ステップＳ３０３）。 First, the document analysis unit 201 sequentially analyzes the documents in the document storage unit 200 and specifies a table area in each document (step S301). Next, based on this analysis result, the attribute name extraction unit 202 extracts attribute names from each table in the document and counts the number of appearances (step S302). Further, the attribute name classification unit 202a classifies the attribute names extracted above into a plurality of groups (step S303).

一方、値抽出部２０３は文書解析部２０１による解析結果にもとづいて、文書中の各表から値を抽出するとともに、その出現数をカウントする（ステップＳ３０４）。これらの処理により、上述の属性名リストおよび値リストが作成され、それぞれ属性名抽出部２０２および値抽出部２０３により保持される。 On the other hand, the value extraction unit 203 extracts a value from each table in the document based on the analysis result by the document analysis unit 201 and counts the number of appearances (step S304). Through these processes, the attribute name list and the value list described above are created and held by the attribute name extraction unit 202 and the value extraction unit 203, respectively.

次に、図４はこの発明の実施の形態にかかる文書検索装置における、文書検索処理の手順を示すフローチャートである。 Next, FIG. 4 is a flowchart showing the procedure of the document search process in the document search apparatus according to the embodiment of the present invention.

まず、入出力部２０５は図５に示すような検索画面を表示する（ステップＳ４０１）。図示するように検索画面には、属性名入力エリア５００および値入力エリア５０１があり、指定すべき属性名や値を検索者が正確に知っている場合は、上記各欄に所望の文字列を入力して検索ボタン５０２を押下すればよい。すなわち検索ボタン５０２の押下を入出力部２０５が検知すると（ステップＳ４０２：Ｙｅｓ）、入出力部２０５からの指示を受けた文書検索部２０４により、その時点で指定されている属性に指定されている値を含む文書が検索され（ステップＳ４０３）、入出力部２０５により当該検索の結果一覧が表示される（ステップＳ４０４）。 First, the input / output unit 205 displays a search screen as shown in FIG. 5 (step S401). As shown in the figure, the search screen has an attribute name input area 500 and a value input area 501. When the searcher knows exactly the attribute name and value to be specified, a desired character string is displayed in each of the above fields. What is necessary is just to input and press the search button 502. FIG. In other words, when the input / output unit 205 detects that the search button 502 is pressed (step S402: Yes), the document search unit 204 that has received an instruction from the input / output unit 205 specifies the attribute specified at that time. A document including a value is searched (step S403), and a list of search results is displayed by the input / output unit 205 (step S404).

しかしながら上述のように、文書記憶部２００内の各表で、たとえば「住所」「現住所」「住所／所在地」のように属性名が微妙に異なる場合、これらを漏れなく指定することは難しい。そこで、検索者は検索実行の前に属性ボタン５０３を押下して、文書記憶部２００内の文書中に出現する属性名を一覧表示させ、当該一覧中から所望の属性名を選択する。 However, as described above, if the attribute names are slightly different in each table in the document storage unit 200 such as “address”, “current address”, and “address / location”, it is difficult to specify them without omission. Therefore, the searcher presses the attribute button 503 before executing the search, displays a list of attribute names appearing in the document in the document storage unit 200, and selects a desired attribute name from the list.

すなわち入出力部２０５は、図５の検索画面の属性ボタン５０３が押下されたことを検知すると（ステップＳ４０２：Ｎｏ、ステップＳ４０５：Ｙｅｓ）、属性名抽出部２０２が保持している属性名リストにもとづいて、図６に示すような属性名一覧を作成・表示する（ステップＳ４０６）。 In other words, when the input / output unit 205 detects that the attribute button 503 on the search screen in FIG. 5 is pressed (step S402: No, step S405: Yes), the attribute name extraction unit 202 holds the attribute name list. Based on this, a list of attribute names as shown in FIG. 6 is created and displayed (step S406).

図示するように属性名一覧では、文書記憶部２００内の全文書から抽出された属性名がグループごとに一覧表示されるとともに（図中、下線を付した文字列は上述のグループ名を意味している）、個々の属性名の横には括弧書きで、その出現数が表示される。そして、選択中の属性名が検索候補エリア６００に表示され、この状態で選択ボタン６０１または追加ボタン６０２を押下することで、当該属性名を図５の検索画面の属性名入力エリア５００に入力できる。すなわち、上記操作を検知した入出力部２０５は（ステップＳ４０７：Ｙｅｓ）、図６の属性名一覧を消去して（ステップＳ４０８）図５の検索画面を表示するとともに、その属性名入力エリア５００に図６で選択された属性名を表示する（ステップＳ４０１）。 As shown in the figure, in the attribute name list, the attribute names extracted from all the documents in the document storage unit 200 are displayed in a list for each group (in the figure, the underlined character string means the above group name). The number of occurrences is displayed in parentheses next to each attribute name. Then, the currently selected attribute name is displayed in the search candidate area 600. By pressing the selection button 601 or the add button 602 in this state, the attribute name can be input to the attribute name input area 500 of the search screen in FIG. . That is, the input / output unit 205 that has detected the above operation (step S407: Yes) deletes the attribute name list of FIG. 6 (step S408) and displays the search screen of FIG. The attribute name selected in FIG. 6 is displayed (step S401).

なお、図５の属性名入力エリア５００にすでに何らかの属性名が入力されていた場合、図６で他の属性名を選択して選択ボタン６０１を押下すると、選択された属性名のみが属性名入力エリア５００に表示される。一方、追加ボタン６０２を押下すると、入力済みの属性名に図６で選択された属性名が追加される。 If an attribute name has already been input in the attribute name input area 500 of FIG. 5, when another attribute name is selected in FIG. 6 and the selection button 601 is pressed, only the selected attribute name is input. Displayed in area 500. On the other hand, when the add button 602 is pressed, the attribute name selected in FIG. 6 is added to the input attribute name.

また、図５の検索画面で値ボタン５０４が押下されるのを待って、押下されたことを検知すると（ステップＳ４０２：Ｎｏ、ステップＳ４０５：Ｎｏ、ステップＳ４０９：Ｙｅｓ）、入出力部２０５は値抽出部２０３が保持している値リストにもとづいて、図７に示すような値一覧を作成・表示する（ステップＳ４１０）。 In addition, after waiting for the value button 504 to be pressed on the search screen of FIG. 5 and detecting that the value button 504 has been pressed (step S402: No, step S405: No, step S409: Yes), the input / output unit 205 sets the value. Based on the value list held by the extraction unit 203, a value list as shown in FIG. 7 is created and displayed (step S410).

図示するように値一覧では、値ボタン５０４が押下された時点で属性名入力エリア５００に入力されている属性（図示する例では「製品」および「製品名」）について、その値が一覧表示されるとともに、各値の出現数が括弧書きの数値およびグラフにより表示される。そして、選択中の値が検索候補エリア７００に表示され、この状態で選択ボタン７０１または追加ボタン７０２を押下することで、当該値を図５の検索画面の値入力エリア５０１に入力できる。すなわち、上記操作を検知した入出力部２０５は（ステップＳ４１１：Ｙｅｓ）、図７の値一覧を消去して（ステップＳ４１２）図５の検索画面を表示するとともに、その値入力エリア５０１に図７で選択された値を表示する（ステップＳ４０１）。 As shown in the figure, in the value list, the values of the attributes ("product" and "product name" in the example shown) that are input in the attribute name input area 500 when the value button 504 is pressed are displayed in a list. In addition, the number of occurrences of each value is displayed as a numerical value and a graph in parentheses. Then, the value being selected is displayed in the search candidate area 700. By pressing the selection button 701 or the add button 702 in this state, the value can be input to the value input area 501 of the search screen in FIG. That is, the input / output unit 205 that has detected the above operation (step S411: Yes) deletes the value list of FIG. 7 (step S412), displays the search screen of FIG. 5, and displays the search screen of FIG. The value selected in is displayed (step S401).

なお、図５の値入力エリア５０１にすでに何らかの値が入力されていた場合、図７で他の値を選択して選択ボタン７０１を押下すると、選択された値のみが値入力エリア５０１に表示される。一方、追加ボタン７０２を押下すると、入力済みの値に図７で選択された値が追加される。 If some value has already been input in the value input area 501 in FIG. 5, when another value is selected in FIG. 7 and the selection button 701 is pressed, only the selected value is displayed in the value input area 501. The On the other hand, when the add button 702 is pressed, the value selected in FIG. 7 is added to the input value.

ステップＳ４０５以降の手順により属性名および値を入力した後は、直接文字列を入力した場合と同様に検索ボタン５０２を押下することで（ステップＳ４０２：Ｙｅｓ）、指定した属性名に指定した値を含む文書を検索できる（ステップＳ４０３・Ｓ４０４）。 After inputting the attribute name and value by the procedure after step S405, the search button 502 is pressed in the same manner as when the character string is directly input (step S402: Yes), and the value specified for the specified attribute name is changed. The included document can be searched (steps S403 and S404).

以上説明した実施の形態によれば、検索対象文書中の各表で実際に使用されている属性名や値の一覧から所望のものを選択できるので、検索対象文書中にどんな表が含まれるか、各表の構造はどうなっているかなどを熟知していない検索者でも、微妙な表記ゆれなどに起因する検索もれを回避して、正確な検索を実行することができる。 According to the embodiment described above, since a desired item can be selected from a list of attribute names and values actually used in each table in the search target document, what table is included in the search target document. Even a searcher who is not familiar with the structure of each table can perform an accurate search by avoiding a search leak caused by a subtle notation.

しかも、属性名は意味的に同一あるいは類似するもの（厳密には、意味的に同一あるいは類似する可能性が高いもの）ごとに集約表示されるので、似たような属性名を容易に一括指定できる。 In addition, attribute names are aggregated and displayed for each thing that is semantically identical or similar (strictly, those that are highly likely to be semantically identical or similar), so similar attribute names can be easily specified in a batch it can.

なお、図６に示した属性名一覧は単に五十音順にグループを羅列しただけであるが、たとえば各属性名が実際に検索に使用された回数などを履歴として保持しておき、使用頻度の高い属性名あるいはすでに入力されている属性名と組み合わせて使用される頻度の高い属性名を含むグループほど上位に表示したり、あるいは別途ランキング画面を設けて、使用頻度の高い属性名あるいはその所属するグループを上位Ｎ位まで表示したりしてもよい。また、列方向だけでなく行方向にも属性が配置された表の場合は、列方向の属性名と行方向の属性名とを区別（分離）して表示するようにしてもよい。 The attribute name list shown in FIG. 6 is simply a list of groups in the order of the Japanese syllabary. For example, the number of times each attribute name is actually used for the search is stored as a history, A group containing a high attribute name or an attribute name that is frequently used in combination with an attribute name that has already been entered is displayed at the top, or a separate ranking screen is provided, and a frequently used attribute name or its belonging The groups may be displayed up to the top N. In the case of a table in which attributes are arranged not only in the column direction but also in the row direction, the attribute names in the column direction and the attribute names in the row direction may be distinguished (separated) and displayed.

また、図７に示した値一覧は特定の属性名の値一覧であるが、たとえば図５の検索画面に「全値一覧」のようなボタンを用意し、全属性の値を一画面で確認できるようにしてもよい。さらに、上述した実施の形態では指定された値を完全に含む属性名を一覧表示したが、たとえば指定された値をＮＬＰ（自然言語処理）により意味的なブロックに分割し、いずれかのブロックを値に含む属性名をリストアップするようにしてもよい。 The value list shown in FIG. 7 is a list of values for a specific attribute name. For example, a button such as “all value list” is prepared on the search screen of FIG. You may be able to do it. Furthermore, in the above-described embodiment, the attribute names that completely include the specified value are listed. For example, the specified value is divided into semantic blocks by NLP (natural language processing), and any block is You may make it list up the attribute name contained in a value.

以上説明したように、本発明にかかる文書検索装置、文書検索方法、および文書検索プログラムによれば、文書中での位置を指定して文字列を検索することが可能な文書検索装置、文書検索方法、および文書検索プログラムにおいて、上記位置および文字列を検索者が容易な操作で、正確かつ網羅的に指定することが可能である。 As described above, according to the document search device, the document search method, and the document search program according to the present invention, a document search device and a document search that can search a character string by specifying a position in a document. In the method and the document search program, the position and the character string can be specified accurately and comprehensively by an easy operation by the searcher.

なお、本実施の形態で説明した文書検索方法は、あらかじめ用意されたプログラムをパーソナル・コンピュータやワークステーション等のコンピュータで実行することにより実現することができる。このプログラムは、ハードディスク１０５、フレキシブルディスク１０７、ＣＤ−ＲＯＭ、ＣＤ−ＲＷ１０９、ＭＯ、ＤＶＤ等のコンピュータで読み取り可能な記録媒体に記録され、コンピュータによって記録媒体から読み出されることによって実行される。またこのプログラムは、インターネット等のネットワークを介して配布することが可能な伝送媒体であってもよい。 The document search method described in this embodiment can be realized by executing a program prepared in advance on a computer such as a personal computer or a workstation. This program is recorded on a computer-readable recording medium such as the hard disk 105, the flexible disk 107, the CD-ROM, the CD-RW 109, the MO, and the DVD, and is executed by being read from the recording medium by the computer. The program may be a transmission medium that can be distributed via a network such as the Internet.

以上のように、本発明にかかる文書検索装置、文書検索方法、および文書検索プログラムは、表構造を有する文書の検索に有用であり、特に、表内の属性名や値などに微妙な表記ゆれのある表が大量にある場合に適している。 As described above, the document search apparatus, the document search method, and the document search program according to the present invention are useful for searching a document having a table structure, and in particular, subtle notation of attribute names and values in the table. Suitable when there are a large number of tables with

この発明の実施の形態にかかる文書検索装置のハードウエア構成の一例を示す説明図である。It is explanatory drawing which shows an example of the hardware constitutions of the document search apparatus concerning embodiment of this invention. この発明の実施の形態にかかる文書検索装置の構成を機能的に示す説明図である。It is explanatory drawing which shows functionally the structure of the document search device concerning embodiment of this invention. この発明の実施の形態にかかる文書検索装置における、属性名一覧および値一覧の作成処理の手順を示すフローチャートである。It is a flowchart which shows the procedure of the production | generation process of an attribute name list and a value list in the document search apparatus concerning embodiment of this invention. この発明の実施の形態にかかる文書検索装置における、文書検索処理の手順を示すフローチャートである。It is a flowchart which shows the procedure of the document search process in the document search apparatus concerning embodiment of this invention. 入出力部２０５により作成・表示される検索画面の一例を示す説明図である。6 is an explanatory diagram illustrating an example of a search screen created and displayed by an input / output unit 205. FIG. 入出力部２０５により作成・表示される属性名一覧画面の一例を示す説明図である。It is explanatory drawing which shows an example of the attribute name list screen produced and displayed by the input-output part. 入出力部２０５により作成・表示される値一覧画面の一例を示す説明図である。It is explanatory drawing which shows an example of the value list screen produced and displayed by the input-output part.

Explanation of symbols

１００バス
１０１ＣＰＵ
１０２ＲＯＭ
１０３ＲＡＭ
１０４ＨＤＤ
１０５ＨＤ
１０６ＦＤＤ
１０７ＦＤ
１０８ＣＤ−ＲＷドライブ
１０９ＣＤ−ＲＷ
１１０ディスプレイ
１１１キーボード
１１２マウス
１１３ネットワークＩ／Ｆ
１１４通信ケーブル
２００文書記憶部
２０１文書解析部
２０２属性名抽出部
２０２ａ属性名分類部
２０３値抽出部
２０４文書検索部
２０５入出力部
100 bus 101 CPU
102 ROM
103 RAM
104 HDD
105 HD
106 FDD
107 FD
108 CD-RW drive 109 CD-RW
110 Display 111 Keyboard 112 Mouse 113 Network I / F
Reference Signs List 114 Communication cable 200 Document storage unit 201 Document analysis unit 202 Attribute name extraction unit 202a Attribute name classification unit 203 Value extraction unit 204 Document search unit 205 Input / output unit

Claims

In a document search device that searches for an electronic document including a specified character string at a specified position,
Attribute name extraction means for extracting the name of each attribute in the table included in the electronic document;
Attribute name classification means for classifying the names extracted by the attribute name extraction means into groups of semantically identical or similar ones;
Attribute name display means for displaying the name classified by the attribute name classification means for each group;
A document search apparatus comprising:

2. The document search apparatus according to claim 1, wherein the attribute name classifying unit classifies the names extracted by the attribute name extracting unit into groups having the same or similar notation.

Further, a value extracting means for extracting the value of each attribute whose name is extracted by the attribute name extracting means,
Value display means for displaying the value extracted by the value extraction means;
The document search apparatus according to claim 1 or 2, further comprising:

In a document search method for searching for an electronic document including a specified character string at a specified position,
An attribute name extraction step of extracting the name of each attribute in the table included in the electronic document;
An attribute name classification step for classifying the names extracted in the attribute name extraction step into groups of semantically identical or similar ones;
An attribute name display step for displaying the name classified in the attribute name classification step for each group;
A document retrieval method comprising:

5. The document search method according to claim 4, wherein in the attribute name classification step, the names extracted in the attribute name extraction step are classified into groups having the same or similar notation.

Furthermore, a value extraction step for extracting the value of each attribute whose name has been extracted in the attribute name extraction step,
A value display step for displaying the value extracted in the value extraction step;
The document search method according to claim 4, wherein the document search method includes:

A document search program for causing a computer to execute the method according to any one of claims 4 to 6.