JP5644244B2

JP5644244B2 - Document processing apparatus, document processing method, and program

Info

Publication number: JP5644244B2
Application number: JP2010178334A
Authority: JP
Inventors: 貢三浦
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2010-08-09
Filing date: 2010-08-09
Publication date: 2014-12-24
Anticipated expiration: 2030-08-09
Also published as: JP2012038124A

Description

本発明は、文書を表す文書データを処理する文書処理装置に関する。 The present invention relates to a document processing apparatus that processes document data representing a document.

複数の文書の中から、特定の検索キーワード（検索対象単語）を含む文書を検索（特定）する検索処理を行う全文検索装置が知られている。この種の全文検索装置は、主として、２つの方式のいずれかを用いる。第１の方式は、ｇｒｅｐ（順次走査検索、逐次検索）方式であり、第２の方式は、索引（インデックス）方式である。 2. Description of the Related Art A full-text search device that performs a search process for searching (specifying) a document including a specific search keyword (search target word) from a plurality of documents is known. This type of full-text search apparatus mainly uses one of two methods. The first method is a grep (sequential scanning search, sequential search) method, and the second method is an index (index) method.

ｇｒｅｐ方式は、記憶装置に蓄積（記憶）されている文書データが表す文書を順次走査することにより、当該文書から、検索キーワードと同一の文字列を探し出す方式である。ｇｒｅｐ方式の名称は、ＵＮＩＸ（登録商標）において複数のファイルの中から文字列を検索するためのコマンドである「ｇｒｅｐ」に由来している。 The grep method is a method in which a document represented by document data stored (stored) in a storage device is sequentially scanned to search a character string identical to a search keyword from the document. The name of the grep method is derived from “grep” which is a command for searching a character string from a plurality of files in UNIX (registered trademark).

ｇｒｅｐ方式は、インデックス情報が不要であるので、インデックス方式と比較して、記憶装置に記憶させるデータ量が少ない。更に、ｇｒｅｐ方式は、記憶されている文書データが変更された場合であっても、インデックス情報を生成しないので、迅速に検索処理を実行することができる。一方、ｇｒｅｐ方式は、インデックス方式よりも、検索処理の実行に要する時間が長い（即ち、検索処理の実行速度が小さい）。 Since the grep method does not require index information, the amount of data stored in the storage device is smaller than that of the index method. Furthermore, since the grep method does not generate index information even when stored document data is changed, a search process can be executed quickly. On the other hand, the grep method requires a longer time to execute search processing than the index method (that is, the search processing execution speed is lower).

また、インデックス方式は、記憶装置に蓄積されている文書データが表す文書に含まれる単語と、当該単語が出現する文書を識別するための文書識別情報と、を含むインデックス情報を予め生成し、検索キーワードとインデックス情報とを照合することにより検索処理を行う方式である。 Further, the index method generates index information including a word included in a document represented by document data stored in a storage device and document identification information for identifying a document in which the word appears in advance. This is a method for performing a search process by matching a keyword with index information.

インデックス方式は、ｇｒｅｐ方式よりも検索処理の実行速度が大きい。一方、インデックス方式は、文書データとともに、インデックス情報を記憶装置に記憶させる必要があるので、ｇｒｅｐ方式よりも、記憶装置に記憶させるデータ量が多い。更に、インデックス方式は、記憶されている文書データが変更された場合、インデックス情報を生成する時間が必要となるので、迅速に検索処理を実行することができない。なお、インデックス方式を用いた全文検索装置は、例えば、特許文献１又は特許文献２に記載されている。 The index method has a higher search processing execution speed than the grep method. On the other hand, since the index method needs to store index information in the storage device together with the document data, the amount of data stored in the storage device is larger than that in the grep method. Furthermore, in the index method, when stored document data is changed, it takes time to generate index information, so that the search process cannot be executed quickly. Note that a full-text search apparatus using an index method is described in, for example, Patent Document 1 or Patent Document 2.

また、データを圧縮する技術（データ圧縮技術）が知られている。例えば、データ圧縮技術は、ユニバーサル符号方式に含まれる辞書方式（例えば、ＬＺＨ形式、又は、ＺＩＰ形式等の圧縮形式を有する方式）を用いる。辞書方式は、圧縮処理の対象となるデータ内の文字列を辞書に登録し、辞書に登録されている文字列と同一の文字列を、よりデータ量が小さい他の文字列に置換することにより、データを圧縮する方式である。 A technique for compressing data (data compression technique) is also known. For example, a data compression technique uses a dictionary method (for example, a method having a compression format such as the LZH format or the ZIP format) included in the universal code method. In the dictionary method, the character string in the data to be compressed is registered in the dictionary, and the same character string as the character string registered in the dictionary is replaced with another character string having a smaller data amount. This is a method for compressing data.

特開平８−２８７１０５号公報JP-A-8-287105 特開平９−５４７７７号公報Japanese Patent Laid-Open No. 9-54777

そこで、記憶装置に記憶させるデータ量を削減するために、文書データを圧縮するように上記全文検索装置を構成することが好適であると考えられる。この全文検索装置は、圧縮された文書データを伸張（解凍）することにより復元し、復元された文書データに基づいてインデックス情報を生成する。 Therefore, in order to reduce the amount of data stored in the storage device, it is considered preferable to configure the full-text search device so as to compress document data. The full-text search apparatus restores the compressed document data by decompressing (decompressing) the document data, and generates index information based on the restored document data.

しかしながら、この場合、全文検索装置は、インデックス情報を生成する際に、圧縮された文書データを伸張するため、インデックス情報を生成する処理に比較的長い時間を要する。即ち、インデックス情報を迅速に生成することができないという問題があった。更に、文書を出力するためには、インデックス情報に加えて、圧縮された文書データを記憶装置に記憶させておく必要があるため、記憶装置に記憶させるデータ量が比較的多いという問題もあった。 However, in this case, since the full-text search device decompresses the compressed document data when generating the index information, it takes a relatively long time to generate the index information. That is, there is a problem that index information cannot be generated quickly. Furthermore, in order to output a document, it is necessary to store the compressed document data in the storage device in addition to the index information, so there is a problem that the amount of data stored in the storage device is relatively large. .

このため、本発明の目的は、上述した課題である「インデックス情報を生成する処理に比較的長い時間を要すること、及び、記憶装置に記憶させるデータ量が比較的多いこと」を解決することが可能な文書処理装置を提供することにある。 For this reason, the object of the present invention is to solve the above-mentioned problems “the process of generating index information takes a relatively long time and the amount of data stored in the storage device is relatively large”. The object is to provide a possible document processing apparatus.

かかる目的を達成するため本発明の一形態である文書処理装置は、
文書を表す複数の文書データを受け付ける文書データ受付手段と、
上記受け付けられた文書データが表す文書に含まれるすべての単語のそれぞれに対して、当該単語と、当該単語が出現する文書を識別するための文書識別情報と、当該単語が当該文書にて出現する位置を表す位置情報と、を対応付けたインデックス情報を記憶装置に記憶させるインデックス情報記憶処理手段と、
上記記憶されているインデックス情報に基づいて、上記文書の少なくとも一部を復元する文書復元手段と、
を備える。 In order to achieve such an object, a document processing apparatus according to one aspect of the present invention provides
Document data receiving means for receiving a plurality of document data representing a document;
For each word included in the document represented by the received document data, the word, document identification information for identifying the document in which the word appears, and the word appear in the document. Index information storage processing means for storing, in a storage device, index information in which position information representing a position is associated with each other;
Document restoration means for restoring at least a part of the document based on the stored index information;
Is provided.

また、本発明の他の形態である文書処理方法は、
文書を表す複数の文書データを受け付け、
上記受け付けられた文書データが表す文書に含まれるすべての単語のそれぞれに対して、当該単語と、当該単語が出現する文書を識別するための文書識別情報と、当該単語が当該文書にて出現する位置を表す位置情報と、を対応付けたインデックス情報を記憶装置に記憶させ、
上記記憶されているインデックス情報に基づいて、上記文書の少なくとも一部を復元する方法である。 A document processing method according to another aspect of the present invention includes:
Accept multiple document data representing documents,
For each word included in the document represented by the received document data, the word, document identification information for identifying the document in which the word appears, and the word appear in the document. Index information that associates position information representing a position with the storage device is stored in the storage device,
This is a method for restoring at least a part of the document based on the stored index information.

また、本発明の他の形態であるプログラムは、
情報処理装置に、
文書を表す複数の文書データを受け付ける文書データ受付手段と、
上記受け付けられた文書データが表す文書に含まれるすべての単語のそれぞれに対して、当該単語と、当該単語が出現する文書を識別するための文書識別情報と、当該単語が当該文書にて出現する位置を表す位置情報と、を対応付けたインデックス情報を記憶装置に記憶させるインデックス情報記憶処理手段と、
上記記憶されているインデックス情報に基づいて、上記文書の少なくとも一部を復元する文書復元手段と、
を実現させるためのプログラムである。 Moreover, the program which is the other form of this invention is:
In the information processing device,
Document data receiving means for receiving a plurality of document data representing a document;
For each word included in the document represented by the received document data, the word, document identification information for identifying the document in which the word appears, and the word appear in the document. Index information storage processing means for storing, in a storage device, index information in which position information representing a position is associated with each other;
Document restoration means for restoring at least a part of the document based on the stored index information;
It is a program for realizing.

本発明は、以上のように構成されることにより、記憶装置に記憶させるデータ量を削減しながら、インデックス情報を迅速に生成することができる。 With the configuration as described above, the present invention can quickly generate index information while reducing the amount of data stored in the storage device.

本発明の第１実施形態に係る文書処理装置の構成及び機能の概略を表すブロック図である。It is a block diagram showing the outline of a structure and function of the document processing apparatus which concerns on 1st Embodiment of this invention. 本発明の第１実施形態に係るインデックス情報生成装置が記憶する、単語と単語識別情報とを対応付けた単語情報を複数含むテーブルである。It is a table | surface containing multiple word information which matched the word and word identification information which the index information generation apparatus which concerns on 1st Embodiment of this invention memorize | stores. 本発明の第１実施形態に係るインデックス情報生成装置が実行する、文書データから、単語と位置情報とを取得する処理の概要を概念的に示した説明図である。It is explanatory drawing which showed notionally the outline | summary of the process which acquires the word and position information from document data which the index information generation apparatus which concerns on 1st Embodiment of this invention performs. 本発明の第１実施形態に係るインデックス情報生成装置が実行する、単語識別情報と位置情報とを取得する処理の概要を概念的に示した説明図である。It is explanatory drawing which showed notionally the outline | summary of the process which the index information generation apparatus which concerns on 1st Embodiment of this invention performs and acquires word identification information and position information. 本発明の第１実施形態に係るインデックス情報生成装置が実行する、インデックス情報を生成する処理の概要を概念的に示した説明図である。It is explanatory drawing which showed notionally the outline | summary of the process which produces | generates the index information which the index information generation apparatus which concerns on 1st Embodiment of this invention performs. 本発明の第１実施形態に係るインデックス情報生成装置が生成したインデックス情報を複数含むテーブルである。5 is a table including a plurality of index information generated by the index information generating apparatus according to the first embodiment of the present invention. 本発明の第１実施形態に係る文書処理装置の作動のうちの、インデックス情報を記憶装置に記憶させる作動を示したフローチャートである。It is the flowchart which showed the operation | movement which memorize | stores index information in a memory | storage device among operation | movement of the document processing apparatus which concerns on 1st Embodiment of this invention. 本発明の第１実施形態に係る文書処理装置の作動のうちの、インデックス情報に基づいて検索処理を実行する作動を示したフローチャートである。It is the flowchart which showed the operation | movement which performs a search process based on index information among the operations of the document processing apparatus according to the first embodiment of the present invention. 本発明の第１実施形態の変形例に係る文書処理装置の構成及び機能の概略を表すブロック図である。It is a block diagram showing the outline of a structure and function of the document processing apparatus which concerns on the modification of 1st Embodiment of this invention. 本発明の第２実施形態に係る文書処理装置がインデックス情報を生成するための基礎情報を取得するために実行するプログラムを示したフローチャートである。It is the flowchart which showed the program performed in order that the document processing apparatus concerning 2nd Embodiment of this invention may acquire the basic information for producing | generating index information. 本発明の第３実施形態に係る文書処理装置の機能の概略を表すブロック図である。It is a block diagram showing the outline of the function of the document processing apparatus which concerns on 3rd Embodiment of this invention.

以下、本発明に係る、文書処理装置、文書処理方法、及び、プログラム、の各実施形態について図１〜図１１を参照しながら説明する。 Hereinafter, embodiments of a document processing apparatus, a document processing method, and a program according to the present invention will be described with reference to FIGS.

＜第１実施形態＞
（構成）
図１に示したように、第１実施形態に係る文書処理装置１は、互いに通信可能に接続された、インデックス情報生成装置１０と、検索処理実行装置２０と、を備える。 <First Embodiment>
(Constitution)
As shown in FIG. 1, the document processing apparatus 1 according to the first embodiment includes an index information generation apparatus 10 and a search processing execution apparatus 20 that are communicably connected to each other.

インデックス情報生成装置１０は、情報処理装置である。インデックス情報生成装置１０は、図示しない中央処理装置（ＣＰＵ；ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）、及び、記憶装置（メモリ及びハードディスク駆動装置（ＨＤＤ；ＨａｒｄＤｉｓｋＤｒｉｖｅ））を備える。インデックス情報生成装置１０は、記憶装置に記憶されているプログラムをＣＰＵが実行することにより、後述する機能を実現するように構成されている。 The index information generation device 10 is an information processing device. The index information generation device 10 includes a central processing unit (CPU) and a storage device (memory and hard disk drive (HDD)) (not shown). The index information generation device 10 is configured to realize functions to be described later when a CPU executes a program stored in a storage device.

また、検索処理実行装置２０は、情報処理装置である。検索処理実行装置２０は、図示しない中央処理装置（ＣＰＵ）、記憶装置（メモリ及びＨＤＤ）、入力装置（キーボード及びマウス等）、及び、出力装置（ディスプレイ等）を備える。検索処理実行装置２０は、記憶装置に記憶されているプログラムをＣＰＵが実行することにより、後述する機能を実現するように構成されている。 The search processing execution device 20 is an information processing device. The search processing execution device 20 includes a central processing unit (CPU) (not shown), a storage device (memory and HDD), an input device (such as a keyboard and a mouse), and an output device (such as a display). The search processing execution device 20 is configured to realize functions to be described later when the CPU executes a program stored in the storage device.

（機能）
図１は、上記のように構成された文書処理装置１の機能を表すブロック図である。
インデックス情報生成装置１０の機能は、文書データ受付部（文書データ受付手段）１１と、単語情報記憶部（単語情報記憶手段）１２と、インデックス情報生成部（インデックス情報記憶処理手段の一部）１３と、インデックス情報送信部１４と、を含む。 (function)
FIG. 1 is a block diagram showing the functions of the document processing apparatus 1 configured as described above.
The function of the index information generation device 10 is as follows: a document data reception unit (document data reception unit) 11, a word information storage unit (word information storage unit) 12, and an index information generation unit (part of the index information storage processing unit) 13. And the index information transmission unit 14.

文書データ受付部１１は、複数の文書データを受け付ける。文書データは、文字列を含む文書を表す情報である。なお、文書データ受付部１１は、インデックス情報生成装置１０が他の装置から文書データを受信することにより当該文書データを受け付けるように構成されていてもよい。また、文書データ受付部１１は、ユーザがインデックス情報生成装置１０に入力した文書データを受け付けるように構成されていてもよい。 The document data receiving unit 11 receives a plurality of document data. Document data is information representing a document including a character string. The document data receiving unit 11 may be configured to receive the document data when the index information generating device 10 receives the document data from another device. Further, the document data receiving unit 11 may be configured to receive document data input to the index information generating device 10 by the user.

単語情報記憶部１２は、単語と、単語を識別するための単語識別情報と、を対応付けた単語情報を記憶する。本例では、単語情報記憶部１２は、図２に示したように、複数の単語情報を予め記憶している。 The word information storage unit 12 stores word information in which a word is associated with word identification information for identifying the word. In this example, the word information storage unit 12 stores a plurality of pieces of word information in advance as shown in FIG.

インデックス情報生成部１３は、文書データ受付部１１により受け付けられた複数の文書データと、単語情報記憶部１２に記憶されている単語情報と、に基づいてインデックス情報を生成する。このとき、インデックス情報生成部１３は、文書データ受付部１１により受け付けられた複数の文書データが表す文書に含まれるすべての単語のそれぞれに対して、インデックス情報を生成する。 The index information generation unit 13 generates index information based on the plurality of document data received by the document data reception unit 11 and the word information stored in the word information storage unit 12. At this time, the index information generating unit 13 generates index information for each of all words included in the document represented by the plurality of document data received by the document data receiving unit 11.

インデックス情報は、単語と、当該単語が出現する文書を識別するための文書識別情報と、当該単語が当該文書にて出現する位置を表す位置情報と、を対応付けた情報である。後述するように、インデックス情報は、検索処理実行装置２０が検索処理を実行する際に利用可能な情報である。 The index information is information in which a word, document identification information for identifying a document in which the word appears, and position information indicating a position where the word appears in the document are associated with each other. As will be described later, the index information is information that can be used when the search processing execution device 20 executes the search processing.

ここで、インデックス情報生成部１３の機能について、具体例を用いて説明する。
いま、図３に示したように、文書Ｄ１を表す文書データが文書データ受付部１１により受け付けられた場合を想定する。 Here, the function of the index information generation unit 13 will be described using a specific example.
Assume that the document data representing the document D1 is received by the document data receiving unit 11 as shown in FIG.

この場合、インデックス情報生成部１３は、文書Ｄ１に含まれる文字列を順次走査することにより、単語情報記憶部１２に記憶されている単語が出現する位置を特定する。そして、インデックス情報生成部１３は、文書Ｄ１にて出現する単語のそれぞれに対して、文書Ｄ１にて出現する単語Ｄ２（本例では、「合衆国大統領」、「は」、及び、「ニューオリンズ」等）と、当該単語が文書Ｄ１にて出現する位置を表す位置情報Ｄ３（本例では、単語の先頭の文字の、文書の先頭からの文字数であり、「１文字目」、及び「６文字目」等）と、を対応付けて取得する。 In this case, the index information generation unit 13 specifies the position where the word stored in the word information storage unit 12 appears by sequentially scanning the character string included in the document D1. Then, the index information generation unit 13 generates a word D2 appearing in the document D1 (in this example, “President of the United States”, “Ha”, “New Orleans”, etc.) for each word appearing in the document D1. ) And position information D3 indicating the position where the word appears in the document D1 (in this example, the number of characters from the beginning of the document of the first character of the word, “first character” and “sixth character” And the like) are acquired in association with each other.

そして、インデックス情報生成部１３は、図４に示したように、取得された単語Ｄ２を、当該単語と対応付けて単語情報記憶部１２に記憶されている単語識別情報Ｄ４に置換する。 Then, as shown in FIG. 4, the index information generation unit 13 replaces the acquired word D2 with the word identification information D4 stored in the word information storage unit 12 in association with the word.

インデックス情報生成部１３は、このような処理を、文書データ受付部１１により受け付けられた複数の文書データが表す複数の文書のそれぞれに対して実行する。その後、インデックス情報生成部１３は、図５に示したように、各文書に対して取得された単語識別情報及び位置情報を統合する処理を実行する。 The index information generation unit 13 executes such processing for each of the plurality of documents represented by the plurality of document data received by the document data reception unit 11. Thereafter, as shown in FIG. 5, the index information generation unit 13 executes a process of integrating the word identification information and the position information acquired for each document.

具体的には、インデックス情報生成部１３は、図６に示したように、単語識別情報と、当該単語識別情報と対応付けて取得された位置情報と、当該位置情報が取得される基となった文書を識別するための文書識別情報と、を対応付けたインデックス情報を生成する。即ち、本例では、インデックス情報は、単語識別情報と文書識別情報と位置情報とを含む情報である。 Specifically, as illustrated in FIG. 6, the index information generation unit 13 is a basis for acquiring word identification information, position information acquired in association with the word identification information, and the position information. Index information that associates document identification information for identifying the selected document is generated. That is, in this example, the index information is information including word identification information, document identification information, and position information.

インデックス情報送信部１４は、インデックス情報生成部１３により生成されたインデックス情報を検索処理実行装置２０へ送信する。 The index information transmission unit 14 transmits the index information generated by the index information generation unit 13 to the search processing execution device 20.

一方、検索処理実行装置２０の機能は、インデックス情報受信部２１と、インデックス情報記憶処理部（インデックス情報記憶処理手段の一部）２２と、インデックス情報記憶部２３と、検索対象単語受付部（検索対象単語受付手段）２４と、検索処理実行部（検索処理実行手段）２５と、検索結果出力部（文書復元手段）２６と、を含む。 On the other hand, the functions of the search processing execution device 20 are an index information receiving unit 21, an index information storage processing unit (part of the index information storage processing means) 22, an index information storage unit 23, and a search target word receiving unit (search A target word receiving unit) 24, a search processing execution unit (search processing execution unit) 25, and a search result output unit (document restoration unit) 26.

インデックス情報受信部２１は、インデックス情報をインデックス情報生成装置１０から受信する。
インデックス情報記憶処理部２２は、インデックス情報受信部２１により受信されたインデックス情報をインデックス情報記憶部２３に記憶させる。
インデックス情報記憶部２３は、図６に示したように、単語識別情報と文書識別情報と位置情報とを対応付けたインデックス情報を記憶する。 The index information receiving unit 21 receives index information from the index information generating device 10.
The index information storage processing unit 22 causes the index information storage unit 23 to store the index information received by the index information receiving unit 21.
As shown in FIG. 6, the index information storage unit 23 stores index information in which word identification information, document identification information, and position information are associated with each other.

検索対象単語受付部２４は、検索対象単語（検索キーワード、検索キー）としての単語を受け付ける。なお、検索対象単語受付部２４は、検索処理実行装置２０が他の装置から検索対象単語を受信することにより当該検索対象単語を受け付けるように構成されていてもよい。また、検索対象単語受付部２４は、ユーザが検索処理実行装置２０に入力した検索対象単語を受け付けるように構成されていてもよい。 The search target word receiving unit 24 receives words as search target words (search keywords, search keys). Note that the search target word receiving unit 24 may be configured to receive the search target word when the search processing execution device 20 receives the search target word from another device. Further, the search target word receiving unit 24 may be configured to receive a search target word input to the search processing execution device 20 by the user.

検索処理実行部２５は、検索対象単語受付部２４により受け付けられた検索対象単語と、インデックス情報記憶部２３に記憶されているインデックス情報と、に基づいて検索処理を実行する。検索処理は、記憶されているインデックス情報にて検索対象単語と対応付けられた文書識別情報により識別される文書を、当該検索対象単語を含む文書として特定する処理である。 The search process execution unit 25 executes a search process based on the search target word received by the search target word reception unit 24 and the index information stored in the index information storage unit 23. The search process is a process of specifying a document identified by the document identification information associated with the search target word in the stored index information as a document including the search target word.

本例では、検索処理実行部２５は、検索対象単語受付部２４により受け付けられた検索対象単語と対応付けて単語情報記憶部１２に記憶されている単語識別情報（検索対象単語識別情報）を、インデックス情報生成装置１０から受信することにより取得する。 In this example, the search processing execution unit 25 associates the word identification information (search target word identification information) stored in the word information storage unit 12 in association with the search target word received by the search target word reception unit 24. Obtained by receiving from the index information generating apparatus 10.

検索処理実行部２５は、インデックス情報記憶部２３に記憶されているインデックス情報のうちの、検索対象単語識別情報と同一の単語識別情報を含むインデックス情報を取得する。検索処理実行部２５は、取得されたインデックス情報に含まれる、文書識別情報及び位置情報を検索結果として出力する。 The search processing execution unit 25 acquires index information including the same word identification information as the search target word identification information among the index information stored in the index information storage unit 23. The search processing execution unit 25 outputs document identification information and position information included in the acquired index information as a search result.

検索結果出力部２６は、検索処理実行部２５により出力された検索結果を受け付ける。検索結果出力部２６は、受け付けた検索結果と、インデックス情報記憶部２３に記憶されているインデックス情報と、に基づいて検索結果出力情報を生成する。 The search result output unit 26 receives the search result output by the search processing execution unit 25. The search result output unit 26 generates search result output information based on the received search result and the index information stored in the index information storage unit 23.

検索結果出力情報は、検索対象単語を含む文書を識別するための文書識別情報と、検索対象単語を含む文書のうちの、当該検索対象単語の近傍の部分であるスニペットと、を含む情報である。なお、検索結果出力情報は、スニペットに含まれる検索対象単語を他の部分よりも強調するための情報を含んでいてもよい。 The search result output information is information including document identification information for identifying a document including the search target word, and a snippet that is a portion near the search target word in the document including the search target word. . The search result output information may include information for emphasizing the search target word included in the snippet more than other parts.

本例では、検索結果出力部２６は、検索結果に含まれる文書識別情報、及び、検索結果に含まれる位置情報が表す位置の近傍の位置（即ち、検索結果に含まれる位置情報が表す位置を中心とした予め設定された範囲（例えば、３０文字）内の位置）を表す位置情報、を含むインデックス情報を抽出する。 In this example, the search result output unit 26 displays the document identification information included in the search result and a position near the position represented by the position information included in the search result (that is, the position represented by the position information included in the search result). Index information including position information representing a position within a preset range (for example, a position within 30 characters) at the center is extracted.

そして、検索結果出力部２６は、抽出したインデックス情報に含まれる単語識別情報と対応付けて記憶されている単語を、インデックス情報生成装置１０から受信することにより取得する。更に、検索結果出力部２６は、取得された単語を、位置情報に基づいて文書の先頭から順に並べて接続することにより、スニペットを生成する。 Then, the search result output unit 26 acquires the word stored in association with the word identification information included in the extracted index information by receiving it from the index information generation device 10. Further, the search result output unit 26 generates a snippet by connecting the acquired words in order from the top of the document based on the position information.

即ち、検索結果出力部２６は、インデックス情報記憶部２３に記憶されているインデックス情報に基づいて、文書の少なくとも一部を復元している、と言うことができる。更に、検索結果出力部２６は、検索対象単語を含む文書のうちの、当該検索対象単語の近傍の部分を復元している、と言うこともできる。 That is, it can be said that the search result output unit 26 restores at least a part of the document based on the index information stored in the index information storage unit 23. Furthermore, it can be said that the search result output unit 26 is restoring a portion in the vicinity of the search target word in the document including the search target word.

ところで、文書のうちの検索対象単語の近傍の部分を出力する場合において、文書を表す文書データが圧縮されている場合、当該文書データの全体を伸張した後に、当該部分を抽出する必要がある。従って、この場合、当該部分を出力するためには比較的長い時間を要するという問題があった。これに対し、第１実施形態に係る文書処理装置１によれば、インデックス情報に基づいて、文書のうちの検索対象単語の近傍の部分のみを復元することができる。従って、当該部分を迅速に出力することができる。 By the way, when outputting a portion near a search target word in a document and the document data representing the document is compressed, it is necessary to extract the portion after decompressing the entire document data. Therefore, in this case, there is a problem that it takes a relatively long time to output the portion. On the other hand, according to the document processing apparatus 1 according to the first embodiment, it is possible to restore only the portion in the vicinity of the search target word in the document based on the index information. Therefore, the part can be output quickly.

検索結果出力部２６は、生成した検索結果出力情報を出力する。なお、検索結果出力部２６は、出力装置を介して検索結果出力情報を出力するように構成されていてもよい。また、検索結果出力部２６は、他の装置へ検索結果出力情報を送信することにより出力するように構成されていてもよい。 The search result output unit 26 outputs the generated search result output information. The search result output unit 26 may be configured to output search result output information via an output device. The search result output unit 26 may be configured to output search result output information by transmitting it to another device.

（作動）
次に、上述した文書処理装置１の作動について説明する。
文書処理装置１は、図７にフローチャートにより示したインデックス情報登録プログラムを実行する。 (Operation)
Next, the operation of the document processing apparatus 1 described above will be described.
The document processing apparatus 1 executes the index information registration program shown by the flowchart in FIG.

先ず、文書処理装置１は、複数の文書データを受け付ける（ステップＳ１０１）。そして、文書処理装置１は、受け付けた複数の文書データと、記憶装置に記憶されている単語情報と、に基づいてインデックス情報を生成する（ステップＳ１０２）。次いで、文書処理装置１は、生成したインデックス情報を記憶装置に記憶させる（ステップＳ１０３）。 First, the document processing apparatus 1 receives a plurality of document data (step S101). Then, the document processing device 1 generates index information based on the received plurality of document data and the word information stored in the storage device (step S102). Next, the document processing apparatus 1 stores the generated index information in the storage device (step S103).

その後、文書処理装置１は、図８にフローチャートにより示した検索処理実行プログラムを実行する。 Thereafter, the document processing apparatus 1 executes the search processing execution program shown by the flowchart in FIG.

先ず、文書処理装置１は、検索対象単語を受け付ける（ステップＳ２０１）。そして、文書処理装置１は、受け付けた検索対象単語と、記憶装置に記憶されているインデックス情報と、に基づいて検索処理を実行する（ステップＳ２０２）。次いで、文書処理装置１は、検索処理による検索結果と、記憶装置に記憶されているインデックス情報と、に基づいて検索結果出力情報を生成する（ステップＳ２０３）。その後、文書処理装置１は、生成した検索結果出力情報を出力する（ステップＳ２０４）。 First, the document processing apparatus 1 accepts a search target word (step S201). Then, the document processing apparatus 1 executes a search process based on the received search target word and the index information stored in the storage device (step S202). Next, the document processing apparatus 1 generates search result output information based on the search result by the search process and the index information stored in the storage device (step S203). Thereafter, the document processing apparatus 1 outputs the generated search result output information (step S204).

以上、説明したように、本発明の第１実施形態に係る文書処理装置１によれば、インデックス情報を記憶装置に記憶させておくだけで、文書データを記憶装置に記憶させることなく、文書を復元することができる。更に、インデックス情報を生成する際に、圧縮された文書データを伸張する必要がない。即ち、上記文書処理装置によれば、記憶装置に記憶させるデータ量を削減しながら、インデックス情報を迅速に生成することができる。 As described above, according to the document processing apparatus 1 according to the first embodiment of the present invention, it is possible to store the document without storing the document data in the storage device only by storing the index information in the storage device. Can be restored. Furthermore, it is not necessary to decompress the compressed document data when generating the index information. That is, according to the document processing apparatus, it is possible to quickly generate index information while reducing the amount of data stored in the storage device.

加えて、本発明の第１実施形態に係る文書処理装置１は、記憶されているインデックス情報にて検索対象単語と対応付けられた文書識別情報により識別される文書を、当該検索対象単語を含む文書として特定する検索処理を実行する。これによれば、インデックス情報に基づいて迅速に検索処理を実行することができる。 In addition, the document processing apparatus 1 according to the first embodiment of the present invention includes a document identified by the document identification information associated with the search target word in the stored index information, including the search target word. Search processing specified as a document is executed. According to this, a search process can be quickly performed based on index information.

なお、本発明の第１実施形態の変形例に係る文書処理装置１は、受け付けられた文書データが表す文書に含まれる単語が、単語識別情報と対応付けて記憶装置に記憶されていない場合、当該単語と、既に記憶されている単語識別情報のいずれとも異なる新たな単語識別情報と、を対応付けて記憶装置に記憶させるように構成されていてもよい。 Note that the document processing device 1 according to the modification of the first embodiment of the present invention, when the word included in the document represented by the received document data is not stored in the storage device in association with the word identification information, The word and new word identification information different from any of the already stored word identification information may be associated with each other and stored in the storage device.

また、本発明の第１実施形態の変形例に係る文書処理装置１において、インデックス情報生成装置１０及び検索処理実行装置２０は、１つの情報処理装置により構成されていてもよい。 In the document processing device 1 according to the modification of the first embodiment of the present invention, the index information generation device 10 and the search processing execution device 20 may be configured by one information processing device.

また、本発明の第１実施形態の変形例に係る文書処理装置１は、インデックス情報生成装置１０に加えて、検索処理実行装置２０が備える記憶装置にも単語情報記憶部１２が記憶している情報と同一の情報を記憶させていてもよい。 Further, in the document processing device 1 according to the modification of the first embodiment of the present invention, the word information storage unit 12 stores the storage device included in the search processing execution device 20 in addition to the index information generation device 10. The same information as the information may be stored.

また、本発明の第１実施形態の変形例に係る文書処理装置１は、インデックス情報生成装置１０に代えて、検索処理実行装置２０が備える記憶装置に単語情報を記憶させていてもよい。 Further, the document processing device 1 according to the modification of the first embodiment of the present invention may store word information in a storage device provided in the search processing execution device 20 instead of the index information generation device 10.

この場合、図９に示したように、インデックス情報生成装置１０の機能は、第１実施形態に係るインデックス情報生成装置１０の機能から単語情報記憶部１２を削除した機能であり、且つ、検索処理実行装置２０の機能は、第１実施形態に係る検索処理実行装置２０の機能に単語情報記憶部１２と同一の単語情報記憶部２７を追加した機能である。 In this case, as illustrated in FIG. 9, the function of the index information generation device 10 is a function in which the word information storage unit 12 is deleted from the function of the index information generation device 10 according to the first embodiment, and the search process The function of the execution device 20 is a function in which the same word information storage unit 27 as the word information storage unit 12 is added to the function of the search processing execution device 20 according to the first embodiment.

従って、この場合、検索処理実行部２５及び検索結果出力部２６のそれぞれは、単語情報記憶部２７に記憶されている単語情報を用いるように構成され、インデックス情報生成部１３は、検索処理実行装置２０から単語情報を受信し、受信した単語情報を用いるように構成される。 Accordingly, in this case, each of the search processing execution unit 25 and the search result output unit 26 is configured to use the word information stored in the word information storage unit 27, and the index information generation unit 13 The word information is received from 20, and the received word information is used.

＜第２実施形態＞
次に、本発明の第２実施形態に係る文書処理装置について説明する。第２実施形態に係る文書処理装置は、上記第１実施形態に係る文書処理装置に対して、インデックス情報が、文書に出現した単語と、当該単語の直前単語及び直後単語と、を対応付けた情報を含む点において相違している。従って、以下、かかる相違点を中心として説明する。 Second Embodiment
Next, a document processing apparatus according to the second embodiment of the present invention will be described. In the document processing apparatus according to the second embodiment, the index information associates the word appearing in the document with the immediately preceding word and the immediately following word of the word with respect to the document processing apparatus according to the first embodiment. It differs in that it contains information. Accordingly, the following description will focus on such differences.

第２実施形態に係る文書処理装置１は、複数の文書データのそれぞれに対して、インデックス情報を生成するための基礎情報を取得するために、図１０にフローチャートにより示した基礎情報取得プログラムを実行する。 The document processing apparatus 1 according to the second embodiment executes the basic information acquisition program shown by the flowchart in FIG. 10 in order to acquire basic information for generating index information for each of a plurality of document data. To do.

先ず、文書処理装置１は、１つの文書データを受け付ける（ステップＳ３０１）。次いで、文書処理装置１は、受け付けられた文書データが表す文書の先頭から順に１つずつ単語を取得する（ステップＳ３０２）。 First, the document processing apparatus 1 accepts one document data (step S301). Next, the document processing apparatus 1 acquires words one by one from the top of the document represented by the received document data (step S302).

そして、文書処理装置１は、取得された単語が単語情報として記憶（即ち、単語識別情報と対応付けて記憶装置に記憶）されている（記憶済みである）か否かを判定する（ステップＳ３０３）。 Then, the document processing apparatus 1 determines whether or not the acquired word is stored as word information (that is, stored in the storage device in association with the word identification information) (stored) (step S303). ).

取得された単語が単語情報として記憶されている場合、文書処理装置１は、「Ｙｅｓ」と判定してステップＳ３０５へ進む。一方、取得された単語が単語情報として記憶されていない場合、文書処理装置１は、「Ｎｏ」と判定してステップＳ３０４へ進む。そして、文書処理装置１は、取得された単語と、既に記憶されている単語識別情報のいずれとも異なる新たな単語識別情報と、を対応付けて記憶装置に記憶させる（ステップＳ３０４）。そして、文書処理装置１は、ステップＳ３０５へ進む。 If the acquired word is stored as word information, the document processing apparatus 1 determines “Yes” and proceeds to step S305. On the other hand, if the acquired word is not stored as word information, the document processing apparatus 1 determines “No” and proceeds to step S304. Then, the document processing apparatus 1 stores the acquired word and new word identification information different from any of the already stored word identification information in association with each other in the storage device (step S304). Then, the document processing apparatus 1 proceeds to step S305.

次いで、文書処理装置１は、取得された単語と対応付けて記憶装置に記憶されている単語識別情報を取得する（ステップＳ３０５）。 Next, the document processing apparatus 1 acquires word identification information stored in the storage device in association with the acquired word (step S305).

そして、文書処理装置１は、取得された単語が当該文書にて出現する位置の直前に出現する直前単語を識別するための単語識別情報（直前単語識別情報）を、ステップＳ３０２〜ステップＳ３０５と同様の処理を行うことにより取得する。同様に、文書処理装置１は、取得された単語が当該文書にて出現する位置の直後に出現する直後単語を識別するための単語識別情報（直後単語識別情報）を、ステップＳ３０２〜ステップＳ３０５と同様の処理を行うことにより取得する（ステップＳ３０６）。 Then, the document processing apparatus 1 uses word identification information (immediately preceding word identification information) for identifying the immediately preceding word that appears immediately before the position where the acquired word appears in the document, as in steps S302 to S305. Acquired by performing the process. Similarly, the document processing apparatus 1 uses word identification information (immediate word identification information) for identifying the immediately following word that appears immediately after the position where the acquired word appears in the document as steps S302 to S305. It acquires by performing the same process (step S306).

このようにして、文書処理装置１は、受け付けられた文書データが表す文書に含まれる単語（本例では、単語識別情報）と、当該単語が当該文書にて出現する位置を表す位置情報と、当該単語が当該文書にて出現する位置の直前に出現する直前単語（本例では、直前単語識別情報）と、当該単語が当該文書にて出現する位置の直後に出現する直後単語（本例では、直後単語識別情報）と、を対応付けて取得する。 In this way, the document processing apparatus 1 includes a word (word identification information in this example) included in the document represented by the received document data, position information indicating a position where the word appears in the document, The immediately preceding word (in this example, immediately preceding word identification information) that appears just before the position where the word appears in the document, and the immediately following word (in this example, where the word appears immediately after the position where the word appears in the document) , Immediately after the word identification information).

次いで、文書処理装置１は、ステップＳ３０２にて取得された単語の末尾が、文書の末尾と一致しているか否かを判定する（ステップＳ３０７）。そして、取得された単語の末尾が、文書の末尾と一致していない場合、文書処理装置１は、ステップＳ３０２にて取得された単語の次の単語を、ステップＳ３０２〜ステップＳ３０７の処理の対象として設定する。次いで、文書処理装置１は、ステップＳ３０２へ戻り、ステップＳ３０２〜ステップＳ３０８の処理を繰り返し実行する。 Next, the document processing apparatus 1 determines whether or not the end of the word acquired in step S302 matches the end of the document (step S307). If the end of the acquired word does not coincide with the end of the document, the document processing apparatus 1 sets the next word after the word acquired in step S302 as the processing target in steps S302 to S307. Set. Next, the document processing apparatus 1 returns to step S302, and repeatedly executes the processes of steps S302 to S308.

その後、ステップＳ３０２にて取得された単語の末尾が、文書の末尾と一致した場合、文書処理装置１は、ステップＳ３０７にて「Ｙｅｓ」と判定して、この基礎情報取得プログラムの処理を終了する。 Thereafter, when the end of the word acquired in step S302 matches the end of the document, the document processing apparatus 1 determines “Yes” in step S307, and ends the processing of the basic information acquisition program. .

そして、文書処理装置１は、複数の文書データのそれぞれに対して、基礎情報取得プログラムを実行することにより取得された情報に基づいてインデックス情報を生成する。即ち、文書処理装置１は、第１実施形態と同様に、受け付けられた複数の文書データが表す複数の文書に含まれるすべての単語のそれぞれに対して、当該単語と、文書識別情報と、位置情報と、当該単語が当該文書にて出現する位置の直前に出現する直前単語と、当該単語が当該文書にて出現する位置の直後に出現する直後単語と、を対応付けたインデックス情報を生成する。 Then, the document processing apparatus 1 generates index information for each of the plurality of document data based on information acquired by executing the basic information acquisition program. That is, as in the first embodiment, the document processing apparatus 1 determines the word, the document identification information, the position for each of the words included in the plurality of documents represented by the plurality of received document data. Index information is generated by associating information, the immediately preceding word that appears immediately before the position where the word appears in the document, and the immediately following word that appears immediately after the position where the word appears in the document .

更に、文書処理装置１は、インデックス情報に基づいて文書の少なくとも一部を復元する際に、直前単語識別情報及び直後単語識別情報に基づいて、ある単語に後続する単語、及び、先行する単語を取得する。従って、文書処理装置１によれば、直前単語、及び、直後単語に基づいて、インデックス情報から文書の少なくとも一部を復元する処理を迅速に実行することができる。 Further, when the document processing apparatus 1 restores at least a part of the document based on the index information, the document processing device 1 selects a word following a certain word and a preceding word based on the immediately preceding word identification information and the immediately following word identification information. get. Therefore, according to the document processing apparatus 1, it is possible to quickly execute a process of restoring at least a part of the document from the index information based on the immediately preceding word and the immediately following word.

以上、説明したように、本発明の第２実施形態に係る文書処理装置１によれば、第１実施形態に係る文書処理装置１と同様の作用及び効果を奏することができる。
更に、本発明の第２実施形態に係る文書処理装置１によれば、直前単語、及び、直後単語に基づいて、インデックス情報から文書の少なくとも一部を復元する処理を迅速に実行することができる。 As described above, according to the document processing apparatus 1 according to the second embodiment of the present invention, it is possible to achieve the same operations and effects as the document processing apparatus 1 according to the first embodiment.
Furthermore, according to the document processing apparatus 1 according to the second embodiment of the present invention, it is possible to quickly execute a process of restoring at least a part of a document from index information based on the immediately preceding word and the immediately following word. .

なお、第２実施形態に係る文書処理装置１は、直前単語識別情報、及び、直後単語識別情報の両方を用いていたが、直前単語識別情報、及び、直後単語識別情報のいずれか一方のみを用いるように構成されていてもよい。 The document processing apparatus 1 according to the second embodiment uses both the immediately preceding word identification information and the immediately following word identification information. However, only one of the immediately preceding word identification information and the immediately following word identification information is used. It may be configured to be used.

＜第３実施形態＞
次に、本発明の第３実施形態に係る文書処理装置について図１１を参照しながら説明する。
第３実施形態に係る文書処理装置１００は、
文書を表す複数の文書データを受け付ける文書データ受付部（文書データ受付手段）１０１と、
上記受け付けられた文書データが表す文書に含まれるすべての単語のそれぞれに対して、当該単語と、当該単語が出現する文書を識別するための文書識別情報と、当該単語が当該文書にて出現する位置を表す位置情報と、を対応付けたインデックス情報を記憶装置ＳＴに記憶させるインデックス情報記憶処理部（インデックス情報記憶処理手段）１０２と、
上記記憶されているインデックス情報に基づいて、上記文書の少なくとも一部を復元する文書復元部（文書復元手段）１０３と、
を備える。 <Third Embodiment>
Next, a document processing apparatus according to a third embodiment of the present invention will be described with reference to FIG.
The document processing apparatus 100 according to the third embodiment
A document data receiving unit (document data receiving means) 101 for receiving a plurality of document data representing a document;
For each word included in the document represented by the received document data, the word, document identification information for identifying the document in which the word appears, and the word appear in the document. An index information storage processing unit (index information storage processing unit) 102 that stores in the storage device ST index information that associates position information that represents a position;
A document restoration unit (document restoration means) 103 for restoring at least a part of the document based on the stored index information;
Is provided.

これによれば、インデックス情報を記憶装置ＳＴに記憶させておくだけで、文書データを記憶装置ＳＴに記憶させることなく、文書を復元することができる。更に、インデックス情報を生成する際に、圧縮された文書データを伸張する必要がない。即ち、上記文書処理装置１００によれば、記憶装置ＳＴに記憶させるデータ量を削減しながら、インデックス情報を迅速に生成することができる。 According to this, it is possible to restore the document without storing the document data in the storage device ST only by storing the index information in the storage device ST. Furthermore, it is not necessary to decompress the compressed document data when generating the index information. That is, according to the document processing apparatus 100, it is possible to quickly generate index information while reducing the amount of data stored in the storage device ST.

以上、上記実施形態を参照して本願発明を説明したが、本願発明は、上述した実施形態に限定されるものではない。本願発明の構成及び詳細に、本願発明の範囲内において当業者が理解し得る様々な変更をすることができる。 Although the present invention has been described with reference to the above embodiment, the present invention is not limited to the above-described embodiment. Various changes that can be understood by those skilled in the art can be made to the configuration and details of the present invention within the scope of the present invention.

なお、上記各実施形態において文書処理装置の各機能は、ＣＰＵがプログラム（ソフトウェア）を実行することにより実現されていたが、回路等のハードウェアにより実現されていてもよい。 In each of the above embodiments, each function of the document processing apparatus is realized by the CPU executing a program (software), but may be realized by hardware such as a circuit.

また、上記各実施形態においてプログラムは、記憶装置に記憶されていたが、コンピュータが読み取り可能な記録媒体に記憶されていてもよい。例えば、記録媒体は、フレキシブルディスク、光ディスク、光磁気ディスク、及び、半導体メモリ等の可搬性を有する媒体である。 In each of the above embodiments, the program is stored in the storage device, but may be stored in a computer-readable recording medium. For example, the recording medium is a portable medium such as a flexible disk, an optical disk, a magneto-optical disk, and a semiconductor memory.

また、上記実施形態の他の変形例として、上述した実施形態及び変形例の任意の組み合わせが採用されてもよい。 Further, any other combination of the above-described embodiment and modification examples may be adopted as another modification example of the above-described embodiment.

＜付記＞
上記実施形態の一部又は全部は、以下の付記のように記載され得るが、以下には限られない。 <Appendix>
A part or all of the above embodiment can be described as the following supplementary notes, but is not limited thereto.

（付記１）
文書を表す複数の文書データを受け付ける文書データ受付手段と、
前記受け付けられた文書データが表す文書に含まれるすべての単語のそれぞれに対して、当該単語と、当該単語が出現する文書を識別するための文書識別情報と、当該単語が当該文書にて出現する位置を表す位置情報と、を対応付けたインデックス情報を記憶装置に記憶させるインデックス情報記憶処理手段と、
前記記憶されているインデックス情報に基づいて、前記文書の少なくとも一部を復元する文書復元手段と、
を備える文書処理装置。 (Appendix 1)
Document data receiving means for receiving a plurality of document data representing a document;
For each of all words included in the document represented by the accepted document data, the word, document identification information for identifying the document in which the word appears, and the word appear in the document. Index information storage processing means for storing, in a storage device, index information in which position information representing a position is associated with each other;
Document restoration means for restoring at least a part of the document based on the stored index information;
A document processing apparatus comprising:

これによれば、インデックス情報を記憶装置に記憶させておくだけで、文書データを記憶装置に記憶させることなく、文書を復元することができる。更に、インデックス情報を生成する際に、圧縮された文書データを伸張する必要がない。即ち、上記文書処理装置によれば、記憶装置に記憶させるデータ量を削減しながら、インデックス情報を迅速に生成することができる。 According to this, it is possible to restore the document without storing the document data in the storage device only by storing the index information in the storage device. Furthermore, it is not necessary to decompress the compressed document data when generating the index information. That is, according to the document processing apparatus, it is possible to quickly generate index information while reducing the amount of data stored in the storage device.

（付記２）
付記１に記載の文書処理装置であって、
検索対象単語としての単語を受け付ける検索対象単語受付手段と、
前記記憶されているインデックス情報にて前記受け付けられた検索対象単語と対応付けられた文書識別情報により識別される文書を、当該検索対象単語を含む文書として特定する検索処理を実行する検索処理実行手段と、
を備える文書処理装置。 (Appendix 2)
The document processing apparatus according to attachment 1, wherein
A search target word receiving means for receiving a word as a search target word;
Search process execution means for executing a search process for specifying a document identified by the document identification information associated with the accepted search target word in the stored index information as a document including the search target word When,
A document processing apparatus comprising:

これによれば、インデックス情報に基づいて迅速に検索処理を実行することができる。 According to this, a search process can be quickly performed based on index information.

（付記３）
付記２に記載の文書処理装置であって、
前記文書復元手段は、前記受け付けられた検索対象単語を含む文書のうちの、当該検索対象単語の近傍の部分を復元するように構成された文書処理装置。 (Appendix 3)
A document processing apparatus according to appendix 2, wherein
The document processing device is configured to restore a portion near the search target word in the document including the accepted search target word.

ところで、文書のうちの検索対象単語の近傍の部分を出力する場合において、文書を表す文書データが圧縮されている場合、当該文書データの全体を伸張した後に、当該部分を抽出する必要がある。従って、この場合、当該部分を出力するためには比較的長い時間を要するという問題があった。これに対し、上記構成によれば、インデックス情報に基づいて、文書のうちの検索対象単語の近傍の部分のみを復元することができる。従って、当該部分を迅速に出力することができる。 By the way, when outputting a portion near a search target word in a document and the document data representing the document is compressed, it is necessary to extract the portion after decompressing the entire document data. Therefore, in this case, there is a problem that it takes a relatively long time to output the portion. On the other hand, according to the above configuration, only a portion in the vicinity of the search target word in the document can be restored based on the index information. Therefore, the part can be output quickly.

（付記４）
付記１乃至付記３のいずれか一項に記載の文書処理装置であって、
単語と、単語を識別するための単語識別情報と、を対応付けた単語情報を記憶する単語情報記憶手段を備え、
前記インデックス情報記憶処理手段は、前記インデックス情報として、前記単語識別情報と前記文書識別情報と前記位置情報とを含む情報を用いるように構成された文書処理装置。 (Appendix 4)
A document processing apparatus according to any one of appendix 1 to appendix 3,
Word information storage means for storing word information in which a word is associated with word identification information for identifying the word;
The index information storage processing means is a document processing apparatus configured to use information including the word identification information, the document identification information, and the position information as the index information.

（付記５）
付記４に記載の文書処理装置であって、
前記インデックス情報記憶処理手段は、前記記憶されている単語情報に基づいて前記インデックス情報を生成するように構成された文書処理装置。 (Appendix 5)
A document processing apparatus according to appendix 4, wherein
The document processing apparatus configured to generate the index information based on the stored word information.

（付記６）
付記１乃至付記５のいずれか一項に記載の文書処理装置であって、
前記インデックス情報記憶処理手段は、
前記受け付けられた文書データが表す文書に含まれるすべての単語のそれぞれに対して、当該単語と、当該単語が出現する文書を識別するための文書識別情報と、当該単語が当該文書にて出現する位置を表す位置情報と、当該単語が当該文書にて出現する位置の直前に出現する直前単語、及び、当該単語が当該文書にて出現する位置の直後に出現する直後単語の少なくとも一方と、を対応付けた前記インデックス情報を記憶装置に記憶させるように構成された文書処理装置。 (Appendix 6)
The document processing apparatus according to any one of appendix 1 to appendix 5,
The index information storage processing means
For each of all words included in the document represented by the accepted document data, the word, document identification information for identifying the document in which the word appears, and the word appear in the document. Position information representing a position, at least one of a word immediately before the position at which the word appears in the document, and a word immediately after the position at which the word appears in the document, A document processing apparatus configured to store the associated index information in a storage device.

これによれば、直前単語、及び／又は、直後単語に基づいて、インデックス情報から文書の少なくとも一部を復元する処理を迅速に実行することができる。 According to this, based on the immediately preceding word and / or the immediately following word, it is possible to quickly execute the process of restoring at least a part of the document from the index information.

（付記７）
文書を表す複数の文書データを受け付け、
前記受け付けられた文書データが表す文書に含まれるすべての単語のそれぞれに対して、当該単語と、当該単語が出現する文書を識別するための文書識別情報と、当該単語が当該文書にて出現する位置を表す位置情報と、を対応付けたインデックス情報を記憶装置に記憶させ、
前記記憶されているインデックス情報に基づいて、前記文書の少なくとも一部を復元する、文書処理方法。 (Appendix 7)
Accept multiple document data representing documents,
For each of all words included in the document represented by the accepted document data, the word, document identification information for identifying the document in which the word appears, and the word appear in the document. Index information that associates position information representing a position with the storage device is stored in the storage device,
A document processing method for restoring at least a part of the document based on the stored index information.

（付記８）
付記７に記載の文書処理方法であって、
検索対象単語としての単語を受け付け、
前記記憶されているインデックス情報にて前記受け付けられた検索対象単語と対応付けられた文書識別情報により識別される文書を、当該検索対象単語を含む文書として特定する検索処理を実行する、文書処理方法。 (Appendix 8)
The document processing method according to appendix 7,
Accept words as search target words,
A document processing method for executing a search process for specifying a document identified by the document identification information associated with the accepted search target word in the stored index information as a document including the search target word. .

（付記９）
付記８に記載の文書処理方法であって、
前記受け付けられた検索対象単語を含む文書のうちの、当該検索対象単語の近傍の部分を復元する、文書処理方法。 (Appendix 9)
The document processing method according to attachment 8, wherein
A document processing method for restoring a portion in the vicinity of a search target word in a document including the accepted search target word.

（付記１０）
情報処理装置に、
文書を表す複数の文書データを受け付ける文書データ受付手段と、
前記受け付けられた文書データが表す文書に含まれるすべての単語のそれぞれに対して、当該単語と、当該単語が出現する文書を識別するための文書識別情報と、当該単語が当該文書にて出現する位置を表す位置情報と、を対応付けたインデックス情報を記憶装置に記憶させるインデックス情報記憶処理手段と、
前記記憶されているインデックス情報に基づいて、前記文書の少なくとも一部を復元する文書復元手段と、
を実現させるためのプログラム。 (Appendix 10)
In the information processing device,
Document data receiving means for receiving a plurality of document data representing a document;
For each of all words included in the document represented by the accepted document data, the word, document identification information for identifying the document in which the word appears, and the word appear in the document. Index information storage processing means for storing, in a storage device, index information in which position information representing a position is associated with each other;
Document restoration means for restoring at least a part of the document based on the stored index information;
A program to realize

（付記１１）
付記１０に記載のプログラムであって、
前記情報処理装置に、更に、
検索対象単語としての単語を受け付ける検索対象単語受付手段と、
前記記憶されているインデックス情報にて前記受け付けられた検索対象単語と対応付けられた文書識別情報により識別される文書を、当該検索対象単語を含む文書として特定する検索処理を実行する検索処理実行手段と、
を実現させるためのプログラム。 (Appendix 11)
The program according to attachment 10, wherein
In addition to the information processing apparatus,
A search target word receiving means for receiving a word as a search target word;
Search process execution means for executing a search process for specifying a document identified by the document identification information associated with the accepted search target word in the stored index information as a document including the search target word When,
A program to realize

（付記１２）
付記１１に記載のプログラムであって、
前記文書復元手段は、前記受け付けられた検索対象単語を含む文書のうちの、当該検索対象単語の近傍の部分を復元するように構成されたプログラム。 (Appendix 12)
The program according to attachment 11, wherein
The document restoration unit is a program configured to restore a portion in the vicinity of the search target word in the document including the accepted search target word.

本発明は、文書を表す文書データを処理する文書処理装置、及び、ウェブから収集された文書から特定の単語を含む文書を検索する文書検索装置等に適用可能である。 The present invention can be applied to a document processing apparatus that processes document data representing a document, a document search apparatus that searches for a document including a specific word from documents collected from the web, and the like.

１文書処理装置
１０インデックス情報生成装置
１１文書データ受付部
１２単語情報記憶部
１３インデックス情報生成部
１４インデックス情報送信部
２０検索処理実行装置
２１インデックス情報受信部
２２インデックス情報記憶処理部
２３インデックス情報記憶部
２４検索対象単語受付部
２５検索処理実行部
２６検索結果出力部
２７単語情報記憶部
１００文書処理装置
１０１文書データ受付部
１０２インデックス情報記憶処理部
１０３文書復元部
ＳＴ記憶装置 DESCRIPTION OF SYMBOLS 1 Document processing apparatus 10 Index information generation apparatus 11 Document data reception part 12 Word information storage part 13 Index information generation part 14 Index information transmission part 20 Search process execution apparatus 21 Index information reception part 22 Index information storage process part 23 Index information storage part 24 Search target word receiving unit 25 Search processing execution unit 26 Search result output unit 27 Word information storage unit 100 Document processing device 101 Document data reception unit 102 Index information storage processing unit 103 Document restoration unit ST Storage device

Claims

Document data receiving means for receiving a plurality of document data representing a document;
For each of all words included in the document represented by the accepted document data, the word, document identification information for identifying the document in which the word appears, and the word appear in the document. Index information storage processing means for storing, in a storage device, index information in which position information representing a position is associated with each other;
Document restoration means for restoring at least a part of the document based on the stored index information;
A search target word receiving means for receiving a word as a search target word;
Search process execution means for executing a search process for specifying a document identified by the document identification information associated with the accepted search target word in the stored index information as a document including the search target word When,
With
The document restoration means includes a document that includes the accepted search target word and that is near a position represented by the position information stored in association with the document identification information that is associated with the search target word. A document processing device configured to restore a portion .

The document processing apparatus according to claim 1 ,
Word information storage means for storing word information in which a word is associated with word identification information for identifying the word;
The index information storage processing means is a document processing apparatus configured to use information including the word identification information, the document identification information, and the position information as the index information.

The document processing apparatus according to claim 2 ,
The document processing apparatus configured to generate the index information based on the stored word information.

A document processing apparatus according to any one of claims 1 to 3 ,
The index information storage processing means
For each of all words included in the document represented by the accepted document data, the word, document identification information for identifying the document in which the word appears, and the word appear in the document. Position information representing a position, at least one of a word immediately before the position at which the word appears in the document, and a word immediately after the position at which the word appears in the document, A document processing apparatus configured to store the associated index information in a storage device.

Accept multiple document data representing documents,
For each of all words included in the document represented by the accepted document data, the word, document identification information for identifying the document in which the word appears, and the word appear in the document. Index information that associates position information representing a position with the storage device is stored in the storage device,
Restoring at least a portion of the document based on the stored index information ;
Accept words as search target words,
A search process for specifying a document identified by the document identification information associated with the accepted search target word in the stored index information as a document including the search target word;
The restoration of at least a part of the document is represented by the position information stored in association with the document identification information associated with the search target word among the documents including the accepted search target word. A document processing method for restoring a portion near a position .

In the information processing device,
Document data receiving means for receiving a plurality of document data representing a document;
For each of all words included in the document represented by the accepted document data, the word, document identification information for identifying the document in which the word appears, and the word appear in the document. Index information storage processing means for storing, in a storage device, index information in which position information representing a position is associated with each other;
Document restoration means for restoring at least a part of the document based on the stored index information;
A search target word receiving means for receiving a word as a search target word;
Search process execution means for executing a search process for specifying a document identified by the document identification information associated with the accepted search target word in the stored index information as a document including the search target word When,
Realized,
The document restoration means includes a document that includes the accepted search target word and that is near a position represented by the position information stored in association with the document identification information that is associated with the search target word. A program that is configured to restore parts .