JP2014238863A

JP2014238863A - Retrieval apparatus, retrieval method, and retrieval program

Info

Publication number: JP2014238863A
Application number: JP2014155577A
Authority: JP
Inventors: 達哉浅井; Tatsuya Asai; 多湖　真一郎; Shinichiro Tako; 真一郎多湖; 岡本　青史; Seishi Okamoto; 青史岡本
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2014-07-30
Filing date: 2014-07-30
Publication date: 2014-12-18
Anticipated expiration: 2029-06-11
Also published as: JP6055441B2

Abstract

PROBLEM TO BE SOLVED: To accurately extract data corresponding to a retrieval condition from compressed data while keeping the compressed data.SOLUTION: When compressing document data, a compression processing part 10a compresses the document data, by converting character strings included in the document data and break characters included in the document data into codes, separately. Since the compressed document data are compressed separately for the character strings and the segmentation characters, a retrieval processing part 10c can obtain proper retrieval result even if the compressed data is extracted as it is, on the basis of a retrieval expression.

Description

本発明は、検索装置等に関する。 The present invention relates to a search device and the like.

近年、圧縮されたデータから、検索対象のレコードや項目を、圧縮されたままの状態で抽出する技術が考案されている。このように、圧縮されたままの状態でレコードや項目を抽出する技術を圧縮データ抽出と表記する。また、圧縮されたデータを圧縮データ、圧縮される前のデータを元データと表記する。 In recent years, a technique has been devised for extracting records and items to be searched from compressed data in a compressed state. Thus, a technique for extracting records and items in a compressed state is referred to as compressed data extraction. Further, the compressed data is referred to as compressed data, and the data before being compressed is referred to as original data.

圧縮データ抽出を利用すれば、圧縮データを解凍する必要がないので、データ領域を節約できる。また、特定の圧縮方式を用いることで、ＡＣマシン（Aho-Corasick machines）等のスキャン系のアルゴリズムは、元データを検索するよりも圧縮データを検索するほうが高速に検索を行うことができる。 If the compressed data extraction is used, it is not necessary to decompress the compressed data, so that the data area can be saved. In addition, by using a specific compression method, a scan-type algorithm such as an AC machine (Aho-Corasick machines) can search at a higher speed when searching for compressed data than when searching for original data.

図４９は、従来の圧縮データ抽出の基本アイデアを説明するための図である。前提条件として、図４９の元データ１０ａは、ＢＰＥ（Byte Pair Encoding）圧縮技術により、圧縮データ１０ｂに圧縮されているものとする。従来の圧縮データ抽出では、抽出条件が指定された場合に、抽出条件に対応するキーワードの出現位置を特定する。そして、圧縮データ抽出では、キーワードの出現位置に含まれるレコード・項目・要素等を、圧縮データからそのまま切出して出力する。 FIG. 49 is a diagram for explaining the basic idea of conventional compressed data extraction. As a precondition, it is assumed that the original data 10a in FIG. 49 is compressed into compressed data 10b by BPE (Byte Pair Encoding) compression technology. In conventional compressed data extraction, when an extraction condition is specified, the appearance position of a keyword corresponding to the extraction condition is specified. In the compressed data extraction, records, items, elements, and the like included in the keyword appearance positions are extracted from the compressed data as they are and output.

例えば、抽出条件「ENTRY/DDB/update-data[./=19990205]」が指定されたとする。かかる抽出条件の意味は、「19990205」を含むupdate-data要素を抽出せよという意味である。従来の圧縮データ抽出では、既存の圧縮検索技術を用いて、抽出条件に対応するキーワードの出現位置を特定すると、キーワードの出現位置は、圧縮データ１０ｂの「<up・宛・0205益p・飲」となる。そして、従来の圧縮データ検出では、検索結果として、「<up・宛・0205益p・飲」をそのまま出力する。なお、検索処理の過程において、元データ１０ａに対するアクセスは一切発生しない。 For example, assume that the extraction condition “ENTRY / DDB / update-data [./= 19990205]” is designated. The meaning of the extraction condition is to extract the update-data element including “19990205”. In the conventional compressed data extraction, when the keyword appearance position corresponding to the extraction condition is specified using the existing compression search technique, the keyword appearance position is “<up · address · 0205 profit p · drinking” of the compressed data 10b. " In the conventional compressed data detection, “<up · address · 0205 profit p · drink” is output as it is as a search result. In the search process, no access to the original data 10a occurs.

ところで、データを圧縮する技術として、テキストデータを圧縮する途中で任意の位置に区切り記号を挿入する技術が公開されている。任意の位置に区切り記号を挿入することで、解凍時に、文字列全体を解凍することなく、区切り記号で挟まれた部分文字列のみを解凍することで、処理負荷を軽減することが出来る。 By the way, as a technique for compressing data, a technique for inserting a delimiter at an arbitrary position in the middle of compressing text data has been disclosed. By inserting a delimiter at an arbitrary position, the processing load can be reduced by decompressing only the partial character string sandwiched between delimiters without decompressing the entire character string.

また、画像圧縮法に関する技術ではあるが、画像を所定の領域毎に分割し、分割した画像の領域部分と、分割した画像の境界部分を独立して圧縮する技術も知られている。領域部分と境界部分を独立して圧縮することで、画像を全て解凍することなく、所望する領域の画像データを抽出することが可能となる。 In addition, as a technique related to an image compression method, a technique is also known in which an image is divided into predetermined areas, and an area portion of the divided image and a boundary portion of the divided image are independently compressed. By compressing the region portion and the boundary portion independently, it is possible to extract image data of a desired region without decompressing all the images.

特開２０００−２２５５２号公報JP 2000-22552 A 特開２００８−１１３２６７号公報JP 2008-113267 A

T.Kida,T.Matsumoto,M.Takeda,A.Shinohara,and S.Arikawa.Multiple pattern matching algorithms on collage system,In Proc.12th Annual Symposium on Combinatorial Pattern Matching(CPM2001)T.Kida, T.Matsumoto, M.Takeda, A.Shinohara, and S.Arikawa.Multiple pattern matching algorithms on collage system, In Proc.12th Annual Symposium on Combinatorial Pattern Matching (CPM2001) M.Takeda,S.Miyamoto,T.Kida,A.Shinohara,S.Fukamachi,T.Shinohara,and S.Arikawa.Processing Text Files as Is:Pattern Matching over Compressed Texts,Multi-Byte Character Texts,and Semi-Structured Texts.In Proc.9th International Symposium on String Processing and Information Retrieval(SPIRE2002)M.Takeda, S.Miyamoto, T.Kida, A.Shinohara, S.Fukamachi, T.Shinohara, and S.Arikawa.Processing Text Files as Is: Pattern Matching over Compressed Texts, Multi-Byte Character Texts, and Semi- Structured Texts. In Proc. 9th International Symposium on String Processing and Information Retrieval (SPIRE2002)

元データをそのまま圧縮データに圧縮すると、レコード・項目・要素とＸＭＬ（Extensible Markup Language）タグなどの区切り文字をまとめて別の符号に置き換えてしまう場合がある。このようにデータの区切り文字を意識しないで圧縮した場合には、抽出条件に対応したデータを圧縮データのまま抽出することができないという問題があった。 If the original data is directly compressed into compressed data, delimiters such as records, items, elements, and XML (Extensible Markup Language) tags may be collectively replaced with another code. As described above, when compression is performed without considering the data delimiter, there is a problem that data corresponding to the extraction condition cannot be extracted as compressed data.

例えば、図４９の圧縮データ１０ｂにおいて、「飲」に対応する圧縮前のデータを「><」とすると、抽出条件「ENTRY/DDB/update-data[./=19990205]」が指定された場合に、検索結果「<up・宛・0205益p・飲」が検索結果として出力される。かかる検索結果を解凍すると、「<update-date>19990202</update-date><」となり、正しい検索結果にはならない。 For example, in the compressed data 10b of FIG. 49, when the pre-compression data corresponding to “drink” is “> <”, the extraction condition “ENTRY / DDB / update-data [./= 19990205]” is specified. In addition, the search result “<up / address / 0205 profit / drink” is output as the search result. When the search result is decompressed, “<update-date> 19990202 </ update-date> <” is obtained, and the search result is not correct.

抽出条件「ENTRY/DDB/update-data[./=19990205]」が指定された場合の、解凍後の正しい検索結果は、「<update-date>19990202</update-date>」である。正しくない検索結果を、ＸＭＬデータ処理系に入力するとエラーになってしまう。 When the extraction condition “ENTRY / DDB / update-data [./= 19990205]” is specified, the correct search result after decompression is “<update-date> 19990202 </ update-date>”. If an incorrect search result is input to the XML data processing system, an error occurs.

検索結果「<up・宛・0205益p・飲」を解凍し、正しい検索結果「<update-date>19990202</update-date>」となるように、検索結果を修正すれば、正しい検索結果を得ることが出来るが、余計な計算コストが必要になってしまう。 Decompress the search result "<up / address / 0205 profit p / drink" and correct the search result so that the correct search result is <update-date> 19990202 </ update-date>. Can be obtained, but an extra calculation cost is required.

ここで、従来技術を利用して、データを圧縮する場合に、区切り記号を挿入すれば、レコード・項目・要素と区切り文字を区別できるので、抽出条件に対応したデータを圧縮データのまま抽出することが可能となる。しかし、その一方で、区切り文字毎に区切り記号を挿入すると、圧縮データのデータ量が増大してしまうため、実用的ではない。 Here, when data is compressed using conventional technology, if a delimiter is inserted, the record / item / element can be distinguished from the delimiter, so the data corresponding to the extraction condition is extracted as compressed data. It becomes possible. However, if a delimiter is inserted for each delimiter, the amount of compressed data increases, which is not practical.

また、データを圧縮する場合に、境界部分を独立して圧縮すれば、レコード・項目・要素と区切り文字を区別できるので、データ量を増大させることなく、抽出条件に対応したデータを圧縮データのまま抽出することが可能となる。しかし、境界部分を独立して圧縮する技術は、境界部分とその他の領域が固定されている場合にのみ適応可能であるため、レコード・項目・要素と区切り文字が無作為に含まれている元データに対してかかる圧縮技術を適用することは難しい。 In addition, when compressing data, if the boundary part is compressed independently, the record / item / element can be distinguished from the delimiter, so the data corresponding to the extraction condition can be transferred to the compressed data without increasing the amount of data. It is possible to extract as it is. However, the technology that compresses the boundary part independently is applicable only when the boundary part and other areas are fixed, so the source that contains records, items, elements, and delimiters at random. It is difficult to apply such compression techniques to data.

圧縮データから検索条件に対応したデータを、圧縮したままで正確に抽出することができる検索装置、検索方法および検索プログラムを提供することを目的とする。 It is an object of the present invention to provide a search device, a search method, and a search program that can accurately extract data corresponding to a search condition from compressed data while being compressed.

第１の案では、第１のデータと、前記第１のデータの属性を示す第２のデータと、を含むデータのうち、少なくとも第２のデータを、第２のデータよりも小さいサイズの符号化データに置換した置換データから、指定のデータを検索する検索装置であって、指定のデータの属性を示すデータを含む検索条件を受け付けた場合、第２のデータの置換前後の関係を記憶する記憶部を参照し、指定のデータの属性を示すデータを符号化データに置換する置換処理部と、指定のデータの属性を示す前記データが符号化データに置換された検索条件を用いて、置換データから検索条件により特定される置換データ上の領域に含まれるデータを取得する検索処理部とを有することを特徴とする。 In the first proposal, among the data including the first data and the second data indicating the attribute of the first data, at least the second data is encoded with a code smaller in size than the second data. When a search condition including data indicating the attribute of the specified data is received from the replacement data replaced with the digitized data, the search device stores the relationship before and after the replacement of the second data. Replace with reference to the storage unit, a replacement processing unit that replaces the data indicating the attribute of the specified data with the encoded data, and a search condition in which the data indicating the attribute of the specified data is replaced with the encoded data And a search processing unit that acquires data included in an area on replacement data specified by a search condition from the data.

本発明によれば、圧縮データから検索条件に対応したデータを、圧縮したままで正確に抽出することができる。 According to the present invention, data corresponding to a search condition can be accurately extracted from compressed data while being compressed.

図１は、本実施例１にかかる検索システムの構成を示す図である。FIG. 1 is a diagram illustrating the configuration of the search system according to the first embodiment. 図２は、本実施例２にかかる検索装置の構成を示す図である。FIG. 2 is a diagram illustrating the configuration of the search device according to the second embodiment. 図３は、ＸＭＬデータのデータ構造の一例を示す図である。FIG. 3 is a diagram illustrating an example of the data structure of XML data. 図４は、符号化データのデータ構造の一例を示す図である。FIG. 4 is a diagram illustrating an example of a data structure of encoded data. 図５は、対応表のデータ構造の一例を示す図である。FIG. 5 is a diagram illustrating an example of the data structure of the correspondence table. 図６は、ＡＣマシンのデータ構造の一例を示す図である。FIG. 6 is a diagram illustrating an example of the data structure of the AC machine. 図７は、状態構造体のデータ構造の一例を示す図である。FIG. 7 is a diagram illustrating an example of the data structure of the state structure. 図８は、圧縮辞書データのデータ構造の一例を示す図である。FIG. 8 is a diagram illustrating an example of a data structure of compression dictionary data. 図９は、圧縮データ用ＡＣマシンのデータ構造の一例を示す図である。FIG. 9 is a diagram illustrating an example of a data structure of the compressed data AC machine. 図１０は、文字列圧縮の動作例を示す図である。FIG. 10 is a diagram illustrating an operation example of character string compression. 図１１は、文字列圧縮済みデータのデータ構造の一例を示す図である。FIG. 11 is a diagram illustrating an example of a data structure of character string compressed data. 図１２は、トライＴ構築処理を説明するための図（１）である。FIG. 12 is a diagram (1) for explaining the tri-T construction process. 図１３は、トライＴ構築処理を説明するための図（２）である。FIG. 13 is a diagram (2) for explaining the tri-T construction process. 図１４は、トライＴ構築処理を説明するための図（３）である。FIG. 14 is a diagram (3) for explaining the tri-T construction process. 図１５は、フェイラ遷移追加処理を説明するための図（１）である。FIG. 15 is a diagram (1) for explaining the failer transition addition process. 図１６は、フェイラ遷移追加処理を説明するための図（２）である。FIG. 16 is a diagram (2) for explaining the failer transition addition process. 図１７は、フェイラ遷移追加処理を説明するための図（３）である。FIG. 17 is a diagram (3) for explaining the failer transition addition process. 図１８は、フェイラ遷移追加処理を説明するための図（４）である。FIG. 18 is a diagram (4) for explaining the failer transition addition process. 図１９は、フェイラ遷移追加処理を説明するための図（５）である。FIG. 19 is a diagram (5) for explaining the failer transition addition process. 図２０は、フェイラ遷移追加処理を説明するための図（６）である。FIG. 20 is a diagram (6) for explaining the failer transition addition process. 図２１は、フェイラ遷移追加処理を説明するための図（７）である。FIG. 21 is a diagram (7) for explaining the failer transition addition process. 図２２は、フェイラ遷移追加処理を説明するための図（８）である。FIG. 22 is a diagram (8) for explaining the failer transition addition process. 図２３は、フェイラ遷移追加処理を説明するための図（９）である。FIG. 23 is a diagram (9) for explaining the failer transition addition process. 図２４は、圧縮データ用ＡＣマシン構築処理を説明するための図（１）である。FIG. 24 is a diagram (1) for explaining the compressed data AC machine construction processing. 図２５は、圧縮データ用ＡＣマシン構築処理を説明するための図（２）である。FIG. 25 is a diagram (2) for explaining the compressed data AC machine construction processing. 図２６は、圧縮データ用ＡＣマシン構築処理を説明するための図（３）である。FIG. 26 is a diagram (3) for explaining the compressed data AC machine construction processing. 図２７は、照合処理を説明するための図（１）である。FIG. 27 is a diagram (1) for explaining the collation processing. 図２８は、照合処理を説明するための図（２）である。FIG. 28 is a diagram (2) for explaining the collation processing. 図２９は、照合処理を説明するための図（３）である。FIG. 29 is a diagram (3) for explaining the collation processing. 図３０は、照合処理を説明するための図（４）である。FIG. 30 is a diagram (4) for explaining the collation processing. 図３１は、照合処理を説明するための図（５）である。FIG. 31 is a diagram (5) for explaining the collation processing. 図３２は、照合処理を説明するための図（６）である。FIG. 32 is a diagram (6) for explaining the collation processing. 図３３は、圧縮処理の処理手順を示すフローチャートである。FIG. 33 is a flowchart illustrating a processing procedure of compression processing. 図３４は、文字列計数処理の処理手順を示すフローチャートである。FIG. 34 is a flowchart showing the procedure of the character string counting process. 図３５は、辞書構築・文字列置換処理の処理手順を示すフローチャートである。FIG. 35 is a flowchart showing a processing procedure of dictionary construction / character string replacement processing. 図３６は、タグ名置換処理の処理手順を示すフローチャートである。FIG. 36 is a flowchart illustrating a processing procedure of tag name replacement processing. 図３７は、構築処理の処理手順を示すフローチャートである。FIG. 37 is a flowchart illustrating the processing procedure of the construction processing. 図３８は、抽出条件受け付け処理の処理手順を示すフローチャートである。FIG. 38 is a flowchart illustrating the processing procedure of the extraction condition reception processing. 図３９は、ＡＣマシン構築処理の処理手順を示すフローチャートである。FIG. 39 is a flowchart showing the processing procedure of the AC machine construction processing. 図４０は、ΠのトライＴを構築する処理の処理手順を示すフローチャートである。FIG. 40 is a flowchart illustrating a processing procedure of processing for constructing a cocoon tri-T. 図４１は、パターン登録処理の処理手順を示すフローチャートである。FIG. 41 is a flowchart showing a processing procedure of pattern registration processing. 図４２は、トライ（Π）にフェイラ遷移を追加する処理の処理手順を示すフローチャート（１）である。FIG. 42 is a flowchart (1) illustrating a processing procedure of processing for adding a failer transition to a trie (Π). 図４３は、トライ（Π）にフェイラ遷移を追加する処理の処理手順を示すフローチャート（２）である。FIG. 43 is a flowchart (2) illustrating a processing procedure of a process of adding a failer transition to a trie (Π). 図４４は、ＡＣマシンＡＣ（Π）にスキップ遷移を追加する処理の処理手順を示すフローチャートである。FIG. 44 is a flowchart of a process procedure for adding a skip transition to the AC machine AC (Π). 図４５は、第１のスキップ遷移作成処理の処理手順を示すフローチャートである。FIG. 45 is a flowchart of a process procedure of the first skip transition creation process. 図４６は、第２のスキップ遷移作成処理の処理手順を示すフローチャートである。FIG. 46 is a flowchart showing the processing procedure of the second skip transition creation processing. 図４７は、切出し位置計算処理の処理手順を示すフローチャートである。FIG. 47 is a flowchart showing the processing procedure of the extraction position calculation process. 図４８は、実施例にかかる検索装置を構成するコンピュータのハードウェア構成を示す図である。FIG. 48 is a diagram illustrating a hardware configuration of a computer constituting the search device according to the embodiment. 図４９は、従来の圧縮データ抽出の基本アイデアを説明するための図である。FIG. 49 is a diagram for explaining the basic idea of conventional compressed data extraction.

以下に、本願の開示する検索装置、検索方法および検索プログラムの実施例を図面に基づいて詳細に説明する。なお、この実施例によりこの発明が限定されるものではない。 Embodiments of a search device, a search method, and a search program disclosed in the present application will be described below in detail with reference to the drawings. Note that the present invention is not limited to the embodiments.

図１は、本実施例１にかかる検索システムの構成を示す図である。図１に示すように、この検索装置１０は、圧縮処理部１０ａ、置換処理部１０ｂ、検索処理部１０ｃを有する。このうち、圧縮処理部１０ａは、データの属性を示す区切情報により各データが区切られたデータ集合を取得し、当該データ集合に含まれるデータに基づき符号化データに置換し、さらに、該データの属性を示す区切情報のみに基づき符号化データに置換することでデータ集合を圧縮する処理部である。 FIG. 1 is a diagram illustrating the configuration of the search system according to the first embodiment. As shown in FIG. 1, the search device 10 includes a compression processing unit 10a, a replacement processing unit 10b, and a search processing unit 10c. Among these, the compression processing unit 10a acquires a data set in which each data is divided by the division information indicating the attribute of the data, replaces the data set with encoded data based on the data included in the data set, and further, It is a processing unit that compresses a data set by replacing with encoded data based only on delimiter information indicating attributes.

置換処理部１０ｂは、圧縮処理部１０ａにより圧縮されたデータ集合を対象に、検索すべき属性およびデータを有する検索条件を取得した場合に、圧縮されたデータ集合の区切情報と置換前の区切情報の関係に基づいて、検索条件の属性を符号化データに置換する処理部である。 When the replacement processing unit 10b obtains a search condition having attributes and data to be searched for the data set compressed by the compression processing unit 10a, the replacement processing unit 10b and the partition information before replacement The processing unit replaces the attribute of the search condition with the encoded data based on the relationship.

検索処理部１０ｃは、検索条件の属性を符号化データに置換された検索条件と圧縮されたデータ集合を基にして、検索条件に指定される前記圧縮されたデータ集合上の領域を特定し、特定した領域に含まれるデータを出力する。 The search processing unit 10c specifies an area on the compressed data set specified in the search condition based on the search condition in which the attribute of the search condition is replaced with the encoded data and the compressed data set, Output data contained in the specified area.

上述した検索システム１０によれば、圧縮処理部１０ａが、データ集合を圧縮する場合に、データ集合に含まれる区切情報に基づきデータ集合を圧縮する。圧縮されたデータ集合は、区切情報に基づき圧縮されているので、検索処理部１０ｃは、検索条件に基づいて、圧縮データをそのまま切出しても、正確な検索結果を得ることが出来る。 According to the search system 10 described above, when the compression processing unit 10a compresses the data set, the data set is compressed based on the delimiter information included in the data set. Since the compressed data set is compressed based on the delimiter information, the search processing unit 10c can obtain an accurate search result even if the compressed data is extracted as it is based on the search condition.

図２は、本実施例２にかかる検索装置の構成を示す図である。図２に示すように、この検索装置１００は、入力部１１０、出力部１２０、通信制御ＩＦ部１３０、入出力制御ＩＦ部１４０、記憶部１５０、制御部１６０を有する。なお、検索装置１００は、ネットワークを介して端末装置（図示略）に接続する。 FIG. 2 is a diagram illustrating the configuration of the search device according to the second embodiment. As illustrated in FIG. 2, the search device 100 includes an input unit 110, an output unit 120, a communication control IF unit 130, an input / output control IF unit 140, a storage unit 150, and a control unit 160. The search device 100 is connected to a terminal device (not shown) via a network.

入力部１１０は、各種の情報を入力する入力部であり、キーボードやマウス、マイクなどを含む。また、後述するモニタ（出力部１２０）も、マウスと協働してポインティングデバイス機能を実現する。 The input unit 110 is an input unit that inputs various types of information, and includes a keyboard, a mouse, a microphone, and the like. In addition, a monitor (output unit 120), which will be described later, also realizes a pointing device function in cooperation with the mouse.

出力部１２０は、各種の情報を出力する出力部であり、モニタ（若しくはディスプレイ、タッチパネル）やスピーカ等を含む。通信制御ＩＦ部１３０は、端末装置（図示略）との間における通信を制御する処理部である。入出力制御ＩＦ部１４０は、入力部１１０、出力部１２０、通信制御ＩＦ部１３０、記憶部１５０、制御部１６０によるデータの入出力を制御する手段である。 The output unit 120 is an output unit that outputs various types of information, and includes a monitor (or display, touch panel), a speaker, and the like. The communication control IF unit 130 is a processing unit that controls communication with a terminal device (not shown). The input / output control IF unit 140 is a unit that controls input / output of data by the input unit 110, the output unit 120, the communication control IF unit 130, the storage unit 150, and the control unit 160.

記憶部１５０は、制御部１６０による各種処理に必要なデータおよびプログラムを記憶する記憶部である。この記憶部１５０は、ＸＭＬデータ１５０ａ、符号化データ１５０ｂ、対応表１５０ｃ、ＡＣマシン１５０ｄ、圧縮辞書１５０ｅ、圧縮データ用ＡＣマシン１５０ｆ、抽出条件１５０ｇ、変換後抽出条件１５０ｈを有する。 The storage unit 150 is a storage unit that stores data and programs necessary for various processes performed by the control unit 160. The storage unit 150 includes XML data 150a, encoded data 150b, correspondence table 150c, AC machine 150d, compression dictionary 150e, compressed data AC machine 150f, extraction condition 150g, and post-conversion extraction condition 150h.

ＸＭＬデータ１５０ａは、要素識別子「＜」、「＜／」等により要素が区切られた階層構造を有する文書データである。図３は、ＸＭＬデータのデータ構造の一例を示す図である。符号化データ１５０ｂは、圧縮されたＸＭＬデータである。図４は、符号化データ１５０ｂのデータ構造の一例を示す図である。 The XML data 150a is document data having a hierarchical structure in which elements are separated by element identifiers “<”, “</”, and the like. FIG. 3 is a diagram illustrating an example of the data structure of XML data. The encoded data 150b is compressed XML data. FIG. 4 is a diagram illustrating an example of the data structure of the encoded data 150b.

対応表１５０ｃは、ＸＭＬデータ１５０ａのタグ名と、このタグ名に対応する符号を対応付けたテーブルである。図５は、対応表１５０ｃのデータ構造の一例を示す図である。 The correspondence table 150c is a table in which the tag name of the XML data 150a is associated with the code corresponding to the tag name. FIG. 5 is a diagram illustrating an example of the data structure of the correspondence table 150c.

ＡＣマシン１５０ｄは、有限個の状態と遷移の組合せを含むデータである。かかるＡＣマシン１５０ｄに、テキストデータ（文字列）を入力することで、テキストデータに特定のキーワードが含まれているか否かを照合することが出来る。図６は、ＡＣマシン１５０ｄのデータ構造の一例を示す図である。このＡＣマシン１５０ｄは、テキストデータに、キーワードＡＣ、ＢＡ、ＢＢ、ＢＡＡ、ＢＡＣＤが含まれているか否かを照合するＡＣマシンを示している。 The AC machine 150d is data including a combination of a finite number of states and transitions. By inputting text data (character string) to the AC machine 150d, it is possible to collate whether or not a specific keyword is included in the text data. FIG. 6 is a diagram illustrating an example of the data structure of the AC machine 150d. The AC machine 150d is an AC machine that collates whether or not the keywords AC, BA, BB, BAA, and BACD are included in the text data.

図６に示すように、ＡＣマシン１５０ｄは状態０〜８を有している。各状態は、初期状態、通常状態、照合状態に分類される。初期状態は、初めにテキストデータと照合される状態である。通常状態は、２番目以降にテキストデータと照合される状態である。照合状態は、テキストデータが特定のキーワードにヒットした場合に遷移する状態である。図６に示す例では、初期状態を状態１とし、通常状態を状態１，３とし、照合状態を状態２，４〜８とする。 As shown in FIG. 6, AC machine 150d has states 0-8. Each state is classified into an initial state, a normal state, and a collation state. The initial state is a state in which text data is first collated. The normal state is a state that is collated with text data after the second. The collation state is a state that transitions when the text data hits a specific keyword. In the example shown in FIG. 6, the initial state is state 1, the normal state is states 1 and 3, and the collation state is states 2 and 4-8.

ＡＣマシンは、テキストデータの文字を初期状態０から順次照合し、通常遷移とフェイラ（Failure）遷移を繰り返すことで、テキストデータに含まれるキーワードを判定する。ここで、通常遷移は、照合対象となる状態において、テキストデータの比較対象となる文字による遷移先が存在する場合の遷移を示す。例えば、状態１において、テキストデータの比較対象となる文字がＣの場合には、通常遷移となる。テキストデータの文字がＣの場合には、状態１から状態２に通常遷移する。 The AC machine sequentially checks the characters of the text data from the initial state 0, and repeats the normal transition and the failure transition to determine the keyword included in the text data. Here, the normal transition indicates a transition when there is a transition destination by a character to be compared with text data in a state to be collated. For example, in state 1, when the character to be compared with text data is C, a normal transition is made. When the character of the text data is C, a normal transition is made from state 1 to state 2.

一方、通常遷移の条件に該当しない場合には、フェイラ遷移となる。例えば、状態１において、テキストデータの比較対象となる文字がＣ以外の場合には、フェイラ遷移となる。状態１においてフェイラ遷移した場合には、状態０となる。ＡＣマシンには、各状態に１本ずつ存在する。図６で省略されているフェイラ遷移は、全て初期状態０に向かっているものとする。 On the other hand, if the conditions for the normal transition are not met, a failer transition occurs. For example, in the state 1, when the character to be compared with the text data is other than C, a failer transition is made. When a failer transition occurs in state 1, state 0 is entered. There is one AC machine in each state. It is assumed that the failer transitions omitted in FIG. 6 are all directed to the initial state 0.

ここで、図６に示したＡＣマシンが有する状態（状態構造体）のデータ構造について説明する。図７は、状態構造体のデータ構造の一例を示す図である。図７に示すように、この状態構造体は、各状態を識別する状態ＩＤと、パターンリストと、通常遷移先へのポインタと、通常遷移先へのポインタと、フェイラ遷移先へのポインタを有する。 Here, a data structure of a state (state structure) included in the AC machine shown in FIG. 6 will be described. FIG. 7 is a diagram illustrating an example of the data structure of the state structure. As shown in FIG. 7, this state structure has a state ID for identifying each state, a pattern list, a pointer to a normal transition destination, a pointer to a normal transition destination, and a pointer to a failer transition destination. .

このうち、パターンリストは、初期状態０から遷移先の状態までの文字列を格納する。そして、パターンリストは、該当する通常遷移先へのポインタに対応付けられている。例えば、状態１の状態構造体において、状態１から状態２に遷移する通常遷移先のポインタがg[１]に格納されている場合には、plist[１]に初期状態０から状態２までの文字列「ＡＣ」が格納される。 Among these, the pattern list stores character strings from the initial state 0 to the transition destination state. The pattern list is associated with a pointer to the corresponding normal transition destination. For example, in the state structure of state 1, if the pointer of the normal transition destination that transitions from state 1 to state 2 is stored in g [1], plist [1] contains the initial state 0 to state 2 The character string “AC” is stored.

圧縮辞書データ１５０ｅは、文字の組と、この文字の組に対応付けられた符号を対応付けたテーブルである。図８は、圧縮辞書データ１５０ｅのデータ構造の一例を示す図である。例えば、図８に示すように、文字Ｂ，Ａのペアは、符号Ｘに対応する。かかる圧縮辞書データ１５０ｅは、ＡＣマシン１５０ｄから圧縮データ用ＡＣマシン１５０ｆを生成する場合に利用する。 The compression dictionary data 150e is a table in which character sets are associated with codes associated with the character sets. FIG. 8 is a diagram illustrating an example of the data structure of the compressed dictionary data 150e. For example, as shown in FIG. 8, a pair of characters B and A corresponds to the code X. The compressed dictionary data 150e is used when generating the compressed data AC machine 150f from the AC machine 150d.

圧縮データ用ＡＣマシン１５０ｆは、圧縮されたデータに特定のキーワードが含まれるか否かを照合するべく、ＡＣマシン１５０ｄにスキップ遷移を加えたＡＣマシンである。図９は、圧縮データ用ＡＣマシン１５０ｆのデータ構造の一例を示す図である。 The compressed data AC machine 150f is an AC machine in which a skip transition is added to the AC machine 150d in order to check whether or not a specific keyword is included in the compressed data. FIG. 9 is a diagram illustrating an example of a data structure of the compressed data AC machine 150f.

ここで、スキップ遷移は、状態をスキップする通常遷移である。例えば、照合対象となる状態を状態０とし、圧縮データの比較対象となる文字がＸの場合には、状態３をスキップして、状態４に遷移する。 Here, the skip transition is a normal transition that skips the state. For example, if the state to be collated is state 0, and the character to be compared in the compressed data is X, state 3 is skipped and the state transitions to state 4.

抽出条件１５０ｇは、ＸＭＬデータ１５０ａから抽出するデータの条件である。例えば、抽出条件１５０ｇは、「/AAA/BBB="BACD"」となる。ここで、「/AAA/BBB」は、検索位置を指定する項目名である。検索条件「/AAA/BBB="BACD"」は、区切文字「AAA」の下位の階層に存在する区切文字「BBB」に挟まれた文字列のうち、文字列「BACD」を含む文字列を検索する検索条件である。 The extraction condition 150g is a condition for data extracted from the XML data 150a. For example, the extraction condition 150g is “/ AAA / BBB =“ BACD ””. Here, “/ AAA / BBB” is an item name that specifies a search position. The search condition "/ AAA / BBB =" BACD "" is a character string that includes the character string "BACD" among the character strings sandwiched between the delimiter characters "BBB" that exist in the lower hierarchy of the delimiter character "AAA". Search conditions for searching.

変換後抽出条件１５０ｈは、対応表１５０ｃに基づいて、タグ名を変換された抽出条件である。例えば、変換後抽出条件１５０ｈは、「/1/2="BACD"」となる。かかる変換後抽出条件１５０ｈに含まれるキーワードに基づいて、ＡＣマシン１５０ｄが生成される。 The post-conversion extraction condition 150h is an extraction condition in which a tag name is converted based on the correspondence table 150c. For example, the post-conversion extraction condition 150h is “/ 1/2 =“ BACD ””. The AC machine 150d is generated based on the keyword included in the post-conversion extraction condition 150h.

制御部１６０は、各種の処理手順を規定したプログラムや制御データを格納するための内部メモリを有し、これらによって種々の処理を実行する制御部である。図２に示すように、制御部１６０は、データ管理部１６０ａ、データ圧縮処理部１６０ｂ、ＡＣマシン構築部１６０ｃ、照合処理部１６０ｄ、切出処理部１６０ｅを有する。 The control unit 160 has an internal memory for storing programs defining various processing procedures and control data, and executes various processes using these. As shown in FIG. 2, the control unit 160 includes a data management unit 160a, a data compression processing unit 160b, an AC machine construction unit 160c, a collation processing unit 160d, and a cut-out processing unit 160e.

データ管理部１６０ａは、入力部１１０、通信制御ＩＦ部１３０からＸＭＬデータ、抽出条件を取得した場合に、取得したＸＭＬデータ、抽出条件を記憶部１５０に格納する処理部である。 The data management unit 160 a is a processing unit that stores the acquired XML data and extraction conditions in the storage unit 150 when the XML data and extraction conditions are acquired from the input unit 110 and the communication control IF unit 130.

データ圧縮処理部１６０ｂは、ＸＭＬデータ１５０ａを圧縮して符号化データ１５０ｂを生成する処理、および、抽出条件１５０ｇを変換して変換後抽出条件１５０ｈを生成する処理を実行する処理部である。 The data compression processing unit 160b is a processing unit that executes processing for compressing the XML data 150a to generate encoded data 150b, and processing for converting the extraction condition 150g to generate a post-conversion extraction condition 150h.

まず、ＸＭＬデータ１５０ａを圧縮して、符号化データ１５０ｂを生成する処理について説明する。この符号化データ１５０ｂを生成する処理は、ＸＭＬデータ１５０ａのタグに挟まれた文字列を圧縮する処理と、タグ名を置換する処理を有する。 First, processing for compressing the XML data 150a and generating encoded data 150b will be described. The process of generating the encoded data 150b includes a process of compressing a character string sandwiched between tags of the XML data 150a and a process of replacing a tag name.

文字列を圧縮する処理において、データ圧縮処理部１６０ｂは文字列を参照し、隣接する長さ２の文字列を特定する。以下の説明において、隣接する長さ２の文字列を文字ペアと表記する。データ圧縮処理部１６０ｂは、文字列に含まれる文字ペアの出現数を計数し、出現数が最大となる文字ペアを、未使用文字（符号）に割当てることで、文字列を圧縮する。データ圧縮処理部１６０ｂは、文字列に含まれる文字ペアの出現数が計数閾値未満となるまで、上記処理を繰り返し実行する。 In the process of compressing the character string, the data compression processing unit 160b refers to the character string and identifies the adjacent character string of length 2. In the following description, adjacent character strings of length 2 are referred to as character pairs. The data compression processing unit 160b compresses the character string by counting the number of appearances of the character pairs included in the character string and assigning the character pair having the maximum appearance number to unused characters (codes). The data compression processing unit 160b repeatedly executes the above processing until the number of appearances of character pairs included in the character string is less than the count threshold.

図１０は、文字列圧縮の動作例を示す図である。図１０の説明において、圧縮対象となる文字列を「ＡＢＡＢＡＣＢＡＣＤ」とする。また、計数閾値αを２とし、文字セットを「Ａ，Ｂ，Ｃ，Ｄ，Ｘ，Ｙ，Ｚ」とする。データ圧縮処理部１６０ｂは、文字列「ＡＢＡＢＡＣＢＡＣＤ」を参照し、文字ペアを特定する。文字列「ＡＢＡＢＡＣＢＡＣＤ」に含まれる文字ペアは、「Ａ，Ｂ」、「Ｂ，Ａ」、「Ａ，Ｃ」、「Ｃ，Ｂ」、「Ｃ，Ｄ」となる。 FIG. 10 is a diagram illustrating an operation example of character string compression. In the description of FIG. 10, the character string to be compressed is “ABABACBACD”. Further, the count threshold α is 2, and the character set is “A, B, C, D, X, Y, Z”. The data compression processing unit 160b refers to the character string “ABABACBACD” and identifies a character pair. Character pairs included in the character string “ABABACBACD” are “A, B”, “B, A”, “A, C”, “C, B”, “C, D”.

データ圧縮処理部１６０ｂは、文字列と各文字ペアを比較して、各文字ペアの出現数を計数し、計数表に登録する。計数表は、データ圧縮処理部１６０ｂが保持しているものとする。文字ペア「Ａ，Ｂ」の出現数は２、文字ペア「Ｂ，Ａ」の出現数は３、文字ペア「Ａ，Ｃ」の出現数は２、文字ペア「Ｃ，Ｂ」の出現数は１、文字ペア「Ｃ，Ｄ」の出現数は１となる。ステップＳ１参照。 The data compression processing unit 160b compares the character string with each character pair, counts the number of appearances of each character pair, and registers it in the counting table. It is assumed that the count table is held by the data compression processing unit 160b. The number of appearances of the character pair “A, B” is 2, the number of appearances of the character pair “B, A” is 3, the number of appearances of the character pair “A, C” is 2, and the number of appearances of the character pair “C, B” is 1. The number of appearances of the character pair “C, D” is 1. See step S1.

続いて、データ圧縮処理部１６０ｂは、計数表に登録した各文字ペアのうち、出現数が最大となる文字ペアを特定し、特定した文字ペアに未使用文字を割当てる。文字ペアと未使用文字の関係は、圧縮辞書データ１５０ｅに登録する。 Subsequently, the data compression processing unit 160b specifies a character pair having the maximum number of appearances among the character pairs registered in the counting table, and assigns an unused character to the specified character pair. The relationship between the character pair and the unused character is registered in the compressed dictionary data 150e.

ステップＳ１で作成した計数表を参照すると、文字ペア「Ｂ，Ａ」の出現数が最大となるので、データ圧縮処理部１６０ｂは、文字ペア「Ｂ，Ａ」に未使用文字「Ｘ」を割当てる。文字ペア「Ｂ，Ａ」に未使用文字「Ｘ」を割当てることで、文字列「ＡＢＡＢＡＣＢＡＣＤ」は、文字列「ＡＸＸＣＸＣＤ」となる。データ圧縮処理部１６０ｂは、文字ペア「Ｂ，Ａ」と未使用文字（符号）「Ｘ」の関係を圧縮辞書データ１５０ｅに登録する。 Referring to the counting table created in step S1, the number of appearances of the character pair “B, A” is maximized, so the data compression processing unit 160b assigns an unused character “X” to the character pair “B, A”. . By assigning the unused character “X” to the character pair “B, A”, the character string “ABABACBACD” becomes the character string “AXXCXCD”. The data compression processing unit 160b registers the relationship between the character pair “B, A” and the unused character (symbol) “X” in the compression dictionary data 150e.

データ圧縮処理部１６０ｂは、文字列「ＡＸＸＣＸＣＤ」を参照し、文字ペアを特定する。文字列「ＡＸＸＣＸＣＤ」に含まれる文字ペアは、「Ａ，Ｘ」、「Ｘ，Ｘ」、「Ｘ，Ｃ」、「Ｃ，Ｘ」、「Ｃ，Ｄ」となる。 The data compression processing unit 160b refers to the character string “AXXCXCD” and identifies a character pair. Character pairs included in the character string “AXXCXCD” are “A, X”, “X, X”, “X, C”, “C, X”, and “C, D”.

データ圧縮処理部１６０ｂは、文字列と各文字ペアを比較して、各文字ペアの出現数を計数し、計数表に登録する。文字ペア「Ａ，Ｘ」の出現数は１、文字ペア「Ｘ，Ｘ」の出現数は１、文字ペア「Ｘ，Ｃ」の出現数は２、文字ペア「Ｃ，Ｘ」の出現数は１、文字ペア「Ｃ，Ｄ」の出現数は１となる。ステップＳ２参照。 The data compression processing unit 160b compares the character string with each character pair, counts the number of appearances of each character pair, and registers it in the counting table. The number of appearances of the character pair “A, X” is 1, the number of appearances of the character pair “X, X” is 1, the number of appearances of the character pair “X, C” is 2, and the number of appearances of the character pair “C, X” is 1. The number of appearances of the character pair “C, D” is 1. See step S2.

ステップＳ２で作成した計数表を参照すると、文字ペア「Ｘ，Ｃ」の出現数が最大となるので、データ圧縮処理部１６０ｂは、文字ペア「Ｘ，Ｃ」に未使用文字「Ｙ」を割当てる。文字ペア「Ｘ，Ｃ」に未使用文字を割当てることで、文字列「ＡＸＸＣＸＣＤ」は、文字列「ＡＸＹＹＤ」となる。データ圧縮処理部１６０ｂは、文字ペア「Ｘ，Ｃ」と未使用文字（符号）「Ｙ」の関係を圧縮辞書データ１５０ｅに登録する。 Referring to the counting table created in step S2, the number of appearances of the character pair “X, C” is maximized, so the data compression processing unit 160b assigns an unused character “Y” to the character pair “X, C”. . By assigning unused characters to the character pair “X, C”, the character string “AXXCXCD” becomes the character string “AXYYD”. The data compression processing unit 160b registers the relationship between the character pair “X, C” and the unused character (symbol) “Y” in the compression dictionary data 150e.

データ圧縮処理部１６０ｂは、文字列「ＡＸＹＹＤ」を参照し、文字ペアを特定する。文字列「ＡＸＹＹＤ」に含まれる文字ペアは、「Ａ，Ｘ」、「Ｘ，Ｙ」、「Ｙ，Ｙ」、「Ｙ，Ｄ」となる。 The data compression processing unit 160b refers to the character string “AXYYD” and identifies a character pair. Character pairs included in the character string “AXYYD” are “A, X”, “X, Y”, “Y, Y”, and “Y, D”.

データ圧縮処理部１６０ｂは、文字列と各文字ペアを比較して、各文字ペアの出現数を計数し、計数表に登録する。文字ペア「Ａ，Ｘ」の出現数は１、「Ｘ，Ｙ」の出現数は１、「Ｙ，Ｙ」の出現数は１、「Ｙ，Ｄ」の出現数は１となる。ステップＳ３参照。 The data compression processing unit 160b compares the character string with each character pair, counts the number of appearances of each character pair, and registers it in the counting table. The number of appearances of the character pair “A, X” is 1, the number of appearances of “X, Y” is 1, the number of appearances of “Y, Y” is 1, and the number of appearances of “Y, D” is 1. See step S3.

データ圧縮処理部１６０ｂは、各文字ペアの出現数が計数閾値α未満となった場合、あるいは、文字セットに未使用文字が存在しない場合に、文字列の圧縮を終了する。データ圧縮処理部１６０ｂは、文字列の圧縮が終了した場合に、圧縮後の文字列と圧縮辞書データ１５０ｅを出力する。ステップＳ４参照。 The data compression processing unit 160b ends the compression of the character string when the number of appearances of each character pair is less than the count threshold α or when there are no unused characters in the character set. The data compression processing unit 160b outputs the compressed character string and the compressed dictionary data 150e when the compression of the character string is completed. See step S4.

データ圧縮処理部１６０ｂは、ＸＭＬデータ１５０ａに含まれる全ての文字列に対して、上述した圧縮処理を実行する。以下の説明において、文字列を圧縮したＸＭＬデータを文字列圧縮済みデータと表記する。図１１は、文字列圧縮済みデータのデータ構造の一例を示す図である。図３のＸＭＬデータ１５０ａと、図１１の文字列圧縮済みデータを比較すると、文字列「ＡＢＡＢＡＣＢＡＣＤ」が文字列「ＡＸＹＹＤ」に圧縮されている。 The data compression processing unit 160b performs the above-described compression processing on all character strings included in the XML data 150a. In the following description, XML data obtained by compressing a character string is referred to as character string compressed data. FIG. 11 is a diagram illustrating an example of a data structure of character string compressed data. Comparing the XML data 150a in FIG. 3 with the character string compressed data in FIG. 11, the character string “ABABACBACD” is compressed into the character string “AXYYD”.

続いて、タグ名を置換する処理について説明する。データ圧縮処理部１６０ｂは、文字列圧縮済みデータをスキャンし、文字列圧縮済みデータに出現する全てのタグ名を対応表１５０ｃに登録する。データ圧縮処理部１６０ｂは、対応表１５０ｃに登録されたタグ名に対して、それぞれ固有の符号を割当てる。そして、データ圧縮処理部１６０ｂは、文字列圧縮済みデータと対応表１５０ｃとを比較して、文字列圧縮済みデータのタグ名を、該当する符号に置換することで、符号化データ１５０ｂを生成する。 Next, processing for replacing a tag name will be described. The data compression processing unit 160b scans the character string compressed data and registers all tag names appearing in the character string compressed data in the correspondence table 150c. The data compression processing unit 160b assigns a unique code to each tag name registered in the correspondence table 150c. Then, the data compression processing unit 160b generates the encoded data 150b by comparing the character string compressed data with the correspondence table 150c and replacing the tag name of the character string compressed data with the corresponding code. .

例えば、文字列圧縮済みデータを、図１１に示す文字列圧縮済みデータとし、対応表１５０ｃを、図５に示す対応表とする。この場合、データ圧縮処理部１６０ｂは、文字列圧縮済みデータ中のタグ名「AAA」、「BBB」、「CCC」はそれぞれ符号「1」、「2」、「3」に置換し、図４に示す符号化データを生成する。 For example, the character string compressed data is the character string compressed data shown in FIG. 11, and the correspondence table 150c is the correspondence table shown in FIG. In this case, the data compression processing unit 160b replaces the tag names “AAA”, “BBB”, and “CCC” in the character string compressed data with the codes “1”, “2”, and “3”, respectively. The encoded data shown in FIG.

続いて、抽出条件１５０ｇを変換して変換後抽出条件１５０ｈを生成する処理について説明する。データ圧縮処理部１６０ｂは、抽出条件１５０ｇと対応表１５０ｃを比較して、抽出条件１５０ｇに含まれるタグ名を、符号に置換することで、変換後抽出条件１５０ｈを生成する。 Next, a process for converting the extraction condition 150g to generate the post-conversion extraction condition 150h will be described. The data compression processing unit 160b compares the extraction condition 150g with the correspondence table 150c, and replaces the tag name included in the extraction condition 150g with a code to generate a post-conversion extraction condition 150h.

例えば、対応表１５０ｃを、図５に示す対応表とし、抽出条件１５０ｃを「/AAA/BBB="BACD"」とする。この場合、データ圧縮処理部１６０ｂは、抽出条件１５０ｃのタグ名「AAA」、「BBB」をそれぞれ「1」、「2」に置換し、変換後抽出条件１５０ｈ「/1/2="BACD"」を生成する。 For example, the correspondence table 150c is the correspondence table shown in FIG. 5, and the extraction condition 150c is “/ AAA / BBB =“ BACD ””. In this case, the data compression processing unit 160b replaces the tag names “AAA” and “BBB” of the extraction condition 150c with “1” and “2”, respectively, and the post-conversion extraction condition 150h “/ 1/2 =“ BACD ” Is generated.

次に、図２に示したＡＣマシン構築部１６０ｃについて説明する。ＡＣマシン構築部１６０ｃは、変換後抽出条件１５０ｈに含まれるキーワードに基づいて、当該キーワードを照合するＡＣマシン１５０ｄを生成する。そして、ＡＣマシン構築部１６０ｃは、圧縮辞書データ１５０ｅに基づいて、ＡＣマシン１５０ｄにスキップ遷移を追加することで、圧縮データ用ＡＣマシン１５０ｆを生成する。 Next, the AC machine construction unit 160c shown in FIG. 2 will be described. Based on the keyword included in the post-conversion extraction condition 150h, the AC machine construction unit 160c generates an AC machine 150d that matches the keyword. Then, the AC machine construction unit 160c generates a compressed data AC machine 150f by adding a skip transition to the AC machine 150d based on the compressed dictionary data 150e.

以下において、ＡＣマシン構築部１６０ｃについて具体的に説明する。ＡＣマシン構築部１６０ｃがＡＣマシン１５０ｄを生成する処理について説明した後に、圧縮データ用ＡＣマシン１５０ｆを生成する処理について説明する。また、変換後抽出条件１５０ｈに含まれるキーワードの集合をパターン集合Πとし、説明の便宜上、パターン集合Π＝｛ＡＣ、ＢＡ、ＢＢ、ＢＡＡ、ＢＡＣＤ｝とする。 The AC machine construction unit 160c will be specifically described below. After describing the process of generating the AC machine 150d by the AC machine construction unit 160c, the process of generating the AC machine 150f for compressed data will be described. Further, a set of keywords included in the post-conversion extraction condition 150h is a pattern set Π, and for convenience of explanation, a pattern set Π = {AC, BA, BB, BAA, BACD}.

ＡＣマシン１５０ｄを生成する処理には、トライを構築する処理と、フェイラ遷移をトライに追加する処理が含まれる。図１２〜図１４は、トライＴ構築処理を説明するための図である。 The process of generating the AC machine 150d includes a process of building a trie and a process of adding a failer transition to the trie. 12 to 14 are diagrams for explaining the tri-T construction process.

図１２において、ＡＣマシン構築部１６０ｃは、初期状態０を作成し、初期状態の状態構造体における通常遷移先をすべて初期状態に設定する（図１２、ステップＳ１０）。そして、ＡＣマシン構築部１６０ｃは、パターン集合Πからパターン「ＡＣ」を取り出す。ＡＣマシン構築部１６０ｃは、文字「Ａ」を選択し、文字Ａによる初期状態０の通常遷移先を通常状態１に設定する。 In FIG. 12, the AC machine construction unit 160c creates the initial state 0, and sets all the normal transition destinations in the state structure in the initial state to the initial state (FIG. 12, step S10). Then, the AC machine construction unit 160c extracts the pattern “AC” from the pattern set Π. The AC machine construction unit 160 c selects the character “A” and sets the normal transition destination of the initial state 0 by the character A to the normal state 1.

ＡＣマシン構築部１６０ｃは、文字「Ｃ」を選択し、文字Ｃによる通常状態１の通常遷移先を照合状態２に設定する。また、ＡＣマシン構築部１６０ｃは、通常状態１の状態構造体において、照合状態２へのポインタg[C]に対応するパターンリストをPlist[C]=ACに設定する（図１２、ステップＳ１１）。 The AC machine construction unit 160c selects the character “C” and sets the normal transition destination of the normal state 1 by the character C to the collation state 2. Also, the AC machine construction unit 160c sets Plist [C] = AC in the pattern structure corresponding to the pointer g [C] to the collation state 2 in the state structure in the normal state 1 (FIG. 12, step S11). .

ＡＣマシン構築部１６０ｃは、初期状態０に戻り、パターン集合Πからパターン「ＢＡ」を取り出す。ＡＣマシン構築部１６０ｃは、文字「Ｂ」を選択し、文字Ｂによる初期状態０の通常遷移先を通常状態３に設定する。 The AC machine construction unit 160c returns to the initial state 0, and extracts the pattern “BA” from the pattern set Π. The AC machine construction unit 160 c selects the character “B” and sets the normal transition destination of the initial state 0 by the character B to the normal state 3.

ＡＣマシン構築部１６０ｃは、文字「Ａ」を選択し、文字Ａによる通常状態３の通常遷移先を照合状態４に設定する。また、ＡＣマシン構築部１６０ｃは、通常状態３の状態構造体において、照合状態４へのポインタg[A]に対応するパターンリストをPlist[A]=BAに設定する（図１２、ステップＳ１２）。 The AC machine construction unit 160 c selects the character “A” and sets the normal transition destination of the normal state 3 by the character A to the collation state 4. Also, the AC machine construction unit 160c sets Plist [A] = BA in the pattern structure corresponding to the pointer g [A] to the matching state 4 in the state structure in the normal state 3 (FIG. 12, step S12). .

図１３において、ＡＣマシン構築部１６０ｃは、初期状態０に戻り、パターン集合Πからパターン「ＢＢ」を取り出し、文字「Ｂ」を選択する。ここで、文字Ｂによる初期状態０の通常遷移先は通常状態３であり、既に作成済みであるため、現在の状態を通常状態３に遷移する。 In FIG. 13, the AC machine construction unit 160 c returns to the initial state 0, extracts the pattern “BB” from the pattern set Π, and selects the character “B”. Here, the normal transition destination of the initial state 0 by the letter B is the normal state 3, and since it has already been created, the current state is transitioned to the normal state 3.

ＡＣマシン構築部１６０ｃは、文字「Ｂ」を選択し、文字Ｂによる通常状態３の通常遷移先を照合状態５に設定する。また、ＡＣマシン構築部１６０ｃは、状態３の状態構造体において、照合状態５へのポインタg[B]に対応するパターンリストをPlist[B]=BBに設定する（図１３、ステップＳ１３）。 The AC machine construction unit 160 c selects the character “B” and sets the normal transition destination of the normal state 3 by the character B to the collation state 5. Also, the AC machine construction unit 160c sets Plist [B] = BB in the state structure of state 3 corresponding to the pointer g [B] to the collation state 5 (FIG. 13, step S13).

ＡＣマシン構築部１６０ｃは、初期状態０に戻り、パターン集合Πからパターン「ＢＡＡ」を取り出し、文字「Ｂ」を選択する。ここで、文字Ｂによる初期状態０の通常遷移先は通常状態３であり、既に作成済みであるため、現在の状態を通常状態３に遷移する。 The AC machine construction unit 160c returns to the initial state 0, extracts the pattern “BAA” from the pattern set Π, and selects the character “B”. Here, the normal transition destination of the initial state 0 by the letter B is the normal state 3, and since it has already been created, the current state is transitioned to the normal state 3.

ＡＣマシン構築部１６０ｃは、文字「Ａ」を選択する。ここで、文字Ａによる通常状態３の通常遷移先は照合状態４であり、既に作成済みであるため、現在の状態を照合状態４に遷移する。 The AC machine construction unit 160c selects the letter “A”. Here, since the normal transition destination of the normal state 3 by the character A is the collation state 4 and has already been created, the current state is transited to the collation state 4.

ＡＣマシン構築部１６０ｃは、文字「Ａ」を選択する。ＡＣマシン構築部１６０ｃは文字「Ａ」による照合状態４の通常遷移先を照合状態６に設定する。また、ＡＣマシン構築部１６０ｃは、状態４の状態構造体において、照合状態６へのポインタg[a]に対応するパターンリストをPlist[A]=BAAに設定する（図１３、ステップＳ１４）。 The AC machine construction unit 160c selects the letter “A”. The AC machine construction unit 160 c sets the normal transition destination of the collation state 4 with the character “A” to the collation state 6. Further, the AC machine construction unit 160c sets Plist [A] = BAA in the state structure of state 4 corresponding to the pointer g [a] to the collation state 6 (FIG. 13, step S14).

図１４において、ＡＣマシン構築部１６０ｃは、初期状態０に戻り、パターン集合Πからパターン「ＢＡＣＤ」を取り出し、文字「Ｂ」を選択する。ここで、文字Ｂによる初期状態０の通常遷移先は通常状態３であり、既に作成済みであるため、現在の状態を通常状態３に遷移する。 In FIG. 14, the AC machine construction unit 160 c returns to the initial state 0, extracts the pattern “BACD” from the pattern set Π, and selects the character “B”. Here, the normal transition destination of the initial state 0 by the letter B is the normal state 3, and since it has already been created, the current state is transitioned to the normal state 3.

ＡＣマシン構築部１６０ｃは、文字「Ａ」を選択する。ここで、文字Ａによる通常状態３の通常遷移先は状態４であり、既に作成済みであるため、現在の状態を照合状態４に遷移する。 The AC machine construction unit 160c selects the letter “A”. Here, since the normal transition destination of the normal state 3 by the character A is the state 4 and has already been created, the current state is transitioned to the collation state 4.

ＡＣマシン構築部１６０ｃは、文字「Ｃ」を選択する。ＡＣマシン構築部１６０ｃは、文字Ｃによる照合状態４の通常遷移先を通常状態７に設定する。ＡＣマシン構築部１６０ｃは、文字「Ｄ」を選択する。ＡＣマシン構築部１６０ｃは、文字Ｄによる通常状態７の通常遷移先を照合状態８に設定する（図１４、ステップＳ１５）。ステップＳ１５が終了した段階で、パターン集合Πに含まれる全てのパターンの登録が終了し、トライＴの構築処理が終了する。 The AC machine construction unit 160c selects the character “C”. The AC machine construction unit 160c sets the normal transition destination of the collation state 4 with the character C to the normal state 7. The AC machine construction unit 160c selects the character “D”. The AC machine construction unit 160c sets the normal transition destination of the normal state 7 by the character D to the collation state 8 (FIG. 14, step S15). At the stage where step S15 is completed, registration of all patterns included in the pattern set 終了 ends, and the construction process of the trie T ends.

次に、ＡＣマシン構築部１６０ｃがフェイラ遷移をトライに追加する処理について説明する。図１５〜図２３は、フェイラ遷移追加処理を説明するための図である。ＡＣマシン構築部１６０ｃは、初期状態０から通常遷移先となる状態を判定し、判定した状態をキュー（Queue）に登録する。ＡＣマシン構築部１６０ｃは、キューに登録された状態のフェイラ遷移先を初期状態０に設定する。ここで、初期状態０の通常遷移先は、通常状態１，３となるので、キューに１，３を登録する。また、通常状態１，３のフェイラ遷移先を初期状態０に設定する。図１５参照。 Next, a process in which the AC machine construction unit 160c adds a failer transition to a trie will be described. 15 to 23 are diagrams for describing the failer transition addition process. The AC machine construction unit 160c determines a state that is a normal transition destination from the initial state 0, and registers the determined state in a queue. The AC machine construction unit 160c sets the failer transition destination registered in the queue to the initial state 0. Here, since the normal transition destination of the initial state 0 is the normal states 1 and 3, 1 and 3 are registered in the queue. In addition, the failer transition destination in the normal states 1 and 3 is set to the initial state 0. See FIG.

ＡＣマシン構築部１６０ｃは、キューの先頭の状態１を取り出し、取り出した状態１を状態ｓに設定する。ＡＣマシン構築部１６０ｃは、状態ｓにおいて、g[code(a)]≠Nullとなる全ての文字ａを抽出し、集合Ｘに格納する。この場合、ＡＣマシン構築部１６０ｃは、文字Ｃを抽出し、集合Ｘに文字Ｃを格納する。 The AC machine construction unit 160c extracts the first state 1 of the queue, and sets the extracted state 1 to the state s. The AC machine construction unit 160c extracts all the characters a that satisfy g [code (a)] ≠ Null in the state s, and stores them in the set X. In this case, the AC machine construction unit 160c extracts the character C and stores the character C in the set X.

ＡＣマシン構築部１６０ｃは、集合Ｘから文字Ｃを取り出し、状態ｓの通常遷移先となる状態２をキューの最後尾に追加する。ＡＣマシン構築部１６０ｃは、状態１からフェイラ遷移した初期状態０に移行し、文字Ｃに対する通常遷移先を判定することで、状態nextのフェイラ遷移先を判定すると、初期状態０となる。ＡＣマシン構築部１６０ｃは、状態ｓ（通常状態１）から文字Ｃにより遷移する状態nextを判定し、判定した状態（照合状態２）のフェイラ遷移先を初期状態０に設定する。図１６参照。 The AC machine construction unit 160c extracts the character C from the set X, and adds the state 2 that is the normal transition destination of the state s to the end of the queue. When the AC machine construction unit 160c transitions from the state 1 to the initial state 0 that has undergone a failer transition, and determines the normal transition destination for the character C, it determines the initial state 0 when determining the failer transition destination of the state next. The AC machine construction unit 160c determines the state next to be transitioned by the character C from the state s (normal state 1), and sets the failer transition destination of the determined state (collation state 2) to the initial state 0. See FIG.

ＡＣマシン構築部１６０ｃは、キューの先頭の状態３を取り出し、取り出した状態３を状態ｓに設定する。ＡＣマシン構築部１６０ｃは、状態ｓにおいて、g[code(a)]≠Nullとなる全ての文字ａを抽出し、集合Ｘに格納する。この場合は、ＡＣマシン構築部１６０ｃは、文字Ａ、Ｂを抽出し、集合Ｘに文字Ａ、Ｂを格納する。 The AC machine construction unit 160c extracts the state 3 at the head of the queue, and sets the extracted state 3 to the state s. The AC machine construction unit 160c extracts all the characters a that satisfy g [code (a)] ≠ Null in the state s, and stores them in the set X. In this case, the AC machine construction unit 160c extracts the characters A and B and stores the characters A and B in the set X.

ＡＣマシン構築部１６０ｃは、集合Ｘから文字Ａを取り出し、状態ｓの文字Ａの通常遷移先となる状態４をキューの最後尾に追加する。ＡＣマシン構築部１６０ｃは、状態ｓからフェイラ遷移した初期状態０に移行し、文字Ａに対する通常遷移先を判定することで、状態nextのフェイラ遷移先を判定すると、状態１となる。ＡＣマシン構築部１６０ｃは、状態ｓ（通常状態３）から文字Ａにより遷移する状態nextを判定し、判定した状態（照合状態４）のフェイラ遷移先を状態１に設定する。図１７の状態４参照。 The AC machine construction unit 160c extracts the character A from the set X, and adds the state 4 that is the normal transition destination of the character A in the state s to the tail of the queue. When the AC machine construction unit 160c transitions from the state s to the initial state 0 in which the failure transition is performed and determines the normal transition destination for the character A, the state transitions to the state 1 when the failure transition destination of the state next is determined. The AC machine construction unit 160c determines the state next to be transitioned by the letter A from the state s (normal state 3), and sets the failer transition destination of the determined state (collation state 4) to state 1. See state 4 in FIG.

ＡＣマシン構築部１６０ｃは、集合Ｘから文字Ｂを取り出し、状態ｓ（通常状態３）の文字Ｂの通常遷移先となる状態５をキューの最後尾に追加する。ＡＣマシン構築部１６０ｃは、状態ｓからフェイラ遷移した初期状態０に移行し、文字Ｂに対する通常遷移先を判定することで、状態nextのフェイラ遷移先を判定すると、状態３となる。ＡＣマシン構築部１６０ｃは、状態ｓから文字Ｂにより遷移する状態nextを判定し、判定した状態（照合状態５）のフェイラ遷移先を状態３に設定する。図１７の状態５参照。 The AC machine construction unit 160c extracts the character B from the set X and adds the state 5 that is the normal transition destination of the character B in the state s (normal state 3) to the tail of the queue. When the AC machine construction unit 160c transitions from the state s to the initial state 0 in which the failure transition is performed and determines the normal transition destination for the character B, the state transitions to the state 3 when the failure transition destination of the state next is determined. The AC machine construction unit 160c determines the state next to be transitioned from the state s by the letter B, and sets the failer transition destination of the determined state (collation state 5) to the state 3. See state 5 in FIG.

ＡＣマシン構築部１６０ｃは、キューの先頭の状態２を取り出し、取り出した状態２を状態ｓに設定する。ＡＣマシン構築部１６０ｃは、状態ｓにおいて、g[code(a)]≠Nullとなる全ての文字ａを抽出し、集合Ｘに格納する。状態ｓに通常遷移先は存在しないので、次のステップに移行する。図１８参照。 The AC machine construction unit 160c extracts the state 2 at the head of the queue, and sets the extracted state 2 to the state s. The AC machine construction unit 160c extracts all the characters a that satisfy g [code (a)] ≠ Null in the state s, and stores them in the set X. Since there is no normal transition destination in the state s, the process proceeds to the next step. See FIG.

ＡＣマシン構築部１６０ｃは、キューの先頭の状態４を取り出し、取り出した状態４を状態ｓに設定する。ＡＣマシン構築部１６０ｃは、状態ｓにおいて、g[code(a)]≠Nullとなる全ての文字ａを抽出し、集合Ｘに格納する。この場合は、ＡＣマシン構築部１６０ｃは、文字Ａ、Ｃを抽出し、集合Ｘに文字Ａ、Ｃを格納する。 The AC machine construction unit 160c extracts the state 4 at the head of the queue, and sets the extracted state 4 to the state s. The AC machine construction unit 160c extracts all the characters a that satisfy g [code (a)] ≠ Null in the state s, and stores them in the set X. In this case, the AC machine construction unit 160c extracts the characters A and C and stores the characters A and C in the set X.

ＡＣマシン構築部１６０ｃは、集合Ｘから文字Ａを取り出し、状態ｓ（照合状態４）の文字Ａの通常遷移先となる状態６をキューの最後尾に追加する。ＡＣマシン構築部１６０ｃは、状態ｓからフェイラ遷移した状態１に移行する。状態１において、文字Ａに対する通常遷移先はNullであるため、再度フェイラ遷移し、初期状態０に移行する。 The AC machine construction unit 160c extracts the character A from the set X, and adds the state 6 that is the normal transition destination of the character A in the state s (collation state 4) to the tail of the queue. The AC machine construction unit 160c transitions from the state s to the state 1 in which a failer transition is made. In state 1, the normal transition destination for character A is Null, so a failer transition is made again, and an initial state 0 is entered.

そして、初期状態０において、文字Ａに対する通常遷移先を判定することで、状態nextのフェイラ遷移先を判定すると、状態１となる。ＡＣマシン構築部１６０ｃは、状態ｓ（通常状態４）から文字Ａにより遷移する状態nextを判定し、判定した状態（照合状態６）のフェイラ遷移先を状態１に設定する。図１９の状態６参照。 Then, in the initial state 0, when the normal transition destination for the character A is determined, and the failer transition destination of the state next is determined, the state 1 is obtained. The AC machine construction unit 160c determines the state next to be transitioned by the letter A from the state s (normal state 4), and sets the failure transition destination of the determined state (collation state 6) to state 1. See state 6 in FIG.

ＡＣマシン構築部１６０ｃは、集合Ｘから文字Ｃを取り出し、状態ｓ（照合状態４）の文字Ｃの通常遷移先となる状態７をキューの最後尾に追加する。ＡＣマシン構築部１６０ｃは、状態ｓからフェイラ遷移した状態１に移行し、文字Ｃに対する通常遷移先を判定することで、状態nextのフェイラ遷移先を判定すると、状態２となる。ＡＣマシン構築部１６０ｃは、状態ｓから文字Ｃによる遷移する状態nextを判定し、判定した状態（通常状態７）のフェイラ遷移先を状態２に設定する。 The AC machine construction unit 160c extracts the character C from the set X, and adds the state 7 that is the normal transition destination of the character C in the state s (collation state 4) to the tail of the queue. When the AC machine construction unit 160c transitions from the state s to the state 1 in which the failure transition is performed and determines the normal transition destination for the character C, the state transitions to the state 2 when the failure transition destination of the state next is determined. The AC machine construction unit 160c determines the state next to be transitioned by the character C from the state s, and sets the failer transition destination of the determined state (normal state 7) to the state 2.

また、ＡＣマシン構築部１６０ｃは、状態７のフェイラ遷移先が照合状態２となる。この場合、ＡＣマシン構築部１６０ｃは、状態４の状態構造体において、状態７へのポインタg[C]に対応するパターンリストをPlist[C]=ACに設定し、状態７を通常状態から照合状態に変更する。図１９参照の状態４、７参照。 Further, the AC machine construction unit 160c sets the failure transition destination of the state 7 to the collation state 2. In this case, the AC machine construction unit 160c sets the pattern list corresponding to the pointer g [C] to the state 7 to Plist [C] = AC in the state structure of the state 4, and checks the state 7 from the normal state. Change to state. See states 4 and 7 in FIG.

ＡＣマシン構築部１６０ｃは、キューの先頭の状態５を取り出し、取り出した状態５を状態ｓに設定する。ＡＣマシン構築部１６０ｃは、状態ｓにおいてg[code(a)]≠Nullとなる全ての文字ａを抽出し、集合Ｘに格納する。状態ｓに通常遷移先は存在しないので、次のステップに移行する。図２０参照。 The AC machine construction unit 160c takes out the first state 5 of the queue and sets the taken-out state 5 to the state s. The AC machine construction unit 160c extracts all the characters a that satisfy g [code (a)] ≠ Null in the state s, and stores them in the set X. Since there is no normal transition destination in the state s, the process proceeds to the next step. See FIG.

ＡＣマシン構築部１６０ｃは、キューの先頭の状態６を取り出し、取り出した状態６を状態ｓに設定する。ＡＣマシン構築部１６０ｃは、状態ｓにおいて、g[code(a)]≠Nullとなる全ての文字ａを抽出し、集合Ｘに格納する。状態ｓに通常遷移先は存在しないので、次のステップに移行する。図２１参照。 The AC machine construction unit 160c extracts the state 6 at the head of the queue, and sets the extracted state 6 to the state s. The AC machine construction unit 160c extracts all the characters a that satisfy g [code (a)] ≠ Null in the state s, and stores them in the set X. Since there is no normal transition destination in the state s, the process proceeds to the next step. See FIG.

ＡＣマシン構築部１６０ｃは、キューの先頭の状態７を取り出し、取り出した状態７を状態ｓに設定する。ＡＣマシン構築部１６０ｃは、状態ｓにおいて、g[code(a)]≠Nullとなる全ての文字ａを抽出し、集合Ｘに格納する。この場合は、ＡＣマシン構築部１６０ｃは、文字Ｄを抽出し、集合Ｘに文字Ｄを格納する。 The AC machine construction unit 160c takes out the head state 7 of the queue and sets the taken state 7 to state s. The AC machine construction unit 160c extracts all the characters a that satisfy g [code (a)] ≠ Null in the state s, and stores them in the set X. In this case, the AC machine construction unit 160c extracts the character D and stores the character D in the set X.

ＡＣマシン構築部１６０ｃは、集合Ｘから文字Ｄを取り出し、状態ｓ（照合状態７）の文字Ｄの通常遷移先となる状態８をキューの最後尾に追加する。ＡＣマシン構築部１６０ｃは、状態ｓからフェイラ遷移した状態２に移行する。状態２において、文字Ｄに対する通常遷移先はNullであるため、再度フェイラ遷移し、初期状態０に移行する。 The AC machine construction unit 160c extracts the character D from the set X, and adds the state 8 that is the normal transition destination of the character D in the state s (collation state 7) to the tail of the queue. The AC machine construction unit 160c transitions from the state s to the state 2 in which a failer transition is made. In state 2, the normal transition destination for character D is Null, so a failer transition occurs again, and the state transitions to initial state 0.

ＡＣマシン構築部１６０ｃは、初期状態０において、文字Ｄに対する通常遷移先を判定することで、状態nextのフェイラ遷移先を判定すると、初期状態０となる。ＡＣマシン構築部１６０ｃは、状態ｓ（照合状態７）から文字Ｄにより遷移する状態nextを判定し、判定した状態（照合状態８）のフェイラ遷移先を初期状態０に設定する。図２２の状態８参照。 When the AC machine construction unit 160c determines the failure transition destination of the state next by determining the normal transition destination for the character D in the initial state 0, the state becomes the initial state 0. The AC machine construction unit 160c determines the state next to be transitioned by the character D from the state s (collation state 7), and sets the failer transition destination of the determined state (collation state 8) to the initial state 0. See state 8 in FIG.

ＡＣマシン構築部１６０ｃは、キューの先頭の状態８を取り出し、取り出した状態８を状態ｓに設定する。ＡＣマシン構築部１６０ｃは、状態ｓにおいて、g[code(a)]≠Nullとなる全ての文字ａを抽出し、集合Ｘに格納する。状態ｓに通常遷移先は存在しないので、次のステップに移行する。そして、キューに状態が存在しなくなった場合に、パターン集合ΠのＡＣマシンが完了する。図２３参照。 The AC machine construction unit 160c takes out the head state 8 of the queue and sets the taken state 8 to state s. The AC machine construction unit 160c extracts all the characters a that satisfy g [code (a)] ≠ Null in the state s, and stores them in the set X. Since there is no normal transition destination in the state s, the process proceeds to the next step. Then, when the state no longer exists in the queue, the AC machine of the pattern set IV is completed. See FIG.

次に、ＡＣマシン構築部１６０ｃが、ＡＣマシン１５０ｄにスキップ遷移を追加して、圧縮データ用ＡＣマシン１５０ｆを生成する処理について説明する。圧縮辞書データ１５０ｅの文字ペアをａ，ｂとした場合、ＡＣマシン構築部１６０ｃは、ＡＣマシン１５０ｄのパスをスキャンし、文字ａの遷移と文字ｂの遷移が連続する箇所を判定する。そして、ＡＣマシン構築部１６０ｃは、ＡＣマシン１５０ｄの該当箇所にスキップ遷移を追加する。 Next, a process in which the AC machine construction unit 160c generates a compressed data AC machine 150f by adding a skip transition to the AC machine 150d will be described. When the character pair of the compressed dictionary data 150e is a and b, the AC machine construction unit 160c scans the path of the AC machine 150d and determines a place where the transition of the character a and the transition of the character b are continuous. Then, the AC machine construction unit 160c adds a skip transition to the corresponding part of the AC machine 150d.

また、ＡＣマシン構築部１６０ｃは、ＡＣマシン１５０ｄのパスをスキャンし文字ａの遷移と、文字ｂの遷移がフェイラ遷移を挟む箇所を判定する。そして、ＡＣマシン構築部１６０ｃは、ＡＣマシン１５０ｄの該当箇所にスキップ遷移を追加する。 Further, the AC machine construction unit 160c scans the path of the AC machine 150d, and determines a location where the transition of the character a and the transition of the character b sandwich the failer transition. Then, the AC machine construction unit 160c adds a skip transition to the corresponding part of the AC machine 150d.

図２４〜図２６は、圧縮データ用ＡＣマシン構築処理を説明するための図である。ＡＣマシン構築部１６０ｃは、圧縮辞書データ１５０ｅに登録された文字ペアに対応するスキップ遷移をＡＣマシン１５０ｄに追加する。 24 to 26 are diagrams for explaining the compressed data AC machine construction processing. The AC machine construction unit 160c adds a skip transition corresponding to the character pair registered in the compressed dictionary data 150e to the AC machine 150d.

まず、ＡＣマシン構築部１６０ｃは、圧縮辞書データ１５０ｅの１段目に登録された文字ペア「Ｂ，Ａ」に対応するスキップ遷移をＡＣマシン１５０ｄに追加する。ＡＣマシン構築部１６０ｃは、ＡＣマシン１５０ｄに含まれる全パスをスキャンし、文字Ｂの遷移と文字Ａの遷移が連続する箇所を判定する。 First, the AC machine construction unit 160c adds a skip transition corresponding to the character pair “B, A” registered in the first row of the compressed dictionary data 150e to the AC machine 150d. The AC machine construction unit 160c scans all paths included in the AC machine 150d, and determines a place where the transition of the letter B and the transition of the letter A are continuous.

ＡＣマシン構築部１６０ｃは、ＡＣマシン１５０ｄのパス上の状態ｓと状態ｔに対して、状態ｓの文字Ｂによる通常遷移先が状態ｔとなり（s.g[code(B)]=t）、かつ、状態ｔの、文字Ａによる通常遷移先が状態ｕとなる場合（s.g[code(A)]=u）に、状態ｓにスキップ遷移Ｘを作成する（s.g[code(X)]=u）。このとき、状態ｓのスキップ遷移Ｘのパターンリスト（s.Plist[code(X)]）は、s.Plist[code(X)]=s.Plist[code(B)]∪t.Plist[code(A)]とする。 The AC machine construction unit 160c sets the normal transition destination by the letter B of the state s to the state t (sg [code (B)] = t) for the state s and t on the path of the AC machine 150d, and When the normal transition destination of the character A in the state t is the state u (sg [code (A)] = u), a skip transition X is created in the state s (sg [code (X)] = u). At this time, the pattern list (s.Plist [code (X)]) of the skip transition X of the state s is s.Plist [code (X)] = s.Plist [code (B)] ∪t.Plist [code (A)].

ＡＣマシン１５０ｄの全パスの中で、（s.g[code(B)]=t）かつ（t.g[code(A)]=u）の条件を満たす状態ｓ、状態ｔ、状態ｕの組は、初期状態０、通常状態３、照合状態４となる。従って、ＡＣマシン構築部１６０ｃは、初期状態０の文字Ｘによる通常遷移先のポインタ（g[code(X)]）に、照合状態４を設定し、該当するパターンリスト（Plist[X]）に「ＢＡ」を登録する。図２４の初期状態０参照。 Among all paths of the AC machine 150d, the set of the state s, the state t, and the state u that satisfy the conditions (sg [code (B)] = t) and (tg [code (A)] = u) State 0, normal state 3, and verification state 4 are entered. Accordingly, the AC machine construction unit 160c sets the matching state 4 to the normal transition destination pointer (g [code (X)]) with the character X in the initial state 0, and sets the matching pattern list (Plist [X]). Register “BA”. See initial state 0 in FIG.

続いて、ＡＣマシン構築部１６０ｃは、ＡＣマシン１５０ｄのパスをスキャンし文字Ｂの遷移と、文字Ａの遷移がフェイラ遷移を挟む箇所を判定する。ＡＣマシン構築部１６０ｃは、ＡＣマシン１５０ｄのパス上の状態ｓと状態ｔに対して、状態ｓの文字Ｂによる通常遷移先が状態ｔとなり（s.g[code(B)]=t）、かつ、状態ｔの、文字Ａによる通常遷移先がＮｕｌｌとなる（t.g[code(A)]=Null）状態ｔを全て集合Ｆに格納する。 Subsequently, the AC machine construction unit 160c scans the path of the AC machine 150d and determines a location where the transition of the character B and the transition of the character A sandwich the failer transition. The AC machine construction unit 160c sets the normal transition destination by the letter B of the state s to the state t (sg [code (B)] = t) for the state s and t on the path of the AC machine 150d, and All the states t in which the normal transition destination by the character A in the state t is Null (tg [code (A)] = Null) are stored in the set F.

ＡＣマシン１５０ｄの全パスの中で、（s.g[code(B)]=t）かつ（t.g[code(A)]=Null）の条件を満たす状態ｔは、照合状態５となるため、ＡＣマシン構築部１６０ｃは、照合状態５を集合Ｆに格納する。ＡＣマシン構築部１６０ｃは、集合Ｆのうち、状態ｔからフェイラ遷移し、フェイラ遷移先の状態の文字Ａによる通常遷移先がＮｕｌｌ（t.fail.g[code(A)]=Null）となる状態ｔを削除する。照合状態５は、（t.fail.g[code(A)]=Null）の条件を満たさないので、集合Ｆに残る。 Since the state t satisfying the condition (sg [code (B)] = t) and (tg [code (A)] = Null) in all paths of the AC machine 150d is the collation state 5, the AC machine The construction unit 160c stores the collation state 5 in the set F. In the set F, the AC machine construction unit 160c performs a failer transition from the state t, and the normal transition destination by the character A in the state of the failer transition destination is Null (t.fail.g [code (A)] = Null). Delete state t. Since the collation state 5 does not satisfy the condition (t.fail.g [code (A)] = Null), it remains in the set F.

ＡＣマシン構築部１６０ｃは、集合Ｆの任意の状態ｔに対して、状態ｔからフェイラ遷移し、フェイラ遷移先の状態の文字Ａによる通常遷移先を状態ｕ（t.fail.g[code(A)]=u）とする。また、ＡＣマシン構築部１６０ｃは、（s.g[code(B)]=t）の条件を満たす状態を状態ｓとする。このとき、状態ｓのスキップ遷移Ｘのパターンリスト（s.Plis[code(X)]）を、s.Plist[code(X)]=s.Plist[code(B)]∪t.fail.Plist[code(A)]とする。 The AC machine construction unit 160c makes a failer transition from the state t to an arbitrary state t of the set F, and sets the normal transition destination by the character A of the state of the failer transition destination to the state u (t.fail.g [code (A )] = u). Further, the AC machine construction unit 160c sets the state s as the state satisfying the condition (s.g [code (B)] = t). At this time, the pattern list (s.Plis [code (X)]) of the skip transition X of the state s is expressed as s.Plist [code (X)] = s.Plist [code (B)] ∪t.fail.Plist [code (A)]

（t.fail.g[code(A)]=u）、（s.g[code(B)]=t）を満たす状態ｓは、通常状態３となる。ＡＣマシン構築部１６０ｃは、通常状態３の文字Ｘによる通常遷移先のポインタ（g[code(X)]）に、照合状態４を設定し、該当するパターンリスト（Plist[X]）に「ＢＢ，ＢＡ」を登録する。図２５の通常状態３参照。 The state s that satisfies (t.fail.g [code (A)] = u) and (s.g [code (B)] = t) is the normal state 3. The AC machine construction unit 160c sets the matching state 4 to the normal transition destination pointer (g [code (X)]) with the character X in the normal state 3 and sets “BB” in the corresponding pattern list (Plist [X]). , BA ". See normal state 3 in FIG.

続いて、ＡＣマシン構築部１６０ｃは、圧縮辞書データ１５０ｅの２段目に登録された文字ペア「Ｘ，Ｃ」に対応するスキップ遷移をＡＣマシン１５０ｄに追加する。ＡＣマシン構築部１６０ｃは、ＡＣマシン１５０ｄの該当箇所にスキップ遷移を追加する。 Subsequently, the AC machine construction unit 160c adds a skip transition corresponding to the character pair “X, C” registered in the second stage of the compressed dictionary data 150e to the AC machine 150d. The AC machine construction unit 160c adds a skip transition to the corresponding part of the AC machine 150d.

ＡＣマシン構築部１６０ｃは、ＡＣマシン１５０ｄのパス上の状態ｓと状態ｔに対して、状態ｓの文字Ｘによる通常遷移先が状態ｔとなり（s.g[code(X)]=t）、かつ、状態ｔの文字Ｃによる通常遷移先が状態ｕとなる場合（s.g[code(C)]=u）に、状態ｓにスキップ遷移Ｙを作成する（s.g[code(Y)]=u）。このとき、状態ｓのスキップ遷移Ｙのパターンリスト（s.Plist[code(Y)]）は、s.Plist[code(Y)]=s.Plist[code(X)]∪t.Plist[code(C)]とする。 The AC machine construction unit 160c sets the normal transition destination by the character X of the state s to the state t (sg [code (X)] = t) with respect to the state s and the state t on the path of the AC machine 150d, and When the normal transition destination of the character C in the state t is the state u (sg [code (C)] = u), a skip transition Y is created in the state s (sg [code (Y)] = u). At this time, the pattern list (s.Plist [code (Y)]) of the skip transition Y of the state s is s.Plist [code (Y)] = s.Plist [code (X)] ∪t.Plist [code (C)].

ＡＣマシン１５０ｄの全パスの中で、（s.g[code(X)]=t）かつ（s.g[code(C)]=u）の条件を満たす状態ｓ、状態ｔ、状態ｕの組は、初期状態０、照合状態４、照合状態７の組と、通常状態３、照合状態４、照合状態７の組となる。 Among all paths of the AC machine 150d, the set of the state s, the state t, and the state u that satisfy the condition (sg [code (X)] = t) and (sg [code (C)] = u) A set of state 0, collation state 4 and collation state 7 and a pair of normal state 3, collation state 4 and collation state 7 are set.

ＡＣマシン構築部１６０ｃは、初期状態０の文字Ｙによる通常遷移先のポインタ（g[code(Y)]）に、照合状態７を設定し、該当するパターンリスト（Plist[Y]）に「ＢＡ、ＡＣ」を登録する。図２６の初期状態０参照。 The AC machine construction unit 160c sets the matching state 7 to the normal transition destination pointer (g [code (Y)]) with the letter Y in the initial state 0, and sets “BA” to the corresponding pattern list (Plist [Y]). , AC ". See initial state 0 in FIG.

また、ＡＣマシン構築部１６０ｃは、通常状態３の文字Ｙによる通常遷移先のポインタ（g[code(Y)]）に、照合状態７を設定し、該当するパターンリスト（Plist[Y]）に「ＢＢ、ＢＡ、ＡＣ」を登録する。図２６の通常状態３参照。 Also, the AC machine construction unit 160c sets the matching state 7 to the normal transition destination pointer (g [code (Y)]) with the letter Y in the normal state 3, and sets the matching pattern list (Plist [Y]). Register “BB, BA, AC”. See normal state 3 in FIG.

続いて、ＡＣマシン構築部１６０ｃは、ＡＣマシン１５０ｄのパスをスキャンし文字Ｘの遷移と、文字Ｃの遷移がフェイラ遷移を挟む箇所を判定する。図２６に示すＡＣマシン１５０ｄにおいて、文字Ｘの遷移と、文字Ｃの遷移がフェイラ遷移を挟む箇所は存在しないので、図２６に示したＡＣマシンが、圧縮データ用ＡＣマシン１５０ｆとなる。 Subsequently, the AC machine construction unit 160c scans the path of the AC machine 150d and determines a location where the transition of the character X and the transition of the character C sandwich the failer transition. In the AC machine 150d shown in FIG. 26, there is no place where the transition of the character X and the transition of the character C sandwich the failer transition, so the AC machine shown in FIG. 26 is the compressed data AC machine 150f.

次に、図２に示した照合処理部１６０ｄについて説明する。照合処理部１６０ｄは、照合対象となるテキストが与えられた場合に、与えられたテキストと圧縮データ用ＡＣマシン１５０ｆとを比較して、テキストに特定のキーワードが含まれているか否かを判定する処理部である。 Next, the verification processing unit 160d shown in FIG. 2 will be described. When the text to be collated is given, the matching processing unit 160d compares the given text with the compressed data AC machine 150f to determine whether or not the text contains a specific keyword. It is a processing unit.

図２７〜図３２は、照合処理を説明するための図である。ここでは一例として、照合対象となるテキストを「ＡＸＹＹＤ」とする。照合処理部１６０ｄは、現在の状態ｓを圧縮データ用ＡＣマシン１５０ｆの初期状態０に設定する。図２７参照。 27 to 32 are diagrams for explaining the collation processing. Here, as an example, the text to be collated is “AXYYD”. The verification processing unit 160d sets the current state s to the initial state 0 of the compressed data AC machine 150f. See FIG.

照合処理部１６０ｄは、テキストの１文字目「Ａ」を読み出し、文字Ａによる初期状態０の通常遷移先を判定する。文字Ａによる初期状態０の遷移先は、通常状態１であるため、現在の状態ｓを通常状態１に設定する。図２８参照。 The matching processing unit 160d reads the first character “A” of the text and determines the normal transition destination of the initial state 0 by the character A. Since the transition destination of the initial state 0 by the letter A is the normal state 1, the current state s is set to the normal state 1. See FIG.

照合処理部１６０ｄは、テキストの２文字目「Ｘ」を読み出し、文字Ｘによる通常状態３の遷移先を判定する。文字Ｘによる通常状態３の通常遷移先は存在しないので、通常状態３からフェイラ遷移し、状態ｓを初期状態０に設定する。 The matching processing unit 160d reads the second character “X” of the text and determines the transition destination of the normal state 3 by the character X. Since there is no normal transition destination of the normal state 3 by the letter X, a failer transition is made from the normal state 3 and the state s is set to the initial state 0.

照合処理部１６０ｄは、文字Ｘによる初期状態０の通常遷移先を判定する。文字Ｘによる初期状態０の通常遷移先は照合状態４であるため、現在の状態ｓを照合状態４に設定する。また、照合状態４にヒットしたため、照合処理部１６０ｄは、初期状態０のg[code(X)]に対応付けられたパターンリストPlist[X]={BA}を、集合Ｒに登録する。図２９参照。 The verification processing unit 160d determines the normal transition destination of the initial state 0 by the character X. Since the normal transition destination of the initial state 0 by the character X is the collation state 4, the current state s is set to the collation state 4. Since the matching state 4 is hit, the matching processing unit 160d registers the pattern list Plist [X] = {BA} associated with g [code (X)] in the initial state 0 in the set R. See FIG.

照合処理部１６０ｄは、テキストの３文字目「Ｙ」を読み出し、文字Ｙによる照合状態４の通常遷移先を判定する。文字Ｙによる照合状態４の通常遷移先は存在しないので、照合状態４からフェイラ遷移し、状態ｓを通常状態１に設定する。 The matching processing unit 160d reads the third character “Y” of the text and determines the normal transition destination of the matching state 4 by the letter Y. Since there is no normal transition destination of the collation state 4 by the letter Y, a failer transition is made from the collation state 4 and the state s is set to the normal state 1.

照合処理部１６０ｄは、文字Ｙによる通常状態１の通常遷移先を判定する。文字Ｙによる通常状態１の通常遷移先は存在しないので、通常状態１からフェイラ遷移し、状態ｓを初期状態０に設定する。 The verification processing unit 160d determines the normal transition destination of the normal state 1 by the character Y. Since there is no normal transition destination of the normal state 1 due to the letter Y, a failure transition is made from the normal state 1 and the state s is set to the initial state 0.

照合処理部１６０ｄは、文字Ｙによる初期状態０の通常遷移先を判定する。文字Ｙによる初期状態０の通常遷移先は照合状態７であるため、現在の状態ｓを照合状態７に設定する。また、照合状態７にヒットしたため、照合処理部１６０ｄは、初期状態０のg[code(Y)]に対応付けられたパターンリストPlist[X]={BA,AC}を、集合Ｒに登録する。なお、照合処理部１６０ｄは、ＢＡが既に集合Ｒに登録されているので、ＡＣを集合Ｒに登録する。図３０参照。 The verification processing unit 160d determines the normal transition destination of the initial state 0 by the letter Y. Since the normal transition destination of the initial state 0 by the letter Y is the collation state 7, the current state s is set to the collation state 7. Since the matching state 7 is hit, the matching processing unit 160d registers the pattern list Plist [X] = {BA, AC} associated with g [code (Y)] in the initial state 0 in the set R. . Since the BA is already registered in the set R, the matching processing unit 160d registers AC in the set R. See FIG.

照合処理部１６０ｄは、テキストの４文字目「Ｙ」を読み出し、文字Ｙによる照合状態７の通常遷移先を判定する。文字Ｙによる照合状態７の通常遷移先は存在しないので、照合状態７からフェイラ遷移し、状態ｓを照合状態２に設定する。 The matching processing unit 160d reads the fourth character “Y” of the text and determines the normal transition destination of the matching state 7 by the character Y. Since there is no normal transition destination of the collation state 7 by the letter Y, the failer transitions from the collation state 7 and the state s is set to the collation state 2.

照合処理部１６０ｄは、文字Ｙによる照合状態２の通常遷移先を判定する。文字Ｙによる照合状態２の通常遷移先は存在しないので、照合状態２からフェイラ遷移し、状態ｓを初期状態０に設定する。 The matching processing unit 160d determines the normal transition destination of the matching state 2 by the letter Y. Since there is no normal transition destination of the collation state 2 by the letter Y, a failer transition is made from the collation state 2 and the state s is set to the initial state 0.

照合処理部１６０ｄは、文字Ｙによる初期状態０の通常遷移先を判定する。文字Ｙによる初期状態０の通常遷移先は照合状態７であるため、現在の状態ｓを照合状態７に設定する。また、照合状態７にヒットしたため、照合処理部１６０ｄは、初期状態０のg[code(Y)]に対応付けられたパターンリストPlist[X]={BA,AC}を、集合Ｒに登録する。なお、照合処理部１６０ｄは、ＢＡ、ＡＣが既に集合Ｒに登録されているので、集合Ｒには何も登録しない。図３１参照。 The verification processing unit 160d determines the normal transition destination of the initial state 0 by the letter Y. Since the normal transition destination of the initial state 0 by the letter Y is the collation state 7, the current state s is set to the collation state 7. Since the matching state 7 is hit, the matching processing unit 160d registers the pattern list Plist [X] = {BA, AC} associated with g [code (Y)] in the initial state 0 in the set R. . Note that the collation processing unit 160d does not register anything in the set R because BA and AC are already registered in the set R. See FIG.

照合処理部１６０ｄは、テキストの５文字目「Ｄ」を読み出し、文字Ｄによる照合状態７の通常遷移先を判定する。文字Ｄによる照合状態７の通常遷移先は照合状態８であるため、現在の状態ｓを照合状態８に設定する。また、照合状態８にヒットしたため、照合所理部１６０ｄは、照合状態７のg[code(D)]に対応付けられたパターンリストPlist[D]={BACD}を、集合Ｒに登録する。図３２参照。 The matching processing unit 160d reads the fifth character “D” of the text and determines the normal transition destination of the matching state 7 by the character D. Since the normal transition destination of the collation state 7 by the letter D is the collation state 8, the current state s is set to the collation state 8. Since the matching state 8 is hit, the matching reasoning unit 160d registers the pattern list Plist [D] = {BACD} associated with g [code (D)] in the matching state 7 in the set R. See FIG.

図２７〜図３２に示したように、照合処理部１６０ｄは、テキストを１文字ずつ読み出して、圧縮データ用ＡＣマシン１５０ｆの各状態を遷移し、照合処理を実行することで、例えば、テキスト「ＡＸＹＹＤ」が特定のキーワード「ＢＡ、ＡＣ、ＢＡＤＣ」を含んでいることを判定する。なお、照合処理部１６０ｄは、テキスト「ＡＸＹＹＤ」をまとめて取得する代わりに、１文字ずつテキストを取得した場合でも、図２７〜図３２と同様にして、照合処理を実行する。 As illustrated in FIGS. 27 to 32, the collation processing unit 160 d reads out the text character by character, transitions between the states of the compressed data AC machine 150 f, and executes the collation processing. It is determined that “AXYYD” includes a specific keyword “BA, AC, BADC”. It should be noted that the collation processing unit 160d executes the collation process in the same manner as in FIGS. 27 to 32 even when the text “AXYYD” is acquired collectively instead of acquiring the text one character at a time.

次に、図２に示した切出処理部１６０ｅについて説明する。切出し処理部１６０ｅは、照合処理部１６０ｄと協働し、変換後抽出条件１５０ｈに対応する符号化データ１５０ｂの位置を算出する処理部である。以下の説明において、変換後抽出条件１５０ｈに対応する符号化データ１５０ｂの位置を切出し位置リストと表記する。 Next, the cutting processing unit 160e shown in FIG. 2 will be described. The cutout processing unit 160e is a processing unit that cooperates with the matching processing unit 160d to calculate the position of the encoded data 150b corresponding to the post-conversion extraction condition 150h. In the following description, the position of the encoded data 150b corresponding to the post-conversion extraction condition 150h is referred to as a cutout position list.

切出し処理部１６０ｅは、切出し位置リストに対応する位置の情報を符号化データ１５０ｂから切出し、切出した情報を、検索条件の回答として出力部１２０または通信制御ＩＦ部１３０に出力する。 The cut-out processing unit 160e cuts out the position information corresponding to the cut-out position list from the encoded data 150b, and outputs the cut-out information to the output unit 120 or the communication control IF unit 130 as a search condition answer.

以下において、切出処理部１６０ｅの処理を具体的に説明する。切出処理部１６０ｅは、符号化データ１５０ｂを１文字ずつスキャンし、文字の種類に応じた各種の処理を実行する。ここでは、スキャンした文字が、文字列部の場合、開始タグ「＜＞」の場合、終了タグの場合「＜／＞」に分けて説明する。 Hereinafter, the processing of the cutout processing unit 160e will be specifically described. The cutout processing unit 160e scans the encoded data 150b character by character and executes various processes according to the character type. Here, a case where the scanned character is a character string portion, a start tag “<>”, and an end tag “</>” will be described separately.

スキャンした文字が文字列部の場合について説明する。切出処理部１６０ｅは、スキャンした文字が文字列部の場合に、スキャンした文字を照合処理部１６０ｄに出力し、圧縮データ用ＡＣマシン１５０ｆの照合状態に遷移したか否かの情報を照合処理部１６０ｄから取得する。 A case where the scanned character is a character string portion will be described. When the scanned character is a character string portion, the cutout processing unit 160e outputs the scanned character to the collation processing unit 160d, and performs collation processing on information indicating whether or not the compressed data AC machine 150f has transitioned to the collation state. From the unit 160d.

上述した照合処理部１６０ｄは、上述した処理に加え、切出処理部１６０ｅから文字を取得した場合には、現在の状態ｓを遷移させ、照合状態に遷移したか否かを情報を切出処理部１６０ｅに通知するものとする。切出処理部１６０ｅは、照合状態に遷移した旨の情報を取得した場合には、キーワードフラグｋの値を１に設定する。 In addition to the above-described processing, the collation processing unit 160d described above transitions the current state s when a character is acquired from the cut-out processing unit 160e, and extracts information indicating whether or not the state has transitioned to the collation state. It is assumed to notify the unit 160e. The cut-out processing unit 160e sets the value of the keyword flag k to 1 when acquiring information indicating the transition to the collation state.

続いて、スキャンした文字が開始タグの場合について説明する。切出処理部１６０ｅは、スキャンした文字が開始タグの場合に、タグ符号と現在の位置（符号化データ１５０ｂの先頭文字から数えた、該当タグ符号の「＜」までの数；start）をスタックＳに登録する。 Next, a case where the scanned character is a start tag will be described. When the scanned character is a start tag, the extraction processing unit 160e stacks the tag code and the current position (the number up to “<” of the corresponding tag code counted from the first character of the encoded data 150b; start). Register with S.

また、切出処理部１６０ｅは、スタックＳに登録したタグ符号が、変換後抽出条件１５０ｈの項目部と一致しているか否かを判定し、一致している場合には項目フラグｆを１に設定する。 Further, the cutout processing unit 160e determines whether or not the tag code registered in the stack S matches the item part of the post-conversion extraction condition 150h. If they match, the item flag f is set to 1. Set.

続いて、スキャンした文字が終了タグの場合について説明する。切出処理部１６０ｅは、スキャンした文字が終了タグの場合に、キーワードフラグｋと項目フラグｆが１であるか否かを判定する。キーワードフラグｋと項目フラグｆが１の場合に、切出処理部１６０ｅは、現在の位置（符号化データ１５０ｂの先頭文字から数えた、該当タグ符号の「＞」までの数）をendに登録し、スタックＳの最終要素のstartを取得する。そして、切出処理部１６０ｅは、取得したstartとendの組合せ（start、end）を切出し位置リストに登録する。切出処理部１６０ｅは、スタックＳからタグ符号を取り出し、キーワードフラグｋを０に設定する。 Next, a case where the scanned character is an end tag will be described. The cutout processing unit 160e determines whether the keyword flag k and the item flag f are 1 when the scanned character is an end tag. When the keyword flag k and the item flag f are 1, the cutout processing unit 160e registers the current position (the number up to “>” of the corresponding tag code counted from the first character of the encoded data 150b) in end. Then, the start of the last element of the stack S is acquired. The cutout processing unit 160e registers the acquired combination of start and end (start, end) in the cutout position list. The cut-out processing unit 160e extracts the tag code from the stack S and sets the keyword flag k to 0.

ここで、符号化データ１５０ｂを図４に示す符号化データとし、変換後抽出条件「/1/2="BACD"」として、切出処理部１６０ｅの処理を説明する。切出処理部１６０ｅの初期設定として、キーワードフラグｋと項目フラグｆを０、スタックＳを空スタック、切出し位置リストＲを空リストとする。 Here, assuming that the encoded data 150b is the encoded data shown in FIG. 4 and the post-conversion extraction condition “/ 1/2 =“ BACD ””, the process of the cutout processing unit 160e will be described. As an initial setting of the extraction processing unit 160e, the keyword flag k and the item flag f are set to 0, the stack S is an empty stack, and the extraction position list R is an empty list.

切出処理部１６０ｅは、符号化データ１５０ｂの文字「＜１＞」をスキャンする。文字「＜１＞」は、開始タグである。切出処理部１６０ｅは、スタックＳにタグ符号「＜１＞」と現在位置「１」をスタックＳに登録する。スタックＳ＝｛（＜１＞、１）｝。切出処理部１６０ｅは、スタックＳのタグ符号「＜１＞」が、変換後抽出条件の項目部「/1/2」と一致しないので、項目フラグｆを０のままにする。 The cut-out processing unit 160e scans the character “<1>” of the encoded data 150b. The character “<1>” is a start tag. The cutout processing unit 160 e registers the tag code “<1>” and the current position “1” in the stack S in the stack S. Stack S = {(<1>, 1)}. Since the tag code “<1>” of the stack S does not match the item part “/ 1/2” of the post-conversion extraction condition, the cutout processing unit 160e keeps the item flag f at 0.

切出処理部１６０ｅは、符号化データ１５０ｂの文字「＜２＞」をスキャンする。文字「＜２＞」は、開始タグである。切出処理部１６０ｅは、スタックＳにタグ符号「＜２＞」と現在位置「４」をスタックＳに登録する。スタックＳ＝｛（＜１＞、１）、（＜２＞）、４｝。切出処理部１６０ｅは、スタックＳのタグ符号「＜１＞」、「＜２＞」が、変換後抽出条件の項目部「/1/2」と一致するので、項目フラグｆを１に設定する。 The cut-out processing unit 160e scans the character “<2>” of the encoded data 150b. The character “<2>” is a start tag. The cutout processing unit 160 e registers the tag code “<2>” and the current position “4” in the stack S in the stack S. Stack S = {(<1>, 1), (<2>), 4}. The cutout processing unit 160e sets the item flag f to 1 because the tag codes “<1>” and “<2>” of the stack S match the item part “/ 1/2” of the post-conversion extraction condition. To do.

切出処理部１６０ｅは、符号化データ１５０ｂの文字列「ＡＸＹＹＤ」をスキャンする。切出処理部１６０ｅは、文字列「ＡＸＹＹＤ」を照合処理部１６０ｄに出力し、照合結果を取得する。照合状態に遷移した旨が照合結果に含まれる場合、切出処理部１６０ｅは、キーワードフラグｋを１に設定する。 The cut-out processing unit 160e scans the character string “AXYYD” of the encoded data 150b. The cutout processing unit 160e outputs the character string “AXYYD” to the collation processing unit 160d, and acquires the collation result. When the verification result includes the fact that the verification state has been changed, the extraction processing unit 160e sets the keyword flag k to 1.

切出処理部１６０ｅは、符号化データ１５０ｂの文字「＜２／＞」をスキャンする。文字「＜２／＞」は終了タグであり、項目フラグｆとキーワードフラグｋが１である。この場合、切出処理部１６０ｅは、end＝１５と、スタックＳの最終要素のstart値（＝４）の組を、切出し位置リストＲに登録する。Ｒ＝｛（４，１５）｝。切出処理部１６０ｅは、スタックＳから（＜２＞、４）を取り出す（ホップする）。スタックＳ＝｛（＜１＞、１）｝。切出処理部１６０ｅは、キーワードフラグｋを０に設定する。 The cut-out processing unit 160e scans the character “<2 />” of the encoded data 150b. The character “<2 />” is an end tag, and the item flag f and the keyword flag k are 1. In this case, the extraction processing unit 160e registers a set of end = 15 and the start value (= 4) of the last element of the stack S in the extraction position list R. R = {(4,15)}. The extraction processing unit 160e extracts (hops) (<2>, 4) from the stack S. Stack S = {(<1>, 1)}. The cutout processing unit 160e sets the keyword flag k to 0.

切出処理部１６０ｅは、符号化データ１５０ｂの文字「＜３＞」をスキャンする。文字「＜３＞」は、開始タグである。切出処理部１６０ｅは、スタックＳにタグ符号「＜３＞」と現在位置「１６」をスタックＳに登録する。スタックＳ＝｛（＜１＞、１）、（＜３＞）、１６｝。切出処理部１６０ｅは、スタックＳのタグ符号「＜１＞」、「＜３＞」が、変換後抽出条件の項目部「/1/2」と一致しないので、項目フラグｆを０に設定する。 The cut-out processing unit 160e scans the character “<3>” of the encoded data 150b. The character “<3>” is a start tag. The cut-out processing unit 160e registers the tag code “<3>” and the current position “16” in the stack S in the stack S. Stack S = {(<1>, 1), (<3>), 16}. The cutout processing unit 160e sets the item flag f to 0 because the tag codes “<1>” and “<3>” of the stack S do not match the item part “/ 1/2” of the post-conversion extraction condition. To do.

切出処理部１６０ｅは、符号化データ１５０ｂの文字「＜／３＞」をスキャンする。文字「＜／３＞」は終了タグである。キーワードフラグｋは１であるが、項目フラグｆは０であるため、切出処理部１６０ｅは、切出し位置リストＲに対する登録を行わない。切出処理部１６０ｅは、スタックＳから（＜３＞、１６）を取り出す（ポップする）。スタックＳ＝｛（＜１＞、１）｝。切出処理部１６０ｅは、キーワードフラグｋを０に設定する。 The cut-out processing unit 160e scans the character “</ 3>” of the encoded data 150b. The character “</ 3>” is an end tag. Since the keyword flag k is 1 but the item flag f is 0, the cutout processing unit 160e does not register the cutout position list R. The cutout processing unit 160e takes out (pops) (<3>, 16) from the stack S. Stack S = {(<1>, 1)}. The cutout processing unit 160e sets the keyword flag k to 0.

切出処理部１６０ｅは、符号化データ１５０ｂの文字「＜／１＞」をスキャンする。文字「＜／１＞」は終了タグである。キーワードフラグｋと項目フラグｆが０であるため、切出処理部１６０ｅは、切出し位置リストＲに対する登録を行わない。切出処理部１６０ｅは、スタックＳから（＜１＞、１）を取り出す。スタックＳ＝空スタック。切出処理部１６０ｅは、キーワードフラグｋを０に設定する。 The cut-out processing unit 160e scans the characters “</ 1>” of the encoded data 150b. The character “</ 1>” is an end tag. Since the keyword flag k and the item flag f are 0, the cutout processing unit 160e does not register the cutout position list R. The cutout processing unit 160e extracts (<1>, 1) from the stack S. Stack S = empty stack. The cutout processing unit 160e sets the keyword flag k to 0.

切出処理部１６０ｅは、符号化データ１５０ｂを最後までスキャンしたので、切出し位置リストＲ＝｛（４，１５）｝に基づいて、情報を符号化データ１５０ｂから切り出す。この場合、切出し処理部１６０ｅは、符号化データ１５０ｂの先頭文字から４文字目と１６文字目の間に位置する文字「＜２＞ＡＸＹＹＤ＜／２＞」を切り出す。 Since the extraction processing unit 160e has scanned the encoded data 150b to the end, the extraction processing unit 160e extracts information from the encoded data 150b based on the extraction position list R = {(4,15)}. In this case, the cutout processing unit 160e cuts out the character “<2> AXYYD </ 2>” positioned between the fourth and sixteenth characters from the first character of the encoded data 150b.

次に、本実施例２にかかる検索装置１００の処置手順について説明する。まず、データ圧縮処理部１６０ｂが実行する圧縮処理について説明する。図３３は、圧縮処理の処理手順を示すフローチャートである。 Next, a procedure of the search apparatus 100 according to the second embodiment will be described. First, compression processing executed by the data compression processing unit 160b will be described. FIG. 33 is a flowchart illustrating a processing procedure of compression processing.

図３３に示すように、データ圧縮処理部１６０ｂは、文字列計数処理を実行し（ステップＳ１０１）、辞書構築・文字列置換処理を実行する（ステップＳ１０２）。そして、データ圧縮処理部１６０ｂは、タグ名置換処理を実行する（ステップＳ１０３）。 As shown in FIG. 33, the data compression processing unit 160b executes a character string counting process (step S101), and executes a dictionary construction / character string replacement process (step S102). Then, the data compression processing unit 160b executes tag name replacement processing (step S103).

ここで、図３３のステップＳ１０１に示した文字列計数処理の処理手順について説明する。図３４は、文字列計数処理の処理手順を示すフローチャートである。図３４に示すように、データ圧縮処理部１６０ｂは、一つ前の文字ｃ０＝εに設定し、現在の文字ｃ１にＸＭＬデータＤ（ＸＭＬデータ１５０ａ）の先頭文字を代入する（ステップＳ１１１）。 Here, the procedure of the character string counting process shown in step S101 of FIG. 33 will be described. FIG. 34 is a flowchart showing the procedure of the character string counting process. As shown in FIG. 34, the data compression processing unit 160b sets the previous character c0 = ε, and substitutes the first character of the XML data D (XML data 150a) for the current character c1 (step S111).

データ圧縮処理部１６０ｂは、計数表Ｔを初期化し、タグ内フラグｆ＝０に設定する（ステップＳ１１２）。データ圧縮処理部１６０ｂは、ＸＭＬデータＤに次の文字が存在するか否かを判定する（ステップＳ１１３）。データ圧縮処理部１６０ｂは、ＸＭＬデータＤに次の文字が存在しない場合に（ステップＳ１１４，Ｎｏ）、計数表Ｔを出力する（ステップＳ１１５）。 The data compression processing unit 160b initializes the counting table T and sets the in-tag flag f = 0 (step S112). The data compression processing unit 160b determines whether or not the next character exists in the XML data D (step S113). When the next character does not exist in the XML data D (step S114, No), the data compression processing unit 160b outputs the counting table T (step S115).

一方、データ圧縮処理部１６０ｂは、ＸＭＬデータＤに次の文字が存在する場合に（ステップＳ１１４，Ｙｅｓ）、ｃ０＝ｃ１とし、ｃ１に次の文字を代入する（ステップＳ１１６）。データ圧縮処理部１６０ｂは、ｃ１がタグ開始記号（＜）またはタグ終了記号（＞）であるかを判定する（ステップＳ１１７）。 On the other hand, when the next character exists in the XML data D (Yes at Step S114), the data compression processing unit 160b sets c0 = c1, and substitutes the next character for c1 (Step S116). The data compression processing unit 160b determines whether c1 is a tag start symbol (<) or a tag end symbol (>) (step S117).

データ圧縮処理部１６０ｂは、ｃ１がタグ開始記号ではなく、かつ、タグ終了記号ではない場合に（ステップＳ１１８，Ｎｏ）、ステップＳ１２１に移行する。一方、データ圧縮処理部１６０ｂは、ｃ１がタグ開始記号またはタグ終了記号の場合に（ステップＳ１１８，Ｙｅｓ）、ｃ１がタグ開始記号ならｆ＝１とし、ｃ１がタグ終了記号ならｆ＝０に設定する（ステップＳ１１９）。 If c1 is not a tag start symbol and not a tag end symbol (step S118, No), the data compression processing unit 160b proceeds to step S121. On the other hand, when c1 is a tag start symbol or a tag end symbol (Yes in step S118), the data compression processing unit 160b sets f = 1 if c1 is a tag start symbol, and sets f = 0 if c1 is a tag end symbol. (Step S119).

ｆ＝１の場合には（ステップＳ１２０，Ｎｏ）、ステップＳ１１３に移行する。ｆ＝０の場合に（ステップＳ１２０，Ｙｅｓ）、データ圧縮処理部１６０ｂは、ｃ０とｃ１を繋げた長さ２の文字列を計数表Ｔに登録する。該文字列が登録時未の場合、データ圧縮処理部１６０ｂは、該文字列の出現数を１増やし（ステップＳ１２１）、ステップＳ１１３に移行する。 If f = 1 (No at Step S120), the process proceeds to Step S113. When f = 0 (step S120, Yes), the data compression processing unit 160b registers a character string of length 2 connecting c0 and c1 in the counting table T. If the character string is not registered, the data compression processing unit 160b increases the number of appearances of the character string by 1 (step S121), and proceeds to step S113.

次に、図３３のステップＳ１０２に示した辞書構築・文字列置換処理の処理手順について説明する。図３５は、辞書構築・文字列置換処理の処理手順を示すフローチャートである。図３５に示すように、データ圧縮処理部１６０ｂは、未使用文字集合Ｕに文字が存在するか否かを判定する（ステップＳ１３１）。 Next, the processing procedure of the dictionary construction / character string replacement processing shown in step S102 of FIG. 33 will be described. FIG. 35 is a flowchart showing a processing procedure of dictionary construction / character string replacement processing. As shown in FIG. 35, the data compression processing unit 160b determines whether or not there is a character in the unused character set U (step S131).

データ圧縮処理部１６０ｂは、未使用文字集合Ｕに文字が存在しない場合には（ステップＳ１３２，Ｎｏ）、ＸＭＬデータＤを、文字列圧縮済みデータＣとして出力し、圧縮辞書データＤｉｃ（圧縮辞書データ１５０ｅ）を出力する（ステップＳ１３３）。 If there is no character in the unused character set U (No in step S132), the data compression processing unit 160b outputs the XML data D as the character string compressed data C, and the compression dictionary data Dic (compression dictionary data) 150e) is output (step S133).

一方、データ圧縮処理部１６０ｂは、未使用文字集合Ｕに文字が存在する場合に（ステップＳ１３２，Ｙｅｓ）、圧縮辞書データ１５０ｅに未登録な文字列のうちで、もっとも頻度の高い（出現数が最も大きい）文字列ｓを計数表Ｔから検索する（ステップＳ１３４）。 On the other hand, when there is a character in the unused character set U (Yes in step S132), the data compression processing unit 160b has the highest frequency (number of appearances) among character strings not registered in the compression dictionary data 150e. The largest character string s is searched from the counting table T (step S134).

データ圧縮処理部１６０ｂは、文字列ｓの出現数が計数閾値α以上であるか否かを判定する（ステップＳ１３５）。データ圧縮処理部１６０ｂは、文字列ｓの出現数が計数閾値α未満の場合に（ステップＳ１３６，Ｎｏ）、ステップＳ１３３に移行する。 The data compression processing unit 160b determines whether or not the number of appearances of the character string s is equal to or greater than the count threshold α (step S135). When the number of appearances of the character string s is less than the count threshold α (No at Step S136), the data compression processing unit 160b proceeds to Step S133.

データ圧縮処理部１６０ｂは、文字列ｓの出現数が計数閾値α以上の場合に（ステップＳ１３６，Ｙｅｓ）、未使用文字集合Ｕの文字ａに対して、（ｓ、ａ）の組を圧縮辞書データＤｉｃに加え、Ｕから文字ａを削除する（ステップＳ１３７）。 When the number of occurrences of the character string s is equal to or greater than the count threshold α (step S136, Yes), the data compression processing unit 160b compresses the set (s, a) for the character a in the unused character set U. In addition to the data Dic, the character a is deleted from U (step S137).

データ圧縮処理部１６０ｂは、ＸＭＬデータＤ中の全ての文字列ｓを文字ａに置換し（ステップＳ１３８）、文字列計数処理を実行し（ステップＳ１３９）、ステップＳ１３１に移行する。図３５のステップＳ１３９に示す文字列計数処理は、図３４に示した文字列計数処理と同じである。 The data compression processing unit 160b replaces all the character strings s in the XML data D with the characters a (step S138), executes character string counting processing (step S139), and proceeds to step S131. The character string counting process shown in step S139 of FIG. 35 is the same as the character string counting process shown in FIG.

次に、図３３のステップＳ１０３に示したタグ名置換処理の処理手順について説明する。図３６は、タグ名置換処理の処理手順を示すフローチャートである。図３６に示すように、データ圧縮処理部１６０ｂは、ＸＭＬデータＤをスキャンし、Ｄに出現する全てのタグ名を対応表Ｔ（対応表１５０ｃ）に登録する（ステップＳ１４１）。 Next, the processing procedure of the tag name replacement process shown in step S103 of FIG. 33 will be described. FIG. 36 is a flowchart illustrating a processing procedure of tag name replacement processing. As shown in FIG. 36, the data compression processing unit 160b scans the XML data D and registers all tag names appearing in D in the correspondence table T (correspondence table 150c) (step S141).

データ圧縮処理部１６０ｂは、対応表Ｔに登録された全てのタグ名に対して、符号を割り当てる（ステップＳ１４２）。データ圧縮処理部１６０ｂは、ＸＭＬデータＤの全てのタグ名を、対応表Ｔに割当てられた符号に変換し（ステップＳ１４３）、符号化データＢ（符号化データ１５０ｂ）を出力する（ステップＳ１４４）。 The data compression processing unit 160b assigns codes to all tag names registered in the correspondence table T (Step S142). The data compression processing unit 160b converts all tag names of the XML data D into codes assigned to the correspondence table T (step S143), and outputs encoded data B (encoded data 150b) (step S144). .

次に、ＡＣマシン構築部１６０ｃ、照合処理部１６０ｄ、切出処理部１６０ｅが実行する構築処理について説明する。図３７は、構築処理の処理手順を示すフローチャートである。図３７に示すように、ＡＣマシン構築部１６０ｃは、抽出条件受け付け処理を実行する（ステップＳ２０１）。 Next, the construction process executed by the AC machine construction unit 160c, the collation processing unit 160d, and the cutout processing unit 160e will be described. FIG. 37 is a flowchart illustrating the processing procedure of the construction processing. As shown in FIG. 37, the AC machine construction unit 160c executes an extraction condition reception process (step S201).

そして、ＡＣマシン構築部１６０ｃは、ＡＣマシン構築処理を実行し（ステップＳ２０２）、照合処理部１６０ｄ、切出処理部１６０ｅは、切出し位置計算処理を実行する（ステップＳ２０３）。 Then, the AC machine construction unit 160c executes an AC machine construction process (Step S202), and the collation processing unit 160d and the cut-out processing unit 160e execute a cut-out position calculation process (Step S203).

ここで、図３７のステップＳ２０１に示した抽出条件受け付け処理の処理手順について説明する。図３８は、抽出条件受け付け処理の処理手順を示すフローチャートである。図３８に示すように、ＡＣマシン構築部１６０ｃは、抽出条件Ｑ（抽出条件１５０ｇ）を受け付ける（ステップＳ２１１）。 Here, the processing procedure of the extraction condition reception processing shown in step S201 of FIG. 37 will be described. FIG. 38 is a flowchart illustrating the processing procedure of the extraction condition reception processing. As shown in FIG. 38, the AC machine construction unit 160c receives the extraction condition Q (extraction condition 150g) (step S211).

ＡＣマシン構築部１６０ｃは、対応表１５０ｃを用いて抽出条件Ｑにおける項目名を書き換え（ステップＳ２１２）、書き換えた抽出条件Ｑを抽出条件Ｑ’とする（ステップＳ２１３）。 The AC machine construction unit 160c rewrites the item name in the extraction condition Q using the correspondence table 150c (step S212), and sets the rewritten extraction condition Q as the extraction condition Q '(step S213).

続いて、図３７のステップＳ２０２に示したＡＣマシン構築処理の処理手順について説明する。図３９は、ＡＣマシン構築処理の処理手順を示すフローチャートである。図３９に示すように、ＡＣマシン構築部１６０ｃは、ΠのトライＴを構築し（ステップＳ２２１）、トライ（Π）にフェイラ遷移を追加する（ステップＳ２２２）。 Subsequently, the processing procedure of the AC machine construction processing shown in step S202 of FIG. 37 will be described. FIG. 39 is a flowchart showing the processing procedure of the AC machine construction processing. As shown in FIG. 39, the AC machine construction unit 160c constructs a trie T (step S221), and adds a failer transition to the trie (step S222).

ＡＣマシン構築部１６０ｃは、ＡＣマシンＡＣ（Π）にスキップ遷移を追加し（ステップＳ２２３）、圧縮データ用ＡＣマシンＡＣ＿ｂ（Π、ｄｉｃ）を出力する（ステップＳ２２４）。 The AC machine construction unit 160c adds a skip transition to the AC machine AC (Π) (step S223) and outputs the compressed data AC machine AC_b (Π, dic) (step S224).

ここで、図３９のステップＳ２２１に示したΠのトライＴを構築する処理について説明する。図４０は、ΠのトライＴを構築する処理の処理手順を示すフローチャートである。図４０に示すように、ＡＣマシン構築部１６０ｃは、初期状態（ｉｄ＝０）を作成し、トライ（Π）を初期状態のみで構成されるトライに設定する（ステップＳ２３１）。 Here, the process for constructing the cocoon trie T shown in step S221 of FIG. 39 will be described. FIG. 40 is a flowchart illustrating a processing procedure of processing for constructing a cocoon tri-T. As illustrated in FIG. 40, the AC machine construction unit 160c creates an initial state (id = 0) and sets a trie (Π) to a trie that includes only the initial state (step S231).

ＡＣマシン構築部１６０ｃは、初期状態の全ての通常遷移先を初期状態に設定し（ステップＳ２３２）、Πにパターンが存在するか否かを判定する（ステップＳ２３３）。ＡＣマシン構築部１６０ｃは、Πにパターンが存在しない場合に（ステップＳ２３４，Ｎｏ）、ΠのトライＴ（Π）を出力する（ステップＳ２３５）。 The AC machine construction unit 160c sets all normal transition destinations in the initial state to the initial state (step S232), and determines whether there is a pattern in the bag (step S233). When the pattern does not exist in the bag (step S234, No), the AC machine construction unit 160c outputs the bag trie T (Π) (step S235).

ＡＣマシン構築部１６０ｃは、Πにパターンが存在しない場合に（ステップＳ２３４，Ｙｅｓ）、Πから１つパターンを取り出し、取り出したパターンをｐに設定する（ステップＳ２３６）。ＡＣマシン構築部１６０ｃは、パターン登録処理を実行し（ステップＳ２３７）、ステップＳ２３３に移行する。 When the pattern does not exist in the bag (step S234, Yes), the AC machine construction unit 160c extracts one pattern from the bag and sets the extracted pattern to p (step S236). The AC machine construction unit 160c executes pattern registration processing (step S237), and proceeds to step S233.

続いて、図４０のステップＳ２３７に示したパターン登録処理の処理手順について説明する。図４１は、パターン登録処理の処理手順を示すフローチャートである。図４１に示すように、ＡＣマシン構築部１６０ｃは、現在の状態ｓをトライＴの初期状態に設定し、ｓの前の状態ｒを空の状態に設定する（ステップＳ２４１）。 Next, the process procedure of the pattern registration process shown in step S237 of FIG. 40 will be described. FIG. 41 is a flowchart showing a processing procedure of pattern registration processing. As shown in FIG. 41, the AC machine construction unit 160c sets the current state s to the initial state of the trie T, and sets the state r before s to an empty state (step S241).

ＡＣマシン構築部１６０ｃは、パターンｐに次の文字が存在するか否かを判定する（ステップＳ２４２）。ＡＣマシン構築部１６０ｃは、パターンｐに次の文字が存在しない場合に（ステップＳ２４３，Ｎｏ）、パターンｐの最後の文字（ａ）と状態ｒに対して、ｒのパターンリストPlist[code(a)]に、ｐを代入し（ステップＳ２４４）、トライＴを出力する（ステップＳ２４５）。 The AC machine construction unit 160c determines whether or not the next character exists in the pattern p (step S242). When the next character does not exist in the pattern p (No in step S243), the AC machine construction unit 160c performs the pattern list Plist [code (a of r for the last character (a) and the state r of the pattern p. ]] Is substituted for p (step S244), and a trie T is output (step S245).

一方、ＡＣマシン構築部１６０ｃは、パターンｐに次の文字が存在する場合に（ステップＳ２４３，Ｙｅｓ）、次の文字をａとし、ａのasciiコードをcode（a）に設定する（ステップＳ２４６）。ＡＣマシン構築部１６０ｃは、ｓの通常遷移g[code(a)]=Nullであるか否かを判定する（ステップＳ２４７）。 On the other hand, when the next character exists in the pattern p (Yes in step S243), the AC machine construction unit 160c sets the next character as a and sets the ascii code of a to code (a) (step S246). . The AC machine construction unit 160c determines whether or not the normal transition g [code (a)] = Null of s (step S247).

ＡＣマシン構築部１６０ｃは、通常遷移g[code(a)]=Nullの場合に（ステップＳ２４８，Ｙｅｓ）、ステップＳ２５０に移行する。一方、ＡＣマシン構築部１６０ｃは、g[code(a)]=Nullではない場合に（ステップＳ２４８，Ｎｏ）、状態ｎを新規作成し、g[code(a)]=nに設定する（ステップＳ２４９）。ＡＣマシン構築部１６０ｃは、状態ｒに状態ｓを代入し、状態ｓにg[code(a)]を代入し（ステップＳ２５０）、ステップＳ２４２に移行する。 If the normal transition g [code (a)] = Null (step S248, Yes), the AC machine construction unit 160c proceeds to step S250. On the other hand, if g [code (a)] = Null is not satisfied (No at Step S248), the AC machine construction unit 160c creates a new state n and sets g [code (a)] = n (Step S248). S249). The AC machine construction unit 160c substitutes the state s for the state r, substitutes g [code (a)] for the state s (step S250), and proceeds to step S242.

次に、図３９のステップＳ２２２に示したトライ（Π）にフェイラ遷移を追加する処理について説明する。図４２、図４３は、トライ（Π）にフェイラ遷移を追加する処理の処理手順を示すフローチャートである。 Next, processing for adding a failer transition to the trie (Π) shown in step S222 of FIG. 39 will be described. FIG. 42 and FIG. 43 are flowcharts showing a processing procedure of processing for adding a failer transition to a trie.

図４２に示すように、ＡＣマシン構築部１６０ｃは、初期状態から通常遷移できる全ての状態ｓに対して、それらのフェイラ遷移先に初期状態を代入し、現在の状態ｓをqueueに登録する（ステップＳ２５１）。 As shown in FIG. 42, the AC machine construction unit 160c substitutes the initial state for all of the states s that can be normally shifted from the initial state, and registers the current state s in the queue ( Step S251).

ＡＣマシン構築部１６０ｃは、状態リストqueueに状態が存在するか否かを判定する（ステップＳ２５２）。ＡＣマシン構築部１６０ｃは、状態リストqueueに状態が存在しない場合に（ステップＳ２５３，Ｎｏ）、現在のトライＴをＡＣマシンαとして出力する（ステップＳ２５４）。 The AC machine construction unit 160c determines whether or not a state exists in the state list queue (step S252). When no state exists in the state list queue (step S253, No), the AC machine construction unit 160c outputs the current trie T as the AC machine α (step S254).

ＡＣマシン構築部１６０ｃは、状態リストqueueに状態が存在する場合に（ステップＳ２５３，Ｙｅｓ）、状態リストqueueの先頭の状態をｓとし、状態リストqueueからｓを除去する（ステップＳ２５５）。 When there is a state in the state list queue (Yes in step S253), the AC machine construction unit 160c sets s as the head state of the state list queue and removes s from the state list queue (step S255).

ＡＣマシン構築部１６０ｃは、状態ｓの通常遷移が全てＮｕｌｌであるか否かを判定する（ステップＳ２５６）。ＡＣマシン構築部１６０ｃは、状態ｓの通常遷移が全てＮｕｌｌの場合に（ステップＳ２５７，Ｙｅｓ）、ステップＳ２５２に移行する。 The AC machine construction unit 160c determines whether all the normal transitions of the state s are Null (step S256). When all the normal transitions of the state s are Null (step S257, Yes), the AC machine construction unit 160c proceeds to step S252.

一方、ＡＣマシン構築部１６０ｃは、状態ｓの通常遷移が全てＮｕｌｌではない場合に（ステップＳ２５７，Ｎｏ）、状態ｓにおいて、g[code(a)]≠Nullとなる全ての文字ａの集合をＡに設定する（ステップＳ２５８）。 On the other hand, when all the normal transitions of the state s are not Null (step S257, No), the AC machine construction unit 160c determines a set of all the characters a that satisfy g [code (a)] ≠ Null in the state s. A is set to A (step S258).

ＡＣマシン構築部１６０ｃは、集合Ａに文字が存在するか否かを判定する（ステップＳ２５９）。ＡＣマシン構築部１６０ｃは、集合Ａに文字が存在しない場合に（ステップＳ２６０，Ｎｏ）、ステップＳ２５２に移行する。 The AC machine construction unit 160c determines whether or not a character exists in the set A (step S259). The AC machine construction unit 160c proceeds to step S252 when no character exists in the set A (step S260, No).

一方、ＡＣマシン構築部１６０ｃは、集合Ａに文字が存在する場合に（ステップＳ２６０，Ｙｅｓ）、図４３に移行し、集合Ａから文字を１つ取り出し、取り出した文字をａに設定する（ステップＳ２６１）。 On the other hand, if there is a character in set A (Yes in step S260), AC machine construction unit 160c moves to FIG. 43, extracts one character from set A, and sets the extracted character to a (step). S261).

ＡＣマシン構築部１６０ｃは、状態ｓの通称遷移先next=g[code(a)]を、状態リストqueueの最後尾に追加し（ステップＳ２６２）、状態ｓからフェイラ遷移を繰り返し、文字ａに対する通常遷移先がＮｕｌｌにならない最初の状態をｆに設定する（ステップＳ２６３）。 The AC machine construction unit 160c adds the common transition destination next = g [code (a)] of the state s to the tail of the state list queue (step S262), repeats the failer transition from the state s, and normally The initial state in which the transition destination is not Null is set to f (step S263).

ＡＣマシン構築部１６０ｃは、状態ｆに対する文字ａの通常遷移先へのポインタfnext=g[code(a)]を判定し（ステップＳ２６４）、状態nextのフェイラ遷移先をfnext=g[code(a)]に設定する（ステップＳ２６５）。ＡＣマシン構築部１６０ｃは、状態nextを与える遷移のパターンリストに、状態fnextを与える遷移のパターンリストを足し合わせ（ステップＳ２６６）、図４２のステップＳ２５９に移行する。 The AC machine construction unit 160c determines the pointer fnext = g [code (a)] to the normal transition destination of the character a for the state f (step S264), and sets the next transition destination of the state next to fnext = g [code (a )] (Step S265). The AC machine construction unit 160c adds the transition pattern list that gives the state fnext to the transition pattern list that gives the state next (step S266), and proceeds to step S259 in FIG.

次に、図３９のステップＳ２２３に示したＡＣマシンＡＣ（Π）にスキップ遷移を追加する処理の処理手順について説明する。図４４は、ＡＣマシンＡＣ（Π）にスキップ遷移を追加する処理の処理手順を示すフローチャートである。図４４に示すように、ＡＣマシン構築部１６０ｃは、ｎ＝１に設定し（ステップＳ２７１）、圧縮辞書データｄｉｃにｎ行目が存在するか否かを判定する（ステップＳ２７２）。 Next, a processing procedure for adding a skip transition to the AC machine AC (ＡＣ) shown in step S223 of FIG. 39 will be described. FIG. 44 is a flowchart of a process procedure for adding a skip transition to the AC machine AC (Π). As shown in FIG. 44, the AC machine construction unit 160c sets n = 1 (step S271), and determines whether or not the nth row exists in the compressed dictionary data dic (step S272).

ＡＣマシン構築部１６０ｃは、圧縮辞書データｄｉｃにｎ行目が存在しない場合に（ステップＳ２７３，Ｎｏ）、現在のＡＣマシンＡＣ（Π）を圧縮データ用ＡＣマシンＡＣ＿ｂ（Π、ｄｉｃ）として出力する（ステップＳ２７４）。 The AC machine construction unit 160c outputs the current AC machine AC (Π) as the compressed data AC machine AC_b (Π, dic) when the nth row does not exist in the compressed dictionary data dic (No in step S273). (Step S274).

一方、ＡＣマシン構築部１６０ｃは、圧縮辞書データｄｉｃにｎ行目が存在する場合に（ステップＳ２７３，Ｙｅｓ）、圧縮辞書データｄｉｃのｎ行目において、ペア文字列を（ａ，ｂ）とおき、置換される文字列をＸに設定する（ステップＳ２７５）。 On the other hand, when the nth line exists in the compression dictionary data dic (Yes in step S273), the AC machine construction unit 160c sets the pair character string as (a, b) in the nth line of the compression dictionary data dic. The character string to be replaced is set to X (step S275).

ＡＣマシン構築部１６０ｃは、第１のスキップ遷移作成処理を実行し（ステップＳ２７６）、第２のスキップ遷移作成処理を実行する（ステップＳ２７７）。ＡＣマシン構築部１６０ｃは、ｎ＝ｎ＋１に設定し（ステップＳ２７８）、ステップＳ２７２に移行する。 The AC machine construction unit 160c executes a first skip transition creation process (step S276), and executes a second skip transition creation process (step S277). The AC machine construction unit 160c sets n = n + 1 (step S278), and proceeds to step S272.

続いて、図４４のステップＳ２７６に示した第１のスキップ遷移作成処理の処理手順について説明する。図４５は、第１のスキップ遷移作成処理の処理手順を示すフローチャートである。図４５に示すように、ＡＣマシン構築部１６０ｃは、ＡＣマシンＡＣ（Π）の全パスの集合をＰとおき（ステップＳ２８１）、Ｐにパスが存在するか否かを判定する（ステップＳ２８２）。 Subsequently, the processing procedure of the first skip transition creation processing shown in step S276 of FIG. 44 will be described. FIG. 45 is a flowchart of a process procedure of the first skip transition creation process. As shown in FIG. 45, the AC machine construction unit 160c sets a set of all paths of the AC machine AC (Π) as P (step S281), and determines whether there is a path in P (step S282). .

ＡＣマシン構築部１６０ｃは、Ｐにパスが存在しない場合に（ステップＳ２８３，Ｎｏ）、現在のＡＣマシンＡＣ（Π）を出力する（ステップＳ２８４）。一方、ＡＣマシン構築部１６０ｃは、Ｐにパスが存在する場合に（ステップＳ２８３，Ｙｅｓ）、Ｐから任意のパスを取り出す（ステップＳ２８５）。 The AC machine construction unit 160c outputs the current AC machine AC (Π) when there is no path in P (step S283, No) (step S284). On the other hand, when a path exists in P (step S283, Yes), AC machine construction unit 160c extracts an arbitrary path from P (step S285).

ＡＣマシン構築部１６０ｃは、パスを初期状態から葉に向かって走査する。そして、パス上の状態ｓとｔに対してs.g[code(a)]=t、かつt.g[code(a)]=uが成り立つ場合に、スキップ遷移s.g[code(X)]=uを作成する（ステップＳ２８６）。ＡＣマシン構築部１６０ｃは、Ｐからパスを消去し（ステップＳ２８７）、ステップＳ２８２に移行する。 The AC machine construction unit 160c scans the path from the initial state toward the leaf. And if sg [code (a)] = t and tg [code (a)] = u hold for states s and t on the path, create a skip transition sg [code (X)] = u (Step S286). The AC machine construction unit 160c deletes the path from P (step S287), and proceeds to step S282.

続いて、図４４のステップＳ２７７に示した第２のスキップ遷移作成処理の処理手順について説明する。図４６は、第２のスキップ遷移作成処理の処理手順を示すフローチャートである。 Subsequently, the processing procedure of the second skip transition creation processing shown in step S277 of FIG. 44 will be described. FIG. 46 is a flowchart showing the processing procedure of the second skip transition creation processing.

ＡＣマシン構築部１６０ｃは、ＡＣマシンＡＣ（Π）の全状態を探索し（ステップＳ２９１）、ある状態ｓに対して、s.g[code(a)]=tかつt.g[code(b)]=NULLが成り立つものを、集合Ｆから除去する（ステップＳ２９２）。 The AC machine construction unit 160c searches all states of the AC machine AC (Π) (step S291), and for a certain state s, sg [code (a)] = t and tg [code (b)] = NULL Are removed from the set F (step S292).

ＡＣマシン構築部１６０ｃは、集合Ｆに状態ｔが存在するか否かを判定する（ステップＳ２９４）。ＡＣマシン構築部１６０ｃは、集合Ｆに状態ｔが存在しない場合に（ステップＳ２９３，Ｎｏ）、第２のスキップ遷移作成処理を終了する。 The AC machine construction unit 160c determines whether or not the state t exists in the set F (step S294). When the state t does not exist in the set F (step S293, No), the AC machine construction unit 160c ends the second skip transition creation process.

一方、ＡＣマシン構築部１６０ｃは、集合Ｆに状態ｔが存在する場合に（ステップＳ２９４，Ｙｅｓ）、集合Ｆから任意の状態ｔを選択する（ステップＳ２９５）。ＡＣマシン構築部１６０ｃは、t.fail.g[code(X)]=uとし、s.g[code(a)]=tとなる状態をｓとする場合、スキップ遷移s.g[code(X)]=uを作成する（ステップＳ２９６）。ＡＣマシン構築部１６０ｃは、選択済みの状態ｔを消去し（ステップＳ２９７）、ステップＳ２９３に移行する。 On the other hand, when the state t exists in the set F (step S294, Yes), the AC machine construction unit 160c selects an arbitrary state t from the set F (step S295). If the AC machine construction unit 160c sets t.fail.g [code (X)] = u and sg [code (a)] = t to be s, the skip transition sg [code (X)] = u is created (step S296). The AC machine construction unit 160c deletes the selected state t (step S297), and proceeds to step S293.

次に、図３７のステップＳ２０３に示した切出し位置計算処理の処理手順について説明する。かかる切出し位置計算処理を実行することで、切出処理部１６０ｅは、抽出条件１５０ｇに対応する符号化データ１５０ｂの位置を計算し、計算した位置の情報を圧縮データのまま抽出する。図４７は、切出し位置計算処理の処理手順を示すフローチャートである。図４７に示すように、切出処理部１６０ｅは、符号化データＢの先頭文字をスキャンし（ステップＳ３０１）、スキャンした文字が文字列部であるか否かを判定する（ステップＳ３０２）。 Next, the processing procedure of the cutout position calculation process shown in step S203 of FIG. 37 will be described. By executing the cut-out position calculation process, the cut-out processing unit 160e calculates the position of the encoded data 150b corresponding to the extraction condition 150g, and extracts the calculated position information as compressed data. FIG. 47 is a flowchart showing the processing procedure of the extraction position calculation process. As shown in FIG. 47, the cutout processing unit 160e scans the first character of the encoded data B (step S301), and determines whether the scanned character is a character string portion (step S302).

切出処理部１６０ｅは、スキャンした文字が文字列部である場合に（ステップＳ３０３，Ｙｅｓ）、文字にあわせて、圧縮データ用ＡＣマシン１５０ｆを１文字遷移させ、照合を検知した場合にｋ＝１に設定し（ステップＳ３０４）、ステップＳ３１１に移行する。 When the scanned character is a character string portion (Yes in step S303), the cutout processing unit 160e causes the compressed data AC machine 150f to transition by one character in accordance with the character, and k = 1 is set (step S304), and the process proceeds to step S311.

一方、切出処理部１６０ｅは、スキャンした文字が文字列部でない場合に（ステップＳ３０３，Ｎｏ）、スキャンした文字が開始タグであるか終了タグであるかを判定する（ステップＳ３０５）。 On the other hand, when the scanned character is not a character string portion (No at Step S303), the cutout processing unit 160e determines whether the scanned character is a start tag or an end tag (Step S305).

切出処理部１６０ｅは、スキャンした文字が終了タグの場合に（ステップＳ３０６，Ｎｏ）、ｆ＝１かつｋ＝１の場合に、現在位置をendに登録し、スタックＳの最終要素のstartを取得し、（start、end）をＲに登録する（ステップＳ３０７）。ここで、endに登録する現在位置は、符号化データ１５０ｂの先頭文字から数えた、終了タグ符号の「＞」までの数に対応する。切出処理部１６０ｅは、タグ符号をスタックＳからポップし（取り出し）、ｋ＝０に設定し（ステップＳ３０８）、ステップＳ３１０に移行する。 When the scanned character is an end tag (No in step S306), the cutout processing unit 160e registers the current position in end and sets the start of the last element of the stack S when f = 1 and k = 1. And (start, end) is registered in R (step S307). Here, the current position to be registered in end corresponds to the number up to “>” of the end tag code counted from the first character of the encoded data 150b. The clipping processing unit 160e pops (takes out) the tag code from the stack S, sets k = 0 (step S308), and proceeds to step S310.

一方、切出処理部１６０ｅは、スキャンした文字が開始タグの場合に（ステップＳ３０６，Ｙｅｓ）、タグ符号と現在位置startをスタックＳにプッシュする（格納する）（ステップＳ３０９）。ここで、現在位置startは、符号化データ１５０ｂの先頭文字から数えた、開始タグ符号の「＜」までの数に対応する。 On the other hand, when the scanned character is a start tag (step S306, Yes), the cutout processing unit 160e pushes (stores) the tag code and the current position start on the stack S (step S309). Here, the current position start corresponds to the number up to “<” of the start tag code counted from the first character of the encoded data 150b.

切出処理部１６０ｅは、スタックＳの値が、変換後抽出条件Ｑ’の項目部と一致する場合にｆ＝１とし、一致しない場合にｆ＝０とし（ステップＳ３１０）、全ての文字をスキャンしたか否かを判定する（ステップＳ３１１）。 The cutout processing unit 160e sets f = 1 when the value of the stack S matches the item part of the post-conversion extraction condition Q ′, sets f = 0 when the value does not match (step S310), and scans all characters. It is determined whether or not (step S311).

切出処理部１６０ｅは、全ての文字をスキャンしていない場合に（ステップＳ３１２，Ｎｏ）、次の文字をスキャンし（ステップＳ３１３）、ステップＳ３０２に移行する。一方、切出処理部１６０ｅは、全ての文字をスキャンした場合に（ステップＳ３１２，Ｙｅｓ）、Ｒを出力する（ステップＳ３１４）。 If all the characters have not been scanned (No at Step S312), the cutout processing unit 160e scans the next character (Step S313), and proceeds to Step S302. On the other hand, when all the characters have been scanned (step S312, Yes), the cutout processing unit 160e outputs R (step S314).

上述してきたように、本実施例１にかかる検索装置１００は、データ圧縮処理部１６０ｂが、ＸＭＬデータ１５０ａを圧縮する場合に、ＸＭＬデータ１５０ａに含まれる文字列と、ＸＭＬデータ１５０ａに含まれる区切り文字とを区別して符号化した符号化データ１５０ｂを生成する。符号化データ１５０ｂは、文字列と区切文字を区別して符号化しているので、切出処理部１６０ｅが抽出条件に基づいて、符号化データ１５０ｂから圧縮データをそのまま切出しても、正確な検索結果を得ることが出来る。 As described above, in the search device 100 according to the first embodiment, when the data compression processing unit 160b compresses the XML data 150a, the character string included in the XML data 150a and the delimiter included in the XML data 150a. Encoded data 150b encoded by distinguishing characters is generated. Since the encoded data 150b is encoded by distinguishing the character string and the delimiter, even if the extraction processing unit 160e extracts the compressed data from the encoded data 150b as it is based on the extraction condition, an accurate search result is obtained. Can be obtained.

ところで、本実施例において説明した各処理のうち、自動的におこなわれるものとして説明した処理の全部または一部を手動的におこなうこともでき、あるいは、手動的におこなわれるものとして説明した処理の全部または一部を公知の方法で自動的におこなうこともできる。この他、上記文書中や図面中で示した処理手順、制御手順、具体的名称、各種のデータやパラメータを含む情報については、特記する場合を除いて任意に変更することができる。 By the way, among the processes described in the present embodiment, all or part of the processes described as being automatically performed can be manually performed, or the processes described as being manually performed can be performed. All or a part can be automatically performed by a known method. In addition, the processing procedure, control procedure, specific name, and information including various data and parameters shown in the above-described document and drawings can be arbitrarily changed unless otherwise specified.

また、図示した各装置の各構成要素は機能概念的なものであり、必ずしも物理的に図示の如く構成されていることを要しない。すなわち、各装置の分散・統合の具体的形態は図示のものに限られず、その全部または一部を、各種の負荷や使用状況などに応じて、任意の単位で機能的または物理的に分散・統合して構成することができる。さらに、各装置にて行なわれる各処理機能は、その全部または任意の一部が、ＣＰＵおよび当該ＣＰＵにて解析実行されるプログラムにて実現され、あるいは、ワイヤードロジックによるハードウェアとして実現され得る。 Further, each component of each illustrated apparatus is functionally conceptual, and does not necessarily need to be physically configured as illustrated. In other words, the specific form of distribution / integration of each device is not limited to that shown in the figure, and all or a part thereof may be functionally or physically distributed or arbitrarily distributed in arbitrary units according to various loads or usage conditions. Can be integrated and configured. Further, all or any part of each processing function performed in each device may be realized by a CPU and a program analyzed and executed by the CPU, or may be realized as hardware by wired logic.

図４８は、実施例にかかる検索装置１００を構成するコンピュータのハードウェア構成を示す図である。図４８に示すように、このコンピュータ（検索装置）２００は、入力装置２０１、モニタ２０２、ＲＡＭ（Random Access Memory）２０３、ＲＯＭ（Read Only Memory）２０４、他の装置（例えば、端末装置）との間でデータの送受信を行う通信装置２０５、記憶媒体からデータを読み取る媒体読取装置２０６、ＣＰＵ（Central Processing Unit）２０７、ＨＤＤ（Hard Disk Drive）２０８をバス２０９で接続して構成される。 FIG. 48 is a diagram illustrating a hardware configuration of a computer configuring the search device 100 according to the embodiment. As shown in FIG. 48, the computer (search device) 200 includes an input device 201, a monitor 202, a RAM (Random Access Memory) 203, a ROM (Read Only Memory) 204, and other devices (for example, a terminal device). A communication device 205 that transmits and receives data between them, a medium reading device 206 that reads data from a storage medium, a CPU (Central Processing Unit) 207, and an HDD (Hard Disk Drive) 208 are connected by a bus 209.

そして、ＨＤＤ２０８には、上記した検索装置１００の機能と同様の機能を発揮する検索プログラム２０８ｂが記憶されている。ＣＰＵ２０７が、検索プログラム２０８ｂを読み出して実行することにより、検索プロセス２０７ａが起動される。ここで、検索プロセス２０７ａは、図２に示したデータ管理部１６０ａ、データ圧縮処理部１６０ｂ、ＡＣマシン構築部１６０ｃ、照合処理部１６０ｄ、切出処理部１６０ｅに対応する。 The HDD 208 stores a search program 208b that exhibits the same function as that of the search device 100 described above. The search process 207a is activated when the CPU 207 reads and executes the search program 208b. Here, the search process 207a corresponds to the data management unit 160a, the data compression processing unit 160b, the AC machine construction unit 160c, the collation processing unit 160d, and the extraction processing unit 160e shown in FIG.

また、ＨＤＤ２０８は、記憶部１５０に格納されたデータに対応する各種データ２０８ａを記憶する。ＣＰＵ２０７は、ＨＤＤ２０８に格納された各種データ２０８ａを読み出して、ＲＡＭ２０３に格納し、ＲＡＭ２０３に格納された各種データ２０３ａを利用して、ＸＭＬデータを圧縮し、抽出条件に対応するデータを圧縮データのまま抽出する。 In addition, the HDD 208 stores various data 208 a corresponding to the data stored in the storage unit 150. The CPU 207 reads various data 208a stored in the HDD 208, stores it in the RAM 203, compresses the XML data using the various data 203a stored in the RAM 203, and maintains the data corresponding to the extraction condition as compressed data. Extract.

ところで、図４８に示した検索プログラム２０８ｂは、必ずしも最初からＨＤＤ２０８に記憶させておく必要はない。たとえば、コンピュータに挿入されるフレキシブルディスク（ＦＤ）、ＣＤ−ＲＯＭ、ＤＶＤディスク、光磁気ディスク、ＩＣカードなどの「可搬用の物理媒体」、または、コンピュータの内外に備えられるハードディスクドライブ（ＨＤＤ）などの「固定用の物理媒体」、さらには、公衆回線、インターネット、ＬＡＮ、ＷＡＮなどを介してコンピュータに接続される「他のコンピュータ（またはサーバ）」などに検索プログラム２０８ｂを記憶しておき、コンピュータがこれらから検索プログラム２０８ｂを読み出して実行するようにしてもよい。 Incidentally, the search program 208b shown in FIG. 48 is not necessarily stored in the HDD 208 from the beginning. For example, a “portable physical medium” such as a flexible disk (FD), a CD-ROM, a DVD disk, a magneto-optical disk, or an IC card inserted into a computer, or a hard disk drive (HDD) provided inside or outside the computer. The search program 208b is stored in the “fixed physical medium”, and “another computer (or server)” connected to the computer via a public line, the Internet, a LAN, a WAN, or the like. However, the search program 208b may be read from these and executed.

以上の各実施例を含む実施形態に関し、さらに以下の付記を開示する。 The following supplementary notes are further disclosed with respect to the embodiments including the above examples.

（付記１）圧縮されたデータの集合を対象に検索処理を実行する検索システムであって、
データの属性を示す区切情報により各データが区切られたデータ集合を取得し、当該データ集合に含まれるデータに基づき符号化データに置換し、さらに、該データの属性を示す区切情報のみに基づき符号化データに置換することで前記データ集合を圧縮する圧縮処理部
を有することを特徴とする検索システム。 (Supplementary Note 1) A search system that executes a search process on a set of compressed data,
A data set in which each data is divided by the delimiter information indicating the attribute of the data is acquired, replaced with encoded data based on the data included in the data set, and further encoded based only on the delimiter information indicating the attribute of the data A search system comprising: a compression processing unit that compresses the data set by substituting with data.

（付記２）付記１記載の検索システムであって、
前記圧縮処理部により圧縮されたデータ集合を対象に、検索すべき属性およびデータを有する検索条件を取得した場合に、前記圧縮されたデータ集合の区切情報と置換前の区切情報の関係に基づいて、前記検索条件の属性を符号化データに置換する置換処理部と、
前記検索条件の属性を符号化データに置換された検索条件と前記圧縮されたデータ集合を基にして、前記検索条件に指定される前記圧縮されたデータ集合上の領域を特定し、特定した領域に含まれるデータを出力する検索処理部と
を有することを特徴とする検索システム。 (Supplementary note 2) The search system according to supplementary note 1,
Based on the relationship between the compressed data set delimiter information and the delimiter information before replacement when the search condition having the attribute and data to be searched is acquired for the data set compressed by the compression processing unit A replacement processing unit for replacing the attribute of the search condition with encoded data;
Based on the search condition in which the attribute of the search condition is replaced with encoded data and the compressed data set, the area on the compressed data set specified in the search condition is specified, and the specified area A search processing unit for outputting data included in the search system.

（付記３）付記２記載の検索システムであって、
前記検索処理部は、前記圧縮されたデータ集合を先頭から読み出して、前記検索条件の属性に対応する位置の区切情報に囲まれたデータに対応する符号化データを抽出し、抽出した符号化データに対応するデータが前記検索条件のデータを有するか否かを判定し、判定結果に基づいて、前記検索条件に指定される前記圧縮されたデータ集合上の領域を特定することを特徴とする検索システム。 (Supplementary note 3) The search system according to supplementary note 2,
The search processing unit reads the compressed data set from the beginning, extracts encoded data corresponding to data surrounded by position delimiter information corresponding to the attribute of the search condition, and extracts the extracted encoded data And determining whether or not the data corresponding to the data of the search condition has the data of the search condition, and specifying a region on the compressed data set specified by the search condition based on the determination result system.

（付記４）コンピュータに、
データの属性を示す区切情報により各データが区切られた階層構造を有するデータ集合を取得し、当該データ集合に含まれるデータに基づき符号化データに置換し、さらに、該データの属性を示す区切情報のみに基づき符号化データに置換することで前記データ集合を圧縮する圧縮処理機能
を実現させるためのプログラムを記録した記憶媒体。 (Appendix 4)
A data set having a hierarchical structure in which each data is delimited by delimiter information indicating data attributes is obtained, replaced with encoded data based on the data included in the data set, and delimiter information indicating the data attributes A storage medium storing a program for realizing a compression processing function of compressing the data set by replacing with encoded data based on the above.

（付記５）付記４記載の記憶媒体であって、
前記圧縮処理機能により圧縮されたデータ集合を対象に、検索すべき属性およびデータを有する検索条件を取得した場合に、前記圧縮されたデータ集合の区切情報と置換前の区切情報の関係に基づいて、前記検索条件の属性を符号化データに置換する置換処理機能と、
前記検索条件の属性を符号化データに置換された検索条件と前記圧縮されたデータ集合を基にして、前記検索条件に指定される前記圧縮されたデータ集合上の領域を特定し、特定した領域に含まれるデータを出力する検索処理機能と
を実現させるためのプログラムを記録した記憶媒体。 (Supplementary note 5) The storage medium according to supplementary note 4,
Based on the relationship between the delimiter information of the compressed data set and the delimiter information before replacement when the search condition having the attribute and data to be searched is acquired for the data set compressed by the compression processing function A replacement processing function for replacing the attribute of the search condition with encoded data;
Based on the search condition in which the attribute of the search condition is replaced with encoded data and the compressed data set, the area on the compressed data set specified in the search condition is specified, and the specified area A storage medium storing a program for realizing a search processing function for outputting data included in the.

（付記６）付記５記載の記憶媒体であって、
前記検索処理機能は、前記圧縮されたデータ集合を先頭から読み出して、前記検索条件の属性に対応する位置の区切情報に囲まれたデータに対応する符号化データを抽出し、抽出した符号化データに対応するデータが前記検索条件のデータを有するか否かを判定し、判定結果に基づいて、前記検索条件に指定される前記圧縮されたデータ集合上の領域を特定することを特徴とする記憶媒体。 (Supplementary note 6) The storage medium according to supplementary note 5,
The search processing function reads the compressed data set from the beginning, extracts encoded data corresponding to data surrounded by position delimiter information corresponding to the attribute of the search condition, and extracts the extracted encoded data And determining whether or not the data corresponding to the data of the search condition has data of the search condition, and specifying an area on the compressed data set specified by the search condition based on the determination result Medium.

（付記７）圧縮装置が、
データの属性を示す区切情報により各データが区切られた階層構造を有するデータ集合を取得するステップと、
前記データ集合に含まれるデータに基づき符号化データに置換し、さらに、該データの属性を示す区切情報のみに基づき符号化データに置換することで前記データ集合を圧縮するステップと
を実行することを特徴とする圧縮方法。 (Appendix 7) The compression device is
Obtaining a data set having a hierarchical structure in which each piece of data is divided by division information indicating data attributes;
Replacing the encoded data based on the data included in the data set, and further compressing the data set by replacing the encoded data based only on the delimiter information indicating the attribute of the data. Characteristic compression method.

（付記８）検索装置が、
付記７に記載の圧縮装置により圧縮されたデータ集合を対象に、検索すべき属性およびデータを有する検索条件を取得した場合に、前記圧縮されたデータ集合の区切情報と置換前の区切情報の関係に基づいて、前記検索条件の属性を符号化データに置換するステップと、
前記検索条件の属性を符号化データに置換された検索条件と前記圧縮されたデータ集合を基にして、前記検索条件に指定される前記圧縮されたデータ集合上の領域を特定し、特定した領域に含まれるデータを出力するステップと
を実行することを特徴とする検索方法。 (Appendix 8) The search device is
The relationship between the compressed data set delimiter information and the delimiter information before replacement when the search condition having the attribute and data to be searched is acquired for the data set compressed by the compression device according to appendix 7 Replacing the attribute of the search condition with encoded data based on:
Based on the search condition in which the attribute of the search condition is replaced with encoded data and the compressed data set, the area on the compressed data set specified in the search condition is specified, and the specified area And a step of outputting data included in the search method.

（付記９）前記データを出力するステップにおいて、前記圧縮されたデータ集合を先頭から読み出して、前記検索条件の属性に対応する位置の区切情報に囲まれたデータに対応する符号化データを抽出し、抽出した符号化データに対応するデータが前記検索条件のデータを有するか否かを判定し、判定結果に基づいて、前記検索条件に指定される前記圧縮されたデータ集合上の領域を特定すること特徴とする付記８に記載の検索方法。 (Supplementary Note 9) In the step of outputting the data, the compressed data set is read from the head, and encoded data corresponding to the data surrounded by the delimiter information of the position corresponding to the attribute of the search condition is extracted. Determining whether or not the data corresponding to the extracted encoded data has the data of the search condition, and specifying an area on the compressed data set specified by the search condition based on the determination result The search method according to appendix 8, which is characterized by that.

１１０入力部
１２０出力部
１３０通信制御ＩＦ部
１４０入出力制御ＩＦ部
１５０記憶部
１５０ａＸＭＬデータ
１５０ｂ符号化データ
１５０ｃ対応表
１５０ｄＡＣマシン
１５０ｅ圧縮辞書データ
１５０ｆ圧縮データ用ＡＣマシン
１５０ｇ抽出条件
１５０ｈ変換後抽出条件
１６０制御部
１６０ａデータ管理部
１６０ｂデータ圧縮処理部
１６０ｃＡＣマシン構築部
１６０ｄ照合処理部
１６０ｅ切出処理部
２００コンピュータ（検索装置）
２０１入力装置
２０２モニタ
２０３ＲＡＭ
２０３ａ，２０８ａ各種データ
２０４ＲＯＭ
２０５通信制御装置
２０６媒体読取装置
２０７ＣＰＵ
２０７ａ検索プロセス
２０８ＨＤＤ
２０８ｂ検索プログラム
２０９バス 110 Input unit 120 Output unit 130 Communication control IF unit 140 Input / output control IF unit 150 Storage unit 150a XML data 150b Encoded data 150c Correspondence table 150d AC machine 150e Compression dictionary data 150f Compression data AC machine 150g Extraction condition 150h Extraction after conversion Condition 160 Control unit 160a Data management unit 160b Data compression processing unit 160c AC machine construction unit 160d Collation processing unit 160e Extraction processing unit 200 Computer (search device)
201 Input device 202 Monitor 203 RAM
203a, 208a Various data 204 ROM
205 Communication Control Device 206 Medium Reading Device 207 CPU
207a Search process 208 HDD
208b Search program 209 Bus

Claims

Of the data including the first data and the second data indicating the attribute of the first data, at least the second data is replaced with encoded data having a size smaller than the second data. A search device for searching specified data from the replacement data,
When a search condition including data indicating the attribute of the designated data is received, the storage unit storing a relationship before and after the replacement of the second data is referred to, and the data indicating the attribute of the designated data is the code A replacement processing unit for replacing the data with
Using the search condition in which the data indicating the attribute of the designated data is replaced with the encoded data, data included in an area on the replacement data specified by the search condition is acquired from the replacement data And a search processing unit.

Of the data including the first data and the second data indicating the attribute of the first data, at least the second data is replaced with encoded data having a size smaller than the second data. A search method of a search device for searching for specified data from the replacement data,
When a search condition including data indicating the attribute of the designated data is received, the storage unit storing a relationship before and after the replacement of the second data is referred to, and the data indicating the attribute of the designated data is the code Replaced with data
Using the search condition in which the data indicating the attribute of the designated data is replaced with the encoded data, data included in an area on the replacement data specified by the search condition is acquired from the replacement data A search method characterized by executing processing.

Of the data including the first data and the second data indicating the attribute of the first data, at least the second data is replaced with encoded data having a size smaller than the second data. To the search device that searches the specified data from the replacement data
When a search condition including data indicating the attribute of the designated data is received, the storage unit storing a relationship before and after the replacement of the second data is referred to, and the data indicating the attribute of the designated data is encoded Replaced with data
Using the search condition in which the data indicating the attribute of the designated data is replaced with the encoded data, data included in an area on the replacement data specified by the search condition is acquired from the replacement data A search program characterized by causing processing to be executed.