JP2011008811A

JP2011008811A - Program, and data extraction method

Info

Publication number: JP2011008811A
Application number: JP2010181831A
Authority: JP
Inventors: Masataku Matsuura; 正卓松浦; Hironari Hayashi; 宏也林; Masahiko Nagata; 真彦永田; Kiyohide Omiya; 清英大宮
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2010-08-16
Filing date: 2010-08-16
Publication date: 2011-01-13

Abstract

PROBLEM TO BE SOLVED: To allow all of a required kind of data to be quickly obtained from even large quantities of data with respect to a technique for extracting data meeting a designated extraction condition from available data.SOLUTION: A plurality of extraction conditions different by target data can be input, and in response to input of one or more extraction conditions, data are extracted under each extraction condition and extracted data are output to output destinations corresponding to extraction condition which the data meets. Accordingly, a user can obtain a plurality of extraction results simultaneously by defining and input of a plurality of extraction conditions. Thus, all of required extraction results can be quickly obtained, and as a result, high work efficiency can be easily achieved.

Description

本発明は、取得可能なデータのなかから指定された抽出条件を満たすデータを抽出するための技術に関する。 The present invention relates to a technique for extracting data satisfying a specified extraction condition from data that can be acquired.

取得可能なデータのなかから任意のデータを抽出することができるデータ抽出装置は、現在、様々な用途に広く用いられている。インターネットで公開されている情報の検索では、検索エンジンとして用いられている。ユーザはそのデータ抽出装置を用いることにより、大量のデータのなかから所望のデータを迅速に得ることができる。 A data extraction apparatus that can extract arbitrary data from obtainable data is currently widely used in various applications. It is used as a search engine for searching information published on the Internet. The user can quickly obtain desired data from a large amount of data by using the data extraction device.

データ抽出装置は、予め定められた単位でデータを抽出する。その単位となるのは、例えばファイル、或いはレコードである。文書、及びインターネット上のＷｅｂページはファイルに相当する。顧客の利用実績ＰＯＳ（Point Of Sales）データやＨＨＴ（Hand Held Terminal）データなどはレコード単位で管理されるのが普通である。 The data extraction device extracts data in a predetermined unit. The unit is, for example, a file or a record. Documents and Web pages on the Internet correspond to files. Customer usage record POS (Point Of Sales) data, HHT (Hand Held Terminal) data, and the like are usually managed in record units.

図１は、従来のデータ抽出方法を説明する図である。ここで、図１を参照して、そのデータ抽出方法について具体的に説明する。
図１に示す従来のデータ抽出方法は、例えばクレジットカード会社で行われる場合のものである。表記した「ＪＯＵＲＮＡＬ」は、ファクトデータをレコード単位で格納したジャーナルファイルを表している。「ＭＡＳＴＥＲ」は、クレジットカードの所有者である顧客のデータをレコード単位で格納したマスタファイルを表している。それにより、図１に示すデータ抽出方法は、ＳＱＬ（Structured Query Language）を用いて、共に複数、存在するジャーナルファイル、及びマスタファイルのなかから所望のものを連結（ＪＯＩＮ）させ、その連結結果から所望のレコードを抽出する場合の例を表している。 FIG. 1 is a diagram for explaining a conventional data extraction method. Here, the data extraction method will be specifically described with reference to FIG.
The conventional data extraction method shown in FIG. 1 is for example performed in a credit card company. The notation “JOURNAL” represents a journal file in which fact data is stored in units of records. “MASTER” represents a master file in which data of customers who are credit card owners are stored in units of records. Accordingly, the data extraction method shown in FIG. 1 uses SQL (Structured Query Language) to connect (JOIN) desired ones from a plurality of existing journal files and master files, and from the result of the connection. The example in the case of extracting a desired record is represented.

連結させるジャーナルファイル、マスタファイルのそれぞれの条件は、ＦＲＯＭ句内のＷＨＥＲＥ句に記述されている。そこに記述された条件により、マスタファイルは現在のものが選択され、ジャーナルファイルは２００４年のものが選択される。そのＦＲＯＭ句内のＦＲＯＭ句には、ファイル間におけるレコードの対応関係はクレジットカードナンバーにより特定することが記述されている。連結結果から抽出されるレコードに格納されるデータの項目は、ＳＥＲＥＣＴ句に記述されている。そこに記述された項目は、顧客の指名（Ｖ．ＮＡＭＥ）、その年齢（Ｖ．ＡＧＥ）、利用回数（Ｖ．ＳＡＬＥＳ＿ＮＵＭ）、売上額（Ｖ．ＳＡＬＥＳ）である。連結結果から抽出するレコードの条件は、ＷＨＥＲＥ句に記述されている。そこに記述された条件は、カードの種類がコールドカード、というものである。このようなことから、２００４年に利用し、現在もゴールドカードを持つ顧客のレコードが検索結果として抽出される。 Each condition of the journal file and the master file to be linked is described in the WHERE clause in the FROM clause. Based on the conditions described there, the current master file is selected, and the journal file is selected for 2004. The FROM phrase in the FROM phrase describes that the correspondence between records between files is specified by a credit card number. The data items stored in the record extracted from the concatenation result are described in the SERECT phrase. The items described there are the customer's nomination (V.NAME), its age (V.AGE), the number of uses (V.SALES_NUM), and the sales amount (V.SALES). The condition of the record extracted from the concatenation result is described in the WHERE clause. The condition described there is that the card type is a cold card. For this reason, a record of a customer who used in 2004 and still has a gold card is extracted as a search result.

連結結果から抽出されるレコードを異ならせるには、ＷＨＥＲＥ句に記述する抽出条件を変更すれば良い。シルバーカードを持つ顧客のレコードを抽出させるのであれば、例えば図２に示すように、「ＧＯＬＤ」の記述を「ＳＩＬＶＥＲ」に変更すれば良い。それにより、２００４年に利用し、現在もシルバーカードを持つ顧客のレコードが検索結果として抽出される。 In order to make the records extracted from the concatenation result different, the extraction condition described in the WHERE clause may be changed. If a record of a customer having a silver card is to be extracted, for example, as shown in FIG. 2, the description of “GOLD” may be changed to “SILVER”. Thereby, a record of a customer who used in 2004 and still has a silver card is extracted as a search result.

このように、従来のデータ抽出方法では、所望のデータを得るための抽出条件を決定し、その抽出条件毎に検索を行わせるようになっていた。このため、データを抽出する目的の数、つまり検索に使用する抽出条件の数が多くなるほど、全ての抽出結果を得るまでに要する時間が長くなり、効率的な作業が行えなくなるという問題点があった。 As described above, in the conventional data extraction method, an extraction condition for obtaining desired data is determined, and a search is performed for each extraction condition. For this reason, as the number of objectives for extracting data, that is, the number of extraction conditions used for the search increases, the time required to obtain all extraction results becomes longer, and the efficient work cannot be performed. It was.

現在、デジタルデータで扱う情報の種類、及びその量は非常に増大しつつある。そのため、今後は従来のデータ抽出方法では対応するのが非常に困難となるのが予想される。このこともあって、膨大なデータのなかからでも必要な種類のデータを全てより迅速に得られるようにすることが重要であると考えられる。 Currently, the types and amounts of information handled by digital data are increasing greatly. Therefore, it is expected that it will become very difficult to cope with the conventional data extraction method in the future. For this reason, it is important to make it possible to obtain all necessary types of data more quickly from a huge amount of data.

特開２００２−２２２１９４号公報JP 2002-222194 A 特開２００５−７０９１１号公報JP-A-2005-70911 特開平６−３１９９０６号公報JP-A-6-319906

本発明は、膨大なデータのなかからでも必要な種類のデータを全てより迅速に得られるようにする技術を提供することを目的とする。
本発明の第１、及び第２の態様のプログラムは共に、取得可能なデータのなかから指定された抽出条件を満たすデータを抽出できるデータ抽出装置を実現させるためにコンピュータに実行させることを前提とし、それぞれ以下の機能を実現させる。 An object of the present invention is to provide a technique that enables all necessary types of data to be obtained more quickly from a vast amount of data.
Both the programs of the first and second aspects of the present invention are premised on causing a computer to execute a data extraction apparatus that can extract data satisfying a specified extraction condition from among obtainable data. Each implements the following functions.

第１の態様のプログラムは、データを取得する機能と、抽出条件を入力する機能と、入力する機能により一つ以上、入力された抽出条件を用いて、該抽出条件毎にデータを抽出する機能と、抽出する機能により抽出条件毎に抽出されたデータをそれぞれ異なる出力先に出力する機能と、を実現させる。 The program according to the first aspect includes a function for acquiring data, a function for inputting extraction conditions, and a function for extracting data for each extraction condition using one or more input extraction conditions. And a function of outputting the data extracted for each extraction condition to different output destinations by the extracting function.

第２の態様のプログラムは、データを取得する機能と、抽出条件を入力する機能と、入力する機能により入力された抽出条件を構成する条件式を複数の部分条件式に分割し、該分割によって得られる部分条件式の組み合わせで表現する形式に該抽出条件を変換して、該部分条件式単位で該部分条件式を満たすか否か確認することにより、取得する機能により取得したデータのなかで該抽出条件を満たすデータを抽出する機能と、を実現させる。 The program according to the second aspect divides the conditional expression constituting the extraction condition input by the function of acquiring data, the function of inputting the extraction condition, and the function of inputting into a plurality of partial conditional expressions. In the data acquired by the function to acquire by converting the extraction condition into a format expressed by the combination of the obtained partial conditional expressions and confirming whether or not the partial conditional expression satisfies the partial conditional expression unit And a function of extracting data satisfying the extraction condition.

本発明のデータ抽出方法は、取得可能なデータのなかから指定された抽出条件を満たすデータを抽出するために適用されることが前提であり、対象となるデータが異なる抽出条件を複数、入力可能とさせ、抽出条件が１つ以上、入力された場合に、該抽出条件毎にデータの抽出を行い、該抽出によって得たデータを、該データが満たす抽出条件に応じた出力先に出力する。 The data extraction method of the present invention is premised on being applied to extract data satisfying a specified extraction condition from among obtainable data, and a plurality of extraction conditions with different target data can be input. When one or more extraction conditions are input, data is extracted for each extraction condition, and data obtained by the extraction is output to an output destination corresponding to the extraction condition satisfied by the data.

本発明では、対象となるデータが異なる抽出条件を複数、入力可能とさせ、抽出条件が１つ以上、入力された場合に、抽出条件毎にデータの抽出を行い、それによって得たデータを、そのデータが満たす抽出条件に応じた出力先にそれぞれ出力する。このため、ユーザは、複数の抽出条件を定義して入力することにより、１度に複数の抽出結果を得ることができる。それにより、必要な全ての抽出結果をより迅速に得ることができる。この結果、高い作業効率も容易に実現させることができる。 In the present invention, a plurality of extraction conditions with different target data can be input, and when one or more extraction conditions are input, data is extracted for each extraction condition, and the data obtained thereby is The data is output to output destinations corresponding to the extraction conditions satisfied by the data. For this reason, the user can obtain a plurality of extraction results at a time by defining and inputting a plurality of extraction conditions. Thereby, all necessary extraction results can be obtained more quickly. As a result, high work efficiency can be easily realized.

本発明では、入力された抽出条件は、それを構成する条件式を複数の部分条件式に分割し、その分割によって得られる部分条件式の組み合わせで表現する形式に変換して、部分条件式単位でその部分条件式を満たすか否か確認することにより、データのなかで抽出条件を満たすデータを抽出する。部分条件式の組み合わせで表現する形式に抽出条件を変換することにより、異なる条件式に同じ部分条件式が存在していても、条件式毎に部分条件式をデータが満たすか否かの確認を行う必要性を回避できるようになる。このため、より小さい負荷でデータ抽出を行えることとなる。 In the present invention, the input extraction condition is divided into a plurality of partial conditional expressions and converted into a form expressed by a combination of partial conditional expressions obtained by the division, and the partial conditional expression unit By checking whether or not the partial conditional expression is satisfied, data satisfying the extraction condition is extracted from the data. By converting the extraction condition to a format that is expressed by a combination of partial conditional expressions, even if the same partial conditional expression exists in different conditional expressions, check whether the data satisfies the partial conditional expression for each conditional expression The need to do so can be avoided. For this reason, data extraction can be performed with a smaller load.

従来のデータ抽出方法を説明する図である。It is a figure explaining the conventional data extraction method. 従来のデータ抽出方法で異なる種類のデータを抽出させるための抽出条件の相違を説明する図である。It is a figure explaining the difference in the extraction conditions for extracting different types of data with the conventional data extraction method. 本実施の形態によるデータ抽出装置の昨日構成を説明する図である。It is a figure explaining the yesterday structure of the data extraction device by this Embodiment. 本実施の形態によるデータ抽出装置１００が可能なデータ抽出を説明する図である。It is a figure explaining the data extraction which the data extracting device 100 by this Embodiment can perform. 本実施の形態によるデータ集計装置を実現できるコンピュータのハードウェア構成の一例を示す図である。It is a figure which shows an example of the hardware constitutions of the computer which can implement | achieve the data totaling apparatus by this Embodiment. ＸＭＬデータの構成例を説明する図である。It is a figure explaining the structural example of XML data. ＣＳＶデータの構成例を説明する図である。It is a figure explaining the structural example of CSV data. 抽出条件群の内容例を説明する図である。It is a figure explaining the example of the content of an extraction condition group. タグＤＦＡ例を説明する図である。It is a figure explaining the tag DFA example. 階層照合ＮＦＡ例を説明する図である。It is a figure explaining the hierarchy collation NFA example. ＣＳＶ解析ＤＦＡ例を説明する図である。It is a figure explaining the CSV analysis DFA example. キーワードＤＦＡ例を説明する図である。It is a figure explaining the example of keyword DFA. 論理テーブル例を説明する図である。It is a figure explaining the example of a logical table. 出力バッファの管理方法を説明する図である。It is a figure explaining the management method of an output buffer. 抽出条件入力部１１０が実行する処理のフローチャートである。It is a flowchart of the process which the extraction condition input part 110 performs. データ入力構造検索部１２０が実行する処理のフローチャートである。It is a flowchart of the process which the data input structure search part 120 performs. 抽出条件判定部１３０が実行する処理のフローチャートである。It is a flowchart of the process which the extraction condition determination part 130 performs. データ判定部１４０が実行する処理のフローチャートである。It is a flowchart of the process which the data determination part 140 performs. 本実施の形態によるデータ抽出装置の適用例を説明する図である（その１）。It is a figure explaining the example of application of the data extracting device by this Embodiment (the 1). 本実施の形態によるデータ抽出装置の適用例を説明する図である（その２）。It is a figure explaining the application example of the data extraction apparatus by this Embodiment (the 2). 本実施の形態によるデータ抽出装置の適用例を説明する図である（その３）。It is a figure explaining the application example of the data extraction apparatus by this Embodiment (the 3). 本実施の形態によるデータ抽出装置の適用例を説明する図である（その４）。It is a figure explaining the application example of the data extraction apparatus by this Embodiment (the 4). 本実施の形態によるデータ抽出装置の適用例を説明する図である（その５）。It is a figure explaining the application example of the data extraction apparatus by this Embodiment (the 5). 本実施の形態によるデータ抽出装置の適用例を説明する図である（その６）。It is a figure explaining the example of application of the data extraction apparatus by this Embodiment (the 6).

以下、本発明の実施の形態について、図面を参照しながら詳細に説明する。
図３は、本実施の形態によるデータ抽出装置の機能構成を説明する図である。
そのデータ抽出装置１００は、入力装置２１０からデータ２１１としてテキストデータを入力し、そのデータ２１１を指定された抽出条件群２２０により振り分けて出力するものとして実現されている。そのために、抽出条件入力部１１０、データ入力構造検索部１２０、抽出条件判定部１３０、データ判定部１４０、外部出力用の出力バッファ１５０、及びデータ出力部１６０を備えている。ここでは便宜的に、入力装置２１０から入力するデータ２１１として、図６に示すようなＸＭＬ（eXtensible Markup Language ）データ、及び図７に示すようなＣＳＶ（Comma Separated Values）データのみを想定する。それらのデータは共にテキストデータである。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.
FIG. 3 is a diagram for explaining the functional configuration of the data extraction apparatus according to this embodiment.
The data extraction apparatus 100 is implemented as text data input from the input device 210 as data 211, and the data 211 is distributed according to a specified extraction condition group 220 and output. For this purpose, an extraction condition input unit 110, a data input structure search unit 120, an extraction condition determination unit 130, a data determination unit 140, an output buffer 150 for external output, and a data output unit 160 are provided. Here, for convenience, it is assumed that only XML (eXtensible Markup Language) data as shown in FIG. 6 and CSV (Comma Separated Values) data as shown in FIG. Both of these data are text data.

抽出条件入力部１１０によって入力される抽出条件群２２０は、例えば図８に示すような内容のものである。その図８では、（１）〜（３）に分けてそれぞれ抽出条件、及び出力条件を示している。そのように分けて示す抽出条件は全て、ユーザが所望のデータ２１１を抽出するためのものである。抽出条件と併せて示す出力条件は、その抽出条件によって抽出されるデータ２１１の出力先、及びそのファイル名を指定するものである。それにより、抽出条件群２２０は、所望のデータ２１１別に、そのデータ２１１が満たすべき抽出条件、及びその出力先ファイル名を指定するものとなっている。そのようにデータ２１１の出力先を任意に指定できるようにしたのは、データ２１１をより迅速に所望の形で利用するのを可能とさせるためである。以降、（１）に記述された抽出条件は「抽出条件１」と表記する。これは他でも同様である。 The extraction condition group 220 input by the extraction condition input unit 110 has contents as shown in FIG. 8, for example. In FIG. 8, the extraction condition and the output condition are respectively shown in (1) to (3). All the extraction conditions shown separately are for the user to extract the desired data 211. The output condition shown together with the extraction condition specifies the output destination of the data 211 extracted by the extraction condition and its file name. Thereby, the extraction condition group 220 designates the extraction condition to be satisfied by the data 211 and the output destination file name for each desired data 211. The reason why the output destination of the data 211 can be arbitrarily designated is to enable the data 211 to be used in a desired form more quickly. Hereinafter, the extraction condition described in (1) is referred to as “extraction condition 1”. The same applies to other cases.

図４は、本実施の形態によるデータ抽出装置１００が可能なデータ抽出を説明する図である。ここで図４を参照して、そのデータ抽出について具体的に説明する。
図８に示す抽出条件群２２０は、データ２１１としてＸＭＬデータを想定したものである。図４では、ＣＳＶデータを想定した抽出条件群２２０を示している。「Ｑｕｅｒｙ」は抽出条件に相当し、「ＯｕｔＦｉｌｅ」は出力条件に相当する。Ｑｕｅｒｙ（抽出条件）として表記した「＄Ｘ」は、項目名「Ｘ」を表し、「＄＿」は任意の項目名を表している。それにより、例えばＱｕｅｒｙ１で表記した「＄Ｘ＝＝’Ｘ１’ ＯＲ＄Ｘ＝＝’Ｘａ’」は、項目名「Ｘ」のデータがＸ１またはＸａであるデータ２１１が抽出の対象であることを示している。その表記が「＄＿＝＝’Ｘａ’」となっているＱｕｅｒｙでは、任意の項目のデータとしてＸａが存在しているデータ２１１が抽出の対象であることを示している。そのデータ２１１はＸＭＬデータ、及びＣＳＶデータの何れであっても、ファイルとしてまとめて入力させても良いが、一つずつ順次、入力させても良い。一つずつ入力させる場合、ＸＭＬデータでは図６に示すようなものとなり、ＣＳＶデータでは、図７において、先頭に「０００００１」〜「０００００７」を表記した行のようなものとなる。ここでは便宜的に、それらのデータのまとまりをレコードと呼ぶことにする。また、２つの「’」の間に記述された文字列については「キーワード」と呼ぶことにする。そのキーワードは、図８に示す抽出条件群２２０では２つの「”」の間に記述された文字列が相当する。 FIG. 4 is a diagram for explaining data extraction that can be performed by the data extraction device 100 according to the present embodiment. Here, the data extraction will be specifically described with reference to FIG.
The extraction condition group 220 shown in FIG. 8 assumes XML data as the data 211. FIG. 4 shows an extraction condition group 220 assuming CSV data. “Query” corresponds to the extraction condition, and “OutFile” corresponds to the output condition. “$ X” written as “Query” (extraction condition) represents an item name “X”, and “$ _” represents an arbitrary item name. Accordingly, for example, “$ X ==“ X1 ”OR $ X ==“ Xa ”” expressed in Query 1 indicates that data 211 whose data of item name “X” is X1 or Xa is an extraction target. Show. The query whose notation is “$ _ ==“ Xa ”” indicates that data 211 in which Xa exists as data of an arbitrary item is an extraction target. The data 211 may be XML data or CSV data, which may be input together as a file, but may be input sequentially one by one. When the data is input one by one, the XML data is as shown in FIG. 6, and the CSV data is like a line having “000001” to “000007” at the head in FIG. Here, for convenience, a collection of such data is called a record. A character string described between two “′” is called a “keyword”. The keyword corresponds to a character string described between two """in the extraction condition group 220 shown in FIG.

本実施の形態では、文字列照合方式を用いて、抽出条件群２２０で指定された抽出条件の何れかを満たすデータ２１１を抽出し、満たす抽出条件に対応付けられた出力条件で指定された出力先ファイル名のファイルに出力する。それにより、Ｑｕｅｒｙ１を満たすデータ２１１はファイル名「ｒｅｓｕｌｔ１．ｃｓｖ」のファイル２３１として、Ｑｕｅｒｙ２を満たすデータ２１１はファイル名「ｒｅｓｕｌｔ２．ｃｓｖ」のファイル２３２として、Ｑｕｅｒｙ３を満たすデータ２１１はファイル名「ｒｅｓｕｌｔ３．ｃｓｖ」のファイル２３３として、それぞれ出力される。入力されたデータ２１１とファイル２３１〜３の何れかに出力されるデータ２１１の対応関係は、図中に表記の（１）〜（６）により示している。 In the present embodiment, the data 211 that satisfies any of the extraction conditions specified in the extraction condition group 220 is extracted using the character string matching method, and the output specified in the output condition associated with the extraction condition that satisfies Output to the file with the destination file name. As a result, the data 211 satisfying Query 1 is the file 231 with the file name “result1.csv”, the data 211 satisfying Query 2 is the file 232 with the file name “result2.csv”, and the data 211 satisfying Query 3 is the file name “result3.csv”. csv "files 233, respectively. The correspondence between the input data 211 and the data 211 output to any one of the files 231 to 231 is indicated by notations (1) to (6) in the figure.

各抽出条件はそれぞれ単独で考慮されるため、抽出条件は全て任意に定義することができる。このため、ＸＭＬデータやＣＳＶデータなどのデータ２１１の種類毎に１つ以上の抽出条件を定義することもでき、また、その構造別に１つ以上の抽出条件を定義することもできるようになっている。従って、対象とするデータ２１１間でスキーマがどのように相違していても、その相違の影響は確実に回避させることができる。 Since each extraction condition is considered independently, all the extraction conditions can be arbitrarily defined. Therefore, one or more extraction conditions can be defined for each type of data 211 such as XML data and CSV data, and one or more extraction conditions can be defined for each structure. Yes. Therefore, no matter how the schema differs between the target data 211, the influence of the difference can be reliably avoided.

上述したようなことから、抽出条件間は排他関係としなくとも良い。それにより、Ｑｕｅｒｙ１とＱｕｅｒｙ２では条件式（論理式）「＄Ｘ＝＝’Ｘａ’」を満たすデータ２１１をそれぞれ抽出する内容となっている。同様にＱｕｅｒｙ２とＱｕｅｒｙ３では条件式「＄Ｘ＝＝’Ｘｂ’」を満たすデータをそれぞれ抽出する内容となっている。この結果、ファイル２３１、２３２には共に（４）を表記したデータ２１１が出力され、ファイル２３２、２３３には共に（５）を表記したデータ２１１が出力されている。 As described above, the extraction conditions may not be exclusive. As a result, Query 1 and Query 2 have contents for extracting data 211 that satisfies the conditional expression (logical expression) “$ X == 'Xa'”, respectively. Similarly, in Query 2 and Query 3, the contents satisfying the conditional expression “$ X == 'Xb'” are extracted. As a result, data 211 expressing (4) is output to both the files 231 and 232, and data 211 expressing (5) is output to both the files 232 and 233.

このように、抽出条件群２２０により複数の抽出条件が指定されると、抽出条件毎にそれを満たすデータ２１１を振り分けて指定の出力先に出力するようになっている。このため、ユーザは、抽出条件群２２０として複数の抽出条件、及び出力条件を定義するだけで１度に複数の抽出結果を得ることができる。それにより、必要な全ての抽出結果はより迅速に得ることができる。この結果、高い作業効率も容易に実現させることができる。 As described above, when a plurality of extraction conditions are designated by the extraction condition group 220, the data 211 satisfying the extraction conditions is sorted and output to a designated output destination. For this reason, the user can obtain a plurality of extraction results at a time only by defining a plurality of extraction conditions and output conditions as the extraction condition group 220. Thereby, all necessary extraction results can be obtained more quickly. As a result, high work efficiency can be easily realized.

上述したように、本実施の形態では文字列照合方式を採用している。その文字列照合方式は、抽出条件で指定した文字列と対象のデータ２１１との照合を、そのデータ２１１の先頭より後方に向かって逐次、行っていくことにより、その文字列がデータ２１１中に存在するか否かを調べるものである。その文字列照合方式では、先頭より後方に向かった走査を１回、行うだけで、抽出条件群２２０で定義された抽出条件の何れをデータ２１１が満たしているか確認することができる。そのため、定義された抽出条件の数に係わらず、常に迅速に抽出すべきデータ２１１を抽出することができる。その参考文献としては、例えば特許文献１、及び２が挙げられる。 As described above, the present embodiment employs the character string matching method. The character string collation method is such that the character string specified in the extraction condition and the target data 211 are collated sequentially from the beginning of the data 211 toward the rear, so that the character string is included in the data 211. It is to check whether or not it exists. In the character string collation method, it is possible to confirm which of the extraction conditions defined in the extraction condition group 220 satisfies the data 211 by performing only one scan from the top to the back. Therefore, it is possible to always extract data 211 to be extracted quickly regardless of the number of defined extraction conditions. Examples of the reference include Patent Documents 1 and 2.

図３の説明に戻る。
抽出条件入力部１１０は、上述したような抽出条件群２２０を入力し、抽出条件毎に、その抽出条件を解析して対応のオートマトンを生成する。それにより、抽出条件がＸＭＬデータ用のものであればタグＤＦＡ（Deterministic Finite state Automaton）１７０、階層照合ＮＦＡ（Non-deterministic Finite state Automaton）１７１、及びキーワードＤＦＡ１８０が生成される。抽出条件がＣＳＶデータ用のものであればＣＳＶ解析ＤＦＡ１７２、及びキーワードＤＦＡ１８０が生成される。論理テーブル１９０は、キーワードＤＦＡ１７２と同様に、抽出条件が想定するデータ２１１の種類に係わらず生成される。 Returning to the description of FIG.
The extraction condition input unit 110 inputs the extraction condition group 220 as described above, analyzes the extraction condition for each extraction condition, and generates a corresponding automaton. Accordingly, if the extraction condition is for XML data, a tag DFA (Deterministic Finite state Automaton) 170, a hierarchical matching NFA (Non-deterministic Finite state Automaton) 171 and a keyword DFA 180 are generated. If the extraction condition is for CSV data, a CSV analysis DFA 172 and a keyword DFA 180 are generated. Similar to the keyword DFA 172, the logical table 190 is generated regardless of the type of data 211 assumed by the extraction condition.

抽出条件群２２０の作成は基本的に、ユーザによるデータ入力によって行われる。本実施の形態によるデータ抽出装置１００と接続された端末装置で抽出条件群２２０を作成する場合、例えばユーザは抽出条件群２２０作成用の画面を表示させ、その画面上に所望の内容の抽出条件群２２０を入力する。その入力後、データ抽出を指示すると、作成された抽出条件群２２０がデータ抽出装置１００に出力される。 Creation of the extraction condition group 220 is basically performed by data input by the user. When the extraction condition group 220 is created by a terminal device connected to the data extraction apparatus 100 according to the present embodiment, for example, the user displays a screen for creating the extraction condition group 220 and the extraction condition of desired contents is displayed on the screen. The group 220 is input. When data extraction is instructed after the input, the created extraction condition group 220 is output to the data extraction apparatus 100.

上記論理テーブル１９０としては、抽出条件群２２０が図８に示す内容であった場合、抽出条件入力部１１０によって図１３に示すようなものが生成される。図１３に示すように、その論理テーブル１９０は、Ａ論理テーブル１９０ａ、及びＺ論理テーブル１９０ｂから構成されている。 As the logical table 190, when the extraction condition group 220 has the content shown in FIG. 8, the extraction condition input unit 110 generates the logical table 190 as shown in FIG. As shown in FIG. 13, the logical table 190 includes an A logical table 190a and a Z logical table 190b.

Ａ論理テーブル１９０ａは、抽出条件を構成する条件式（論理式）を関係演算子（図８中では「＝」及び「＜」が相当）で分解して、その条件式が表現する論理により細分化し（図８では抽出条件２を構成する条件式「／ｒｏｏｔ／Ｃｏｍｐａｎｙ／ｃｏｄｅ＜
９９」は「／ｒｏｏｔ／Ｃｏｍｐａｎｙ／ｃｏｄｅ」「＜９９」に分解される）、細分化した条件式（部分条件式）毎に固有の論理番号を付した構成のものである。Ｚ論理テーブル１９０ｂは、条件式、或いは抽出条件を部分条件式、或いは条件式に付した論理番号の組み合わせで表現し、表現した組み合わせ毎に固有の論理番号を付した構成のものである。組み合わせる論理番号はＡ論理テーブル１９０ａ、及びＺ論理テーブル１９０ｂの何れのものであっても良い。その論理番号を用いて条件式、或いは抽出条件を表現することにより、Ａ論理テーブル１９０ａ、或いはＺ論理テーブル１９０ｂで参照すべきレコード（行）を特定できるようにさせている。特には図示していないが、そのＺ論理テーブル１９０ｂには、論理番号の組み合わせ毎に、その組み合わせで表現される条件式、或いは抽出条件が成立しているか否かを示す符号を格納できるようになっている。以降テーブル１９０ａ、及び１９０ｂでそれぞれ割り当てる論理番号を区別するために、Ａ論理テーブル１９０ａの論理番号には「Ａ」、Ｚ論理テーブル１９０ｂの論理には「Ｚ」をそれぞれ先頭に付して表記する。 The A logical table 190a decomposes the conditional expression (logical expression) constituting the extraction condition with a relational operator (corresponding to “=” and “<” in FIG. 8), and subdivides it according to the logic expressed by the conditional expression. (In FIG. 8, the conditional expression “/ root / Company / code <
“99” is decomposed into “/ root / Company / code” and “<99”), and each subdivided conditional expression (partial conditional expression) is assigned a unique logical number. The Z logic table 190b has a configuration in which a conditional expression or an extraction condition is expressed by a partial conditional expression or a combination of logical numbers attached to the conditional expression, and a unique logical number is assigned to each expressed combination. The logical number to be combined may be any of the A logical table 190a and the Z logical table 190b. By expressing a conditional expression or an extraction condition using the logical number, a record (row) to be referred to in the A logical table 190a or the Z logical table 190b can be specified. Although not specifically shown, the Z logical table 190b can store a conditional expression expressed by the combination of each logical number or a code indicating whether or not the extraction condition is satisfied. It has become. Hereinafter, in order to distinguish the logical numbers assigned in the tables 190a and 190b, the logical numbers in the A logical table 190a are indicated with “A”, and the logicals in the Z logical table 190b are indicated with “Z” at the head. .

Ｚ論理テーブル１９０ｂで論理番号Ｚ１が割り当てられた組み合わせは「Ａ１×Ａ２」である。その組み合わせ「Ａ１×Ａ２」は、論理番号Ａ１の部分条件式（／ｒｏｏｔ／ｏｒｉｇｉｎ）が成立し、且つ論理番号Ａ２の部分条件式（”ａｔｃｇ”）が成立するデータ２１１が抽出対象であることを表す形式の論理式となっている。それにより、組み合わせ（論理式）「Ａ１×Ａ２」中の「×」は、論理番号Ａ１、及びＡ２の部分条件式の論理積を行うことを示す論理演算子となっている。その論理式は、抽出条件１の内容を表している。同様に、論理番号Ｚ４、及びＺ５の各論理式はそれぞれ抽出条件３、及び２の内容を表している。抽出条件２はＺ５＝Ｚ２×Ｚ３になっている。ここで１９０ｂのテーブル内で、Ｚ２＝Ａ３×Ａ４によりＡ３＝／ｒｏｏｔ／Ｃｏｍｐａｎｙ／ｃｏｄｅ、Ａ４＝＜９９に対応する。 The combination to which the logical number Z1 is assigned in the Z logical table 190b is “A1 × A2”. In the combination “A1 × A2”, the data 211 in which the partial conditional expression (/ root / origin) of the logical number A1 is satisfied and the partial conditional expression (“atcg”) of the logical number A2 is satisfied is to be extracted. It is a logical expression of the form representing Thereby, “x” in the combination (logical expression) “A1 × A2” is a logical operator indicating that logical product of the partial conditional expressions of the logical numbers A1 and A2 is performed. The logical expression represents the contents of the extraction condition 1. Similarly, logical expressions with logical numbers Z4 and Z5 represent the contents of extraction conditions 3 and 2, respectively. The extraction condition 2 is Z5 = Z2 × Z3. Here, in the table of 190b, Z2 = A3 × A4 corresponds to A3 = / root / Company / code, A4 = <99.

また、Ｚ３＝Ａ１×Ａ５により、Ａ１＝／ｒｏｏｔ／ｏｒｉｇｉｎ、Ａ５＝“ｇｔａｃ”に対応する。したがって、抽出条件２は、Ｚ論理番号Ｚ５と介して、Ａ論理番号Ａ３、Ａ４、Ａ１、Ａ５に対応し、図８で示す抽出条件２の論理積（ＡＮＤ）は、図１３で示す論理テーブルとその要素間のリンク状態で示される。図８の抽出条件３は図１３の抽出条件３、Ｚ論理番号４、Ａ論理番号Ａ１、Ａ６の論理テーブルとその要素間のリンクで示される。すなわち、抽出条件３はＺ４＝Ａ１×Ａ６（Ａ１＝／ｒｏｏｔ／ｏｒｉｇｉｎ、Ａ６＝“ａａｃｇ”）としてＡ論理番号に対応している。すなわち、このような論理番号によって各抽出条件で形成される論理テーブルを使って抽出条件毎のデータ判別が可能となる。 In addition, Z3 = A1 × A5 corresponds to A1 = / root / origin and A5 = “gtac”. Therefore, the extraction condition 2 corresponds to the A logical numbers A3, A4, A1, and A5 through the Z logical number Z5, and the logical product (AND) of the extraction condition 2 shown in FIG. 8 is the logical table shown in FIG. And the link state between the elements. The extraction condition 3 in FIG. 8 is indicated by the extraction condition 3, the logical table of Z logical number 4, A logical number A1, A6 and the link between the elements. That is, the extraction condition 3 corresponds to the A logical number as Z4 = A1 × A6 (A1 = / root / origin, A6 = “aacg”). That is, it is possible to determine data for each extraction condition using a logical table formed by each extraction condition with such a logical number.

図１３に示す検索結果判定情報１９５は、抽出条件毎に、その抽出条件を表現する論理番号の組み合わせに対して付された論理番号、その抽出条件を満たすデータ２１１を格納すべき出力バッファ１５０を示す番号（図中「出力バッファＮｏ．」と表記）、及びファイルディスクリプタ（対応付けられた出力条件）がまとめられたものである。それにより、何れかの抽出条件を満たすデータ２１１は、検索結果判定情報１９５を参照して出力すべき出力バッファ１５０に出力された後、出力すべきファイルに出力される。 The search result determination information 195 illustrated in FIG. 13 includes, for each extraction condition, an output buffer 150 in which a logical number given to a combination of logical numbers expressing the extraction condition and data 211 satisfying the extraction condition are to be stored. The numbers shown (indicated as “output buffer No.” in the figure) and file descriptors (corresponding output conditions) are collected. As a result, the data 211 satisfying any one of the extraction conditions is output to the output buffer 150 to be output with reference to the search result determination information 195 and then to the file to be output.

上記オートマトン（タグＤＦＡ１７０、階層照合ＮＦＡ１７１、キーワードＤＦＡ１８０、ＣＳＶ解析ＤＦＡ１７２）は検索条件中の文字列をデータ２１１と照合するための状態遷移テーブルである。状態間は遷移の方向を示す矢印で結んで表現される。先頭を初期状態とし、この初期状態からデータ２１１中の文字列に応じて順次、状態を遷移させる。遷移させる状態には、検索条件中の文字列の最後に位置する文字に相当する受理状態が１つ以上、含まれている。それによりオートマトンは、データ２１１中に検出すべき文字列が存在していれば、何れかの受理状態に遷移するように生成される。受理状態に遷移した場合、その受理状態に応じたヒット情報を出力するようになっている。そのヒット情報は、遷移した受理状態に応じた特有のものであり、オートマトンの生成時に併せて生成される。 The automaton (tag DFA 170, hierarchical collation NFA 171, keyword DFA 180, CSV analysis DFA 172) is a state transition table for collating the character string in the search condition with the data 211. The states are represented by connecting arrows indicating the direction of transition. The head is set as an initial state, and the state is sequentially shifted from the initial state according to the character string in the data 211. The transition state includes one or more acceptance states corresponding to the character located at the end of the character string in the search condition. As a result, the automaton is generated so as to transition to any of the accepting states if there is a character string to be detected in the data 211. When transitioning to an acceptance state, hit information corresponding to the acceptance state is output. The hit information is peculiar to the transitioned acceptance state, and is generated when the automaton is generated.

上記タグＤＦＡ１７０は、キーワードと照合すべき文字列（要素内容）が存在する要素までの検索パスを検出するためのものである。抽出条件群２２０が図８に示す内容であった場合、抽出条件入力部１１０によって図９に示すようなタグＤＦＡ１７０が最終的に生成される。図８に示す抽出条件群２２０では、検索パスとして「／ｒｏｏｔ／ｏｒｉｇｉｎ」及び「／ｒｏｏｔ／Ｃｏｍｐａｎｙ／ｃｏｄｅ」が存在することから、それぞれがタグ名である文字列「ｒｏｏｔ」「ｏｒｉｇｉｎ」「Ｃｏｍｐａｎｙ」及び「ｃｏｄｅ」をそれぞれ検出できるように生成されている。それらの文字列の最後に位置する文字「ｔ」「ｎ」「ｙ」及び「ｅ」の何れかに相当する受理状態まで遷移することで、その文字に対応する文字列が検出されたことを示すヒット情報１７０ａ〜ｄの何れかが出力される。 The tag DFA 170 is for detecting a search path up to an element having a character string (element content) to be matched with a keyword. When the extraction condition group 220 has the contents shown in FIG. 8, the extraction condition input unit 110 finally generates a tag DFA 170 as shown in FIG. In the extraction condition group 220 shown in FIG. 8, since there are “/ root / origin” and “/ root / Company / code” as search paths, the character strings “root”, “origin”, “Company”, which are tag names, respectively. And “code” can be detected. The transition to the acceptance state corresponding to any of the characters “t”, “n”, “y”, and “e” located at the end of those character strings indicates that the character string corresponding to the character has been detected. One of the pieces of hit information 170a-d shown is output.

階層照合ＮＦＡ１７１は、現在、対象とする検索パスを管理するためのものである。抽出条件群２２０が図８に示す内容であった場合、抽出条件入力部１１０によって図１０に示すような階層照合ＮＦＡ１７１が最終的に生成される。そのＮＦＡ１７１は、図１０に示すように、何れかの検索パスに記述されたタグ名を単位とした状態遷移が行われるように生成されている。このため、その状態遷移は開始タグ、及び終了タグによって発生する。ここでは、「４」、及び「２」を表記した状態が受理状態に相当する。 The hierarchical collation NFA 171 is for managing the target search path at present. When the extraction condition group 220 has the contents shown in FIG. 8, the extraction condition input unit 110 finally generates a hierarchical collation NFA 171 as shown in FIG. As shown in FIG. 10, the NFA 171 is generated so that state transition is performed in units of tag names described in any of the search paths. Therefore, the state transition is generated by the start tag and the end tag. Here, a state where “4” and “2” are written corresponds to an accepted state.

「４」を表記した受理状態に遷移したことは、検索パス「／ｒｏｏｔ／Ｃｏｍｐａｎｙ／ｃｏｄｅ」が検出されたことを意味する。それにより、その検索パスで指定されたノードでは、その値が９９未満か否か、つまり論理番号Ａ４の部分条件式（論理）が成立するか否かの照合を行うためのヒット情報１７１ａが出力される。そのヒット情報１７１ａは、照合の対象となる部分条件式を示す論理番号（ここではＡ４）、検索パスの階層の深さを示す階層情報、及びその部分条件式で関係を確認すべき内容を示す比較情報（ここでは＜９９）を含むものである。同様に「２」を表記した受理状態に遷移したことは、検索パス「／ｒｏｏｔ／ｏｒｉｇｉｎ」が検出されたことを意味するから、その検索パスで指定されたノード、つまりタグ名「ｏｒｉｇｉｎ」のタグでは、その文字列が「ａｔｃｇ」「ｇｔａｃ」或いは「ａａｃｇ」の何れと一致するか否かの照合を行うためのヒット情報１７１ｂ−ｄが出力される。それらのヒット情報１７１ｂ−ｄで比較情報を示していないのは、それらに表記した論理番号に対応する部分条件式の照合はキーワードＤＦＡ１８０により行うためである。 The transition to the acceptance state with “4” means that the search path “/ root / Company / code” has been detected. As a result, in the node specified by the search path, hit information 171a for collating whether or not the value is less than 99, that is, whether or not the partial conditional expression (logic) of the logical number A4 is satisfied is output. Is done. The hit information 171a indicates a logical number (in this case, A4) indicating a partial conditional expression to be collated, hierarchical information indicating the depth of the search path hierarchy, and the content whose relationship should be confirmed by the partial conditional expression. It contains comparison information (here <99). Similarly, the transition to the accepting state with “2” means that the search path “/ root / origin” has been detected. Therefore, the node specified by the search path, that is, the tag name “origin” In the tag, hit information 171b-d for collating whether the character string matches any of “atcg”, “gtac”, or “aacg” is output. The reason why the comparison information is not indicated in the hit information 171b-d is that the partial conditional expression corresponding to the logical number written in them is collated by the keyword DFA180.

階層照合ＮＦＡ１７１における状態遷移は、図９に示すタグＤＦＡ１７０を用いて行われる。例えばタグ名である文字列「ｒｏｏｔ」をタグＤＦＡ１７０により検出すると、つまりタグＤＦＡ１７０によりヒット情報１７０ａを出力すると、ＮＦＡ１７１では「０」を表記した初期状態から「１」を表記した状態に遷移する。次にタグＤＦＡ１７０により文字列「ｏｒｉｇｉｎ」を検出すると、ＮＦＡ１７１では「１」を表記した状態から「２」を表記した状態に遷移する。このとき、タグＤＦＡ１７０により文字列「Ｃｏｍｐａｎｙ」を検出すると、ＮＦＡ１７１では「１」を表記した状態から「３」を表記した状態に遷移する。それらの何れの文字列もタグＤＦＡ１７０により検出できなければ、ＮＦＡ１７１では「１」を表記した状態から「０」を表記した初期状態に遷移する。そのように遷移させることにより、階層照合ＮＦＡ１７１を用いて検索パスに沿った階層の移動の有無を把握し、対象とする検索パスを管理する。 The state transition in the hierarchical collation NFA 171 is performed using a tag DFA 170 shown in FIG. For example, when the character string “root” which is the tag name is detected by the tag DFA 170, that is, when the hit information 170a is output by the tag DFA 170, the NFA 171 transitions from the initial state in which “0” is described to the state in which “1” is described. Next, when the character string “origin” is detected by the tag DFA 170, the NFA 171 transitions from a state where “1” is written to a state where “2” is written. At this time, when the character string “Company” is detected by the tag DFA 170, the NFA 171 transitions from a state where “1” is written to a state where “3” is written. If any of these character strings cannot be detected by the tag DFA 170, the NFA 171 transitions from a state in which “1” is written to an initial state in which “0” is written. By making such a transition, the presence / absence of movement of the hierarchy along the search path is grasped using the hierarchy verification NFA 171 and the target search path is managed.

ＣＳＶ解析ＤＦＡ１７２は、キーワードと照合すべき文字列（要素内容）が存在する要素までの検索パスを検出するためのものである。その要素が２つのダブルコーテーション間に存在するＣＳＶデータ（図７）では、抽出条件入力部１１０によって図１１に示すようなＣＳＶ解析ＤＦＡ１７２が生成される。図１１中に表記した「０ｘ」はそれに続くシンボルが１６進数表現であることを表している。 The CSV analysis DFA 172 is for detecting a search path up to an element having a character string (element content) to be matched with a keyword. In the CSV data (FIG. 7) in which the element exists between two double quotations, the extraction condition input unit 110 generates a CSV analysis DFA 172 as shown in FIG. “0x” shown in FIG. 11 indicates that the following symbol is expressed in hexadecimal.

キーワードＤＦＡ１８０は、抽出条件により指定されたキーワードと一致する文字列をデータ２１１中から検出するためのものである。抽出条件群２２０が図８に示す内容であった場合、抽出条件入力部１１０によって図１２に示すようなキーワードＤＦＡ１８０が最終的に生成される。それに登録された何れかのキーワードの最後に位置する文字に相当する受理状態まで遷移した場合、つまり文字列「ａａｃｇ」「ａｃｇｔ」及び「ｇｔａｃ」の何れかを検出できた場合、検出された文字列に応じてヒット情報１８０ａ〜ｃの何れかが出力される。 The keyword DFA 180 is for detecting from the data 211 a character string that matches the keyword specified by the extraction condition. When the extraction condition group 220 has the content shown in FIG. 8, the extraction condition input unit 110 finally generates a keyword DFA 180 as shown in FIG. When a transition is made to an acceptance state corresponding to the last character of any of the registered keywords, that is, when any of the character strings “aacg”, “acgt”, and “gtac” can be detected, the detected character One of the hit information 180a to 180c is output according to the column.

データ入力構造検索部１２０は、入力装置２１０から所定量ずつ連続的にデータ２１１を入力し、そのデータ２１１の種類に応じて、照合に用いるオートマトンを決定する。それにより、データ２１１がＸＭＬデータであれば、タグＤＦＡ１７０、及び階層照合ＮＦＡ１７１を用いて抽出条件の何れかに記述された検索パスの検出を行う。データ２１１がＣＳＶデータであれば、ＣＳＶ解析ＤＦＡ１７２を用いて抽出条件の何れかに記述された項目名の検出を行う。検索パス、或いは項目名を検出すると、その検索パスによって指定されたノード、或いはその項目名のセルが開始する位置を示すデータ位置情報、及び検出された文字列を示すノード・セル情報を抽出条件判定部１３０に通知する。それらの情報は例えばヒット情報として生成するものか、或いはそれを含むものである。それらの情報の通知は、データ２１１の終端を検出するまで、検索パス、或いは項目名を検出する度に行う。その終端の検出は、ＸＭＬデータではルートタグと組になる終了タグの検出に相当し、ＣＳＶデータでは所定個数のセルの検出に相当する。データ入力構造検索部１２０による検索パス、或いは項目名の検出は、Ａ論理テーブル１９０ａに格納された部分条件式が成立することの確認に相当する。 The data input structure search unit 120 continuously inputs data 211 by a predetermined amount from the input device 210, and determines an automaton to be used for matching according to the type of the data 211. Thus, if the data 211 is XML data, the search path described in any of the extraction conditions is detected using the tag DFA 170 and the hierarchical collation NFA 171. If the data 211 is CSV data, the CSV analysis DFA 172 is used to detect the item name described in any of the extraction conditions. When the search path or item name is detected, the data specified by the search path or the data position information indicating the position where the cell of the item name starts and the node / cell information indicating the detected character string are extracted. The determination unit 130 is notified. Such information is generated, for example, as hit information or includes it. Notification of such information is performed every time a search path or item name is detected until the end of the data 211 is detected. The end detection corresponds to detection of an end tag paired with a root tag in XML data, and corresponds to detection of a predetermined number of cells in CSV data. Detection of a search path or item name by the data input structure search unit 120 corresponds to confirmation that the partial conditional expression stored in the A logic table 190a is satisfied.

抽出条件判定部１３０は、データ入力構造検索部１２０から通知されたデータ位置情報が示すデータ位置より、キーワードＤＦＡ１８０を用いた照合を行う。その照合の結果、そのデータ位置から何れかのキーワードと一致する文字列、或いは関係演算子が示す関係を満たす値（図８に示す抽出条件群２２０では９９未満の値）が存在することを確認すると、Ｚ論理テーブル１９０ｂの該当論理番号の箇所にそのことを示す符号（以降「真符号」と表記し、それと異なる符号を「偽符号」と表記する）を格納する。その確認ができる前にデータ２１１の終端を検出した場合には、その終端の位置を示すデータ位置情報をデータ入力構造検索部１２０に通知する。それにより、構造検索部１２０は、データ２１１の終端を自身が検出したか否かに係わらず、その終端まで走査が終了したことをデータ判定部１４０に通知する。 The extraction condition determination unit 130 performs collation using the keyword DFA 180 based on the data position indicated by the data position information notified from the data input structure search unit 120. As a result of the comparison, it is confirmed that there is a character string that matches any keyword from the data position or a value that satisfies the relationship indicated by the relational operator (value less than 99 in the extraction condition group 220 shown in FIG. 8). Then, a code indicating this (hereinafter referred to as “true code” and a different code as “false code”) is stored at the corresponding logical number in the Z logic table 190b. If the end of the data 211 is detected before the confirmation, the data input structure search unit 120 is notified of the data position information indicating the position of the end. Thereby, the structure search unit 120 notifies the data determination unit 140 that the scanning has been completed up to the end of the data 211 regardless of whether or not the end of the data 211 has been detected.

抽出条件判定部１３０は、上記通知を行うか、或いは構造検索部１２０が終端を検出するまで、構造検索部１２０から情報が通知される度にキーワードＤＦＡ１８０を用いた照合を行う。この結果、データ２１１が抽出条件２を満たしている場合には、論理番号Ｚ２、及びＺ３の符号として真符号が順次、格納され、最後に論理番号Ｚ５の符号として真符号が格納されることになる。そのようにして、対象とするデータ２１１が論理式を満たす論理番号の箇所にのみ真符号が格納されることから、Ｚ論理テーブル１９０ｂを参照することにより、データ２１１が満たす抽出条件を確認できるようになっている。 The extraction condition determination unit 130 performs collation using the keyword DFA 180 every time information is notified from the structure search unit 120 until the notification is performed or the structure search unit 120 detects the end. As a result, when the data 211 satisfies the extraction condition 2, the true code is sequentially stored as the codes of the logical numbers Z2 and Z3, and finally the true code is stored as the code of the logical number Z5. Become. In this way, since the true code is stored only at the location of the logical number that satisfies the logical expression of the target data 211, the extraction condition that the data 211 satisfies can be confirmed by referring to the Z logical table 190b. It has become.

このようにして本実施の形態では、抽出条件を構成する条件式をそれが表現する論理により細分化し、その細分化によって得られた部分条件式（細分化論理）単位での照合を行うようにしている。それにより、一致する文字列、或いは検索パスの検出、関係演算子で表す関係の確認、及びそのようなことを行うべき箇所の特定、などをそれぞれ個別に実施している。そのようにすると、より柔軟に対応することが可能となり、データ２１１の種類やその構造などの情報がたとえ不足していたとしても、ユーザは得られている情報から所望のデータ２１１が満たす内容を抽出条件としてより容易に定義できるようになる。このため、ユーザにとっての高い利便性が実現される。 In this way, in this embodiment, the conditional expressions constituting the extraction condition are subdivided according to the logic expressed by the conditional expressions, and collation is performed in units of partial conditional expressions (subdivided logic) obtained by the subdivision. ing. Thereby, detection of matching character strings or search paths, confirmation of relations represented by relational operators, and identification of locations where such a thing should be performed are performed individually. By doing so, it becomes possible to respond more flexibly, and even if the information such as the type and structure of the data 211 is insufficient, the user can obtain the content that the desired data 211 satisfies from the obtained information. This makes it easier to define the extraction conditions. For this reason, high convenience for the user is realized.

部分条件式（細分化論理）は、同じ、或いは他の抽出条件で別に存在する場合がある。図８に示す例では、部分条件式「／ｒｏｏｔ／ｏｒｉｇｉｎ」は抽出条件１〜３の何れにも記述されている。しかし、そのような複数の同じ記述は、条件式を細分化することにより、一つの部分条件式として残せば済むようになる。それにより、抽出条件の数や内容に係わらず、成立するか否か確認すべき部分条件式は必要最小限に抑えることができる。条件式、或いは抽出条件は複数の部分条件式の組み合わせで表現される。このため、それらが成立するか否かはより迅速に行えることとなる。 Partial conditional expressions (subdivision logic) may exist separately under the same or other extraction conditions. In the example illustrated in FIG. 8, the partial conditional expression “/ root / origin” is described in any of the extraction conditions 1 to 3. However, such a plurality of the same descriptions can be left as one partial conditional expression by subdividing the conditional expression. As a result, regardless of the number and contents of extraction conditions, the partial conditional expressions to be confirmed whether or not they are satisfied can be suppressed to the minimum necessary. A conditional expression or extraction condition is expressed by a combination of a plurality of partial conditional expressions. For this reason, whether or not they are established can be performed more quickly.

データ判定部１４０は、Ｚ論理テーブル１９０ｂを参照して、データ２１１が満たす抽出条件を確認する。その確認により、何れかの抽出条件を満たしていることが判明すると、検索結果判定情報１９５（図１３）を参照して、出力すべき出力バッファ１５０にデータ２１１を出力して格納する。 The data determination unit 140 refers to the Z logic table 190b and confirms the extraction condition that the data 211 satisfies. If it is determined by the confirmation that any one of the extraction conditions is satisfied, the data 211 is output and stored in the output buffer 150 to be output with reference to the search result determination information 195 (FIG. 13).

図１４は、出力バッファの管理方法を説明する図である。
データ２１１を対応する出力バッファ１５０への出力は、出力バッファ情報１５１、及びバッファ情報１５２により管理している。出力バッファ情報１５１は、抽出条件群２２０により確保した出力バッファ１５０の数を示す取得バッファ数情報、及びバッファ情報１５２にアクセスするためのポインタ情報を備えている。そのバッファ情報１５２は、取得バッファ数情報が示す数のレコードを備えたものであり、各レコードには、対応する出力バッファ１５０（ここでは出力バッファ１５０ａ〜ｃのうちの一つ）に関する複数の情報を有する個別バッファ情報１５３（ここでは個別バッファ情報１５３ａ〜ｃのうちの一つ）がそれぞれ格納されている。それら出力バッファ情報１５１、及びバッファ情報１５２を格納するエリアは出力バッファ１５０と共に、データ抽出装置１００に搭載、或いは接続された記憶装置１４０１上に確保されている。タグＤＦＡ１７０、階層照合ＮＦＡ１７１、ＣＳＶ解析ＤＦＡ１７２、キーワードＤＦＡ１８０、及び論理テーブル１９０も例えばその記憶装置１４０１に格納される。 FIG. 14 is a diagram for explaining an output buffer management method.
Output of the data 211 to the corresponding output buffer 150 is managed by output buffer information 151 and buffer information 152. The output buffer information 151 includes acquisition buffer number information indicating the number of output buffers 150 secured by the extraction condition group 220 and pointer information for accessing the buffer information 152. The buffer information 152 includes the number of records indicated by the acquisition buffer number information, and each record includes a plurality of pieces of information related to the corresponding output buffer 150 (here, one of the output buffers 150a to 150c). The individual buffer information 153 (in this case, one of the individual buffer information 153a to 153c) is stored. The output buffer information 151 and the area for storing the buffer information 152 are secured on the storage device 1401 mounted on or connected to the data extraction device 100 together with the output buffer 150. The tag DFA 170, the hierarchical collation NFA 171, the CSV analysis DFA 172, the keyword DFA 180, and the logical table 190 are also stored in the storage device 1401, for example.

その個別バッファ情報１５３は、対応する出力バッファ１５０にアクセスするためのポインタ情報、そのデータ２１１を格納可能な全サイズを表す全バッファサイズ、そのサイズのなかでデータ２１１を格納可能な残りのサイズを表す残バッファサイズ、確保した出力バッファ１５０自体のサイズを表す出力バッファサイズ、を有している。各レコードに付した番号の大小関係は抽出条件の番号のそれと同じとさせている。つまり、レコード番号０のレコードは抽出条件１に対応している。それにより、データ２１１が満たす抽出条件に対応するレコードを特定できるようにさせている。 The individual buffer information 153 includes pointer information for accessing the corresponding output buffer 150, a total buffer size indicating the total size in which the data 211 can be stored, and a remaining size in which the data 211 can be stored. A remaining buffer size, and an output buffer size representing the size of the secured output buffer 150 itself. The size relationship between the numbers assigned to each record is the same as that of the extraction condition numbers. That is, the record with record number 0 corresponds to the extraction condition 1. Thereby, the record corresponding to the extraction condition satisfied by the data 211 can be specified.

上述したようなことから、データ判定部１４０は、Ｚ論理テーブル１９０ｂを参照してデータ２１１が満たす抽出条件が存在していることを確認すると、検索結果判定情報１９５を参照してその抽出条件を確認し、出力バッファ情報１５１、及びバッファ情報１５２を参照する。それにより、確認した抽出条件に対応するレコードをバッファ情報１５２から取り出し、そのレコードに格納された個別バッファ情報１５３により指定される出力バッファ１５０にデータ２１１を出力する。残バッファサイズは、出力するデータ２１１のサイズにより更新する。 As described above, when the data determination unit 140 refers to the Z logic table 190b and confirms that there is an extraction condition that the data 211 satisfies, the data determination unit 140 refers to the search result determination information 195 to determine the extraction condition. Confirmation is made and the output buffer information 151 and the buffer information 152 are referred to. Thereby, the record corresponding to the confirmed extraction condition is taken out from the buffer information 152, and the data 211 is output to the output buffer 150 designated by the individual buffer information 153 stored in the record. The remaining buffer size is updated according to the size of the data 211 to be output.

データ出力部１６０は、各出力バッファ１５０の例えば残バッファサイズを監視し、そのサイズが所定値以下になるか、或いは入力装置２１０から入力して処理するデータ２１１が無くなった場合に、検索結果判定情報１９５を参照して、出力バッファ１５０に格納されているデータ２１１を対応するファイルに出力する。それにより、出力条件で指定された出力先ファイル名のファイルに、これまでに抽出したデータ２１１を保存する。ここでは、３つのファイル２３１〜２３３は共に同じ出力装置２３０上に保存させている。 The data output unit 160 monitors, for example, the remaining buffer size of each output buffer 150, and determines the search result when the size is equal to or smaller than a predetermined value or when there is no data 211 to be input and processed from the input device 210. With reference to the information 195, the data 211 stored in the output buffer 150 is output to the corresponding file. Thereby, the data 211 extracted so far is stored in the file having the output destination file name designated by the output condition. Here, the three files 231 to 233 are all stored on the same output device 230.

図５は、データ抽出装置１００を実現できるコンピュータのハードウェア構成の一例を示す図である。抽出装置１００は複数のコンピュータ（データ処理装置）により実現させても良いが、ここでは図５に構成を示す１台のコンピュータによって実現されていることを前提として説明することとする。 FIG. 5 is a diagram illustrating an example of a hardware configuration of a computer that can implement the data extraction apparatus 100. Although the extraction device 100 may be realized by a plurality of computers (data processing devices), it will be described here on the assumption that it is realized by a single computer having the configuration shown in FIG.

図５に示すコンピュータは、ＣＰＵ５１、メモリ５２、入力装置５３、出力装置５４、外部記憶装置５５、媒体駆動装置５６、及びネットワーク接続装置５７を有し、これらがバス５８によって互いに接続された構成となっている。同図に示す構成は一例であり、これに限定されるものではない。 The computer shown in FIG. 5 includes a CPU 51, a memory 52, an input device 53, an output device 54, an external storage device 55, a medium drive device 56, and a network connection device 57, which are connected to each other via a bus 58. It has become. The configuration shown in the figure is an example, and the present invention is not limited to this.

メモリ５２は、データを一時的に格納するＲＡＭ等のメモリである。外部記憶装置５５、若しくは媒体駆動装置５６がアクセスする可搬記録媒体ＭＤに記憶されているプログラム、あるいはデータが一時的に格納される。ＣＰＵ５１は、プログラムをメモリ５２に読み出して実行することにより、全体の制御を行う。そのプログラムは、ネットワーク接続装置５７によりネットワークを介して取得したものであっても良い。 The memory 52 is a memory such as a RAM that temporarily stores data. A program or data stored in the portable recording medium MD accessed by the external storage device 55 or the medium driving device 56 is temporarily stored. The CPU 51 performs overall control by reading the program into the memory 52 and executing it. The program may be acquired by the network connection device 57 via the network.

入力装置５３は、例えば、キーボード、マウス等の入力機器と接続されているか、或いはそれらを有するものである。そのような入力機器に対するユーザの操作を検出し、その検出結果をＣＰＵ５１に通知する。 The input device 53 is connected to or has input devices such as a keyboard and a mouse, for example. The user's operation on such an input device is detected, and the detection result is notified to the CPU 51.

出力装置５４は、例えばディスプレイと接続されているか、或いはそれを有するものである。ＣＰＵ５１の制御によって送られてくるデータをディスプレイ上に出力させる。
ネットワーク接続装置５７は、例えばイントラネットやインターネット等のネットワークを介して、他の装置と通信を行うためのものである。外部記憶装置５５は、例えばハードディスク装置である。主に各種データやプログラムの保存に用いられる。 The output device 54 is connected to or has a display, for example. The data sent under the control of the CPU 51 is output on the display.
The network connection device 57 is for communicating with other devices via a network such as an intranet or the Internet. The external storage device 55 is a hard disk device, for example. Mainly used for storing various data and programs.

記憶媒体駆動装置５６は、フレキシブル・ディスク、光ディスク（ここではＣＤ−ＲＯＭ、ＣＤ−Ｒ、及びＤＶＤ等を含む）、或いは光磁気ディスク等の可搬型の記録媒体ＭＤにアクセスするものである。 The storage medium driving device 56 accesses a portable recording medium MD such as a flexible disk, an optical disk (including CD-ROM, CD-R, and DVD), or a magneto-optical disk.

図３に示す出力装置２３０は、図５に示す構成では外部記憶装置５５、記録媒体ＭＤが装着された媒体駆動装置５６、或いはネットワーク接続装置５７によりアクセス可能な外部装置に相当する。入力装置２１０は、記録媒体ＭＤが装着された媒体駆動装置５６、或いはネットワーク接続装置５７によりアクセス可能な外部装置に相当する。抽出条件群２２０の入力は、入力装置５３、記録媒体ＭＤが装着された媒体駆動装置５６、或いはネットワーク接続装置５７により行うことができる。図１４に示す記憶装置１４０１は、例えば外部記憶装置５５、及びメモリ５２の少なくとも一方に相当する。 In the configuration shown in FIG. 5, the output device 230 shown in FIG. 3 corresponds to the external storage device 55, the medium drive device 56 loaded with the recording medium MD, or an external device accessible by the network connection device 57. The input device 210 corresponds to a medium driving device 56 in which the recording medium MD is mounted or an external device accessible by the network connection device 57. The extraction condition group 220 can be input by the input device 53, the medium driving device 56 to which the recording medium MD is mounted, or the network connection device 57. A storage device 1401 illustrated in FIG. 14 corresponds to at least one of the external storage device 55 and the memory 52, for example.

検索条件入力部１１０は、例えば出力装置５４を除く各部５１〜５３、及び５５〜５８によって実現される。データ入力構造検索部１２０、及びデータ出力部１６０は共に、例えば入力装置５３、及び出力装置５４を除く各部５１、５２、及び５５〜５７によって実現される。抽出条件判定部１３０、及びデータ判定部１４０は共に、例えば入力装置５３、出力装置５４、及びネットワーク接続装置５７を除く各部５１、５２、５５、５６、及び５８によって実現される。 The search condition input unit 110 is realized by the units 51 to 53 and 55 to 58 excluding the output device 54, for example. Both the data input structure search unit 120 and the data output unit 160 are realized by the units 51, 52, and 55 to 57, excluding the input device 53 and the output device 54, for example. Both the extraction condition determination unit 130 and the data determination unit 140 are realized by the respective units 51, 52, 55, 56, and 58 excluding the input device 53, the output device 54, and the network connection device 57, for example.

次に、上述した各部１１０、１２０、１３０、及び１４０の動作について、図１５〜図１８に示す各処理のフローチャートを参照して詳細に説明する。それらの処理は何れも、例えばＣＰＵ５１が、外部記憶装置５５、若しくは媒体駆動装置５６に装着された可搬記録媒体ＭＤに記憶されているプログラムをメモリ５２に読み出して実行することにより実現される。 Next, the operation of each of the above-described units 110, 120, 130, and 140 will be described in detail with reference to the flowcharts of the processes shown in FIGS. All of these processes are realized by, for example, the CPU 51 reading out the program stored in the portable storage medium MD mounted on the external storage device 55 or the medium driving device 56 to the memory 52 and executing it.

図１５は、抽出条件入力部１１０が実行する処理のフローチャートである。始めに図１５を参照して、その処理について詳細に説明する。その処理は、例えば抽出条件群２２０の入力をユーザが入力装置５３、或いはネットワークを介して指示することで起動される。その場合、抽出条件群２２０は入力装置５３、或いはネットワーク接続装置５７を介して入力される。 FIG. 15 is a flowchart of processing executed by the extraction condition input unit 110. First, the process will be described in detail with reference to FIG. The process is started, for example, when the user instructs input of the extraction condition group 220 via the input device 53 or the network. In that case, the extraction condition group 220 is input via the input device 53 or the network connection device 57.

先ず、ステップ１１では、抽出条件群２２０を入力し、例えばメモリ５２に保存する。続くステップ１２では、保存した抽出条件群２２０のなかから１抽出条件を選択して読み出し、それを解析して対応するオートマトンの種類を特定する。その次に移行するステップ１３では、特定した種類のオートマトンを生成、或いは更新する。その生成、或いは更新により、抽出条件に記述された文字列が必要に応じてタグＤＦＡ１７０、階層照合ＮＦＡ１７１、或いはキーワードＤＦＡ１８０に登録される。 First, in step 11, the extraction condition group 220 is input and stored in the memory 52, for example. In the subsequent step 12, one extraction condition is selected from the stored extraction condition group 220 and read out, and analyzed to identify the corresponding automaton type. In the next step 13, the specified type of automaton is generated or updated. By the generation or update, the character string described in the extraction condition is registered in the tag DFA 170, the hierarchical collation NFA 171 or the keyword DFA 180 as necessary.

ステップ１３に続くステップ１４では、抽出条件群２２０のなかに選択していない他の抽出条件が有るか否か判定する。そのような抽出条件が残っていた場合、判定はＹＥＳとなって上記ステップ１２に戻り、他の選択条件を選択する。そうでない場合には、判定はＮＯとなり、ステップ１５で論理テーブル１９０の生成と併せて検索結果判定情報１９５（図１３）、出力バッファ情報１５１、及びバッファ情報１５２の生成を行い、抽出条件数に応じた出力バッファ１５０（図１４）の確保を行った後、一連の処理を終了する。このようにして、抽出条件群２２０の入力により、必要なオートマトンの生成に併せて、データ２１１を出力すべき出力先に出力するための準備が行われる。 In step 14 following step 13, it is determined whether there are other extraction conditions not selected in the extraction condition group 220. If such an extraction condition remains, the determination is yes and the process returns to step 12 to select another selection condition. Otherwise, the determination is no, and in step 15, the search result determination information 195 (FIG. 13), the output buffer information 151, and the buffer information 152 are generated together with the generation of the logical table 190, and the number of extraction conditions is set. After securing the corresponding output buffer 150 (FIG. 14), the series of processing ends. In this way, by inputting the extraction condition group 220, preparation for outputting the data 211 to an output destination to be output is performed together with generation of a necessary automaton.

図１６は、データ入力構造検索部１２０が実行する処理のフローチャートである。次に図１６を参照して、その処理について詳細に説明する。その処理は、例えばデータ２１１の入力装置２１０からの取り込みが指示されている間、実行される。 FIG. 16 is a flowchart of processing executed by the data input structure search unit 120. Next, the processing will be described in detail with reference to FIG. The process is executed while an instruction to fetch the data 211 from the input device 210 is given, for example.

先ず、ステップ２１では、入力装置２１０から入力すべきデータ２１１が有るか否か判定する。そのようなデータ２１１が無かった場合、判定はＮＯとなり、再度、その判定を行う。それにより、そのデータ２１１が生じるのを待つ。一方、そうでない場合には、判定はＹＥＳとなってステップ２２に移行する。 First, in step 21, it is determined whether there is data 211 to be input from the input device 210. If there is no such data 211, the determination is no and the determination is performed again. Thereby, it waits for the data 211 to be generated. On the other hand, when that is not right, determination will be YES and it will transfer to step 22.

ステップ２２では、入力装置２１０から所定量のデータ２１１を入力する。続くステップ２３では、入力したデータ２１１から一つを選択し、抽出条件入力部１１０によって決定したオートマトンを用いて、それに登録された文字列の何れかと一致する文字列の検索を行う。 In step 22, a predetermined amount of data 211 is input from the input device 210. In the subsequent step 23, one of the input data 211 is selected, and a character string that matches any of the character strings registered in it is searched using the automaton determined by the extraction condition input unit 110.

その検索は１文字単位で行い、その検索が終了するとステップ２４に移行して、対象となる文字列（検索パス、項目名、など）を検出できたか否か判定する。そのような文字列を検出できなかった場合、判定はＮＯとなってステップ２７に移行する。そうでない場合には、判定はＹＥＳとなってステップ２５に移行する。 The search is performed in units of one character. When the search is completed, the process proceeds to step 24 to determine whether or not the target character string (search path, item name, etc.) has been detected. If such a character string cannot be detected, the determination is no and the process moves to step 27. Otherwise, the determination is yes and the process moves to step 25.

ステップ２５では、データ位置情報等を抽出条件判定部１３０に通知する。その通知により、抽出条件判定部１３はキーワードＤＦＡ１８０を用いた照合を行い、その照合によってデータ２１１の終端を検出すると、そのデータ位置情報を通知する。このことから、次のステップ２６では、その通知が有ったか否か判定する。その通知が有った場合、判定はＹＥＳとなってステップ２８に移行する。そうでない場合には、判定はＮＯとなって上記ステップ２３に戻り、検索を続行する。 In step 25, the data position information and the like are notified to the extraction condition determination unit 130. In response to the notification, the extraction condition determination unit 13 performs collation using the keyword DFA 180. When the end of the data 211 is detected by the collation, the data position information is notified. Therefore, in the next step 26, it is determined whether or not there is a notification. If there is such notification, the determination is yes and the process moves to step 28. Otherwise, the determination is no and the process returns to step 23 to continue the search.

上記ステップ２４の判定がＮＯとなって移行するステップ２７では、検索によってデータ２１１の終端を検出したか否か判定する。その終端を検出した場合、判定はＹＥＳとなってステップ２８に移行する。そうでない場合には、判定はＮＯとなって上記ステップ２３に戻り、検索を続行する。 In step 27 where the determination in step 24 is NO and the process proceeds, it is determined whether or not the end of the data 211 has been detected by the search. If the end is detected, the determination is yes and the process moves to step 28. Otherwise, the determination is no and the process returns to step 23 to continue the search.

ステップ２８では、データ２１１の終端が検出されたことをデータ判定部１４０に通知する。続くステップ２９では、入力したデータ２１１のなかで未選択のデータ２１１が有るか否か判定する。未選択のデータ２１１が存在する場合、判定はＹＥＳとなって上記ステップ２３に戻り、未選択のデータ２１１を選択して検索を開始する。そうでない場合には、判定はＮＯとなって上記ステップ２１に戻る。それにより、入力装置２１０に入力すべきデータ２１１が有るか否かの確認を行う。 In step 28, the data determination unit 140 is notified that the end of the data 211 has been detected. In the following step 29, it is determined whether or not there is unselected data 211 in the input data 211. If unselected data 211 exists, the determination is yes, the process returns to step 23, the unselected data 211 is selected, and the search is started. Otherwise, the determination is no and the process returns to step 21 above. Thereby, it is confirmed whether or not there is data 211 to be input to the input device 210.

図１７は、抽出条件判定部１３０が実行する処理のフローチャートである。次に図１７を参照して、その処理について詳細に説明する。
先ず、ステップ４１では、レコードの終了通知が通知されるのを待つ。その通知を受け取ると、判定がＮＯとなってステップ４２に移行し、通知されたデータ位置情報、及びキーワードＤＦＡ１８０を用いた照合を行う。その次に移行するステップ４３では、キーワードＤＦＡ１８０に登録されたキーワードの何れかと一致する文字列をデータ２１１から検出できたか否か判定する。そのような文字列を検出できた場合、判定はＹＥＳとなり、ステップ４４で論理テーブル１９０（Ｚ論理テーブル１９０ｂ）の該当論理番号の箇所に真符号を設定した後、上記ステップ４１に戻り、通知待ちの状態に移行する。そうでない場合には、判定はＮＯとなってステップ４５に移行する。 FIG. 17 is a flowchart of processing executed by the extraction condition determination unit 130. Next, the processing will be described in detail with reference to FIG.
First, in step 41, it waits for notification of the end of record. When the notification is received, the determination is no, the process proceeds to step 42, and collation is performed using the notified data position information and the keyword DFA 180. In the next step 43, it is determined whether or not a character string that matches any of the keywords registered in the keyword DFA 180 has been detected from the data 211. If such a character string can be detected, the determination is YES, a true code is set at the position of the corresponding logical number in the logical table 190 (Z logical table 190b) in step 44, and the process returns to step 41 to wait for notification. Transition to the state. Otherwise, the determination is no and the process moves to step 45.

ステップ４５では、データ２１１の終端を検出したか否か判定する。照合によってその終端を検出した場合、判定はＹＥＳとなり、そのことを通知するためにデータ位置情報をデータ入力構造検索部１２０にステップ４６で通知した後、上記ステップ４１に戻る。そうでない場合には、判定はＮＯとなって上記ステップ４２に戻り、照合を続行する。 In step 45, it is determined whether or not the end of the data 211 has been detected. If the end is detected by collation, the determination is yes, and the data position information is notified to the data input structure search unit 120 in step 46 in order to notify that, and the process returns to step 41. Otherwise, the determination is no and the process returns to step 42 and the collation is continued.

上述したようにして、データ入力構造検索部１２０と抽出条件判定部１３０の間では必要な情報のやりとりが随時、行われ、その情報によってそれぞれ処理を進行させる。それにより、１データ２１１毎に、それが成立する抽出条件を確認し、その確認結果に応じた処理を行うようになっている。 As described above, necessary information is exchanged at any time between the data input structure search unit 120 and the extraction condition determination unit 130, and processing is advanced according to the information. Thereby, for each data 211, an extraction condition for satisfying it is confirmed, and processing according to the confirmation result is performed.

図１８は、データ判定部１４０が実行する処理のフローチャートである。最後に図１８を参照して、その処理について詳細に説明する。
先ず、ステップ５１では、データ入力構造検索部１２０からデータ２１１の終端が通知されるのを待つ。その通知を受け取ると、判定がＮＯとなってステップ５２に移行し、論理テーブル１９０を参照して、現在、対象としているデータ２１１が満たす抽出条件を判定する。その後はステップ５３に移行する。 FIG. 18 is a flowchart of processing executed by the data determination unit 140. Finally, the processing will be described in detail with reference to FIG.
First, in step 51, it waits for the end of the data 211 to be notified from the data input structure search unit 120. When the notification is received, the determination is no, the process proceeds to step 52, and the logical table 190 is referenced to determine the extraction condition that the currently targeted data 211 satisfies. Thereafter, the process proceeds to step 53.

ステップ５３では、データ２１１が満たす抽出条件が有るか否か判定する。そのような抽出条件が存在した場合、判定はＹＥＳとなってステップ５４に移行し、検索結果判定情報１９５（図１３）、出力バッファ情報１５１、及びバッファ情報１５２（図１４）を参照してデータ２１１を出力すべき出力バッファ１５０に出力し、対応する個別バッファ情報１５３を更新した後、上記ステップ５１に戻る。それにより、通知待ちの状態に移行する。一方、そうでない場合には、判定はＮＯとなってそのステップ５１に戻る。 In step 53, it is determined whether there is an extraction condition that the data 211 satisfies. If such an extraction condition exists, the determination is YES, the process proceeds to step 54, and data is obtained with reference to the search result determination information 195 (FIG. 13), the output buffer information 151, and the buffer information 152 (FIG. 14). 211 is output to the output buffer 150 to be output, and the corresponding individual buffer information 153 is updated. Thereby, it shifts to a notification waiting state. On the other hand, if not, the determination is no and the process returns to step 51.

図１９〜図２４は、上記データ抽出装置の適用例を説明する図である。以降は、図１９〜図２４を参照して、その適用可能な利用法について具体的に説明する。図１９〜図２４において、データ抽出装置は「抽出器」と表記している。 19 to 24 are diagrams for explaining application examples of the data extraction device. Hereinafter, the applicable usage will be described in detail with reference to FIGS. In FIG. 19 to FIG. 24, the data extraction device is expressed as “extractor”.

図１９は、複数のデータ抽出装置１００を多段階で使用する場合の例を示している。データ１９０３を入力するデータ抽出装置１００は、そのデータ１９０３を２つの連結器１９１０に振り分けている。その二つの連結器１９１０の一方は、マスタファイル１９０１のデータをデータ１９０３と連結させて別のデータ抽出装置１００に出力し、そのデータ抽出装置１００は連結結果を２つの集計器１９２０に振り分けている。その２つの集計器１９２０はそれぞれ異なるデータ抽出装置１００に集計結果を出力し、その集計結果を入力するデータ抽出装置１００はそのデータをそれぞれ３つのファイルに振り分けて出力している。これらは、二つの連結器１９１０の他方側でも同様である。 FIG. 19 shows an example in which a plurality of data extraction devices 100 are used in multiple stages. The data extraction apparatus 100 that receives the data 1903 distributes the data 1903 to the two connectors 1910. One of the two concatenators 1910 concatenates the data of the master file 1901 with the data 1903 and outputs it to another data extraction device 100, which distributes the concatenation result to the two totalizers 1920. . The two tabulators 1920 output the tabulation results to different data extraction devices 100, and the data extraction device 100 that inputs the tabulation results sorts the data into three files and outputs them. The same applies to the other side of the two couplers 1910.

図２０は、入力データの振り分けにデータ抽出装置１００を使用する場合の例を示している。その入力データは、ジャーナルファイル２０００に格納された各レコードのデータである。データ抽出装置１００は、抽出条件を満たすデータをジャーナルファイル２００１〜３のうちの何れかに振り分けて出力するために用いられている。そのように振り分けるのは、例えばマスタＸ〜Ｚとの連結条件がそれぞれ異なることに対応するためである。そのように振り分けると、データを３系統で並行して処理することが可能となることから、処理の高速化を実現できる。 FIG. 20 shows an example in which the data extraction device 100 is used for sorting input data. The input data is data of each record stored in the journal file 2000. The data extraction device 100 is used to distribute and output data satisfying the extraction condition to any of the journal files 2001 to 3. The reason for such distribution is, for example, to cope with different connection conditions with the masters X to Z. If distributed in this way, it is possible to process data in parallel in three systems, so that the processing speed can be increased.

図２１は、連結結果のデータの振り分けにデータ抽出装置１００を使用する場合の例を示している。その連結結果は、マスタとジャーナルのデータを連結させたものである。データ抽出装置１００は、抽出条件１〜３の何れかを満たすデータを、その抽出条件に応じてファイル２１０１〜３のうちの何れかに出力するために用いられている。 FIG. 21 shows an example in which the data extraction apparatus 100 is used for sorting the data of the connection result. The concatenation result is obtained by concatenating master and journal data. The data extraction device 100 is used to output data satisfying any one of the extraction conditions 1 to 3 to any of the files 2101 to 3 according to the extraction condition.

図２２は、集計結果のデータの振り分けにデータ抽出装置１００を使用する場合の例を示している。その集計結果は、マスタとジャーナルのデータの連結結果に対して集計操作を行ったものである。データ抽出装置１００は、抽出条件１〜３の何れかを満たす集計結果のデータを、その抽出条件に応じてファイル２２０１〜３のうちの何れかに出力するために用いられている。 FIG. 22 shows an example in which the data extraction device 100 is used for sorting the data of the total results. The aggregation result is obtained by performing an aggregation operation on the result of concatenating master and journal data. The data extraction apparatus 100 is used to output the data of the aggregation result that satisfies any of the extraction conditions 1 to 3 to any of the files 2201 to 3 according to the extraction condition.

図２３は、新聞社等で実施されるクリッピングサービスの提供用にデータ抽出装置１００を使用する場合の例を示している。その場合、データ抽出装置１００にはサービス登録者毎に、その登録者に送るべき記事データが満たす抽出条件を定義する。その抽出装置１００には随時、記事データが入力され、その記事データが満たす抽出条件に応じて対応するファイルに出力される。そのファイルに出力された記事データは、定期的にサービス登録者に配信される。サービス登録者の追加、削除、或いは要求の変更などは、抽出条件の追加、削除、或いは内容の変更によって対応することができる。 FIG. 23 shows an example in which the data extraction apparatus 100 is used for providing a clipping service implemented in a newspaper company or the like. In this case, the data extraction apparatus 100 defines, for each service registrant, extraction conditions that are satisfied by article data to be sent to the registrant. Article data is input to the extraction device 100 as needed, and is output to a corresponding file in accordance with the extraction conditions satisfied by the article data. The article data output to the file is periodically distributed to service registrants. Addition, deletion, or change of request of service registrants can be handled by adding or deleting extraction conditions or changing contents.

図２４は、ハイウェイ利用調査システムにデータ抽出装置１００を使用する場合の例を示している。その場合、ハイウェイのモニタシステムから随時、データがデータ抽出装置１００に入力される。その抽出装置１００には、必要なデータのみを抽出するための抽出条件を定義する。それにより、抽出装置１００は、抽出条件に従ってデータを選別する（フィルタリングする）。選別されたデータは、連結器によりマスタデータと照合され、より詳細なデータに展開される。例では、自動車の番号が「ｋ２１０４」のデータに対して会社名「○○通運」が付加されている。マスタデータと照合されたデータは集計器により、例えば会社毎に集計されて出力される。 FIG. 24 shows an example in which the data extraction device 100 is used in a highway usage survey system. In this case, data is input to the data extraction device 100 from the highway monitor system as needed. The extraction apparatus 100 defines extraction conditions for extracting only necessary data. Thereby, the extraction apparatus 100 sorts (filters) data according to the extraction conditions. The selected data is collated with master data by a coupler and developed into more detailed data. In the example, the company name “XX Transport” is added to the data of the car number “k 2104”. Data collated with the master data is aggregated and output by the aggregator, for example, for each company.

なお、本実施の形態では、抽出条件によって出力先を振り分けるデータそのものを外部から入力しているが、そのデータは実際に振り分けるデータの生成用、或いは特定用のものであっても良い。つまり符号化された圧縮データのようなものであっても良い。そのようなデータの入力は、記録媒体ＭＤに記録して行うようにしても良い。 In this embodiment, the data itself that distributes the output destination according to the extraction condition is input from the outside, but the data may be used for generation of data that is actually distributed or for identification. That is, it may be like encoded compressed data. Such data input may be performed by recording on the recording medium MD.

Claims

A program to be executed by a computer to realize a data extraction device capable of extracting data satisfying a specified extraction condition from among obtainable data,
A function of acquiring the data;
A function of inputting the extraction condition;
A function of extracting data for each extraction condition by using one or more input extraction conditions according to the input function;
A function of outputting the data extracted for each extraction condition to the different output destinations by the function of extracting;
A program to realize