JP2009064430A

JP2009064430A - Device and method for creating test data

Info

Publication number: JP2009064430A
Application number: JP2008210122A
Authority: JP
Inventors: Katsuyasu Sato; 勝康佐藤
Original assignee: SYSTEMEXE Inc
Current assignee: SYSTEMEXE Inc
Priority date: 2007-08-16
Filing date: 2008-08-18
Publication date: 2009-03-26
Anticipated expiration: 2028-08-18
Also published as: JP5212980B2

Abstract

<P>PROBLEM TO BE SOLVED: To correctly attain secret processing of production environment data with few resources. <P>SOLUTION: A feature description for an item name of production environment data stored in a database 41, attributes of the item and a format of actual data is stored in a feature storage part 42 together with points thereof. The feature description describes a feature pattern of each data by regular expression. A point counting part 11 counts, when matching of an item name, its attribute and sample data extracted from the item with the feature description is determined by a feature determination part 10, a point to the item. A data content determination part 12 determines, when the point reaches a fixed value, that data of the item is a known type of data having the feature. For data belonging to the item having the data whose feature is determined by the data content determination part 12, the data is partially or entirely substituted by test data. The text data is directly transferred to a production environment, or output in an optional file format. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

本発明は、データベースシステムの開発時に使用するテストデータを、既存の顧客データベースに記憶されているデータに基づいて作成するためのテストデータ作成方法及び作成方法に関するものである。 The present invention relates to a test data creation method and a creation method for creating test data used at the time of development of a database system based on data stored in an existing customer database.

データベースシステムの開発時において、開発されたシステムのテストを行う場合には、開発されたシステムの問題点を洗い出すために、極力本番環境に近い状態でテストを行うことが要求される。そのため、従来では、テスト用のデータとして、本番環境用データの一部をそのまま使用することが行われていた。 When testing a developed system during the development of a database system, it is required to perform the test in a state as close to the production environment as possible in order to identify problems with the developed system. Therefore, conventionally, a part of the production environment data is used as it is as the test data.

しかし、本番環境用データの一部をそのままにテストに使用すると、個人情報等の情報資産の流失の危険性があった。それを改善する方法として、特許文献１や特許文献２に示すように、本番環境用データの一部を他の文字に変換することで、本番環境用データの内容を秘匿したテストデータを作成する方法が提案されている。
特開２００４−３２６５１０号公報特開２００８−６５６８７号公報特開平０７−３６８７３号公報 However, if part of the production environment data is used as it is for testing, there is a risk of loss of information assets such as personal information. As a method for improving this, as shown in Patent Literature 1 and Patent Literature 2, test data that conceals the content of production environment data is created by converting a part of production environment data into other characters. A method has been proposed.
JP 2004-326510 A JP 2008-65687 A Japanese Patent Laid-Open No. 07-36873

しかし、特許文献１の発明は、本番環境用データの各データがどのようなデータ構造や属性を持っているか、あるいは本番環境用データのどの部分をテストデータに変換するか、などの作業を開発者が指定する必要があり、多数の異なる種類のデータを記録するデータベースにあっては、開発者に多大な負担が要求される問題があった。 However, the invention of Patent Document 1 develops work such as what data structure and attributes each data of production environment data has, or which part of production environment data is converted to test data. In a database that records many different types of data that must be specified by the developer, there is a problem that requires a great burden on the developer.

一方、特許文献２の発明は、デーブル形式で記録されたデータから、持ち出し用の中間データを作成し、この中間データと予め用意された置き換え用マスタデータとを関連づけることで、テストデータを作成している。そのため、本番環境用データと完成されたテストデータとの属性やデータ構造の同一性を維持することができない問題があった。 On the other hand, the invention of Patent Document 2 creates test data by creating intermediate data for taking out from data recorded in a table format and associating the intermediate data with replacement master data prepared in advance. ing. For this reason, there is a problem that it is impossible to maintain the same attribute and data structure between the production environment data and the completed test data.

また、特許文献３の発明は、姓名を集めた辞書データベースを用意し、本番環境用データ中の各データと、この辞書データベースに記録された姓名とを比較することで、本番環境用データが氏名に関するデータであることを判定する。その後、この本番環境用データを辞書データベースの姓名と置換することで、テストデータを生成している。 In addition, the invention of Patent Document 3 prepares a dictionary database that collects first and last names, and compares each data in the production environment data with the first and last names recorded in the dictionary database, so that the production environment data becomes the full name. It is determined that the data is related to. Then, test data is generated by replacing the production environment data with the first and last names in the dictionary database.

しかし、この特許文献３の発明は、本番環境用データがどのような内容であるか、すなわち「氏名」「姓のみ」「名のみ」のいずれかであるかを判定することができない欠点があった。また、本番環境用のデータベースが氏名、住所、振り仮名、郵便番号、電話番号、年月日、金額などの数値、その他の文字列など、種々の種類のデータを記録したものである場合には、どの項目にどのような種類のデータが記憶されているかを判別することが難しいという問題があった。例えば、郵便番号や電話番号については、全国すべての番号をデータベースに登録していないと判定できなかった。 However, the invention of Patent Document 3 has a drawback that it is impossible to determine what the actual environment data is, ie, “name”, “last name only”, or “first name only”. It was. In addition, if the production environment database records various types of data such as names, addresses, pseudonyms, postal codes, phone numbers, dates, monetary values, and other character strings, etc. There is a problem that it is difficult to determine what kind of data is stored in which item. For example, for postal codes and telephone numbers, it could not be determined that all the numbers in the country were not registered in the database.

本発明は、前記のような従来技術の問題点を解決するために提案されたもので、既存のデータベースに記録されている各種のデータを、そのデータ内容と共にそのパターンを分析することにより、各項目のデータの種類を特定し、特定したデータの種類に応じたデータ内容の変換を行うことで、既存のデータベースのデータ内容を秘匿して、既存のデータベースのデータ構造に近いテストデータを作成することを可能としたテストデータ作成装置及び作成方法を提供することを目的とする。 The present invention has been proposed to solve the problems of the prior art as described above, and various data recorded in an existing database are analyzed by analyzing their patterns together with the data contents. By specifying the data type of the item and converting the data content according to the specified data type, the data content of the existing database is concealed and test data close to the data structure of the existing database is created It is an object of the present invention to provide a test data creation apparatus and creation method that make it possible.

本発明のテストデータ作成装置は、次のような構成要件を有することを特徴とする。
(a) 複数種類のデータを項目別に分類したテーブル形式で記憶するデータベース。
(b) 既知の種類のデータについてそのデータの有する特徴のパターンを正規表現により記述した形式で記憶すると共に、その特徴について予め定められたポイントを記憶する特徴記憶部。
(c) 前記データベースに記憶されているデータの中から、各項目別に複数のサンプルデータを抽出し、それぞれのサンプルデータが前記特徴記憶部に記憶されている正規表現により記述したパターンに適合するか否かを判定する特徴判定部。
(d) この特徴判定部によって前記各サンプルデータに前記各項目に特有の特徴が検出された場合に、その項目に対して、検出された特徴について予め定められたポイントを計数するポイント計数部。
(e) 前記ポイント計数部によって計数されたポイントが一定値に達した場合に、そのサンプルデータが属する項目のデータが、前記特徴を有する既知の種類のデータであることを判定するデータ内容判定部。
(f) データベースに記憶されているデータ中の、前記データ内容判定部によってデータの特徴が判定された項目に属するデータについて、判定されたデータ内容に基づいてそのデータの一部または全部を秘匿用データに置換するデータ変換部。 The test data creation apparatus of the present invention has the following configuration requirements.
(a) A database that stores multiple types of data in a table format classified by item.
(b) A feature storage unit that stores, for a known type of data, a feature pattern of the data in a format described by a regular expression, and stores a predetermined point for the feature.
(c) Extracting a plurality of sample data for each item from the data stored in the database, and whether each sample data conforms to the pattern described by the regular expression stored in the feature storage unit A feature determination unit for determining whether or not.
(d) A point counting unit that counts a predetermined point for the detected feature for each item when a feature specific to each item is detected in each sample data by the feature determination unit.
(e) A data content determination unit that determines that the data of the item to which the sample data belongs is a known type of data having the characteristics when the points counted by the point counting unit reach a certain value .
(f) Concerning data belonging to the items whose data characteristics are determined by the data content determination unit in the data stored in the database, a part or all of the data is concealed based on the determined data content A data converter that replaces data.

また、前記のような構成を有するテストデータ作成装置において実行される各処理を構成要件とするテストデータ作成方法も、本発明の一態様である。 In addition, a test data creation method using each process executed in the test data creation apparatus having the above-described configuration as a configuration requirement is also an aspect of the present invention.

前記のような構成を有する本発明において、前記特徴記憶部が、既知の項目名に特有の特徴と、その特徴についてのポイントを記憶するものであり、前記特徴判定部が、データベースに記憶されているテーブル形式の項目名を抽出し、この項目名が前記特徴記憶部に記憶されている既知の項目名に関する特徴を有するか否かを分析するものであって、前記ポイント計数部が、前記サンプルデータのポイントと共に、前記特徴判定部によって前記項目名に既知の項目名に特有の特徴が検出された場合に、その項目名に対して、検出された特徴について予め定められたポイントを計数するものであることも、本発明の一態様である。 In the present invention having the above-described configuration, the feature storage unit stores a feature unique to a known item name and points about the feature, and the feature determination unit is stored in a database. The table name item name is extracted, and it is analyzed whether or not the item name has a feature relating to a known item name stored in the feature storage unit, and the point counting unit includes the sample Along with data points, when a feature specific to an item name known to the item name is detected by the feature determination unit, a point predetermined for the detected feature is counted for the item name It is also an embodiment of the present invention.

前記のような構成を有する本発明において、前記データベースが、テーブル形式で記憶されたデータの各項目ごとに、その項目に属するデータのデータ形式を項目の属性として記憶するものであり、前記データ構造判定部が、前記サンプルデータ及び／または項目名に基づくポイントと共に、この項目の属性と既知のデータのデータ構造との比較結果に基づいて、この項目に属するデータのデータ構造を判定するものであることも、本発明の一態様である。 In the present invention having the above-described configuration, the database stores, for each item of data stored in a table format, a data format of data belonging to the item as an item attribute, and the data structure The determination unit determines the data structure of the data belonging to the item based on the comparison result between the attribute of the item and the data structure of the known data together with the points based on the sample data and / or the item name. This is also one embodiment of the present invention.

前記のような構成を有する本発明において、前記データ変換部が、既知のデータの種類ごとに、データベース内に記憶されているデータの一部または全部を置換するための置換ルールと、この置換ルールが使用するダミーデータとを記憶した変換データ記憶部を参照してデータの変換を行うものであることも、本発明の一態様である。 In the present invention having the above-described configuration, the data conversion unit replaces part or all of data stored in the database for each known data type, and the replacement rule. It is also an aspect of the present invention that data conversion is performed with reference to a conversion data storage unit that stores dummy data used by.

本発明によれば、サンプルデータが特徴記憶部に記憶されている正規表現により記述したパターンに適合するか否かを判定することにより、データベースの各項目に属するデータが有するデータ構造のパターンを考慮しつつ、データの特徴を判定することが可能になる。その結果、文字列そのものを特徴記述とした場合に比較して、少ない特徴記述により種々データの特徴の判定を行うことが可能になり、本番環境用データの秘匿処理を少ないリソースでしかも正確に実現できる。 According to the present invention, the data structure pattern of the data belonging to each item of the database is taken into account by determining whether the sample data matches the pattern described by the regular expression stored in the feature storage unit. However, it is possible to determine the characteristics of the data. As a result, compared to the case where the character string itself is used as a feature description, it is possible to determine the characteristics of various data with less feature description, and the data processing for production environment can be concealed with less resources and more accurately. it can.

本発明の一実施形態を図面に従って具体的に説明する。図１は、本実施形態の構成を示すブロック図、図２はその動作を示すフローチャートである。 An embodiment of the present invention will be specifically described with reference to the drawings. FIG. 1 is a block diagram showing the configuration of the present embodiment, and FIG. 2 is a flowchart showing its operation.

［実施形態の構成及び作用］
本実施形態のテストデータ作成装置は、キーボードなどの入力装置１、ＣＰＵなどの演算装置２、プログラムやデータを展開するメモリ３、本番環境用データのデータベース４１や特徴記憶部４２、完成されたテストデータの記憶部４３、本番環境用データをテストデータに置換するための置換ルールと、この置換ルールが使用するダミーデータとを記憶した変換データ記憶部４４などとして使用されるハードディスクなどの記憶装置４、ディスプレイやプリンタなどの出力装置５などのハードウェアを備えている。 [Configuration and Action of Embodiment]
The test data creation apparatus according to the present embodiment includes an input device 1 such as a keyboard, an arithmetic device 2 such as a CPU, a memory 3 that develops programs and data, a production environment data database 41 and a feature storage unit 42, and a completed test. Data storage unit 43, storage device 4 such as a hard disk used as conversion data storage unit 44 storing replacement rules for replacing production environment data with test data and dummy data used by the replacement rules And hardware such as an output device 5 such as a display or a printer.

そして、このハードウェア上にコンピュータプログラムを実行させることにより、本発明のテストデータ作成装置６を構成する特徴判定部１０、ポイント計数部１１、データ内容判定部１２、データ変換部１３、特徴修正部１４及び重み付け変更部１５が、コンピュータ上に実現されている。 And by making a computer program run on this hardware, the characteristic determination part 10, the point counting part 11, the data content determination part 12, the data conversion part 13, and the characteristic correction part which comprise the test data creation apparatus 6 of this invention 14 and the weight changing unit 15 are realized on a computer.

前記本番環境用データのデータベース４１には、複数種類のデータが、項目別に分類したテーブル形式で記憶されている。本実施形態では、データベースには、項目名、項目の属性、及び各項目ごとに分類された実データが記憶されている。これらは、一例として、次のようなものである。 The production environment data database 41 stores a plurality of types of data in a table format classified by item. In the present embodiment, the database stores item names, item attributes, and actual data classified for each item. As an example, these are as follows.

（１）項目名（カラム名）
(a) 日付型
DATE
TIME
YEAR
(b) 文字列型
ADDRESS
JYUSHO
PREFECTURE
NAME
(c) 数値型
WEIGHT
LENGTH
AMOUNT
VOLUME (1) Item name (column name)
(a) Date type
DATE
TIME
YEAR
(b) String type
ADDRESS
JYUSHO
PREFECTURE
NAME
(c) Numeric type
WEIGHT
LENGTH
AMOUNT
VOLUME

（２）項目の属性（データ型）
(a) 日付型
DATE
IMESTAMP
(b) 文字列型
VARCHAR
CHAR
(c) 数値型
INTEGER
BIGINT
FLOAT
REAL
DOUBLE (2) Item attributes (data type)
(a) Date type
DATE
IMESTAMP
(b) String type
VARCHAR
CHAR
(c) Numeric type
INTEGER
BIGINT
FLOAT
REAL
DOUBLE

（３）実データの書式
(a) 日付
年、月、日、時、分、秒
年、月、日
yyyy/MM/dd hh:mm:ss
yyyy-MM-dd
(b) 文字列
都道府県名から番地等の詳細まで
都道府県
郵便番号
(c) 数値
１２３
456
123,456 (3) Format of actual data
(a) Date Year, month, day, hour, minute, second Year, month, day
yyyy / MM / dd hh: mm: ss
yyyy-MM-dd
(b) Character string From prefecture name to details such as street address Prefecture postal code
(c) Numerical value 123
456
123,456

前記特徴記憶部４２には、前記データベース４１に記憶されている本番環境用データについての前記（１）項目名、（２）項目の属性及び（３）実データとその書式に関する特徴記述が、そのポイントと共に記憶されている。この特徴記述は、各データの特徴のパターンを正規表現により記述したもので、例えば、次のようなものである。なお、この特徴記述の例は、必ずしも前記本番環境用のデータ例に対応するものではない。 The feature storage unit 42 includes the (1) item name, (2) item attribute, and (3) feature description about the actual data and its format for the production environment data stored in the database 41. It is remembered with the points. This feature description is a description of a feature pattern of each data by a regular expression. For example, the feature description is as follows. The example of the feature description does not necessarily correspond to the data example for the production environment.

（１）項目名（カラム名）
(a) 日付
.*[Dd][Aa][Tt][Ee].*
.*[Tt][Ii][Mm][Ee].*
(b) 文字列
.*[Aa][Dd][Dd][Rr][Ee][Ss][Ss]$
.*([Jj][Yy]|[Ji])[Uu][Ss]([Hh]|[Yy])[Oo]$
.*[Pp][Rr][Ee][Ff][Ee][Cc][Tt][Uu][Rr][Ee].*
(c) 数値
.*([Ww]|[Hh])[Ee][Ii][Gg][Hh][Tt]$ (1) Item name (column name)
(a) Date
. * [Dd] [Aa] [Tt] [Ee]. *
. * [Tt] [Ii] [Mm] [Ee]. *
(b) Character string
. * [Aa] [Dd] [Dd] [Rr] [Ee] [Ss] [Ss] $
. * ([Jj] [Yy] | [Ji]) [Uu] [Ss] ([Hh] | [Yy]) [Oo] $
. * [Pp] [Rr] [Ee] [Ff] [Ee] [Cc] [Tt] [Uu] [Rr] [Ee]. *
(c) Numerical value
. * ([Ww] | [Hh]) [Ee] [Ii] [Gg] [Hh] [Tt] $

（３）実データの書式
(a) 日付
^.*年.*月.*日.*時.*分.*秒$…年月日と時刻
.*年.*月.*日$…年月日
^((19|[2-9][0-9])[0-9]{2})/([1-9]|(0[1-9]|1[0-2]))/([1-9]|(0[1-9]|([12][0-9]|3[01])))$…西暦
(b) 文字
^(佐藤|鈴木|高橋|田中|渡辺|伊藤|山本).*…氏名
^(佐藤|鈴木|高橋|田中|渡辺|伊藤|山本)$…姓
^[0-9０-９]{1,}[-−][0-9０-９]{1,}[-−][0-9０-９]{1,}$…電話番号
^[0-9０-９]{3}[-−][0-9０-９]{4}$…郵便番号
(c) 数値
^[0-9.,]{1,}$ (3) Format of actual data
(a) Date
^. * Year. * Month. * Day. * Hour. * Minute. * Second $ ... year, month, day and time
. * Year. * Month. * Day $ ... date
^ ((19 | [2-9] [0-9]) [0-9] {2}) / ([1-9] | (0 [1-9] | 1 [0-2])) / ([1-9] | (0 [1-9] | ([12] [0-9] | 3 [01]))) $ ...
(b) Character
^ (Sato | Suzuki | Takahashi | Tanaka | Watanabe | Ito | Yamamoto). *… Name
^ (Sato | Suzuki | Takahashi | Tanaka | Watanabe | Ito | Yamamoto) $… Surname
^ [0-90-9] {1,} [-] [0-90-9] {1,} [-] [0-90-9] {1,} $ ... phone number
^ [0-90-9] {3} [-] [0-90-9] {4} $ ... zip code
(c) Numerical value
^ [0-9.,] {1,} $

ここで、各特徴記述に使用されている正規表現は次のような意味を有する。
(a) ピリオドは改行を除く任意の１文字を意味する。
(b) アスタリスクは、直前にある正規表現の０回以上の繰り返しを検索する。
(c) ブラケット(角括弧)で囲んだ文字のいずれかひとつとマッチすればマッチしたと判断する。
(d) 「|」は、パターンの論理和を示す。このパターンの論理和は、「この文字列かこの文字列」を探したいという場合に使用する。
(e) パーレン（小括弧、丸括弧）は、パターンをグループ化して評価する。優先順位が高くなるので、パーレンの中を評価した後に全体を評価する。
(d) ドル記号は行末を意味する。
(f) カレットは行頭を意味する。
(g) {n}・{n,}・{n,m}は、パターンの繰り返し回数を指定する。{n}はn回の繰り返し、{n,}はn回以上の繰り返し、{n,m}はn回以上、m回以下の繰り返し。 Here, the regular expression used for each feature description has the following meaning.
(a) Period means any single character except line feed.
(b) The asterisk searches for the regular expression that is immediately before 0 or more times.
(c) If one of the characters enclosed in brackets (square brackets) is matched, it is judged as a match.
(d) “|” indicates a logical sum of patterns. The logical OR of this pattern is used when it is desired to search for “this character string or this character string”.
(e) Palen (parentheses, parentheses) is evaluated by grouping patterns. Since the priority is higher, evaluate the whole after evaluating the inside of the paren.
(d) A dollar sign means the end of a line.
(f) Caret means the beginning of a line.
(g) {n}, {n,}, and {n, m} specify the number of pattern repetitions. {n} is repeated n times, {n,} is repeated n times or more, {n, m} is repeated n times or more and m times or less.

前記のような特徴記述は、本番環境用データのデータベースにおいて使用されることが予想される各種の項目名、その属性及び実データの書式に合わせて、多数のパターンを用意しておく。 For the feature description as described above, a number of patterns are prepared in accordance with various item names, their attributes, and actual data formats expected to be used in the production environment data database.

前記特徴判定部１０は、前記データベース４１に記憶されている本番環境用データの項目名、項目の属性及び実データの中から任意に取り出したサンプルデータ（例えば、１０から２０個余りのデータ）を抽出し（図２のステップ１，４，７）、それぞれが前記特徴記憶部４２に記憶されている正規表現により記述したパターンに適合するか否かを判定する（ステップ２，５，８）。 The feature determination unit 10 obtains sample data (for example, about 10 to about 20 pieces of data) arbitrarily extracted from the item name, item attribute, and actual data of the production environment data stored in the database 41. Extraction is performed (steps 1, 4, and 7 in FIG. 2), and it is determined whether or not each pattern matches a pattern described by the regular expression stored in the feature storage unit 42 (steps 2, 5, and 8).

例えば、データベースに「氏名」のデータが記録されている場合、その項目名は、「NAMAE」や「SHIMEI」を含み、項目の属性は「VARCHAR」「CHAR」のいづれかと一致し、取り出したサンプルデータ中には、日本人に多数見られる「姓」と「名」とが存在するはずである。そこで、これら項目名、属性及びサンプルデータのそれぞれについて、特徴記憶部４２に記憶されているどの特徴記述と一致するか否かを判定する。 For example, if "name" data is recorded in the database, the item name includes "NAMAE" or "SHIMEI", the item attribute matches either "VARCHAR" or "CHAR", and the sample is taken out In the data, there should be "last name" and "first name" that are often seen in Japanese. Therefore, it is determined whether each item name, attribute, and sample data matches which feature description stored in the feature storage unit 42.

具体的には、開発対象となるデータベースが、氏名を記憶する場合に、「氏名」として１項目に記憶するか、「姓」と「名」とに２つの項目に分けて記憶するかによって、作成するテストデータも異なってくる。 Specifically, when the database to be developed stores a name, it is stored in one item as “name” or divided into two items, “last name” and “first name”. The test data to create will also be different.

そこで、特徴記憶部４２には、「氏名」と判定するための、^(佐藤|鈴木|高橋|田中|渡辺|伊藤|山本).*という正規表現の特徴記述と、「姓」のみと判定するための^(佐藤|鈴木|高橋|田中|渡辺|伊藤|山本)$という正規表現の特徴記述を用意しておき、本番環境用データから抽出したサンプルデータがいずれに該当するかを、特徴判定部１０によって分析する。例えば、「氏名」であれば、「佐藤|鈴木|高橋|田中|渡辺|伊藤|山本」というような日本人の姓に多く見られる名字のいずれかを含み、しかもその後に「.*」の正規表現で示すような繰り返し文字（名前が続くと考えられる）が出現する。一方、サンプルデータが「姓」の場合には、正規表現の末尾が「$」になっており、その後に文字が続くことがない。 Therefore, in the feature storage unit 42, a regular expression feature description of ^ (Sato | Suzuki | Takahashi | Tanaka | Watanabe | Ito | Yamamoto). To prepare ^ (Sato | Suzuki | Takahashi | Tanaka | Watanabe | Ito | Yamamoto) $ for the regular expression feature description, and the sample data extracted from the production environment data Analysis is performed by the determination unit 10. For example, “name” includes one of the surnames commonly found in Japanese surnames such as “Sato | Suzuki | Takahashi | Tanaka | Watanabe | Ito | Yamamoto”, followed by “. *” Repeat characters (names are considered to follow) appear as regular expressions. On the other hand, when the sample data is “last name”, the regular expression ends with “$”, and no characters follow.

同様に、電話番号や郵便番号については、使用される数字、その桁数、ハイフォンの位置などのパターンを正規表現で記述しておくことで、サンプルデータがどのような特徴を有するかを判定する。 Similarly, for phone numbers and postal codes, patterns such as the numbers used, the number of digits, and the location of the haiphong are described in regular expressions to determine what characteristics the sample data has. .

前記ポイント計数部１１は、この特徴判定部１０によって、前記項目名、その属性、その項目から抽出されたサンプルデータと前記特徴記述とが一致した場合（その項目に特有の特徴が検出された場合）に、その項目に対して、検出された特徴について予め定められたポイントを計数する（ステップ３，６，９）。 When the feature determination unit 10 matches the item name, its attribute, sample data extracted from the item, and the feature description (if a feature specific to the item is detected), the point counting unit 11 ), A predetermined point for the detected feature is counted for the item (steps 3, 6 and 9).

すなわち、ポイント計数部１１は、
(a) 項目名とある特徴記述が一致すると、その特徴記述に対応して定められたポイントを計数する。
(b) 属性とある特徴記述が一致すると、その特徴記述に対応して定められたポイントを計数する。
(c) サンプルデータの１つとある特徴記述が一致すると、その特徴記述に対応して定められたポイントを計数する。
というように、特徴記述と一致する度にその項目についてポイントを加算していく。 That is, the point counting unit 11
(a) When an item name and a feature description match, the points determined corresponding to the feature description are counted.
(b) When an attribute and a feature description match, the points determined corresponding to the feature description are counted.
(c) When one feature description matches one of the sample data, the points determined corresponding to the feature description are counted.
In this way, points are added to the item every time it matches the feature description.

前記データ内容判定部１２は、前記ポイント計数部１１によって計数されたポイントが一定値に達した場合に（ステップ１１のＹｅｓ）、その項目名、属性、及びサンプルデータが属する項目のデータが、前記特徴を有する既知の種類のデータであるとを判定する。 When the point counted by the point counting unit 11 reaches a certain value (Yes in Step 11), the data content determination unit 12 determines that the item name, attribute, and item data to which the sample data belongs are It is determined that the data is a known type of data having characteristics.

前記データ変換部１３は、データベース４１に記憶されている本番環境用データ中の、前記データ内容判定部１２によってデータの特徴が判定された項目に属するデータについて、判定されたデータ内容に基づいてそのデータの一部または全部を秘匿用データに置換するものである。すなわち、項目名、属性、及びサンプルデータに基づいて、その項目に属するデータの特徴が判定されると、データ変換部１３は、変換用データ記憶部４４に記憶されている変換ルールと、変換用ダミーデータとを参照して（ステップ１２）、その項目に属する本番環境用データをテストデータに変換する（ステップ１３）。 The data conversion unit 13 determines, based on the determined data content, the data belonging to the item whose data characteristics are determined by the data content determination unit 12 in the production environment data stored in the database 41. A part or all of the data is replaced with confidential data. That is, when the characteristics of the data belonging to the item are determined based on the item name, attribute, and sample data, the data conversion unit 13 converts the conversion rule stored in the conversion data storage unit 44 and the conversion data. Referring to the dummy data (step 12), the production environment data belonging to the item is converted into test data (step 13).

例えば、変換ルールとして、
(a) その項目のデータが、「氏名」や「住所」である時には、変換用データ記憶部４４に用意されている多数の「氏名」や「住所」の中からランダムに抽出したダミーデータに変換する。
(b) 電話番号や数値の場合には、ランダムな数字やアスタリスクに置き換える。
(c) ダミーデータを用意することなく、本番環境用データ中の他の文字列と入れ替える。(d) 数値や年月日については、一定の範囲の数字にのみ置き換える。
(e) 文字列や数値中の一定の位置にある値のみをダミーデータで置換する。
(f) 作成するテストデータの数。
など、開発対象となるデータベースシステムの動作確認に適した内容のテストデータを作成することのできるルールを用意しておく。 For example, as a conversion rule:
(a) When the data of the item is “name” or “address”, dummy data randomly extracted from a large number of “name” and “address” prepared in the conversion data storage unit 44 Convert.
(b) In the case of telephone numbers or numbers, replace them with random numbers or asterisks.
(c) Replace with other character strings in the production environment data without preparing dummy data. (d) For numbers and dates, replace only with a certain range of numbers.
(e) Replace only the value at a certain position in a character string or numerical value with dummy data.
(f) Number of test data to create.
Prepare a rule that can create test data with contents suitable for the operation check of the database system to be developed.

前記データ変換部１３によって変換されたテストデータは、テストデータ記憶部４３に記憶され、その後、出力装置５から外部に出力されたり、開発対象であるデータベースシステムのデータとして利用される。この場合、図３に示すように、テスト環境への接続が可能な場合は、テスト環境用データベース４５に対して、作成されたテストデータを直接に転送することが可能である。一方、テスト環境への接続が出来ない場合、ＸＭＬ，ＣＳＶ，ＴＳＶのような任意の書式のファイルへ出力することができる。 The test data converted by the data conversion unit 13 is stored in the test data storage unit 43, and then output to the outside from the output device 5 or used as data of a database system to be developed. In this case, as shown in FIG. 3, when connection to the test environment is possible, the created test data can be directly transferred to the test environment database 45. On the other hand, if connection to the test environment is not possible, the file can be output to a file of any format such as XML, CSV, TSV.

図４は、このようにして得られたテストデータの一例を示す画面例で、その上段には、データベースを構成する各項目の、項目名、属性などが、中段には変換前の本番環境用データが、下段には変換後のテストデータが示されている。 FIG. 4 is an example of a screen showing an example of the test data obtained in this way. The upper part shows the item names, attributes, etc. of each item constituting the database, and the middle part for the production environment before conversion. The test data after conversion is shown in the lower row.

［実施形態の効果］
以上のような構成を有する本実施形態によれば、項目名、項目の属性、項目に属する実データの内容などについての特徴を正規表現を使用して記述したため、項目名やデータ内容の特徴を単なる文字列や数値の一致だけではなく、パターンとして把握することが可能になる。その結果、同じ数字を利用した電話番号と郵便番号のようなデータでも、両者の記録パターンを分別することで、正確に区別することが可能になる。また、正規表現の使用により、多種多量の特徴を簡単に集約して記述できるので、特徴記述部４２の記憶容量も少なくで済む利点がある。 [Effect of the embodiment]
According to the present embodiment having the above-described configuration, the feature of the item name, the attribute of the item, the content of the actual data belonging to the item, and the like are described using regular expressions. It becomes possible not only to match simple character strings and numerical values but also to grasp them as patterns. As a result, even for data such as telephone numbers and postal codes using the same numbers, it is possible to accurately distinguish both by recording the recording patterns of both. In addition, by using regular expressions, a large number of features can be easily aggregated and described, so there is an advantage that the storage capacity of the feature description unit 42 can be reduced.

本実施形態では、項目名、項目の属性、項目に属する実データ、及び正規表現で記述した特徴のいずれについても、前記の例に記載のように文字列によって表現しているため、その記述及び内容の把握が容易であり、特徴記述の修正、追加が容易に実施できる。特に、本実施形態では、特徴記憶部４２における特徴（パターン）記述は外部ファイルに記述されており、プログラムの動作に影響を与えることなく容易に修正可能である。 In the present embodiment, since the item name, the item attribute, the actual data belonging to the item, and the feature described by the regular expression are expressed by the character string as described in the above example, the description and The contents can be easily grasped, and the feature description can be easily corrected and added. In particular, in the present embodiment, the feature (pattern) description in the feature storage unit 42 is described in an external file and can be easily corrected without affecting the operation of the program.

また、変換用データ記憶部４４に変換ルールとして、文字列のある範囲のみを置換するというようなルールを定めておいた場合には、開発するシステムとの整合性を十分に配慮したテストデータを作成することができる。例えば、本番環境用データが「Ｘ０１−００１」という番号の場合は、「Ｘ０１−」が識別番号であり、これを変更した場合は、データとしての整合性が取れずに、「ＹＢ４−」などと異なる識別番号となり、結果が異なることになる。しかし、本実施形態では、前記のような変換ルールを使用することで、「Ｘ０１−００１」中の「Ｘ０１−」は変換を行わず「００１」の部分だけを「００２」などと変換したテストデータを得ることができる。 In addition, when the conversion data storage unit 44 has a rule that replaces only a certain range of character strings as a conversion rule, test data with sufficient consideration for consistency with the system to be developed is provided. Can be created. For example, when the production environment data is a number “X01-001”, “X01-” is an identification number, and if this is changed, data consistency is not achieved, and “YB4-” or the like is obtained. Will result in different identification numbers. However, in this embodiment, by using the conversion rule as described above, “X01-” in “X01-001” is not converted, and only “001” is converted to “002” or the like. Data can be obtained.

［他の実施形態］
本発明は、前記の実施形態に限定されるものではなく、次のような他の実施形態も包含する。
(a) 前記実施形態は、項目名、項目の属性、項目に属する実データ（サンプルデータ）のすべてについて特徴の判定を行い、ポイントの計数を行っていたが、サンプルデータからだけでその項目に属するデータの内容を判定することも可能である。 [Other Embodiments]
The present invention is not limited to the above-described embodiments, and includes other embodiments as follows.
(a) In the above-described embodiment, the characteristics are determined for all of the item name, item attribute, and actual data (sample data) belonging to the item, and the points are counted. It is also possible to determine the contents of the data to which it belongs.

(b) 特徴記憶部４２に記憶した各種のデータの特徴は、正規表現による記述を追加・変更することで、自由に変えることができる。特に、テストデータ作成装置自体に、特徴修正部１４を設けることで、特徴記憶部４２内に記憶されている正規表現を一覧表のような形でディスプレイなどの出力装置５に表示させ、入力装置１から正規表現の修正、追加を行うように構成することもできる。 (b) Features of various data stored in the feature storage unit 42 can be freely changed by adding / changing a description using a regular expression. In particular, by providing the feature correction unit 14 in the test data creation device itself, the regular expressions stored in the feature storage unit 42 are displayed on the output device 5 such as a display in the form of a list, and the input device The regular expression can be modified and added from 1.

(c) 特徴記憶部４２に、特徴記述と共に記憶する各特徴のポイントについては、自由に変更可能である。この場合、テストデータ作成装置自体に、重み付け変更部１５を設けることで、前記特徴記述と同様に入力装置１から各特徴についてのポイントを変更できる。特に、項目名、項目の属性は、１つの項目に対して１つの特徴を有するためポイントは一定であるが、項目に属する実データのポイントはサンプルデータ数に応じて加算されるため、重み付け変更部１５により「項目名、項目の属性」のポイントと「サンプルデータ」によるポイントとのバランスを取ることが望ましい。 (c) The feature points stored in the feature storage unit 42 together with the feature description can be freely changed. In this case, by providing the weight change unit 15 in the test data creation device itself, the points for each feature can be changed from the input device 1 in the same manner as the feature description. In particular, the item name and item attribute have one feature for each item, so the points are constant, but the actual data points belonging to the item are added according to the number of sample data, so the weight change It is desirable to balance the “item name, item attribute” points and the “sample data” points by the unit 15.

図５は、本発明のデータ内容の判定処理の一実施例を示すものである。
図中、５１は本番環境用データを記憶したデータベースの特にサンプルデータを示す。５２は特徴判定部１０、ポイント計数部１１及びデータ内容判定部１２の処理内容、５３はその判断基準、５４は特徴記述に対応付けた点数の定義、５５は前記サンプルデータを定義５４の点数に当て嵌めて数値化した表、５６は各項目ごとに定義の点数を集積化した表、５７は項目のデータ内容の判定結果を示す表である。 FIG. 5 shows an embodiment of the data content determination process of the present invention.
In the figure, reference numeral 51 denotes particularly sample data of a database storing production environment data. 52 is a processing content of the feature determination unit 10, the point counting unit 11 and the data content determination unit 12, 53 is a determination criterion thereof, 54 is a definition of a score associated with the feature description, and 55 is a score of the definition 54. A table that is numerically shown by fitting, 56 is a table that accumulates definition scores for each item, and 57 is a table that shows the determination results of the data contents of the items.

この図５から解るように、本発明の評価定義情報（前記実施形態で説明した特徴記述とその特徴に応じ定めた各タイプ毎のポイント定義をあわせた情報）を用いた判別方法とは、数字データ型、文字データ型、日付データ型毎に評価定義情報の区分を変えることで、タイプ別に決められた点数を付けていき、件数を集めるデータ母体数の傾向を判読することである。ここでいうデータ型は、データベースの呼ばれるデータの性格や数値の表現範囲などを規定する型を示している。この点数を集積していくほど母体数が増えていき、データベースのカラム毎に点数の配分が順位として明確になる。 As can be seen from FIG. 5, the discrimination method using the evaluation definition information of the present invention (information combining the feature description described in the embodiment and the point definition for each type determined according to the feature) is a numeric value. By changing the classification of the evaluation definition information for each data type, character data type, and date data type, the score determined for each type is attached, and the tendency of the number of data bases collecting the number of cases is read. The data type here refers to a type that defines the character of data called a database and the range of numerical values. As the points are accumulated, the number of bases increases, and the distribution of the points for each column of the database becomes clear as a ranking.

この例によると、データベースから取り出したテーブルとカラム一覧情報データは、評価定義情報を通して分別処理をしている。基準は、前記実施形態の特徴記述に応じ、１．「特定文字の一致」、２．「特定形式の一致」、３．「文字数の把握」、４．「母体数の集積により判断」である。 According to this example, the table extracted from the database and the column list information data are subjected to separation processing through the evaluation definition information. The standard depends on the feature description of the embodiment. “Specific character match”, 2. 2. “Specific type match”; “Understanding the number of characters”; It is “determined by accumulation of the number of mothers”.

図５の５３に示すように、基準１では、「特定文字の一致」を点数により判別を行っている。テーブルやカラムに特定の文字が一致すれば、データ型の特定がよりできやすくなる。例えば、図５の５１では、一列目では「田中」と「佐藤」等というテーブル内に記載されている。通常の一般社会では「田中」と「佐藤」は、苗字に用いられることが多い名前として世間的に認識されている。コンピュータ処理上では言葉としての認識判断はできないが、予め名前である可能性が高い特徴的な言葉として、定義情報格納部（図１の特徴記憶部４２）内に正規表現を使用したＸＭＬ形式で記載をしている。この定義情報の登録により、「田中」と「佐藤」が基準に一致した場合に、点数として１点を付けている。 As indicated by 53 in FIG. 5, in the standard 1, “specific character match” is determined by the score. If a specific character matches a table or column, it becomes easier to specify the data type. For example, in 51 of FIG. 5, the first column is described in a table such as “Tanaka” and “Sato”. In the general public, “Tanaka” and “Sato” are widely recognized as names often used for family names. Although it cannot be recognized as a word in computer processing, it is an XML format that uses a regular expression in the definition information storage unit (feature storage unit 42 in FIG. 1) as a characteristic word that is likely to be a name in advance. It is described. By registering this definition information, when “Tanaka” and “Sato” match the criteria, 1 point is given.

図５の５１の二列目については、「一郎」と「三子」いう二列目のデータであるが、「一郎」は通常の一般社会では人物名称に用いられる特徴的名前として、世間的に認識されている。しかし、「次郎」、「三郎」などという異なる名前には、全てが対応しなければならなくなる。また、「三子」は名前と地名等と推測し得る。 The second row of 51 in FIG. 5 is the second row of data called “Ichiro” and “Mitsuko”, but “Ichiro” is a popular name used as a person name in ordinary society. Has been recognized. However, different names such as “Jiro” and “Saburo” must all correspond. In addition, “Sanko” can be inferred as a name and a place name.

そこで、コンピュータ処理上では言葉としての認識判断はできないが、「郎」や「子」といった名前でよく使用される傾向にある言葉を、予め定義情報格納部（図１の特徴記憶部４２）内に正規表現を使用したＸＭＬ形式で記載をしている。これらは、上記一列目と同様に、「郎」や「子」が２の基準として一致した場合に、点数として２点が付けられる。 Therefore, words that cannot be recognized and recognized as words in computer processing, but words that tend to be often used with names such as “Buro” and “Son” are stored in advance in the definition information storage unit (feature storage unit 42 in FIG. 1). Is described in XML format using regular expressions. In the same manner as in the first row, when “Buro” and “Child” are matched as 2 criteria, 2 points are given as points.

図５の５３でいう基準２では、「特定形式の一致」を点数により判別を行っている。テーブルやカラムに、予めに決まった形式に一致するかどうかで、「日付」であるかなどが判別できる可能性が高いことが挙げられる。そこで、定義情報格納部に正規表現を使用したＸＭＬ形式で記載されている日付データ型等のタイプ別に分けて、データベース等の表記方法の形式として用いられるＹＹＹＹ／ＭＭ／ＤＤ等の型に一致等の有無を評価判断している。 In criterion 2 indicated by 53 in FIG. 5, “specific format match” is determined by the score. There is a high possibility that it is possible to determine whether the date is a table or column according to whether or not it matches a predetermined format. Therefore, the definition information storage unit is divided into types such as date data types described in XML format using regular expressions, and matches the type of YYYY / MM / DD etc. used as the format of the notation method such as database etc. Evaluation of the presence or absence of.

例として、図５の５１の三列目では、「２０００／０９／０１」とテーブル内に記載されている。コンピュータ処理上では、前記実施形態の特徴記憶部４２内で、コンピュータ上のデータベースの日付データ型を用いた場合は、ＹＹＹＹ／ＭＭ／ＤＤと一致することで、評価判断をしている。また、基準１との併用により、「年、月、日」等の特定文字が記載されていることにより判断ができる。これらは、上記一列目と同様に、「２０００／０９／０１」が基準として一致した場合に、点数として４点が付けられる。 As an example, in the third column of 51 in FIG. 5, “2000/09/01” is described in the table. In the computer processing, when the date data type of the database on the computer is used in the feature storage unit 42 of the above embodiment, the evaluation judgment is made by matching with YYYY / MM / DD. Further, in combination with the standard 1, it can be determined by the fact that specific characters such as “year, month, day” are described. As in the first row, when “2000/09/01” matches as a reference, 4 points are assigned as points.

図５の５３でいう基準３では、「文字数の把握」点数により判別を行っている。テーブルやカラムに、予めに決まった文字数が一致するかどうかで、「郵便番号」や「電話番号」などを判別できる可能性が高くなる。そこで、定義情報格納部のＸＭＬ形式で記載されている文字データ型等のタイプ別に分けて、正規表現により１から９といった連続した数字や文字があった場合に適用するように、定義している。 In the standard 3 shown by 53 in FIG. 5, the determination is made based on the “ascertaining number of characters” score. There is a high possibility that “zip code”, “phone number”, etc. can be discriminated depending on whether or not the predetermined number of characters matches the table or column. Therefore, it is defined to be applied when there are consecutive numbers and characters such as 1 to 9 by regular expression, divided by type such as character data type described in XML format of definition information storage unit .

例として、図５の１番の四列目では、「１２３−７４５６」とテーブル内に記載されている。これは通常の一般社会では、「１２３−７４５６」は、郵便番号の７桁の規定された識別番号と認識されている。コンピュータ処理上では、定義情報格納部のＸＭＬ形式で記載されている文字データ型を用いた場合は、「［０−９］｛３｝［−］［０−９］｛４｝」正規表現で記載されている７桁の数字が入ることで評価判断を行っているために、点数として５点が付けられる。 As an example, “123-7456” is described in the table in the fourth column in the first row in FIG. In ordinary ordinary society, “123-7456” is recognized as a 7-digit specified identification number of a zip code. In computer processing, when the character data type described in the XML format of the definition information storage unit is used, the regular expression “[0-9] {3} [−] [0-9] {4}” is used. Since the evaluation judgment is made by entering the described 7-digit number, 5 points are given as the score.

そして、図５の５３の基準４では「母体数の集積により判断」に記載のように、点数の合計により判断している。データの点数評価が進むことにより、図５の５６で示している集積化イメージのようになる。点数として獲得したタイプ毎に合計され、点数の高いタイプに選定されて、適切な定義情報へと収束される。これにより、データの置換時に蓄積されたデータがより適切に変換されるようになる。 Then, in criterion 4 of 53 in FIG. 5, the determination is made based on the total number of points as described in “determination based on accumulation of the number of mother bodies”. As the score evaluation of the data progresses, an integrated image indicated by 56 in FIG. 5 is obtained. A total is obtained for each type obtained as a score, and a type with a high score is selected and converged to appropriate definition information. As a result, the data accumulated at the time of data replacement is more appropriately converted.

本発明の実施形態の機能ブロック図Functional block diagram of an embodiment of the present invention 図１の実施形態の処理フロー図Processing flow diagram of the embodiment of FIG. 作成されたテストデータの出力形式を示す図Figure showing the output format of the created test data 本番環境用データと作成されたテストデータの一例を示す表示画面Display screen showing one example of production environment data and created test data 本発明の実施例における評価定義情報の処理を示す図The figure which shows the process of the evaluation definition information in the Example of this invention

Explanation of symbols

１…入力装置
２…演算装置
３…メモリ
４…記憶装置
４１…本番環境用データのデータベース
４２…特徴記憶部
４３…テストデータの記憶部
４４…変換データ記憶部
４５…テスト環境用データベース
５…出力装置
６…テストデータ作成装置
１０…特徴判定部
１１…ポイント計数部
１２…データ内容判定部
１３…データ変換部
１４…特徴修正部
１５…重み付け変更部 DESCRIPTION OF SYMBOLS 1 ... Input device 2 ... Arithmetic device 3 ... Memory 4 ... Storage device 41 ... Production environment data database 42 ... Feature storage unit 43 ... Test data storage unit 44 ... Conversion data storage unit 45 ... Test environment database 5 ... Output Device 6 ... Test data creation device 10 ... Feature determination unit 11 ... Point counting unit 12 ... Data content determination unit 13 ... Data conversion unit 14 ... Feature correction unit 15 ... Weight change unit

Claims

A database that stores multiple types of data in a table format that is classified by item,
A feature storage unit that stores a pattern of a feature of the known type of data in a format described by a regular expression and stores a predetermined point for the feature;
A plurality of sample data is extracted for each item from the data stored in the database, and whether each sample data conforms to a pattern described by a regular expression stored in the feature storage unit. A feature determining unit for determining;
When a feature specific to each item is detected in each sample data by the feature determination unit, a point counting unit that counts a predetermined point for the detected feature for the item;
A data content determination unit for determining that the data of the item to which the sample data belongs when the point counted by the point counting unit reaches a certain value, is a known type of data having the characteristics;
In the data stored in the database, for data belonging to the items whose data characteristics have been determined by the data content determination unit, part or all of the data is replaced with confidential data based on the determined data content A data converter to
A test data creation device characterized by comprising:

The feature storage unit stores a feature specific to a known item name and a point about the feature,
The feature determination unit extracts an item name in a table format stored in a database, and analyzes whether the item name has a feature related to a known item name stored in the storage unit. And
When the point counting unit detects a feature specific to the item name known to the item name by the feature determination unit together with the points of the sample data, the detected feature is predetermined for the item. The test data creating apparatus according to claim 1, wherein the points are counted.

The database stores, for each item of data stored in a table format, the data format of the data belonging to that item as an item attribute,
The data content determination unit determines the data structure of the data belonging to the item based on the comparison result between the attribute of the item and the data structure of the known data together with the points based on the sample data and / or the item The test data creation device according to claim 1, wherein the test data creation device is a test data creation device.

Conversion data in which the data conversion unit stores a replacement rule for replacing part or all of the data stored in the database for each known data type and dummy data used by the replacement rule 4. The test data creation apparatus according to claim 1, wherein the test data creation apparatus performs data conversion with reference to a storage unit.

The feature storage unit includes a feature correction unit for correcting a pattern of a feature expressed by a regular expression stored therein by changing a description of the regular expression. The test data creation device according to any one of claims 1 to 4.

Weighting for changing the weighting of the points of the sample data counted by the point counting unit and the item name and / or item attribute points counted by the point counting unit by the data content determination unit 6. The test data creating apparatus according to claim 2, further comprising a changing unit.

Extract multiple sample data from the data stored in a database that stores multiple types of data in a table format classified by item, and whether these sample data have the characteristics of known types of data Is determined by comparing the feature of the sample data and known data with a feature description described by a regular expression,
When a characteristic peculiar to the known type of data is detected in each sample data by the determination process, a predetermined point for the detected characteristic is set for an item in which the sample data is stored. Processing to count,
When the point counted by the point counting process reaches a certain value, a process of determining that the data of the item to which the sample data belongs is the known type of data;
A part or all of a plurality of data belonging to an item determined to be a known type of data by the determination process in the database is determined according to the characteristics of the determined known type of data. A process of converting a part or all into confidential data;
A test data creation method characterized by comprising: