JP2012168671A

JP2012168671A - Data compression device, data compression method and data compression program

Info

Publication number: JP2012168671A
Application number: JP2011028083A
Authority: JP
Inventors: Shinichiro Tako; 真一郎多湖; Tatsuya Asai; 達哉浅井; Hiroya Inakoshi; 宏弥稲越; Nobuhiro Yugami; 伸弘湯上; Seishi Okamoto; 青史岡本; Hiroaki Morikawa; 裕章森川
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2011-02-14
Filing date: 2011-02-14
Publication date: 2012-09-06
Anticipated expiration: 2031-02-14
Also published as: JP5682354B2

Abstract

PROBLEM TO BE SOLVED: To provide a data compression device capable of increasing compression efficiency without decreasing information accuracy.SOLUTION: A data compression device 1 comprises a storage part 10, a grouping part 11, an information accuracy determination part 13, an event conversion part 15 and a data compression part 17. The storage part 10 stores: an identification determination table that shows the weight of consistency when item information is consistent, and a retention level of information accuracy; an information accuracy determination table that shows an original data retention level determined from a distribution of each item information and the retention level; and an identification determination table that shows conditions for grouping based on a distribution of the weight of consistency between data and on an appearance frequency. The grouping part 11 groups identifiable data together based on setting information, with respect to each of data groups in event data 2. The information accuracy determination part 13 obtains the distribution of item information, and determines, based on the setting information, the original data retention level of each item, with respect to each group. The event conversion part 15 rewrites data information in accordance with the original data retention level. The data compression part 17 compresses a data group including the converted data.

Description

本発明は，データ圧縮技術に関し，特にデータ項目を有するデータのデータ圧縮技術に関する。 The present invention relates to a data compression technique, and more particularly to a data compression technique for data having data items.

近年では，センサデータ，ログデータなどのデータが大量に生成されるようになり，大量のデータから，何らかの知見を得ようとする動きがある。そのため，大量のデータを保存したり，転送させたりする必要があり，データの保存，転送のコストの低減をはかるため，データ圧縮処理を利用することができる。 In recent years, a large amount of data such as sensor data and log data has been generated, and there is a movement to obtain some knowledge from a large amount of data. Therefore, it is necessary to store or transfer a large amount of data, and data compression processing can be used to reduce the cost of storing and transferring data.

データ圧縮処理として，辞書式圧縮処理とエントロピー圧縮処理とを組み合わせた処理が良く知られている。辞書式圧縮処理は，データ中の文字列を辞書に登録し，既出の文字列が再度出現した場合に，その文字列を，辞書の登録箇所を示す符号に置き換える処理である。エントロピー圧縮処理は，符号の出現頻度をもとに，割り当てる符号の長さを決定して，データを圧縮する処理である。 As data compression processing, processing combining lexicographic compression processing and entropy compression processing is well known. The lexicographic compression process is a process of registering a character string in data in a dictionary and replacing the character string with a code indicating a registered part of the dictionary when an existing character string appears again. The entropy compression process is a process for compressing data by determining the length of a code to be assigned based on the appearance frequency of the code.

上記のようなデータ圧縮処理を用いても，圧縮対象のデータ量によっては，圧縮済みデータのデータ量が大きい場合もあり，圧縮効率の改善が検討されてきた。 Even if the above data compression processing is used, depending on the amount of data to be compressed, the amount of compressed data may be large, and improvements in compression efficiency have been studied.

従来手法として，圧縮対象のデータ（元データ）の情報精度を低下させることによってデータ圧縮率を改善することが行われていた。一例として，元データが，「時間：分：秒（例えば，１０：３６：２４）」で示す時間データを含む場合に，「分：秒」の値を削除して，「時間（１０）」のみを示す時間データに変換する前処理を行い，データ圧縮の圧縮率を高めていた。また別の例として，元データに含まれる数値（例えば，０．４７９２４９２７８９７）の少数の有効数字桁数を少なくした数値（０．４７９）に変換する前処理を行っていた。 As a conventional method, the data compression rate has been improved by reducing the information accuracy of the data to be compressed (original data). As an example, when the original data includes time data indicated by “hour: minute: second (for example, 10:36:24)”, the value of “minute: second” is deleted and “hour (10)” is deleted. The pre-processing to convert the data into only time data was performed to increase the data compression rate. As another example, pre-processing for converting a numerical value (0.47924927897) included in the original data into a numerical value (0.479) with a small number of significant digits is reduced.

また，解析処理に応じて，必要なフィールドデータのみを抽出，圧縮して送信することにより，転送するデータを小容量にする処理が知られている。 Also, a process is known in which only necessary field data is extracted, compressed, and transmitted according to the analysis process, thereby reducing the data to be transferred.

特開２００４−１９９３７７号公報JP 2004-199377 A

従来，データ圧縮の効率を高めるためには情報精度を低下させなければならなかった。元データに対して，特定のデータ項目の情報精度を低下させる前処理した場合に，そのデータ項目の情報精度が一律に低下してしまい，データ利用の際にそのデータ項目の利用価値が低下してしまうという問題があった。 Conventionally, in order to increase the efficiency of data compression, information accuracy has to be lowered. When the original data is preprocessed to reduce the information accuracy of a specific data item, the information accuracy of the data item is uniformly reduced, and the use value of the data item is reduced when the data is used. There was a problem that.

本発明の目的は，情報精度を低下させずにデータ圧縮率を高くするデータ圧縮技術を提供することである。 An object of the present invention is to provide a data compression technique that increases a data compression rate without degrading information accuracy.

本発明の一形態として開示されるデータ圧縮装置は，１）複数のデータ間で項目の情報が一致する場合に該項目に付与される一致の重みをもとに計算される一致度とデータ群におけるデータの出現頻度とを用いてデータを１グループにまとめる条件が設定された同一視判定テーブルと，２）項目ごとの情報の分布にもとづいて，該項目に設定された情報精度を示す保持レベルと，変換後の情報精度を示す最低元本保持レベルとの対応関係が定義された情報精度決定テーブルとを記憶する記憶部と，３）処理対象のデータ群の各データについて，前記項目ごとに一致の重みと保持レベルとが設定された設定情報をもとに，前記データ群の他のデータに対する一致の重みにもとづく一致度を求め，前記同一視判定テーブルの条件を満たすデータ同士をグループ化するグループ分割部と，４）前記グループ各々において，前記項目各々に出現する情報の分布を求め，前記情報精度決定テーブルをもとに，前記出現する情報の分布と前記設定情報の保持レベルとに応じて前記項目各々の元本保持レベルを決定する情報精度決定部と，５）前記元本保持レベルに対応してデータの情報を書き換える変換規則に従って，前記グループごとに，グループ内のデータの各項目の情報を前記元本保持レベルにもとづいて変換するイベント変換部と，６）前記変換されたデータを含む前記データ群に対してデータ圧縮を行うデータ圧縮部とを備える。 A data compression apparatus disclosed as one embodiment of the present invention includes: 1) a degree of coincidence and a data group calculated based on a matching weight given to an item when the item information matches between a plurality of data And the same level determination table in which conditions for grouping data into one group using the appearance frequency of data are set, and 2) a holding level indicating the information accuracy set for the item based on the distribution of information for each item And a storage unit that stores an information accuracy determination table in which a correspondence relationship with the minimum principal holding level indicating the converted information accuracy is defined, and 3) for each item of the data group to be processed, for each item Based on the setting information in which the matching weight and the holding level are set, a matching degree based on the matching weight with respect to other data in the data group is obtained, and data matching the conditions of the sameness determination table is obtained. 4) a distribution of information that appears in each of the items in each of the groups, and holding the distribution of the appearing information and the setting information based on the information accuracy determination table An information accuracy determining unit that determines a principal holding level of each item according to the level; and 5) according to a conversion rule for rewriting data information corresponding to the principal holding level, for each group, An event conversion unit that converts information of each item of data based on the principal holding level; and 6) a data compression unit that performs data compression on the data group including the converted data.

また，本発明の別の形態として開示されるデータ圧縮プログラムは，コンピュータに，前記装置で実現されるような処理を実行させるためのものである。 A data compression program disclosed as another embodiment of the present invention is for causing a computer to execute a process that is realized by the apparatus.

また，本発明の別の形態として開示されるデータ圧縮方法は，コンピュータが，前記装置で実現されるような処理ステップを実行するものである。 According to another aspect of the present invention, there is provided a data compression method in which a computer executes processing steps that are realized by the apparatus.

上記したデータ圧縮装置によれば，圧縮対象のデータ群のデータを項目の情報の一致の分布によってグループ分けし，設定された情報精度に応じて情報を書き換えたデータが圧縮されるため，情報精度を維持しつつデータ圧縮の効率を改善することができる。 According to the data compression apparatus described above, the data of the data group to be compressed is grouped according to the distribution of coincidence of the item information, and the data in which the information is rewritten according to the set information accuracy is compressed. Thus, the efficiency of data compression can be improved.

本発明の一形態として開示するデータ圧縮装置の実施例における構成例を示す図である。It is a figure which shows the structural example in the Example of the data compression apparatus disclosed as one form of this invention. データ圧縮装置が取得するイベントデータ群の例を示す図である。It is a figure which shows the example of the event data group which a data compression apparatus acquires. 設定情報の例を示す図である。It is a figure which shows the example of setting information. 同一視判定テーブルの例を示す図である。It is a figure which shows the example of an identification determination table. 情報精度決定テーブルの例を示す図である。It is a figure which shows the example of an information precision determination table. 変換規則の例を示す図である。It is a figure which shows the example of a conversion rule. イベントデータのグループ分割の例を示す図である。It is a figure which shows the example of the group division | segmentation of event data. イベントデータの情報の書き換えの例を示す図である。It is a figure which shows the example of rewriting of the information of event data. イベントデータの情報の書き換えの例を示す図である。It is a figure which shows the example of rewriting of the information of event data. 情報の書き換え処理がされたイベントデータ群の例を示す図である。It is a figure which shows the example of the event data group by which the rewriting process of information was carried out. 辞書式圧縮方法によりイベントデータ群を圧縮した場合のデータイメージを示す図である。It is a figure which shows the data image at the time of compressing an event data group with the dictionary type compression method. データ圧縮装置の処理フロー例を示す図である。It is a figure which shows the example of a processing flow of a data compression apparatus. データ圧縮装置のハードウェア構成例を示す図である。It is a figure which shows the hardware structural example of a data compression apparatus.

図１は，本発明の一形態として開示するデータ圧縮装置の実施例における構成例を示す図である。 FIG. 1 is a diagram illustrating a configuration example in an embodiment of a data compression device disclosed as one embodiment of the present invention.

図１に示すデータ圧縮装置１は，複数の項目に対応する情報を含むデータ群（イベントデータ群）に対して圧縮処理を行って圧縮データ３を出力する。 The data compression apparatus 1 shown in FIG. 1 performs compression processing on a data group (event data group) including information corresponding to a plurality of items, and outputs compressed data 3.

データ圧縮装置１で処理対象となるデータ群は，ＰＯＳシステム（ＰｏｉｎｔＯｆＳａｌｅｓｓｙｓｔｅｍ）においてイベントごとに生成されるようなイベントデータ２の集合（イベントデータ群）である。 A data group to be processed by the data compression apparatus 1 is a set of event data 2 (event data group) that is generated for each event in a POS system (Point Of Sales system).

データ圧縮装置１は，記憶部１０，グループ分割部１１，情報精度決定部１３，イベント変換部１５，およびデータ圧縮部１７を備える。また，データ圧縮装置１は，設定部１９を備えてもよい。 The data compression apparatus 1 includes a storage unit 10, a group division unit 11, an information accuracy determination unit 13, an event conversion unit 15, and a data compression unit 17. Further, the data compression apparatus 1 may include a setting unit 19.

記憶部１０は，同一視判定テーブル１０１，情報精度決定テーブル１０３，変換規則１０５，および設定情報１０７を記憶する。 The storage unit 10 stores an identification determination table 101, an information accuracy determination table 103, a conversion rule 105, and setting information 107.

同一視判定テーブル１０１は，複数のデータ間で項目の情報が一致する場合に該項目に付与される一致の重みをもとに計算される一致度とデータ群におけるデータの出現頻度とを用いてデータを１グループにまとめる条件が設定されたデータテーブルである。 The identification determination table 101 uses the degree of coincidence calculated based on the matching weight assigned to an item when the information of the item matches among a plurality of data and the appearance frequency of the data in the data group. It is a data table in which conditions for grouping data into one group are set.

情報精度決定テーブル１０３は，項目ごとの情報のユニーク数にもとづいて，該項目に設定された情報精度を示す保持レベルと，変換後の情報の情報精度を示す元本保持レベルとの対応関係が定義されたデータテーブルである。 Based on the unique number of information for each item, the information accuracy determination table 103 shows the correspondence between the holding level indicating the information accuracy set for the item and the principal holding level indicating the information accuracy of the converted information. It is a defined data table.

変換規則１０５は，元本保持レベルに対応して情報を変換する変換規則１０５が設定された情報である。 The conversion rule 105 is information in which a conversion rule 105 for converting information corresponding to the principal holding level is set.

設定情報１０７は，イベントデータ２の各項目に対する一致の重みと保持レベルとが設定された情報である。 The setting information 107 is information in which a matching weight and a holding level for each item of the event data 2 are set.

グループ分割部１１は，設定情報１０７をもとに，処理対象のイベントデータ群の各イベントデータ２について，他のイベントデータ２に対する一致の重みの分布を求め，同一視判定テーブル１０１の条件を満たすデータ同士を１グループ（イベント集合）にまとめる。 Based on the setting information 107, the group dividing unit 11 obtains a distribution of matching weights with respect to other event data 2 for each event data 2 of the event data group to be processed, and satisfies the conditions of the sameness determination table 101 Data is collected into one group (event set).

情報精度決定部１３は，グループ分割部１１によってまとめられたグループ各々において，グループ内のイベントデータ２の項目各々に出現する情報の分布を求め，情報精度決定テーブル１０３をもとに，情報の分布と設定情報１０７の保持レベルから，各項目について，項目の情報の変換の際に保持される情報精度を示す元本保持レベルを決定する。 The information accuracy determination unit 13 obtains the distribution of information appearing in each item of the event data 2 in the group for each group collected by the group division unit 11, and distributes the information based on the information accuracy determination table 103. From the holding level of the setting information 107, for each item, a principal holding level indicating the accuracy of information held when converting item information is determined.

情報の分布は，例えば，情報のユニーク数で特定する。 The distribution of information is specified by, for example, a unique number of information.

イベント変換部１５は，変換規則１０５に従って，グループごとに，グループ内の各イベントデータの項目の情報を，元本保持レベルにもとづいて変換する。イベント変換部１５は，情報を変換したイベントデータ２のグループ（イベント集合）を，変換イベント群に追加し，さらに，入力されたイベントデータ群のうちイベント集合にまとめられずに残っていたイベントデータ２を変換イベント群に追加する。 The event conversion unit 15 converts the information of each event data item in the group according to the conversion rule 105 based on the principal holding level. The event conversion unit 15 adds the event data 2 group (event set) whose information has been converted to the conversion event group, and among the input event data group, the event data remaining without being collected into the event set 2 is added to the conversion event group.

データ圧縮部１７は，変換イベント群に対してデータ圧縮を行い，圧縮データ３を生成する。 The data compression unit 17 performs data compression on the conversion event group and generates compressed data 3.

設定部１９は，イベントデータ２の各項目に対する一致の重み，保持レベルの入力を受け付け，設定情報１０７を生成して記憶部１０に格納する。または，設定部１９は，設定情報１０７を受け付けて記憶部１０に格納する。 The setting unit 19 receives input of matching weight and holding level for each item of the event data 2, generates setting information 107, and stores it in the storage unit 10. Alternatively, the setting unit 19 receives the setting information 107 and stores it in the storage unit 10.

図２は，データ圧縮装置１が取得するイベントデータ群の例を示す図である。 FIG. 2 is a diagram illustrating an example of an event data group acquired by the data compression apparatus 1.

図２に示すイベントデータ群の各行が，個々のイベントデータ２を表す。 Each row of the event data group shown in FIG.

イベントデータ２は，予め設定された複数の項目，本実施例では「年，月，日，時，分，秒，顧客ＩＤ，レジ番号，レジ担当，商品ＩＤ，色，サイズ」の各項目についての情報を含む。 The event data 2 includes a plurality of items set in advance, each item of “year, month, day, hour, minute, second, customer ID, cash register number, cashier charge, product ID, color, size” in this embodiment. Contains information.

図３は，設定情報１０７の例を示す図である。 FIG. 3 is a diagram illustrating an example of the setting information 107.

設定情報１０７には，図２に示すイベントデータ２の各項目に対し「一致の重み」と「保持レベル」が設定される。 In the setting information 107, “weight of matching” and “holding level” are set for each item of the event data 2 shown in FIG.

「一致の重み」は，複数のイベントデータ２間で項目の情報が一致する場合に，一致しているイベントデータ２の該当項目に付与される値である。一致の重みは，イベントデータ２同士の一致度を判断する場合に，項目の情報が一致していることをどの程度重視するかを示す値である。図３の設定情報１０７の例では，一致の重みは，０．１単位で０から１までの値をとり，値が１に近い程，項目の情報の一致を重要視することを表している。 The “matching weight” is a value given to the corresponding item of the matching event data 2 when the item information matches between the plurality of event data 2. The matching weight is a value indicating how much importance is placed on matching of item information when determining the degree of matching between the event data 2. In the example of the setting information 107 in FIG. 3, the matching weight takes a value from 0 to 1 in 0.1 units, and the closer the value is to 1, the more important the matching of item information is. .

イベントデータ２をグループにまとめる場合に，一致の重みが１に近い値である項目の情報が一致していることが重視される。図３の例では，「商品ＩＤ」，「サイズ」の項目の情報の一致が重視される。 When grouping event data 2 into groups, it is important that information of items whose matching weights are close to 1 match. In the example of FIG. 3, importance is placed on the coincidence of information in the items “product ID” and “size”.

「保持レベル」は，イベントデータ２の項目の情報をどの程度の精度で保持するかの設定を示す値である。図３の設定情報１０７の例では，「完全，範囲，分布，削除可」の４つのレベルが設定されるものとする。 The “holding level” is a value indicating a setting with which accuracy the information of the item of the event data 2 is held. In the example of the setting information 107 in FIG. 3, it is assumed that four levels of “complete, range, distribution, deleteable” are set.

「完全」は，項目に出現する複数の情報を，そのまま維持する情報精度を示す。「完全」は，個々の情報が特定できることが要求されるような項目に対して設定される。 “Complete” indicates information accuracy for maintaining a plurality of pieces of information appearing in the item as they are. “Complete” is set for items that require individual information to be identified.

「範囲」は，項目に出現する複数の情報を，それら情報の限界値の組で表す情報精度を示す。「範囲」は，情報の範囲が特定できれば有用と認識されるような項目に対して設定される。 “Range” indicates information accuracy in which a plurality of pieces of information appearing in an item are represented by a set of limit values of the information. “Range” is set for items that are recognized as useful if the range of information can be specified.

「分布」は，項目に出現する複数の情報を，各情報のグループ内の全情報に対する割合で表す情報精度を示す。「分布」は，情報の分布が特定できれば有用と認識されるような項目に対して設定される。 “Distribution” indicates information accuracy that represents a plurality of pieces of information appearing in an item as a percentage of all pieces of information in each information group. “Distribution” is set for items that are recognized as useful if the distribution of information can be identified.

「削除可」は，項目に出現する複数の情報が分散していて，いずれの情報も一定数に満たないような状態である場合に，それら情報の削除が可能であることを表す情報精度を示す。情報にばらつきがある場合には情報自体が有意なものと認識されないような項目に対して設定される。 “Can be deleted” is an information accuracy that indicates that information can be deleted when multiple items appearing in the item are dispersed and all information is less than a certain number. Show. When there is a variation in information, it is set for an item for which the information itself is not recognized as significant.

図４は，同一視判定テーブル１０１の例を示す図である。 FIG. 4 is a diagram illustrating an example of the sameness determination table 101.

同一視判定テーブル１０１は，各「一致度」に対応付けられた「頻度」が設定されている。 In the sameness determination table 101, “frequency” associated with each “matching degree” is set.

「一致度」は，複数のイベントデータ２の間で，一致の重みにもとづく一致度の範囲である。本実施例において，一致度は，情報が一致している項目に付与された一致の重みの和の，全項目の一致の重みの和に対する割合である。図４の例では，一致度はパーセントで示す。一致度＝１０％−２９％は，一致度が１０％から２９％の範囲を表している。 The “matching degree” is a range of matching degrees based on matching weights among a plurality of event data 2. In this embodiment, the degree of coincidence is the ratio of the sum of the matching weights given to the items for which the information matches to the sum of the matching weights for all items. In the example of FIG. 4, the degree of coincidence is shown as a percentage. The degree of coincidence = 10% −29% represents a range where the degree of coincidence is 10% to 29%.

「頻度」は，対応する一致度のイベントデータ２をグループ化する（１つのイベント集合にまとめる）場合に必要なイベントデータ２の出現数である。 “Frequency” is the number of appearances of event data 2 required when grouping event data 2 having a corresponding degree of matching (combining them into one event set).

図４に示す例では，イベントデータ群に，一致度が３０％であるイベントデータ２が６つ以上存在している場合に，該当するイベントデータ２を同一視して１グループにまとめる処理が行われる。 In the example shown in FIG. 4, when there are six or more event data 2 having a matching degree of 30% in the event data group, a process is performed in which the corresponding event data 2 is identified and grouped into one group. Is called.

図５は，情報精度決定テーブル１０３の例を示す図である。 FIG. 5 is a diagram illustrating an example of the information accuracy determination table 103.

情報精度決定テーブル１０３では，設定情報１０７で項目に設定可能な保持レベルごとに，情報の変換で保持される情報精度である元本保持レベルが，グループにまとめられたイベントデータ２の項目に出現する情報の分布を示す情報のユニーク数に応じて設定される。 In the information accuracy determination table 103, for each retention level that can be set for the item in the setting information 107, the principal retention level that is the information accuracy retained by the information conversion appears in the items of the event data 2 grouped together. It is set according to the unique number of information indicating the distribution of information to be performed.

元本保持レベルは，グループの各項目の情報がどの情報精度を保持して変換されるかを示す情報であり，「完全，分布，範囲，削除」が定義される。 The principal holding level is information indicating which information accuracy of the information of each item of the group is to be converted, and “complete, distribution, range, deletion” is defined.

「完全」は，項目に出現する複数の情報を，そのまま維持する情報精度を示す。「範囲」は，項目に出現する複数の情報を，情報の限界値の組で表す情報精度を示す。「分布」は，項目に出現する複数の情報を，各情報のグループ内の全情報に対する割合で表す情報精度を示す。「削除」は，項目に出現する複数の情報を削除することを示す。 “Complete” indicates information accuracy for maintaining a plurality of pieces of information appearing in the item as they are. “Range” indicates information accuracy in which a plurality of pieces of information appearing in an item are represented by a set of information limit values. “Distribution” indicates information accuracy that represents a plurality of pieces of information appearing in an item as a percentage of all pieces of information in each information group. “Delete” indicates that a plurality of pieces of information appearing in the item are deleted.

図５の情報精度決定テーブル１０３では，保持レベルが「完全」である場合には，どのユニーク数であっても，元本保持レベルに「完全」が設定される。 In the information accuracy determination table 103 of FIG. 5, when the retention level is “complete”, “complete” is set as the principal retention level for any unique number.

また，保持レベルが「範囲」である場合には，ユニーク数＝１のときに元本保持レベル「完全」が，ユニーク数＝２のときに元本保持レベル「分布」が，ユニーク数＝３または４以上のときに元本保持レベル「範囲」が設定される。 When the retention level is “range”, the principal retention level “complete” is when the unique number = 1, the principal retention level “distribution” when the unique number = 2, and the unique number = 3. Alternatively, the principal holding level “range” is set when the number is 4 or more.

図６は，変換規則１０５の例を示す図である。 FIG. 6 is a diagram illustrating an example of the conversion rule 105.

変換規則１０５は，各元本保持レベルに対応して，どのように情報を変換するか，すなわち，グループごとに決定された元本保持レベルにもとづいてイベントデータ２の各項目の情報をどのように変換し表記するかの規則を示す。 The conversion rule 105 indicates how to convert information corresponding to each principal holding level, that is, how information of each item of the event data 2 is based on the principal holding level determined for each group. Indicates the rules for converting to notation.

図６の変換規則１０５では，例えば，「完全」の場合に，項目に出現する情報は，変換せずに，そのまま表記する。また，「範囲」の場合に，項目に出現する情報の上下限値（例えば，最大値，最小値）を求め，“最大値−最小値”と表記すること，ユニーク数＝２のときは“最大値；最小値”と表記する。また，「分布」の場合に，項目に出現する各情報の数を求め，各情報数の全情報数に対する割合（％）を計算し，情報Ａ，Ｂ，Ｃがａ％，ｂ％，ｃ％である場合に，“Ａ−Ｂ−Ｃ＝ａ％−ｂ％−ｃ％”と表記する。また，「削除」の場合に，項目に出現する情報を削除する。 In the conversion rule 105 of FIG. 6, for example, in the case of “complete”, the information appearing in the item is written as it is without being converted. In the case of “range”, the upper and lower limit values (for example, the maximum value and the minimum value) of the information appearing in the item are obtained and expressed as “maximum value−minimum value”. “Maximum value; Minimum value”. Further, in the case of “distribution”, the number of each information appearing in the item is obtained, the ratio (%) of the number of each information to the total number of information is calculated, and the information A, B, C is a%, b%, c %, It is expressed as “A−B−C = a% −b% −c%”. In addition, in the case of “delete”, the information appearing in the item is deleted.

次に，データ圧縮装置１の処理を，より具体的に説明する。 Next, the process of the data compression apparatus 1 will be described more specifically.

グループ分割部１１は，図２に示すイベントデータ群を取得したとする。 It is assumed that the group division unit 11 acquires the event data group shown in FIG.

グループ分割部１１は，イベントデータ群の各イベントデータ２について，他のイベントデータ２各々に対して項目に格納されている情報が一致する場合には，設定情報１０７を参照して，該当項目に一致の重みを付与し，さらに，項目に付与した一致の重みの和を，全項目に設定されている一致の重みの和で除した割合を計算して一致度を求める。 The group division unit 11 refers to the setting information 107 when the information stored in the item for each event data 2 in the event data group matches the other event data 2, and sets the corresponding item. The matching weight is assigned, and the degree of matching is obtained by calculating the ratio obtained by dividing the sum of the matching weights given to the items by the sum of the matching weights set for all items.

さらに，グループ分割部１１は，同一視判定テーブル１０１を参照して，図２に示すイベントデータ群について，一致度が高くかつ頻度（出現数）が多いイベントデータ２同士をまとめてグループ化し，イベント集合とする。一致度が同じデータが一定数以上で出現している場合には，高い優先度の項目に対する一致の重みを優先してデータをグループ化する。 Further, the group division unit 11 refers to the sameness determination table 101 and groups together event data 2 having a high degree of coincidence and a high frequency (number of appearances) for the event data group shown in FIG. Let it be a set. When data having the same degree of coincidence appears in a certain number or more, the data is grouped by giving priority to the weight of coincidence with respect to the item with high priority.

図７は，イベントデータ２のグループ分割の例を示す図である。 FIG. 7 is a diagram illustrating an example of group division of the event data 2.

図７に示す３つのイベントデータ２は，グループ分割部１１によって，図２に示すイベントデータ群からイベント集合として抽出されたイベントデータ２を示している。 Three event data 2 shown in FIG. 7 indicate event data 2 extracted as an event set from the event data group shown in FIG.

図７のイベントデータ２では，項目のうち，「年（０．１），月（０．５），日（０．１），時（０．５），レジ担当（０．１），商品ＩＤ（１），色（０．５），サイズ（１）」の情報が一致して，一致の重みの和＝３．８となる。さらに，イベントデータ２の全項目の一致の重み＝４．６から，一致度＝８３％が計算される。 In the event data 2 of FIG. 7, among the items, “year (0.1), month (0.5), day (0.1), hour (0.5), cashier charge (0.1), product” The information of “ID (1), color (0.5), size (1)” matches, and the sum of matching weights = 3.8. Further, from the matching weight of all items of the event data 2 = 4.6, the matching degree = 83% is calculated.

さらに，同一視判定テーブル１０１にもとづいて，該当する一致度（７０−１００％）に対応する頻度（≧３）の条件が満たされるため，図７に示す３以上のイベントデータ２は，同一視できるものとして，１つのイベント集合にまとめられている。 Furthermore, since the condition of frequency (≧ 3) corresponding to the corresponding degree of coincidence (70-100%) is satisfied based on the sameness determination table 101, three or more event data 2 shown in FIG. As a result, they are grouped into one event set.

次に，情報精度決定部１３は，設定情報１０７と情報精度決定テーブル１０３を参照して，グループ分割部１１によりまとめられたイベント集合について，イベントデータ２の各項目の情報精度を決定する。すなわち，情報精度決定部１３は，項目ごとに，情報のユニーク数をカウントする。そして，情報精度決定部１３は，情報精度決定テーブル１０３を参照して，設定情報１０７で設定されている保持レベルと求めたユニーク数とをもとに，元本保持レベルを決定する。 Next, the information accuracy determination unit 13 refers to the setting information 107 and the information accuracy determination table 103 to determine the information accuracy of each item of the event data 2 for the event set collected by the group division unit 11. That is, the information accuracy determination unit 13 counts the unique number of information for each item. Then, the information accuracy determination unit 13 refers to the information accuracy determination table 103 and determines the principal retention level based on the retention level set in the setting information 107 and the obtained unique number.

より具体的には，図７に示すイベント集合の場合，「年」は，保持レベル＝完全であり，情報（値）が全て同じ（ユニーク数＝１）であるので，元本保持レベル＝完全と設定される。「月，日」は，保持レベル＝範囲，ユニーク数＝１であるので，元本保持レベル＝完全と設定される。「時」は，保持レベル＝削除可，ユニーク数＝１であるので，元本保持レベル＝完全と設定される。「分，秒」は，保持レベル＝削除可，ユニーク数＝３であるので，元本保持レベル＝範囲と設定される。 More specifically, in the case of the event set shown in FIG. 7, “year” has the retention level = complete, and the information (values) are all the same (unique number = 1). Is set. Since “Month, Day” has a retention level = range and a unique number = 1, the principal retention level = complete is set. “Time” is set as “maintenance level = complete” because the retention level = deletable and the unique number = 1. “Minute and Second” is set such that the retention level = deletable and the unique number = 3, so that the principal retention level = range.

「顧客ＩＤ」は，保持レベル＝完全，ユニーク数＝３であるので，元本保持レベル＝完全と設定される。「レジ番号」は，保持レベル＝範囲，ユニーク数＝２であるので，元本保持レベル＝範囲と設定される。「レジ担当」は，保持レベル＝削除可，ユニーク数＝１であるので，元本保持レベル＝完全と設定される。 “Customer ID” is set as “maintenance holding level = complete” because holding level = complete and unique number = 3. The “registration number” is set as “maintenance holding level = range” because holding level = range and unique number = 2. “Responsible for cashier” is set as “maintenance level = complete” because the retention level = deletable and the unique number = 1.

「商品ＩＤ」は，保持レベル＝完全，ユニーク数＝１であるので，元本保持レベル＝完全と設定される。「色」は，保持レベル＝分布，ユニーク数＝１であるので，元本保持レベル＝完全と設定される。「サイズ」は，保持レベル＝分布，ユニーク数＝１であるので，元本保持レベル＝完全と設定される。 “Product ID” has a retention level = complete and a unique number = 1, so that the principal retention level = complete. Since “color” has a retention level = distribution and a unique number = 1, the principal retention level = complete is set. Since “size” is retention level = distribution and unique number = 1, principal retention level = complete is set.

次に，イベント変換部１５は，変換規則１０５を参照し，情報精度決定部１３によって決定された情報精度（元本保持レベル）にもとづいて，イベント集合にまとめられたイベントデータ２の各項目の情報を書き換える。すなわち，イベント変換部１５は，各イベントデータ２の項目の情報を，その項目の元本保持レベルに対応する変換規則１０５に従って，書き換える。 Next, the event conversion unit 15 refers to the conversion rule 105, and based on the information accuracy (principal holding level) determined by the information accuracy determination unit 13, the event conversion unit 15 sets each item of the event data 2 collected in the event set. Rewrite information. That is, the event conversion unit 15 rewrites the information of the item of each event data 2 according to the conversion rule 105 corresponding to the principal holding level of the item.

図８，図９は，イベントデータ２の情報の書き換えの例を示す図である。 8 and 9 are diagrams showing examples of rewriting information of the event data 2. FIG.

図８に示す３つのイベントデータ２は，イベント変換部１５によって，図７に示すイベント集合のイベントデータ２を示している。 The three event data 2 shown in FIG. 8 shows the event data 2 of the event set shown in FIG.

ここで，「分」の情報（１７，２４，５３）について，元本保持レベル＝範囲と決定されている。イベント変換部１５によって，イベント集合内の各イベントデータ２の情報から最小値と最大値とが求められ，イベント集合の全イベントデータ２の「分」の情報が，“１７−５３”と書き換えられている。 Here, with respect to the information (17, 24, 53) of “minutes”, the principal holding level = range is determined. The event conversion unit 15 obtains the minimum value and the maximum value from the information of each event data 2 in the event set, and the “minute” information of all event data 2 in the event set is rewritten as “17-53”. ing.

また，「秒」の情報（０１，１１，２４）について，元本保持レベル＝範囲と決定されているので，同様に，最小値と最大値とが求められ，イベント集合の全イベントデータ２の「秒」の情報が，“０１−２４”と書き換えられている。 Also, since the principal holding level = range is determined for the “second” information (01, 11, 24), similarly, the minimum value and the maximum value are obtained, and all event data 2 of the event set are determined. The information of “second” is rewritten as “01-24”.

また，その他の項目については，元本保持レベル＝完全と決定されているので，イベント集合の各ベントデータ２の情報は，そのまま維持される。 For other items, since the principal holding level is determined to be complete, the information of each event data 2 of the event set is maintained as it is.

図９（Ａ）に示す７つのイベントデータ２は，イベント変換部１５によって，図２に示すイベントデータ群から抽出された別のイベント集合を示している。 The seven event data 2 shown in FIG. 9A shows another event set extracted from the event data group shown in FIG.

図９（Ａ）に示すイベント集合についても，図８を用いて説明した処理と同様に，元本保持レベルが決定され，各項目の情報が書き換えられる。 For the event set shown in FIG. 9A, the principal holding level is determined and the information of each item is rewritten, as in the process described with reference to FIG.

図９（Ａ）に示すイベント集合において，情報精度決定部１３によって「色」の元本保持レベル＝分布と決定されている場合に，イベント変換部１５によって，「色」の各情報（白，赤，スカイブルー）の出現割合が計算され，変換規則１０５に従って表記される。すなわち，図９（Ｂ）に示すように，「色」の情報は，“白−赤−スカイブルー＝４３−２９−２９”と書き換えられる。 In the event set shown in FIG. 9A, when the information accuracy determining unit 13 determines that the “color” principal holding level = distribution, the event converting unit 15 sets each information of “color” (white, The appearance ratio of (red, sky blue) is calculated and expressed according to the conversion rule 105. That is, as shown in FIG. 9B, the “color” information is rewritten as “white-red-sky blue = 43-29-29”.

また，「サイズ」の情報についても，同様に，分布の表記“Ｍ−Ｌ−ＬＬ＝２９−４３−２９”とに書き換えられる。 Similarly, the “size” information is also rewritten as a distribution notation “ML−LL = 29−43−29”.

上記のような，情報精度の決定および情報の書き換え処理は，イベントデータ群から成された全イベント集合について行われる。 The information accuracy determination and the information rewriting process as described above are performed on all event sets made up of event data groups.

イベント変換部１５は，イベントデータ群から抽出し，情報を書き換えたイベント集合（変換済みのイベントデータ２）を，圧縮処理のための変換イベント群に追加する。そして，イベント変換部１５は，イベント集合にまとめられずに残っていたイベントデータ２も，変換イベント群に追加する。 The event conversion unit 15 adds an event set (converted event data 2) extracted from the event data group and rewritten to the converted event group for compression processing. Then, the event conversion unit 15 adds the event data 2 remaining without being collected into the event set to the conversion event group.

図１０は，情報の書き換え処理がされたイベントデータ群の例を示す図である。 FIG. 10 is a diagram illustrating an example of an event data group subjected to information rewrite processing.

図１０に示すイベントデータ群は，図２のイベントデータ群に対応している。 The event data group shown in FIG. 10 corresponds to the event data group shown in FIG.

イベント集合Ｇ１，Ｇ２，Ｇ３では，項目の情報が，対応する元本保持レベルに応じて書き換えられて，同一の情報となっていることが示されている。 In the event sets G1, G2, and G3, it is shown that the item information is rewritten according to the corresponding principal holding level to be the same information.

図１０に示すイベントデータ群の最下行のイベントデータ２は，グループ分割部１１によってイベント集合にまとめられずに残ったものであることを表す。イベントデータ群のうち，同一視判定テーブル１０１の条件に該当せずグループ化されなかったイベントデータ２は，項目の情報がそのままの状態（情報精度）となっていることが示されている。 The event data 2 in the lowermost row of the event data group shown in FIG. 10 represents that the event data 2 remains without being grouped into an event set by the group dividing unit 11. In the event data group, the event data 2 that is not grouped because it does not meet the conditions of the sameness determination table 101 indicates that the item information remains as it is (information accuracy).

データ圧縮の対象となった図１０に示すイベントデータ群から，着目する項目についての統計情報が，設定された保持レベル（情報精度）以上で維持され，後の解析処理等で有用な情報として読み取り可能となっていることがわかる。 From the event data group shown in FIG. 10 that is subject to data compression, statistical information about the item of interest is maintained at a set retention level (information accuracy) or higher, and is read as useful information for later analysis processing, etc. You can see that it is possible.

より具体的には，図１０に示すイベントデータ群から，販売傾向の知見として，ある商品のＳサイズについて，２０１０年４月６日の１０時の時間帯に“ピンク”の色のものしか売れていないことがわかる。また，Ｍサイズのものは，２０１０年４月３日〜１４日，１０時の時間帯に，“ピンク”のものが売れていることがわかる。 More specifically, from the event data group shown in FIG. 10, as a knowledge of the sales trend, only “pink” color is sold at 10 o'clock on April 6, 2010 for the S size of a certain product. You can see that it is not. In addition, it can be seen that the “M” size products are sold in the “pink” period from April 3 to 14, 2010 at 10 o'clock.

さらに，他のサイズや色の組み合わせについては，統計情報で保持されて，情報精度が低くなっているが，設定された“分布”の情報精度で維持されている。 Furthermore, other size and color combinations are retained in the statistical information and the information accuracy is low, but the information accuracy of the set “distribution” is maintained.

一方，４月１５日の記録のように，従来手法であれば，削除されてしまうか，他の情報に埋もれてしまうような特異な情報を含むイベントデータ２も，そのままの情報精度で維持されている。 On the other hand, event data 2 including unique information that is deleted or buried in other information as in the case of the April 15 record is also maintained with the same information accuracy. ing.

データ圧縮部１７は，イベント変換部１５によってまとめられた変換イベント群にデータ圧縮を行い，圧縮データ３を生成する。データ圧縮部１７は，既知のデータ圧縮方法を用いて処理を行うが，本例では，辞書式圧縮処理およびエントロピー圧縮処理を用いる。これらの圧縮処理は既知であるため説明を省略する。 The data compression unit 17 performs data compression on the conversion event group collected by the event conversion unit 15 to generate compressed data 3. The data compression unit 17 performs processing using a known data compression method. In this example, lexicographic compression processing and entropy compression processing are used. Since these compression processes are known, description thereof is omitted.

図１１は，辞書式圧縮方法により，図１０に示すイベントデータ群を圧縮した場合のデータイメージを示す図である。 FIG. 11 is a diagram showing a data image when the event data group shown in FIG. 10 is compressed by the lexicographic compression method.

図１１に示すように，イベント集合ごとに各項目の情報が同一の表記に書き換えられているため，より高い圧縮率でデータ圧縮されることがわかる。 As shown in FIG. 11, since the information of each item is rewritten to the same notation for each event set, it can be seen that the data is compressed at a higher compression rate.

図１２は，データ圧縮装置１の処理フロー例を示す図である。 FIG. 12 is a diagram illustrating a processing flow example of the data compression apparatus 1.

ステップＳ１０：データ圧縮装置１の設定部１９は，入力装置（図１に図示しない）を介して，イベントデータ２の各項目に対する一致の重みおよび保持レベルの入力を受け付けて設定情報１０７を生成し，記憶部１０に格納する。または，設定部１９は，設定情報１０７を取得して記憶部１０に格納する。 Step S10: The setting unit 19 of the data compression apparatus 1 receives the input of matching weight and holding level for each item of the event data 2 via the input device (not shown in FIG. 1), and generates the setting information 107. , Stored in the storage unit 10. Alternatively, the setting unit 19 acquires the setting information 107 and stores it in the storage unit 10.

ステップＳ１１：グループ分割部１１は，イベントデータ２のデータ群を取得し，各イベントデータ２について，設定情報１０７をもとに，他のイベントデータ２各々との間で，情報が一致する項目があれば，その項目に設定された一致の重みを付与する。さらに，グループ分割部１１は，一致の重みの和を計算し，全項目に設定された一致の重みに対する割合を示す一致度（％）を計算する。グループ分割部１１は，同一視判定テーブル１０１をもとに，一致度が高くかつ頻度が多いイベントデータ２の一群をグループ化してイベント集合として取り出す。 Step S11: The group dividing unit 11 acquires the data group of the event data 2, and for each event data 2, based on the setting information 107, items having the same information with each other event data 2 are displayed. If there is, the matching weight set for the item is assigned. Further, the group dividing unit 11 calculates the sum of the matching weights, and calculates the degree of matching (%) indicating the ratio to the matching weights set for all items. The group dividing unit 11 groups a group of event data 2 having a high degree of coincidence and a high frequency based on the sameness determination table 101, and takes out as a set of events.

ステップＳ１２：情報精度決定部１３は，取り出したイベント集合について，イベントデータ２の各項目について情報の分布を求め，設定情報１０７でその項目に設定されている保持レベルと情報の分布とをもとに，各項目の元本保持レベルを決定する。 Step S12: The information accuracy determination unit 13 obtains the distribution of information for each item of the event data 2 for the extracted event set, and based on the holding level set for the item in the setting information 107 and the distribution of information. Next, determine the principal retention level for each item.

ステップＳ１３：イベント変換部１５は，決定された元本保持レベルと変換規則１０５とに従って，イベント集合の全イベントデータ２の各項目の情報を，該当する情報の分布を表現した表記に変換し，変換したイベント集合のイベントデータ２を，圧縮処理のための変換イベント群に追加する。 Step S13: The event conversion unit 15 converts the information of each item of all event data 2 of the event set into a notation representing the distribution of the corresponding information according to the determined principal holding level and the conversion rule 105, Event data 2 of the converted event set is added to a conversion event group for compression processing.

ステップＳ１４：グループ分割部１１は，取得しているイベントデータ２のイベントデータ群の残りに，一致度が高くかつ頻度が多いイベントデータ２の一群があるかを調べる。一致度が高くかつ頻度が多いイベントデータ２の一群がある場合には（ステップＳ１４のＹ），ステップＳ１１の処理へ戻す。一致度が高くかつ頻度が多いイベントデータ２の一群がない場合には（ステップＳ１４のＮ），ステップＳ１５の処理へ進む。 Step S14: The group dividing unit 11 checks whether there is a group of event data 2 having a high degree of matching and a high frequency in the remaining event data group of the acquired event data 2. If there is a group of event data 2 having a high degree of coincidence and a high frequency (Y in step S14), the process returns to the process in step S11. If there is no group of event data 2 having a high degree of coincidence and a high frequency (N in step S14), the process proceeds to step S15.

ステップＳ１５：データ圧縮部１７は，取得しているイベントデータ２のイベントデータ群の残りであるイベントデータ２と，変換イベント群のイベントデータ２とをまとめ，まとめたイベントデータ２のイベントデータ群に対してデータ圧縮処理を行い，圧縮データ３を出力する。 Step S15: The data compressing unit 17 summarizes the event data 2 that is the remaining event data group of the acquired event data 2 and the event data 2 of the conversion event group, into the event data group of the combined event data 2 On the other hand, data compression processing is performed and compressed data 3 is output.

データ圧縮装置１は，ＣＰＵおよびメモリ等を有するハードウェアと，ソフトウェアプログラムとを備えるコンピュータ・システム，または専用ハードウェアによって実現される。 The data compression apparatus 1 is realized by a computer system including hardware having a CPU and a memory and a software program, or dedicated hardware.

図１３は，データ圧縮装置１のハードウェア構成例を示す図である。 FIG. 13 is a diagram illustrating a hardware configuration example of the data compression apparatus 1.

データ圧縮装置１は，演算装置（ＣＰＵ）１０１，一時記憶装置（ＤＲＡＭ，フラッシュメモリ等）１０２，永続性記憶装置（ＨＤＤ，フラッシュメモリ等）１０３を有するコンピュータ１００と，入力装置（キーボード，マウス等）１２０と出力装置（ディスプレイ，プリンタ等）１３０とによって実施することができる。 The data compression apparatus 1 includes a computing device (CPU) 101, a temporary storage device (DRAM, flash memory, etc.) 102, a computer 100 having a permanent storage device (HDD, flash memory, etc.) 103, and an input device (keyboard, mouse, etc.). ) 120 and an output device (display, printer, etc.) 130.

また，データ圧縮装置１は，コンピュータ１００が実行可能なプログラムによって実施することができる。この場合に，データ圧縮装置１が有すべき機能の処理内容を記述したプログラムが提供される。提供されたプログラムをコンピュータ１００が実行することによって，上記説明したデータ圧縮装置１の処理機能がコンピュータ１００上で実現される。 The data compression apparatus 1 can be implemented by a program that can be executed by the computer 100. In this case, a program describing the processing contents of the functions that the data compression apparatus 1 should have is provided. When the computer 100 executes the provided program, the processing functions of the data compression apparatus 1 described above are realized on the computer 100.

なお，コンピュータ１００は，可搬型記録媒体から直接プログラムを読み取り，そのプログラムに従った処理を実行することもできる。 The computer 100 can also read a program directly from a portable recording medium and execute processing according to the program.

以上の本実施例に示されるように，データ圧縮装置１によれば，設定情報１０７において，項目ごとに情報精度である保持レベルが設定されることにより，利用したい情報に関する項目の情報について最低限の情報精度を維持し，さらに情報の分布に応じてより高い情報精度で情報を維持したままデータ圧縮を行うことができる。 As shown in the present embodiment, according to the data compression apparatus 1, by setting a holding level that is information accuracy for each item in the setting information 107, information on items related to information to be used is minimized. The data compression can be performed while maintaining the information with higher information accuracy according to the distribution of information.

すなわち，データ圧縮装置１をイベントデータ２のデータ圧縮処理に適用した場合に，次のような効果が得られる。 That is, when the data compression apparatus 1 is applied to the data compression process of the event data 2, the following effects can be obtained.

（１）圧縮対象のデータが含む項目ごとに圧縮率が高くなるデータを集めてグループ化するため，データによっては情報精度をあまり低下させることなく圧縮効率を向上させることができる。 (1) Since data with a high compression rate is collected and grouped for each item included in data to be compressed, the compression efficiency can be improved without significantly reducing the information accuracy depending on the data.

（２）特殊な情報を含むデータはグループ化されずにデータ圧縮されるため，情報精度が高く維持することができる。 (2) Since data including special information is compressed without being grouped, information accuracy can be maintained high.

なお，本実施例では，データ圧縮装置１を，イベントデータのデータ圧縮に適用した場合について説明したが，データ圧縮装置１の適用範囲はこれに限定されるものではなく，その記述の主旨の範囲において種々の変形が可能であることは当然である。 In the present embodiment, the case where the data compression apparatus 1 is applied to the data compression of event data has been described. However, the scope of application of the data compression apparatus 1 is not limited to this, and the scope of the gist of the description Of course, various modifications are possible.

１データ圧縮装置
１０記憶部
１１グループ分割部
１３情報精度決定部
１５イベント変換部
１７データ圧縮部
１９設定部
１０１同一視判定テーブル
１０３情報精度決定テーブル
１０５変換規則
１０７設定情報
２イベントデータ
３圧縮データ DESCRIPTION OF SYMBOLS 1 Data compression apparatus 10 Memory | storage part 11 Group division | segmentation part 13 Information accuracy determination part 15 Event conversion part 17 Data compression part 19 Setting part 101 Homogeneity determination table 103 Information accuracy determination table 105 Conversion rule 107 Setting information 2 Event data 3 Compression data

Claims

A data compression apparatus for compressing a data group having multiple items of information,
A condition for grouping data into a group using the degree of coincidence calculated based on the matching weight given to the item and the appearance frequency of the data in the data group when the item information matches between a plurality of data An identification determination table in which is set,
Based on the distribution of information for each item, an information accuracy determination table in which a correspondence relationship between a holding level indicating the information accuracy set for the item and a minimum principal holding level indicating the information accuracy after conversion is defined. A storage unit for storing;
For each data in the data group to be processed, based on the setting information in which the matching weight and the holding level are set for each item, the degree of matching based on the matching weight for the other data in the data group is obtained, A group division unit for grouping data satisfying the conditions of the same identification determination table;
In each of the groups, a distribution of information appearing in each of the items is obtained, and based on the information accuracy determination table, a principal of each of the items according to the distribution of the appearing information and the holding level of the setting information An information accuracy determination unit for determining a retention level;
In accordance with a conversion rule for rewriting data information corresponding to the principal holding level, for each group, an event converter that converts information of each item of data in the group based on the principal holding level;
A data compression apparatus comprising: a data compression unit that performs data compression on the data group including the converted data.

The principal holding level includes any two or more levels of complete, range, distribution, or deletion. The complete indicates information accuracy for maintaining information appearing in each item as it is, and the range includes each item. Indicates the information accuracy for converting the information appearing in the information into the maximum value and the minimum value of the information, and the distribution indicates the information accuracy for converting the information appearing in each item into the ratio of the information in the group, The data compression apparatus according to claim 1, wherein "deletable" indicates information accuracy for deleting information appearing in each item.

The information processing apparatus includes: a setting unit that acquires the setting information and stores the setting information in the storage unit; or receives an input related to the matching weight and the holding level to generate setting information and store the setting information in the storage unit. The data compression apparatus according to claim 1 or 2.

A data compression program for causing a computer to execute compression processing of a data group having multiple items of information,
When the computer matches the item information among a plurality of pieces of data, the computer uses the matching degree calculated based on the matching weight assigned to the item and the appearance frequency of the data in the data group to set the data to 1 The identification table with the conditions to be grouped together,
Based on the distribution of information for each item, an information accuracy determination table in which a correspondence relationship between a holding level indicating the information accuracy set for the item and a minimum principal holding level indicating the information accuracy after conversion is defined. It has a storage unit to store,
In the computer,
For each data in the data group to be processed, based on the setting information in which the matching weight and the holding level are set for each item, the degree of matching based on the matching weight for the other data in the data group is obtained, A process of grouping data satisfying the conditions of the sameness determination table;
In each of the groups, a distribution of information appearing in each of the items is obtained, and based on the information accuracy determination table, a principal of each of the items according to the distribution of the appearing information and the holding level of the setting information Processing to determine the retention level;
In accordance with a conversion rule for rewriting data information corresponding to the principal holding level, for each group, processing for converting information of each item of data in the group based on the principal holding level;
A process of performing data compression on the data group including the converted data;
Data compression program to be executed.

A data compression method in which a computer compresses a data group having multiple items of information,
The computer is
A condition for grouping data into a group using the degree of coincidence calculated based on the matching weight given to the item and the appearance frequency of the data in the data group when the item information matches between a plurality of data An identification determination table in which is set,
Based on the distribution of information for each item, an information accuracy determination table in which a correspondence relationship between a holding level indicating the information accuracy set for the item and a minimum principal holding level indicating the information accuracy after conversion is defined. It has a storage unit to store,
For each data in the data group to be processed, based on the setting information in which the matching weight and the holding level are set for each item, the degree of matching based on the matching weight for the other data in the data group is obtained, A processing step for grouping data satisfying the conditions of the identification table,
In each of the groups, a distribution of information appearing in each of the items is obtained, and based on the information accuracy determination table, a principal of each of the items according to the distribution of the appearing information and the holding level of the setting information Processing steps to determine the retention level;
In accordance with a conversion rule for rewriting data information corresponding to the principal holding level, for each group, a process step for converting information of each item of data in the group based on the principal holding level;
A data compression method comprising: performing a data compression on the data group including the converted data.