CN100412863C - Huge amount of data compacting storage method and implementation apparatus therefor - Google Patents

Huge amount of data compacting storage method and implementation apparatus therefor Download PDF

Info

Publication number
CN100412863C
CN100412863C CN 200510089094 CN200510089094A CN100412863C CN 100412863 C CN100412863 C CN 100412863C CN 200510089094 CN200510089094 CN 200510089094 CN 200510089094 A CN200510089094 A CN 200510089094A CN 100412863 C CN100412863 C CN 100412863C
Authority
CN
China
Prior art keywords
data
storage
values
step
recorded
Prior art date
Application number
CN 200510089094
Other languages
Chinese (zh)
Other versions
CN1908932A (en
Inventor
任永杰
玉 冯
周庆庆
杜小勇
珊 王
Original Assignee
北京人大金仓信息技术有限公司;王 珊;杜小勇;任永杰;周庆庆;冯 玉
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京人大金仓信息技术有限公司;王 珊;杜小勇;任永杰;周庆庆;冯 玉 filed Critical 北京人大金仓信息技术有限公司;王 珊;杜小勇;任永杰;周庆庆;冯 玉
Priority to CN 200510089094 priority Critical patent/CN100412863C/en
Publication of CN1908932A publication Critical patent/CN1908932A/en
Application granted granted Critical
Publication of CN100412863C publication Critical patent/CN100412863C/en

Links

Abstract

本发明为一种海量数据紧缩存储方法及执行装置,其包括:选择存储策略的步骤,通过侦测CPU的类型,选择与该CPU类型相匹配的存储策略进行数据存储;页面内紧缩冗余数据的步骤,在页面内保存一条具有完整信息的记录,其它记录值均参照该记录值;紧缩组合索引的步骤,采用增量的方式记录各个索引记录值之间的差别;区分历史数据和活动数据的步骤,将历史数据中由多版本并发控制协议附加的头部版本控制数据去除,并实现历史数据和活动数据的动态转换;重新编码的步骤,在存储海量数据的数据库内根据普通关系数据的数据值的频度不同,对普通关系数据的数据值进行重新编码,并在不解码的情况下进行运算。 The present invention is a mass storage method and performing data compression apparatus, comprising: the step of selecting a storage policy, by detecting the type of CPU, data storage and selection of the storage policy that matches the type of CPU; tightening redundant data within a page step stored in a page having the complete information is recorded, the other values ​​are recorded by referring to the value recorded; combination index tightening step, the incremental difference between the recorded value of each index record; distinguish historical data and activity data step, the historical data by a plurality of concurrency control protocol version control data appended header is removed, and converts dynamic history data and activity data; re-encoding step of, in mass data storage database according to the normal relational data different frequency data values, the data values ​​of normal relationship data re-encoding, and arithmetic operations are performed without decoding. 本发明可提高查询效率、减小存储空间。 The present invention can improve the query efficiency, reduce the storage space.

Description

一种海量数据紧缩存储方法及执行装置技术领域本发明涉及数据存储技术,其特别涉及一种在使用多版本并发控制协议(MVCC)的数据库管理系统(DBMS)中海量数据的存储技术,具体的讲是一种海量数据紧缩存储方法及执行装置。 One kind of mass data storage method and performing compression TECHNICAL FIELD The present invention relates to data storage technology, in particular it relates to a technique for using a stored multi-version concurrency control protocol (MVCC) database management system (DBMS) in the mass data, the specific compression stresses is a mass data storage method and the execution means. 背景技术数据库存储子系统是数据库系统的重要组成部分,它负责数据在存储介质上的保存、向查询处理子系统提供数据存取。 A database storage subsystem is an important part of the database system, which is responsible for storing data on a storage medium, providing access to the data query processing subsystem. 存储方法指的是数据在永久性介质(如磁盘、磁带)上的存放格式和在易失性介质(如普通内存)上的存》丈方式。 Refers to a method of storing data on a permanent storage medium format (e.g., disk, tape), and kept "on the way husband volatile media (such as normal memory). 通常的存储方法有如下几种:一、 保守存储的策略:这是传统的存储方式[l]。 Conventional storage methods as follows: a conservative strategy storage: This is the traditional storage [l]. 因为有些CPU要求内存中的数据按照数据类型进行对齐存放,所以这种方法兼顾到这部分CPU,将数据在磁盘和内存中都按照对齐要求进行存放。 CPU requirements because some data is stored in memory according to the data type are aligned, this method takes into account that part of CPU, the data for storage in accordance with the alignment disk and memory requirements. 这种情况下,CPU的效率是最好的,但是无疑会增加数据的存储规模,增加I/0量。 In this case, CPU efficiency is the best, but will undoubtedly increase the size of stored data, increasing I / 0 volume. 二、 紧凑存储的策略:许多系统根据CPU的特点使用该策略[2]。 Second, a compact stored policy: The policy Many systems use the characteristics of the CPU [2]. 例如如果CPU没有内存对齐的要求,例如Intel的32位CPU系列,DBMS就可以将数据不对齐在存储介质上存放。 For example, if the CPU does not memory alignment requirements, such as Intel series of 32-bit CPU, the DBMS can not be aligned with the data stored on the storage medium. 这样就会缩小数据占用的存储空间,从而减少1/0量,提升查询效率。 This will reduce the storage space occupied by the data, thereby reducing the amount of 1/0, improve query efficiency. 这种方法在一定程度上会给CPU带来一些负担,因为对齐的工作实际上在CPU内部完成。 This approach to a certain extent, CPU will bring some of the burden, because alignment work is actually completed within the CPU. 三、 压缩存储的策略:有些系统采取压缩部分数据的方式节省存储空间[3]。 Third, compression and storage strategy: Some systems take the form of data compression to save storage space section [3]. 为了压缩数据,需要使用耗费CPU的LZ算法或者其它压缩算法。 To compress data, it is necessary to use LZ algorithm or other CPU-intensive compression algorithms. 由于压缩后的数据失去了数据格式,所以当数据提供给查询子系统时,又需要耗费CPU进行解压缩。 Since the compressed data lost data format, so when the data available to query subsystem, and CPU takes to decompress. 方法一在CPU瓶颈的应用中有优势,但是在I/O瓶颈的应用中如果使用不需要对齐的CPU,那么方法二就具有优势.同时,方法一会浪费存储介质.在对于CPU足够富余的应用中,方式三具有一定的好处.然而,上述这些系统不具有根据数据的语义特点进行优化的特点。 The method has the advantages in the application CPU bottleneck, but the application I / O bottlenecks if used not require alignment of the CPU, then the two methods has an advantage. Meanwhile, a method is wasted storage medium in a sufficient margin to the CPU applications, third approach has certain advantages. However, these systems do not have the characteristics to optimize the characteristics of the semantic data. 发明内容本发明的目的在于,提供一种海量数据紧缩存储方法及执行装置,用以改进上述现有存储方法所存在的问题和不足,提高查询效率、减小存储空间.本发明的技术方案为:一种海量数据紧缩存储方法,其包括以下步骤:选择存储策略的步骤,通过侦测CPU的类型,选择与所述CPU类型相匹配的存储策略进行数据存储;页面内紧缩冗余数据的步骤,在页面内保存一条具有完整信息的记录, 其它记录值均参照该记录值;紧缩组合索引的步骤,釆用增量的方式记录各个索引记录值之间的差別;区分历史数据和活动数据的步骤,将历史数据中由多版本并发控制协议(MVCC)附加的头部版本控制数提去除,并实现历史数据和活动数据的动态转换;重新编码的步骤,在存储海量数据的数据库内根据普通关系数据的数据值的频度不同,对普通关系数据的数据值进 The object of the present invention is to provide a mass storage method and performing data compression means for improving the above-described conventional method of storing the problems and shortcomings, to improve search efficiency, reduce the storage space aspect of the present invention is : one kind of mass storage data compression method, comprising the steps of: a step selected storage policy, the storage policy data storage type detection by the CPU, the CPU is selected to match the type of; redundant data compression step in the page , stored in a page having the complete information is recorded, the other values ​​are recorded by referring to the recorded value; combination index tightening step, preclude the use of incremental difference between the recorded value of each index record; distinguish historical data and activity data step, the historical data from the multi-version concurrency control protocol (MVCC) version control number appended header mention removed, and converts dynamic history data and activity data; re-encoding step, in accordance with common mass data stored in the database different frequency data relational data value, the data value into the ordinary relational data 重新编码,并且在不解码的情况下进行运算.在所述的页面内紧缩冗余数据的步骤中,首先评价一个页面内的重复值出现的频度,在达到预定频度后,选取一条记录完整记录其值,然后其它记录值均参照该记录值.在所述的重新编码的步骤中,根椐数椐库中数据值(数值、字符或者字符串)出现的频度,给数据值在数据库内部进行重新编码,频度高的数据值采用短编码,而频度低的数据值采用长编码.本发明还提供了一种海量数据紧缩存储执行装置,其包括:存储策略选择单元,用于侦测CPU的类型,调用与所述CPU类型相匹配的存储策略进行数据存储;冗余数据紧缩单元,用于判断一个页面内的重复值出现的频度,在达到预定频度后选取一条记录,完整记录其值,并使其它记录值均参照该记录值;历史数据紧缩单元,用于将历史数据中由多版本并发控制 Re-encoding, and arithmetic operations are performed without decoding step tightening of redundant data within the page, the first evaluation value in the frequency of a duplicate page that appears after the predetermined frequency, selecting a record full recorded values, and other values ​​are recorded by referring to the value recorded in the re-encoding step, the number noted in the frequency of data values ​​noted in the library (number, or character string) that appears, to the data values ​​in internal database re-encoding, high frequency encoding data values ​​with a short, low frequency data value length coding using the present invention also provides a mass storage performs data compression apparatus comprising: a storage strategy selection unit for detecting the type of CPU, said CPU and storage policy call type match data storage; redundant data compression means, for determining the frequency of repetition value in a page that appears after selecting a predetermined frequency recording a complete record of the value, and other values ​​are recorded by referring to the value recorded; historical data compression unit for the historical data controlled by a plurality of concurrency 协议(MVCC ) 附加的头部控制数据去除,并实现历史数据和活动数据的动态转换;重新编码单元,用于在存储海量数据的数据库内根据普通关系数据的数据值的频度不同,对普通关系数据的数据值进行重新编码,并且在不解码的情况下进行运算.组合索引紧缩单元,采用增量的方式记录各个索引记录值之间的差别.本发明的有益效果在于:提高了I/0效率:由于数据存储规^莫大大减小, 所以访问同样的数据量需要访问的页面也少;比普通压缩方式不:l毛费CPU:本发明重复利用数据的冗余度采用重编码和参照相邻元组值的方式,不需要进行常规的压缩算法就能达到数据量减少的好处,而且大部分计算都不必解开数据;节省了存储空间;因为冗余的减少而节省存储空间. 附图说明图1为侦测Kingbase6执行引擎结构图; 图2为侦测CPU选择存储策略的流程图; 图3为页 Protocol (MVCC) additional control data header is removed, and the historical data and dynamic conversion activity data; re-encoding unit, in a database for storing massive data in accordance with frequency data value different from normal relationship data, ordinary relationship between the data values ​​of the data re-encoding, and arithmetic operations are performed without decoding the combination index tightening unit, the incremental difference between the recorded value of each index recorded the beneficial effects of the present invention: increased I / 0 efficiency: Since the data is stored Regulation ^ great reduced considerably, so that access the same page of the amount of data to be accessed less; than normal compression is not: L gross fee CPU: reuse redundancy of the present invention using a re-encoded data and reference value tuples adjacent embodiment, does not require a conventional compression algorithms can achieve the benefit of reducing the amount of data, and most of the calculations do not have to unlock transactions; saves storage space; reduced because the redundant and save storage space. BRIEF DESCRIPTION oF dRAWINGS FIG 1 is a configuration diagram of the detection Kingbase6 execution engine; FIG. 2 is a flowchart of CPU selected storage policy detection; FIG. 3 is a page 内紧缩冗余数据的流程图; 图4为重新编码的流程图; 图5为区分历史数据和活动数据的流程图; 图6为本发明装置的结构框图. 具体实施方式下面结合附图说明本发明的具体实施方式.针对上述现有存储方法所存在的问题和不足之处,本发明基于使用MVCC并发协议的数据库系统提出了一种可提高查询效率、减小存储空间的数据紧缩存储方法.该数据紧缩存储方法包括:如图1所示,为KingbaseES6数据库执行引擎101,执行数据库清理命令, 采用数据语义分析模块进行分析,其中包括:选择存储策略的步骤,通过侦测CPU的类型,选择与所述CPU类型相匹配的存储策略进行数据存储;当确定了存储策略后进行一下步骤:页面内紧缩冗余数据的步骤,在页面内保存一条记录具有完整信息,其它记录均参照该记录;区分历史数据和活动数据的步骤,将历史数据中由多版本并 Tightening the flowchart of redundant data; FIG. 4 is a flowchart of the re-encoded; FIG. 5 is a flowchart of the history data and activity data distinction;. FIG. 6 is a block diagram of the apparatus of the invention in conjunction with the accompanying drawings DETAILED DESCRIPTION OF THE EMBODIMENTS DETAILED DESCRIPTION oF eMBODIMENTS in response to the above-described conventional method for storing the problems and shortcomings, the present invention is based on a database system using MVCC proposes a complicated protocol query efficiency can be improved, reducing the data compression method is stored in the storage space. the data compression method comprising storing: as shown, the execution engine 101 is KingbaseES6 database, perform a database cleanup command using semantic analysis module to analyze the data, including: the step of selecting the storage policy, the type of detection by the CPU, selection with a storage policy that matches the type of said CPU data storage; after storage strategy is determined at steps: compression of redundant data within the page, saving a record in the page having the complete information, with reference to the other records are recorded; step distinguish between historical data and activity data, historical data and the multi-version 控制协议附加的头部控制数据去除,并实现历史数据和活动数据的动态转换;重新编码的步骤,在数据库内根据数据值的频度不同,对数据值进行重新编码,并且在不解码的情况下进行运算。 The step of re-encoding case, in the database according to the different values ​​of frequency data, re-encode the data values, and without decoding; additional control protocol header removing control data, and historical data conversion and dynamic activity data under operated. 紧缩组合索引的步骤,采用增量的方式记录各个索弓1记录之间的差别。 Combination index tightening step, the recorded incremental difference between each index recording a bow. 本发明方法包4舌以下两个方面的内容:堆数据(heap)的紧缩存储;索引数据(index)的紧缩存储.其中,堆是指普通的关系数据,索引是指B+tree或者其它索引数据。 The method of the present invention, two aspects of the package the tongue 4: Data heap (heap) memory contraction; index data (index) storage tightening wherein heap data refers to the normal relationship, the index refers to a B + tree or other indexes. data. 数据中的每个有格式数据称为一个元组(或记录)。 Each data has a data format called a tuple (or record). 所有的元组都存储在页面中, 一个页面是文件中固定大小的块。 All tuples are stored in the page, a page file is a fixed-size blocks. 由于这个方法要比普通的使用压缩算法的方法在CPU上要节省的多,并且具有保持记录结构,不必解紧缩就可以进行数据运算等特点,所以我们称之为数据紧缩方法,以进行区别。 Since this method than the ordinary method of using a compression algorithm to be saved on multiple CPU, and records a holding structure, you do not have to unzip characteristic data calculation can be performed, etc., so we call data compression method, to be distinguished. 首先我们说明堆数据的紧缩存储方法。 First, we explain the data compression method for storing heap. 针对平台。 For platform. 针对不同的平台,包括IA32, SPARC等,我们采用了保守的数据存储策略和紧凑策略相结合的办法。 For different platforms, including IA32, SPARC and so on, we have adopted a conservative approach to data storage strategy and tactics compact combination. 这个选择过程可以由用户指定,例如对于CPU瓶颈的应用,用户可以指定采取保守策略,否则可以采取紧凑策略。 This selection process may be specified by the user, for example, a CPU bottleneck of the application, the user can specify a conservative strategy, otherwise you can take the compact policy. 页内紧缩。 Inside pages crunch. 在页面内,有大量的元组具有重复值,那么对于这部分值我们也可以考虑通过减少冗余来达到减小存储容量和I/O的目的。 Within the page, a large number of duplicate values ​​having tuples, then for this value we can also consider to achieve the purpose of reducing the storage capacity, and I / O by reducing redundancy. 根据在非覆盖存储方式上MVCC协议不删除历史数据的特点,我们首先评价一个页面内的重复值出现频度,在达到一定频度后,选取一条记录完整记录其值,然后其它记录均参照它。 According to MVCC agreement does not remove the historical data on non-covered storage characteristics, we first evaluate duplicate values ​​within a page frequency of appearance, after reaching a certain frequency, select a record a complete record of their value, and other records are referring to it . 既然记录不被删除,所以这样就完全没有维护代价。 Since the records are not deleted, so that no maintenance costs. 区分历史和活动数据。 Distinguish between history and activity data. 在堆数据中为了标注记录的不同版本,MVCC需要在记录的头部附加一个大小不等的记录头。 In order to stack different versions of the data record labels, MVCC require additional recording head sizes in a recording head. 在比较大型的数据厍中,这会带来不小的存储开销,导致I0效率降低。 It in relatively large-scale data, which will bring no small storage overhead, resulting in reduced efficiency I0. 实际上,观察到一个数据库中的绝大部分数据是历史数据,也就是说它们是有已经提交的事务产生的,而活动数据是处于活动中的事务产生的。 In fact, the vast majority of observed data in a database of historical data, which means that they have already submitted the transaction generated, and activity data is active in the affairs generated. 历史数据的控制信息可以非常简短的表示。 Control information can be very brief historical data representation. 这样就会减小数据库的规模。 This will reduce the size of the database. 考虑到历史数据和活动数据都是相对的,它们需要相互地转换,我们在实现回滚段技术的同时实现了这个自动的转化过程。 Taking into account historical data and activity data are relative, they need to convert each other, we realize this automated conversion process while achieving rollback technology. 重新编码整体紧缩。 Re-encode the whole crunch. 由于数据库中数据值(数值、字符或者字符串)出现的频度不同,我们可以给数据值在数据库内部进行重新编码,频度高的数据值采用短编码,而频度低的采用长编码,这样对于许多重复模式的记录我们大大减少它的长度。 Depending on the frequency of data in the database values ​​(numeric, or character strings) appears, we can give the data values ​​within the database re-encoding, high-frequency encoding data values ​​with a short, low frequency using a length coding, for this record we have many repeat mode greatly reduces its length. 这种方式比直接对整个元组使用普通的压缩算法具有三个优点(i )保持元组结构;(ii )编码过程是个线性转换;(iii )少耗费CPU的特点;但是相对前面的其它方法还是要多用CPU。 In this manner than using conventional direct compression algorithm for the entire three tuple having advantages (i) holding tuple structure; (ii) encoding a linear conversion process; (iii) consume less CPU characteristics; however, other methods of opposing front or to multi-purpose CPU. 我们采用三个措施改进:(i)选择性的进行处理。 We use three measures to improve: (i) selective processing. 当预测的重编码的好处达到一定程度后,我们才进行这个处理;(ii)推迟解紧缩时机。 When the predicted benefits of re-encoding reaches a certain level, we carry out this process; (ii) defer unzip time. 由于我们的编码是个线性变换,数据通常不需要解紧缩就可以进行运算。 Since our encoding is linear transformation, data is usually not required unzip operation can be performed. 只有到了非紧缩不可的时候,我们才做。 Only the time can not be unpacked, and we do. (iii )客户端分担解紧缩。 (Iii) sharing client unzip. 用户可以选择解紧缩的位置是在客户端还是服务器端。 Users can choose to unzip position on the client or server side. 然后我们说明索引数据的紧缩存储方法:优化组合索引。 Then we explain austerity index data storage method: optimal combination index. 除了对于堆数据的选择存储策略的步骤和页面内紧缩冗余数据的步骤外。 In addition to the step of tightening the redundant data storage policy for the selection of the page stack and the step of data. 我们特别为组合索引进行存储优化。 We are especially optimized for storage combined index. 因为观察到在组合索引中除了最后一列外,前面几列(称为前缀)的数据几乎都是一样的,而且保持有序。 Because the composite index was observed in addition to the last column, the first few columns (referred to as prefix) data are almost the same, and remain orderly. 那么我们可以考虑类似前面处理堆的压缩记录的方式来减小数据冗余。 So we can consider a similar manner previously processed record stack compression to reduce data redundancy. 同时,由于索引数据的有序性,我们采用增量的方式来记录各个索引记录之间的差别,这是堆数据所没有采用的(因为紧缩程度不够高)。 At the same time, due to the ordering of the index data, we use an incremental way to record the differences between the various index record, this is a heap of data that are not used (because of the degree of tightening is not high enough). 索引没有采用堆数据的重新编码的步骤,这是因为在索引中应用此类方法的情况4艮少。 The step of re-encoding the index does not use stack data, which is the case because the application of such methods in the index at least 4 Gen. 在KingbaseES 6系统中实现了本发明装置。 In the apparatus of the present invention achieved KingbaseES 6 system. 即: 一种海量数据紧縮存储执行装置(如图6所示),其包括:存储策略选择单元,用于侦测CPU的类型,调用与所述CPU类型相匹配的存储策略进行数据存储;冗余数据紧缩单元,用于判断一个页面内的重复值出现的频度,在达到预定频度后选取一条记录,完整记录其值,并使其它记录均参照该记录;历史数据紧缩单元,用于将历史数据中由多版本并发控制协议附加的头部控制数据去除,并实现历史数据和活动数据的动态转换;重新编码单元,用于在数据库内根据数据值的频度不同,对数据值进行重新编码,并且大多数运算可以在不解码的情况下进行。 That is: one kind of mass data storage compression execution means (shown in FIG. 6), comprising: a storage policy selection unit for detecting the type of CPU, and said CPU calls the storage policy that matches the type of data storage; redundant data compression means, for determining the frequency of repetition value in a page that appears, select a recorded after reaching a predetermined frequency, a complete record of their values, and other records are recorded with reference to this; history data compression unit, with in the history data added by the multi concurrency control protocol header removing control data, and historical data conversion and dynamic activity data; re-encoding unit, according to different values ​​of frequency data in a database, the data values re-encoded, and most operation may be performed without decoding. 组合索引紧缩单元,采用增量的方式记录各个索引记录之间的差别。 Combination index tightening unit, the recorded incremental difference between each index record. 如图2所示,准备初始化数据库,执行所述的存储策略选择单元,用于通过安装脚本判定CPU的类型,并考虑用户要求;判断CPU是否可以处理非对齐数据,如果是则选择非对齐方式存储数据,如果否,则选择对齐方式存储数据。 2, ready to initialize the database, the policy storage unit performs the selection for the type of CPU is determined by the installation script, and considers the user requirements; determining whether the CPU can process data non-aligned, non-aligned if the selected mode storing data, if not, select data stored in the alignment. 然后进入初始化数据库。 Then enter initialize the database. 如图3所示,当准备插入数据时,执行所述的冗余数据紧缩单元,对于插入的新记录判断页面是否为满;如果满则进一步分析页面内数据冗余度,选择本次最佳参照记录并重新组织页面,如果空间有节余,则继续在该页面上进行插入操作,如果空间没有节余则转入下一个页面。 3, when the data is ready to be inserted, the redundant data compression unit performs, for the insertion of a new record is determined whether the page is full; then further analysis if the full page data redundancy, the best choice in this the reference record and re-organize the page, if there is space savings, insertion is continued on this page, if there is no surplus space is transferred to the next page. 如图4所示,准备重新编码时,执行所述的重新编码单元,根据关系大小抽样若干页面并计算值域分布图,根据分布图选择若千编码方式并在样本页面上尝试,根据用户查询运行时记录的SQL日志,判定是否对用户查询有利,如果有利则进行重新编码,否则结束。 4, when preparing re-encoding unit performs the re-encoding, the relationship between the page size according to several samples and calculate the distribution range, and if one thousand encoding attempts at sample selected according to the profile page, the user query SQL runtime log records to determine whether a user query favorable, favorable if the re-encoding, or the end. 之所以进行是否对用户查询冇利的判断,是因为该方法可能导致用户查询使用更多的CPU,而如果从SQL日志判断为CPU瓶颈,则不必采用该紧缩方法。 The reason why the user query to determine whether the benefits of Nuisance, because this method may result in the user query uses more CPU, but if the judge from the SQL logs for CPU bottlenecks, do not have to adopt the austerity approach. 如图5所示,在准备区分历史数据时,执行所述的历史数据紧缩单元, 对于目行关系A,建立一个同结构的新的关系B,从A中读取一个元組,然后判断是否为确定已经提交或确定已经回滚,如果是:则去除头部信息,并设置标志位,拷贝至关系B;如果否:则直接拷贝至关系B。 5, the historical data in preparation for distinguishing the execution history of the data compression unit, to mesh with the relationship A, a build, reads a tuple with the new configuration of the relationship between B from A, and then determine whether to determine determined to have been committed or rolled back, if it is: removing the header information, and sets the flag bit is copied to the relationship B; if no: to directly copy relationship B. 然后再判断关系A中是否还有元组,如果是:则继续上述的步骤,如果否:则将关系A替换为关系B,结束。 And then determine whether there are relational tuple A, if: The above steps are continued, and if not: A relationship will be replaced with the relationship B, ending. 通过上述各单元处理后的数据存入数据字典和/或业务数据库,已被数据的使用。 After the data processing units stored in the data dictionary, and / or the traffic database, the data has been used. 其中,在数据库的具体存储上,页面的格式如图7所示。 Wherein the database is stored in the specific format of the page as shown in FIG. 它实现了页面内紧缩冗余数据的步骤和紧缩组合索引的步骤:其中,具体每个元组的控制格式如图8所示。 It implements the step tightening of redundant data within a page and the step of compression of the composite index: wherein each tuple specific format of the control as shown in FIG. 它实现了选择存储策略的步骤,区分历史数据和活动数据的步骤和重新编码的步骤。 It implements the selected storage policy steps to distinguish between historical data and activity data Steps and re-encoding. 其中,使用"行信息位"这个单元来说明该行数据是否为历史数据,数据采取的紧缩方式(和字典信息配合使用)。 Among them, use the "line information bit" this unit to illustrate the way of tightening the bank whether the data is historical data, data taken (and dictionary information used in conjunction). 在采用该方法后,针对典型的OLTP应用(如TPCC测试),测试的平均数据空间使用率减小30~40%。 After using this method, typical OLTP applications (e.g. TPCC test), the test data space usage on average 30 to 40% decrease. 在1/0瓶颈的应用中数据库性能提升20%~30%。 1/0 bottleneck in the application database performance by 20% to 30%. 图1是该方法在KingbaseES 6系统的核心执行引擎中的示意图。 1 is a schematic of the method in the implementation of the core engine system in KingbaseES 6. 具体而言,本发明具有以下优点:1、 提高了1/0效率:由于数据存储规模大大减小,所以访问同样的数据量需要访问的页面也少;2、 比普通压缩方式不耗费CPU:我们的方法重复利用数据的冗余度采用重编码和参照相郊元组值的方式,不需要进行常规的压缩算法就能达到数据量减少的好处,而且大部分计算都不必解开数据;3、节省了存储空间;因为冗余的减少而节省存储空间。 Specifically, the present invention has the following advantages: 1, to improve the efficiency of the 1/0: greatly reduced since the size of data storage, so that the same amount of data accessed page requires less access; 2, do not cost than conventional compression CPU: our approach to reuse redundancy of data by way of re-encoding and the phase reference value tuples rural, does not require a conventional compression algorithms can achieve the benefit of reducing the amount of data, and most of the calculations do not have to unlock transactions; 3 , saving storage space; because of reduced redundancy and save storage space. 以上具体实施方式仅用于说明本发明,而非用于限定本发明。 DETAILED above embodiments merely illustrate the present invention, not intended to limit the present invention. 说明:[1]见Database System Implementation, H. Garcia-Mohna, JD Ullm叫J Widom ^ , Prentice Hall[2]如SQL Server 2000,见Microsoft的SQL Server 2000 Book Online[3]如PostgreSQL'见Postgresql数据序的爿ZJ^7? 7]4Si五5^7^rOT^G五子句 Note: [1] See Database System Implementation, H. Garcia-Mohna, JD Ullm called J Widom ^, Prentice Hall [2] such as SQL Server 2000, see the Microsoft SQL Server 2000 Book Online [3] such as PostgreSQL 'see Postgresql data sequence valves ZJ ^ 7? 7] 4Si five 5 ^ 7 ^ rOT ^ G five clauses

Claims (7)

1. 一种海量数据紧缩存储方法,其特征在于包括以下步骤: 选择存储策略的步骤,通过侦测CPU的类型,选择与所述CPU类型相匹配的存储策略进行数据存储; 页面内紧缩冗余数据的步骤,在页面内保存一条具有完整信息的记录,其它记录值均参照该记录值; 紧缩组合索引的步骤,采用增量的方式记录各个索引记录值之间的差别; 区分历史数据和活动数据的步骤,将历史数据中由多版本并发控制协议附加的头部控制数据去除,并实现历史数据和活动数据的动态转换; 重新编码的步骤,在存储海量数据的数据库内根据普通关系数据的数据值的频度不同,对普通关系数据的数据值进行重新编码,并且在不解码的情况下进行运算。 A mass storage data compression method, comprising the steps of: the step of selecting the storage policy, the storage policy data storage type detection by the CPU, the CPU is selected to match the type of; redundant tightening the page the step of data stored in a page having the complete information is recorded, the other values ​​are recorded by referring to the recorded value; combination index tightening step, the incremental difference between the recorded value of each index record; distinguish between data and historical activities the step data, an additional protocol header historical data controlled by the plurality of data concurrency control is removed, and converts dynamic history data and activity data; re-encoding step of, in mass data storage database according to the normal relational data different frequency data value, the data value of the plain relational data re-encoding, and arithmetic operations are performed without decoding.
2. 才艮据权利要求1所述的方法,其特征在于还包括:若上述各个步骤有处理后的数据,则将数据存入数据库. 2. The method of claim 1 It was Gen claim, characterized by further comprising: data processing when the above-described respective steps, the data stored in the database.
3. 根据权利要求l所述的方法,其特征在于,通过侦测CPU的类型,选择紧凑存储的策略对数据进行存储. 3. The method as claimed in claim l, characterized in that, for compact storage of the policy data is stored by the CPU type detection.
4. 根据权利要求l所述的方法,其特征在于,在所述的页面内紧缩冗余数据的步骤中,首先评价一个页面内的重复值出现的频度,在达到预定频度后,选取一条记录完整记录其值,然后其它记录值均参照该记录值. 4. The method of claim l, wherein the step of tightening the redundant data within the page, the first evaluation value in the frequency of a duplicate page that appears after the predetermined frequency, select recording a complete record of the value, and other values ​​are recorded by referring to the recording value.
5. 根据权利要求1所述的方法,其特征在于,在所述的重新编码的步骤中,根据数据库中数据值出现的频度,给数据值在数据库内部进行重新编码, 频度高的数据值采用短编码,而频度低的数据值采用长编码. 5. The method according to claim 1, wherein said re-encoding step, the data in the database in accordance with the frequency of occurrence of values, to re-encode the data values ​​in the internal database, data of high frequency values ​​using the short code, and the low frequency data values ​​using length coding.
6. —种海量数据紧缩存储执行装置,其特征在于包括: 存储策略选择单元,用于侦测CPU的类型,调用与所述CPU类型相匹配的存储策略进行数据存储;冗余数据紧缩单元,用于判断一个页面内的重复值出现的频度,在达到预定频度后选取一条记录,完整记录其值,并使其它记录值均参照该记录值;组合索引紧缩单元,采用增量的方式记录各个索引记录值之间的差別; 历史数据紧缩单元,用于将历史数据中由多版本并发控制协议附加的头部控制凄t据去除,并实现历史数据和活动数据的动态转换;重新编码单元,用于在存储海量数据的数据库内根据普通关系数据的数据值的频度不同,对普通关系数据的数据值进行重新编码,并且在不解码的情况下进行运算. 6. - Species Mass storage perform data compression apparatus characterized by comprising: a storage policy selection unit for detecting the type of CPU, and said CPU calls the storage policy that matches the type of data storage; redundant data compression unit, for determining the frequency of duplicate values ​​occurring within a page, selected after recording a predetermined frequency, a complete record of the value, and other values ​​are recorded by referring to the value recorded; combination index tightening unit, incremental manner recording the difference between the value of each index record; historical data compression units, the historical data from the plurality of additional concurrency control protocol header for data control sad t removed, and historical data conversion and dynamic activity data; re-encoding unit database for mass data storage of the data values ​​according to the frequency of different normal relationship data, the relationship data values ​​of normal data re-encoding, and arithmetic operations are performed without decoding.
7.根据权利要求6所述的装置,其特征在于还包括:将上述装置处理后的数据存入数据库. 7. The device according to claim 6, characterized by further comprising: means for processing the data stored in the database.
CN 200510089094 2005-08-05 2005-08-05 Huge amount of data compacting storage method and implementation apparatus therefor CN100412863C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 200510089094 CN100412863C (en) 2005-08-05 2005-08-05 Huge amount of data compacting storage method and implementation apparatus therefor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 200510089094 CN100412863C (en) 2005-08-05 2005-08-05 Huge amount of data compacting storage method and implementation apparatus therefor

Publications (2)

Publication Number Publication Date
CN1908932A CN1908932A (en) 2007-02-07
CN100412863C true CN100412863C (en) 2008-08-20

Family

ID=37700043

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 200510089094 CN100412863C (en) 2005-08-05 2005-08-05 Huge amount of data compacting storage method and implementation apparatus therefor

Country Status (1)

Country Link
CN (1) CN100412863C (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102023978B (en) * 2009-09-15 2015-04-15 腾讯科技(深圳)有限公司 Mass data processing method and system
CN101794299B (en) 2010-01-27 2012-03-28 浪潮(山东)电子信息有限公司 Method for increment definition and processing of historical data management
CN102111617B (en) * 2010-12-15 2012-07-11 广州市动景计算机科技有限公司 Streaming media decoding method and device
CN103631774B (en) * 2012-08-20 2018-03-20 腾讯科技(深圳)有限公司 Method and system for data storage
CN103412864B (en) * 2013-06-06 2017-04-05 莱诺斯科技(北京)股份有限公司 A data compression method for storing

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1167951A (en) 1996-01-31 1997-12-17 株式会社日立制作所 Method of and apparatus for compressing and expanding data and data processing apparatus and network system using same
CN1459743A (en) 2002-05-24 2003-12-03 中国科学院软件研究所 Self adapting history data compression method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1167951A (en) 1996-01-31 1997-12-17 株式会社日立制作所 Method of and apparatus for compressing and expanding data and data processing apparatus and network system using same
CN1459743A (en) 2002-05-24 2003-12-03 中国科学院软件研究所 Self adapting history data compression method

Also Published As

Publication number Publication date
CN1908932A (en) 2007-02-07

Similar Documents

Publication Publication Date Title
Tsirogiannis et al. Query processing techniques for solid state drives
US6208993B1 (en) Method for organizing directories
JP5445682B2 (en) Storage system
US7308456B2 (en) Method and apparatus for building one or more indexes on data concurrent with manipulation of data
EP0877327B1 (en) Method and apparatus for performing a join query in a database system
US6175835B1 (en) Layered index with a basic unbalanced partitioned index that allows a balanced structure of blocks
US6721749B1 (en) Populating a data warehouse using a pipeline approach
JP5466232B2 (en) Coding efficient column based data for large data storage
US8225029B2 (en) Data storage processing method, data searching method and devices thereof
JP4522170B2 (en) Relational database indexes additional program, the index adding unit and index adding method
Hoffer et al. The use of cluster analysis in physical data base design
US20090292947A1 (en) Cascading index compression
US7024414B2 (en) Storage of row-column data
CN102171680B (en) Efficient large-scale filtering and/or sorting for querying of column based data encoded structures
US5918225A (en) SQL-based database system with improved indexing methodology
JP5271359B2 (en) Multidimensional database architecture
JP4907600B2 (en) Block compression of tables with repeated values
CN1085863C (en) Memory management system of computer system
CN101937448B (en) A main memory storage device based on a column sequence of order-preserving dictionary compression
US20080059492A1 (en) Systems, methods, and storage structures for cached databases
CN101436207B (en) Data restoring and synchronizing method based on log snapshot
US20120254253A1 (en) Disk-Resident Streaming Dictionary
US20050256908A1 (en) Transportable database
CN102918494B (en) Agnostic model based on the database, and Platform agnostic data storing and accessing data storage workload agnostic models and / or retrieval systems and methods
US8255398B2 (en) Compression of sorted value indexes using common prefixes

Legal Events

Date Code Title Description
C06 Publication
C10 Entry into substantive examination
C14 Grant of patent or utility model
C56 Change in the name or address of the patentee

Owner name: BEIJING RENDA JINCANG INFORMATION TECHNOLOGY CO.,

Free format text: FORMER NAME OR ADDRESS: BEIJING RENDA JINCANG INFORMATION TECHNOLOGY CO.LTD.; WANG SHAN; DU XIAOYONG; REN YONGJIE ADDRESS

C56 Change in the name or address of the patentee

Owner name: BEIJING RENDA JINCANG INFORMATION TECHNOLOGY CO.,

Free format text: FORMER NAME OR ADDRESS: BEIJING RENDA JINCANG INFORMATION TECHNOLOGY CO., LTD.; WANG SHAN; DU XIAOYONG; REN YONGJIE

CF01 Termination of patent right due to non-payment of annual fee