WO2019128409A1 - Procédé de compression et de stockage de données et dispositif de compression et de stockage de données - Google Patents

Procédé de compression et de stockage de données et dispositif de compression et de stockage de données Download PDF

Info

Publication number
WO2019128409A1
WO2019128409A1 PCT/CN2018/111180 CN2018111180W WO2019128409A1 WO 2019128409 A1 WO2019128409 A1 WO 2019128409A1 CN 2018111180 W CN2018111180 W CN 2018111180W WO 2019128409 A1 WO2019128409 A1 WO 2019128409A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
compression
fields
compressed
different
Prior art date
Application number
PCT/CN2018/111180
Other languages
English (en)
Chinese (zh)
Inventor
何东杰
Original Assignee
中国银联股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国银联股份有限公司 filed Critical 中国银联股份有限公司
Publication of WO2019128409A1 publication Critical patent/WO2019128409A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • G06F16/1744Redundancy elimination performed by the file system using compression, e.g. sparse files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Definitions

  • the present invention relates to data processing technologies, and in particular, to a data compression storage method and a data compression storage device.
  • the compression tools used by enterprises to store data are common tools for all data.
  • the characteristics of enterprise data are not fully considered. Therefore, the compression efficiency of the data is not very high.
  • the present invention is directed to a further data compression storage method and data compression storage device.
  • the compression step compresses and compresses the compressed data for different fields based on different data contents.
  • the relationship between the determination fields is determined to be strong, and the compression policy is set according to the strength of the association relationship.
  • said compressing step comprises the following substeps:
  • said compressing step comprises the following substeps:
  • the multiple fields in which the correlation relationship exists are combined, and for the combined fields, different compression policies are used for different data contents for compression storage.
  • said compressing step comprises the following substeps:
  • binary storage is performed for the enumerated character string field, and the string value is converted to an integer or floating point for compression storage.
  • the short fields are combined and then compressed, the approximate fields are compressed and the compressed information is compressed, and only one of the fields is reversely stored, and the inclusion relationship between the fields is included. Store redundant information after compressing it.
  • binary storage is performed for the enumerated character string field, the string value is converted into an integer or floating point for compression storage, the short field is combined, then compressed, and the approximate field is compressed.
  • the remaining information is compressed and stored, and only one of the fields is reversely stored for the information in the reverse order, and the compressed redundant information exists for the inclusion relationship between the fields.
  • the mapping relationship storage step establishes a mapping relationship between the original data and the compressed data and stores the relationship.
  • a segmentation module for splitting raw data into multiple fields
  • the compression module is configured to compress and store the compressed data according to different data content by using different compression policies for different fields.
  • the compression module determines that the association relationship between the fields is strong, and sets a compression policy according to the strength of the association relationship.
  • the mapping relationship storage module is configured to establish a mapping relationship between the original data and the compressed data and store the relationship.
  • the compressing step has the following sub-modules:
  • the content analysis sub-module performs content analysis on data cut into multiple fields, and establishes an association relationship between the fields
  • Compression sub-module for a single field, uses different compression strategies for different data content for compressed storage.
  • the compression module is provided with:
  • the content analysis sub-module performs content analysis on data cut into multiple fields, establishes a data distribution map and an association relationship diagram between the fields, and identifies a correlation relationship between the data fields based on the data distribution graph and the association relationship graph;
  • the compression sub-module combines multiple fields with related relationships, and uses different compression policies for different data contents for compressed storage for the combined fields.
  • the compression module is provided with:
  • the content analysis sub-module performs content analysis on data cut into multiple fields, establishes a data distribution map and an association relationship diagram between the fields, and identifies a correlation relationship between the data fields based on the data distribution graph and the association relationship graph;
  • the compression sub-module for a single field, uses different compression policies for different data contents to be compressed and stored, and also combines multiple fields of related relationships, and uses different compression strategies for different data contents for the combined fields. Perform compressed storage.
  • the computer readable medium of the present invention has stored thereon a computer program, characterized in that the computer program is executed by a processor to implement the data compression storage method described above.
  • the computer device of the present invention comprises a memory, a processor, and a computer program stored on the memory and operable on the processor, wherein the processor executes the computer program to implement the steps of the data compression storage method described above .
  • an efficient data compression scheme for enterprise data characteristics is proposed.
  • a corresponding efficient compression algorithm is used for different data fields. , thereby improving the efficiency of data compression compression.
  • the general data compression tools such as GZIP and SNAPPY, there is a significant improvement in data compression rate.
  • FIG. 1 is a flow chart showing a data compression storage method of the present invention.
  • Fig. 2 is a block diagram showing the structure of a data compression storage device of the present invention.
  • the main idea of the data compression storage method and the data compression storage device of the present invention is to analyze the enterprise data content, establish a data distribution map, combine the data content and the data distribution, and adopt a corresponding optimized compression algorithm for each information.
  • the enumeration type string field is compressed by binary code, and the string storing the number is converted into an integer or a floating point type.
  • the associated fields in the data content can be combined and then compressed.
  • some fields are from a combination of other fields, and only one of the data can be stored, and some of the fields are in the reverse order of other fields, and only one of the data is stored.
  • FIG. 1 is a flow chart showing a data compression storage method of the present invention.
  • the data compression storage method of the present invention includes the following steps:
  • Segmenting step S100 cutting the original data into a plurality of fields
  • the compressing step S200 compressing and storing the compressed compressed data by using different optimized compression strategies for different fields, wherein the relationship between the determination fields is strong and weak, according to the strength of the association relationship, according to different data contents. Different compression strategies.
  • the relationship between the data table fields and the fields is established through the analysis of the data content, and then the corresponding optimized compression algorithm is used for compression, so that the effect of increasing the data compression rate can be achieved.
  • a mapping relationship storing step S300 may be further set, in which the mapping relationship between the original data and the compressed data is established in the metadata and stored, so that when the data is accessed from the outside, The mapping relationship smoothly resolves the original data.
  • FIG. 2 is a block diagram showing the structure of a data compression storage device of the present invention.
  • the data compression storage device of the present invention comprises:
  • a segmentation module 100 for dividing the original data into a plurality of fields
  • the compression module 200 is configured to compress and store the compressed compressed data by using different optimized compression policies for different fields, wherein the compression module 200 determines that the relationship between the fields is strong or not, according to the association. Strong relationship, set compression strategy.
  • the mapping relationship storage module 300 may be further configured to establish a mapping relationship between the original data and the compressed data in the mapping relationship storage module 300, so that the data can be accessed according to the mapping relationship when accessing data from the outside. Parse out the raw data.
  • the mapping relationship storage module 300 is not a structural unit necessary for the data compression storage device of the present invention, but is preferably one module.
  • the first embodiment relates to an embodiment in which compressed storage is performed for each field using an optimized compression strategy.
  • the data compression storage method of the first embodiment includes a segmentation step and a compression step, wherein the segmentation step is the same as the segmentation step S100 described above, and the compression step specifically includes the following substeps:
  • enumeration type uses binary storage, string value conversion to integer or floating point storage, and the like.
  • a mapping relationship storing step may be further set, in which a mapping relationship between the original data and the compressed data is established and stored, so that when the data is accessed from the outside, the mapping relationship can be The original data is successfully parsed.
  • the data compression storage device of the first embodiment includes a segmentation module and a compression module.
  • the function of the segmentation module in the first embodiment is the same as that of the segmentation module 100 described above.
  • the compression module in the first embodiment specifically includes the following sub-modules:
  • the content analysis sub-module performs content analysis on data cut into multiple fields, and establishes an association relationship between the fields
  • Compression sub-module for a single field, uses different compression strategies for different data content for compressed storage.
  • mapping relationship storage module may be selectively set, and the mapping relationship between the original data and the compressed data is established and stored in the mapping relationship storage module.
  • the second embodiment relates to an implementation that uses an optimized compression strategy for multiple fields.
  • the data compression storage method of the second embodiment includes a segmentation step and a compression step, wherein the segmentation step is the same as the segmentation step S100 described above, and the compression step specifically includes the following substeps:
  • the multiple fields in which the correlation relationship exists are combined, and for the combined fields, different compression policies are used for different data contents for compression storage.
  • a compression strategy multiple fields are combined, and a related optimized data compression storage method is adopted; for example, a short field is combined and then compressed, and an approximate field is compressed and compressed, and the information is reversed between fields. Store one of them, compressive redundant information with an inclusion relationship between fields, store it, and so on.
  • a mapping relationship storing step may be further set, in which a mapping relationship between the original data and the compressed data is established and stored, so that when the data is accessed from the outside, the mapping relationship can be The original data is successfully parsed.
  • the data compression storage device of the second embodiment includes a segmentation module and a compression module.
  • the function of the segmentation module in the first embodiment is the same as that of the segmentation module 100 described above.
  • the compression module in the first embodiment specifically includes the following sub-modules:
  • the content analysis sub-module performs content analysis on data cut into multiple fields, establishes a data distribution map and an association relationship diagram between the fields, and identifies a correlation relationship between the data fields based on the data distribution graph and the association relationship graph;
  • the compression sub-module combines multiple fields with related relationships, and uses different compression policies for different data contents for compressed storage for the combined fields.
  • mapping relationship storage module may be selectively set, and the mapping relationship between the original data and the compressed data is established and stored in the mapping relationship storage module.
  • the third embodiment relates to an implementation that uses an optimized compression strategy for both a single field and multiple field combinations.
  • the data compression storage method of the third embodiment includes a segmentation step and a compression step, wherein the segmentation step is the same as the segmentation step S100 described above, and the compression step specifically includes the following substeps:
  • the optimized data compression storage method is adopted for single field and multi-field; for example, the enumeration type uses binary storage, the string value is converted into integer or floating point storage, and the short field is combined.
  • the storage approximating the field to compress the redundant information
  • compressing the storage storing the information in the reverse order of the fields, storing only one of them, storing the redundant information with the inclusion relationship between the fields, and so on.
  • a mapping relationship storing step may be further set, in which a mapping relationship between the original data and the compressed data is established and stored, so that when the data is accessed from the outside, the mapping relationship can be The original data is successfully parsed.
  • a data compression storage device includes a segmentation module and a compression module.
  • the function of the segmentation module in the first embodiment is the same as that of the segmentation module 100 described above.
  • the compression module in the first embodiment specifically includes the following sub-modules:
  • the content analysis sub-module performs content analysis on the data cut into multiple fields, establishes a data distribution map and an association relationship diagram between the fields, and identifies a correlation between the data fields based on the data distribution graph and the association relationship graph;
  • the compression sub-module for a single field, uses different compression policies for different data contents to be compressed and stored, and also combines multiple fields of related relationships, and uses different compression strategies for different data contents for the combined fields. Perform compressed storage.
  • mapping relationship storage module may be selectively set, and the mapping relationship between the original data and the compressed data is established and stored in the mapping relationship storage module.
  • the present invention also provides a computer readable medium having stored thereon a computer program, characterized in that the computer program, when executed by a processor, implements a data compression storage method in accordance with each of the above embodiments.
  • the present invention also provides a computer device comprising a memory, a processor, and a computer program stored on the memory and operable on the processor, wherein the processor implements the data compression storage described above when the computer program is executed The steps of the method.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

La présente invention concerne un procédé de compression et de stockage de données et un dispositif de compression et de stockage de données. Le procédé de compression de données comprend les étapes suivantes : une étape de segmentation : segmenter des données d'origine en une pluralité de champs ; et une étape de compression : sur la base de différents contenus de données, compresser différents champs en utilisant différentes politiques de compression, et stocker les données compressées. Selon le procédé de compression et de stockage de données et le dispositif de compression et de stockage de données de la présente invention, différents procédés de compression peuvent être utilisés en considération de différents contenus de données, l'efficacité de compression de données peut être efficacement améliorée, et par comparaison à des outils de compression de données courants tels que GZIP et SNAPPY, le taux de compression de données est significativement amélioré.
PCT/CN2018/111180 2017-12-28 2018-10-22 Procédé de compression et de stockage de données et dispositif de compression et de stockage de données WO2019128409A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201711455790.2 2017-12-28
CN201711455790.2A CN108304472A (zh) 2017-12-28 2017-12-28 一种数据压缩存储方法以及数据压缩存储装置

Publications (1)

Publication Number Publication Date
WO2019128409A1 true WO2019128409A1 (fr) 2019-07-04

Family

ID=62867648

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/111180 WO2019128409A1 (fr) 2017-12-28 2018-10-22 Procédé de compression et de stockage de données et dispositif de compression et de stockage de données

Country Status (3)

Country Link
CN (1) CN108304472A (fr)
TW (1) TWI683548B (fr)
WO (1) WO2019128409A1 (fr)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108304472A (zh) * 2017-12-28 2018-07-20 中国银联股份有限公司 一种数据压缩存储方法以及数据压缩存储装置
CN110134342A (zh) * 2019-05-28 2019-08-16 首都师范大学 数据近似方法及系统、存储方法及系统、读取方法及系统
CN110784227B (zh) * 2019-10-21 2021-07-30 清华大学 一种对数据集的多路压缩方法、装置及存储介质
CN111010189B (zh) * 2019-10-21 2021-10-26 清华大学 一种对数据集的多路压缩方法、装置及存储介质
CN111008230B (zh) * 2019-11-22 2023-08-04 远景智能国际私人投资有限公司 数据存储方法、装置、计算机设备及存储介质
CN111259107B (zh) * 2020-01-10 2023-08-18 北京百度网讯科技有限公司 行列式文本的存储方法、装置以及电子设备
CN113220651B (zh) * 2021-04-25 2024-02-09 暨南大学 运行数据压缩方法、装置、终端设备以及存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102708183A (zh) * 2012-05-09 2012-10-03 华为技术有限公司 数据压缩的方法和装置
CN105308589A (zh) * 2013-04-17 2016-02-03 朗桑有限公司 基于数据内容来压缩数据
CN106980639A (zh) * 2016-12-29 2017-07-25 中国银联股份有限公司 短文本数据聚合系统及方法
CN108304472A (zh) * 2017-12-28 2018-07-20 中国银联股份有限公司 一种数据压缩存储方法以及数据压缩存储装置

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102638579B (zh) * 2012-03-29 2016-05-04 深圳市高正软件有限公司 一种基于移动设备数据传输的数据处理方法及系统
CN103379136B (zh) * 2012-04-17 2017-02-22 中国移动通信集团公司 一种日志采集数据压缩方法、解压缩方法及装置
CN104424229B (zh) * 2013-08-26 2019-02-22 腾讯科技(深圳)有限公司 一种多维度拆分的计算方法及系统
CN104462524A (zh) * 2014-12-24 2015-03-25 福建江夏学院 一种物联网数据压缩存储方法
CN106156037B (zh) * 2015-03-26 2019-11-12 深圳市腾讯计算机系统有限公司 数据处理方法、装置及系统
EP3229444B1 (fr) * 2015-12-29 2019-10-16 Huawei Technologies Co., Ltd. Serveur et procédé de compression de données par serveur
CN106019369B (zh) * 2016-06-28 2017-12-22 西南科技大学 一种改进的seg‑y文件中地震数据无损压缩算法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102708183A (zh) * 2012-05-09 2012-10-03 华为技术有限公司 数据压缩的方法和装置
CN105308589A (zh) * 2013-04-17 2016-02-03 朗桑有限公司 基于数据内容来压缩数据
CN106980639A (zh) * 2016-12-29 2017-07-25 中国银联股份有限公司 短文本数据聚合系统及方法
CN108304472A (zh) * 2017-12-28 2018-07-20 中国银联股份有限公司 一种数据压缩存储方法以及数据压缩存储装置

Also Published As

Publication number Publication date
TWI683548B (zh) 2020-01-21
CN108304472A (zh) 2018-07-20
TW201931780A (zh) 2019-08-01

Similar Documents

Publication Publication Date Title
WO2019128409A1 (fr) Procédé de compression et de stockage de données et dispositif de compression et de stockage de données
US8650163B1 (en) Estimation of data reduction rate in a data storage system
WO2019095586A1 (fr) Procédé de génération de comptes rendus de réunions, serveur d'application et support d'informations lisible par ordinateur
US8825617B2 (en) Limiting deduplication based on predetermined criteria
US9514179B2 (en) Table boundary detection in data blocks for compression
US11762813B2 (en) Quality score compression apparatus and method for improving downstream accuracy
US8650144B2 (en) Apparatus and methods for lossless compression of numerical attributes in rule based systems
US20130124796A1 (en) Storage method and apparatus which are based on data content identification
US20050210054A1 (en) Information management system
CN104298736B (zh) 数据集合连接方法、装置及数据库系统
US9843802B1 (en) Method and system for dynamic compression module selection
US11030172B2 (en) Database archiving method and device for creating index information and method and device of retrieving archived database including index information
KR102230245B1 (ko) 피벗 쿼리를 처리하기 위한 컴퓨터 프로그램
CN105260387A (zh) 一种面向海量事务数据库的关联规则分析方法
US20230325375A1 (en) Measuring and improving index quality in a distrubuted data system
CN106990914B (zh) 数据删除方法及装置
JP5601468B2 (ja) 故障の木の最小カットセットを効率的に評価する方法とシステム
US9785724B2 (en) Secondary queue for index process
WO2024021491A1 (fr) Procédé, appareil et système de découpage de données
US8463759B2 (en) Method and system for compressing data
CN106970837B (zh) 一种信息处理方法及电子设备
US7777653B2 (en) Decoding variable-length code (VLC) bitstream information
KR20170122151A (ko) 동적인 알고리즘 변경을 통하여 쿼리 처리 시간을 축소시키기 위한 방법, 장치 및 컴퓨터-판독가능 매체
US10649979B1 (en) System, method, and computer program for maintaining consistency between a NoSQL database and non-transactional content associated with one or more files
CN111488439B (zh) 保存和分析日志数据的系统和方法

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18894483

Country of ref document: EP

Kind code of ref document: A1