CN112948386B - Simple indexing and encrypting disk-dropping mechanism for ETL abnormal data - Google Patents

Simple indexing and encrypting disk-dropping mechanism for ETL abnormal data Download PDF

Info

Publication number
CN112948386B
CN112948386B CN202110241032.0A CN202110241032A CN112948386B CN 112948386 B CN112948386 B CN 112948386B CN 202110241032 A CN202110241032 A CN 202110241032A CN 112948386 B CN112948386 B CN 112948386B
Authority
CN
China
Prior art keywords
data
index
information
page
compression
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110241032.0A
Other languages
Chinese (zh)
Other versions
CN112948386A (en
Inventor
陈晖�
崔营
杨健
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fifth Research Institute Of Telecommunications Technology Co ltd
Original Assignee
Fifth Research Institute Of Telecommunications Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fifth Research Institute Of Telecommunications Technology Co ltd filed Critical Fifth Research Institute Of Telecommunications Technology Co ltd
Priority to CN202110241032.0A priority Critical patent/CN112948386B/en
Publication of CN112948386A publication Critical patent/CN112948386A/en
Application granted granted Critical
Publication of CN112948386B publication Critical patent/CN112948386B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/254Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/602Providing cryptographic facilities or services
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a simple indexing and encrypting disk-dropping mechanism for ETL abnormal data, which is applied to disk-dropping of the abnormal data in the running process of the data ETL and comprises the steps of establishing an index, encrypting and compressing the data and inquiring and accessing the disk-dropping data. The invention adopts the method comprising the global index, the page index and the line index to simply index the landing data, so that the efficiency of inquiring the landing data is greatly improved, and the target data can be inquired quickly even under the situation that the data volume of the landing data is large; the simple data compression encryption mode comprises displacement compression, dictionary compression and bitmap compression, and the data is encrypted while the data is compressed, so that data information leakage caused by plaintext data is prevented.

Description

Simple indexing and encrypting disk-dropping mechanism for ETL abnormal data
Technical Field
The invention belongs to the field of big data storage, and particularly relates to a simple indexing and encrypting disk-dropping mechanism for ETL abnormal data.
Background
When an ETL (Extract-Transform-Load) model is designed, all types of data and special cases of data are always tried to be covered, but when the ETL model faces a mass data source, abnormal data caused by unreasonable design is inevitably avoided, and in order to facilitate service personnel to check the abnormal data, the abnormal data needs to be dropped.
In the current big data ETL process, even if a designer carefully researches and designs in detail, the problem of incapability of matching the design with an actual scene, the problem of network abnormal communication, the problem of data abnormality caused by service change and other reasons inevitably occur in the actual use process, as shown in the figure 1. In a production environment, the abnormal data cannot be directly discarded, and the abnormal data generally falls on a disc to wait for a salesman to check, for example, the ultra-long data is always a headache problem of an ETL engineer; because of the large number of servers involved in the complete ETL process, these servers typically multiplex other services, which in turn involve many users, which are not expected to be viewed by other users even though the data is anomalous to the data that is important; on the other hand, in the process of upgrading ETL software, many data are executing ETL, hundreds of thousands or even millions of data to be processed may be stored in the memory, and the data need to be first dropped out and then software is upgraded, otherwise, the hundreds of thousands or even millions of data in the memory are lost.
The existing landing mode is basically a plain text landing mode, so that a service person can directly check abnormal data by using a text editor to find error reasons, but the ETL server is a multiplexing server generally, a plurality of people can use the server by hand, and the data is directly landed on the server in a plain text mode, so that the risk of data leakage exists.
Disclosure of Invention
The invention aims to solve the problems and provide a simple indexing and encrypting disk-falling mechanism for ETL abnormal data, which is applied to the disk-falling of the abnormal data in the running process of the data ETL and comprises the steps of establishing an index, encrypting and compressing the data and inquiring and accessing the disk-falling data;
establishing the index comprises establishing a global index, a page index and a row index; the global index is used for recording the basic information of the complete data file; the page index is used for recording index information of the current index page; the row index is used for recording row index information of the current data row;
encrypting and compressing data, including displacement compression, dictionary compression and bitmap compression;
the method comprises the steps of inquiring global indexes, inquiring index pages according to the global indexes, decompressing index page data, inquiring key information of line index information and current data lines.
The invention has the beneficial effects that: the invention adopts the global index to index global data, the page index indexes all data rows of the current page, the row index indexes single data, namely, adopts the simple index comprising the global index, the page index and the row index to index the landing data, so that the efficiency of inquiring the landing data is greatly improved, and the target data can be quickly inquired even under the situation that the landing data volume is large; and a simple data compression encryption mode is adopted, including displacement compression, dictionary compression and bitmap compression, and data is encrypted while compressed, so that data information leakage caused by plaintext data is prevented.
Drawings
FIG. 1 is a schematic illustration of an application scenario of the present invention;
FIG. 2 is a logical framework diagram of the present invention;
FIG. 3 is a schematic diagram of an index structure;
FIG. 4 is a global index schematic;
FIG. 5 is a page index diagram;
FIG. 6 is a line index schematic;
FIG. 7 is a displacement compression schematic;
FIG. 8 is a dictionary compression diagram;
FIG. 9 is a bitmap compression schematic;
FIG. 10 is a data drop flow diagram;
FIG. 11 is a flowchart of a drop data query.
Detailed Description
The invention is further described below with reference to the accompanying drawings:
as shown in figure 2, the simple indexing and encrypting disk-dropping mechanism for ETL abnormal data is applied to disk-dropping of abnormal data in the running process of the data ETL and comprises the steps of establishing an index, encrypting and compressing the data and inquiring and accessing the disk-dropping data;
establishing the index comprises establishing a global index, a page index and a row index; the global index is used for recording the basic information of the complete data file; the page index is used for recording index information of the current index page; the row index is used for recording row index information of the current data row;
encrypting and compressing data, including displacement compression, dictionary compression and bitmap compression;
the method comprises the steps of inquiring global indexes, inquiring index pages according to the global indexes, decompressing index page data, inquiring key information of line index information and current data lines.
Specifically, the basic information of the complete data file includes: the method comprises the steps of total index unique ID, creation time stamp, export time stamp, total number of data pages, overall compression algorithm for compressing data, data size before compression, data size after compression, total index check value and basic information of each data page.
Specifically, the data page basic information includes offset positions of the data pages in the whole storage file, data page content keyword information and data line position information of important keywords.
Specifically, the index page comprises an index area and a data area; the index area is used for storing index head information; the index header information comprises page number, data number, index header size, data size before compression, data size after compression, data check bit and offset of each data line; the data area comprises a data line index, data information and data line verification information.
Specifically, the row index is used for indexing single data, and the key information of the current data row comprises an index row unique address, key word information and data compression information.
Specifically, the displacement compression is used for recording character string information and character string offset information, and when the occurrence frequency of the character string information and the character string offset information exceeds the set frequency, the displacement compression is carried out; the dictionary compression is used for setting a keyword information base, and replacing and storing keywords by adopting symbol addresses; the bitmap compression is used for storing the keywords exceeding the set character string length according to a plurality of set byte size symbol addresses.
Specifically, the key information of the current data line includes data information and data check bits.
Specifically, when the data size of the newly added data plus the current index page is smaller than a set value, adding the newly added data plus the current index page to the current index page, and when the data size of the newly added data plus the current index page is larger than the set value, newly building an index page and adding the newly built index page to the newly added index page; when the single piece of newly added data is smaller than the set value, creating an index page, and updating the single piece of newly added data to the last index page.
As shown in fig. 3, the index structure includes a global index, a page index, and a row index. The global index records basic information of the complete data file, the page index records index information of the current index page, and the line index records key information of the current data line.
As shown in the global index diagram of fig. 4, the global index record information includes: the method comprises the steps of total index unique ID, creation time stamp, export time stamp, total number of data, total number of pages of data, overall compression algorithm, data size before compression, data size after compression, total index check value and basic information of each data page. Wherein the integral compression algorithm adopts an open source compression algorithm. The complete data is compressed.
As shown in fig. 5, the index page includes an index area and a data area, where the index area is used to store index header information; the index header information includes a page number, a number of data pieces, an index header size, a data size before compression, a data size after compression, a data check bit, and an offset for each data line. The data area contains data line index, data information and data line check information.
Each index page is 4G in size, when the data size of the newly added data plus the current index page is smaller than 4G, the index page is added to the current index page, and when the data size of the newly added data plus the current index page is larger than 4G, the index page is newly built and added to the newly added index page. When the single piece of newly added data is larger than 4G, creating an index page, and independently generating the index page by the single piece of data.
As shown in the line index diagram of fig. 6, the line index is used to index a single piece of data, and key information and data compression information of the single piece of index are recorded.
As shown in the displacement compression diagram of fig. 7, if a certain character string (e.g. "ABC") often appears, displacement compression is used, where the ID is a characteristic unique key (e.g. "unused"% "or" @ "plus a string of unique value digits). The above formula "XXXABCBXCXXXXXABCXX" can be recorded as "XXX%20XX%20XXXXX%20XX" and "%20, ABC,3,8,16". The character string comprises 20 characters, the addresses of the characters are numbered 0-20, the address of the first 'ABC' is numbered 3, the address of the second 'ABC' is numbered 8, and the address of the third 'ABC' is numbered 16. The displacement compression is used for recording character string information and character string offset information.
As shown in the dictionary compression diagram of FIG. 8, if key information such as place names, countries, person names, sensitive entries and the like frequently appears, symbol addresses are used for replacement storage, for example, "Chongqing" uses a first address ID1, "university" uses a second address ID2, and "long-medium" uses a third address ID3 for replacement storage.
As shown in fig. 9, for some common phrases, if the space occupied by the ID string replaced by dictionary compression is still large, bitmap compression is used, for example, one long data has 64 bits, the data of these bits are defined according to the information category, the data of the multiple bits represents the first type of information, the data of the multiple bits represents the second type of information, etc., for example, "man", "woman", "Beijing", "Chongqing", "adult" appear in the ID string, the addresses of "man", "woman" belonging to the same type of information are put into the multiple bits, and the addresses of the data ("Beijing", "Chongqing", "adult") belonging to the same type of information are put into the multiple bits.
The data-drop flow chart shown in fig. 10 comprises the following steps:
the ETL program is started; data acquisition, data conversion and data loading program operation; judging whether the newly added data is abnormal data, if so, carrying out disc landing, otherwise, not carrying out disc landing, and returning to the data acquisition, data conversion and data loading program operation processes; judging whether the sizes of the newly added data and the last index page are larger than a set value, if so, creating an index page, otherwise, updating the newly added data to the last index page; updating the newly added data index row index; the global index information is updated.
As shown in FIG. 11, the tray data query flow chart comprises the following steps: the tray data inquiry software is started; querying a global index; inquiring a specific index page according to the global index; decompressing the index page data; inquiring row index information and data information; and returning the queried data.
The invention adopts the global index to index global data, the page index indexes all data rows of the current page, the row index indexes single data, namely, adopts the simple index comprising the global index, the page index and the row index to index the landing data, so that the efficiency of inquiring the landing data is greatly improved, and the target data can be quickly inquired even under the situation that the landing data volume is large; and a simple data compression encryption mode is adopted, including displacement compression, dictionary compression and bitmap compression, and data is encrypted while compressed, so that data information leakage caused by plaintext data is prevented.
The technical scheme of the invention is not limited to the specific embodiment, and all technical modifications made according to the technical scheme of the invention fall within the protection scope of the invention.

Claims (7)

1. A simple indexing and encrypting disk-falling mechanism for ETL abnormal data is applied to disk-falling of the abnormal data in the running process of the data ETL, and is characterized by comprising the steps of establishing an index, encrypting and compressing the data and inquiring and accessing the disk-falling data;
establishing the index comprises establishing a global index, a page index and a row index; the global index is used for recording the basic information of the complete data file; the page index is used for recording index information of the current index page; the row index is used for recording row index information of the current data row;
encrypting and compressing data, including displacement compression, dictionary compression and bitmap compression;
the method comprises the steps of inquiring and accessing the landing data, including inquiring a global index, inquiring an index page according to the global index, decompressing index page data, inquiring key information of line index information and current data lines;
the basic information of the complete data file includes: the method comprises the steps of total index unique ID, creation time stamp, export time stamp, total number of data pages, overall compression algorithm for compressing data, data size before compression, data size after compression, total index check value and basic information of each data page.
2. The easy index and encrypt-to-drop mechanism for ETL exception data as in claim 1, wherein the data page basic information comprises data page offset location, data page content key information and key-importance data line location information throughout the storage file.
3. The easy indexing and encryption drop mechanism for ETL exception data according to claim 1, wherein said index page comprises an index area and a data area; the index area is used for storing index head information;
the index header information comprises page number, data number, index header size, data size before compression, data size after compression, data check bit and offset of each data line;
the data area comprises a data line index, data information and data line verification information.
4. The easy index and encrypt-and-drop mechanism for ETL exception data as in claim 1, wherein the row index is used to index a single piece of data, and the key information of the current data row includes an index row unique address, key information, and data compression information.
5. The simple indexing and encrypting disk-dropping mechanism for ETL abnormal data according to claim 1, wherein said displacement compression is used for recording character string information and character string offset information, and when the occurrence frequency of the character string information and the character string offset information exceeds a set number of times, the displacement compression is performed; the dictionary compression is used for setting a keyword information base, and replacing and storing keywords by adopting symbol addresses; the bitmap compression is used for storing the keywords exceeding the set character string length according to a plurality of set byte size symbol addresses.
6. The easy index and encrypt-to-disc mechanism of claim 1 for ETL exception data, wherein the key information of the current data line includes data information and data check bits.
7. The simple indexing and encrypting disk-dropping mechanism for ETL abnormal data according to claim 1, wherein when the data size of the newly added data plus the current index page is smaller than a set value, the new index page is added to the newly added index page, and when the data size of the newly added data plus the current index page is larger than the set value, the new index page is newly built; when the single piece of newly-added data is larger than the set value, creating an index page, and independently generating the index page by the single piece of newly-added data.
CN202110241032.0A 2021-03-04 2021-03-04 Simple indexing and encrypting disk-dropping mechanism for ETL abnormal data Active CN112948386B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110241032.0A CN112948386B (en) 2021-03-04 2021-03-04 Simple indexing and encrypting disk-dropping mechanism for ETL abnormal data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110241032.0A CN112948386B (en) 2021-03-04 2021-03-04 Simple indexing and encrypting disk-dropping mechanism for ETL abnormal data

Publications (2)

Publication Number Publication Date
CN112948386A CN112948386A (en) 2021-06-11
CN112948386B true CN112948386B (en) 2023-09-22

Family

ID=76247670

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110241032.0A Active CN112948386B (en) 2021-03-04 2021-03-04 Simple indexing and encrypting disk-dropping mechanism for ETL abnormal data

Country Status (1)

Country Link
CN (1) CN112948386B (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105354247A (en) * 2015-10-13 2016-02-24 武汉大学 Geographical video data organization management method supporting storage and calculation linkage
CN105471856A (en) * 2015-11-19 2016-04-06 中国电子科技网络信息安全有限公司 System and method used for retrieving and sharing large data center platform encryption files
CN105701096A (en) * 2014-11-25 2016-06-22 腾讯科技(深圳)有限公司 Index generation method, data inquiry method, index generation device, data inquiry device and system
CN106599040A (en) * 2016-11-07 2017-04-26 中国科学院软件研究所 Layered indexing method and search method for cloud storage
CN108040074A (en) * 2018-01-26 2018-05-15 华南理工大学 A kind of real-time network unusual checking system and method based on big data
CN109299106A (en) * 2018-10-31 2019-02-01 中国联合网络通信集团有限公司 Data query method and apparatus
EP3442158A1 (en) * 2017-08-11 2019-02-13 Palo Alto Research Center Incorporated System and architecture for supporting analytics on encrypted databases
CN109492410A (en) * 2018-10-09 2019-03-19 华南农业大学 Data can search for encryption and keyword search methodology, system and terminal, equipment
CN109639811A (en) * 2018-12-21 2019-04-16 北京金山云网络技术有限公司 Data transmission method, date storage method, device, server and storage medium
CN111782661A (en) * 2020-07-21 2020-10-16 杭州海康威视数字技术股份有限公司 Data storage method, data query method and device

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7730106B2 (en) * 2006-12-28 2010-06-01 Teradata Us, Inc. Compression of encrypted data in database management systems
US8965921B2 (en) * 2012-06-06 2015-02-24 Rackspace Us, Inc. Data management and indexing across a distributed database

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105701096A (en) * 2014-11-25 2016-06-22 腾讯科技(深圳)有限公司 Index generation method, data inquiry method, index generation device, data inquiry device and system
CN105354247A (en) * 2015-10-13 2016-02-24 武汉大学 Geographical video data organization management method supporting storage and calculation linkage
CN105471856A (en) * 2015-11-19 2016-04-06 中国电子科技网络信息安全有限公司 System and method used for retrieving and sharing large data center platform encryption files
CN106599040A (en) * 2016-11-07 2017-04-26 中国科学院软件研究所 Layered indexing method and search method for cloud storage
EP3442158A1 (en) * 2017-08-11 2019-02-13 Palo Alto Research Center Incorporated System and architecture for supporting analytics on encrypted databases
CN108040074A (en) * 2018-01-26 2018-05-15 华南理工大学 A kind of real-time network unusual checking system and method based on big data
CN109492410A (en) * 2018-10-09 2019-03-19 华南农业大学 Data can search for encryption and keyword search methodology, system and terminal, equipment
CN109299106A (en) * 2018-10-31 2019-02-01 中国联合网络通信集团有限公司 Data query method and apparatus
CN109639811A (en) * 2018-12-21 2019-04-16 北京金山云网络技术有限公司 Data transmission method, date storage method, device, server and storage medium
CN111782661A (en) * 2020-07-21 2020-10-16 杭州海康威视数字技术股份有限公司 Data storage method, data query method and device

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
A low cost and inner-round pipelined design of ECB-AES-256 crypto engine for solid state disk;Fei Wu等;《Proceedings of the 2010 IEEE International Conference on Networking, Architecture, and Storage》;第485-491页 *
基于Ceph块存储的高可用ISCSI研究与应用;张泽军;《中国优秀硕士学位论文全文数据库 信息科技辑》;I137-34 *
流程工业分布式实时数据库研究与应用;李德文;《中国优博士学位论文全文数据库 信息科技辑》;流程工业分布式实时数据库研究与应用 *

Also Published As

Publication number Publication date
CN112948386A (en) 2021-06-11

Similar Documents

Publication Publication Date Title
EP2443564B1 (en) Data compression for reducing storage requirements in a database system
US10467420B2 (en) Systems for embedding information in data strings
US8255398B2 (en) Compression of sorted value indexes using common prefixes
CN102867071B (en) Management method for massive network management historical data
EP1866776B1 (en) Method for detecting the presence of subblocks in a reduced-redundancy storage system
CN104462141B (en) Method, system and the storage engines device of a kind of data storage and inquiry
CN111339103B (en) Data exchange method and system based on full-quantity fragmentation and incremental log analysis
CN102622434B (en) Data storage method, data searching method and device
WO2002065316A9 (en) System and method of indexing unique electronic mail messages and uses for the same
WO1992021090A1 (en) Relational data base memory utilization analyzer
US7627609B1 (en) Index processing using transformed values
EP3788505B1 (en) Storing data items and identifying stored data items
US7698325B1 (en) Index processing for legacy systems
CN100498794C (en) Method and device for compressing index
CN100383787C (en) Multi-chart information initializing method of database
CN101937464B (en) Ciphertext search method based on word-for-word indexing
CN111008183A (en) Storage method and system for business wind control log data
CN108984626B (en) Data processing method and device and server
CN112948386B (en) Simple indexing and encrypting disk-dropping mechanism for ETL abnormal data
Wang et al. Storage and query over encrypted character and numerical data in database
CN102693315A (en) Method and device for removing URL (uniform resource locator) duplicate on basis of shared memory mapping
CN108647243B (en) Industrial big data storage method based on time series
CN110555021B (en) Data storage method, query method and related device
US7536398B2 (en) On-line organization of data sets
CN102597969A (en) Database management device using key-value store with attributes, and key-value-store structure caching-device therefor

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant