CN112948386B - Simple indexing and encrypting disk-dropping mechanism for ETL abnormal data - Google Patents
Simple indexing and encrypting disk-dropping mechanism for ETL abnormal data Download PDFInfo
- Publication number
- CN112948386B CN112948386B CN202110241032.0A CN202110241032A CN112948386B CN 112948386 B CN112948386 B CN 112948386B CN 202110241032 A CN202110241032 A CN 202110241032A CN 112948386 B CN112948386 B CN 112948386B
- Authority
- CN
- China
- Prior art keywords
- data
- index
- information
- page
- compression
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2228—Indexing structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/25—Integrating or interfacing systems involving database management systems
- G06F16/254—Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/602—Providing cryptographic facilities or services
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Abstract
The invention discloses a simple indexing and encrypting disk-dropping mechanism for ETL abnormal data, which is applied to disk-dropping of the abnormal data in the running process of the data ETL and comprises the steps of establishing an index, encrypting and compressing the data and inquiring and accessing the disk-dropping data. The invention adopts the method comprising the global index, the page index and the line index to simply index the landing data, so that the efficiency of inquiring the landing data is greatly improved, and the target data can be inquired quickly even under the situation that the data volume of the landing data is large; the simple data compression encryption mode comprises displacement compression, dictionary compression and bitmap compression, and the data is encrypted while the data is compressed, so that data information leakage caused by plaintext data is prevented.
Description
Technical Field
The invention belongs to the field of big data storage, and particularly relates to a simple indexing and encrypting disk-dropping mechanism for ETL abnormal data.
Background
When an ETL (Extract-Transform-Load) model is designed, all types of data and special cases of data are always tried to be covered, but when the ETL model faces a mass data source, abnormal data caused by unreasonable design is inevitably avoided, and in order to facilitate service personnel to check the abnormal data, the abnormal data needs to be dropped.
In the current big data ETL process, even if a designer carefully researches and designs in detail, the problem of incapability of matching the design with an actual scene, the problem of network abnormal communication, the problem of data abnormality caused by service change and other reasons inevitably occur in the actual use process, as shown in the figure 1. In a production environment, the abnormal data cannot be directly discarded, and the abnormal data generally falls on a disc to wait for a salesman to check, for example, the ultra-long data is always a headache problem of an ETL engineer; because of the large number of servers involved in the complete ETL process, these servers typically multiplex other services, which in turn involve many users, which are not expected to be viewed by other users even though the data is anomalous to the data that is important; on the other hand, in the process of upgrading ETL software, many data are executing ETL, hundreds of thousands or even millions of data to be processed may be stored in the memory, and the data need to be first dropped out and then software is upgraded, otherwise, the hundreds of thousands or even millions of data in the memory are lost.
The existing landing mode is basically a plain text landing mode, so that a service person can directly check abnormal data by using a text editor to find error reasons, but the ETL server is a multiplexing server generally, a plurality of people can use the server by hand, and the data is directly landed on the server in a plain text mode, so that the risk of data leakage exists.
Disclosure of Invention
The invention aims to solve the problems and provide a simple indexing and encrypting disk-falling mechanism for ETL abnormal data, which is applied to the disk-falling of the abnormal data in the running process of the data ETL and comprises the steps of establishing an index, encrypting and compressing the data and inquiring and accessing the disk-falling data;
establishing the index comprises establishing a global index, a page index and a row index; the global index is used for recording the basic information of the complete data file; the page index is used for recording index information of the current index page; the row index is used for recording row index information of the current data row;
encrypting and compressing data, including displacement compression, dictionary compression and bitmap compression;
the method comprises the steps of inquiring global indexes, inquiring index pages according to the global indexes, decompressing index page data, inquiring key information of line index information and current data lines.
The invention has the beneficial effects that: the invention adopts the global index to index global data, the page index indexes all data rows of the current page, the row index indexes single data, namely, adopts the simple index comprising the global index, the page index and the row index to index the landing data, so that the efficiency of inquiring the landing data is greatly improved, and the target data can be quickly inquired even under the situation that the landing data volume is large; and a simple data compression encryption mode is adopted, including displacement compression, dictionary compression and bitmap compression, and data is encrypted while compressed, so that data information leakage caused by plaintext data is prevented.
Drawings
FIG. 1 is a schematic illustration of an application scenario of the present invention;
FIG. 2 is a logical framework diagram of the present invention;
FIG. 3 is a schematic diagram of an index structure;
FIG. 4 is a global index schematic;
FIG. 5 is a page index diagram;
FIG. 6 is a line index schematic;
FIG. 7 is a displacement compression schematic;
FIG. 8 is a dictionary compression diagram;
FIG. 9 is a bitmap compression schematic;
FIG. 10 is a data drop flow diagram;
FIG. 11 is a flowchart of a drop data query.
Detailed Description
The invention is further described below with reference to the accompanying drawings:
as shown in figure 2, the simple indexing and encrypting disk-dropping mechanism for ETL abnormal data is applied to disk-dropping of abnormal data in the running process of the data ETL and comprises the steps of establishing an index, encrypting and compressing the data and inquiring and accessing the disk-dropping data;
establishing the index comprises establishing a global index, a page index and a row index; the global index is used for recording the basic information of the complete data file; the page index is used for recording index information of the current index page; the row index is used for recording row index information of the current data row;
encrypting and compressing data, including displacement compression, dictionary compression and bitmap compression;
the method comprises the steps of inquiring global indexes, inquiring index pages according to the global indexes, decompressing index page data, inquiring key information of line index information and current data lines.
Specifically, the basic information of the complete data file includes: the method comprises the steps of total index unique ID, creation time stamp, export time stamp, total number of data pages, overall compression algorithm for compressing data, data size before compression, data size after compression, total index check value and basic information of each data page.
Specifically, the data page basic information includes offset positions of the data pages in the whole storage file, data page content keyword information and data line position information of important keywords.
Specifically, the index page comprises an index area and a data area; the index area is used for storing index head information; the index header information comprises page number, data number, index header size, data size before compression, data size after compression, data check bit and offset of each data line; the data area comprises a data line index, data information and data line verification information.
Specifically, the row index is used for indexing single data, and the key information of the current data row comprises an index row unique address, key word information and data compression information.
Specifically, the displacement compression is used for recording character string information and character string offset information, and when the occurrence frequency of the character string information and the character string offset information exceeds the set frequency, the displacement compression is carried out; the dictionary compression is used for setting a keyword information base, and replacing and storing keywords by adopting symbol addresses; the bitmap compression is used for storing the keywords exceeding the set character string length according to a plurality of set byte size symbol addresses.
Specifically, the key information of the current data line includes data information and data check bits.
Specifically, when the data size of the newly added data plus the current index page is smaller than a set value, adding the newly added data plus the current index page to the current index page, and when the data size of the newly added data plus the current index page is larger than the set value, newly building an index page and adding the newly built index page to the newly added index page; when the single piece of newly added data is smaller than the set value, creating an index page, and updating the single piece of newly added data to the last index page.
As shown in fig. 3, the index structure includes a global index, a page index, and a row index. The global index records basic information of the complete data file, the page index records index information of the current index page, and the line index records key information of the current data line.
As shown in the global index diagram of fig. 4, the global index record information includes: the method comprises the steps of total index unique ID, creation time stamp, export time stamp, total number of data, total number of pages of data, overall compression algorithm, data size before compression, data size after compression, total index check value and basic information of each data page. Wherein the integral compression algorithm adopts an open source compression algorithm. The complete data is compressed.
As shown in fig. 5, the index page includes an index area and a data area, where the index area is used to store index header information; the index header information includes a page number, a number of data pieces, an index header size, a data size before compression, a data size after compression, a data check bit, and an offset for each data line. The data area contains data line index, data information and data line check information.
Each index page is 4G in size, when the data size of the newly added data plus the current index page is smaller than 4G, the index page is added to the current index page, and when the data size of the newly added data plus the current index page is larger than 4G, the index page is newly built and added to the newly added index page. When the single piece of newly added data is larger than 4G, creating an index page, and independently generating the index page by the single piece of data.
As shown in the line index diagram of fig. 6, the line index is used to index a single piece of data, and key information and data compression information of the single piece of index are recorded.
As shown in the displacement compression diagram of fig. 7, if a certain character string (e.g. "ABC") often appears, displacement compression is used, where the ID is a characteristic unique key (e.g. "unused"% "or" @ "plus a string of unique value digits). The above formula "XXXABCBXCXXXXXABCXX" can be recorded as "XXX%20XX%20XXXXX%20XX" and "%20, ABC,3,8,16". The character string comprises 20 characters, the addresses of the characters are numbered 0-20, the address of the first 'ABC' is numbered 3, the address of the second 'ABC' is numbered 8, and the address of the third 'ABC' is numbered 16. The displacement compression is used for recording character string information and character string offset information.
As shown in the dictionary compression diagram of FIG. 8, if key information such as place names, countries, person names, sensitive entries and the like frequently appears, symbol addresses are used for replacement storage, for example, "Chongqing" uses a first address ID1, "university" uses a second address ID2, and "long-medium" uses a third address ID3 for replacement storage.
As shown in fig. 9, for some common phrases, if the space occupied by the ID string replaced by dictionary compression is still large, bitmap compression is used, for example, one long data has 64 bits, the data of these bits are defined according to the information category, the data of the multiple bits represents the first type of information, the data of the multiple bits represents the second type of information, etc., for example, "man", "woman", "Beijing", "Chongqing", "adult" appear in the ID string, the addresses of "man", "woman" belonging to the same type of information are put into the multiple bits, and the addresses of the data ("Beijing", "Chongqing", "adult") belonging to the same type of information are put into the multiple bits.
The data-drop flow chart shown in fig. 10 comprises the following steps:
the ETL program is started; data acquisition, data conversion and data loading program operation; judging whether the newly added data is abnormal data, if so, carrying out disc landing, otherwise, not carrying out disc landing, and returning to the data acquisition, data conversion and data loading program operation processes; judging whether the sizes of the newly added data and the last index page are larger than a set value, if so, creating an index page, otherwise, updating the newly added data to the last index page; updating the newly added data index row index; the global index information is updated.
As shown in FIG. 11, the tray data query flow chart comprises the following steps: the tray data inquiry software is started; querying a global index; inquiring a specific index page according to the global index; decompressing the index page data; inquiring row index information and data information; and returning the queried data.
The invention adopts the global index to index global data, the page index indexes all data rows of the current page, the row index indexes single data, namely, adopts the simple index comprising the global index, the page index and the row index to index the landing data, so that the efficiency of inquiring the landing data is greatly improved, and the target data can be quickly inquired even under the situation that the landing data volume is large; and a simple data compression encryption mode is adopted, including displacement compression, dictionary compression and bitmap compression, and data is encrypted while compressed, so that data information leakage caused by plaintext data is prevented.
The technical scheme of the invention is not limited to the specific embodiment, and all technical modifications made according to the technical scheme of the invention fall within the protection scope of the invention.
Claims (7)
1. A simple indexing and encrypting disk-falling mechanism for ETL abnormal data is applied to disk-falling of the abnormal data in the running process of the data ETL, and is characterized by comprising the steps of establishing an index, encrypting and compressing the data and inquiring and accessing the disk-falling data;
establishing the index comprises establishing a global index, a page index and a row index; the global index is used for recording the basic information of the complete data file; the page index is used for recording index information of the current index page; the row index is used for recording row index information of the current data row;
encrypting and compressing data, including displacement compression, dictionary compression and bitmap compression;
the method comprises the steps of inquiring and accessing the landing data, including inquiring a global index, inquiring an index page according to the global index, decompressing index page data, inquiring key information of line index information and current data lines;
the basic information of the complete data file includes: the method comprises the steps of total index unique ID, creation time stamp, export time stamp, total number of data pages, overall compression algorithm for compressing data, data size before compression, data size after compression, total index check value and basic information of each data page.
2. The easy index and encrypt-to-drop mechanism for ETL exception data as in claim 1, wherein the data page basic information comprises data page offset location, data page content key information and key-importance data line location information throughout the storage file.
3. The easy indexing and encryption drop mechanism for ETL exception data according to claim 1, wherein said index page comprises an index area and a data area; the index area is used for storing index head information;
the index header information comprises page number, data number, index header size, data size before compression, data size after compression, data check bit and offset of each data line;
the data area comprises a data line index, data information and data line verification information.
4. The easy index and encrypt-and-drop mechanism for ETL exception data as in claim 1, wherein the row index is used to index a single piece of data, and the key information of the current data row includes an index row unique address, key information, and data compression information.
5. The simple indexing and encrypting disk-dropping mechanism for ETL abnormal data according to claim 1, wherein said displacement compression is used for recording character string information and character string offset information, and when the occurrence frequency of the character string information and the character string offset information exceeds a set number of times, the displacement compression is performed; the dictionary compression is used for setting a keyword information base, and replacing and storing keywords by adopting symbol addresses; the bitmap compression is used for storing the keywords exceeding the set character string length according to a plurality of set byte size symbol addresses.
6. The easy index and encrypt-to-disc mechanism of claim 1 for ETL exception data, wherein the key information of the current data line includes data information and data check bits.
7. The simple indexing and encrypting disk-dropping mechanism for ETL abnormal data according to claim 1, wherein when the data size of the newly added data plus the current index page is smaller than a set value, the new index page is added to the newly added index page, and when the data size of the newly added data plus the current index page is larger than the set value, the new index page is newly built; when the single piece of newly-added data is larger than the set value, creating an index page, and independently generating the index page by the single piece of newly-added data.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110241032.0A CN112948386B (en) | 2021-03-04 | 2021-03-04 | Simple indexing and encrypting disk-dropping mechanism for ETL abnormal data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110241032.0A CN112948386B (en) | 2021-03-04 | 2021-03-04 | Simple indexing and encrypting disk-dropping mechanism for ETL abnormal data |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112948386A CN112948386A (en) | 2021-06-11 |
CN112948386B true CN112948386B (en) | 2023-09-22 |
Family
ID=76247670
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110241032.0A Active CN112948386B (en) | 2021-03-04 | 2021-03-04 | Simple indexing and encrypting disk-dropping mechanism for ETL abnormal data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112948386B (en) |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105354247A (en) * | 2015-10-13 | 2016-02-24 | 武汉大学 | Geographical video data organization management method supporting storage and calculation linkage |
CN105471856A (en) * | 2015-11-19 | 2016-04-06 | 中国电子科技网络信息安全有限公司 | System and method used for retrieving and sharing large data center platform encryption files |
CN105701096A (en) * | 2014-11-25 | 2016-06-22 | 腾讯科技(深圳)有限公司 | Index generation method, data inquiry method, index generation device, data inquiry device and system |
CN106599040A (en) * | 2016-11-07 | 2017-04-26 | 中国科学院软件研究所 | Layered indexing method and search method for cloud storage |
CN108040074A (en) * | 2018-01-26 | 2018-05-15 | 华南理工大学 | A kind of real-time network unusual checking system and method based on big data |
CN109299106A (en) * | 2018-10-31 | 2019-02-01 | 中国联合网络通信集团有限公司 | Data query method and apparatus |
EP3442158A1 (en) * | 2017-08-11 | 2019-02-13 | Palo Alto Research Center Incorporated | System and architecture for supporting analytics on encrypted databases |
CN109492410A (en) * | 2018-10-09 | 2019-03-19 | 华南农业大学 | Data can search for encryption and keyword search methodology, system and terminal, equipment |
CN109639811A (en) * | 2018-12-21 | 2019-04-16 | 北京金山云网络技术有限公司 | Data transmission method, date storage method, device, server and storage medium |
CN111782661A (en) * | 2020-07-21 | 2020-10-16 | 杭州海康威视数字技术股份有限公司 | Data storage method, data query method and device |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7730106B2 (en) * | 2006-12-28 | 2010-06-01 | Teradata Us, Inc. | Compression of encrypted data in database management systems |
US8965921B2 (en) * | 2012-06-06 | 2015-02-24 | Rackspace Us, Inc. | Data management and indexing across a distributed database |
-
2021
- 2021-03-04 CN CN202110241032.0A patent/CN112948386B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105701096A (en) * | 2014-11-25 | 2016-06-22 | 腾讯科技(深圳)有限公司 | Index generation method, data inquiry method, index generation device, data inquiry device and system |
CN105354247A (en) * | 2015-10-13 | 2016-02-24 | 武汉大学 | Geographical video data organization management method supporting storage and calculation linkage |
CN105471856A (en) * | 2015-11-19 | 2016-04-06 | 中国电子科技网络信息安全有限公司 | System and method used for retrieving and sharing large data center platform encryption files |
CN106599040A (en) * | 2016-11-07 | 2017-04-26 | 中国科学院软件研究所 | Layered indexing method and search method for cloud storage |
EP3442158A1 (en) * | 2017-08-11 | 2019-02-13 | Palo Alto Research Center Incorporated | System and architecture for supporting analytics on encrypted databases |
CN108040074A (en) * | 2018-01-26 | 2018-05-15 | 华南理工大学 | A kind of real-time network unusual checking system and method based on big data |
CN109492410A (en) * | 2018-10-09 | 2019-03-19 | 华南农业大学 | Data can search for encryption and keyword search methodology, system and terminal, equipment |
CN109299106A (en) * | 2018-10-31 | 2019-02-01 | 中国联合网络通信集团有限公司 | Data query method and apparatus |
CN109639811A (en) * | 2018-12-21 | 2019-04-16 | 北京金山云网络技术有限公司 | Data transmission method, date storage method, device, server and storage medium |
CN111782661A (en) * | 2020-07-21 | 2020-10-16 | 杭州海康威视数字技术股份有限公司 | Data storage method, data query method and device |
Non-Patent Citations (3)
Title |
---|
A low cost and inner-round pipelined design of ECB-AES-256 crypto engine for solid state disk;Fei Wu等;《Proceedings of the 2010 IEEE International Conference on Networking, Architecture, and Storage》;第485-491页 * |
基于Ceph块存储的高可用ISCSI研究与应用;张泽军;《中国优秀硕士学位论文全文数据库 信息科技辑》;I137-34 * |
流程工业分布式实时数据库研究与应用;李德文;《中国优博士学位论文全文数据库 信息科技辑》;流程工业分布式实时数据库研究与应用 * |
Also Published As
Publication number | Publication date |
---|---|
CN112948386A (en) | 2021-06-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
EP2443564B1 (en) | Data compression for reducing storage requirements in a database system | |
US10467420B2 (en) | Systems for embedding information in data strings | |
US8255398B2 (en) | Compression of sorted value indexes using common prefixes | |
CN102867071B (en) | Management method for massive network management historical data | |
EP1866776B1 (en) | Method for detecting the presence of subblocks in a reduced-redundancy storage system | |
CN104462141B (en) | Method, system and the storage engines device of a kind of data storage and inquiry | |
CN111339103B (en) | Data exchange method and system based on full-quantity fragmentation and incremental log analysis | |
CN102622434B (en) | Data storage method, data searching method and device | |
WO2002065316A9 (en) | System and method of indexing unique electronic mail messages and uses for the same | |
WO1992021090A1 (en) | Relational data base memory utilization analyzer | |
US7627609B1 (en) | Index processing using transformed values | |
EP3788505B1 (en) | Storing data items and identifying stored data items | |
US7698325B1 (en) | Index processing for legacy systems | |
CN100498794C (en) | Method and device for compressing index | |
CN100383787C (en) | Multi-chart information initializing method of database | |
CN101937464B (en) | Ciphertext search method based on word-for-word indexing | |
CN111008183A (en) | Storage method and system for business wind control log data | |
CN108984626B (en) | Data processing method and device and server | |
CN112948386B (en) | Simple indexing and encrypting disk-dropping mechanism for ETL abnormal data | |
Wang et al. | Storage and query over encrypted character and numerical data in database | |
CN102693315A (en) | Method and device for removing URL (uniform resource locator) duplicate on basis of shared memory mapping | |
CN108647243B (en) | Industrial big data storage method based on time series | |
CN110555021B (en) | Data storage method, query method and related device | |
US7536398B2 (en) | On-line organization of data sets | |
CN102597969A (en) | Database management device using key-value store with attributes, and key-value-store structure caching-device therefor |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |