CN112948386B

CN112948386B - Simple indexing and encrypting disk-dropping mechanism for ETL abnormal data

Info

Publication number: CN112948386B
Application number: CN202110241032.0A
Authority: CN
Inventors: 陈晖�; 崔营; 杨健
Original assignee: Fifth Research Institute Of Telecommunications Technology Co ltd
Current assignee: Fifth Research Institute Of Telecommunications Technology Co ltd
Priority date: 2021-03-04
Filing date: 2021-03-04
Publication date: 2023-09-22
Anticipated expiration: 2041-03-04
Also published as: CN112948386A

Abstract

The invention discloses a simple indexing and encrypting disk-dropping mechanism for ETL abnormal data, which is applied to disk-dropping of the abnormal data in the running process of the data ETL and comprises the steps of establishing an index, encrypting and compressing the data and inquiring and accessing the disk-dropping data. The invention adopts the method comprising the global index, the page index and the line index to simply index the landing data, so that the efficiency of inquiring the landing data is greatly improved, and the target data can be inquired quickly even under the situation that the data volume of the landing data is large; the simple data compression encryption mode comprises displacement compression, dictionary compression and bitmap compression, and the data is encrypted while the data is compressed, so that data information leakage caused by plaintext data is prevented.

Description

Simple indexing and encrypting disk-dropping mechanism for ETL abnormal data

Technical Field

The invention belongs to the field of big data storage, and particularly relates to a simple indexing and encrypting disk-dropping mechanism for ETL abnormal data.

Background

When an ETL (Extract-Transform-Load) model is designed, all types of data and special cases of data are always tried to be covered, but when the ETL model faces a mass data source, abnormal data caused by unreasonable design is inevitably avoided, and in order to facilitate service personnel to check the abnormal data, the abnormal data needs to be dropped.

In the current big data ETL process, even if a designer carefully researches and designs in detail, the problem of incapability of matching the design with an actual scene, the problem of network abnormal communication, the problem of data abnormality caused by service change and other reasons inevitably occur in the actual use process, as shown in the figure 1. In a production environment, the abnormal data cannot be directly discarded, and the abnormal data generally falls on a disc to wait for a salesman to check, for example, the ultra-long data is always a headache problem of an ETL engineer; because of the large number of servers involved in the complete ETL process, these servers typically multiplex other services, which in turn involve many users, which are not expected to be viewed by other users even though the data is anomalous to the data that is important; on the other hand, in the process of upgrading ETL software, many data are executing ETL, hundreds of thousands or even millions of data to be processed may be stored in the memory, and the data need to be first dropped out and then software is upgraded, otherwise, the hundreds of thousands or even millions of data in the memory are lost.

The existing landing mode is basically a plain text landing mode, so that a service person can directly check abnormal data by using a text editor to find error reasons, but the ETL server is a multiplexing server generally, a plurality of people can use the server by hand, and the data is directly landed on the server in a plain text mode, so that the risk of data leakage exists.

Disclosure of Invention

The invention aims to solve the problems and provide a simple indexing and encrypting disk-falling mechanism for ETL abnormal data, which is applied to the disk-falling of the abnormal data in the running process of the data ETL and comprises the steps of establishing an index, encrypting and compressing the data and inquiring and accessing the disk-falling data;

establishing the index comprises establishing a global index, a page index and a row index; the global index is used for recording the basic information of the complete data file; the page index is used for recording index information of the current index page; the row index is used for recording row index information of the current data row;

encrypting and compressing data, including displacement compression, dictionary compression and bitmap compression;

the method comprises the steps of inquiring global indexes, inquiring index pages according to the global indexes, decompressing index page data, inquiring key information of line index information and current data lines.

The invention has the beneficial effects that: the invention adopts the global index to index global data, the page index indexes all data rows of the current page, the row index indexes single data, namely, adopts the simple index comprising the global index, the page index and the row index to index the landing data, so that the efficiency of inquiring the landing data is greatly improved, and the target data can be quickly inquired even under the situation that the landing data volume is large; and a simple data compression encryption mode is adopted, including displacement compression, dictionary compression and bitmap compression, and data is encrypted while compressed, so that data information leakage caused by plaintext data is prevented.

Drawings

FIG. 1 is a schematic illustration of an application scenario of the present invention;

FIG. 2 is a logical framework diagram of the present invention;

FIG. 3 is a schematic diagram of an index structure;

FIG. 4 is a global index schematic;

FIG. 5 is a page index diagram;

FIG. 6 is a line index schematic;

FIG. 7 is a displacement compression schematic;

FIG. 8 is a dictionary compression diagram;

FIG. 9 is a bitmap compression schematic;

FIG. 10 is a data drop flow diagram;

FIG. 11 is a flowchart of a drop data query.

Detailed Description

The invention is further described below with reference to the accompanying drawings:

as shown in figure 2, the simple indexing and encrypting disk-dropping mechanism for ETL abnormal data is applied to disk-dropping of abnormal data in the running process of the data ETL and comprises the steps of establishing an index, encrypting and compressing the data and inquiring and accessing the disk-dropping data;

Specifically, the basic information of the complete data file includes: the method comprises the steps of total index unique ID, creation time stamp, export time stamp, total number of data pages, overall compression algorithm for compressing data, data size before compression, data size after compression, total index check value and basic information of each data page.

Specifically, the data page basic information includes offset positions of the data pages in the whole storage file, data page content keyword information and data line position information of important keywords.

Specifically, the index page comprises an index area and a data area; the index area is used for storing index head information; the index header information comprises page number, data number, index header size, data size before compression, data size after compression, data check bit and offset of each data line; the data area comprises a data line index, data information and data line verification information.

Specifically, the row index is used for indexing single data, and the key information of the current data row comprises an index row unique address, key word information and data compression information.

Specifically, the displacement compression is used for recording character string information and character string offset information, and when the occurrence frequency of the character string information and the character string offset information exceeds the set frequency, the displacement compression is carried out; the dictionary compression is used for setting a keyword information base, and replacing and storing keywords by adopting symbol addresses; the bitmap compression is used for storing the keywords exceeding the set character string length according to a plurality of set byte size symbol addresses.

Specifically, the key information of the current data line includes data information and data check bits.

Specifically, when the data size of the newly added data plus the current index page is smaller than a set value, adding the newly added data plus the current index page to the current index page, and when the data size of the newly added data plus the current index page is larger than the set value, newly building an index page and adding the newly built index page to the newly added index page; when the single piece of newly added data is smaller than the set value, creating an index page, and updating the single piece of newly added data to the last index page.

As shown in fig. 3, the index structure includes a global index, a page index, and a row index. The global index records basic information of the complete data file, the page index records index information of the current index page, and the line index records key information of the current data line.

As shown in the global index diagram of fig. 4, the global index record information includes: the method comprises the steps of total index unique ID, creation time stamp, export time stamp, total number of data, total number of pages of data, overall compression algorithm, data size before compression, data size after compression, total index check value and basic information of each data page. Wherein the integral compression algorithm adopts an open source compression algorithm. The complete data is compressed.

As shown in fig. 5, the index page includes an index area and a data area, where the index area is used to store index header information; the index header information includes a page number, a number of data pieces, an index header size, a data size before compression, a data size after compression, a data check bit, and an offset for each data line. The data area contains data line index, data information and data line check information.

Each index page is 4G in size, when the data size of the newly added data plus the current index page is smaller than 4G, the index page is added to the current index page, and when the data size of the newly added data plus the current index page is larger than 4G, the index page is newly built and added to the newly added index page. When the single piece of newly added data is larger than 4G, creating an index page, and independently generating the index page by the single piece of data.

As shown in the line index diagram of fig. 6, the line index is used to index a single piece of data, and key information and data compression information of the single piece of index are recorded.

As shown in the displacement compression diagram of fig. 7, if a certain character string (e.g. "ABC") often appears, displacement compression is used, where the ID is a characteristic unique key (e.g. "unused"% "or" @ "plus a string of unique value digits). The above formula "XXXABCBXCXXXXXABCXX" can be recorded as "XXX%20XX%20XXXXX%20XX" and "%20, ABC,3,8,16". The character string comprises 20 characters, the addresses of the characters are numbered 0-20, the address of the first 'ABC' is numbered 3, the address of the second 'ABC' is numbered 8, and the address of the third 'ABC' is numbered 16. The displacement compression is used for recording character string information and character string offset information.

As shown in the dictionary compression diagram of FIG. 8, if key information such as place names, countries, person names, sensitive entries and the like frequently appears, symbol addresses are used for replacement storage, for example, "Chongqing" uses a first address ID1, "university" uses a second address ID2, and "long-medium" uses a third address ID3 for replacement storage.

As shown in fig. 9, for some common phrases, if the space occupied by the ID string replaced by dictionary compression is still large, bitmap compression is used, for example, one long data has 64 bits, the data of these bits are defined according to the information category, the data of the multiple bits represents the first type of information, the data of the multiple bits represents the second type of information, etc., for example, "man", "woman", "Beijing", "Chongqing", "adult" appear in the ID string, the addresses of "man", "woman" belonging to the same type of information are put into the multiple bits, and the addresses of the data ("Beijing", "Chongqing", "adult") belonging to the same type of information are put into the multiple bits.

The data-drop flow chart shown in fig. 10 comprises the following steps:

the ETL program is started; data acquisition, data conversion and data loading program operation; judging whether the newly added data is abnormal data, if so, carrying out disc landing, otherwise, not carrying out disc landing, and returning to the data acquisition, data conversion and data loading program operation processes; judging whether the sizes of the newly added data and the last index page are larger than a set value, if so, creating an index page, otherwise, updating the newly added data to the last index page; updating the newly added data index row index; the global index information is updated.

As shown in FIG. 11, the tray data query flow chart comprises the following steps: the tray data inquiry software is started; querying a global index; inquiring a specific index page according to the global index; decompressing the index page data; inquiring row index information and data information; and returning the queried data.

The invention adopts the global index to index global data, the page index indexes all data rows of the current page, the row index indexes single data, namely, adopts the simple index comprising the global index, the page index and the row index to index the landing data, so that the efficiency of inquiring the landing data is greatly improved, and the target data can be quickly inquired even under the situation that the landing data volume is large; and a simple data compression encryption mode is adopted, including displacement compression, dictionary compression and bitmap compression, and data is encrypted while compressed, so that data information leakage caused by plaintext data is prevented.

The technical scheme of the invention is not limited to the specific embodiment, and all technical modifications made according to the technical scheme of the invention fall within the protection scope of the invention.

Claims

1. A simple indexing and encrypting disk-falling mechanism for ETL abnormal data is applied to disk-falling of the abnormal data in the running process of the data ETL, and is characterized by comprising the steps of establishing an index, encrypting and compressing the data and inquiring and accessing the disk-falling data;

the method comprises the steps of inquiring and accessing the landing data, including inquiring a global index, inquiring an index page according to the global index, decompressing index page data, inquiring key information of line index information and current data lines;

the basic information of the complete data file includes: the method comprises the steps of total index unique ID, creation time stamp, export time stamp, total number of data pages, overall compression algorithm for compressing data, data size before compression, data size after compression, total index check value and basic information of each data page.

2. The easy index and encrypt-to-drop mechanism for ETL exception data as in claim 1, wherein the data page basic information comprises data page offset location, data page content key information and key-importance data line location information throughout the storage file.

3. The easy indexing and encryption drop mechanism for ETL exception data according to claim 1, wherein said index page comprises an index area and a data area; the index area is used for storing index head information;

the index header information comprises page number, data number, index header size, data size before compression, data size after compression, data check bit and offset of each data line;

the data area comprises a data line index, data information and data line verification information.

4. The easy index and encrypt-and-drop mechanism for ETL exception data as in claim 1, wherein the row index is used to index a single piece of data, and the key information of the current data row includes an index row unique address, key information, and data compression information.

5. The simple indexing and encrypting disk-dropping mechanism for ETL abnormal data according to claim 1, wherein said displacement compression is used for recording character string information and character string offset information, and when the occurrence frequency of the character string information and the character string offset information exceeds a set number of times, the displacement compression is performed; the dictionary compression is used for setting a keyword information base, and replacing and storing keywords by adopting symbol addresses; the bitmap compression is used for storing the keywords exceeding the set character string length according to a plurality of set byte size symbol addresses.

6. The easy index and encrypt-to-disc mechanism of claim 1 for ETL exception data, wherein the key information of the current data line includes data information and data check bits.

7. The simple indexing and encrypting disk-dropping mechanism for ETL abnormal data according to claim 1, wherein when the data size of the newly added data plus the current index page is smaller than a set value, the new index page is added to the newly added index page, and when the data size of the newly added data plus the current index page is larger than the set value, the new index page is newly built; when the single piece of newly-added data is larger than the set value, creating an index page, and independently generating the index page by the single piece of newly-added data.