CN111291403B

CN111291403B - Data desensitizing device based on distributed cluster

Info

Publication number: CN111291403B
Application number: CN202010042550.5A
Authority: CN
Inventors: 程永新; 宋辉; 郭振宇
Original assignee: Shanghai New Torch Network Information Technology Ltd By Share Ltd
Current assignee: Shanghai New Torch Network Information Technology Ltd By Share Ltd
Priority date: 2020-01-15
Filing date: 2020-01-15
Publication date: 2023-09-19
Anticipated expiration: 2040-01-15
Also published as: CN111291403A

Abstract

The invention discloses a data desensitizing device based on a distributed cluster, which comprises a master server, a thread master scheduler and a plurality of slave servers, wherein each slave server is provided with a thread scheduler for distributing threads of the slave servers, and the master server slices each source data table in a database which needs to be desensitized and places the table slices into a slicing queue of the source data table; and the main server distributes a defined thread pipeline to each source data table through a thread total scheduler, and the thread total scheduler dispatches threads to pull data from the slicing queues through a thread scheduler of the slave server to be desensitized and then loads the data into the target data table. The invention coordinates the thread dispatcher of the slave server through the thread master dispatcher, thereby realizing the dynamic allocation of threads and improving the loading performance; the distributed cluster arrangement of the master server and the slave server has good expansion performance; and high-speed data extraction is realized through table data slicing.

Description

Data desensitizing device based on distributed cluster

Technical Field

The invention relates to a data desensitizing device, in particular to a data desensitizing device based on a distributed cluster.

Background

Data desensitization refers to the deformation of data of certain sensitive information through a desensitization rule, so that the reliable protection of sensitive privacy data is realized. This allows for the safe use of the desensitized real data set in development, testing and other non-production environments and outsourcing environments. A large amount of sensitive information in the relational database requires desensitization.

The existing desensitization method comprises the following two steps:

scheme 1: desensitization was performed using a simple JDBC approach.

Scheme 2: data desensitization of multiple tables is performed using a single machine.

The existing desensitization method has the following problems:

the existing scheme 1 causes the following problems: JDBC can extract and load data, but when the data volume of a single table reaches a hundred million levels, the performance of extraction and loading is very slow, even a query timeout phenomenon may occur, and the desensitization task cannot be completed.

Existing scheme 2 may lead to the following problems: because the CPU and the memory of the single machine are limited, if the library to be desensitized has thousands of tables, the total amount is more than the TB level, and the memory overflow is possibly caused, so that the problem that the CPU cannot process is solved.

The existing production environment is mostly desensitized by a whole library or a whole set, mass data of a plurality of tables are simultaneously desensitized, a desensitization algorithm is executed for acquiring the mass data of the plurality of tables, the desensitization algorithm is complex and consumes more CPU resources, so that the desensitization server is executed as a CPU intensive task, CPU resources are limited, CPU context switching is frequent if the number of the started multithreading is not limited, a large amount of CPU time is consumed, and meanwhile, the CPU is possibly not used for processing the data, the data is backlogged in a pipeline (the pipeline is a memory-based queue Array Blocking Queue), and the JVM memory overflows. For a single server CPU, such as a 16 core CPU, it is appropriate to use a 16 x 2 thread count, but for data volumes above the TB level, the single server CPU and memory is clearly insufficient for processing. Accordingly, the prior art is in need of improvement.

Disclosure of Invention

The invention aims to provide a data desensitizing device based on a distributed cluster, which solves the problems.

The invention provides a data desensitizing device based on a distributed cluster, which comprises a master server, a thread master scheduler and a plurality of slave servers, wherein each slave server is provided with a thread scheduler for distributing threads of the slave servers, and the master server slices each source data table in a database which needs to be desensitized and places the table slices into a slicing queue of the source data table; the main server distributes and defines the thread pipeline group number and the thread quantity to each source data table through a thread total scheduler, and the thread total scheduler dispatches threads to pull data from the slicing queues through a thread scheduler of the slave server to be desensitized and then loads the data into a target data table.

Further, the threads of the slave server comprise a decimating thread, a desensitizing thread and a loading thread, each group of thread pipelines consists of a decimating thread, a desensitizing thread and a loading thread which are correspondingly arranged, the decimating thread, the desensitizing thread and the loading thread which are correspondingly arranged carry out data transmission through queues to form a serial thread pipeline, the decimating thread and the desensitizing thread carry out data transmission through a pipeline queue I, and the desensitizing thread and the loading thread carry out data transmission through a pipeline queue II.

Further, the step of pulling the data from the slicing queue from the server to perform desensitization specifically includes: the extraction thread of the thread pipeline reads data from the slicing queue and sends the data to the pipeline queue I; the desensitization thread of the thread pipeline pulls data from the pipeline queue I, performs data desensitization, and transmits the desensitized data to the pipeline queue II; the loading thread of the thread pipeline pulls data from the pipeline queue II and loads the data to the target data table.

Further, the total number of thread pipelines of each slave server is 32, and the thread total scheduler sequentially pulls data from the slicing queues of the table for desensitization according to the ordering of the table through the thread scheduler scheduling threads of each slave server.

Further, the first thread pipeline of each slave server sequentially pulls the sliced data from the sliced queue of the first source data table until the first thread pipeline of each slave server is completely distributed, and after the sliced data pulled from the first thread pipeline of the server is completely desensitized, the data are sequentially pulled from the sliced queue of the first source data table to be desensitized until the first source data table is completely desensitized, and then the new source data table is desensitized; according to the ordering of the tables, the second thread pipeline of each slave server sequentially pulls data from the slicing queue of the second source data table, and the third thread pipeline of each slave server sequentially pulls data from the slicing queue of the third source data table until all the thread pipelines of the slave servers are distributed; after the active data table is desensitized, the thread pipelines corresponding to the active data table are desensitized again according to the ordering of the table; until all source data tables have completed desensitization.

Further, the main server uniformly slices a source data table in the database; the database is an ORACLE relational database, each source data table uses a sample () function of ORACLE to uniformly take out N physical storage addresses ROWIDs of the table, the number of the physical storage addresses is dynamically modified according to the size of the table until the number of the physical storage addresses is extracted to be suitable, then the ROWIDs are sorted into intervals in pairs and divided into a plurality of fragments, and after the fragments are completed, all fragments of the table are put into a fragment queue of the table; the database is a MYSQL relational database, and each source data table is divided into a plurality of fragments by id uniformly and then put into a fragment queue of the table through id of a main key max and id of min to obtain a maximum main key value and a minimum main key value of the table; if the table is a normal table, the table is partitioned once, and if the table is a partitioned table, each partition of the table is partitioned once.

Further, when the slave server loads the desensitized data into the target data table, the database is an ORACLE relational database, then batch direct path loading is adopted, the JDBC is utilized to drive the cache, and a plurality of pieces of data are connected and then are sent to the slave server together to realize loading; and if the database is a MYSQL relational database, text loading is adopted, data is written into the text, compression is started through JDBC driving connection, and loading is realized.

Furthermore, the target data tables are stored by adopting different table spaces when the desensitized data are loaded to the target data tables by the slave server, different target data tables are mapped to different table space disks to balance I/O, meanwhile, each target data table is partitioned, different partitions are mapped to different disks to balance I/O, and simultaneous high-speed writing of a plurality of disks is realized.

Further, each slave server periodically reports the execution state and the heartbeat information to the master server, the master server uniformly collects and calculates real-time data from the execution state of each slave server until all data of the source data table are processed, and the master server counts all performance data and persists the performance data to the database.

Compared with the prior art, the invention has the following beneficial effects: according to the data desensitizing device based on the distributed cluster, the thread scheduler of the slave server is coordinated through the thread master scheduler, so that the dynamic allocation of threads is realized, and the loading performance is improved; the distributed cluster arrangement of the master server and the slave server has good expansion performance; through the table data slicing, slicing is uniform, the data extraction performance is improved, and high-speed data extraction is realized; and adopting different methods to realize high-speed loading according to different databases.

Drawings

Fig. 1 is a schematic diagram of desensitizing a data desensitizing device based on a distributed cluster in an embodiment of the invention.

Detailed Description

The invention is further described below with reference to the drawings and examples.

Referring to fig. 1, a data desensitizing apparatus based on a distributed cluster according to an embodiment of the present invention includes a master server, a thread master scheduler, and a plurality of slave servers, where each slave server is provided with a thread scheduler to allocate threads of the slave servers, and the master server slices each source data table that needs to be desensitized in a database and places the table slices into a slice queue of the source data table; the main server distributes and defines the thread pipeline group number and the thread quantity to each source data table through a thread total scheduler, and the thread total scheduler dispatches threads to pull data from the slicing queues through a thread scheduler of the slave server to be desensitized and then loads the data into a target data table.

The data desensitizing device based on the distributed cluster in the embodiment of the invention uses a main server and three slave servers to perform data desensitization of 100 tables, wherein a host of the main server is used as the total coordination of cluster scheduling, the main server can make two main and standby systems available, and the main server is responsible for scheduling and slicing library tables and managing the states of the slave servers; three slave server hosts are taken as task executors of desensitization data, the executors can be extended infinitely, and each slave server is provided with a thread scheduler which is used for thread allocation.

The total pipeline number of each slave server for desensitizing task allocation is 32, each pipeline starts 3 threads, the extraction thread is responsible for inquiring data, the desensitizing thread is responsible for desensitizing data, the loading thread is responsible for loading data, and the master server is responsible for cutting out a plurality of fragments through the fragments of the table data in turn.

The table data slicing principle is as follows:

ORACLE database: a plurality of rowid are randomly fetched from the table through the sample () function, and are encapsulated into a plurality of query SQL statements, e.g., select from table a where rowid < xxxx and rowid > =xxxx.

Other relational databases: the maximum primary key value and the minimum primary key value of the table are obtained through primary keys max (id) and min (id), and are packaged into a plurality of query SQL sentences through the ids, for example, a selection from table A where id < XXXXXXX and id > =XXXXX.

The master server slices the table, puts all slices of the table into a slicing queue, one table is provided with a slicing queue, the slicing queues are all placed in a public space and can be accessed by all the slave servers, the master server informs all the slave servers that slicing SQL processing data can be pulled for desensitization, then the table is continuously sliced, and so on until all 100 tables are sliced.

Pulling the sliced data from the sliced queue of the first source data table in turn by the first thread pipeline of each slave server until the first thread pipeline of each slave server is distributed, continuously pulling the data from the sliced queue of the first source data table in turn for desensitization after the sliced pulled from the first thread pipeline of the server is desensitized until the first source data table is desensitized, and then performing desensitization of a new source data table; according to the ordering of the tables, the second thread pipeline of each slave server sequentially pulls data from the slicing queue of the second source data table, and the third thread pipeline of each slave server sequentially pulls data from the slicing queue of the third source data table until all the thread pipelines of the slave servers are distributed; after the active data table is desensitized, the thread pipelines corresponding to the active data table are desensitized again according to the ordering of the table; until all source data tables have completed desensitization.

The assignment of thread pipes from the servers may also take other forms, such as: the slave server 1 has 32 pipes, starts pipe 1 to pull one slice from table 1 to execute, starts pipe 2 to pull one slice from table 2, runs out of the number of pipes-1 until 32 pipes are used up, the number of pipes=0, starts new pipe to pull one slice from table 33 immediately when pipe execution is finished, and so on until no slice can be pulled, and other slave servers poll 100 table slices according to the rule until all 100 table slices are pulled.

Because desensitization and loading are usually the slowest links and consume CPU and memory resources relatively, so in each 32 upper pipelines of the slave server, 3 threads in each pipeline are subjected to current limiting processing based on a memory queue Array Blocking Queue, the queue defaults to limit the current by a counter, 1 ten thousand pieces of data are defaulted, more than 1 ten thousand pieces of data block the extraction threads, but the counter cannot accurately calculate the data capacity size, when one piece of data has a large text CLOB/BLOB type field, the memory usage amount is possibly exceeded, memory overflow is caused, the queue is optimized, a capacity counter is added, the capacity size current limit can be realized, the default is calculated according to 1 ten thousand pieces of data=64M, if the capacity limit exceeds 64M, the extraction threads are blocked, the memory overflow problem is effectively avoided, the data are put into the memory queue after the extraction of the thread inquiry data, the desensitization threads obtain the data from the memory queue, for example, the first 6 bits are address codes, 7 to 14 bits are the data capacity size of the data are desensitization, 15 to 17 bits are in the same order of birth year, the number of marks are assigned to the same order of men and women in the same year, and the even numbered area represent the same year, the same order of number is assigned to women in the same year, and the same year, the even numbered area represents women are assigned to the same year; bit 18 is a check code, using ISO 7064:1983, MOD 11-2 check character System. The 1 st to 6 th positions are desensitized by adopting a seed+address dictionary, an address code is obtained in a fixed dictionary through the seed, the 7 th to 14 th positions are selected by adopting a reasonable range of the seed+date, the year is selected according to the seed within 100 years, the month is selected according to the seed from 12 months, the day is selected according to the seed from 30 days, the month is automatically matched, the 15 th to 17 th order codes are selected according to the seed from a dispatch place code of the place, and the 18 th check code is calculated according to the 17 th order codes according to a unified formula. Such complex desensitization algorithms consume CPU resources to develop the capability and high performance of simultaneous computation of multiple servers under distributed clusters. And the desensitization thread completes desensitization and puts the data into a second memory queue.

The loading thread of each pipeline of each slave server acquires data from the memory queue II, and different high-speed loading technologies are adopted according to the characteristics of different databases.

ORACLE database: the method comprises the steps of collecting batch loading and direct path loading, and using JDBC driven cache, assembling every 5000 pieces of data together INTO a plurality of INSERT/+APPEND_VALUES (A) and/or INTO T_APPEND A (ID, NAME) VALUES (3, 'ABC'), and sending the data to an ORACLE server together to realize high-speed loading.

MYSQL database: collecting text for high-speed loading, writing data into the text, starting compression by JDBC driving connection, and executing SQL sentences: LOAD DATA INFILE 'f/Book1.csv' NTO TABLE test_Book FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY ',' lines terminated BY '\r\n' ignore 1lines (id, name, data) to achieve high speed loading.

And simultaneously, each table adopts a partition, and different partitions are mapped to different disks to balance I/O, so that high-speed writing by using a plurality of disks simultaneously can be realized, concurrent loading of data is realized, and the highest performance of distributed desensitization is realized.

Each slave server reports the execution state and heartbeat information to the master server regularly, the master server uniformly collects the execution state of each slave server and calculates real-time data until the data of 100 tables are processed, the master server counts all performance data to be persisted into a database, and the task is ended.

In summary, the data desensitizing device based on the distributed cluster in the embodiment of the invention coordinates the thread scheduler of the slave server through the thread master scheduler, so as to realize the dynamic allocation of threads and improve the loading performance; the distributed cluster arrangement of the master server and the slave server has good expansion performance; through the table data slicing, slicing is uniform, the data extraction performance is improved, and high-speed data extraction is realized; and adopting different methods to realize high-speed loading according to different databases.

While the invention has been described with reference to the preferred embodiments, it is not intended to limit the invention thereto, and it is to be understood that other modifications and improvements may be made by those skilled in the art without departing from the spirit and scope of the invention, which is therefore defined by the appended claims.

Claims

1. The data desensitizing device based on the distributed cluster is characterized by comprising a master server, a thread master scheduler and a plurality of slave servers, wherein each slave server is provided with a thread scheduler for distributing threads of the slave servers, and the master server slices each source data table needing desensitization in a database and places the table slices into a slicing queue of the source data table; the main server distributes and defines the thread pipeline group number and the thread quantity to each source data table through a thread total scheduler, and the thread total scheduler dispatches threads to pull data from the slicing queues through a thread scheduler of the slave server to desensitize and then loads the data to a target data table;

the threads of the slave server comprise extraction threads, desensitization threads and loading threads, each group of thread pipelines consists of the extraction threads, the desensitization threads and the loading threads which are correspondingly arranged, the extraction threads, the desensitization threads and the loading threads which are correspondingly arranged perform data transmission through queues to form a serial thread pipeline, the extraction threads and the desensitization threads perform data transmission through a pipeline queue I, and the desensitization threads and the loading threads perform data transmission through a pipeline queue II;

the step of pulling data from the slicing queues from the server for desensitization specifically comprises the following steps: the extraction thread of the thread pipeline reads data from the slicing queue and sends the data to the pipeline queue I; the desensitization thread of the thread pipeline pulls data from the pipeline queue I, performs data desensitization, and transmits the desensitized data to the pipeline queue II; the loading thread of the thread pipeline pulls data from the pipeline queue II and loads the data to the target data table;

when the slave server loads the desensitized data into a target data table, the database is an ORACLE relational database, then batch direct path loading is adopted, JDBC is utilized to drive cache, and a plurality of data are connected and then sent to the slave server together to realize loading; and if the database is a MYSQL relational database, text loading is adopted, data is written into the text, compression is started through JDBC driving connection, and loading is realized.

2. The distributed cluster-based data desensitizing apparatus according to claim 1, wherein the total number of thread pipes per said slave server is 32, and the thread total scheduler desensitizes by each slave server's thread scheduler scheduling thread sequentially pulling data from the table's slicing queues in the order of the table.

3. The distributed cluster-based data desensitizing apparatus according to claim 2, wherein the first thread pipe of each slave server sequentially pulls the sliced data from the sliced queue of the first source data table until the first thread pipe of each slave server is allocated, and the sliced data pulled from the first thread pipe of the server continues to sequentially pull the data from the sliced queue of the first source data table for desensitizing until the first source data table is completely desensitized, and then desensitizes the new source data table; according to the ordering of the tables, the second thread pipeline of each slave server sequentially pulls data from the slicing queue of the second source data table, and the third thread pipeline of each slave server sequentially pulls data from the slicing queue of the third source data table until all the thread pipelines of the slave servers are distributed; after the active data table is desensitized, the thread pipelines corresponding to the active data table are desensitized again according to the ordering of the table; until all source data tables have completed desensitization.

4. The distributed cluster-based data desensitizing apparatus according to claim 1, wherein said master server uniformly slices source data tables in a database; the database is an ORACLE relational database, each source data table uses a sample () function of ORACLE to uniformly take out N physical storage addresses ROWIDs of the table, the number of the physical storage addresses is dynamically modified according to the size of the table until the number of the physical storage addresses is extracted to be suitable, then the ROWIDs are sorted into intervals in pairs and divided into a plurality of fragments, and after the fragments are completed, all fragments of the table are put into a fragment queue of the table; the database is a MYSQL relational database, and each source data table is divided into a plurality of fragments by id uniformly and then put into a fragment queue of the table through id of a main key max and id of min to obtain a maximum main key value and a minimum main key value of the table; if the table is a normal table, the table is partitioned once, and if the table is a partitioned table, each partition of the table is partitioned once.

5. The distributed cluster-based data desensitizing apparatus according to claim 1, wherein the slave servers perform concurrent loading when loading the desensitized data into the target data tables, the target data tables are stored with different tablespaces, different target data tables are mapped to different tablespace disks to balance the I/O, each target data table is partitioned, different partitions are mapped to different disks to balance the I/O, and simultaneous high-speed writing of a plurality of disks is realized.

6. The distributed cluster-based data desensitizing apparatus according to claim 1, wherein each of said slave servers periodically reports the execution status and heartbeat information to a master server, said master server uniformly collects and calculates real-time data from the execution status of each slave server until all data of all source data tables are processed, and the master server counts all performance data and persists to a database.