CN107391306B

CN107391306B - Heterogeneous database backup file recovery method

Info

Publication number: CN107391306B
Application number: CN201710622124.7A
Authority: CN
Inventors: 刘赛; 杨华飞; 聂庆节; 刘嘉华; 刘军; 张磊; 马悦皎; 缪骞云; 张翼; 张迎星
Original assignee: State Grid Corp of China SGCC; Nari Information and Communication Technology Co; Nanjing NARI Group Corp; Electric Power Research Institute of State Grid Jiangsu Electric Power Co Ltd
Current assignee: State Grid Corp of China SGCC; Nari Information and Communication Technology Co; Nanjing NARI Group Corp; Electric Power Research Institute of State Grid Jiangsu Electric Power Co Ltd
Priority date: 2017-07-27
Filing date: 2017-07-27
Publication date: 2019-12-10
Anticipated expiration: 2037-07-27
Also published as: CN107391306A

Abstract

the invention discloses a heterogeneous database backup file recovery method, which comprises the following steps: normalizing and converting data in a heterogeneous source database; adopting a DELTA compression algorithm of K-medoids clustering to perform clustering pretreatment on the data blocks, and classifying the data blocks with higher similarity into one class; compressing the same type of data blocks by utilizing a Delta compression algorithm; the data is restored based on an SQL reproduction method, a database version at a restoring end is read according to a configuration file, a metadata model is converted into SQL statements supported by a database of a corresponding version according to conversion rules, and the SQL statements are imported into the database after data consistency detection, so that the functions of stable backup and recovery of a heterogeneous database are realized. The invention can support various source databases through the expansion of the mapping rule, realizes the backup of the heterogeneous database, supports high-efficiency file compression and reduces the backup cost.

Description

Heterogeneous database backup file recovery method

Technical Field

The invention relates to a heterogeneous database backup file recovery method, and belongs to the technical field.

Background

in recent years, with the development of information technology, information management systems have been widely used. The system and the method become a platform for information release and information transaction by the characteristics of rapidness, high efficiency and convenience, and further promote the digitization and informatization process of the whole society, and various informatization systems construct the current 'information world'.

the development of various industries is not separated from data: product data, customer data, financial data, etc., the survival development of enterprises is increasingly dependent on IT systems. The information data is damaged in a large scale due to computer viruses, network intrusion, physical damage, manual operation errors and the like, so that the information system cannot provide normal service, and huge economic loss is caused in the fields of certain industries related to economic benefits, such as banks, electric power, communication and the like. The data is protected through a data backup means, and the local data can be rapidly recovered after a fault occurs.

database backup research is a demand-driven area in which large corporations have begun relatively early on, and some backup technologies have been in use for a considerable period of time in a variety of application environments. Foreign research on backup software began in the mid-80's of the 20 th century, and commercial backup products matured to date include: tivoli from EMC, NetVault from BakBone, BrihtStor from CA, etc.

The software research institute of the university of zhongshan has jointly developed NetBunker2 for network backup recovery of Linux backup servers with cantonese network technologies ltd. The Heartone Backup Enterprise of the Zhongshan same-direction company provides distributed Backup, realizes intelligent Backup recovery, and simplifies a server and a network storage environment.

In the open source field, backup software is developed vigorously, and a large number of excellent open source backup software appears, wherein a few of the excellent backup software are named and comprise Amanda, Bacula, BackupPC, Restore, Burt and the like. Open source software, although technically disclosed, functions only support the most basic work in some backups and is not suitable for business scenarios. It is therefore necessary to conduct theoretical studies on some commercial functions.

with the gradual development of enterprises, the enterprise data has the characteristics of large quantity, wide sources, multiple types, complex structure and the like. Enterprises accumulate a large amount of business data, the data has very important significance for normal operation of the enterprises, and due to the fact that database systems used in all stages are different, how to backup heterogeneous data becomes a key problem in the field of data backup. Although some large databases such as Oracle and SQL Server have provided database backup restore tools themselves, these tools only support a single database backup and do not solve the heterogeneous problem of the database backup process.

Disclosure of Invention

the invention aims to overcome the defects in the prior art, provides a heterogeneous database backup file recovery method, and solves the technical problem that heterogeneous data cannot be effectively backed up and recovered in the prior art.

in order to solve the technical problems, the technical scheme adopted by the invention is as follows: a heterogeneous database backup file recovery method comprises the following steps:

(1) Normalizing and converting data in a heterogeneous source database;

(2) Clustering preprocessing is carried out on the data blocks, the data blocks of the same type are compressed by utilizing a DELTA compression algorithm to generate corresponding binary storage files, and the compressed backup files are backed up to a backup medium;

(3) Restoring the metadata in the backup file by using an SQL (structured query language) reproduction method, and reading a database version of a restoring end according to the configuration file;

(4) and converting the metadata model into SQL statements supported by the corresponding version database according to the conversion rules, performing data consistency detection, and importing the SQL statements into the database to realize backup file recovery of the heterogeneous database.

The specific method of the step (1) is as follows:

101. Loading a driver: importing a driver into a development environment, and loading the driver through a class.

102. Creating a connection: after loading the driver, creating a database connection object through getConnect () function of DriverManage, wherein the connection object comprises: protocol name, IP address, port number, database name;

103. create State object: creating a State object through a create State () function of Connection;

104. and (3) executing the SQL statement: when an SQL statement produces a single result set, execute query (); when there is no returned result, executeUpdate (); when multiple result sets are returned, execute ();

105. and (3) obtaining a result: when executing execute () and execute query () of state, the returned result is a ResultSet object, and data in the returned result is obtained by using a next () function through a pointer pointing to the object;

106. loading a conversion rule according to the type of a database, converting heterogeneous data into metadata with unified standards through an excData () function, wherein each element in the metadata comprises a key field identifier and is used for checking the consistency of the data during data recovery, and if the metadata is changed in the backup process, setting the identifier to be 1;

107. Writing the obtained data into a file according to an XML format through a wrtData () function to generate a corresponding backup file;

108. closing the connection: if the database is no longer in use, the database connection is closed using the close () method.

the metadata is the minimum unit of the data model, and the structural expression of the metadata is shown as formula (1):

M＝CS+SS (1)

wherein: CS is a content structure, defining the constituent elements of metadata and element content, SS is a syntax structure, defining the format structure of metadata and a specific description method;

The content structure expression is shown as formula (2):

CS＝(T,Z,S,F) (2)

t represents a source table, is a table structure of a multi-source database, stores table structure information of data to be backed up, and comprises: the method comprises the steps of obtaining a source table serial number, a source table name, an identifier, a field number, a field name and field type information;

Z represents a field, is a data value of the multi-source database, and stores specific numerical values of the field in a table, including: a field sequence number, a field name, a field type, a field value, a table name, and an identifier;

s represents a preset set, which is a basic unit of backup and comprises a preset set number, a source server, a target server, a start time, an end time, a backup serial number, a source table serial number and a field serial number; the system comprises a plurality of units, a backup task module and a backup task module, wherein the units are used for defining backup objects, subdividing a backup process into the units, and continuing a backup task from an interrupt position when one backup task is interrupted;

f represents constraint, and the constraint element describes field constraint information in the table and is used for recording special column information in the table, wherein the special column information comprises a table name, a constraint serial number, a primary key column name, an external key column name, an index column name and an identifier.

the special column information is recorded separately to give an integrity description of the table structure.

the specific method of the step (2) is as follows:

201. segmenting a file to be compressed, adopting the size of a 1M file as a dividing unit, performing Delta compression between every two divided file blocks, storing the size of the file subjected to Delta compression in a temporary matrix arr _ DELTA [ N ] [ N ], and taking the size as the similarity between data blocks;

202. clustering the data blocks by using the similarity information stored in the similarity matrix as a clustering basis through a K-medoids clustering algorithm;

203. Selecting a feature set from a file by adopting a content-independent method, and determining the number of generated intermediate fingerprints and the size of the file according to the size of an allocable memory;

204. Setting the size of a sliding window, continuously moving the sliding window forwards, calculating data fingerprints under the moving window, and mapping the data fingerprints into super features or super fingerprint sets by adopting a Hash function;

205. if the super fingerprints are matched, searching a reference file with the highest similarity in the feature database, and compressing according to a compression function D after finding the reference file;

206. Encoding the ordered symbol string by a compression function D, and encoding a command by utilizing ADD, wherein the command format is (ADD, L, S), and the command format is that a character string S with the length of L is added at a specified position in V; COPY encoding command, its command format is (COPY, L, O), represent COPY length L, offset O character string to appointed position in V from R;

207. and recombining the compressed data blocks into a backup file.

the specific method for compressing the same type of data blocks by utilizing the DELTA compression algorithm comprises the following steps:

Partitioning the backup file, recording a data block set as S ═ S1, S2, S3 … Sn }, clustering data objects in the set S, dividing the data blocks into K classes C ═ C1', C2', C3'… Ck' }, and expressing the similarity between two similar data blocks as DELTA distance between the two similar data blocks, namely:

dist(Si,Sj)＝delta(Si,Sj) (3)

Randomly selecting K data blocks as the center points of the clusters in S, respectively representing the K data blocks by { m1, m2 and m3 … mk }, and distributing points representing the rest data blocks to the nearest clusters to obtain cluster clusters C ═ C1, C2 and C3 … Ck };

for each cluster Ci, i belongs to {1,2,3 … k }, traversing the jth non-center point object Sj in the cluster, calculating the total cost of each data block S _j and the rest data blocks S _k in the cluster by using formula (4),

and selecting the minimum total cost point in the clusters as the central point of the new cluster, and iterating the steps until the central point of each cluster is not changed any more, and finally obtaining K clusters C ═ C1', C2', C3'… Ck'.

the specific method of the step (3) is as follows:

301. reading the type and the version number of a database at a recovery end, and loading a corresponding mapping rule according to the database version;

302. reading a preset set sequence number of a corresponding task according to the recovery task information, and searching a source table sequence number to be recovered, a constraint sequence number and a field sequence number according to the preset set sequence number;

303. Searching corresponding source table elements and constraint elements in the metadata according to the source table sequence numbers and the constraint sequence numbers, and checking the corresponding identifier content: if the identifier is 1, executing step 304, otherwise executing step 305;

304. Acquiring source table and dependency specific information, including: the method comprises the steps that a table name, a field name in the table, a field type, a main key, an external key and an index are obtained, a corresponding SQL statement is generated and stored in an SQL file, and an identifier is set to be 0 after the file is generated;

305. acquiring a corresponding field element according to the field sequence number, checking the content of the corresponding identifier, and executing a step 306 if the identifier is 1, or executing a step 307 if the identifier is not 1;

306. Acquiring field specific information, including field names, field types, field values and field corresponding source table names, generating corresponding INSERT statements according to the acquired information to realize data addition, storing the contents in an sql file, and setting an identifier to be 0 after the file generation is finished;

307. the sql file restores the data to the database by executing the control command.

when the SQL reproduction method is adopted to restore the metadata in the backup file, the value of the identifier in the metadata file is firstly checked:

if the identifier is 1, the data is not recovered, and the content in the backup file is converted into an SQL statement by reversely using a grammar mapping rule;

if the identifier is 0, the content is restored to the database in the previous restoration task, and conversion and restoration are not needed.

compared with the prior art, the invention has the following beneficial effects:

the invention designs a universal metadata model, defines the mapping rule between the data in the current mainstream databases Oracle, Mysql and PostgreSQL and the model, normalizes the data into metadata and stores the metadata in an XML file;

an improved DELTA compression algorithm is provided, repeated data deletion is carried out on the backup files, and the backup cost is reduced;

the problem of information isolated island caused by heterogeneous databases in enterprises can be solved, a consistent backup framework facing enterprise requirements is provided, the utilization rate of backup media can be improved, and the backup cost can be reduced;

And for the recovery task, recovering the metadata into SQL statements supported by the database with the specified version according to the configuration of the database, importing the data into the database by executing an SQL statement mode to realize recovery, and selectively recovering the data according to the modification marks in the source data model during recovery to ensure the consistency of the data.

Drawings

FIG. 1 is a schematic diagram of a backup system hierarchy;

FIG. 2 is a flow chart of the present invention;

FIG. 3 is a heterogeneous data extraction flow diagram;

FIG. 4 is a flow chart of data compression based on K-medoids clustering;

Fig. 5 is a data recovery flow chart.

Detailed Description

the invention provides a heterogeneous database backup file recovery method, which comprises the following steps: designing a metadata model, normalizing and converting data in a heterogeneous source database, and storing the metadata model through an XML file; a DELTA compression algorithm based on K-medoids clustering is provided, data blocks are subjected to clustering preprocessing, and the data blocks with high similarity are classified into one class. Compressing the same type of data blocks by utilizing a Delta compression algorithm; the data is restored based on an SQL reproduction method, a database version at a restoring end is read according to a configuration file, a metadata model is converted into SQL statements supported by a database of a corresponding version according to conversion rules, and the SQL statements are imported into the database after data consistency detection, so that the functions of stable backup and recovery of a heterogeneous database are realized.

the invention is further described below with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.

The backup system includes three functions: data extraction, data processing and data recovery. And data extraction realizes unified description of different database data types through a metadata model, and source data are extracted and stored into a backup file according to a backup task. And the data processing compresses the repeated contents in the backup file by using a compression algorithm to generate a corresponding binary storage file, and backs up the compressed backup file to a backup medium. And converting the metadata in the backup file based on an SQL reproduction method for data recovery to generate SQL files which can be executed by databases of various versions, and finally importing the data into a database system to realize data recovery.

as shown in fig. 1, the hierarchical structure diagram of the backup system is divided into three layers, namely, a common connection layer, a service layer and an application layer.

(1) common connection layer

The public connection layer is positioned at the bottommost layer of the system and is responsible for realizing the database connection function, providing database connection and query service for the service layer, and providing encryption and decryption when backuping the database with higher security level so as to ensure the reliable connection established between the public connection layer and the heterogeneous data source. The method is mainly used for establishing connection with different databases through JDBC technology.

(2) business layer

the business layer realizes the core function of the system, and all the composition links of the database backup and recovery are realized at the same layer. The data conversion realizes the mutual mapping of metadata and database data, and shields the difference of data formats, constraint rules and SQL syntax of the heterogeneous database through the mapping rules, which is a difficult point for backup of the heterogeneous database.

the data compression function uses a DELTA compression algorithm based on K-medoids clustering, the efficiency is doubled on the basis of the most basic DELTA compression algorithm, the backup files can be compressed to about one fourth of the original files, and the backup cost can be reduced while the backup speed is increased.

The consistency detection function is to protect the data reliability and ensure that the content in the database after the recovery task is executed is the same as the content in the backup.

there is a mutual dependency relationship between them in the functional flow. And in the backup task stage, data conversion is firstly carried out, and then the converted data is compressed and stored in a backup medium. In the recovery stage, the compressed file is restored into a data file through a recovery technology of data compression, the data content is determined to be recovered through checking identifiers in the data, and then the data content is converted into an SQL statement through a conversion rule and is imported into a database.

(3) application layer

the application layer solves the practical problem by using the services provided by the service layer and the public connection layer, and mainly comprises a backup recovery task or a backup recovery plan customized by a user. The layer carries out interface design based on QT, and ensures the portability of the whole system and the expansibility of the system.

as shown in fig. 2 to 5, the method for restoring the backup file of the heterogeneous database provided by the present invention specifically includes the following steps:

(1) the specific method for normalizing and converting the data in the heterogeneous source database comprises the following steps:

102. Creating a connection: after the driver is loaded, a database connection object is created through the getConnect () function of DriverManage, such as Connect ═ drivermanager. getConnect ("url", "UserName", "PassWord"). Although the urls of different databases have different formats, they should contain information such as protocol name, IP address, port number, database name, etc. UserName and PassWord are user names and passwords connected to a database management system;

103. create State object: creating a State object through a create State () function of Connection; the State element class is mainly used for executing the SQL Statement to obtain a result generated after execution;

104. And (3) executing the SQL statement: the method for executing the SQL Statement by the State element mainly comprises executeQuery (), executeUpdate (), and execute (). Execute query () is used when an SQL statement produces a single result set, execute update () is used when no result is returned, execute () is used when multiple result sets are returned;

108. Closing the connection: in order not to waste resources, the database connection is closed with the close () method when the database is no longer used.

the data formats of various heterogeneous databases are different from the metadata format, and currently, the mainstream database data format has strong functional syntax and rich data type main keys, for example, Oracle's basic character types include CHAR, VARCHAR2, NCHAR, nvarch 2, and the like. Meanwhile, for different database systems, the situation that different data types cannot be supported exists, so that the mapping rule is set for conversion. The data mapping rule is also called a metadata dictionary and is a basis for normalizing heterogeneous data. The mapping rules are designed based on the data types in the heterogeneous source database, and can be classified into character type, real number type, integer type and byte type according to the meaning to be expressed by the data types. The specific mapping relationship is shown in table 1.

TABLE 1 data type mapping rules

M＝CS+SS (1)

wherein: CS (content structure) is a content structure, defines the composition elements of metadata and element content, SS (syntax structure) is a syntax structure, defines the format structure of the metadata and a specific description method;

The content structure expression is shown as formula (2):

CS＝(T,Z,S,F) (2)

Source table (T): the element represents a table structure of a multi-source database, stores table structure information of data to be backed up, and comprises a source table serial number, a source table name, an identifier, a field number, a field name and field type information.

field (Z): the element represents the data value of the multi-source database, and the specific numerical value of the field in the storage table. Including field sequence number, field name, field type, field value, table name, and identifier.

predetermined set (S): defining backup objects subdivides the backup process into units, and when one backup task is interrupted, the backup task can be continued from the interrupted position. The mechanism saves time and improves backup efficiency, is helpful to ensure the consistency of backup results, and prevents data redundancy caused by repeated backup of data which is backed up. The predetermined set is defined as a basic unit of backup, containing objects to be backed up. The reservation integrator includes a reservation set number, a source server, a target server, a start time, an end time, a backup sequence number, a source table sequence number, and a field sequence number.

constraint (F): the constraint element describes field constraint information in the table and is used for recording special column information in the table. Including a table name, a constraint sequence number, a primary key column name, an foreign key column name, an index column name, and an identifier. The special column information must be recorded separately for its special function to describe the integrity of the table structure.

(2) clustering preprocessing is carried out on data blocks, the data blocks of the same type are compressed by utilizing a DELTA compression algorithm to generate corresponding binary storage files, and the compressed backup files are backed up to a backup medium, wherein the specific method comprises the following steps:

202. Clustering the data blocks by using the similarity information stored in the similarity matrix as a clustering basis through a K-medoids clustering algorithm, wherein the clustering result ensures that the similarity between the data blocks in the same class is higher;

204. setting the size of a sliding window, continuously moving the sliding window forwards, and calculating the data fingerprint under the moving window. In order to improve the retrieval speed and reduce the search time, a Hash function is adopted to map the super features or the super fingerprint set;

205. if the super fingerprints are matched, the similarity of the two files is larger. Searching a reference file which is highly similar to the characteristic database in the characteristic database, and compressing the reference file according to a compression function D after the reference file is found;

207. And recombining the compressed data blocks into a backup file.

dist(Si,Sj)＝delta(Si,Sj) (3)

(3) Restoring the metadata in the backup file by using an SQL (structured query language) reproduction method, and reading a database version of a restoring end according to the configuration file; and reversely using the conversion rule to restore the metadata information into SQL sentences which can be identified by the database and generating corresponding SQL files. The specific method comprises the following steps:

the above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims

1. a method for recovering backup files of a heterogeneous database is characterized by comprising the following steps:

(1) Normalizing and converting data in a heterogeneous source database;

(3) Restoring the metadata in the backup file by using an SQL (structured query language) reproduction method, and reading a database version of a restoring end according to the configuration file; the specific method comprises the following steps:

307. the sql file restores the data to the database;

2. The method for restoring the backup files of the heterogeneous database according to claim 1, wherein the specific method in step (1) is as follows:

3. The method for restoring the backup file of the heterogeneous database according to claim 1, wherein the metadata is a minimum unit of a data model, and a metadata structure expression is shown in formula (1):

M＝CS+SS (1)

wherein: CS is a content structure, which refers to metadata construction elements and element contents, SS is a syntax structure, and a metadata format structure and a specific description method are defined;

the content structure expression is shown as formula (2):

CS＝(T,Z,S,F) (2)

4. The method of claim 3, wherein the special column information is recorded separately to describe the integrity of the table structure.

5. the method for restoring the backup files of the heterogeneous database according to claim 1, wherein the specific method in the step (2) is as follows:

207. and recombining the compressed data blocks into a backup file.

6. The method for restoring the backed-up files of the heterogeneous database according to claim 5, wherein the specific method for compressing the same type of data blocks by using a DELTA compression algorithm is as follows:

dist(Si,Sj)＝delta(Si,Sj) (3)

7. the method for restoring the backed-up file of the heterogeneous database according to claim 1, wherein the "SQL rendition method" is adopted to restore the metadata in the backed-up file by first checking the value of the identifier in the metadata file: