CN104239493B - cross-cluster data migration method and system - Google Patents

cross-cluster data migration method and system Download PDF

Info

Publication number
CN104239493B
CN104239493B CN201410455695.2A CN201410455695A CN104239493B CN 104239493 B CN104239493 B CN 104239493B CN 201410455695 A CN201410455695 A CN 201410455695A CN 104239493 B CN104239493 B CN 104239493B
Authority
CN
China
Prior art keywords
cluster
data
tables
source
target cluster
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201410455695.2A
Other languages
Chinese (zh)
Other versions
CN104239493A (en
Inventor
黄刚
何洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Jingdong Shangke Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN201410455695.2A priority Critical patent/CN104239493B/en
Publication of CN104239493A publication Critical patent/CN104239493A/en
Application granted granted Critical
Publication of CN104239493B publication Critical patent/CN104239493B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • G06F16/24532Query optimisation of parallel queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/252Integrating or interfacing systems involving database management systems between a Database Management System and a front-end application

Abstract

An embodiment of the invention provides a cross-cluster migration method and system. According to the cross-cluster migration method and system, persistence of data inside a distributed database of a source cluster before migration can be achieved due to data operation interruption through all child nodes of the source cluster and persistence of memory data of the distributed database of the source cluster; the data transmission amount can be reduced due to compression of data tables in the distributed database of the source cluster, the compressed data tables in the distributed database of the source cluster are migrated to a target cluster, and the migration efficiency is improved; then occupied storage space and total file blocks of the data tables in the distributed database of the source cluster before migration are matched with occupied space and total file blocks of the data tables of the target cluster, after migration and accordingly the migration integrity can be verified according to a matching result.

Description

Across company-data moving method and system
Technical field
The present embodiments relate to database technical field, more particularly to across the company-data moving method of one kind and system.
Background technology
With the development of internet, applications, the surge of customer volume, data storage quantity is exponentially incremented by, traditional single stock Storage technology cannot meet the access requirement of mass data, HDFS (Hadoop Distributed File System, distributed text Part system) and Distributed Database and give birth to.
HBase (Hadoop Database, distributed data base) be it is a kind of it is extendible, towards row storage it is distributed Data base, using HDFS systems stored as a file, the data storage in the form of tables of data can be on common hardware environmental basis 1,000,000,000 magnitude rows, the Large data table of million magnitudes row are supported, and supports to carry out random storage and reading to the data of this scale Extract operation.Due to high reliability, enhanced scalability, support random access memory and support MapReduce (MapReduce) and Row is calculated, therefore is widely applied.Wherein, Hadoop is a distributed system base developed by " Apache " foundation Plinth framework, user can develop distributed program in the case where distributed low-level details are not known about, and make full use of the prestige of cluster Power realizes high-speed computation and the access of mass data.
In actual application, inevitably it is related to Data Migration, especially under certain HBase cluster needs on line Line, or when room management resettlement, can all face the urgent task of mass data migration, i.e., the tables of data of old cluster is moved Move on to and offer mass data access service in access service side's is provided in new cluster.
Existing Data Transference Technology, generally carries out distributed copy, so as to reach using the data copy component of Hadoop To the purpose that the tables of data in a cluster is moved to new cluster.After the completion of data copy, start new cluster related service Process.
Above-mentioned Data Transference Technology has a drawback in that:The integrity of data after cannot ensureing to migrate;Migration is time-consuming tight Lattice depend on the scale of migrating data, cause to migrate the time used and are difficult control, if inter-cluster network limited bandwidth, while moving Move data again many, it is difficult to ensure that complete migration work in of short duration migration window, namely transport efficiency is low.
The content of the invention
The embodiment of the present invention provides a kind of across company-data moving method and system, to guarantee across the complete of company-data migration Whole property and high efficiency.
In a first aspect, a kind of across company-data moving method is embodiments provided, including:
The main controlled node of source cluster calls each child node of voltage input cluster of ceasing and desisting order to stop data manipulation;
The main controlled node of source cluster utilizes the clearing buffers area assembly of the distributed data base of source cluster, will be described distributed Data persistence in databases is in distributed file system HDFS;
The main controlled node of source cluster controls the tables of data included to the distributed data place of source cluster, using the pressure of setting Compression algorithm is compressed;
First of the HDFS shared by tables of data in the distributed data base of the main controlled node Statistic Source cluster of source cluster Storage size and the first general act block number;
The IP address and Hostname of the node that the main controlled node of source cluster is included based on the target cluster for obtaining in advance Mapping relations, the tables of data in distributed data base in the cluster of source is migrated to the distributed data base of the target cluster In;
If the Data Migration completion message that the management of webpage interface for getting the MapReduce process of source cluster returns, The second of the corresponding HDFS that tables of data in the distributed data base of the main controlled node statistics target cluster of target cluster is occupied Storage size and the second general act block number, and by second storage size and the second general act block number and described One storage size and the first general act block number are matched;
If the match is successful, the main controlled node of target cluster is calculated using decompression corresponding with the compression algorithm of the setting Method is decompressed to migrating the tables of data into target cluster;
The main controlled node of target cluster starts the target cluster based on strategy is started.
Second aspect, the embodiment of the present invention additionally provides a kind of across company-data migratory system, including source cluster and target Cluster, the source cluster includes main controlled node and at least one child node, and the target cluster includes main controlled node and at least Individual child node;
The main controlled node of the source cluster includes:
Stopping modular, each child node for calling voltage input cluster of ceasing and desisting order stops data manipulation;
Persistence module, for using the clearing buffers area assembly of the distributed data base of source cluster, will be described distributed Data persistence in databases is in distributed file system HDFS;
Compression module, for the tables of data included to the distributed data place of source cluster, using the compression algorithm of setting It is compressed;
Statistical module, first for the HDFS shared by the tables of data in the distributed data base of Statistic Source cluster stores Space size and the first general act block number;
Transferring module, for being based on the IP address of node and the reflecting for Hostname that the target cluster for obtaining in advance is included Relation is penetrated, the tables of data in distributed data base in the cluster of source is migrated into the distributed data base of the target cluster;
The main controlled node of the target cluster includes:
Statistical module, if the data that the management of webpage interface of the MapReduce process for getting source cluster returns are moved Completion message is moved, then the second storage for counting the corresponding HDFS that the tables of data in the distributed data base of target cluster is occupied is empty Between size and the second general act block number, and by second storage size and the second general act block number and the described first storage Space size and the first general act block number are matched;
Decompression module, if for the match is successful, using decompression algorithm pair corresponding with the compression algorithm of the setting The tables of data migrated into target cluster is decompressed;
Starting module, for based on strategy is started, starting the target cluster.
Across company-data moving method provided in an embodiment of the present invention and system, are stopped by making each child node of source cluster Data manipulation, and by the data persistence in the internal memory of the distributed data base of source cluster, can realize migrating front source cluster Distributed data base in data persistence;It is compressed by the tables of data in the distributed data base to source cluster, energy Enough reduce data transfer, the tables of data after the compression in the distributed data base of source cluster is migrated into target cluster, carry High transport efficiency;Then by by the memory space shared by the tables of data in the distributed data base of the source cluster before migration The storage size and general act block number that the tables of data of the target cluster after size and general act block number, with migration is occupied is carried out Matching, can verify the integrity of migration according to matching result.
Description of the drawings
In order to be illustrated more clearly that the present invention, one will be done to the accompanying drawing to be used needed for the present invention and be simply situated between below Continue, it should be apparent that, drawings in the following description are some embodiments of the present invention, for those of ordinary skill in the art come Say, without having to pay creative labor, can be with according to these other accompanying drawings of accompanying drawings acquisition.
Fig. 1 is a kind of flow chart across company-data moving method that the embodiment of the present invention one is provided;
Fig. 2 is a kind of flow chart across company-data moving method that the embodiment of the present invention three is provided;
Fig. 3 a are the main controlled node of source cluster in a kind of across the company-data migratory system that the embodiment of the present invention four is provided Structural representation;
Fig. 3 b are the main controlled node of target cluster in a kind of across the company-data migratory system that the embodiment of the present invention four is provided Structural representation.
Specific embodiment
To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with accompanying drawing to the embodiment of the present invention In technical scheme be described in further detail, it is clear that described embodiment is a part of embodiment of the invention, rather than entirely The embodiment in portion.It is understood that specific embodiment described herein is only used for explaining the present invention, rather than to the present invention's Limit, based on the embodiment in the present invention, those of ordinary skill in the art are obtained under the premise of creative work is not made Every other embodiment, belong to the scope of protection of the invention.It also should be noted that, for the ease of description, accompanying drawing In illustrate only part related to the present invention rather than full content.
Embodiment one
Fig. 1 is referred to, is a kind of flow chart across company-data moving method that the embodiment of the present invention one is provided.The present invention The method of embodiment is applied to across company-data migratory system, and the system includes source cluster and target cluster, the source cluster bag Main controlled node and at least one child node are included, the target cluster includes main controlled node and at least one child node.Wherein, source collection The main controlled node and at least one child node of group forms HDFS, and be stored with tables of data to be migrated in the cluster of source;Target cluster Main controlled node and at least one child node can also form HDFS, for migrating storage source cluster in tables of data.
The method includes:
Step 110, the main controlled node of source cluster call each child node of voltage input cluster of ceasing and desisting order to stop data manipulation;
This step stops data manipulation particular by each child node by source cluster so that the number before migration in each node According to persistence.Specifically, can notify that the corresponding business side of each child node stops data write or read operation, then call Ceasing and desisting order makes each child node of source cluster stop data manipulation.Source cluster is made it is of course also possible to directly invoke and cease and desist order Each child node stops data manipulation.
Step 120, the main controlled node of source cluster utilize the clearing buffers area assembly of the distributed data base of source cluster, by institute The data persistence in distributed data base internal memory is stated in HDFS;
This step is specifically by the data persistence in the distributed data base of source cluster.
Wherein described emptying buffer component is used for the lasting data that will be temporarily stored in the internal memory of the distributed data base Change in the disk of HDFS.
Step 130, the main controlled node of source cluster control the tables of data included to the distributed data place of source cluster, adopt The compression algorithm of setting is compressed;
This step specifically to source cluster in tables of data to be migrated be compressed.Specifically, can be each by checking The compressive state of tables of data, the tables of data to not being compressed is compressed, and can specifically adopt LZO (Lempel-Ziv- Oberhumer) compression algorithm, SNAPPY compression algorithms or other compression algorithms.
Wherein, LZO compression algorithms are the compression algorithms that a kind of high compression ratio and decompression speed are exceedingly fast, and are lossless compress, Data energy accurate reproduction after compressing.SNAPPY compression algorithms are a kits for being used to compress and decompress, it is intended to carried For high speed compression speed and rational compression ratio.
Distributed data base utilizes HDFS systems stored as a file, and the data storage in the form of tables of data can be common 1,000,000,000 magnitude rows, the Large data table of million magnitudes row are supported on the basis of hardware environment, therefore by tables of data to be migrated It is compressed, can effectively reduces data transmission rate, is conducive to improving data migration efficiency.
Shared by tables of data in step 140, the distributed data base of the main controlled node Statistic Source cluster of source cluster First storage size of HDFS and the first general act block number;
In this step, the tables of data that the distributed data place of source cluster is included is used as tables of data to be migrated, Ke Yicun Store up in the root of the distributed data base, because the distributed data base of source cluster utilizes HDFS systems stored as a file System, therefore disk storage space is defined based on HDFS, the disk storage space is the disk storage of each child node of source cluster The summation in space, the shared memory space in the disk storage space of the HDFS of the tables of data is described first and deposits Storage space.
Due to the data volume that tables of data to be migrated is stored it is very big, during actual storage, to the number to be migrated Use distributed storage according to table, also will the tables of data to be migrated carry out piecemeal, multiple blocks of files are formed, by difference Blocks of files be stored in the disk of different child nodes of source cluster.The first general act block number refers to tables of data to be migrated The summation of the block number of corresponding blocks of files.
The IP address of the node that step 150, the main controlled node of source cluster are included based on the target cluster for obtaining in advance with The mapping relations of Hostname, migrate distributed to the target cluster by the tables of data in distributed data base in the cluster of source In data base;
This step specifically migrates the tables of data in distributed data base in the cluster of source to the distributed number of target cluster According in storehouse.
It should be noted that the data access service that the distributed data base that source cluster is only off in step 110 is provided, And the service of the HDFS of source cluster still normally runs.
Also, it should be noted that the mapping relations of the IP address of the node included according to target cluster and host name, source The main controlled node of cluster can find the HDFS of target cluster, so as to be based on the mapping relations, can realize dividing in the cluster of source Tables of data in cloth data base is migrated into the HDFS of target cluster, because the distributed data base of target cluster utilizes HDFS System stored as a file such that it is able to realize that the tables of data in the cluster of source in distributed data base migrates dividing to target cluster In cloth data base.
If step 160, the management of webpage interface of MapReduce (MapReduce) process for getting source cluster return Data Migration completion message, the then tables of data that the main controlled node of target cluster is counted in the distributed data base of target cluster is occupied Corresponding HDFS the second storage size and the second general act block number, and by second storage size and second General act block number is matched with first storage size and the first general act block number;
This step specifically after monitoring that the tables of data migration is completed, counts first the distributed number of target cluster Second storage size of the corresponding HDFS occupied according to the tables of data in storehouse and the second general act block number, then by described second Storage size is compared with first storage size, and by the second general act block number and first general act Block number is compared.
Second storage size described in this step and the second general act block number are big with first memory space Little similar with the first general act block number, here is omitted.
Wherein, MapReduce is a kind of universal programming model for realizing Distributed Parallel Computing task, for processing big rule The concurrent operation of modulus evidence.Specific migration situation can be monitored by the management of webpage interface of MapReduce processes, for example, Real-time migration speed, migrate percentage ratio, estimate remaining time and the description information of data of having moved etc..
If step 170, the match is successful, the main controlled node of target cluster is using corresponding with the compression algorithm of the setting Decompression algorithm decompress to migrating the tables of data into target cluster;
The match is successful in this step refer to migration before source cluster distributed data base in tables of data shared by First storage size of HDFS is consistent with second storage size of the HDFS of target cluster shared after migration, with And the general act block number before migration is consistent with the general act block number after migration, namely the distribution of the match is successful the as source cluster Tables of data in formula data base has intactly been moved in the distributed data base of target cluster.
Specifically after tables of data is completely migrated, decompression migrates the tables of data into target cluster to this step.
Step 180, the main controlled node of target cluster start the target cluster based on strategy is started.
This step specifically starts the target cluster, so that each node normal work of the target cluster.
The technical scheme of the present embodiment, by making each child node of source cluster data manipulation is stopped, and by source cluster Data persistence in the internal memory of distributed data base, can realize that the data migrated in the distributed data base of front source cluster are held Longization;It is compressed by the tables of data in the distributed data base to source cluster, data transfer can be reduced, by source cluster Distributed data base in compression after tables of data migrate into target cluster, improve transport efficiency;Then by moving The storage size shared by tables of data and general act block number in the distributed data base of the source cluster of lead, after migration Target cluster the storage size occupied of tables of data and general act block number matched, can be verified according to matching result The integrity of migration.
Embodiment two
The present embodiment on the basis of above-described embodiment, the distributed data of the main controlled node Statistic Source cluster of cluster in source Before first storage size of the HDFS shared by tables of data in storehouse and the first general act block number, also include:
The main controlled node of source cluster removes the source using the complete file combining block of the distributed data base of source cluster Meet the default tables of data for removing strategy in the disk storage space of the distributed data base of cluster.
This step removes the distributed data of the source cluster specifically after being compressed to tables of data to be migrated The tables of data of the failure in the disk storage space in storehouse, with the further data volume for reducing migration, improves transport efficiency.
Wherein, the default removing strategy can have various implementations, such as including following at least one:
Using in the disk storage space of the distributed data base of the source cluster with delete the tables of data of mark as Tables of data to be cleaned;
Using the tables of data for reaching life span in the disk storage space of the distributed data base of the source cluster as Tables of data to be cleaned;
It should be noted that the life span of tables of data can be pre-set as needed, according to during the existence of tables of data Between eliminate expired tables of data.By taking e-commerce platform as an example, generally, can arrange corresponding according to the persistent period of advertising campaign Certain period in the life span of tables of data, such as 30 days, 7 days or specific some day, such as this day of shop-establishment celebration is from morning 10 To at 10 points in evening, at the end of shop-establishment celebration, life span is expired tables of data to point for the tables of data of this shop-establishment celebration, by clear Except expired tables of data, be conducive to saving memory space and raising transport efficiency.
Maximum version number in the disk storage space of the distributed data base of the source cluster is more than into the number of threshold value According to table as tables of data to be cleaned.
It should be noted that the maximum version number of tables of data can be pre-set as needed, 3 are usually arranged as.For Update and could be arranged to 1 than more frequently tables of data such that it is able to rapidly eliminate the tables of data of failure, be conducive to saving storage Space and raising transport efficiency.
The technical scheme of the present embodiment, the data in the distributed data base for migrating front source cluster carry out persistence it Afterwards, tables of data to be migrated is compressed, the data volume of transmission can be reduced, and by removing the distribution of the source cluster The tables of data of the failure in the disk storage space of formula data base, can further reduce the data volume of migration, by source cluster The tables of data through compression and clear operation in distributed data base is migrated into target cluster, improves transport efficiency;It is logical Cross the storage size and general act block number shared by the tables of data of the distributed data base of the source cluster before migration, and move The storage size and general act block number that the tables of data of the target cluster after shifting is occupied is matched, can be according to matching result The integrity of checking migration.
In such scheme, emptying buffer can be triggered by calling the distributed data base command line interface of source cluster Component and complete file combining block.
Wherein, command line interface is the interactive interface of operating system and user.In (SuSE) Linux OS, claim order line Interface is shell, and its effect mainly provides the user service, such as receives the input data from keyboard, or is shown on screen Show implementing result etc..
In such scheme, calculated using decompression corresponding with the compression algorithm of the setting in the main controlled node of target cluster Before method is to migrating the tables of data into target cluster and decompressing, further preferably include:
Consistency detection component in the main controlled node invocation target cluster of target cluster, detects the distributed of target cluster The concordance of the tables of data that data place is included;
If consistent, the main controlled node for triggering target cluster is calculated using decompression corresponding with the compression algorithm of the setting Method is decompressed to migrating the tables of data into target cluster.
It should be noted that the concordance of detection data table refers to the description information and target cluster of detection data table Whether the attribute information of the tables of data of necessary being is consistent in HDFS.If consistent, the main controlled node for triggering target cluster is adopted Decompressed to migrating the tables of data into target cluster with decompression algorithm corresponding with the compression algorithm of the setting;If no Unanimously, then repaired using the consistency detection component.This step is after tables of data migration integrity verification is carried out Supplement checking, the concordance of the tables of data migrated into target cluster can be improved.
Embodiment three
Fig. 2 is referred to, is a kind of flow chart across company-data moving method that the embodiment of the present invention three is provided.This enforcement Example is on the basis of the various embodiments described above, there is provided the main controlled node of target cluster starts the object set based on strategy is started The preferred version of group.The method for optimizing includes:
Step 210, the main controlled node of target cluster call startup order to start target cluster;
If there is no error-logging information in the journal file of step 220, the distributed data base of target cluster association Or warning log information, then the holistic health degree inspection in the distributed data base of the main controlled node invocation target cluster of target cluster Component is looked into, the holistic health degree of target cluster is checked;
This step is specifically to look at the journal file associated with distributed data base in the node that object set group is included, If there is error-logging information or warning log information, then according to the correlation of the distributed data base for pointing out invocation target cluster Component solve problem;If there is no error-logging information or warning log information, then using the distributed data of target cluster The health degree in storehouse checks component, checks the holistic health degree of target cluster.
Wherein, check that the holistic health degree of target cluster includes checking the tables of data of target cluster whether in normal shape State.
Step 230, the main controlled node of target cluster call the command line interface of distributed data base by the tables of data State is set to enabled state.
This step specifically according to the testing result of the holistic health degree of the target cluster in step 220, makes target cluster In the state of tables of data maintain the normal condition of " enable ".
The technical scheme of the present embodiment, by start target cluster after, in checking the node that target cluster is included The journal file associated with distributed data base, if there is error-logging information or warning log information, then according to prompting The associated component solve problem of the distributed data base of invocation target cluster;If there is no error-logging information or warning daily record Information, then check component using the health degree of the distributed data base of target cluster, checks the holistic health degree of target cluster;And Based on the testing result of the holistic health degree of target cluster, the state for making the tables of data in target cluster is maintaining " enable " just Normal state, so that the tables of data in target cluster can provide normal access service.
In such scheme, in the distributed database management page for passing through target cluster, if the distribution of target cluster The tables of data that formula data place is included is not on enabled state, then the main controlled node of target cluster calls distributed data base Command line interface is set to the state of the tables of data after enabled state, further preferably includes:
The IP address of the node that target cluster is included and the mapping relations of Hostname, and the distribution of target cluster Link information in formula data base is sent to business side, and notifies that the business side is tested the data, services of target cluster Card.
Example IV
Refer to Fig. 3 a and Fig. 3 b.The embodiment of the present invention four provides a kind of across company-data migratory system, and the system includes: Source cluster and target cluster, the source cluster includes main controlled node and at least one child node, and the target cluster includes master control Node and at least one child node.
The main controlled node of the source cluster includes:Stopping modular 310, persistence module 320, compression module 330, statistics mould Block 340 and transferring module 350.
The main controlled node of the target cluster includes:Statistical module 360, decompression module 370 and starting module 380.
Wherein, stopping modular 310 is used to call each child node of voltage input cluster of ceasing and desisting order to stop data manipulation;Persistently Change the clearing buffers area assembly of the distributed data base that module 320 is used to utilize source cluster, by the distributed data base internal memory In data persistence in HDFS;Compression module 330 is used for the tables of data included to the distributed data place of source cluster, adopts It is compressed with the compression algorithm of setting;The tables of data institute that statistical module 340 is used in the distributed data base of Statistic Source cluster First storage size of the HDFS of occupancy and the first general act block number;Transferring module 350 is used for based on the mesh for obtaining in advance The IP address of the node that mark cluster is included and the mapping relations of Hostname, by the data in distributed data base in the cluster of source Table is migrated into the distributed data base of the target cluster;
Wherein, if the management of webpage interface that statistical module 360 is used for the MapReduce process for getting source cluster returns Data Migration completion message, then count the of the corresponding HDFS that occupies of tables of data in the distributed data base of target cluster Two storage sizes and the second general act block number, and by second storage size and the second general act block number with it is described First storage size and the first general act block number are matched;If decompression module 370 be used for the match is successful, using with The corresponding decompression algorithm of compression algorithm of the setting is decompressed to migrating the tables of data into target cluster;Starting module 380 are used to, based on strategy is started, start the target cluster.
The technical scheme of the present embodiment, by making each child node of source cluster data manipulation is stopped, and by source cluster Data persistence in the internal memory of distributed data base, can realize that the data migrated in the distributed data base of front source cluster are held Longization;It is compressed by the tables of data in the distributed data base to source cluster, data transfer can be reduced, by source cluster Distributed data base in compression after tables of data migrate into target cluster, improve transport efficiency;Then by moving The storage size shared by tables of data and general act block number in the distributed data base of the source cluster of lead, after migration Target cluster the storage size occupied of tables of data and general act block number matched, can be verified according to matching result The integrity of migration.
In such scheme, the main controlled node of the source cluster further preferably includes:Module is removed, in Statistic Source cluster Distributed data base in tables of data shared by HDFS the first storage size and the first general act block number before, profit With the complete file combining block of the distributed data base of source cluster, the disk for removing the distributed data base of the source cluster is deposited Meet the default tables of data for removing strategy in storage space.
In such scheme, it is described it is default remove strategy include it is following at least one:
Using in the disk storage space of the distributed data base of the source cluster with delete the tables of data of mark as Tables of data to be cleaned;
Using the tables of data for reaching life span in the disk storage space of the distributed data base of the source cluster as Tables of data to be cleaned;
Maximum version number in the disk storage space of the distributed data base of the source cluster is more than into the number of threshold value According to table as tables of data to be cleaned.
In such scheme, the persistence module 310 by calling the distributed data base of source cluster specifically for being ordered Line interface is made to trigger clearing buffers area assembly, by the data persistence in the distributed data base internal memory to distributed field system In system HDFS;The removing module, specifically for shared by the tables of data in the distributed data base of Statistic Source cluster Before first storage size of HDFS and the first general act block number, by the distributed data base order line for calling source cluster Interface triggers complete file combining block, meets default in the disk storage space of the distributed data base for removing the source cluster Remove the tables of data of strategy.
In such scheme, the main controlled node of the target cluster further preferably includes:Consistency detection module, for adopting With decompression algorithm corresponding with the compression algorithm of the setting to migrating the tables of data into target cluster and decompressing before, adjust With the consistency detection component in target cluster, the consistent of the tables of data that the distributed data place of target cluster is included is detected Property;If consistent, the main controlled node for triggering target cluster adopts decompression algorithm pair corresponding with the compression algorithm of the setting The tables of data migrated into target cluster is decompressed.
In such scheme, the starting module 390 is preferably included:Start unit, holistic health degree detector unit sum According to table status setting unit.
Wherein, start unit is used to call startup order to start target cluster;If holistic health degree detector unit is used for There is no error-logging information or warning log information in the journal file of the distributed data base association of target cluster, then call Holistic health degree in the distributed data base of target cluster checks component, checks the holistic health degree of target cluster;Tables of data State set unit is used to call the command line interface of distributed data base that the state of the tables of data is set to into enabled state.
In such scheme, the starting module 390 can also include:Data service authentication unit, for by mesh The distributed database management page of mark cluster, if the tables of data that the distributed data place of target cluster is included is not on Enabled state, the then command line interface for calling distributed data base is set to the state of the tables of data after enabled state, will In the IP address of the node that target cluster is included and the mapping relations of Hostname, and the distributed data base of target cluster Link information send to business side, and notify that the business side verifies to the data, services of target cluster.
The master of the main controlled node of source cluster and target cluster in across company-data migratory system provided in an embodiment of the present invention Control node can perform across the company-data moving method that any embodiment of the present invention is provided, and possess the corresponding function of execution method Module and beneficial effect.
Finally it should be noted that:Various embodiments above is merely to illustrate technical scheme, rather than it is limited System;In embodiment preferred embodiment, not it is limited, to those skilled in the art, the present invention can be with There are various changes and change.All any modification, equivalent substitution and improvements made within spirit and principles of the present invention etc., Should be included within protection scope of the present invention.

Claims (10)

1. across the company-data moving method of one kind, it is characterised in that include:
The main controlled node of source cluster calls each child node of voltage input cluster of ceasing and desisting order to stop data manipulation;
The main controlled node of source cluster utilizes the clearing buffers area assembly of the distributed data base of source cluster, by the distributed data Data persistence in the internal memory of storehouse is in distributed file system HDFS;
The tables of data that the main controlled node of source cluster is included to the distributed data place of source cluster, is entered using the compression algorithm of setting Row compression;
First storage of the HDFS shared by tables of data in the distributed data base of the main controlled node Statistic Source cluster of source cluster Space size and the first general act block number;
The IP address of node and reflecting for Hostname that the main controlled node of source cluster is included based on the target cluster for obtaining in advance Relation is penetrated, the tables of data in distributed data base in the cluster of source is migrated into the distributed data base of the target cluster;
If the Data Migration completion message that the management of webpage interface for getting the MapReduce process of source cluster returns, target Second storage of the corresponding HDFS that the tables of data in the distributed data base of the main controlled node statistics target cluster of cluster is occupied Space size and the second general act block number, and second storage size and the second general act block number are deposited with described first Storage space size and the first general act block number matching;
If the match is successful, the main controlled node of target cluster adopts decompression algorithm pair corresponding with the compression algorithm of the setting The tables of data migrated into target cluster is decompressed;
The main controlled node of target cluster starts the target cluster based on strategy is started.
2. method according to claim 1, it is characterised in that the main controlled node Statistic Source cluster of cluster is distributed in source Before first storage size of the HDFS shared by tables of data in data base and the first general act block number, also include:
The main controlled node of source cluster removes the source cluster using the complete file combining block of the distributed data base of source cluster Distributed data base disk storage space in meet the default tables of data for removing strategy.
3. method according to claim 2, it is characterised in that it is described it is default remove strategy include it is following at least one:
Will be clear as treating with the tables of data for deleting mark in the disk storage space of the distributed data base of the source cluster The tables of data removed;
The tables of data for reaching life span in the disk storage space of the distributed data base of the source cluster is clear as treating The tables of data removed;
Maximum version number in the disk storage space of the distributed data base of the source cluster is more than into the tables of data of threshold value As tables of data to be cleaned.
4. method according to claim 2, it is characterised in that connect by calling the distributed data base order line of source cluster Mouth triggering clearing buffers area assembly and complete file combining block.
5. method according to claim 1, it is characterised in that adopt and the setting in the main controlled node of target cluster Before the corresponding decompression algorithm of compression algorithm is to migrating the tables of data into target cluster and decompressing, also include:
Consistency detection component in the main controlled node invocation target cluster of target cluster, detects the distributed data of target cluster The concordance of the tables of data that place is included;
If consistent, the main controlled node for triggering target cluster adopts decompression algorithm pair corresponding with the compression algorithm of the setting The tables of data migrated into target cluster is decompressed.
6. according to the arbitrary described method of claim 1-5, it is characterised in that the main controlled node of target cluster is based on startup plan Slightly, the target cluster is started, including:
The main controlled node of target cluster calls startup order to start target cluster;
If there is no error-logging information or warning daily record letter in the journal file of the distributed data base association of target cluster Breath, then the holistic health degree inspection component in the distributed data base of the main controlled node invocation target cluster of target cluster, checks The holistic health degree of target cluster;
The main controlled node of target cluster calls the command line interface of distributed data base that the state of the tables of data is set to into enable State.
7. method according to claim 6, it is characterised in that by the distributed database management page of target cluster Face, if the tables of data that the distributed data place of target cluster is included is not on enabled state, the master control of target cluster Node calls the command line interface of distributed data base that the state of the tables of data is set to after enabled state, also includes:
The IP address of the node that target cluster is included and the mapping relations of Hostname, and the distributed number of target cluster Send to business side according to the link information in storehouse, and notify that the business side verifies to the data, services of target cluster.
8. across the company-data migratory system of one kind, including source cluster and target cluster, the source cluster is including main controlled node and extremely A few child node, the target cluster includes main controlled node and at least one child node, it is characterised in that:
The main controlled node of the source cluster includes:
Stopping modular, each child node for calling voltage input cluster of ceasing and desisting order stops data manipulation;
Persistence module, for using the clearing buffers area assembly of the distributed data base of source cluster, by the distributed data Data persistence in the internal memory of storehouse is in distributed file system HDFS;
Compression module, for the tables of data included to the distributed data place of source cluster, is carried out using the compression algorithm of setting Compression;
Statistical module, for first memory space of the HDFS shared by the tables of data in the distributed data base of Statistic Source cluster Size and the first general act block number;
Transferring module, the IP address of node and the mapping of Hostname for being included based on the target cluster for obtaining in advance is closed System, the tables of data in distributed data base in the cluster of source is migrated into the distributed data base of the target cluster;
The main controlled node of the target cluster includes:
Statistical module, if the Data Migration that the management of webpage interface of the MapReduce process for getting source cluster returns is complete Into message, then the second memory space for counting the corresponding HDFS that the tables of data in the distributed data base of target cluster is occupied is big Little and the second general act block number, and by second storage size and the second general act block number and first memory space Size and the first general act block number are matched;
Decompression module, if for the match is successful, using decompression algorithm corresponding with the compression algorithm of the setting to migration Tables of data into target cluster is decompressed;
Starting module, for based on strategy is started, starting the target cluster.
9. system according to claim 8, it is characterised in that the main controlled node of the source cluster also includes:
Module is removed, the first storage for the HDFS shared by the tables of data in the distributed data base of Statistic Source cluster is empty Between before size and the first general act block number, using the complete file combining block of the distributed data base of source cluster, remove institute Meet the default tables of data for removing strategy in the disk storage space of the distributed data base for stating source cluster.
10. system according to claim 9, it is characterised in that it is described it is default remove strategy include it is following at least one:
Will be clear as treating with the tables of data for deleting mark in the disk storage space of the distributed data base of the source cluster The tables of data removed;
The tables of data for reaching life span in the disk storage space of the distributed data base of the source cluster is clear as treating The tables of data removed;
Maximum version number in the disk storage space of the distributed data base of the source cluster is more than into the tables of data of threshold value As tables of data to be cleaned.
CN201410455695.2A 2014-09-09 2014-09-09 cross-cluster data migration method and system Active CN104239493B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410455695.2A CN104239493B (en) 2014-09-09 2014-09-09 cross-cluster data migration method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410455695.2A CN104239493B (en) 2014-09-09 2014-09-09 cross-cluster data migration method and system

Publications (2)

Publication Number Publication Date
CN104239493A CN104239493A (en) 2014-12-24
CN104239493B true CN104239493B (en) 2017-05-10

Family

ID=52227552

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410455695.2A Active CN104239493B (en) 2014-09-09 2014-09-09 cross-cluster data migration method and system

Country Status (1)

Country Link
CN (1) CN104239493B (en)

Families Citing this family (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105808612B (en) * 2014-12-31 2019-08-27 北京嘀嘀无限科技发展有限公司 The method and apparatus of data for migrating data library
CN105069128B (en) * 2015-08-14 2018-11-09 北京京东尚科信息技术有限公司 Method of data synchronization and device
CN105159970B (en) * 2015-08-25 2019-03-15 浪潮(北京)电子信息产业有限公司 A kind of database data migration system and method
CN106484379B (en) * 2015-08-28 2019-11-29 华为技术有限公司 A kind of processing method and processing device of application
CN106933859B (en) * 2015-12-30 2020-10-20 中国移动通信集团公司 Medical data migration method and device
CN108021585B (en) * 2016-10-28 2022-01-18 腾讯科技(深圳)有限公司 Distributed data storage method and device
CN106777164B (en) * 2016-12-20 2020-07-10 东软集团股份有限公司 Data migration cluster and data migration method
CN108234566B (en) * 2016-12-21 2021-04-23 阿里巴巴集团控股有限公司 Cluster data processing method and device
CN107016075A (en) * 2017-03-27 2017-08-04 聚好看科技股份有限公司 Company-data synchronous method and device
CN107515782A (en) * 2017-07-26 2017-12-26 北京天云融创软件技术有限公司 Implementation method of the container across host migration under a kind of Docker environment
CN107704633A (en) * 2017-11-01 2018-02-16 郑州云海信息技术有限公司 A kind of method and system of file migration
CN110955720B (en) * 2018-09-27 2023-04-07 阿里巴巴集团控股有限公司 Data loading method, device and system
CN109376010B (en) * 2018-09-28 2020-11-27 上海思询信息科技有限公司 Method for realizing cross-cluster resource migration based on Openstack
CN109544072A (en) * 2018-11-21 2019-03-29 北京京东尚科信息技术有限公司 Method, system, equipment and medium are reduced in hot spot inventory localization
CN109542882B (en) * 2018-12-05 2020-11-06 南京中孚信息技术有限公司 Database migration method and device
CN109818794A (en) * 2019-01-31 2019-05-28 北京搜狐互联网信息服务有限公司 Cluster moving method and tool
CN111756562B (en) * 2019-03-29 2023-07-14 深信服科技股份有限公司 Cluster takeover method, system and related components
CN110209731A (en) * 2019-04-25 2019-09-06 深圳壹账通智能科技有限公司 Method of data synchronization, device and storage medium, electronic device
CN110209653B (en) * 2019-06-04 2021-11-23 中国农业银行股份有限公司 HBase data migration method and device
CN110263044B (en) * 2019-06-21 2023-03-31 深圳前海微众银行股份有限公司 Data storage method, device, equipment and computer readable storage medium
CN110704540B (en) * 2019-10-10 2023-05-02 云南中烟工业有限责任公司 Method for evaluating data quality of source end and target end in data acquisition process
CN111064789B (en) * 2019-12-18 2022-09-20 北京三快在线科技有限公司 Data migration method and system
CN111274213B (en) * 2020-02-13 2022-07-15 苏州浪潮智能科技有限公司 Distributed file system HDFS (Hadoop distributed file system) cross-Insight cluster real-time data transmission method and system
CN111367889B (en) * 2020-03-09 2023-08-04 中国工商银行股份有限公司 Cross-cluster data migration method and device based on webpage interface
CN113760856A (en) * 2020-06-05 2021-12-07 京东数字科技控股有限公司 Database management method and device, computer readable storage medium and electronic device
CN113297166A (en) * 2020-07-27 2021-08-24 阿里巴巴集团控股有限公司 Data processing system, method and device
CN112799912A (en) * 2021-01-27 2021-05-14 苏州浪潮智能科技有限公司 Data monitoring method, device and system of AMS (automatic monitoring system)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101958808A (en) * 2010-10-18 2011-01-26 华东交通大学 Cluster task dispatching manager used for multi-grid access
CN103207814A (en) * 2012-12-27 2013-07-17 北京仿真中心 Decentralized cross cluster resource management and task scheduling system and scheduling method
CN103329105A (en) * 2011-01-27 2013-09-25 国际商业机器公司 Application recovery in file system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9223845B2 (en) * 2012-08-01 2015-12-29 Netapp Inc. Mobile hadoop clusters

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101958808A (en) * 2010-10-18 2011-01-26 华东交通大学 Cluster task dispatching manager used for multi-grid access
CN103329105A (en) * 2011-01-27 2013-09-25 国际商业机器公司 Application recovery in file system
CN103207814A (en) * 2012-12-27 2013-07-17 北京仿真中心 Decentralized cross cluster resource management and task scheduling system and scheduling method

Also Published As

Publication number Publication date
CN104239493A (en) 2014-12-24

Similar Documents

Publication Publication Date Title
CN104239493B (en) cross-cluster data migration method and system
US11269884B2 (en) Dynamically resizable structures for approximate membership queries
CN109725822B (en) Method, apparatus and computer program product for managing a storage system
US20210173588A1 (en) Optimizing storage device access based on latency
CN102970158B (en) Log storage and processing method and log server
CN105339907B (en) Synchronous mirror in Nonvolatile memory system
CN109947668B (en) Method and device for storing data
US20210019063A1 (en) Utilizing data views to optimize secure data access in a storage system
CN103116661B (en) A kind of data processing method of database
US10659225B2 (en) Encrypting existing live unencrypted data using age-based garbage collection
KR20170054299A (en) Reference block aggregating into a reference set for deduplication in memory management
US20200174671A1 (en) Bucket views
US7681001B2 (en) Storage system
WO2022063284A1 (en) Data synchronization method and apparatus, device, and computer-readable medium
US20220236904A1 (en) Using data similarity to select segments for garbage collection
US8621143B2 (en) Elastic data techniques for managing cache storage using RAM and flash-based memory
CN106657356A (en) Data writing method and device for cloud storage system, and cloud storage system
CN107209714A (en) The control method of distributed memory system and distributed memory system
CN107656939A (en) File wiring method and device
WO2019001521A1 (en) Data storage method, storage device, client and system
US20210055885A1 (en) Enhanced data access using composite data views
CN101808012A (en) Data backup method in the cloud atmosphere
US10592165B1 (en) Method, apparatus and computer program product for queueing I/O requests on mapped RAID
CN107179878A (en) The method and apparatus of data storage based on optimizing application
WO2019226652A1 (en) Auto-scaling a software application

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant