CN110019092B

CN110019092B - Data storage method, controller and system

Info

Publication number: CN110019092B
Application number: CN201711443996.3A
Authority: CN
Inventors: 徐振鑫
Original assignee: Huawei Technologies Co Ltd
Current assignee: Shenzhen Huawei Cloud Computing Technology Co ltd
Priority date: 2017-12-27
Filing date: 2017-12-27
Publication date: 2021-07-09
Anticipated expiration: 2037-12-27
Also published as: CN110019092A

Abstract

The application provides a data storage method, so that the controller of the main cluster does not need to perform merging operation, thereby avoiding bringing great bandwidth pressure and IO pressure in the merging execution process and improving the system performance. The method comprises the following steps: receiving the merged file, a first serial number and first identification information sent by a second controller, wherein the first serial number is a serial number of first data in the merged file, and the first identification information is used for indicating that an area number of the merged file in a standby cluster is a first area number; determining the sequence number of the last written data in the merged file according to the first sequence number; comparing the sequence number of the last written data in the merged file with the sequence number of the last written data in the uncombined file; when the sequence number of the last written data in the uncombined file is smaller than the sequence number of the last written data in the combined file, deleting the uncombined file; and storing the merged file.

Description

Data storage method, controller and system

Technical Field

The present application relates to the field of storage, and more particularly, to a method, controller and system for data storage.

Background

Under an available Area (AZ) dual-active architecture, a main cluster provides read-write service for user equipment, a standby cluster is only used for backing up data of the main cluster and sending the backed-up data to the main cluster so as to restore the data when the main cluster fails and loses the data, and the standby cluster does not provide the read-write service for the user equipment. Data synchronization is achieved between the primary cluster and the secondary cluster by synchronizing a data-Ahead Logging (WAL). The main and standby clusters both adopt a distributed storage system (HBase), respectively generate their meta and HFile according to the data WAL, and respectively execute the merge compact operation of the HFile. Due to the large bandwidth pressure and the short IO pressure and calculation pressure during the execution of the compact, the HBase performance is greatly reduced. The main cluster provides read-write service for the user equipment, but the main cluster brings great bandwidth pressure and short-time IO pressure in the process of executing the compact, and the execution of the compact consumes the computing capacity, so that the HBase performance is greatly reduced, and the user experience is influenced.

Disclosure of Invention

The application provides a data storage method, a controller and a system, wherein a controller of a main cluster receives a merged file sent by a controller of a standby cluster, the controller of the main cluster deletes an uncombined file originally stored on the controller of the main cluster, the uncombined file already comprises the merged file sent by the controller of the standby cluster, and the controller of the main cluster does not need to perform merging operation, so that great bandwidth pressure and short-time IO pressure brought in the process of executing a compact are avoided, user experience is improved, and system performance is improved.

In a first aspect, a method for data storage is provided, where the method is applied to a cluster system, the system includes a main cluster and a standby cluster, the method is performed by a first controller, the first controller is a controller of the main cluster, and a second controller is a controller of the standby cluster, the method includes:

receiving the merged file, a first serial number and first identification information sent by the second controller, wherein the first serial number is a serial number of first data in the merged file, and the first identification information is used for indicating that an area number of the merged file in the standby cluster is a first area number; determining the sequence number of the last written data in the merged file according to the first sequence number; comparing the serial number of the last written data in the merged file with the serial number of the last written data in the un-merged file, wherein the un-merged file is a file of which the area number in the main cluster is the first area number; when the serial number of the last written data in the uncombined file is smaller than the serial number of the last written data in the combined file, deleting the uncombined file; and storing the merged file in the main cluster.

Therefore, after receiving the merged file, the first sequence number and the first identification information sent by the second controller of the standby cluster, the first controller of the primary cluster determines the sequence number of the last written data in the merged file according to the first sequence number, compares the sequence number of the last written data in the merged file with the sequence number of the last written data in the un-merged file, deletes the un-merged file when the sequence number of the last written data in the un-merged file is smaller than the sequence number of the last written data in the merged file, and stores the merged file in the primary cluster. The first controller does not need to perform merging operation, so that the consumption of large bandwidth pressure, short IO pressure and computing capacity in the process of executing the compact is avoided, the performance of the main cluster is improved, and the user experience is improved.

With reference to the first aspect, in certain implementations of the first aspect, the method further includes: in the main cluster, the region number of the merged file is set as the first region number.

At this time, the area number of the merged file is set as the first area number, and the association between the first area and the merged file is established, so that the file is conveniently searched according to the first area number.

With reference to the first aspect, in certain implementations of the first aspect, the first sequence number is a sequence number of last written data in the merged file.

With reference to the first aspect, in some implementations of the first aspect, before the receiving the merged file, the first sequence number, and the first identification information sent by the second controller, the method further includes: generating a first log, wherein the first log comprises information of a first area with an area number of the first area in the main cluster, and the information of the first area comprises the unmerged file associated with the first area; and sending the first log to the second controller.

At this time, when the controller of the master cluster generates the first log including the information of the first area with the area number being the first area number in the master cluster, the controller of the master cluster sends the first log to the second controller of the standby cluster, so that the second controller obtains the information of the first area, and associates the first area with the area number being the first area number in the standby cluster and the unmerged file in the standby cluster according to the information of the first area, thereby keeping the consistency of the information of the first area in the master cluster and the information of the standby cluster.

With reference to the first aspect, in some implementations of the first aspect, before the receiving the merged file, the first sequence number, and the first identification information sent by the second controller, the method further includes: generating a second log, wherein the second log comprises data of the uncomposited file; sending the second log to the second controller; and generating the uncombined file according to the data of the uncombined file.

In a second aspect, a method for storing data is provided, where the method is applied to a cluster system, the system includes a main cluster and a standby cluster, the method is performed by a second controller, the second controller is a controller of the standby cluster, and the first controller is a controller of the main cluster, and the method includes:

when the number of files which are not merged in a first area in the standby cluster reaches a first threshold value, merging the files which are not merged to obtain a merged file, and recording a first serial number, wherein the first serial number is the serial number of first data in the merged file; and sending the merged file, the first sequence number and first identification information to the first controller, wherein the first identification information is used for indicating that the area number of the merged file in the standby cluster is the first area number.

Therefore, the uncombined files are combined in the first area in the standby cluster to obtain a combined file, the first serial number is recorded, the combined file, the first serial number and the first identification information are sent to the first controller of the main cluster, and the combined file, the first serial number and the first identification information are sent to the controller of the main cluster, so that the main cluster is prevented from carrying out combining operation. Therefore, the method avoids the large bandwidth pressure, the IO pressure in a short time and the consumption of computing power in the process of executing the compact, thereby improving the performance of the main cluster and improving the user experience.

With reference to the second aspect, in certain implementations of the second aspect, the method further includes:

receiving a first log sent by the first controller, wherein the first log comprises information of a first area with an area number of the first area in the main cluster, and the information of the first area comprises the uncombined files associated with the first area; playing back the first log to obtain the information of the first area; and according to the information of the first area, associating the area number in the standby cluster as the first area of the first area number and the files which are not merged in the standby cluster.

With reference to the second aspect, in certain implementations of the second aspect, the method further includes: receiving a second log sent by the first controller, wherein the second log comprises data of the uncomposited file; and generating the uncombined file according to the data of the uncombined file.

In a third aspect, a controller is provided, where the controller is applied to a cluster system, the system includes a main cluster and a standby cluster, the controller is a first controller, the first controller is a controller of the main cluster, and a second controller is a controller of the standby cluster, and the controller includes:

a receiving module, configured to receive the merged file, a first serial number, and first identification information sent by the second controller, where the first serial number is a serial number of first data in the merged file, and the first identification information is used to indicate that an area number of the merged file in the standby cluster is a first area number; the processing module is used for determining the serial number of the last written data in the merged file according to the first serial number; the processing module is further configured to compare a sequence number of last written data in the merged file with a sequence number of last written data in an un-merged file, where the un-merged file is a file whose area number in the main cluster is the first area number; the processing module is further configured to delete the uncombined file when the sequence number of the last written data in the uncombined file is smaller than the sequence number of the last written data in the combined file; and the storage module is used for storing the merged file in the main cluster.

With reference to the third aspect, in some implementations of the third aspect, the processing module is further configured to: in the main cluster, the region number of the merged file is set as the first region number.

With reference to the third aspect, in some implementations of the third aspect, the first sequence number is a sequence number of last written data in the merged file.

With reference to the third aspect, in certain implementations of the third aspect, the processing module is further configured to generate a first log, where the first log includes information of a first area with an area number of the first area within the primary cluster, and the information of the first area includes the unmerged file associated with the first area; the controller further comprises: and the sending module is used for sending the first log to the second controller.

With reference to the third aspect, in certain implementations of the third aspect, the processing module is further configured to generate a second log, where the second log includes data of the uncomposited file; the sending module is further configured to send the second log to the second controller; the storage module is also used for generating the uncombined file according to the data of the uncombined file.

In a fourth aspect, a controller for data storage is provided, where the controller is applied to a cluster system, the system includes a main cluster and a standby cluster, the controller is executed by a second controller, the second controller is a controller of the standby cluster, the first controller is a controller of the main cluster, and the controller includes:

a processing module: the cluster management system is used for merging the files which are not merged in the first area in the standby cluster to obtain a merged file when the number of the files which are not merged in the first area in the standby cluster reaches a first threshold value, and recording a first serial number, wherein the first serial number is the serial number of first data in the merged file; and the sending module is used for sending the merged file, the first serial number and the first identification information to the first controller, wherein the first identification information is used for indicating that the area number of the merged file in the standby cluster is the first area number.

With reference to the fourth aspect, in certain implementations of the fourth aspect, the controller further includes:

a receiving module, configured to receive a first log sent by the first controller, where the first log includes information of a first area whose area number is the first area number in the primary cluster, and the information of the first area includes the uncombined file associated with the first area; the processing module is further configured to play back the first log, and obtain information of the first area; the processing module is further configured to associate, according to the information of the first area, the first area with the area number in the standby cluster as the first area number and the unmerged file in the standby cluster.

With reference to the fourth aspect, in some implementations of the fourth aspect, the receiving module is further configured to receive a second log sent by the first controller, where the second log includes data of the uncomposited file; the processing module is also used for generating the uncombined file according to the data of the uncombined file.

In a fifth aspect, a system is provided, the system comprising a controller as in any of the implementations of the third aspect and a controller as in any of the implementations of the fourth aspect and the fourth aspect.

In a sixth aspect, there is provided a controller comprising a processor, a memory and an interface, the interface being for communicating with a second controller, the memory being for storing computer program code comprising instructions which, when executed by the processor, perform the method of the first aspect or any of the optional implementations of the first aspect.

In a seventh aspect, there is provided a controller comprising a processor, a memory for communicating with a first controller, and an interface for storing computer program code comprising instructions which, when executed by the processor, perform the method of the second aspect or any of the alternative implementations of the second aspect.

In an eighth aspect, a computer-readable storage medium is provided, in which computer-executable instructions are stored, and when the computer-executable instructions are executed by at least one processor of a controller, the controller executes the method in the first aspect or any optional implementation manner of the first aspect.

In a ninth aspect, there is provided a computer-readable storage medium having stored therein computer-executable instructions that, when executed by at least one processor of a memory, cause a controller to perform the method of the second aspect or any of the alternative implementations of the second aspect.

In a tenth aspect, there is provided a computer program product comprising computer executable instructions stored on a computer readable storage medium from which at least one processor of a controller can read, the at least one processor executing the computer executable instructions to cause the controller to perform the method of the first aspect or any of the optional implementations of the first aspect.

In an eleventh aspect, there is provided a computer program product comprising computer executable instructions stored in a computer readable storage medium from which at least one processor of a memory can read the computer executable instructions, the at least one processor executing the computer executable instructions to cause the memory to perform the method of the second aspect or any optional implementation of the second aspect.

Drawings

FIG. 1 is a system architecture diagram of a controller and a method of data storage according to the present application.

Fig. 2 is a diagram showing an example of the structure of the controller 11 in fig. 1.

FIG. 3 is a schematic flow chart diagram of a method of data storage according to the present application.

FIG. 4 is a schematic flow chart diagram of a method of data storage according to the present application.

FIG. 5 is a schematic flow chart diagram of a method of data storage according to the present application.

FIG. 6 is a schematic block diagram of a controller according to the present application.

FIG. 7 is a schematic block diagram of a controller according to the present application.

Fig. 8 shows a schematic block diagram of the apparatus provided herein.

Detailed Description

The technical solution in the present application will be described below with reference to the accompanying drawings.

FIG. 1 is a system architecture diagram 100 of a controller and method of data storage according to the present application. The system is a cross-AZ active-active architecture system, and as shown in fig. 1, the system 100 includes a user equipment 10, a master cluster controller 11, and a slave cluster controller 12. The master cluster controller 11 and the standby cluster controller 12 may be a computing device, such as a server, a desktop computer, etc. The master cluster controller 11 and the slave cluster controller 12 are provided with a write system and an application program. The master cluster controller 11 may receive input output (I/O) requests from user devices. The master cluster controller 11 may also store the data carried in the I/O request and write the data to the persistent storage device. The main cluster provides read-write service for the user equipment, the standby cluster is only used for backing up data of the main cluster and sending the backed-up data to the main cluster to restore the data when the main cluster fails and loses the data, and the standby cluster does not provide the read-write service for the user equipment. Data synchronization is achieved between the primary cluster and the secondary cluster by synchronizing a data-Ahead Logging (WAL).

In fig. 1, only the master cluster controller 11 and the slave cluster controller 12 are shown, and other devices of the master cluster and the slave cluster are not shown, but other devices such as a server, a computer, and the like may be further included in the master cluster, and other devices such as a server, a computer, and the like may be further included in the slave cluster.

The primary and backup clusters may employ a distributed storage system Hbase. HBase is a highly reliable, high-performance, column-oriented, scalable, distributed storage system. When writing data, the user equipment writes the WAL first, and writes the buffer memstore after the WAL is successfully written. Firstly writing data into WAL, wherein the operation is to store the data on a hard disk, the data is still in the WAL after power failure, and then writing the data into the memstore, and if the memstore is powered down, reading the data from the hard disk. And after the data cached by the memstore meets a certain condition, executing flush operation to enable the data to really fall to the disk to form a data file HFile. As the data writing is increased, the number of times of flush is increased, and the number of HFile data files is increased. However, too many data files will result in an increase of the number of data query IO times, so HBase tries to merge these files continuously, and this merging process is called compact.

Hbase stores data in the form of a table. The table is composed of rows and columns, and its logical structure is shown in table 1.

TABLE 1 logical Structure of Hbase data Table

For 1 traditional column-row two-dimensional table, each row of data of HBase has 1 row-key, the data in one table is kept in order, byte ordering is carried out according to the row-key + CF: qualifier + timestamp, and quick positioning can be completed through a dichotomy during data query. The row-key may be any string, with a maximum length of 64kb, sorted in lexicographic order. Column Family (CF), the data in the rows are grouped by Column Family, which also affects the physical storage of HBase data, and therefore, they must be defined in advance and not easily modified. Each row in the table has the same column family, although the row need not store data in each column family. Each column in CF, is 1 qualifier; 1 qualifier can only belong to 1 CF. Column qualifiers (columns), by which data in a column family is located. The column qualifiers do not have to be defined in advance, and the column qualifiers do not have to be consistent between different rows. Each row of data in the HBase has 1 timestamp, the HBase can store information of 3 different versions in the same row under the default condition, and the timestamps can be automatically assigned (the current system time is accurate to millisecond) by the HBase or explicitly and manually assigned when data is written. In Hbase, Row-key + CF + Qulifier + a timestamp can be used to locate a cell.

The physical storage of HBase data is briefly described below. During physical storage, the 1 logical two-dimensional table is split into a plurality of folders according to the CF, 1 CF corresponds to 1 folder, and a plurality of Hfiles are arranged in the folders. The Master of the Hbase system divides the table in the row direction, a plurality of rows of records form 1 HRegion, the 1 table is divided into a plurality of HRegions, and the HRegion is the minimum unit of HBase distributed storage and load balancing. The hregenion contains a plurality of CF, and the physical storage of HBase is performed based on CF, so the hregenion internally divides a plurality of HStores during storage, and each HStore is responsible for the physical storage of 1 CF.

The HStore is internally composed of a memory part (Memstore) and a disk part (HFile), data is written into the Memstore firstly, and the HFile is generated when the Memstore overflows, and finally falls on a Block of the HDFS.

In the prior art, the active and standby clusters respectively generate meta and HFile according to the data WAL, and respectively execute the merge compact operation of the HFile. Due to the large bandwidth pressure and the short IO pressure during the compact execution, the HBase performance is greatly reduced. The main cluster provides read-write service for the user equipment, and the main cluster brings great bandwidth pressure and short-time IO pressure in the process of executing the compact, so that the HBase performance is greatly reduced, and the user experience is influenced.

Fig. 2 is a diagram showing an example of the structure of the controller 11 in fig. 1, and as shown in fig. 2, the controller 11 includes an interface card 110, a processor 112, a memory 111, and an interface card 113.

The interface card 110 is used for communicating with the user equipment and receiving instructions sent by the user equipment, and the controller 11 can receive writing instructions of the user equipment through the interface card 110. For example, a write instruction includes a key (key) and a value (value), the key being an identification of the value. As a specific example, the value may be various information of a student, and the keyword may be a student number or other identifier indicating an attribute of an aspect of the student.

An interface card 113 for communicating with the standby cluster controller 12.

The processor 112 is a Central Processing Unit (CPU). In an embodiment of the present invention, the processor 112 may be configured to receive a write instruction or a read instruction from a user device and process the instructions. The processor 112 may also send the data in the write instruction to the persistent storage device. The processor 112 is further configured to allocate a logical address to the data, and store a correspondence between the keyword and the allocated logical address, so as to read the data in the future according to the correspondence between the keyword and the allocated logical address. .

Memory 111, including volatile memory, non-volatile memory, or a combination thereof. Volatile memory is, for example, random-access memory (RAM). The non-volatile memory is, for example, a floppy disk, a hard disk, a Solid State Disk (SSD), an optical disk, or various other machine-readable media capable of storing program codes. The memory 111 has a power-saving function, which means that data stored in the memory 111 cannot be lost when the system is powered off and powered on again. The memory 111 may be one or more of for temporarily storing data received from the host or data read from the persistent storage device, for example, when the controller 11 receives a plurality of write commands sent by the user device, the data in the plurality of write commands may be temporarily stored in the memory 111.

For a better understanding of the present application, embodiments of the present invention will now be described with reference to fig. 3 to 8, taking as an example a system identical or similar to that shown in fig. 1.

Fig. 3 is a schematic flow chart of a method 200 for data storage according to the present application, which is applied to a cluster system including a master cluster and a slave cluster, wherein the first controller is a controller of the master cluster, and the second controller is a controller of the slave cluster. As shown in fig. 3, the method 200 includes the following.

At 210, when the number of files that are not merged in the first area in the standby cluster reaches a first threshold, the second controller merges the files that are not merged to obtain a merged file, and records a first sequence number, where the first sequence number is a sequence number of first data in the merged file.

In 220, the second controller sends the merged file, the first serial number, and first identification information to the first controller, where the first identification information is used to indicate that the area number of the merged file in the standby cluster is the first area number.

At 230, the first controller receives the merged file, the first serial number and the first identification information sent by the second controller.

At 240, the first controller determines a sequence number of the last written data in the merged file based on the first sequence number.

At 250, the first controller compares the sequence number of the last written data in the merged file with the sequence number of the last written data in the un-merged file, where the un-merged file is the file with the region number in the main cluster being the first region number.

At 260, the first controller deletes the unflexed file when the sequence number of the last written data in the unflexed file is less than the sequence number of the last written data in the merged file.

At 270, the first controller stores the merged file within the primary cluster.

Therefore, in this embodiment of the present application, when the number of files that are not merged in the first area in the backup cluster reaches a first threshold, the second controller merges the files that are not merged to obtain merged files, records a first serial number, then sends the merged files, the first serial number, and first identification information to the first controller, the first controller determines, according to the first serial number, a serial number of data that is written last in the merged files, the first controller compares the serial number of data that is written last in the merged files with the serial number of data that is written last in the files that are not merged, and when the serial number of data that is written last in the files that are not merged is smaller than the serial number of data that is written last in the merged files, deleting the non-merged file and storing the merged file. Thereby avoiding the merging operation by the first controller of the master cluster. Therefore, the method avoids the large bandwidth pressure, the IO pressure in a short time and the consumption of computing power in the process of executing the compact, further improves the HBase performance of the main cluster, and improves the user experience.

It should be understood that a Region in this application may be referred to as a Region, but other representations are also included.

Optionally, the method 200 further comprises: in the main cluster, the region number of the merged file is set as the first region number.

Optionally, the first sequence number is a sequence number of last written data in the merged file.

Specifically, each piece of data is written with a sequence number, and the sequence numbers are arranged in an increasing order. The second controller records the sequence number of the last written data when performing the file merge, that is, records the currently largest sequence number.

It should be understood that the first sequence number may also be a sequence number of the first data in the merged file. The first data may be the last written data in the merged file, the first written data in the merged file, or any written data in the merged file.

For example, the second controller may record the sequence number of the second last data according to a predetermined convention, and when the first controller receives the sequence number, the sequence number is determined to be the sequence number of the second last data in the merged file according to the predetermined convention, so that the sequence number may be incremented by 1 to obtain the sequence number of the last data written in the merged file.

For another example, when the second controller sends the first sequence number to the first controller without prior agreement, the second controller also carries a parameter indicating a difference between the sequence numbers of the first data and the last written data in the merged file. The first controller determines a sequence number of last written data in the merged file according to the first sequence number and the parameter when receiving the first sequence number and the parameter.

Optionally, before the receiving the merged file, the first sequence number, and the first identification information sent by the second controller, the method further includes: the first controller generating a first log, the first log including information of a first area with an area number of the first area within the primary cluster, the information of the first area including the unmerged file associated with the first area; and sending the first log to the second controller.

Specifically, when the first area of the master cluster changes, a first log is generated, and the first controller sends the first log to the second controller for the second controller to generate the first area of the slave cluster according to the first log.

Optionally, the first journal includes a start key and an end key of the first region.

For example, in the HBase system, the view information of Region is stored in the meta table, which holds some information to maintain the cluster and the architecture of the cluster. When the view information of the Region of the main cluster is changed, a WAL log is generated, the first controller sends the WAL log to the second controller, the view information of the changed Region can be synchronized to the second controller, the second controller plays back the WAL log, and the Region of the standby cluster is generated according to the view information of the changed Region.

Optionally, the second controller receives a first log sent by the first controller, where the first log includes information of a first area whose area number is the first area number in the primary cluster, and the information of the first area includes the non-merged file associated with the first area; playing back the first log to obtain the information of the first area; and according to the information of the first area, associating the area number in the standby cluster as the first area of the first area number and the files which are not merged in the standby cluster.

Optionally, before the receiving the merged file, the first sequence number, and the first identification information sent by the second controller, the method further includes: generating a second log, wherein the second log comprises data of the uncomposited file; sending the second log to the second controller; and generating the uncombined file according to the data of the uncombined file.

Optionally, the second controller receives a second log sent by the first controller, where the second log includes data of the uncomposited file; and generating the uncombined file according to the data of the uncombined file.

Fig. 4 is a schematic flow chart of a method 300 for storing data according to the present application, the method is applied to a cluster system, the system includes a primary cluster and a secondary cluster, the third controller is a controller of the primary cluster and is used for managing a primary cluster system, the fourth controller is a controller of the secondary cluster and is used for managing a secondary cluster system, the first server is used for storing data of the primary cluster, the second server is used for storing data of the secondary cluster, the third controller manages the first server, and the fourth controller manages the first server. It should be appreciated that fig. 4 illustrates a method of data storage when the primary and secondary clusters deploy the control system and storage system separately. As shown in fig. 4, the method 300 includes the following.

In 301, when the number of files that are not merged in the first area in the standby cluster reaches a first threshold, the fourth controller merges the files that are not merged to obtain a merged file, and records a first sequence number, where the first sequence number is a sequence number of first data in the merged file.

In 302, the fourth controller sends the merged file and the first serial number to the second server.

In 303, the second server receives the merged file, the first serial number, and stores the merged file.

In 304, the second server sends the merged file, the first serial number, and first identification information to the first server, where the first identification information is used to indicate that the area number of the merged file in the standby cluster is the first area number.

In 305, the first server receives the merged file, the first serial number and the first identification information sent by the second server.

In 306, the fourth controller sends a first message to the third controller, where the first message is used to indicate that the second server has sent the merged file, the first sequence number, and the first identification information to the first server.

In 307, the third controller sends a second message to the first server after receiving the first message, where the second message is used to indicate the first server to delete the operation.

At 308, the first server receives the second message and determines a sequence number of the last written data in the merged file based on the first sequence number.

In 309, the first server compares the sequence number of the last written data in the merged file with the sequence number of the last written data in the un-merged file, where the un-merged file is the file with the region number in the main cluster being the first region number.

In 310, the first server deletes the un-merged file when the sequence number of the last written data in the un-merged file is smaller than the sequence number of the last written data in the merged file.

At 311, the first server stores the merged file within the primary cluster.

Therefore, in this embodiment of the present application, when the number of files that are not merged in the first area where files are merged in the backup cluster reaches the first threshold, the second server merges the files that are not merged to obtain merged files, records the first serial number, then sends the merged files, the first serial number, and the first identification information to the first server, when the first server receives a delete command, the first server determines the serial number of the last written data in the merged files according to the first serial number, the third controller compares the serial number of the last written data in the merged files with the serial number of the last written data in the files that are not merged, when the serial number of the last written data in the files that are not merged is smaller than the serial number of the last written data in the merged files, deleting the non-merged file and storing the merged file. Thereby avoiding the third controller of the master cluster from performing the merge operation. Therefore, the method avoids the large bandwidth pressure, the IO pressure in a short time and the consumption of computing power in the process of executing the compact, further improves the HBase performance of the main cluster, and improves the user experience.

Optionally, the method 300 further comprises: in the main cluster, the third controller sets the region number of the merged file as the first region number.

Specifically, each piece of data is written with a sequence number, and the sequence numbers are arranged in an increasing order. The fourth controller records the sequence number of the last written data when performing the file merge, that is, records the currently largest sequence number.

For the description of the first sequence number, reference may be made to the related description in the method 200, and details are not repeated here to avoid repetition.

Optionally, the method further comprises: the third controller generating a first log including information of a first area having an area number within the primary cluster as the first area number, the information of the first area including the uncombined file associated with the first area; sending the first log to the fourth controller.

Specifically, when the first area of the master cluster changes, a first log is generated, and the third controller sends the first log to the fourth controller, so that the fourth controller generates the first area of the slave cluster according to the first log.

For example, in the HBase system, the view information of Region is stored in the meta table, which holds some information to maintain the cluster and the architecture of the cluster. When the view information of the Region of the main cluster is changed, a WAL log is generated, the third controller sends the WAL log to the fourth controller, the view information of the changed Region can be synchronized to the fourth controller, the fourth controller plays back the WAL log, and the Region of the standby cluster is generated according to the view information of the changed Region.

Optionally, the fourth controller receives a first log sent by the third controller, where the first log includes information of a first area whose area number is the first area number in the primary cluster, and the information of the first area includes the non-merged file associated with the first area; playing back the first log to obtain the information of the first area; and according to the information of the first area, associating the area number in the standby cluster as the first area of the first area number and the files which are not merged in the standby cluster.

Optionally, the method further comprises: generating a second log, wherein the second log comprises data of the uncomposited file; sending the second log to the fourth controller; and generating the uncombined file according to the data of the uncombined file.

Optionally, the fourth controller receives a second log sent by the third controller, where the second log includes data of the uncomposited file; and generating the uncombined file according to the data of the uncombined file.

For better understanding of the embodiment of the present application, a data storage method of the present application is now described with reference to fig. 5, as shown in fig. 5, a main cluster is divided into an area 01 and an area 02, and a sequence ID of latest stored data recorded under the area 01 is 26000. The area division of the backup cluster is identical to the main cluster, and the area 01 and the area 02 are also divided, and the serial ID of the latest stored data recorded in the area 01 is 26000. And performing a merging operation on the standby cluster, merging the files with the sequence numbers less than or equal to 20000 into one file, and recording the last sequence number 20000 of the merged file. The backup cluster sends the merged file and the serial number 20000 to the main cluster, the main cluster deletes all files with serial numbers smaller than 20000, and the main cluster associates the merged file with the area 01 of the main cluster.

It should be understood that fig. five is described from the cluster system, and specific actions performed by the controller or the server in the cluster may refer to the method 200 or the method 300, which are not described herein again.

FIG. 6 is a schematic block diagram of a controller according to the present application. As shown in fig. 6, the controller 400 includes the following.

A receiving module 410, configured to receive the merged file, a first serial number and first identification information sent by the second controller, where the first serial number is a serial number of first data in the merged file, and the first identification information is used to indicate that an area number of the merged file in the standby cluster is a first area number.

The processing module 420 is configured to determine a sequence number of the last written data in the merged file according to the first sequence number.

The processing module 420 is further configured to compare a sequence number of last written data in the merged file with a sequence number of last written data in an un-merged file, where the un-merged file is a file whose area number in the main cluster is the first area number.

The processing module 420 is further configured to delete the uncombined file when the sequence number of the last written data in the uncombined file is smaller than the sequence number of the last written data in the combined file.

A storage module 430, configured to store the merged file in the main cluster.

Optionally, the processing module 420 is further configured to: in the main cluster, the region number of the merged file is set as the first region number.

Optionally, the processing module 420 is further configured to generate a first log, where the first log includes information of a first area with an area number of the first area within the primary cluster, and the information of the first area includes the non-merged file associated with the first area; the controller further comprises: and the sending module is used for sending the first log to the second controller.

Optionally, the processing module 420 is further configured to generate a second log, where the second log includes data of the uncomposited file; the sending module is further configured to send the second log to the second controller; the storage module is also used for generating the uncombined file according to the data of the uncombined file.

The controller 400 is fully corresponding to the first controller in the embodiment of the method 200, and corresponding modules execute corresponding steps, and reference may be specifically made to corresponding method embodiments.

It should be noted that the receiving module 410, the processing module 420 and the storage module 430 may be separately arranged, or may be integrated together and implemented by a single processing chip.

FIG. 7 is a schematic block diagram of a controller according to the present application. As shown in fig. 7, the controller 500 includes:

a processing module 510, configured to, when the number of files that are not merged in a first area in the backup cluster reaches a first threshold, merge the files that are not merged to obtain a merged file, and record a first sequence number, where the first sequence number is a sequence number of first data in the merged file;

a sending module 520, configured to send the merged file, the first sequence number, and first identification information to the first controller, where the first identification information is used to indicate that an area number of the merged file in the standby cluster is a first area number.

Optionally, the controller further comprises: a receiving module, configured to receive a first log sent by the first controller, where the first log includes information of a first area whose area number is the first area number in the primary cluster, and the information of the first area includes the uncombined file associated with the first area; the processing module is further configured to play back the first log, and obtain information of the first area; the processing module is further configured to associate, according to the information of the first area, the first area with the area number in the standby cluster as the first area number and the unmerged file in the standby cluster.

Optionally, the receiving module is further configured to receive a second log sent by the first controller, where the second log includes data of the uncomplexed file; the processing module is also used for generating the uncombined file according to the data of the uncombined file.

The controller 500 is fully corresponding to the second controller in the embodiment of the method 200, and corresponding modules execute corresponding steps, and reference may be specifically made to corresponding method embodiments.

It should be noted that the processing module 510 and the sending module 520 may be separately configured, or may be integrated together and implemented by one processing chip.

Fig. 8 shows a schematic block diagram of an apparatus 600 provided herein. The apparatus 600 comprises:

memory 610, processor 620, input/output interface 630. The memory 610, the processor 620 and the input/output interface 630 are connected through an internal connection path, the memory 610 is used for storing program instructions, and the processor 620 is used for executing the program instructions stored in the memory 610 to control the input/output interface 630 to receive input data and information and output data such as operation results.

Optionally, when the code is executed, the processor 620 may implement the operations of the method 200 or the method 300, which are not described herein again for brevity. In this case, the apparatus 600 may be a first controller or a second controller.

It should be understood that, in the embodiment of the present application, the processor 620 may be a Central Processing Unit (CPU), and the processor 620 may also be other general-purpose processors, Digital Signal Processors (DSP), Application Specific Integrated Circuits (ASIC), Field Programmable Gate Arrays (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and so on.

The input/output interface 630 may be for implementing signal transmission and reception functions.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for storing data, the method being applied to a cluster system, the cluster system including a main cluster and a standby cluster, the method being performed by a first controller, the first controller being a controller of the main cluster, and a second controller being a controller of the standby cluster, the method comprising:

receiving the merged file, a first serial number and first identification information sent by the second controller, where the first serial number is a serial number of first data in the merged file, and the first identification information is used to indicate that an area number of the merged file in the standby cluster is a first area number;

determining the sequence number of the last written data in the merged file according to the first sequence number;

comparing the serial number of the last written data in the merged file with the serial number of the last written data in the un-merged file, wherein the un-merged file is a file of which the area number in the main cluster is the first area number;

deleting the uncombined file when the serial number of the last written data in the uncombined file is smaller than the serial number of the last written data in the combined file;

and storing the merged file in the main cluster.

2. The method of claim 1, further comprising:

setting the region number of the merged file as the first region number in the main cluster.

3. The method of claim 1, wherein the first sequence number is a sequence number of last written data in the merged file.

4. The method according to any one of claims 1 to 3, wherein before the receiving the merged file, the first sequence number and the first identification information sent by the second controller, the method further comprises:

generating a first log, wherein the first log comprises information of a first area with an area number of the first area in the main cluster, and the information of the first area comprises the files which are not merged and are associated with the first area;

sending the first log to the second controller.

5. The method of claim 4, wherein before the receiving the merged file, the first sequence number, and the first identification information sent by the second controller, the method further comprises:

generating a second log, the second log comprising data of the unmerged file;

sending the second log to the second controller;

and generating the uncombined file according to the data of the uncombined file.

6. A method for storing data, the method being applied to a cluster system, the cluster system including a main cluster and a standby cluster, the method being performed by a second controller, the second controller being a controller of the standby cluster, and a first controller being a controller of the main cluster, the method comprising:

when the number of files which are not merged in a first area in the backup cluster reaches a first threshold value, merging the files which are not merged to obtain a merged file, and recording a first serial number, wherein the first serial number is the serial number of first data in the merged file;

and sending the merged file, the first sequence number and first identification information to the first controller, wherein the first identification information is used for indicating that the area number of the merged file in the standby cluster is a first area number.

7. The method of claim 6, further comprising:

receiving a first log sent by the first controller, wherein the first log comprises information of a first area with an area number of the first area in the main cluster, and the information of the first area comprises the files which are not merged and are associated with the first area;

playing back the first log to acquire the information of the first area;

and according to the information of the first area, associating the first area with the area number in the standby cluster as the first area of the first area number and the files which are not merged in the standby cluster.

8. The method according to claim 6 or 7, characterized in that the method further comprises:

receiving a second log sent by the first controller, wherein the second log comprises data of the uncombined file;

9. A controller, the controller is applied to a cluster system, the cluster system includes a main cluster and a standby cluster, the controller is a first controller, the first controller is the controller of the main cluster, a second controller is the controller of the standby cluster, and the controller includes:

a receiving module, configured to receive a merged file, a first serial number, and first identification information sent by the second controller, where the first serial number is a serial number of first data in the merged file, and the first identification information is used to indicate that an area number of the merged file in the standby cluster is a first area number;

the processing module is used for determining the serial number of the last written data in the merged file according to the first serial number;

the processing module is further configured to compare a sequence number of last written data in the merged file with a sequence number of last written data in an un-merged file, where the un-merged file is a file whose area number in the main cluster is the first area number;

the processing module is further configured to delete the uncombined file when the sequence number of the last written data in the uncombined file is smaller than the sequence number of the last written data in the combined file;

and the storage module is used for storing the merged file in the main cluster.

10. The controller of claim 9, further comprising:

11. The controller of claim 9, wherein the first sequence number is a sequence number of last written data in the merged file.

12. The controller according to any one of claims 9 to 11, wherein the processing module is further configured to generate a first log, the first log including information of a first area with an area number of the first area within the primary cluster, the information of the first area including the unmerged file associated with the first area;

the controller further includes:

and the sending module is used for sending the first log to the second controller.

13. The controller of claim 12, wherein the processing module is further configured to generate a second log, the second log comprising data of the unmerged file;

the sending module is further configured to send the second log to the second controller;

the storage module is further configured to generate the uncombined file according to the data of the uncombined file.

14. A controller for data storage, the controller being applied to a cluster system, the cluster system including a main cluster and a standby cluster, the controller being executed by a second controller, the second controller being a controller of the standby cluster, and a first controller being a controller of the main cluster, the controller comprising:

a processing module: the cluster management system is used for merging the files which are not merged to obtain a merged file and recording a first serial number when the number of the files which are not merged in a first area in the standby cluster reaches a first threshold value, wherein the first serial number is the serial number of first data in the merged file;

and a sending module, configured to send the merged file, the first sequence number, and first identification information to the first controller, where the first identification information is used to indicate that an area number of the merged file in the standby cluster is a first area number.

15. The controller of claim 14, further comprising:

a receiving module, configured to receive a first log sent by the first controller, where the first log includes information of a first area whose area number is the first area number in the primary cluster, and the information of the first area includes the non-merged file associated with the first area;

the processing module is further configured to play back the first log to obtain information of the first area;

the processing module is further configured to associate, according to the information of the first area, a first area in the standby cluster with an area number of the first area and the unmerged file in the standby cluster.

16. The controller of claim 15,

the receiving module is further configured to receive a second log sent by the first controller, where the second log includes data of the file that is not merged;

the processing module is further configured to generate the uncombined file according to the data of the uncombined file.

17. A system comprising a controller as claimed in any one of claims 9 to 13 and a controller as claimed in any one of claims 14 to 16.