CN117033381A - Multi-copy data storage method, device, equipment and medium - Google Patents

Multi-copy data storage method, device, equipment and medium Download PDF

Info

Publication number
CN117033381A
CN117033381A CN202310996285.8A CN202310996285A CN117033381A CN 117033381 A CN117033381 A CN 117033381A CN 202310996285 A CN202310996285 A CN 202310996285A CN 117033381 A CN117033381 A CN 117033381A
Authority
CN
China
Prior art keywords
data
node
nodes
storage
column
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310996285.8A
Other languages
Chinese (zh)
Inventor
杨琪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Oceanbase Technology Co Ltd
Original Assignee
Beijing Oceanbase Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Oceanbase Technology Co Ltd filed Critical Beijing Oceanbase Technology Co Ltd
Priority to CN202310996285.8A priority Critical patent/CN117033381A/en
Publication of CN117033381A publication Critical patent/CN117033381A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/221Column-oriented storage; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2272Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor

Abstract

One or more embodiments of the present specification provide a multi-copy data storage method, apparatus, device, and medium. The method comprises the following steps: obtaining a copy deployment mode preset in a target data table in a database; under the condition that the duplicate deployment mode is not set, acquiring the duplicate deployment mode indicated by the configuration file of the database; setting storage formats of a plurality of nodes according to the copy deployment mode; and storing the data in the target data table in the nodes according to the storage format of the nodes. By arranging the row memory and the column memory nodes in the same database architecture, different requirements on the storage format under different inquiry load scenes can be met, compared with the case that the row memory and the column memory are arranged in different databases, the maintenance of the databases can be realized by using a set of software system, and the nodes with different storage formats are not required to depend on data synchronization tools outside the databases, so that the monitoring and the maintenance are easy.

Description

Multi-copy data storage method, device, equipment and medium
Technical Field
One or more embodiments of the present disclosure relate to the field of data processing technology, and in particular, to a method, an apparatus, a device, and a medium for storing multi-copy data.
Background
Distributed databases typically use a multiple copy mechanism, i.e., multiple copies of data are stored on different nodes so that when one node fails, other nodes can continue to provide service. The data storage formats of the nodes include a row memory format and a column memory format, which have different advantages in terms of data storage, query performance, and data analysis.
At present, line memory data and column memory data are usually stored in two sets of different distributed database systems respectively, so that two sets of software systems are required to maintain databases with different storage formats respectively, and synchronization between data with different storage formats is required to depend on a data synchronization tool outside the database systems.
Disclosure of Invention
In view of this, one or more embodiments of the present description provide a multi-copy data storage method, apparatus, device, and medium.
In order to achieve the above object, one or more embodiments of the present disclosure provide the following technical solutions:
according to a first aspect of one or more embodiments of the present specification, there is provided a multi-copy data storage method applied to a distributed database, the database comprising a plurality of nodes, the method comprising:
Acquiring a copy deployment mode preset in a target data table in the database, wherein the copy deployment mode indicates storage format information of a node;
under the condition that the target data table is not provided with a copy deployment mode, acquiring the copy deployment mode indicated by the configuration file of the database;
setting storage formats of the plurality of nodes according to the copy deployment mode, wherein the storage formats comprise row storage formats and column storage formats;
and storing the data in the target data table in the node according to the storage format of the node.
In some embodiments, the copy deployment mode includes a number of row-stored copies and a number of column-stored copies, and the setting storage formats of the plurality of nodes according to the copy deployment mode includes:
setting the nodes with the number of the line memory copies in the plurality of nodes into a line memory format;
and setting the nodes with the number of the column storage copies in the plurality of nodes into a column storage format.
In some embodiments, the plurality of nodes includes a master node and at least one replica node, and the storing the data in the target data table in the nodes according to a storage format of the nodes includes:
And storing the data in the master node so that the master node synchronizes the data to the duplicate node in the form of a log, wherein the log is in a line memory format, and the duplicate node is used for storing the received data in the log according to a storage format.
In some embodiments, where the database is based on a log-structured merge tree, the method further comprises:
the slave node writes the received line data in the log into a memory table in a memory;
if the data amount in the memory table reaches a set threshold value, writing the line data in the memory table into an ordered character string table in a disk if the duplicate node is in a line memory format; and if the duplicate node is in a column storage format, disassembling the data in the memory table according to columns, and storing each column of data into independent data blocks in the sequencing character string table.
In some embodiments, the method further comprises:
under the condition that all the duplicate nodes in the database are in a column storage format, each duplicate node calculates the check value of each column of data, and writes the check value of each column of data into a check value table;
And under the condition that the duplicate nodes in the row memory format exist in the database, the duplicate nodes in the row memory format disassemble row data according to columns, check values of the data in each column are calculated, and the check values of the data in each column are written into a check value table.
In some embodiments, the method further comprises:
the master node acquires a check value table of each copy node;
and confirming the data consistency of the duplicate nodes according to the check value of each column of data in the check value table.
In some embodiments, in the case of node repair, the method further comprises:
if the storage formats of the plurality of nodes are the same, the nodes are supplemented by copying the data stored in the first node, wherein the first node is any node in the plurality of nodes;
if the storage formats of the first node and the node to be complemented are different, the data stored in the first node is subjected to format conversion and then stored in the node to be complemented.
In some embodiments, the converting the format of the data stored in the first node and storing the converted format in the node to be repaired includes:
reading out each row of data in the first node and storing each row of data in a separate data block in the node to be supplemented according to columns under the condition that the first node is in a row memory format and the node to be supplemented is in a column memory format;
And reading out each column of data in the first node under the condition that the first node is in a column memory format and the node to be supplemented is in a row memory format, splicing each column of data into rows in sequence, and storing the rows of data in the node to be supplemented.
According to a second aspect of one or more embodiments of the present specification, there is provided a multi-copy data storage apparatus for use in a distributed database, the database comprising a plurality of nodes, the apparatus comprising:
the first acquisition unit is used for acquiring a copy deployment mode preset in a target data table in the database, wherein the copy deployment mode indicates storage format information of a node;
the second acquisition unit is used for acquiring the copy deployment mode indicated by the configuration file of the database under the condition that the copy deployment mode is not set in the target data table;
the setting unit is used for setting storage formats of the plurality of nodes according to the copy deployment mode, wherein the storage formats comprise row storage formats and column storage formats;
and the storage unit is used for storing the data in the target data table in the node according to the storage format of the node.
In some embodiments, the copy deployment mode includes a number of row-stored copies and a number of column-stored copies, and the second obtaining unit is specifically configured to:
Setting the nodes with the number of the line memory copies in the plurality of nodes into a line memory format;
and setting the nodes with the number of the column storage copies in the plurality of nodes into a column storage format.
In some embodiments, the plurality of nodes includes a master node and at least one replica node, and the storage unit is specifically configured to:
and storing the data in the master node so that the master node synchronizes the data to the duplicate node in the form of a log, wherein the log is in a line memory format, and the duplicate node is used for storing the received data in the log according to a storage format.
In some embodiments, in case the database is based on a log-structured merge tree, the apparatus further comprises a synchronization unit for:
the slave node writes the received line data in the log into a memory table in a memory;
if the data amount in the memory table reaches a set threshold value, writing the line data in the memory table into an ordered character string table in a disk if the duplicate node is in a line memory format; and if the duplicate node is in a column storage format, disassembling the data in the memory table according to columns, and storing each column of data into independent data blocks in the sequencing character string table.
In some embodiments, the apparatus further comprises a check value writing unit for:
under the condition that all the duplicate nodes in the database are in a column storage format, each duplicate node calculates the check value of each column of data, and writes the check value of each column of data into a check value table;
and under the condition that the duplicate nodes in the row memory format exist in the database, the duplicate nodes in the row memory format disassemble row data according to columns, check values of the data in each column are calculated, and the check values of the data in each column are written into a check value table.
In some embodiments, the apparatus further comprises a verification unit for:
the master node acquires a check value table of each copy node;
and confirming the data consistency of the duplicate nodes according to the check value of each column of data in the check value table.
In some embodiments, in case of node repair, the apparatus further comprises a repair unit for:
if the storage formats of the plurality of nodes are the same, the nodes are supplemented by copying the data stored in the first node, wherein the first node is any node in the plurality of nodes;
if the storage formats of the first node and the node to be complemented are different, the data stored in the first node is subjected to format conversion and then stored in the node to be complemented.
In some embodiments, the filling unit is configured to, when configured to format-convert the data stored in the first node and store the data in the node to be filled, specifically:
reading out each row of data in the first node and storing each row of data in a separate data block in the node to be supplemented according to columns under the condition that the first node is in a row memory format and the node to be supplemented is in a column memory format;
and reading out each column of data in the first node under the condition that the first node is in a column memory format and the node to be supplemented is in a row memory format, splicing each column of data into rows in sequence, and storing the rows of data in the node to be supplemented.
According to a third aspect of one or more embodiments of the present specification, there is provided an electronic device comprising:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to implement the methods set forth in one or more embodiments of the present specification by executing the executable instructions.
According to a fourth aspect of one or more embodiments of the present specification, there is provided a computer readable storage medium having stored thereon computer instructions which, when executed by a processor, implement the steps of the method set forth in one or more embodiments of the present specification.
In the embodiment of the specification, a copy deployment mode preset in a target data table in a database is acquired; under the condition that the duplicate deployment mode is not set, acquiring the duplicate deployment mode indicated by the configuration file of the database; setting storage formats of a plurality of nodes according to the copy deployment mode; and storing the data in the target data table in the nodes according to the storage format of the nodes. By arranging the row memory and the column memory nodes in the same database architecture, different requirements on the storage format under different inquiry load scenes can be met, compared with the case that the row memory and the column memory are arranged in different databases, the maintenance of the databases can be realized by using a set of software system, and the nodes with different storage formats are not required to depend on data synchronization tools outside the databases, so that the monitoring and the maintenance are easy.
Drawings
FIG. 1 is a schematic illustration of an application environment for a multi-copy data storage method provided by an exemplary embodiment.
FIG. 2 is a flowchart of a method for multi-copy data storage provided by an exemplary embodiment.
Fig. 3A schematically shows a schematic diagram of a travel storage format.
Fig. 3B schematically shows a schematic diagram of a columnar storage format.
FIG. 4 is a schematic diagram of storage format settings according to a replica deployment mode, as provided by an example embodiment.
FIG. 5 is a block diagram of a multi-copy storage device provided by an exemplary embodiment.
Fig. 6 is a schematic diagram of an apparatus according to an exemplary embodiment.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with one or more embodiments of the present specification. Rather, they are merely examples of apparatus and methods consistent with aspects of one or more embodiments of the present description as detailed in the accompanying claims.
It should be noted that: in other embodiments, the steps of the corresponding method are not necessarily performed in the order shown and described in this specification. In some other embodiments, the method may include more or fewer steps than described in this specification. Furthermore, individual steps described in this specification, in other embodiments, may be described as being split into multiple steps; while various steps described in this specification may be combined into a single step in other embodiments.
To assist in understanding the present specification, technical terms related to the present specification are first referred to.
Distributed database: may also be referred to as a distributed cluster, is a database that runs and stores data across multiple computers. The database distributes data across multiple nodes, which may be in the same geographic location or distributed across multiple geographic locations. By such a storage manner, the reliability, availability and performance of data can be improved.
Copy mechanism: multiple copies of data are stored on different nodes so that in the event of a failure of one node, other nodes can continue to provide service.
LSM tree: the log structured merge Tree (Log Structured Merge-Tree) is a hierarchical, ordered, disk-oriented data structure. In a database based on an LSM tree structure as a storage engine, one data slice of each data table corresponds to one LSM tree structure. The LSM tree includes a Memory Table (Memtable) in Memory and an ordered string Table (Sorted String Table, SSTable) on disk. The data stored in the database is divided into two parts of baseline data and incremental data, the baseline data is stored in a disk, the incremental data is cached in a memory when the data is written in, and the incremental data is refreshed on the disk when the incremental data reaches a certain threshold value; and merging the incremental data on the disk with the baseline data when the incremental data on the disk reaches a certain threshold value. The core design idea of the LSM tree is to keep the modification increment of the data in the memory, and write the modification operation into the disk in batches after the specified size limit is reached, so that the performance is greatly improved.
Distributed databases typically use a multi-copy mechanism, and the data storage formats of nodes in the database may include a row memory format and a column memory format, which have different advantages in terms of data storage, query performance, and data analysis.
In particular, each data block contains all of the data fields of the row in a row storage format, which is commonly used in online transaction processing (OLTP) applications, so that query operations can be processed in terms of rows, which can provide efficient transaction processing and low-latency query performance; in the column storage format, each data block only contains one column of data, so that query operation can be processed according to columns, batch query performance and data analysis capability are improved, and the similarity of adjacent data is higher due to the fact that the adjacent data are stored according to columns, the compression rate of the data is improved, storage space is saved, and the column storage format is commonly used in large data warehouse and analysis (OLAP) application.
Currently, row and column store data are typically stored in two different sets of distributed database systems, respectively.
For example, combining the row-store database Aurora and the column-store database RedShift, synchronizing data from Aurora into RedShift using a data migration service, and utilizing their respective advantages, a hybrid storage scheme is implemented. In this scheme, common, low latency transactional data uses Aurora's row storage, while data for bulk data analysis uses Redshift's column storage.
The hybrid storage scheme has the following problems:
(1) The row memory/column memory data are respectively stored in two sets of different database systems, and two sets of software systems are required to be maintained;
(2) The data in the databases with two different storage formats are synchronized by depending on a data synchronization tool outside the database system, so that the monitoring and maintenance are difficult;
(3) Once the table structure of the data table in the row-store database Aurora changes, the data table cannot be synchronized to the column-store database RedShift any more, and the data table needs to be re-created.
In view of this, one or more embodiments of the present disclosure provide a multi-copy data storage scheme, where storage formats of a plurality of nodes are set according to a copy deployment mode indicated by a target data table in a database or a copy deployment mode of a configuration file of the database, and data of the target data table is stored according to the set storage formats, so that row storage nodes and column storage nodes are set in the same set of database architecture.
In order to better understand the multi-copy storage method, device, apparatus and medium provided in the embodiments of the present disclosure, an application environment suitable for the embodiments of the present disclosure is described below. Referring to fig. 1, fig. 1 is a schematic diagram illustrating an application environment of a multi-copy data storage method according to an embodiment of the present disclosure. As an implementation manner, the multi-copy data storage method provided in the embodiments of the present disclosure is applied to a distributed database, for example, a database such as OceanBase. The system architecture shown in fig. 1 includes a plurality of database nodes, e.g., node 1, node 2, …, node N, each of which may be used to store duplicate data.
The plurality of database nodes shown in fig. 1 are only exemplary nodes, and the number of nodes in actual application may be set according to actual requirements, which is not specifically limited herein. The client in the system architecture shown in fig. 1 may run an upper layer application corresponding to the distributed database, specifically, may be database management system software of the distributed database, and an application performing functions of searching, establishing, deleting, modifying, and the like. In one example, the multi-copy data storage method may be performed by a master node in a distributed database, which may be any of the plurality of nodes in fig. 1.
FIG. 2 is a flow chart of a method of multi-copy data storage, including steps 201-204, as shown in an exemplary embodiment of the present description.
In step 201, a copy deployment mode preset in a target data table in the database is acquired.
The target data table may be any database table in the database.
When creating the target data table, a replica deployment mode may be specified that is used to indicate storage format information for a plurality of nodes in the database. The storage format information may specifically indicate, for example, a storage format of each node, i.e., whether each node is in a row storage format or a column storage format; the number of row memory copies and the number of column memory copies included in the plurality of nodes may also be indicated, where the number of row memory copies refers to the number of nodes in the row memory format and the number of column memory copies refers to the number of nodes in the column memory format.
In the case where the target data table specifies a replica deployment mode, all partitions of the table use the replica deployment mode. In the present description example, the replica deployment mode specified in the target data table may be referred to as a table-level replica configuration.
In step 202, in a case that the target data table does not set a copy deployment mode, a copy deployment mode indicated by a configuration file of the database is obtained.
And if the copy deployment mode is not specified when the target data table is created, acquiring the copy deployment mode indicated by the configuration file of the database. The replica deployment mode indicated by the configuration file of the database may be referred to as a cluster-level replica configuration.
In the embodiment of the present specification, it is specified that the cluster must be set with a cluster-level replica configuration at the time of creation, and thus, without specifying a table-level replica configuration, the cluster-level replica configuration may be acquired.
It can be seen that if a table-level replica configuration is specified when the target data table is created, the table-level replica configuration is utilized as a replica deployment mode, and all partitions of the table use the table-level replica configuration; if the table level replica configuration is not specified when the data table is created, the table defaults to using the cluster level replica configuration.
In step 203, a storage format of the plurality of nodes is set according to the copy deployment mode, where the storage format includes a row storage format and a column storage format.
As can be seen from the descriptions of step 201 and step 202, in the case that the target data table specifies the table-level copy configuration, the table-level copy configuration is used as the copy configuration mode to set whether the plurality of nodes are in the line memory format or the column memory format; in the case that the table-level copy configuration is not specified, the cluster-level copy configuration is used as a copy configuration mode to set the storage format of the node.
Specifically, setting the storage format of the node to the line storage format means that when the node stores data, the node stores data of an entire line in one data block, and as shown in fig. 3A, each data block includes all data fields of the line. Setting the storage format of the node to the column storage format means that the node stores all values of each column as a single data block when storing data, and each data block only contains data of one column as shown in fig. 3B.
In step 204, the data in the target data table is stored in the node according to the storage format of the node.
When the data in the target data table is written into a plurality of nodes in the database, the data can be written according to the storage format of each node set in step 203, so that the plurality of nodes in the database realize multi-copy data storage with different storage formats.
In the embodiment of the specification, a copy deployment mode preset in a target data table in a database is acquired; under the condition that the duplicate deployment mode is not set, acquiring the duplicate deployment mode indicated by the configuration file of the database; setting storage formats of a plurality of nodes according to the copy deployment mode; and storing the data in the target data table in the nodes according to the storage format of the nodes. By arranging the row memory and the column memory nodes in the same database architecture, different requirements on the storage format under different inquiry load scenes can be met, compared with the case that the row memory and the column memory are arranged in different databases, the maintenance of the databases can be realized by using a set of software system, and the nodes with different storage formats are not required to depend on data synchronization tools outside the databases, so that the monitoring and the maintenance are easy.
In the case where the copy deployment mode indicates the number of row-store copies and the number of column-store copies, the node of the number of row-store copies in the plurality of nodes in the database may be set to a row-store format, and the node of the number of column-store copies in the plurality of nodes may be set to a column-store format. Referring to the schematic diagram of storage format setting according to the replica deployment mode provided by the exemplary embodiment shown in fig. 4.
The configuration file of the distributed database cluster indicates that the copy deployment mode is that the number of line memory copies is 3, and as shown in fig. 4, the cluster is configured as three line memory copies.
Data table a (hereinafter referred to as table a) specifies a replica deployment mode at creation time, specifically: and the number of row memory copies is 2, and the number of column memory copies is 1, the nodes in the database adopt the table-level copy configuration designated by the table A to set the storage format, two nodes in three nodes in the database are set to be in the row memory format, and one node is set to be in the column memory format.
Data table B (hereinafter referred to as table B) also specifies a replica deployment mode at creation time, specifically: and if the number of the row memory copies is 1 and the number of the column memory copies is 2, setting a storage format by adopting table-level copy configuration designated by a table B by the nodes in the database, setting one of three nodes in the database as the row memory format, and setting two nodes as the column memory format.
Data table C (hereinafter referred to as table C) does not refer to the lower copy deployment mode at creation time, i.e., does not specify a table-level copy configuration, then the copy is deployed by default using the cluster configuration. Specifically, three nodes are set to three row-store copies indicated by the cluster configuration.
In the embodiment of the present disclosure, since the nodes in the row memory format and the nodes in the column memory format exist in the same database system, that is, all the copies share one data table structure information, when the data table structure is changed, all the copies synchronously complete the change of the table structure, without additional operations.
For a distributed database, data in a data table may be sent to all nodes synchronously, or data may be sent to a master node among a plurality of nodes, and the master node may synchronize the data to other replica nodes (hereinafter also referred to as replicas). Wherein the master node may be any one of a plurality of nodes.
In some embodiments, after receiving the data in the target database and storing the data according to the set storage format, the master node may synchronize the data to the replica node in the form of a log. Wherein the log is in a line memory format. And after receiving the log, the copy node stores the data in the log according to the set storage format.
Under the condition that the distributed database is based on an LSM tree, the slave node writes the received line data in the log into a memory table Memtable in a memory; if the copy node is in a line memory format under the condition that the data quantity in the Memtable reaches a set threshold value, writing the data in the Memtable into an ordering string table SSTable in a disk, and finally generating the SSTable format as the line memory; and if the duplicate node is in a column storage format, disassembling the data in the Memable according to columns, storing each column of data into independent data blocks in the Memable, and finally generating an SSTable format as a column storage.
By the mode, data synchronization among the copies in different storage formats can be realized.
In the related art, since row memory data and column memory data are stored in two different sets of database systems respectively, the data consistency check cannot be completed by the row memory data and the column memory data. In the embodiment of the specification, a method for checking data consistency between copies in different storage formats in the same database system is provided.
The data consistency check may be performed by the master node, and the replica nodes outside the master node first perform data snapshot persistence, then calculate the check value, and write the check value into the check value table. The master node can acquire the check value information of all the copies by polling the check value table of the copy node, so that the consistency of the data can be confirmed. If the data is inconsistent, the check value table can be written with inconsistent error reporting information, so that the user can be notified of the inconsistent error reporting information.
Under the condition that all the duplicate nodes in the database are in the same storage format, each duplicate node makes data snapshot persistence once at the same time point, and then the check value of the data can be calculated through the binary file of the data block. And because the data formats of the multiple copies are the same, the check values of the data blocks can be directly compared to realize data consistency check.
Specifically, in the case where all the replica nodes are in the column storage format, each replica node calculates a check value of each column of data, and writes the check value of each column of data into the check value table. After the master node acquires the check value table, the data consistency of the duplicate nodes is confirmed by comparing the check values according to columns.
And under the condition that the duplicate nodes in the row memory format exist in the database, calculating the check value by columns by the duplicate nodes in the row memory format. Specifically, the copy nodes in the row memory format unpack row data according to columns, calculate the check value of each column of data, and write the check value of each column of data into the check value table. After the master node acquires the check value table, the data consistency of the duplicate nodes is confirmed by comparing the check values according to columns.
In the embodiment of the specification, the row storage data and the column storage data calculate the check value according to the columns, and the master node completes data consistency check, so that the data security is ensured.
Table level copy configuration of the data table allows changes to be made after the table is built, and copies that produce changes can be adjusted in the form of deleted copies or filled copies.
Specifically, if the number of row-stored copies or the number of column-stored copies is reduced in value, deleting the corresponding number of copies; and if the number of line or column copies increases, the corresponding number of copies is padded.
For example, in the case where the table level copy configuration indicates that the number of line-stored copies is M and the number of column-stored copies is N (M, N is a positive integer), if the number of line-stored copies changes to (m+1), one line-stored copy is added for filling; if the number of column copies changes to (N-1), then one column copy is deleted.
When the copies are complemented, if the storage formats of the plurality of copies are the same, the data stored in a first node can be replicated to complement the nodes, wherein the first node is any node in the plurality of nodes; when the storage formats of the multiple copies are inconsistent, format conversion of the copy data is required. Specifically, under the condition that the storage formats of a first node for performing the filling are different from the storage formats of a node to be filled, the data stored in the first node are stored in the node to be filled after format conversion.
When the first node is in a row memory format and the node to be complemented is in a column memory format, the complement of the column memory copy can be realized by reading out each row of data in the first node and storing each row of data in an independent data block in the node to be complemented according to columns;
When the first node is in a column storage format and the node to be supplemented is in a row storage format, the supplement of the row storage copies can be realized by reading out each column of data in the first node, splicing each column of data into rows in sequence and storing the rows of data in the node to be supplemented.
Referring to fig. 5, fig. 5 is a multi-copy data storage apparatus provided in an exemplary embodiment for use with a distributed database, the database including a plurality of nodes, the apparatus comprising:
a first obtaining unit 501, configured to obtain a copy deployment mode preset in a target data table in the database, where the copy deployment mode indicates storage format information of a node;
a second obtaining unit 502, configured to obtain, when the target data table does not set a duplicate deployment mode, a duplicate deployment mode indicated by a configuration file of the database;
a setting unit 503, configured to set storage formats of the plurality of nodes according to the copy deployment mode, where the storage formats include a row storage format and a column storage format;
a storage unit 504, configured to store the data in the target data table in the node according to a storage format of the node.
In some embodiments, the copy deployment mode includes a number of row-stored copies and a number of column-stored copies, and the second obtaining unit is specifically configured to:
setting the nodes with the number of the line memory copies in the plurality of nodes into a line memory format;
and setting the nodes with the number of the column storage copies in the plurality of nodes into a column storage format.
In some embodiments, the plurality of nodes includes a master node and at least one replica node, and the storage unit is specifically configured to:
and storing the data in the master node so that the master node synchronizes the data to the duplicate node in the form of a log, wherein the log is in a line memory format, and the duplicate node is used for storing the received data in the log according to a storage format.
In some embodiments, in case the database is based on a log-structured merge tree, the apparatus further comprises a synchronization unit for:
the slave node writes the received line data in the log into a memory table in a memory;
if the data amount in the memory table reaches a set threshold value, writing the line data in the memory table into an ordered character string table in a disk if the duplicate node is in a line memory format; and if the duplicate node is in a column storage format, disassembling the data in the memory table according to columns, and storing each column of data into independent data blocks in the sequencing character string table.
In some embodiments, the apparatus further comprises a check value writing unit for:
under the condition that all the duplicate nodes in the database are in a column storage format, each duplicate node calculates the check value of each column of data, and writes the check value of each column of data into a check value table;
and under the condition that the duplicate nodes in the row memory format exist in the database, the duplicate nodes in the row memory format disassemble row data according to columns, check values of the data in each column are calculated, and the check values of the data in each column are written into a check value table.
In some embodiments, the apparatus further comprises a verification unit for:
the master node acquires a check value table of each copy node;
and confirming the data consistency of the duplicate nodes according to the check value of each column of data in the check value table.
In some embodiments, in case of node repair, the apparatus further comprises a repair unit for:
if the storage formats of the plurality of nodes are the same, the nodes are supplemented by copying the data stored in the first node, wherein the first node is any node in the plurality of nodes;
if the storage formats of the first node and the node to be complemented are different, the data stored in the first node is subjected to format conversion and then stored in the node to be complemented.
In some embodiments, the filling unit is configured to, when configured to format-convert the data stored in the first node and store the data in the node to be filled, specifically:
reading out each row of data in the first node and storing each row of data in a separate data block in the node to be supplemented according to columns under the condition that the first node is in a row memory format and the node to be supplemented is in a column memory format;
and reading out each column of data in the first node under the condition that the first node is in a column memory format and the node to be supplemented is in a row memory format, splicing each column of data into rows in sequence, and storing the rows of data in the node to be supplemented.
Fig. 6 is a schematic block diagram of an apparatus provided in an exemplary embodiment. Referring to fig. 6, at the hardware level, the device includes a processor 602, an internal bus 604, a network interface 606, a memory 608, and a non-volatile storage 610, although other hardware required for other services is possible. One or more embodiments of the present description may be implemented in a software-based manner, such as by the processor 602 reading a corresponding computer program from the non-volatile memory 610 into the memory 608 and then running. Of course, in addition to software implementation, one or more embodiments of the present disclosure do not exclude other implementation manners, such as a logic device or a combination of software and hardware, etc., that is, the execution subject of the following processing flow is not limited to each logic unit, but may also be hardware or a logic device.
The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. A typical implementation device is a computer, which may be in the form of a personal computer, laptop computer, cellular telephone, camera phone, smart phone, personal digital assistant, media player, navigation device, email device, game console, tablet computer, wearable device, or a combination of any of these devices.
In a typical configuration, a computer includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.
Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, read only compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic disk storage, quantum memory, graphene-based storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by the computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.
The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
The terminology used in the one or more embodiments of the specification is for the purpose of describing particular embodiments only and is not intended to be limiting of the one or more embodiments of the specification. As used in this specification, one or more embodiments and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.
It should be understood that although the terms first, second, third, etc. may be used in one or more embodiments of the present description to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of one or more embodiments of the present description. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.
The foregoing description of the preferred embodiment(s) is (are) merely intended to illustrate the embodiment(s) of the present invention, and it is not intended to limit the embodiment(s) of the present invention to the particular embodiment(s) described.

Claims (11)

1. A multi-copy data storage method applied to a distributed database, the database comprising a plurality of nodes, the method comprising:
acquiring a copy deployment mode preset in a target data table in the database, wherein the copy deployment mode indicates storage format information of a node;
under the condition that the target data table is not provided with a copy deployment mode, acquiring the copy deployment mode indicated by the configuration file of the database;
setting storage formats of the plurality of nodes according to the copy deployment mode, wherein the storage formats comprise row storage formats and column storage formats;
and storing the data in the target data table in the node according to the storage format of the node.
2. The method of claim 1, the replica deployment mode comprising a number of row-stored replicas and a number of column-stored replicas, the setting storage formats of the plurality of nodes according to the replica deployment mode comprising:
Setting the nodes with the number of the line memory copies in the plurality of nodes into a line memory format;
and setting the nodes with the number of the column storage copies in the plurality of nodes into a column storage format.
3. The method of claim 1, the plurality of nodes including a master node and at least one replica node, the storing data in the target data table in the nodes in a storage format of the nodes comprising:
and storing the data in the master node so that the master node synchronizes the data to the duplicate node in the form of a log, wherein the log is in a line memory format, and the duplicate node is used for storing the received data in the log according to a storage format.
4. A method according to claim 3, in the case where the database is based on a log-structured merge tree, the method further comprising:
the slave node writes the received line data in the log into a memory table in a memory;
if the data amount in the memory table reaches a set threshold value, writing the line data in the memory table into an ordered character string table in a disk if the duplicate node is in a line memory format; and if the duplicate node is in a column storage format, disassembling the data in the memory table according to columns, and storing each column of data into independent data blocks in the sequencing character string table.
5. A method according to claim 3, the method further comprising:
under the condition that all the duplicate nodes in the database are in a column storage format, each duplicate node calculates the check value of each column of data, and writes the check value of each column of data into a check value table;
and under the condition that the duplicate nodes in the row memory format exist in the database, the duplicate nodes in the row memory format disassemble row data according to columns, check values of the data in each column are calculated, and the check values of the data in each column are written into a check value table.
6. The method of claim 5, the method further comprising:
the master node acquires a check value table of each copy node;
and confirming the data consistency of the duplicate nodes according to the check value of each column of data in the check value table.
7. The method of claim 1, in the case of node repair, further comprising:
if the storage formats of the plurality of nodes are the same, the nodes are supplemented by copying the data stored in the first node, wherein the first node is any node in the plurality of nodes;
if the storage formats of the first node and the node to be complemented are different, the data stored in the first node is subjected to format conversion and then stored in the node to be complemented.
8. The method of claim 7, wherein the converting the format of the data stored in the first node and storing the converted format data in the node to be repaired comprises:
reading out each row of data in the first node and storing each row of data in a separate data block in the node to be supplemented according to columns under the condition that the first node is in a row memory format and the node to be supplemented is in a column memory format;
and reading out each column of data in the first node under the condition that the first node is in a column memory format and the node to be supplemented is in a row memory format, splicing each column of data into rows in sequence, and storing the rows of data in the node to be supplemented.
9. A multi-copy data storage device for use in a distributed database, the database comprising a plurality of nodes, the device comprising:
the first acquisition unit is used for acquiring a copy deployment mode preset in a target data table in the database, wherein the copy deployment mode indicates storage format information of a node;
the second acquisition unit is used for acquiring the copy deployment mode indicated by the configuration file of the database under the condition that the copy deployment mode is not set in the target data table;
The setting unit is used for setting storage formats of the plurality of nodes according to the copy deployment mode, wherein the storage formats comprise row storage formats and column storage formats;
and the storage unit is used for storing the data in the target data table in the node according to the storage format of the node.
10. An electronic device, comprising:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to implement the method of any of claims 1-8 by executing the executable instructions.
11. A computer readable storage medium having stored thereon computer instructions which, when executed by a processor, implement the steps of the method of any of claims 1-8.
CN202310996285.8A 2023-08-08 2023-08-08 Multi-copy data storage method, device, equipment and medium Pending CN117033381A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310996285.8A CN117033381A (en) 2023-08-08 2023-08-08 Multi-copy data storage method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310996285.8A CN117033381A (en) 2023-08-08 2023-08-08 Multi-copy data storage method, device, equipment and medium

Publications (1)

Publication Number Publication Date
CN117033381A true CN117033381A (en) 2023-11-10

Family

ID=88638545

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310996285.8A Pending CN117033381A (en) 2023-08-08 2023-08-08 Multi-copy data storage method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN117033381A (en)

Similar Documents

Publication Publication Date Title
Li et al. A spatiotemporal indexing approach for efficient processing of big array-based climate data with MapReduce
US8468291B2 (en) Asynchronous distributed object uploading for replicated content addressable storage clusters
Chavan et al. Survey paper on big data
US9323791B2 (en) Apparatus and method for expanding a shared-nothing system
US10310904B2 (en) Distributed technique for allocating long-lived jobs among worker processes
US20150169623A1 (en) Distributed File System, File Access Method and Client Device
Yang From Google file system to omega: a decade of advancement in big data management at Google
CN111221814B (en) Method, device and equipment for constructing secondary index
CN115114374B (en) Transaction execution method and device, computing equipment and storage medium
CN115658391A (en) Backup recovery method of WAL mechanism based on QianBase MPP database
CN117033381A (en) Multi-copy data storage method, device, equipment and medium
CN108256019A (en) Database key generation method, device, equipment and its storage medium
CN115658683A (en) Metadata processing method, apparatus, device, medium, and program product
CN114297196A (en) Metadata storage method and device, electronic equipment and storage medium
CN113515518A (en) Data storage method and device, computer equipment and storage medium
Wang et al. The method of cloudizing storing unstructured LiDAR point cloud data by MongoDB
CN107102898B (en) Memory management and data structure construction method and device based on NUMA (non Uniform memory Access) architecture
WO2024087777A1 (en) Data reorganization method and apparatus for database table, medium, and computer device
US20170116300A1 (en) Efficient mirror data re-sync
CN117539690B (en) Method, device, equipment, medium and product for merging and recovering multi-disk data
US20230394043A1 (en) Systems and methods for optimizing queries in a data lake
CN112988474B (en) Method, system, equipment and medium for backing up hot data by mass small files
Toups A study of three paradigms for storing geospatial data: distributed-cloud model, relational database, and indexed flat file
CN115757397A (en) Data reforming method and device for database table, medium and computer equipment
CN117807174A (en) Index processing method, apparatus, computer device, medium, and program product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination