WO2023197404A1 - Procédé et appareil de stockage d'objet basés sur une base de données distribuée - Google Patents

Procédé et appareil de stockage d'objet basés sur une base de données distribuée Download PDF

Info

Publication number
WO2023197404A1
WO2023197404A1 PCT/CN2022/094380 CN2022094380W WO2023197404A1 WO 2023197404 A1 WO2023197404 A1 WO 2023197404A1 CN 2022094380 W CN2022094380 W CN 2022094380W WO 2023197404 A1 WO2023197404 A1 WO 2023197404A1
Authority
WO
WIPO (PCT)
Prior art keywords
index
disk
memory
internal request
distributed database
Prior art date
Application number
PCT/CN2022/094380
Other languages
English (en)
Chinese (zh)
Inventor
刘森
蔡攀龙
Original Assignee
上海川源信息科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海川源信息科技有限公司 filed Critical 上海川源信息科技有限公司
Publication of WO2023197404A1 publication Critical patent/WO2023197404A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/176Support for shared access to files; File sharing support
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5083Techniques for rebalancing the load in a distributed system

Definitions

  • the present application relates to the technical fields of distributed databases and object-oriented storage, and in particular to an object storage method and device based on a distributed database.
  • Object storage also known as object-oriented storage, is a storage technology suitable for unstructured data. At present, it is often the best solution for reading and writing massive small files. Object storage applies a write-after-write mode to aggregate writes to small files, thereby greatly improving the IOPS (Input/Output operations Per Second, number of read and write operations per second) and bandwidth of reading and writing.
  • IOPS Input/Output operations Per Second, number of read and write operations per second
  • index the location and size of a specific small file in the aggregate file, that is, the mapping relationship between the small file and the aggregate file.
  • object storage technology solved the problem of reading and writing massive small files, it also introduced a large number of new small files (i.e., numerous index data), and what followed was a new literacy issues. In other words, the existence of numerous index data has become a new bottleneck restricting read and write performance.
  • one solution is to use an in-memory database such as Redis to solve the performance problem when reading and writing index data through the fast memory read and write characteristics.
  • memory has the disadvantage of non-persistence. It will bring about a series of problems, and the cost of memory is also high.
  • Another approach in the existing technology is to use a disk-based database such as MySQL to ensure the durability of the index itself by sacrificing read and write performance. However, the cost is that performance will be lost, and a large part of the disk's IOPS is consumed in processing the index. .
  • This application provides an object storage method and device based on a distributed database, which achieves high concurrent reading and writing characteristics while maintaining the persistence characteristics of the entire system, thereby truly solving the problem of reading and writing massive small files.
  • an object storage method based on a distributed database is provided.
  • the method is used in a distributed database system.
  • the distributed database system includes multiple nodes, and each node shares a disk and Memory; the methods include:
  • the behavior log of the application programming interface API of the current node is written in the disk and the memory of the current node, and then the index corresponding to the internal request for index writing is written into the current node.
  • the index corresponding to the internal request for index reading is read from the disk and returned.
  • write the behavior log of the application program interface API of the current node in the disk which may include:
  • the target log file is a log file on the disk corresponding to the current node, and the file names of log files corresponding to different nodes are mutually exclusive;
  • the behavior log is written to the target log file in the form of write-write.
  • the method further includes:
  • the time stamp is used to determine whether the data is new or old to ensure that non-latest data does not overwrite the latest data.
  • the method also includes:
  • Time consistency verification is performed on each node every preset period.
  • the method further includes:
  • the duplicate index is deleted in the memory.
  • the method also includes:
  • the converted internal request is sent to the selected node.
  • the load balancing strategy includes:
  • a hash value is first calculated based on the file name of the file, and then a node is selected based on the hash value to achieve load balancing among nodes.
  • each node of the distributed database system realizes shared disk and memory through the Gluster file system.
  • an object storage device based on a distributed database is provided.
  • the device is used in a distributed database system.
  • the distributed database system includes multiple nodes, and each node shares a disk and Memory;
  • the device includes:
  • a logging unit configured to write the behavior log of the application program interface API of the current node in the disk and the memory of the current node when the current node receives an internal request for index writing, and then trigger the first index writing unit;
  • the first index writing unit is used to write the index corresponding to the internal request for index writing into the first-in-first-out FIFO queue in the memory of the current node, where the index includes keywords, and the keywords are operated by the user.
  • a second index writing unit configured to write all indexes in the queue to the disk regularly or when the queue is full
  • the index reading unit is used to read the index corresponding to the internal request for index reading from the disk and return it when the current node receives an internal request for index reading.
  • the logging unit when used to write the behavior log of the application program interface API of the current node in the disk, it is specifically used to:
  • the target log file is a log file on the disk corresponding to the current node, and the file names of log files corresponding to different nodes are mutually exclusive;
  • the behavior log is written to the target log file in the form of write-write.
  • the second index writing unit is also used for:
  • the time stamp is used to determine whether the data is new or old to ensure that non-latest data does not overwrite the latest data.
  • the device also includes:
  • the time consistency check unit is used to check the time consistency of each node every preset period.
  • the index reading unit is also used for:
  • the duplicate index is deleted in the memory.
  • the device also includes:
  • An internal request generation unit used to obtain the user's operation on the file, and convert the operation into an internal request of the distributed database system, wherein the internal request is divided into an internal request for index writing and an internal request for index reading;
  • a task allocation unit is used to select a node in the distributed database system according to a preset load balancing policy, and send the converted internal request to the selected node.
  • the load balancing strategy includes:
  • a hash value is first calculated based on the file name of the file, and then a node is selected based on the hash value to achieve load balancing among nodes.
  • disks and memory are shared between nodes of the distributed database system through the Gluster file system.
  • the solution of this application is improved on the basis of traditional disk-based databases and introduces a memory index mechanism so that the index data is not directly written into the disk-based database, but is written first into the memory, using the memory as a buffer pool, and then writes it to the disk in a timely manner.
  • record behavior logs on the disk and memory before the operation for data recovery in the event of a failure.
  • Figure 1 is a schematic flow chart of an object storage method based on a distributed database provided by an embodiment of the present application
  • FIG. 2 is a schematic diagram of the work flow of nodes in the embodiment of this application.
  • Figure 3 is another schematic flow chart of an object storage method based on a distributed database provided by an embodiment of the present application
  • Figure 4 is a schematic diagram of an object storage device based on a distributed database provided by an embodiment of the present application.
  • Figure 1 is a schematic flow chart of an object storage method based on a distributed database provided by an embodiment of the present application.
  • the method can be used in a distributed database system, which can include multiple nodes, and disks and memory are shared between each node.
  • the nodes of the distributed database system may share disks and memories through the Gluster file system.
  • GlusterFS Cluster File System
  • GlusterFS Cluster File System
  • the method can be applied to any node in the distributed database system. As shown in Figure 1, the method may include the following steps:
  • step S101 when an internal request for index writing is received, the behavior log (log) of the application program interface API of the current node is written in the disk and the current node memory, and then the internal request for index writing is written.
  • the corresponding index is written into the first-in-first-out FIFO queue in the memory of the current node, where the index includes keywords, and the keywords are the file names of the files operated by the user.
  • mapping relationship is formed between small files and large files (that is, aggregated files).
  • This mapping relationship is an index.
  • the index can include keywords (keys), and the keywords are files operated by the user (also known as aggregate files). That is, the file name of a small file).
  • the index can also include the location of the small file in the aggregated file, the size of the small file, etc.
  • Disks and memory are shared between each node (for example, the Gluster file system can be used), which ensures the synchronization of data between different nodes, thereby ensuring the persistence of data when a node fails.
  • the API behavior log will be recorded first for data recovery in case of failure.
  • writing the behavior log of the application program interface API of the current node in the disk may specifically include:
  • the target log file is a log file on the disk corresponding to the current node, and the file names of log files corresponding to different nodes are mutually exclusive;
  • the behavior log is written to the target log file in the form of write-write.
  • the current node will directly write the API behavior log to the shared disk to ensure data recovery in the event of node failure.
  • the log file can be named with a special name, such as saving it with the node number as a suffix, or adding a random hash to the node number, etc., to ensure mutual exclusion between nodes.
  • the shared memory mechanism can be used to write the API behavior log in the memory, so that when a node fails unexpectedly, this part of the memory data can be passed through accessed by other nodes.
  • a ring structure can be used to ensure primary and backup memory data between multiple nodes.
  • node 2 has the backup data of node 1
  • node 3 has the backup data of node 2, and so on.
  • there is also a behavior log on the disk if the shared memory data of other nodes is available, the recovery efficiency will be improved.
  • the two copies of data on disk and in memory can mutually prove the credibility of the backup data.
  • step S102 all indexes in the queue are written to the disk periodically or when the queue is full.
  • FIG. 2 is a schematic diagram of the work flow of nodes in the embodiment of the present application.
  • Each node maintains a FIFO in memory.
  • the queue is set up with a mechanism for flushing the disk when the data is full and flushing the disk periodically. Outside of these two opportunities (regular and full), the index data is not directly written to the disk database, thus The writing rate is greatly improved.
  • data persistence can be ensured, thereby supporting the stability of the entire data system.
  • Coopetition conflicts are key conflicts.
  • Key (keyword in the index) is the file name of the small file operated by the user.
  • a node When a node is preparing to write an index for a key, another node may also want to write an index for the same key. , for example, a user operates multiple times, and each operation is distributed to a different node by the load balancing mechanism.
  • This kind of competition can be avoided through the operation time attribute of the database. By adding timestamps and using conditional SQL statements when flushing, it can be ensured that non-latest data will not overwrite the correct data and then be discarded correctly.
  • the method may also include:
  • the time stamp is used to determine whether the data is new or old to ensure that non-latest data does not overwrite the latest data.
  • the method may also include:
  • Time consistency verification is performed on each node every preset period.
  • step S103 when an internal request for index reading is received, the index corresponding to the internal request for index reading is read from the disk and returned.
  • the method in the process of reading the index corresponding to the internal request of the index read from the disk, the method may also include:
  • the duplicate index is deleted in the memory.
  • the method may also include:
  • step S301 the user's operation on the file is obtained.
  • the files here are small files. User operations on small files will be converted into internal requests in the distributed database system. For example, if the user modifies the small file, the index of the small file will usually change accordingly, and the system will An index write request will be generated.
  • step S302 the operation is converted into an internal request of the distributed database system, where the internal request is divided into an internal request for index writing and an internal request for index reading.
  • step S303 a node is selected in the distributed database system according to a preset load balancing strategy.
  • This embodiment does not limit the specific load balancing strategy. Those skilled in the art can choose and design it according to different needs and different scenarios. These choices and designs that can be used here do not deviate from the spirit and spirit of this application. protected range.
  • the load balancing strategy may specifically include:
  • a hash value is first calculated based on the file name of the file, and then a node is selected based on the hash value to achieve load balancing among nodes.
  • step S304 the converted internal request is sent to the selected node.
  • the selected node is also the current node in step S101.
  • requests can be distributed to each node in an even distribution.
  • the hash value of the file name in the request will be calculated and distributed among nodes based on the hash value. Allocation through hashing may weaken load balancing to a certain extent, but it reduces data competition in the background and implicitly improves concurrency performance.
  • the focus of load may be different. For example, when multiple files operated by a user have similar file names, they will all be hashed and assigned to the same node, which will cause the load to be unbalanced. , at this time, the load balancing rules can be adjusted according to the situation.
  • the file name can be hashed to get 1 to 15 numbers
  • 1 to 5 are assigned to node A
  • 6 to 10 are assigned to node B
  • 11 to 15 are assigned to node C to achieve balance.
  • you may find that the numbers that appear are all from 1 to 10.
  • you can adjust the strategy for example, assign 1 to 3 to node A, 4 to 7 to node B, and 8 to 10 to node C.
  • this embodiment improves on the traditional disk-based database and introduces a memory index mechanism so that the index data is not directly written into the disk-based database, but first Write to memory, use memory as a buffer pool, and then write to disk in a timely manner. At the same time, record behavior logs on disk and memory before operation for data recovery in case of failure.
  • Figure 4 is a schematic diagram of an object storage device based on a distributed database provided by an embodiment of the present application.
  • the device is used in a distributed database system, which includes multiple nodes, and disks and memories are shared between each node.
  • disks and memory can be shared between nodes of the distributed database system through the Gluster file system.
  • the device may include:
  • the logging unit 401 is configured to write the behavior log of the application program interface API of the current node in the disk and the memory of the current node when the current node receives an internal request for index writing, and then trigger the first index writing unit;
  • the first index writing unit 402 is used to write the index corresponding to the internal request for index writing into the first-in-first-out FIFO queue in the memory of the current node, where the index includes keywords, and the keywords are operated by the user.
  • the second index writing unit 403 is used to write all indexes in the queue to the disk regularly or when the queue is full;
  • the index reading unit 404 is configured to read the index corresponding to the internal request for index reading from the disk and return it when the current node receives an internal request for index reading.
  • the logging unit when used to write the behavior log of the application programming interface API of the current node in the disk, it can be specifically used to:
  • the target log file is a log file on the disk corresponding to the current node, and the file names of log files corresponding to different nodes are mutually exclusive;
  • the behavior log is written to the target log file in the form of write-write.
  • the second index writing unit can also be used for:
  • the time stamp is used to determine whether the data is new or old to ensure that non-latest data does not overwrite the latest data.
  • the device may further include:
  • the time consistency check unit is used to check the time consistency of each node every preset period.
  • the index reading unit can also be used for:
  • the duplicate index is deleted in the memory.
  • the device may further include:
  • An internal request generation unit used to obtain the user's operation on the file, and convert the operation into an internal request of the distributed database system, wherein the internal request is divided into an internal request for index writing and an internal request for index reading;
  • a task allocation unit is used to select a node in the distributed database system according to a preset load balancing policy, and send the converted internal request to the selected node.
  • the load balancing strategy may specifically include:
  • a hash value is first calculated based on the file name of the file, and then a node is selected based on the hash value to achieve load balancing among nodes.
  • this embodiment improves on the traditional disk-based database and introduces a memory index mechanism so that the index data is not directly written into the disk-based database, but first Write to memory, use memory as a buffer pool, and then write to disk in a timely manner. At the same time, record behavior logs on disk and memory before operation for data recovery in case of failure.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Procédé et appareil de stockage d'objet basés sur une base de données distribuée, le procédé et l'appareil étant utilisés dans un système de base de données distribuée, un disque et une mémoire étant partagés entre des nœuds. Le procédé consiste : lorsque le nœud actuel reçoit une demande interne d'écriture d'index, à écrire un journal de comportement d'une API du nœud actuel dans un disque et une mémoire du nœud actuel, puis à écrire, dans une file d'attente FIFO dans la mémoire du nœud actuel, un index correspondant à l'écriture d'index (S101) ; à écrire tous les index dans la file d'attente dans le disque soit périodiquement soit lorsque la file d'attente est pleine (S102) ; et lorsqu'une demande interne de lecture d'index est reçue, à lire, à partir du disque, un index correspondant à la lecture d'index et à renvoyer l'index (S103). Premièrement, des index sont écrits dans une mémoire qui sert de pool tampon, puis les index sont écrits dans un disque à un moment approprié, et un journal de comportement est enregistré à l'avance. Ainsi, une mémoire et un disque sont tous deux utilisés et se complètent, de telle sorte que le taux d'écriture peut être augmenté, et les caractéristiques de lecture et d'écriture à simultanéité élevée peuvent être satisfaites. De plus, au moyen d'un mécanisme de partage de mémoire et de disque et de l'enregistrement d'un journal de comportement à la fois dans une mémoire et dans un disque, la disponibilité élevée de données d'indice est assurée.
PCT/CN2022/094380 2022-04-14 2022-05-23 Procédé et appareil de stockage d'objet basés sur une base de données distribuée WO2023197404A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210391992.XA CN114741449A (zh) 2022-04-14 2022-04-14 一种基于分布式数据库的对象存储方法及装置
CN202210391992.X 2022-04-14

Publications (1)

Publication Number Publication Date
WO2023197404A1 true WO2023197404A1 (fr) 2023-10-19

Family

ID=82280812

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/094380 WO2023197404A1 (fr) 2022-04-14 2022-05-23 Procédé et appareil de stockage d'objet basés sur une base de données distribuée

Country Status (2)

Country Link
CN (1) CN114741449A (fr)
WO (1) WO2023197404A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117596211A (zh) * 2024-01-18 2024-02-23 湖北省楚天云有限公司 Ip分片多核负载均衡装置及方法

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115543214B (zh) * 2022-11-25 2023-03-28 深圳华锐分布式技术股份有限公司 低时延场景下的数据存储方法、装置、设备及介质

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020029214A1 (en) * 2000-08-10 2002-03-07 Nec Corporation Synchronizable transactional database method and system
US20040107381A1 (en) * 2002-07-12 2004-06-03 American Management Systems, Incorporated High performance transaction storage and retrieval system for commodity computing environments
CN103577339A (zh) * 2012-07-27 2014-02-12 深圳市腾讯计算机系统有限公司 一种数据存储方法及系统
CN104133867A (zh) * 2014-07-18 2014-11-05 中国科学院计算技术研究所 分布式顺序表片内二级索引方法及系统
CN104731921A (zh) * 2015-03-26 2015-06-24 江苏物联网研究发展中心 Hadoop分布式文件系统针对日志型小文件的存储和处理方法
CN111046044A (zh) * 2019-12-13 2020-04-21 南京富士通南大软件技术有限公司 一种基于内存型数据库的分布式对象存储系统的高可靠性架构
CN113961153A (zh) * 2021-12-21 2022-01-21 杭州趣链科技有限公司 一种索引数据写入磁盘的方法、装置及终端设备

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020029214A1 (en) * 2000-08-10 2002-03-07 Nec Corporation Synchronizable transactional database method and system
US20040107381A1 (en) * 2002-07-12 2004-06-03 American Management Systems, Incorporated High performance transaction storage and retrieval system for commodity computing environments
CN103577339A (zh) * 2012-07-27 2014-02-12 深圳市腾讯计算机系统有限公司 一种数据存储方法及系统
CN104133867A (zh) * 2014-07-18 2014-11-05 中国科学院计算技术研究所 分布式顺序表片内二级索引方法及系统
CN104731921A (zh) * 2015-03-26 2015-06-24 江苏物联网研究发展中心 Hadoop分布式文件系统针对日志型小文件的存储和处理方法
CN111046044A (zh) * 2019-12-13 2020-04-21 南京富士通南大软件技术有限公司 一种基于内存型数据库的分布式对象存储系统的高可靠性架构
CN113961153A (zh) * 2021-12-21 2022-01-21 杭州趣链科技有限公司 一种索引数据写入磁盘的方法、装置及终端设备

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117596211A (zh) * 2024-01-18 2024-02-23 湖北省楚天云有限公司 Ip分片多核负载均衡装置及方法
CN117596211B (zh) * 2024-01-18 2024-04-05 湖北省楚天云有限公司 Ip分片多核负载均衡装置及方法

Also Published As

Publication number Publication date
CN114741449A (zh) 2022-07-12

Similar Documents

Publication Publication Date Title
US11153380B2 (en) Continuous backup of data in a distributed data store
US10642840B1 (en) Filtered hash table generation for performing hash joins
US10423493B1 (en) Scalable log-based continuous data protection for distributed databases
US11720594B2 (en) Synchronous replication in a distributed storage environment
US10853182B1 (en) Scalable log-based secondary indexes for non-relational databases
US9946735B2 (en) Index structure navigation using page versions for read-only nodes
CN105393243B (zh) 事务定序
US11841844B2 (en) Index update pipeline
WO2023197404A1 (fr) Procédé et appareil de stockage d'objet basés sur une base de données distribuée
US8868487B2 (en) Event processing in a flash memory-based object store
US8700842B2 (en) Minimizing write operations to a flash memory-based object store
US20120158650A1 (en) Distributed data cache database architecture
US20070288526A1 (en) Method and apparatus for processing a database replica
US20180218023A1 (en) Database concurrency control through hash-bucket latching
JPWO2011108695A1 (ja) 並列データ処理システム、並列データ処理方法及びプログラム
WO2021057108A1 (fr) Procédé de lecture de données, procédé d'écriture de données et serveur
WO2022111188A1 (fr) Procédé de traitement de transaction, système, appareil, dispositif, support d'enregistrement et produit-programme
WO2023165196A1 (fr) Procédé et appareil d'accélération de stockage de journal, et dispositif électronique et support de stockage lisible non volatil
US20240028598A1 (en) Transaction Processing Method, Distributed Database System, Cluster, and Medium
WO2023077971A1 (fr) Procédé et appareil de traitement de transactions, dispositif informatique et support de stockage
WO2019109256A1 (fr) Procédé de gestion de journal, serveur et système de base de données
WO2020119709A1 (fr) Procédé de mise en œuvre de fusion de données, dispositif, système et support de stockage
US11442663B2 (en) Managing configuration data
JP7450735B2 (ja) 確率的データ構造を使用した要求の低減
US10922012B1 (en) Fair data scrubbing in a data storage system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22937045

Country of ref document: EP

Kind code of ref document: A1