RU2777270C1

RU2777270C1 - Method and system for distributed storage of recovered data which ensures integrity and confidentiality of information

Info

Publication number: RU2777270C1
Application number: RU2021123038A
Authority: RU
Inventors: Сергей Станиславович Чайковский
Original assignee: Коннект Медиа Лтд
Filing date: 2021-08-03
Publication date: 2022-08-01

Abstract

FIELD: data storage.

SUBSTANCE: invention relates to a system for distributed storage of recoverable data and a method for distributed storage of recoverable data. The method for distributed storage of recoverable data with information integrity is performed by at least one system, which contains a plurality of storage nodes connected by a network and storing data in the form of extents, data service (DS) agents for managing extents, metadata service (MDS) agents for managing metadata , related to nodes and extents, and cluster manager (CM) agents. The method consists of the following steps: generating a checksum for each data extent; detection of a node failure in the system by one of the CM agents or a disk failure in the system by the DS agent; notifying DS or MDS agents of a failure; formation by independently notified DS or MDS agents of a recovery plan for data extents affected by the failure; and collectively restoring the affected extents based on the generated plan in the previous step using checksums, with each node including a plurality of disks; and a disk failure is detected by the DS agent based on the disk error rate.

EFFECT: improving the reliability of data recovery.

17 cl, 9 dwg

Description

ОБЛАСТЬ ТЕХНИКИFIELD OF TECHNOLOGY

[001] Данное техническое решение в общем относится к компьютерным системам хранения данных и, в частности, к системе распределенного хранения восстанавливаемых данных и способу восстановления данных после отказа узла хранения данных в системе или диска в одном из узлов.[001] This technical solution generally relates to computer data storage systems and, in particular, to a distributed storage system for recoverable data and a method for recovering data after a failure of a storage node in a system or a disk in one of the nodes.

УРОВЕНЬ ТЕХНИКИBACKGROUND OF THE INVENTION

[002] В настоящее время большой объем компьютерных данных обычно хранится в распределенной системе хранения. Распределенные системы хранения предлагают несколько преимуществ, таких как возможность увеличения емкости хранения по мере роста требований пользователей, надежность данных на основе избыточности данных и гибкость в обслуживании и замене вышедших из строя компонентов. Распределенные системы хранения были реализованы в различных формах, например, таких как системы с избыточными массивами независимых дисков (RAID), которые широко известны в уровне техники.[002] Currently, a large amount of computer data is usually stored in a distributed storage system. Distributed storage systems offer several benefits, such as the ability to increase storage capacity as user requirements grow, data reliability based on data redundancy, and the flexibility to maintain and replace failed components. Distributed storage systems have been implemented in various forms, such as Redundant Arrays of Independent Disks (RAID) systems, which are widely known in the art.

[003] Например, в статье «Параллельная RAID-архитектура TickerTAIP» (автор Цао и др.) описывают систему дискового массива (RAID), которая включает в себя ряд рабочих узлов и исходных узлов. Каждый рабочий узел имеет несколько дисков, подключенных через шину. Исходные узлы обеспечивают подключения к компьютерам-клиентам. Когда диск или узел выходит из строя, система восстанавливает потерянные данные, используя частичное резервирование, предусмотренное в системе. Описанный метод применим только к системам хранения RAID. Кроме того, этот метод не решает проблему распределенной системы хранения, состоящей из очень большого количества независимых массивов хранения. [003] For example, "TickerTAIP Parallel RAID Architecture" (by Cao et al.) describes a disk array (RAID) system that includes a number of worker nodes and source nodes. Each worker node has multiple disks connected via a bus. Source nodes provide connections to client computers. When a disk or node fails, the system recovers the lost data using the partial redundancy provided in the system. The described method only applies to RAID storage systems. Also, this method does not solve the problem of a distributed storage system consisting of a very large number of independent storage arrays.

[004] Также из уровня техники известен патент США US 6438661 «Method, system, and program for managing meta data in a storage system and rebuilding lost meta data in cache» (правообладатель: International Business Machines Corp), который описывает способ и систему для восстановления потерянных метаданных в кэш-памяти. Метаданные предоставляют информацию о пользовательских данных, хранящихся на устройстве хранения. Метод определяет, были ли изменены дорожки метаданных в кэше, указывает в энергонезависимой памяти, что дорожки метаданных были изменены, и восстанавливает дорожки метаданных. Восстановление данных включает в себя доступ к дорожкам данных, ассоциированным с дорожками метаданных, размещение дорожек данных, к которым осуществляется доступ, в кэш и обработку дорожек данных для восстановления дорожек метаданных. Описанный метод не восстанавливает данные, утерянные из-за отказа узла в распределенной системе хранения.[004] Also known from the prior art is US patent US 6438661 "Method, system, and program for managing meta data in a storage system and rebuilding lost meta data in cache" (right holder: International Business Machines Corp), which describes a method and system for recovering lost metadata in the cache. Metadata provides information about user data stored on a storage device. The method determines if the metadata tracks in the cache have changed, indicates in non-volatile memory that the metadata tracks have changed, and restores the metadata tracks. Data recovery includes accessing the data tracks associated with the metadata tracks, placing the accessed data tracks in a cache, and processing the data tracks to recover the metadata tracks. The described method does not recover data lost due to a node failure in a distributed storage system.

[005] Заявка на патент US20020062422A1 «Method for rebuilding meta-data in a data storage system and a data storage system» (правообладатель: International Business Machines Corp) описывает другой способ восстановления метаданных в системе хранения, в которой потоки данных записываются в систему как сегменты. Метод сканирует метаданные в каждом сегменте, чтобы определить последний сегмент, записанный из каждого потока. Затем он восстанавливает метаданные, используя метаданные в сегментах, за исключением метаданных для идентифицированных последних сегментов. Описанный метод неприменим к распределенной многоузловой системе хранения и не решает проблему восстановления данных после сбоя узла.[005] Patent application US20020062422A1 "Method for rebuilding meta-data in a data storage system and a data storage system" (copyright holder: International Business Machines Corp) describes another method for restoring metadata in a storage system in which data streams are written to the system as segments. The method scans the metadata in each segment to determine the last segment written from each stream. It then reconstructs the metadata using the metadata in the segments, excluding the metadata for the last segments identified. The described method is not applicable to a distributed multi-node storage system and does not solve the problem of data recovery after a node failure.

СУЩНОСТЬ ИЗОБРЕТЕНИЯSUMMARY OF THE INVENTION

[006] Технической проблемой или технической задачей, решаемой в данном техническом решении, является осуществление способа и системы распределенного хранения восстанавливаемых данных с обеспечением целостности и конфиденциальности информации.[006] The technical problem or technical problem solved in this technical solution is the implementation of a method and system for distributed storage of recoverable data while ensuring the integrity and confidentiality of information.

[007] Техническим результатом, достигаемым при решении вышеуказанной технической проблемы, является повышение надежности восстановления данных.[007] The technical result achieved by solving the above technical problem is to increase the reliability of data recovery.

[008] Указанный технический результат достигается за счет осуществления способа распределенного хранения восстанавливаемых данных с обеспечением целостности и конфиденциальности информации, который выполняется по меньшей мере одной системой, которая содержит множество узлов хранения, связанных сетью и хранящих данные в виде экстентов, агентов службы данных (DS) для управления экстентами, агентов службы метаданных (MDS) для управления метаданными, относящимися к узлам и экстентам, и агентов диспетчеров кластеров (CM), причем способ состоит из следующих этапов: обнаружение сбоя диска в системе одним из агентов CM; уведомление агентов DS или MDS о сбое; формирование независимо уведомленными агентами DS или MDS плана восстановления экстентов данных, затронутых сбоем; и коллективное восстановление затронутых экстентов на основе сгенерированного плана на предыдущем шаге, причем каждый узел включает в себя множество дисков; и отказ диска обнаруживается агентом DS на основании частоты ошибок дисков.[008] The specified technical result is achieved by implementing a method for distributed storage of recoverable data while ensuring the integrity and confidentiality of information, which is performed by at least one system that contains a plurality of storage nodes connected by a network and storing data in the form of extents, data service agents (DS ) for managing extents, metadata service (MDS) agents for managing metadata related to nodes and extents, and cluster manager (CM) agents, the method consisting of the following steps: detection of a disk failure in the system by one of the CM agents; notifying DS or MDS agents of a failure; formation by independently notified DS or MDS agents of a recovery plan for data extents affected by the failure; and collectively restoring the affected extents based on the generated plan in the previous step, each node including a plurality of disks; and a disk failure is detected by the DS agent based on the disk error rate.

[009] Также указанный технический результат достигается благодаря осуществлению системы распределенного хранения восстанавливаемых данных с обеспечением целостности и конфиденциальности информации, которая содержит множество узлов хранения, соединенных сетью, причем каждый узел хранит данные в виде экстентов; агент службы данных (DS) в каждом узле для управления экстентами в узле; множество агентов службы метаданных (MDS) для управления метаданными, относящимися к узлам и экстентам, причем агенты MDS работают в подмножестве узлов; агент диспетчера кластеров (CM) в каждом узле для обнаружения сбоя в системе и уведомления подмножества агентов DS или MDS о сбое,при этом после уведомления об отказе подмножество агентов DS или MDS независимо генерирует план для восстановления экстентов, затронутых отказом, и коллективного восстановления затронутых экстентов на основе плана; и постоянная карта, которая коррелирует экстенты данных с узлами, при этом каждый агент MDS управляет подмножеством карты.[009] Also, the specified technical result is achieved through the implementation of a distributed storage system for recoverable data ensuring the integrity and confidentiality of information, which contains a plurality of storage nodes connected by a network, each node storing data in the form of extents; a data service (DS) agent in each node to manage the extents in the node; a plurality of metadata service (MDS) agents for managing metadata related to nodes and extents, the MDS agents operating on a subset of the nodes; a Cluster Manager (CM) agent on each node to detect a system failure and notify a subset of DS or MDS agents of the failure, whereby upon notification of a failure, a subset of DS or MDS agents independently generates a plan to restore the extents affected by the failure and collectively restore the affected extents based on the plan; and a persistent map that correlates data extents with nodes, with each MDS agent managing a subset of the map.

[0010] В некоторых вариантах реализации технического решения система дополнительно содержит интерфейс, позволяющий приложению получать доступ к данным, хранящимся в системе.[0010] In some implementations of the technical solution, the system further comprises an interface that allows the application to access data stored in the system.

[0011] В некоторых вариантах реализации технического решения система дополнительно содержит средство для определения подмножества узлов, в которых работают агенты MDS. [0011] In some implementations of the technical solution, the system further comprises a means for determining a subset of nodes in which MDS agents operate.

[0012] В некоторых вариантах реализации технического решения каждый агент CM поддерживает упорядоченный список узлов, которые в настоящее время работают в системе.[0012] In some implementations of the technical solution, each CM agent maintains an ordered list of nodes that are currently running in the system.

[0013] В некоторых вариантах реализации технического решения каждый агент CM включает в себя средство для обнаружения отказа узла. [0013] In some embodiments of the technical solution, each CM agent includes a means for detecting node failure.

[0014] В некоторых вариантах реализации технического решения каждый агент CM включает в себя средство для обнаружения нового узла в системе. [0014] In some embodiments of the technical solution, each CM agent includes a means for discovering a new node in the system.

[0015] В некоторых вариантах реализации технического решения упомянутое средство для обнаружения нового узла включает в себя средство для мониторинга сетевых сообщений от нового узла. [0015] In some embodiments of the technical solution, said means for discovering a new node includes means for monitoring network messages from a new node.

[0016] В некоторых вариантах реализации технического решения каждый агент DS распространяет обновления до экстентов в ассоциированном узле другим агентам DS в системе. [0016] In some implementations of the technical solution, each DS agent distributes updates to extents in the associated node to other DS agents in the system.

[0017] В некоторых вариантах реализации технического решения каждый агент DS управляет кэшированием данных в ассоциированном узле. [0017] In some implementations of the technical solution, each DS agent controls the caching of data in the associated node.

[0018] В некоторых вариантах реализации технического решения каждый узел содержит множество дисков с данными, и каждый агент DS включает в себя средство для обнаружения отказа дисков в ассоциированном узле. [0018] In some implementations of the technical solution, each node contains a plurality of data disks, and each DS agent includes a means for detecting disk failure in an associated node.

[0019] В некоторых вариантах реализации технического решения средство для обнаружения отказа диска основано на частоте ошибок дисков. [0019] In some implementations of the technical solution, the means for detecting a disk failure is based on the error rate of the disks.

[0020] В некоторых вариантах реализации технического решения план восстановления включает в себя список экстентов, которые должны быть восстановлены для восстановления избыточности данных в системе. [0020] In some implementations of the technical solution, the recovery plan includes a list of extents that must be restored to restore data redundancy in the system.

[0021] В некоторых вариантах реализации технического решения восстанавливаемые экстенты совместно восстанавливаются агентами DS. [0021] In some implementations of the technical solution, the extents to be restored are jointly restored by the DS agents.

[0022] В некоторых вариантах реализации технического решения пространство выделяется в узлах, которые все еще работают, для замены экстентов, затронутых отказом. [0022] In some embodiments, space is allocated in nodes that are still running to replace extents affected by the failure.

[0023] В некоторых вариантах реализации технического решения данные в затронутых экстентах определяются и переносятся в выделенное пространство. [0023] In some implementations of the technical solution, the data in the affected extents is determined and transferred to the allocated space.

[0024] В некоторых вариантах реализации технического решения агенты DS уведомлены об отказе и уведомленные агенты DS определяют те экстенты, которые имеют данные на отказавшем узле. [0024] In some implementations of the technical solution, the DS agents are notified of the failure and the notified DS agents determine those extents that have data on the failed node.

КРАТКОЕ ОПИСАНИЕ ЧЕРТЕЖЕЙBRIEF DESCRIPTION OF THE DRAWINGS

[0025] Фиг. 1 иллюстрирует блок-схему, показывающую типичную конфигурацию системы коллективного хранения согласно изобретению.[0025] FIG. 1 is a block diagram showing a typical configuration of a shared storage system according to the invention.

[0026] Фиг. 2 иллюстрирует блок-схему, показывающую типичную физическую конфигурацию узла хранения коллективной системы хранения.[0026] FIG. 2 illustrates a block diagram showing a typical physical configuration of a storage node of a shared storage system.

[0027] Фиг. 3 иллюстрирует пример экстента данных и его фрагментов, которые находятся на разных узлах хранения.[0027] FIG. 3 illustrates an example of a data extent and its fragments that reside on different storage nodes.

[0028] Фиг. 4 иллюстрирует блок-схему, типичную конфигурацию системы коллективного хранения с агентами для поддержки операций восстановления данных в соответствии с изобретением.[0028] FIG. 4 illustrates a block diagram of a typical configuration of a shared storage system with agents to support data recovery operations in accordance with the invention.

[0029] Фиг. 5 иллюстрирует пример карты Fragment_To_Disk в соответствии с изобретением.[0029] FIG. 5 illustrates an example of a Fragment_To_Disk card in accordance with the invention.

[0030] Фиг. 6 иллюстрирует пример карты Extent_To_Node в соответствии с изобретением.[0030] FIG. 6 illustrates an example of an Extent_To_Node map in accordance with the invention.

[0031] Фиг. 7 иллюстрирует блок-схему, показывающую общий процесс восстановления данных после отказа узла или отказа диска в коллективной системе хранения.[0031] FIG. 7 illustrates a flowchart showing a general data recovery process after a node failure or disk failure in a shared storage system.

[0032] Фиг. 8 иллюстрирует блок-схему, показывающую предпочтительный процесс создания плана восстановления данных после отказа узла.[0032] FIG. 8 is a flowchart showing the preferred process for creating a data recovery plan after a node failure.

[0033] Фиг. 9 иллюстрирует блок-схему, показывающую предпочтительный процесс создания плана восстановления данных после сбоя диска.[0033] FIG. 9 is a flowchart showing the preferred process for creating a data recovery plan after a disk failure.

ПОДРОБНОЕ ОПИСАНИЕ ИЗОБРЕТЕНИЯDETAILED DESCRIPTION OF THE INVENTION

[0034] Ниже будут подробно рассмотрены термины и их определения, используемые в описании технического решения.[0034] Below will be discussed in detail the terms and their definitions used in the description of the technical solution.

[0035] В данном изобретении под системой подразумевается компьютерная система, ЭВМ (электронно-вычислительная машина), ЧПУ (числовое программное управление), ПЛК (программируемый логический контроллер), компьютеризированные системы управления и любые другие устройства, способные выполнять заданную, четко определенную последовательность операций (действий, инструкций), централизованные и распределенные базы данных, смарт-контракты.[0035] In this invention, the system means a computer system, a computer (electronic computer), CNC (numerical control), PLC (programmable logic controller), computerized control systems and any other devices capable of performing a given, well-defined sequence of operations (actions, instructions), centralized and distributed databases, smart contracts.

[0036] Под устройством обработки команд подразумевается электронный блок либо интегральная схема (микропроцессор), исполняющая машинные инструкции (программы), смарт-контракт, виртуальная машина Ethereum (EVM) или подобное. Устройство обработки команд считывает и выполняет машинные инструкции (программы) с одного или более устройства хранения данных. В роли устройства хранения данных могут выступать, но, не ограничиваясь, жесткие диски (HDD), флеш-память, ПЗУ (постоянное запоминающее устройство), твердотельные накопители (SSD), оптические приводы.[0036] A command processing device refers to an electronic unit or an integrated circuit (microprocessor) executing machine instructions (programs), a smart contract, an Ethereum virtual machine (EVM), or the like. An instruction processing device reads and executes machine instructions (programs) from one or more data storage devices. The role of a storage device can be, but not limited to, hard disk drives (HDD), flash memory, ROM (read only memory), solid state drives (SSD), optical drives.

[0037] Программа - последовательность инструкций, предназначенных для исполнения устройством управления вычислительной машины или устройством обработки команд.[0037] A program is a sequence of instructions intended to be executed by a computer control device or command processing device.

[0038] Диспетчер кластеров обычно является серверной частью графическим пользовательским интерфейсом (GUI) или программным обеспечением командной строки, который работает на одном или всех узлах кластера (в некоторых случаях он работает на другом сервере или кластере серверов управления). Диспетчер кластера работает вместе с агентом управления кластером. Эти агенты работают на каждом узле кластера для управления и настройки служб, набора служб или для управления и настройки всего самого сервера cluster (см. суперкомпьютеры) В некоторых случаях диспетчер кластера в основном используется для диспетчеризации работы, которую должен выполнять кластер (или облако). В последнем случае подмножество диспетчера кластера может быть приложением удаленного рабочего стола, которое используется не для настройки, а просто для отправки работы и получения результатов работы из кластера. В других случаях кластер больше связан с доступностью и балансировкой нагрузки, чем с вычислительными или конкретными сервисными кластерами.[0038] The Cluster Manager is typically a back end graphical user interface (GUI) or command line software that runs on one or all of the cluster nodes (in some cases it runs on a different server or cluster of management servers). The Cluster Manager works in conjunction with the Cluster Management Agent. These agents run on each cluster node to manage and configure services, a set of services, or to manage and configure the entire cluster server itself (see supercomputing). In some cases, the cluster manager is mainly used to dispatch the work that the cluster (or cloud) must do. In the latter case, a subset of the cluster manager could be a remote desktop application that is used not for configuration, but simply for submitting work and receiving work results from the cluster. In other cases, the cluster has more to do with availability and load balancing than with compute or specific service clusters.

[0039] Изобретение будет описано в первую очередь как система распределенного хранения восстанавливаемых данных с обеспечением целостности и конфиденциальности информации. Однако специалисты в данной области техники поймут, что система, в качестве которой может быть средство обработки данных, включая центральный процессор, память, блок ввода / вывода, хранилище программ, соединительную шину и другие соответствующие компоненты, может быть запрограммировано или иным образом спроектировано для облегчения осуществления изобретения. Такая система должна включать соответствующие программные средства для выполнения операций изобретения.[0039] The invention will be described primarily as a system for distributed storage of recoverable data while ensuring the integrity and confidentiality of information. However, those skilled in the art will appreciate that a system, which may be a data processing means, including a central processing unit, memory, an I/O unit, a program store, a backplane, and other relevant components, may be programmed or otherwise designed to facilitate implementation of the invention. Such a system would include appropriate software to carry out the operations of the invention.

[0040] Кроме того, техническое решение, такое как предварительно записанный диск или другой подобный компьютерный программный продукт, для использования с системой обработки данных, может включать в себя носитель данных и записанные на нем программные средства для управления системой обработки данных, чтобы облегчить осуществление изобретения. Такие устройства и технические решения также подпадают под сущность и объем изобретения.[0040] In addition, a technical solution, such as a pre-recorded disc or other similar computer program product, for use with a data processing system, may include a storage medium and software stored thereon for controlling the data processing system to facilitate the implementation of the invention. . Such devices and technical solutions also fall within the essence and scope of the invention.

[0041] Фиг. 1 представляет собой высокоуровневую блок-схему системы 100 распределенного хранения восстанавливаемых данных с обеспечением целостности и конфиденциальности информации согласно изобретению. Система 100 включает в себя большое количество узлов 101 хранения, которые обычно подключены к структуре 102 узлов. Каждый узел 101 включает в себя один или несколько подключенных в частном порядке дисков 103, которые содержат данные клиентов. Структура 102 узлов предпочтительно представляет собой сеть Fibre Channel (сверхвысокоскоростная (до 1 Гбит/с и выше) схема полнодуплексной передачи данных), сеть iSCSI или сеть другого типа. Более подробно узлы 101 хранения описаны ниже со ссылкой на Фиг. 2–6. Целостность данных, хранящихся в запоминающих устройствах, зависит от физического состояния этих устройств, а именно, от того, функционируют ли они надлежащим образом или, например, по меньшей мере до некоторой степени неисправны.[0041] FIG. 1 is a high-level block diagram of a data integrity and confidentiality distributed storage system 100 for recoverable data in accordance with the invention. System 100 includes a large number of storage nodes 101, which are typically connected to node structure 102. Each node 101 includes one or more privately attached drives 103 that contain customer data. The node structure 102 is preferably a Fiber Channel network (ultra-high speed (up to 1 Gb/s and higher) full duplex data transfer scheme), an iSCSI network, or another type of network. The storage nodes 101 are described in more detail below with reference to FIG. 2–6. The integrity of the data stored in the storage devices depends on the physical state of these devices, namely whether they are functioning properly or, for example, at least to some extent faulty.

[0042] Клиенты данных, поддерживаемые коллективной системой 100 хранения, называются хостами. Хосты могут быть подключены к системе 100 хранения с использованием двух типов интерфейсов: драйвера хоста или шлюза. На Фиг. 1 изображены два хоста 107 и 108. В случае хоста 107 его приложение 109 обращается к данным в узлах хранения 101, используя драйвер 110 хоста. В случае хоста 108 его приложение 111 обращается к данным клиента с помощью агента 112 шлюза, который находится в узле 113 шлюза. Более подробная информация о функциях драйвера 110 хоста и агента 112 шлюза описана ниже с другими агентами в системе 100. Обычно приложения 109 и 111 являются файловыми системами или системами баз данных.[0042] The data clients supported by the shared storage system 100 are called hosts. Hosts can be connected to the storage system 100 using two types of interfaces: a host driver or a gateway. On FIG. 1, two hosts 107 and 108 are shown. In the case of host 107, its application 109 accesses data in storage nodes 101 using host driver 110. In the case of the host 108, its application 111 accesses the client's data using the gateway agent 112, which resides in the gateway node 113. More detailed information about the functions of host driver 110 and gateway agent 112 is described below with other agents on system 100. Applications 109 and 111 are typically file systems or database systems.

[0043] Фиг. 2 иллюстрирует предпочтительный вариант осуществления узла 101 хранения с точки зрения его физических компонентов. Узел 101 имеет интерфейс 201 структуры для подключения к структуре 102 узлов, RAM 202 и NVRAM 203 для хранения данных во время обработки данных узлом, процессором 204 для управления и обработки данных и дисковый интерфейс 205. Узел 101 дополнительно включает в себя набор дисков 207 для хранения данных клиентов, которые подключены к интерфейсу 205 диска через матрицу 206 дисков.[0043] FIG. 2 illustrates a preferred embodiment of storage node 101 in terms of its physical components. The node 101 has a fabric interface 201 for connecting to the node fabric 102, RAM 202 and NVRAM 203 for storing data while the node is processing data, a processor 204 for managing and processing data, and a disk interface 205. The node 101 further includes a set of storage disks 207 data of clients that are connected to the disk interface 205 via the disk array 206 .

[0044] Узлы 101 хранения в системе 100 коллективного хранения не являются избыточными, то есть компоненты оборудования в узлах, такие как процессор 204, память 202-203 и диски 207, не являются избыточными. Таким образом, отказ одного или нескольких компонентов в узле 101 хранения может вывести из строя весь этот узел и предотвратить доступ к дискам 207 на этом узле. Это отсутствие избыточности в сочетании с более высокой частотой отказов, возникающей из-за большого количества дисков 207 и узлов 101, означает, что программное обеспечение в системе 100 должно обрабатывать отказы и делать данные клиентов достаточно избыточными, чтобы обеспечить доступ к данным, несмотря на отказы.[0044] The storage nodes 101 in the shared storage system 100 are not redundant, that is, the hardware components in the nodes, such as the processor 204, memory 202-203, and disks 207, are not redundant. Thus, the failure of one or more components in the storage node 101 can disable the entire node and prevent access to the disks 207 on that node. This lack of redundancy, combined with the higher failure rate resulting from the large number of disks 207 and nodes 101, means that the software in system 100 must fail over and make customer data redundant enough to allow data to be accessed despite failures. .

[0045] В системе 100 коллективного хранения доступность данных клиента может быть обеспечена с использованием методики, основанной на избыточности, такой как разделение данных RAID, простая репликация. В каждом из этих методов данные о клиентах хранятся в наборе блоков или фрагментов физической памяти, где каждый блок данных находится на отдельном узле.[0045] In the shared storage system 100, the availability of client data can be ensured using a technique based on redundancy such as RAID data partitioning, simple replication. In each of these methods, customer data is stored in a set of blocks or chunks of physical memory, with each block of data located on a separate node.

[0046] Распределение данных клиента по узлам 101 и дискам 103 выполняется процессом распределения для достижения одной или нескольких системных целей. Желаемый процесс распределения данных должен преследовать две ключевые цели: доступность и балансировка нагрузки. Что касается доступности, при выходе из строя узла 101 или диска 103 все фрагменты хранения на этом узле 101 или диске 103 могут быть потеряны. Следовательно, на других узлах должно быть достаточно избыточных данных для восстановления потерянных фрагментов данных. Что касается балансировки нагрузки, нагрузка чтения / записи данных должна распределяться по узлам 101, чтобы избежать узких мест.[0046] The distribution of client data across nodes 101 and disks 103 is performed by the distribution process to achieve one or more system goals. The desired data distribution process should have two key goals: availability and load balancing. With respect to availability, if node 101 or disk 103 fails, all storage chunks on that node 101 or disk 103 may be lost. Therefore, there must be enough redundant data on the other nodes to recover the lost pieces of data. As far as load balancing is concerned, the data read/write load should be distributed across 101 nodes to avoid bottlenecks.

[0047] Организация данных осуществления следующим образом как раскрыто ниже.[0047] The organization of the implementation data is as follows, as disclosed below.

[0048] Данные клиента в системе 100 организованы в виде блоков фиксированного размера, которые линейно адресуются в логических устройствах SCSI (то есть дисках). У каждого логического устройства есть номер, который называется LUN (или номер логического устройства). Система выглядит как блочное устройство SCSI на одной шине с множеством логических модулей (хотя это является гибким для операционных систем хоста, которые имеют ограниченные возможности LUN). Внутренне логическая единица отображается на контейнер. Контейнер - это набор объектов, называемых экстентами. Атрибуты контейнера включают в себя идентификатор (ID контейнера), который однозначно идентифицирует контейнер в системе 100 хранения. Отображение LUN хоста или функция отображения между LUN и ID контейнера явно хранится и управляется системой 100.[0048] Client data in system 100 is organized into fixed-size blocks that are linearly addressed in SCSI logical devices (ie, disks). Each LUN has a number called a LUN (or LUN). The system looks like a SCSI block device on a single bus with multiple LUNs (although this is flexible for host operating systems that have limited LUN capabilities). Internally, a logical unit is mapped to a container. A container is a collection of objects called extents. The container attributes include an identifier (container ID) that uniquely identifies the container in the storage system 100. The host LUN mapping or mapping function between LUN and container ID is explicitly stored and managed by the system 100.

[0049] Экстент — в файловых системах, непрерывная область носителя информации. Как правило, в файловых системах с поддержкой экстентов большие файлы состоят из нескольких экстентов, не связанных друг с другом напрямую. Экстент хранится как группа фрагментов, где каждый фрагмент находится на другом узле. Фрагмент однозначно идентифицируется по идентификатору экстента и идентификатору фрагмента. Группа фрагментов, составляющая экстент, называется группой экстентов (или группой отношений фрагментов). Фиг. 3 иллюстрирует экстент 300, который содержит три фрагмента 301, 302 и 303. Фрагменты 301, 302 и 303 находятся соответственно на узле 1, узле 2 и узле 3.[0049] Extent - in file systems, a contiguous area of a storage medium. Typically, in extent-aware filesystems, large files consist of several extents that are not directly related to each other. The extent is stored as a group of fragments, where each fragment is located on a different node. A fragment is uniquely identified by an extent ID and a fragment ID. The group of fragments that make up an extent is called an extent group (or fragment relationship group). Fig. 3 illustrates an extent 300 that contains three fragments 301, 302, and 303. Fragments 301, 302, and 303 are located at node 1, node 2, and node 3, respectively.

[0050] Экстент относится к ряду примыкающих друг к другу блоков на носителях информации и может различаться в зависимости от приложения. Например, одно приложение может разделить диск на экстенты, имеющие один размер, в то время как другое приложение может разделить диск на экстенты, имеющие другой размер.[0050] An extent refers to a number of contiguous blocks on storage media, and may vary depending on the application. For example, one application may partition a disk into extents of the same size, while another application may partition a disk into extents of a different size.

[0051] Если некоторый блок на диске изменяется после создания «теневой» копии, то, прежде чем блок будет изменен, экстент, содержащий этот блок копируется в место на запоминающем устройстве, расположенное в дифференциальной области. Для конкретной «теневой» копии экстент копируется только первый раз, когда изменяется какой-либо блок внутри экстента. Когда принят запрос на информацию, содержащуюся в «теневой» копии, сначала проводится проверка, имеющая целью определить, изменился ли этот блок в оригинальном томе (например, посредством проверки того, имеется ли экстент, содержащий этот блок, в дифференциальной области). Если блок не изменился, то извлекаются и возвращаются данные из оригинального тома. Если блок изменился, то извлекаются и возвращаются данные из дифференциальной области. Следует отметить, что если блок перезаписан теми же данными, то экстент, содержащий этот блок, не записывается в дифференциальную область.[0051] If a certain block on the disk is changed after the shadow copy is made, then before the block is changed, the extent containing that block is copied to a storage location located in the differential region. For a particular "shadow" copy, the extent is only copied the first time any block within the extent is modified. When a request is received for information contained in a shadow copy, a check is first made to determine whether the block has changed in the original volume (eg, by checking whether the extent containing the block exists in the differential area). If the block has not changed, then the data from the original volume is retrieved and returned. If the block has changed, then the data from the differential area is retrieved and returned. Note that if a block is overwritten with the same data, then the extent containing that block is not written to the differential area.

[0052] Атрибуты экстента включают в себя его идентификатор (ID экстента), уровень избыточности и список узлов 101, которые имеют зависимости данных (реплики, код стирания и т.д.) от данных клиента, хранящихся в этом экстенте. Идентификатор экстента уникален внутри контейнера. Надежность и доступность экстента 300 достигаются путем распределения его фрагментов 301–303 на отдельные узлы 101.[0052] The extent attributes include an extent identifier (extent ID), a redundancy level, and a list of nodes 101 that have data dependencies (replicas, erasure codes, etc.) on client data stored in that extent. The extent ID is unique within the container. The reliability and availability of the extent 300 is achieved by distributing its fragments 301-303 into separate nodes 101.

[0053] В некоторых вариантах реализации системы могут быть выполнены различные криптографические функции для обеспечения целостности данных, такие как SFLA-1 (алгоритм безопасного хеширования), MD5 и RSA, и предусматривается несколько возможностей шифрования, включая DES, 3DES и AES. В некоторых вариантов технического решения циклический избыточный код (англ. «Cyclic redundancy check», сокр. CRC) - один из алгоритмов нахождения контрольной суммы, предназначенный для проверки целостности данных. В соответствии с не имеющими ограничительного характера вариантами воплощения настоящей технологии циклический избыточный код может быть реализован с помощью операции деления полиномов над конечным полем.[0053] In some system implementations, various cryptographic functions can be implemented to ensure data integrity, such as SFLA-1 (secure hash algorithm), MD5, and RSA, and several encryption options are provided, including DES, 3DES, and AES. In some technical solutions, cyclic redundancy check (CRC) is one of the checksum finding algorithms designed to check data integrity. In accordance with non-limiting embodiments of the present technology, a cyclic redundancy code may be implemented with a finite field polynomial division operation.

[0054] В некоторых вариантах реализации для контроля целостности для каждого экстента определяют его контрольную сумму. После подсчета контрольной суммы от каждого экстента, например по алгоритму MD5, получают набор значений: 1bc29b36f623ba82aaf6724fd3bl6718,…, 026f8e459c8f89ef75fa7a78265a0025. Контрольная сумма - некоторое значение, рассчитанное по набору данных путем применения вычислительного алгоритма и используемое для проверки целостности данных при их передаче или хранении. [0054] In some embodiments, to check the integrity of each extent, its checksum is determined. After calculating the checksum from each extent, for example, using the MD5 algorithm, a set of values is obtained: 1bc29b36f623ba82aaf6724fd3bl6718,…, 026f8e459c8f89ef75fa7a78265a0025. Checksum - some value calculated from a set of data by applying a computational algorithm and used to verify the integrity of data during transmission or storage.

[0055] В предпочтительном варианте осуществления способа разбивают узел данных на три области, при этом разбивают каждую из трех областей на подобласть замены и подобласть данных, при этом разбивают подобласти на блоки фиксированной длины, каждый из которых защищают контрольной суммой алгоритма CRC (cyclic redundancy check - циклический избыточный код) для проверки целостности данных в блоке, причем, в случае обнаружения некорректной контрольной суммы в блоке из подобласти данных, используют вместо данного блока из подобласти данных соответствующий ему блок из подобласти замены.[0055] In a preferred embodiment of the method, the data node is divided into three areas, while each of the three areas is divided into a replacement sub-area and a data sub-area, while the sub-areas are divided into fixed-length blocks, each of which is protected by a CRC checksum (cyclic redundancy check - cyclic redundancy code) to check the integrity of the data in the block, and, if an incorrect checksum is detected in a block from the data subarea, the corresponding block from the replacement subarea is used instead of this block from the data subarea.

[0056] При восстановлении данных в узлах данных используют контрольные суммы экстентов при проверке целостности.[0056] When restoring data in data nodes, checksums of extents are used when checking integrity.

[0057] Конфиденциальность данных в текущей системе решается посредством применения дополнительного шифрования данных. Шифрование является традиционным способом обеспечения конфиденциальности данных при их хранении. В данном решении используется алгоритм шифрования, основанный на применении блочных шифров. Такие шифры оперируют фрагментами данных фиксированной длины - блоками, и сочетают в себе стойкость и высокую скорость работы.[0057] Data privacy in the current system is addressed by applying additional data encryption. Encryption is the traditional way to ensure the confidentiality of data while it is stored. This solution uses an encryption algorithm based on the use of block ciphers. Such ciphers operate with data fragments of a fixed length - blocks, and combine durability and high speed.

[0058] Тенденция на увеличение объемов хранимой информации и скорости передаваемых данных требует от используемых блочных шифров высокой производительности. Эффективным методом увеличения быстродействия алгоритма шифрования является использование параллельных вычислений. Одним из способов организации параллельных вычислений в случае программной реализации алгоритма является использование SIMD-технологий, в основе которых лежит применение одной инструкции процессора для одновременной обработки нескольких фрагментов данных, предварительно размещенных на одном регистре. SIMD-технологии получили широкое распространение и поддерживаются на большинстве современных вычислительных платформ, в том числе на процессорах общего назначения Intel и AMD. В данном решении может использоваться несколько типовых наборов SIMD-инструкций, каждый из которых предназначен для работы с регистрами определенной длины, широко известных из уровня техники.[0058] The trend towards an increase in the amount of stored information and the speed of transmitted data requires high performance from the used block ciphers. An effective method of increasing the speed of the encryption algorithm is the use of parallel computing. One of the ways to organize parallel computing in the case of a software implementation of an algorithm is the use of SIMD technologies, which are based on the use of one processor instruction for the simultaneous processing of several data fragments previously placed on one register. SIMD technologies have become widespread and are supported on most modern computing platforms, including general-purpose processors from Intel and AMD. This solution can use several typical sets of SIMD instructions, each of which is designed to work with registers of a certain length, widely known from the prior art.

[0059] В данном техническом решении в алгоритмах блочного шифрования, SIMD-технологии используются для эффективной обработки сразу нескольких блоков данных. С помощью SIMD-технологий алгоритм шифрования, предназначенный для обработки одного блока, выполняют одновременно для нескольких блоков. Эффективность такого подхода напрямую зависит от возможности параллельного выполнения использующихся в алгоритме шифрования преобразований и операций, которая, в свою очередь, определяется наличием в вычислительной платформе соответствующих им SIMD-инструкций. В случае возможности распараллеливания каждой из операций такой подход позволяет выполнять обработку нескольких блоков данных за время, необходимое для обработки одного блока данных, то есть производительность алгоритма растет пропорциональному количеству одновременно обрабатываемых блоков. Поскольку число одновременно обрабатываемых блоков определяется длиной используемых регистров, производительность в этом случае растет пропорционально увеличению длины используемых регистров. В данном техническом решении может использоваться алгоритм шифрования «Кузнечик». «Кузнечик» (англ. Kuznyechik или англ. Kuznechik) — симметричный алгоритм блочного шифрования с размером блока 128 бит и длиной ключа 256 бит, использующий для генерации раундовых ключей SP-сеть.[0059] In this technical solution, in block cipher algorithms, SIMD technologies are used to efficiently process several blocks of data at once. Using SIMD technologies, an encryption algorithm designed to process one block is performed simultaneously for several blocks. The effectiveness of this approach directly depends on the possibility of parallel execution of the transformations and operations used in the encryption algorithm, which, in turn, is determined by the presence in the computing platform of the corresponding SIMD instructions. If it is possible to parallelize each of the operations, this approach allows processing several data blocks in the time required to process one data block, that is, the performance of the algorithm grows in proportion to the number of simultaneously processed blocks. Since the number of simultaneously processed blocks is determined by the length of the used registers, the performance in this case increases in proportion to the increase in the length of the used registers. In this technical solution, the Grasshopper encryption algorithm can be used. Grasshopper (English Kuznyechik or English Kuznechik) is a symmetric block cipher algorithm with a block size of 128 bits and a key length of 256 bits, which uses an SP network to generate round keys.

[0060] Конфигурация системы подробно раскрывается ниже.[0060] The system configuration is detailed below.

[0061] Фиг. 4 представляет собой блок-схему, показывающую типичную конфигурацию системы 400 коллективного хранения с необходимыми агентами для поддержки операций восстановления данных. Каждый узел 401 коллективной системы 400 хранения одновременно выполняет один или несколько программных агентов, также называемых модулями. В системе 400 коллективного хранения есть четыре типа агентов: агенты 410 диспетчера кластеров (CM), агенты 420 сервера данных (DS), агенты 430 сервера метаданных (MDS) и агент 440 драйвера хоста (HD). Не все четыре типа агента работают на каждом узле 401 хранения. Кроме того, в предпочтительных вариантах осуществления изобретения на узле 401 может работать не более одного экземпляра типа агента.[0061] FIG. 4 is a block diagram showing a typical configuration of a shared storage system 400 with the necessary agents to support data recovery operations. Each node 401 of the collective storage system 400 simultaneously executes one or more software agents, also called modules. There are four types of agents in the shared storage system 400: cluster manager (CM) agents 410, data server (DS) agents 420, metadata server (MDS) agents 430, and host driver (HD) agent 440. Not all four agent types work on every storage node 401 . In addition, in preferred embodiments of the invention, no more than one instance of an agent type can run on node 401.

[0062] Агент диспетчера кластеров (CM) работает следующим образом.[0062] The Cluster Manager (CM) agent works as follows.

[0063] Агент 410 диспетчера кластера (CM) отвечает за ведение упорядоченного списка узлов 401, которые в настоящее время работают в коллективной системе 400 хранения, обнаружение отказов узлов и обнаружение новых узлов в системе 400. В предпочтительных вариантах осуществления изобретения только один агент 410 CM работает на каждом узле 401. Каждый агент 410 CM в системе 400 имеет уникальное имя, которое предпочтительно является уникальным идентификатором соответствующего узла. Упорядоченные списки текущих функциональных узлов 401, поддерживаемые агентами 410 CM в системе 400, идентичны друг другу. Таким образом, все узлы 401 в системе 400 имеют одинаковую информацию о том, какие узлы 401 в настоящее время работают в системе, то есть идентичный вид системы.[0063] The Cluster Manager (CM) agent 410 is responsible for maintaining an ordered list of nodes 401 that are currently operating in the shared storage system 400, detecting node failures, and discovering new nodes in the system 400. In preferred embodiments of the invention, only one CM agent 410 runs on each node 401. Each CM agent 410 in system 400 has a unique name, which is preferably a unique identifier for the corresponding node. The ordered lists of current functional nodes 401 maintained by CM agents 410 in system 400 are identical to each other. Thus, all nodes 401 in system 400 have the same information about which nodes 401 are currently running on the system, i.e., an identical view of the system.

[0064] Агент 410 диспетчера кластеров обнаруживает отказ узла, когда он не получает ожидаемого отклика в пределах интервала обнаружения отказа от отказавшего узла. Обнаружение сбоев широко известно из текущего уровня техники.[0064] The cluster manager agent 410 detects a node failure when it does not receive the expected response within the failure detection interval of the failed node. Fault detection is well known in the art.

[0065] Когда несколько узлов 401 пытаются одновременно присоединиться к системе 400 коллективного хранения или покинуть ее, каждый агент 410 CM гарантирует, что уведомления о присоединении и выходе доставляются последовательно в одном и том же порядке всем узлам 401 в системе 400.[0065] When multiple nodes 401 attempt to simultaneously join or leave the shared storage system 400, each CM agent 410 ensures that the join and leave notifications are delivered sequentially in the same order to all nodes 401 in the system 400.

[0066] Агент службы данных (DS) работает следующим образом.[0066] The Data Service (DS) agent works as follows.

[0067] Агент 420 службы данных (DS) отвечает за управление необработанными данными в узле 401 хранения в форме фрагментов данных, находящихся на этом узле. Это включает в себя кэширование данных и избыточность данных. Кэширование данных связано с кэшированием записи или упреждающим чтением. Чтобы обеспечить избыточность данных для экстента, обновление любого, если его фрагменты данных, распространяется на узлы, которые содержат связанные фрагменты в той же группе экстентов. Кроме того, агент 420 DS поддерживает постоянную карту, называемую картой Fragment_To_Disk. Карта Fragment_To_Disk содержит информацию, которая коррелирует фрагменты данных в узле 101, на котором агент DS работает, с адресами логических блоков на дисках 103 в этом узле. На Фиг. 5 показан пример карты 400 Fragment_To_Disk, которая указывает, что фрагмент 0001 экстента ABC0052 находится на диске 01 по адресу диска 1001.[0067] The data service (DS) agent 420 is responsible for managing the raw data in the storage node 401 in the form of data fragments residing on that node. This includes data caching and data redundancy. Data caching is related to write caching or read ahead. To provide data redundancy for an extent, updating any if its data chunks is propagated to nodes that contain related chunks in the same extent group. In addition, the DS agent 420 maintains a persistent map called a Fragment_To_Disk map. The Fragment_To_Disk map contains information that correlates data fragments in the node 101 on which the DS agent is running with logical block addresses on disks 103 in that node. On FIG. 5 shows an example Fragment_To_Disk map 400 that indicates that fragment 0001 of extent ABC0052 is located on disk 01 at disk address 1001.

[0068] Агент 420 DS также управляет локальной копией второй постоянной карты, называемой картой Extent_To_Node. Эта карта включает информацию, которая коррелирует экстент с рабочими узлами, которые содержат фрагменты, составляющие этот экстент. Агент 420 DS, который содержит первый фрагмент в группе экстентов, называется лидером для этого экстента. Агенты 420 DS, содержащие оставшиеся фрагменты этого размера, называются последователями. Лидер группы экстентов отвечает за восстановление, восстановление и обновление данных для последователей. Лидерство передается при отказе диска или узла, что приводит к потере лидера. Механизм контроля согласованности для чтения / записи в сопоставление фрагментов в той же степени управляется лидером. То есть все операции ввода-вывода отправляются лидеру группы экстентов, и лидер действует как точка сериализации для чтения / записи в этом объеме.[0068] The DS agent 420 also manages a local copy of a second permanent map, called the Extent_To_Node map. This map includes information that correlates the extent with the work nodes that contain the fragments that make up that extent. The DS agent 420 that contains the first fragment in the extent group is called the leader for that extent. DS agents 420 containing remaining fragments of this size are called successors. The extent group leader is responsible for restoring, restoring, and updating data for successors. Leadership is transferred when a disk or node fails, resulting in the loss of a leader. The consistency control mechanism for reading/writing to fragment mapping is controlled by the leader to the same extent. That is, all I/Os are sent to the leader of the extent group, and the leader acts as a serialization point for reads/writes to that extent.

[0069] Лидеры групп экстентов в системе 400 распределяются между узлами 401 с помощью алгоритма распределения, чтобы гарантировать, что ни один узел не станет узким местом для того, чтобы быть лидером. Можно ослабить требование, чтобы все считывания данных проходили через лидера. Также можно ослабить потребность в том, чтобы все записи передавались лидеру, за счет тщательно синхронизированных часов для узлов и использования временных меток. На Фиг. 6 показан пример карты Extent_To_Node для экстента ABC0052.[0069] Extent group leaders in system 400 are distributed among nodes 401 using a distribution algorithm to ensure that no node becomes a bottleneck for being a leader. You can relax the requirement that all data reads go through the leader. You can also ease the need for all writes to be passed to the leader by carefully synchronizing clocks for nodes and using timestamps. On FIG. Figure 6 shows an example Extent_To_Node map for extent ABC0052.

[0070] Все агенты 420 DS в системе 400 хранения коллективно управляют группами экстентов в системе во время чтения и записи данных клиента, восстановления после сбоя диска или узла, реорганизации и операций моментального снимка.[0070] All DS agents 420 in storage system 400 collectively manage extent groups in the system during read and write client data, disk or node failure recovery, reorganization, and snapshot operations.

[0071] В предпочтительных вариантах осуществления изобретения существует один экземпляр агента DS, работающего на каждом узле в системе 400 коллективного хранения. Экстенты данных полностью размещены на диске, то есть экстент не распространяется на несколько дисков. Следовательно, что касается экстентов, отказ диска, содержащего фрагмент экстента, сродни отказу узла 401, содержащего тот же самый фрагмент.[0071] In preferred embodiments of the invention, there is one instance of the DS agent running on each node in the shared storage system 400 . Data extents are completely disk-based, meaning that an extent does not span multiple disks. Therefore, with respect to extents, the failure of a disk containing an extent fragment is akin to the failure of node 401 containing the same fragment.

[0072] Агент 420 службы данных обнаруживает отказ диска, используя регулярные тайм-ауты запроса проверки связи. Например, когда диск не используется, агент 420 DS опрашивает диск через регулярные промежутки времени, чтобы определить состояние диска. В случае тайм-аутов запроса агент 420 DS может периодически проверять частоту ошибок диска, чтобы определить свой статус. [0072] The data service agent 420 detects a disk failure using regular ping request timeouts. For example, when a disk is not in use, the DS agent 420 polls the disk at regular intervals to determine the status of the disk. In the event of request timeouts, the DS agent 420 may periodically check the disk error rate to determine its status.

[0073] Агент службы метаданных (MDS) подробно раскрыт ниже.[0073] The Metadata Service (MDS) agent is detailed below.

[0074] Агент 430 службы метаданных (MDS) отвечает за управление картой Extent_To_Node, постоянной картой, коррелирующей все экстенты данных в системе с узлами, которые содержат фрагменты, составляющие экстенты. Каждый агент 420 DS (описанный выше) имеет карту Extent_To_Node, которая содержит записи только для тех экстентов, фрагментами которых он управляет. С другой стороны, агент 430 MDS имеет карту Extent_To_Node для всех экстентов в системе 400 хранения. Карта MDS индексируется с идентификатором экстента в качестве первичного ключа. Также поддерживается вторичный индекс на узлах, который полезен при создании плана ремонта в ответ на отказ узла. Кроме того, в отличие от агента 420 DS, агент 430 MDS не запускается на каждом узле 401. Вместо этого используется адаптивный метод для определения набора узлов 401, которые будут запускать агенты 430 MDS. Агент 430 MDS также сохраняет и управляет метаданными экстентов и выполняет создание и удаление экстентов.[0074] The Metadata Service (MDS) agent 430 is responsible for managing the Extent_To_Node map, a persistent map that correlates all data extents in the system with the nodes that contain the fragments that make up the extents. Each DS agent 420 (described above) has an Extent_To_Node map that contains entries only for those extents whose fragments it manages. On the other hand, the MDS agent 430 has an Extent_To_Node map for all extents in the storage system 400 . The MDS map is indexed with the extent ID as the primary key. A secondary index on nodes is also supported, which is useful when creating a repair plan in response to a node failure. Also, unlike the DS agent 420, the MDS agent 430 does not run on every node 401. Instead, an adaptive method is used to determine the set of nodes 401 that will run the MDS agents 430. The MDS agent 430 also stores and manages extent metadata and performs extent creation and deletion.

[0075] Преобразование экстента в контейнере в его группу экстентов выполняется агентом 430 MDS. Кроме того, распределение экстентов обеспечивается агентом 430 MDS, который формирует группу экстентов для нового экстента. Агент 430 MDS дополнительно управляет списком сконфигурированных контейнеров и их атрибутов.[0075] Converting an extent in a container to its extent group is performed by the MDS agent 430. In addition, extent allocation is provided by the MDS agent 430, which generates an extent group for the new extent. The MDS agent 430 further manages the list of configured containers and their attributes.

[0076] Агент хост-драйвера (HD) раскрывается ниже.[0076] The Host Driver Agent (HD) is discussed below.

[0077] Агент 440 хост-драйвера (HD) — это интерфейс, через который клиентское приложение 402 может получать доступ к данным в коллективной системе 400 хранения. Агент 440 HD обменивается данными с приложением 402 в терминах логических модулей (LUN) и с остальными система хранения 400 с точки зрения контейнеров и экстентов данных. Агент 440 HD обычно находится в хосте 403, где выполняется приложение 402. Однако его интерфейсные функции могут быть обеспечены в узле 404 хранения в форме агента 405 шлюза. Затем хост-приложение получает доступ к данным в системе 400 хранения через агента 405 шлюза, который находится в узле 404 шлюза.[0077] The host driver (HD) agent 440 is an interface through which the client application 402 can access data in the shared storage system 400. The HD agent 440 communicates with the application 402 in terms of logical units (LUNs) and with the rest of the storage system 400 in terms of containers and data extents. The HD agent 440 is typically located in the host 403 where the application 402 is running. However, its interface functions may be provided in the storage node 404 in the form of a gateway agent 405 . The host application then accesses the data in the storage system 400 through a gateway agent 405 that resides in the gateway node 404 .

[0078] Чтобы получить доступ к данным клиента в системе 400, агент 440 HD или агент 405 шлюза определяет экстент и лидера для этого экстента. Агент 440 HD или агент 405 шлюза затем обращается к данным из узла хранения, который содержит лидер экстента. Чтобы получить доступ к экстенту данных, клиент должен сначала получить местоположение экстента. Эта функция обеспечивается агентом 430 MDS. По заданному идентификатору экстента агент 430 MDS возвращает список узлов, где можно найти фрагменты этого экстента.[0078] To access client data in system 400, HD agent 440 or gateway agent 405 determines an extent and a leader for that extent. The HD agent 440 or gateway agent 405 then accesses the data from the storage node that contains the extent leader. To access a data extent, the client must first obtain the location of the extent. This function is provided by the agent 430 MDS. Given an extent identifier, the MDS agent 430 returns a list of nodes where fragments of that extent can be found.

[0079] Восстановление данных раскрывается подробно ниже.[0079] Data recovery is discussed in detail below.

[0080] Теперь описывается процесс восстановления данных в коллективной системе 400 хранения после отказа узла или отказа диска. Восстановление — это процесс воссоздания потерянных данных из-за неисправного диска или узла на новых узлах или дисках, чтобы застраховаться от потери данных в результате последующих сбоев. Желаемый процесс восстановления должен иметь следующие свойства:[0080] A process for restoring data to a shared storage system 400 after a node failure or disk failure is now described. Recovery is the process of recreating lost data due to a failed disk or node on new nodes or disks to insure against data loss due to subsequent failures. The desired recovery process must have the following properties:

[0081] а) надежность: все затронутые данные клиентов в конечном итоге восстанавливаются.[0081] a) Reliability: All affected customer data is eventually recovered.

[0082] б) отказоустойчивость: если во время восстановления данных происходит второй сбой, то свойство (а) сохраняется.[0082] b) fault tolerance: if a second failure occurs during data recovery, then property (a) is preserved.

[0083] c) эффективность: процесс восстановления требует как можно меньше обменов сообщениями и перемещений данных между узлами.[0083] c) efficiency: the recovery process requires as few message exchanges and data movements between nodes as possible.

[0084] d) сбалансированность: работа по восстановлению данных распределяется по узлам, чтобы минимизировать влияние на одновременный доступ к данным клиентов.[0084] d) Balanced: Data recovery work is distributed across nodes to minimize the impact on concurrent access to client data.

[0085] e) масштабируемость: время, необходимое для восстановления с отказавшего диска или узла, должно обратно пропорционально масштабироваться с размером системы, то есть чем больше система, тем короче время восстановления. Это важно, поскольку системе приходится иметь дело с более высокой интенсивностью отказов по мере роста системы.[0085] e) scalability: The time required to recover from a failed disk or node should scale inversely with system size, ie the larger the system, the shorter the recovery time. This is important because the system has to deal with higher failure rates as the system grows.

[0086] Фиг. 7 представляет собой блок-схему, показывающую общий процесс восстановления потерянных данных после отказа узла или отказа диска в системе 400 коллективного хранения. На этапе 701 отказ обнаруживается агентом в системе 400. Как описано выше для CM и агенты DS, отказ узла обнаруживается агентом CM, а отказ диска обнаруживается агентом DS. На этапе 702 агенты в системе 400, ответственные за координацию восстановления данных, уведомляются об отказе. В одном предпочтительном варианте осуществления в соответствии с изобретением восстановление координируется одним из агентов MDS в системе. В другом варианте осуществления изобретения восстановление координируется одним из агентов DS в системе. На этапе 703 агент-координатор генерирует план восстановления данных, как описано ниже со ссылкой на Фиг. 8–9. План определяет экстенты данных в системе, на которые повлиял сбой. На этапе 704 агенты в системе коллективно восстанавливают затронутые экстенты на основе плана восстановления.[0086] FIG. 7 is a flowchart showing a general process for recovering lost data after a node failure or disk failure in the shared storage system 400. At block 701, a failure is detected by an agent in system 400. As described above for the CM and DS agents, a node failure is detected by the CM agent and a disk failure is detected by the DS agent. At block 702, agents in system 400 responsible for coordinating data recovery are notified of the failure. In one preferred embodiment in accordance with the invention, the restoration is coordinated by one of the MDS agents in the system. In another embodiment of the invention, recovery is coordinated by one of the DS agents in the system. At step 703, the coordinator agent generates a data recovery plan, as described below with reference to FIG. 8–9. The plan determines the data extents in the system that are affected by the failure. At block 704, the agents in the system collectively restore the affected extents based on the restoration plan.

[0087] На Фиг. 8 показана блок-схема, показывающая предпочтительный процесс создания плана восстановления данных после отказа узла в соответствии с изобретением. На этапе 801 агент MDS определяет экстенты данных, затронутые отказом узла, используя карту Extent_To_Node. Записи на этой карте соотносят экстенты в системе хранения с узлами системы. При сканировании этой карты на основе идентификации отказавшего узла идентифицируются затронутые экстенты. На этапе 802 способ определяет, потеряны ли лидеры какой-либо степени. Если лидер потерян из-за сбоя, то новый лидер для этого экстента выбирается из оставшихся фрагментов в группе экстентов на этапе 803. Простой выбор здесь - сделать так, чтобы агент DS удерживал первый из оставшихся фрагментов в экстенте как нового лидера, если бы это не был предыдущий лидер. Затем на этапе 804 агент MDS выделяет новое пространство на все еще работающих узлах для размещения данных затронутых экстентов.[0087] In FIG. 8 is a flow diagram showing the preferred process for creating a data recovery plan after a node failure in accordance with the invention. At block 801, the MDS agent determines the data extents affected by the node failure using the Extent_To_Node map. Entries in this map map the extents in the storage system to the nodes in the system. When this map is scanned, affected extents are identified based on the identity of the failed node. At 802, the method determines if leaders of any degree are lost. If the leader is lost due to a failure, then a new leader for that extent is selected from the remaining fragments in the extent group at step 803. A simple choice here is to have the DS agent hold the first of the remaining fragments in the extent as the new leader, if not was the previous leader. Then, at block 804, the MDS agent allocates new space on the still running nodes to accommodate the affected extent data.

[0088] Фиг. 9 иллюстрирует блок-схему, показывающую предпочтительный процесс создания плана восстановления данных после сбоя диска в соответствии с изобретением. На этапе 901 агент DS, связанный с отказавшим диском, определяет экстенты данных, затронутые отказом диска, на основе сопоставлений Fragment_To_Disk в агенте DS. Эта карта указывает идентификаторы фрагментов для фрагментов на каждом диске. Используя идентификатор отказавшего диска, фрагменты на отказавшем диске могут быть определены путем сканирования карты. На этапе 902 способ определяет, потеряны ли лидеры какой-либо степени. Если лидер потерян из-за сбоя, то новый лидер для этого экстента выбирается из оставшихся фрагментов в группе экстентов на этапе 903. Простой выбор здесь - сделать так, чтобы агент DS удерживал первый из оставшихся фрагментов в экстенте как новый лидер, если бы это не был предыдущий лидер. На этапе 904 новое пространство выделяется во все еще функционирующих узлах для размещения данных затронутых экстентов. В одном предпочтительном варианте осуществления изобретения агенты DS запускают задачу восстановления и, таким образом, выделяют новое пространство. В другом предпочтительном варианте осуществления изобретения агент MDS отвечает за координацию работы по восстановлению и обрабатывает задачу распределения.[0088] FIG. 9 is a flowchart showing the preferred process for creating a data recovery plan after a disk failure in accordance with the invention. At block 901, the DS agent associated with the failed disk determines the data extents affected by the disk failure based on the Fragment_To_Disk mappings in the DS Agent. This map specifies the fragment IDs for the fragments on each disk. Using the ID of the failed drive, fragments on the failed drive can be identified by scanning the map. At 902, the method determines if leaders of any degree are lost. If the leader is lost due to a failure, then a new leader for that extent is selected from the remaining fragments in the extent group at step 903. A simple choice here is to have the DS agent hold the first of the remaining fragments in the extent as the new leader, if not was the previous leader. At block 904, new space is allocated in the still-functioning nodes to accommodate the affected extent data. In one preferred embodiment of the invention, the DS agents start a restore task and thus allocate new space. In another preferred embodiment of the invention, the MDS agent is responsible for coordinating the restoration work and handling the allocation task.

[0089] Приведенный пример показал, что заявляемый(ая) способ и система распределенного хранения восстанавливаемых данных с обеспечением целостности и конфиденциальности информации функционирует корректно, технически реализуем (а) и позволяет решить поставленную задачу.[0089] The above example showed that the claimed method and system for distributed storage of recoverable data with ensuring the integrity and confidentiality of information functions correctly, is technically feasible (a) and allows solving the problem.

[0090] Элементы заявляемого технического решения находятся в функциональной взаимосвязи, а их совместное использование приводит к созданию нового и уникального технического решения. Таким образом, все блоки функционально связаны.[0090] The elements of the proposed technical solution are in a functional relationship, and their joint use leads to the creation of a new and unique technical solution. Thus, all blocks are functionally connected.

[0091] Все блоки, используемые в системе, могут быть реализованы с помощью электронных компонент, используемых для создания цифровых интегральных схем, что очевидно для специалиста в данном уровне техники. Не ограничиваюсь, могут использоваться микросхемы, логика работы которых определяется при изготовлении, или программируемые логические интегральные схемы (ПЛИС), логика работы которых задаётся посредством программирования. Для программирования используются программаторы и отладочные среды, позволяющие задать желаемую структуру цифрового устройства в виде принципиальной электрической схемы или программы на специальных языках описания аппаратуры: Verilog, VHDL, AHDL и др. Альтернативой ПЛИС могут быть программируемые логические контроллеры (ПЛК), базовые матричные кристаллы (БМК), требующие заводского производственного процесса для программирования; ASIC - специализированные заказные большие интегральные схемы (БИС), которые при мелкосерийном и единичном производстве существенно дороже.[0091] All blocks used in the system can be implemented using electronic components used to create digital integrated circuits, which is obvious to a person skilled in the art. Not limited to, microcircuits can be used, the logic of which is determined during manufacture, or programmable logic integrated circuits (FPGA), the logic of which is set by programming. Programmers and debugging environments are used for programming, allowing you to set the desired structure of a digital device in the form of a circuit diagram or a program in special hardware description languages: Verilog, VHDL, AHDL, etc. An alternative to FPGAs can be programmable logic controllers (PLCs), basic matrix crystals ( BMK), requiring a factory production process for programming; ASIC - specialized custom-made large integrated circuits (LSI), which are significantly more expensive for small-scale and single-piece production.

[0092] Обычно, сама микросхема ПЛИС состоит из следующих компонент:[0092] Typically, the FPGA chip itself consists of the following components:

• конфигурируемых логических блоков, реализующих требуемую логическую функцию;• configurable logical blocks that implement the required logical function;

• программируемых электронных связей между конфигурируемыми логическими блоками;• programmable electronic links between configurable logic blocks;

• программируемых блоков ввода/вывода, обеспечивающих связь внешнего вывода микросхемы с внутренней логикой.• programmable input/output blocks that provide communication between the external output of the microcircuit and the internal logic.

[0093] Также блоки могут быть реализованы с помощью постоянных запоминающих устройств.[0093] Blocks can also be implemented using read-only memories.

[0094] Таким образом, реализация всех используемых блоков достигается стандартными средствами, базирующимися на классических принципах реализации основ вычислительной техники.[0094] Thus, the implementation of all used blocks is achieved by standard means based on the classical principles of implementing the fundamentals of computer technology.

[0095] Как будет понятно специалисту в данной области техники, аспекты настоящего технического решения могут быть выполнены в виде системы, способа или компьютерного программного продукта. Соответственно, различные аспекты настоящего технического решения могут быть реализованы исключительно как аппаратное обеспечение, как программное обеспечение (включая прикладное программное обеспечение и так далее) или как вариант осуществления, сочетающий в себе программные и аппаратные аспекты, которые в общем случае могут упоминаться как «модуль», «система» или «архитектура». Кроме того, аспекты настоящего технического решения могут принимать форму компьютерного программного продукта, реализованного на одном или нескольких машиночитаемых носителях, имеющих машиночитаемый программный код, который на них реализован.[0095] As will be appreciated by one of skill in the art, aspects of the present technical solution may be implemented as a system, method, or computer program product. Accordingly, various aspects of the present technical solution may be implemented solely as hardware, as software (including application software, etc.), or as an embodiment combining software and hardware aspects, which may be generally referred to as a "module" , "system" or "architecture". In addition, aspects of the present technical solution may take the form of a computer program product implemented on one or more computer-readable media having computer-readable program code embodied thereon.

[0096] Также может быть использована любая комбинация одного или нескольких машиночитаемых носителей. Машиночитаемый носитель хранилища может представлять собой, без ограничений, электронную, магнитную, оптическую, электромагнитную, инфракрасную или полупроводниковую систему, аппарат, устройство или любую подходящую их комбинацию. Конкретнее, примеры (неисчерпывающий список) машиночитаемого носителя хранилища включают в себя: электрическое соединение с помощью одного или нескольких проводов, портативную компьютерную дискету; жесткий диск, оперативную память (ОЗУ), постоянную память (ПЗУ), стираемую программируемую постоянную память (EPROM или Flash-память), оптоволоконное соединение, постоянную память на компакт-диске (CD-ROM), оптическое устройство хранения, магнитное устройство хранения или любую комбинацию вышеперечисленного. В контексте настоящего описания, машиночитаемый носитель хранилища может представлять собой любой гибкий носитель данных, который может содержать или хранить программу для использования самой системой, устройством, аппаратом или в соединении с ними.[0096] Any combination of one or more computer-readable media can also be used. The computer-readable storage medium can be, without limitation, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, device, or any suitable combination thereof. More specifically, examples (non-exhaustive list) of a computer-readable storage medium include: an electrical connection using one or more wires, a portable computer diskette; hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or Flash memory), fiber optic connection, compact disc read only memory (CD-ROM), optical storage device, magnetic storage device, or any combination of the above. As used herein, a computer-readable storage medium can be any flexible storage medium that can contain or store a program for use by or in connection with a system, device, apparatus.

[0097] Программный код, встроенный в машиночитаемый носитель, может быть передан с помощью любого носителя, включая, без ограничений, беспроводную, проводную, оптоволоконную, инфракрасную и любую другую подходящую сеть или комбинацию вышеперечисленного.[0097] The program code embedded in a computer-readable medium may be transmitted using any medium, including, without limitation, wireless, wired, fiber optic, infrared, and any other suitable network, or a combination of the foregoing.

[0098] Компьютерный программный код для выполнения операций для шагов настоящего технического решения может быть написан на любом языке программирования или комбинаций языков программирования, включая объектно-ориентированный язык программирования, например Python, R, Java, Smalltalk, С++ и так далее, и обычные процедурные языки программирования, например язык программирования «С» или аналогичные языки программирования. Программный код может выполняться на компьютере пользователя полностью, частично, или же как отдельный пакет программного обеспечения, частично на компьютере пользователя и частично на удаленном компьютере, или же полностью на удаленном компьютере. В последнем случае, удаленный компьютер может быть соединен с компьютером пользователя через сеть любого типа, включая локальную сеть (LAN), глобальную сеть (WAN) или соединение с внешним компьютером (например, через Интернет с помощью Интернет-провайдеров).[0098] The computer program code for performing the operations for the steps of the present technical solution may be written in any programming language or combinations of programming languages, including an object-oriented programming language such as Python, R, Java, Smalltalk, C++, and so on, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may be executed in whole, in part, on the user's computer, or as a separate software package, in part on the user's computer and in part on a remote computer, or entirely on a remote computer. In the latter case, the remote computer may be connected to the user's computer via any type of network, including a local area network (LAN), a wide area network (WAN), or a connection to an external computer (eg, via the Internet via ISPs).

[0099] Аспекты настоящего технического решения были описаны подробно со ссылкой на блок-схемы, принципиальные схемы и/или диаграммы способов, устройств (систем) и компьютерных программных продуктов в соответствии с вариантами осуществления настоящего технического решения. Следует иметь в виду, что каждый блок из блок-схемы и/или диаграмм, а также комбинации блоков из блок-схемы и/или диаграмм, могут быть реализованы компьютерными программными инструкциями. Эти компьютерные программные инструкции могут быть предоставлены процессору компьютера общего назначения, компьютера специального назначения или другому устройству обработки данных для создания процедуры, таким образом, чтобы инструкции, выполняемые процессором компьютера или другим программируемым устройством обработки данных, создавали средства для реализации функций/действий, указанных в блоке или блоках блок-схемы и/или диаграммы.[0099] Aspects of the present technical solution have been described in detail with reference to block diagrams, circuit diagrams and/or diagrams of methods, devices (systems), and computer program products in accordance with embodiments of the present technical solution. It should be appreciated that each block from the block diagram and/or diagrams, as well as combinations of blocks from the block diagram and/or diagrams, may be implemented by computer program instructions. These computer program instructions may be provided to the processor of a general purpose computer, a special purpose computer, or other data processing device to create a procedure, such that the instructions executed by the computer processor or other programmable data processing device create the means to implement the functions/actions specified in block or blocks of a flowchart and/or diagram.

[00100] Эти компьютерные программные инструкции также могут храниться на машиночитаемом носителе, который может управлять компьютером, отличным от программируемого устройства обработки данных или отличным от устройств, которые функционируют конкретным образом, таким образом, что инструкции, хранящиеся на машиночитаемом носителе, создают устройство, включающее инструкции, которые осуществляют функции/действия, указанные в блоке блок-схемы и/или диаграммы.[00100] These computer program instructions may also be stored on a computer-readable medium that can control a computer other than a programmable data processing device or other than devices that operate in a particular manner such that the instructions stored on the computer-readable medium create a device including instructions that perform the functions/actions indicated in the block diagram and/or diagram.

Claims

1. A method for distributed storage of recoverable data with information integrity, performed by at least one system that contains a plurality of storage nodes connected by a network and storing data in the form of extents, data service (DS) agents for managing extents, metadata service (MDS) agents to manage metadata related to nodes and extents and cluster manager (CM) agents, the method consisting of the following steps:

• generating a checksum for each data extent;

• detection of a node failure in the system by one of the CM agents or a disk failure in the system by the DS agent;

• notifying DS or MDS agents of the failure;

• independently notified DS or MDS agents form a recovery plan for data extents affected by the failure; and

• collectively recovering the affected extents based on the plan generated in the previous step using checksums, with each node including multiple disks; and disk failure is detected by the DS agent based on the disk error rate.

2. A system for distributed storage of recoverable data with information integrity, containing

• a plurality of storage nodes connected by a network, each node storing data in the form of extents over which checksums are calculated;

• a data service (DS) agent in each node to manage the extents in the node;

• a plurality of Metadata Service (MDS) agents for managing metadata related to nodes and extents, with MDS agents operating on a subset of nodes;

• Cluster Manager (CM) agent in each node to detect node failure in the system,

• whereby, upon notification of a failure, a subset of the DS or MDS Agents independently generates a plan to restore the extents using the checksums affected by the failure and collectively restore the affected extents based on the plan; and

• a persistent map that correlates data extents with nodes, with each MDS agent managing a subset of the map.

3. The storage system of claim 2, further comprising an interface allowing an application to access data stored on the system.

4. The storage system of claim 2, further comprising means for determining a subset of nodes running MDS agents.

5. The storage system of claim 2, wherein each CM agent maintains an ordered list of nodes that are currently running on the system.

6. The storage system of claim 2, wherein each CM agent includes means for detecting node failure.

7. The storage system of claim 2, wherein each CM agent includes means for discovering a new node in the system.

8. The storage system of claim 7, wherein said means for discovering a new node includes means for monitoring network communications from the new node.

9. The storage system of claim 2, wherein each DS agent distributes updates to extents in an associated node to other DS agents in the system.

10. The storage system of claim 2, wherein each DS agent manages data caching at an associated node.

11. The storage system of claim 2, wherein each node includes a plurality of data disks and each DS agent includes means for detecting disk failure in an associated node.

12. The storage system of claim 11, wherein said drive failure detection means is based on the error rate of the drives.

13. The storage system of claim 2, wherein the recovery plan includes a list of extents that must be recovered to restore data redundancy to the system.

14. The storage system of claim 13, wherein the extents to be restored are jointly restored by the DS agents.

15. The storage system of claim 13, wherein space is allocated in nodes that are still running to replace extents affected by the failure.

16. The storage system of claim 15, wherein the data in the affected extents is determined and transferred to the allocated space.

17. The storage system of claim 2, wherein the DS agents are notified of the failure and the notified DS agents determine those extents that have data on the failed node.