EP2212791A2

EP2212791A2 - Improved computer system comprising multiple network nodes

Info

Publication number: EP2212791A2
Application number: EP08837674A
Authority: EP
Inventors: Michaël DUSSERE; Samuel Richard
Original assignee: Seanodes
Current assignee: Seanodes
Priority date: 2007-10-12
Filing date: 2008-07-23
Publication date: 2010-08-04
Also published as: FR2922335A1; WO2009053557A1; WO2009053557A9; EP2212792A1; WO2009047398A2; WO2009047397A2; WO2009047398A3; WO2009047397A3

Abstract

Computer tool for storing data, comprising a matching module (40) that is connected to storage units (38) and is designed to determine a match between virtual addresses and physical addresses in the storage units, each virtual address being assigned to at least two storage addresses. Said computer tool is characterized in that the matching module maintains a first table containing data for identifying faulty storage units as well as a second table containing data for modifying blocks of virtual addresses. Furthermore, the computer tool also comprises a recovery unit (406) which is designed, once a faulty storage unit has been restored, to update the storage addresses of said storage unit by calling the matching module by means of virtual addressed extracted from the second table on the basis of the data of the first table.

Description

Improved computer system comprising a plurality of networked nodes

The invention relates to computer systems comprising several computer stations called nodes interconnected in a network.

Modern networks include user stations that are connected to one or more servers and can share applications and / or storage spaces locally or remotely.

In shared applications that use a large amount of data or share a large amount of data, specialized storage systems such as the Storage Area or SAN).

The use of these advanced systems has certain disadvantages, such as the associated costs, the limitations of performance and extensibility, and the overall heaviness of the installation that corresponds to them.

Moreover, with modern networks, the use of these advanced systems represents an underutilization of the hardware already present in the network.

Finally, the systems that have been proposed that use the hardware already present in the network have unsatisfactory performance, especially in terms of fault management.

The invention improves the situation.

For this purpose, the invention proposes a computer data storage tool comprising a correspondence module connected to storage units, said correspondence module comprising a correspondence function for determining at least a first and a second storage address from an incoming virtual address.

According to a particular aspect, the correspondence module maintains a first table comprising data for identification of failed storage units, as well as a second table comprising data for modifying virtual address blocks, and the computer tool. includes a recovery unit arranged, at the recovery of a failed storage unit, for updating the storage addresses of this storage unit by calling the correspondence module with virtual addresses derived from the second table, on the database of the first table.

Other advantages and features of the invention will appear better on reading the following description of examples, given by way of illustration and without limitation, from the drawings in which:

FIG. 1 shows a general functional view of a computer system according to the invention,

FIG. 2 shows an example of a logical implementation of the system of FIG. 1; FIG. 3 shows an exemplary composition of an element of FIG.

FIG. 4 shows a method of accessing a file in the system of FIG. 1,

FIG. 5 shows an exemplary implementation of an element of FIG. 3,

FIG. 6 shows a correspondence between logical spaces and physical spaces managed by the element of FIG. 5,

FIG. 7 shows an example of a function implemented by the element of FIG. 5 to establish the correspondence of FIG. 6,

FIG. 8 shows an exemplary implementation of a part of FIG. 7,

FIGS. 9 and 10 show examples of functions running in parallel with the function of FIG. 7,

FIG. 11 shows an allocation of the logical spaces over the physical spaces as a variant of the correspondence represented in FIG. 6; FIG. 12 shows a variant of the function of FIG. 8 adapted to take account of the allocation of FIG. 11,

FIG. 13 shows an exemplary implementation of a part of FIG. 12,

FIG. 14 shows an example of a function running in parallel with the function of FIG. 7, and

FIG. 15 shows a variant of FIGS. 8 and 12 which implements both the assignment shown in FIG. 6 and that shown in FIG. 11.

The drawings and the description below contain, for the most part, elements of a certain character. They can therefore not only serve to better understand the present invention, but also contribute to its definition, if any.

FIG. 1 represents a general diagram of a computer system according to the invention. In this system, an application environment 2 has access to a file system manager 4. A virtualization layer 6 establishes the correspondence between the file system manager 4 and storage servers 8.

FIG. 2 represents a logical implementation of the system of FIG. 1. In this implementation, a set of stations 10, also referred to herein as nodes, are interconnected in a network of which they constitute the physical and application resources.

In the example described here, the network consists of 5 stations, denoted Ni with i varying between 1 and 5. The application environment 2 is made of a distributed application layer 12 on the N1, N2 and N3, in one application layer 14 on the N4 and an application layer 16 on the N5.

Note that the term station or station used here should be interpreted broadly, and as designating network computing elements on which applications or server programs run, or both. The file system manager 4 is produced in a distributed file system 18, and two non-distributed file systems 20 and 22. The system 18 is distributed over the N1, N2 and N3 and defines all the files accessible from the distributed application layer 12. The file systems 20 and 22 respectively define the set of files accessible from the application layers 14 and 16.

The files designated by the file systems 18, 20 and 22 are stored in a virtual storage space 24 which is distributed over the set of Ni with i varying between 1 and 5. The virtual storage space 24 is here divided into a shared logical space 26, and two private logical spaces 28 and 30.

The shared logical space 26 corresponds to the space accessible from the distributed application layer 12 by means of the distributed file system 18, and the private logical spaces 28 and 30 to the space accessible from the application layers 14 and 16 by means of the file systems 20 and 22.

The logical space 26 is distributed over the N1, N2 and N3, the private logical space 28 on the N3 and N4, and the private logical space 30 on the N5.

Thus, an application of the layer 12 (respectively 14, 16) "sees" the data stored in the logical space 26 (respectively 28, 30) by means of the file system 18 (respectively 20, 22), although these they are not necessarily physically present on one of the storage disks of the station 10 that uses this application.

Furthermore, the spaces 26, 28 and 30 are purely logical, that is, they do not directly represent physical storage spaces. Logical spaces are mapped using virtual addresses that are referenced or contained in file systems 18, 20, and 22. To access the data of these files, it is necessary to use a correspondence module. The correspondence module contains a table of correspondence between the virtual addresses of the data in the logical spaces and physical addresses that designate the physical storage spaces in which these data are actually stored.

Several achievements are possible for the correspondence module. The distribution of the physical storage spaces described here is an example intended to show the very general scope of the invention.

As can be seen in the example presented, each station is used for both the application layer and the storage layer. This multifunctionality makes it possible to use the free space on all the stations of the network, rather than leaving this space unoccupied.

In the context of the invention, however, it would be possible to specialize some of the stations, and create a node dedicated to storage or a node dedicated to applications.

This means that, in the context of the invention, any station can play an application node role, a storage node role, or both these roles at once.

All the application, storage and file system resources can be integrated locally on each station, or distributed on the stations of the network.

This is for example the case of N1, N2 and N3 stations, whose resources are fully distributed, both at the application level and at the level of the file system and storage. FIG. 3 represents an exemplary architecture of a station 10 of FIG. 2. The station represented in this example can represent one of the stations N1, N2 or N3.

Station Nx individually has a structure similar to that of the global structure shown in Figure 1. It thus comprises an application layer 32, a file system 34, a virtualization layer 36 and a storage space 38 in the form of a local memory with direct access.

The virtualization layer 36 comprises a motor 40 and a correspondence table 42. The direct access to the storage space 38 is managed by a storage client 44 and a storage server 46. The roles and operations of these elements will be specified below.

The example described here represents an improved embodiment of the invention, in which all the resources, both application and storage, are distributed over the network.

This means, for example, that the file system 34 is not entirely present on this station, but distributed over several of them, and that access to it implies communication with other nodes of the network which contain the data sought.

It is the same for the virtualization layer 36, the storage client 44 and the storage server 46. The distribution of these elements is managed by means of a management module 48.

For the rest of the description, it does not matter whether the resources in question are distributed or not.

The administration module 48 is mainly used during the creation and updating of the logical spaces. When creating or editing of a logical space, the administration module 48 calls the virtualization layer 36 to create the correspondence table between each virtual address of the logical space and a physical address on a given storage node.

Then, the correspondences between a file accessible by this file system and the virtual addresses of the data that make up this file are made at the level of the file system that exploits this logical space, the "physical" data being stored at the associated physical addresses in the file. correspondence table to the virtual addresses, in accordance with the mapping established during the creation of the logical space.

This means that, as soon as a logical space is created by the administration module, the correspondences between the virtual addresses and the physical addresses are established. The virtual addresses appear "empty" to the file system accessing the logical space, although the physical addresses that correspond to them are already "reserved" through the correspondence table.

It is when the link between the data files of this space and the virtual addresses of these data is established that the physical addresses are met.

In the embodiment described here, the look-up table 42 is a table that contains information for retrieving matches. When an application uses a given memory space, the engine 40 interacts with the table 42 to establish the corresponding physical address.

As will become clearer later, the lookup table 42 does not contain all the matches, but only one set much smaller information, sufficient to restore correspondence very quickly.

In order to better understand the invention, it is necessary to differentiate the application layer from the storage layer. Indeed, the management of access to the data stored in the storage layer is an approach that has many advantages over the existing one.

Figure 4 shows a method implemented by the system to access a file.

The access to a file by an application of the application layer of a given node is initialized by a file access request 50. The file access request 50 comprises: an identifier of the file concerned for the file system and an address in this file,

the size of the request, that is to say the number of bits to be accessed after the address of the targeted file, and

- the type of request, namely reading or writing.

In a step 52, the file system determines one or more virtual addresses for the data of this file, and generates one or more virtual access requests based on the request 50 and these virtual addresses.

Virtual access requests each include:

- the targeted virtual address,

the size of the request, that is to say the number of bits to be accessed following the targeted virtual address, and

- the type of request, which is identical to that of the request 50.

If we refer to the system described in FIG. 2, step 52 consists in determining the logical space and the virtual address (es) on this space designated by the request 50, and to produce one or more "virtual" requests.

There is a difference in level between file access requests and virtual access requests. Indeed, a file access request will target the content of a large quantity of virtual addresses, to enable the content of a file to be reconstructed, whereas a virtual request targets the contents of a data block. associated with this address.

The resulting virtual access request (s) are then transmitted to the virtualization layer, which determines the physical address (es) and the corresponding storage spaces in a step 54.

To determine the physical addresses, the virtualization layer operates using the engine 40 and the look-up table 42.

In the context of a read access request, the searched file already exists in a storage space 38, and the engine 40 calls the correspondence table 42 with the virtual address or addresses to determine by correspondence the physical address or addresses. data from the file.

In the context of a write access request, the file does not necessarily exist beforehand in a storage space 38. Nevertheless, as we have seen above, the correspondences between virtual addresses and physical addresses are frozen, and the motor 40 therefore operates in the same way as in the context of a read request to determine the physical address or addresses of the data.

In any case, once the engine 40 has determined the physical addresses, it generates in a step 56 physical access requests that it transmits to the storage client 44. In step 56, the physical access requests are generated based on the request 50 and the physical address (es) determined in step 54.

These requests include: - the targeted physical address;

the size of the request, that is to say the number of bits to be accessed following the physical address targeted by the request; and

- the type of action aimed at, namely reading or writing.

The physical address and the size of the request are obtained directly from step 54, and the type of the request is inherited from the type of the virtual access request concerned.

A loop is then initiated, in which a stopping condition 58 is reached when a physical access request has been issued to the storage client 44 for all physical addresses obtained in step 52.

In fact, each physical access request is placed in a request queue of the storage client 44 for execution in a step 60. The storage client 44 may optionally include several queues, for example a queue of data storage requests. wait by storage server 46 with which it interacts.

In this loop, all physical access requests in step 56 are represented as successively performed for simplicity. However, the execution can also be performed in parallel, and not only in series.

In the example described, requests are transmitted from layer to layer, up to the physical access layer. However, it would be possible to determine and transmit only addresses (virtual and physical), and to recover, at the physical layer level, selected properties of the initial file request to form the physical access requests. For the execution of a given physical access request, the storage client 44 interacts with the storage server 46 of the storage station that contains the storage space 38 on which the physical address designated by the storage address 38 is located. the physical access request concerned.

FIG. 5 represents an exemplary embodiment of the virtualization layer 36 of FIG.

The engine 40 includes a queue 402, an address determination unit 404 and a cover unit 406.

Queue 402 receives all virtual access requests for the determination of the corresponding physical addresses. The determination of the physical addresses is carried out by the address determination unit 404, in collaboration with the correspondence table 42.

In the example described here, the correspondence table 42 contains only an extremely limited set of data which will be described later. Indeed, the invention proposes several schemes for assigning virtual spaces to physical spaces. These assignments make it possible to quickly and inexpensively determine the virtual address / physical address correspondences on the basis of light algorithms while offering a high quality of service. This is much more efficient in terms of processor and memory occupation than the use of a direct look-up table such as a "look-up table" (in English).

The function of the cover unit 406 is to update certain physical addresses when a storage space of a given station has ceased to function, as will be described below.

FIG. 6 illustrates a first scheme for allocating virtual spaces to physical spaces so as to tolerate so-called correlated failures. By failure correlated means a failure that renders inoperative a set of spaces or storage units (hereinafter disks) connected together (hereinafter group of failure).

In other embodiments, the disks can be grouped by failure group on the basis of a fault dependency criterion. Such a fault dependence criterion aims to bring together disks for which the probability of a simultaneous failure is important, so as to ensure that the data of these disks are not replicated on one of them. .

As an example of a fault dependency criterion, mention may be made of the fact that the same node-node belongs, from both a hardware and software point of view, to the link to the same network node, from a material point of view that software, the proximity of geographical location, etc.

To prevent such failures, the allocation of virtual space to the physical spaces is carried out so that:

- the data of a virtual address are stored on disks that belong to distinct fault groups; and

the consecutive virtual address data is gathered into hatches that extend over all the disks.

Thus, in the example shown, we start from four nodes N1 to N4. The node N1 here comprises three disks MU11, MU12 and MU13, the node N2 three disks MU21, MU22 and MU23, the node N3 a disk MU31, and the node N4 three disks MU41, MU42 and MU43.

In the example described here, each node forms a failure group, i.e., in terms of failure, it is assumed that the disks of a node depend on it, but that the node disks distinct are independent of each other. The failure group N1 thus consists of the disks MU 11 to MU13, the failure group N2 is composed of the disks MU21 to MU23, the failure group N3 is composed of the disk MU31, and the failure group N4 is composed of the disks MU41. at MU43.

In other embodiments, the groups of failures could be defined differently, for example by belonging to a network unit. It is therefore important to understand that disks are first grouped by failure group and not only by node.

Initially, the assignment shown in Figure 6 starts from failure groups to define replication groups that allow for secure duplication of data.

The assignment of the failure group disks to a replication group follows an allocation algorithm that implements the constraints described above.

To account for the first constraint, there must be at least as many replication groups as there are disks in the failure group that has the largest number of disks, according to the equation

10 of Annex A.

Then, in order to make a distribution as flexible as possible in terms of load, one fixes the number of disks by group of replication according to the equation

11 of Appendix A. Finally, the disks are assigned individually in each replication group according to equation 12 in Appendix A.

Thus, in the example of FIG. 6, the replication group GR1 comprises four disks corresponding respectively to the MLM disks 1, MU21, MU31 and MU43, and the other disks MU12 and MU13 of the failure group N1 are assigned respectively to the groups GR2 and GR3. Other disk allocation algorithms are possible, as those skilled in the art will recognize. It will be noted, for example, that for allocating the disks to the replication groups, it would also be possible to take into account the space available on the disks, so as to ensure replication groups of uniform size. Another possibility would be to take into account their performance to obtain homogeneous performance replication groups.

This could for example be done: * when assigning disks to replication groups, complicating the algorithm, or

^* after the allocation described above, by performing disk exchanges between replication groups.

In a second step, to take into account the second constraint, the virtual addresses are assigned on the disks sorted by replication group. In each disk shown, each number that is represented means the association of virtual addresses to physical addresses.

Indeed, the virtual addresses are grouped in ascending order in units of hatching ("striping unit" in English) of defined sizes. The hatch units are themselves associated with physical addresses on the disks.

As can be seen in the upper part of Figure 6, the hatch units are assigned to all disks of all replication groups in ascending order, line by line. The virtual addresses are thus allocated in block to the replication groups, in an increasing manner.

Thus, the first hatch unit of the first disk of the first replication group receives the index 0, the first hatch unit of the second disk of the first replication group receives the index 1, and so on. A line of hatch units will subsequently be called the actual hatch. Note also in the upper part of Figure 6 that for every other line, the hatch units are not ordered by increasing index.

This is because, to account for the first constraint, the data for each real hatch is replicated within the replication group to which the first version is assigned. As a result, corresponding real hatches are combined in pairs to form a main hatch each time.

To determine the allocation of hatch units for replication, we use an equation that ensures that:

* two replicas of the same hatch unit are not stored on the same disk, and * the offset that is made for this purpose is not constant, so that when a given disk fails, the load is distributed on the other disks.

Thus, in the example shown in FIG. 6, the physical addresses of the disk MU11 of N1 receive the following hatch units: 0, 3, 10, 12, 20, 21, 30, 33, 40 and 42.

In addition, in the example described here, the data is replicated to directly consecutive true hatches within the main hatch. In other embodiments, the replicated data could be in non-consecutive real hatches.

As can be seen from the above, it is therefore sufficient to know the total number of disks as well as the number of replication groups to determine a physical address from a virtual address. The correspondence table 42 is thus reduced to its simplest expression. Note also that, because of the data replication mentioned above, there are twice as many physical addresses as virtual addresses.

In the example shown in Figure 6, as well as in the remainder of this document, a simple replication of the data is performed, which results in doubling the amount of data stored. However, it would be possible to make more than two replicas of the data.

Figure 7 shows the example of a function that allows to find a physical address from a virtual address.

In the example described here, this function is implemented in the address determination unit 404. It starts in an operation 2000 of a virtual address, for example taken from the queue 402.

In a 2020 operation, a test is performed to determine if the virtual address is related to a write request. If this is the case, in an operation 2040, a test is performed to determine if the system is in a degraded state, that is, if one of the disks is down. If this is the case, a Wrt_Drt_Z () function is called in 2060.

The operations 2040, 2060 and 2080 are connected to functions enabling the rapid recovery of a disk failure which will be described further with FIGS. 9 and 10.

In the case where the request associated with the virtual address is a read request, or when the state of the system is not degraded, or when the function Wrt_Drt_Z () ends, a function SU_lnd () is called in a operation 2070. The function SUJ nd () starts from the virtual address whose match is sought, and returns the index of the hatch unit SU Ind. associated with it. This is easily achievable as the size of each hatching unit is known.

On the basis of this index, a Get_Phy_lnd () function is called in an operation 2080 which determines corresponding physical address indices.

Finally, in an operation 2100, the physical address indices are converted into physical addresses by a function Phy_Ad (). This is easily achievable as the size of each hatching unit is known.

The function Get_Phy_lnd () will now be described with figure 8. The function Get_Phy_lnd () receives the following two arguments (operation 2082):

^* the index of the hatchery unit to be matched, and

* An Ngr [] array containing a row whose values indicate the number of disks for each replication group.

The table Ngr [] is useful because it allows to quickly find the total number of N disks and the number of Ngr replication group, and to access just as quickly the number of disks by replication group.

In another implementation, it is possible to directly pass N and Ngr as arguments, but a computation is then necessary when one needs the number of disks within a given replication group.

The role of the Get_Phy_lnd () function is to find the index of the hatch unit in the sort by failure group from its index in the sort by replication group. In a first operation 2084, a function

StripQ determines a principal hatch index k, as well as the index m1 in the replication groups of the first disk on which the virtual address is stored. This is accomplished by applying Equations 20 and 21 of Appendix A.

In an operation 2086, a function Repl () determines the actual hatch indices k1 and k2 to account for data replication. This is accomplished by applying Equations 22 and 23 of Appendix A.

In an operation 2088, a function Spl () determines the index p of the replication group that corresponds to the disk index m1, as well as the index q1 of the disk m1 within this replication group. This is accomplished by applying Equations 24 and 25 of Appendix A.

In an operation 2090, a function Shft () determines an index q2 within the index replication group p of the disk on which the replicated data is stored. This is accomplished by applying Equation 26 of Appendix A.

Other equations would obviously be usable, such as a simple shift unit index offset. In this simpler case, each disk within a replication group contains all replicated data from another disk in the same group.

In an operation 2097, a function Mrg () determines a disk index m2 which corresponds to the index q2 within the index replication group p. This is accomplished by applying Equation 27 of Appendix A.

In an operation 2098, the indexes m1 and m2 of disks classified by group of replication are converted into disk indices n1 and n2 of the disks classified by failure group by a function Get_Dsk_lnd (). This function performs the inverse operation of equation 12 and applies equations 28 and 29 of Appendix A. Finally, the Get_Phy_lnd () function returns the physical address indices determined in a 2099 operation.

Now, the function Wrt_Drt_Z () will be described in connection with Figures 9 and 10. As mentioned above, the invention relies in part on the replication of data in case of failure. The function Wrt_Drt_Z () as well as the function described in FIG. 9 allow a fast reintegration of a disk after a failure. This reinstatement is described with Figure 10.

The principle of this reintegration is based on the marking of the virtual zones associated with a failed disk during a write.

Indeed, both during a read request and when there is no access, the stored data are not modified. Therefore, when a disk comes back from a failure, it is not necessary to update this data since they have not been modified.

On the other hand, if there has been writing, the stored data is potentially distinct, and consistency must be restored with the version of the data that has been written.

For this, two tables are maintained during execution. The first table, called failures, contains two lines, one of which receives disk IDs and the other receives a growing index. The second table, called zones, contains two lines, one of which receives zone identifiers, and the other receives a modification index.

In the example described here, the fault table is stored on each of the disks in a space reserved by the administration module which is not usable for storing the data. This table is kept consistent on all the disks by the administration module. This table has a fixed size equal to the total number of disks. The fault table is filled by means of the function shown in FIG. 9.

Thus, when a fault is detected, for example by the administration module 48 (operation 2200), this function receives the identifiers of all the failed disks.

In an operation 2202, a function ls_Tgd_Dsk () searches for each failed disk identifier in the fault table.

For each missing identifier, the fault table creates a new entry that receives the disk identifier in the first line, and an index incremented in the second line (operation 2204). Otherwise the function processes the next identifier (operation 2206).

Alternatively, the fault table is implemented as a stack or linked list. Therefore, it comprises only one line which receives the indices of disks, and it is the index of position of each identifier in the table which serves as increasing index.

The Wrt_Drt_Z () function relies on clues in the fault table to maintain an up-to-date view of the areas associated with a failed disk that have been modified.

Thus, as we have seen above, the function Wrt_Drt_Z () is called for each write request, and its function is to mark in the zone table the highest index of the fault table.

Thus, when a zone has an index greater than or equal to that of a disk of the fault table, this means that it has been modified after the failure of the disk considered. On this basis, the cover unit 406 may perform the function of Fig. 10 to restore a disk after a failure. For this, the unit 406 starts with a subscript i set to zero (operation 2300) and goes through the zone table.

Each time, a comparison between the index associated with the disk that is restored in the fault table and that of the index zone i in the zone table indicates whether the zone has been modified after the failure of the disk considered ( operation 2302).

If so, then a function Upd_Z (i) updates the data of the area concerned by retrieving the corresponding replicated data (operation 2304). Next, the zone table is updated to reflect this operation (2306).

Once the index field i is updated, an End_Drt_Zone () function deletes the entry from the fault table associated with the restored disk, and goes through the fault table to increase the index of the zones by the maximum of the remaining indices. This ensures slow index growth and avoids processing too much data.

If there are still areas to browse, the index i is incremented (operation 2310). Otherwise, the function ends in operation 2312.

It should be noted that the zone table can receive zones of configurable size. Indeed, an entry in the zone table is associated with a plurality of contiguous virtual addresses.

In the embodiment described here, this table is stored in the reserved data area of the logical volume to which the virtual addresses belong, that is, it is equally distributed on all the disks. Note that the reserved data area of the logical volume is not extensible indefinitely. It should also be noted that the call to the zone table constitutes a read request in the system.

It is therefore necessary to find a compromise between the granularity of the data of the zone table (which increases the performance of the recovery mechanism), and the cost which is associated with the multiplication of the requests (which increases with the granularity).

The functions described in FIGS. 9 and 10 can be seen as functions that loop in parallel with the main execution of the system. Indeed, to ensure maximum information security, these functions constitute "interruptions".

This means that if the condition of their launch (disk that fails or disk is restored) is encountered, the physical address determination requests that are running are canceled and replayed after these functions are executed.

These functions can therefore be performed directly in the virtualization layer or in the administration module, as separate functions as presented here, or as an integral part of the functions presented above.

Figure 11 shows an assignment of virtual spaces to physical spaces as an alternative to that of Figure 6. Here, fault tolerance is also increased, this time voluntarily leaving free spaces within replication groups called resiliency units.

For the sake of clarity, there is shown in this figure a single replication group that includes seven disks. Among these seven discs, we will define hatching with four hatching units for storing data, and three resiliency units for fault tolerance.

As seen in Figure 11, the data is always replicated once, in the same manner as before, but there is also a lag with each hatch. This offset is used to ensure that all free spaces are distributed over all disks, which ensures a better balance of the load.

As for the hatch units, two physical addresses are associated on the disks with each resiliency unit SpO, Sp1 and Sp2. In Fig. 11, the physical addresses associated with the same resilience unit in a given principal hatch are referenced identically.

For the determination of physical address with the assignment described with FIG. 11, the function described with FIG. 7 remains usable, with some modifications to the function Get_Phy_lnd ().

Such an alternative function Get_Phy_lnd () will now be described with FIG. 12. First of all, the difference of context between the attribution presented in FIG. 6 and that shown in FIG. 11 will be noted.

In the first case, the disks are divided into several groups to avoid correlated failures, that is to say affecting several disks at a time.

In the second case, some of the hatch units are used as free space to compensate for a failure of one or more of the disks, as will be shown below.

There is therefore no priori grouping of disks by replication group, although this is possible as will be seen in FIG. 15. The definition of the resilience units in addition to the hatch units amounts to distributing part of the physical addresses into a working group (those receiving the data from the hatch units) on the one hand, and into fault groups (those that receive data from the resiliency units) on the other hand, according to a fault tolerance criterion.

Such fault tolerance criterion relates for example to the number of successive failures that one wishes to support, and therefore the number of fault groups to manage. In the present example, this number is three. Other criteria could nevertheless be used.

In the example of Figure 12, the function Get_Phy_lnd () can be seen as two successive blocks:

a processing block A for determining the disk and hatch indices as before, without taking into account the presence of the resilience units, and

^* a block B which treats the indices thus calculated to take into account the presence of the resilience units and obtain the real indices.

The function Get_Phy_lnd () receives the following three arguments (operation 2482):

^* the index of the unit of hatching for which the correspondence is sought,

^* the number of disks in the system, and

^* the number of resilience units.

In a first operation 2484, the Strip () function determines a principal hatch index k, as well as the index mm1 of the first disk on which the virtual address is stored.

Operation 2484 differs from operation 2084 of Figure 8 in that the Strip () function is called with the number of hatch units, i.e. the number of disks minus the number of units. of resilience. In an operation 2486, the function Repl () determines the actual hatch indices k1 and k2 to account for the replication of the data as the operation 2086.

In an operation 2490, the Shft () function determines a mm2 index of the disk that receives the replicated data. Operation 2490 differs from operation 2090 of Figure 8 in that the function Shft () is called with the number of hatch units, i.e., the number of disks minus the number of units of resilience.

In an operation 2492, a function Cp_Spr () determines an index m1 which corresponds to the real index of the disk associated with the index mm1. This function is used to modify the mm1 index to take into account the presence of the resilience units. As we will see below, the index m1 that returns the function Cp_Spr () can be an index of a unit of resilience.

In an operation 2494, a function Cp_Spr () determines an index m2 which corresponds to the real index of the disk associated with the index mm2. This function is used to modify the mm2 index to take into account the presence of the resilience units. As will be seen below, the index m2 that returns the function Cp_Spr () can be an index of a unit of resilience.

Once the indices m1 and m2 are determined, the physical address indices are returned in 2499.

The function Cp_Spr () will now be described with FIG. 13. This function receives as arguments an index of disk mm, a index of unit of hatching k, a total number of disks N and a number of units of resilience S. The Cp_Spr () function starts by executing a Spr () function in 2602.

The Spr () function implements Equation 30 of Appendix A. The Spr () function receives three input arguments: * the disk index mm,

* the hatch unit index k;

^* the total number of records N.

As the index mm has been established on an N-S number of disks, the function Spr () thus makes it possible to establish an index m which takes into account the presence of the S units of resilience.

Next, a test determines in an operation 2604 whether the disk associated with the real index m has failed and whether a resiliency unit has been assigned to it.

An example of implementation of the function 2604 is the holding of a table here called resilience table. This table contains a single line, in which each column corresponds to a disk of index m corresponding to the number of the column.

The value stored in each column indicates:

^* that the index disk m is not broken, or

* that the index disk m is out of order, and the stored value is then a resilience unit index which makes it possible to determine the index of the disk to which the resiliency unit which will be associated with the index disk is attributed. m.

Such a resilience table is stored on each of the disks and is synchronized continuously, together with the fault table for example.

If a resiliency unit is to be used then the index mm is updated in an operation 2606 by a function Spr_2 () which implements equation 31 of Annex A using as arguments the total number of disks N , the number of resilience units S, and the index m that has just been calculated. This function assumes that the index disk data m is stored in each hatch on the resilience unit whose resilience index is indicated by the resilience table, and therefore only the index hatch k to determine the index of the disk on which the desired resilience unit is allocated.

For this, after updating the mm index, the Cp_Spr () function is restarted. Thus, in the case where two failures take place successively, if the resiliency unit associated with one of the failed disks is located on the second failed disk, the function Cp_Spr () is repeated again, so that the The returned index corresponds to a resiliency unit on a functional disk.

In this way, multiple failures can be tolerated, as the resiliency units can be used to replace other resiliency units. To maintain the resilience table, and to fill the physical addresses associated with the resiliency units during a failure, a function will now be described with FIG.

When the system detects that a disk fails (operation 2700), this function receives the IDs of all failed disks.

In an operation 2702, a function Spr_Dsk () modifies the index of the failed disk with which no resilience unit is associated. The value of its resilience index receives the first resilience index not already assigned.

Then, in an operation 2704, the replicated data of the hatch units that were located on the failed disk are copied to the associated resiliency units by means of a Wrt_Spr_Dsk () function.

To improve the performance, the copying is not immediately carried out: the Wrt_Spr_Dsk () function generates queries for writing data available on the remaining hatch unit to the resiliency unit, and these requests are executed in competition with the other access requests. This means that the resiliency unit can not be used until this write request has been made. Finally, the function ends in 2706.

Alternatively, the Wrt_Spr_Dsk () function generates the write requests on the resiliency units and executes these requests before any other access request.

Note that when the disk is restored, it is also necessary to copy again the replicated data on it.

The function described in Figure 14 can be seen as a function that loops in parallel with the main execution of the system. Indeed, to ensure maximum information security, this function constitutes an "interruption"

This means that if the condition of its launch (disk crashing) is encountered, the physical address determination requests that are running are canceled and replayed after this function has been executed.

This function can therefore be performed directly in the virtualization layer or in the administration module, as a separate function as presented here, or as an integral part of the functions presented above.

In an advantageous embodiment, the allocation described with FIG. 6 and that described with FIG. 11 are mixed, to offer an optimal support of failure.

For the determination of physical address with such an assignment, the function described with FIG. 7 remains usable, with some modifications to the function Get_Phy_lnd (). Such an alternative Get_Phy_lnd () function will now be described with FIG.

The function shown in FIG. 15 is a mixture of the variants of the Get_Phy_lnd () function described with FIGS. 8 and 12. Thus, it has great similarities with them. This is why some operations have not been renumbered.

In a 2482 operation, the Get_Phy_lnd () function receives the following arguments:

^* the index of the unit of hatching for which the correspondence is sought,

^* an array Ngr [] containing a line whose values indicate the number of disks for each replication group, and

^* the number of resilience units.

As for Figure 8, the table Ngr [] is useful because it allows to quickly find the total number of N disks and the number of Ngr replication group, and access just as quickly to the number of disks per replication group.

Operations 2484 and 2486 are then performed identically to those of FIG. 12, and operation 2488 is performed as step 2088 of FIG.

Then, the operations 2490 to 2494 are performed as in the case of Figure 12, taking into account that this operation is performed within a replication group. The indices of the disks q1 and q2 within the replication group p are then transformed into disk indices m1 and m2 in operations 2496 and 2497 in a similar manner to the operation 2097 of FIG. 8.

Then the indexes m1 and m2 of disks classified by replication group are converted into disks indexes n1 and n2 disks classified by failure group in an operation 2498 as the operation 2098 of Figure 8.

Finally, the operation 2499 is performed as the operation 2099 of Figure 8, in order to return the physical address indices.

This embodiment is the embodiment is very advantageous because it offers a very high tolerance to various failures, both through the distribution of data on disks belonging to different failure groups that through the use of the units of resilience that can contain failures within replication groups.

In addition, the hatching distribution described in the modes described allows to obtain very improved performance, as the accesses are distributed on separate disks.

The hatching units are here described as having a fixed size. It would nevertheless be possible to implement the invention by assigning different sizes of hatch units according to the replication groups with some modifications to the described functions.

The engine and lookup table described herein form a correspondence module capable of assigning the virtual addresses to physical addresses based on a rule having an arithmetic formula defined by the above functions. Although in the embodiments described above these functions constitute the essence of the assignment rule, it would be possible to use a rule that fixedly affects a certain part of the virtual addresses, and which affects another part of the virtual addresses with functions.

Although the embodiments described above link the virtualization layer to an upstream application environment and to physical disks (or local memory unit) downstream, those skilled in the art will understand that the virtualization layer presented here could be isolated from these elements.

It is therefore necessary to consider this layer independently, as a logical brick (virtualization) above and below which can be stacked additional logical bricks.

In this context, obtaining addresses that are called "physical address" means above all a disassembly of the virtualization layer. These addresses could indeed themselves be virtual addresses of a logical space of a storage server. In the same way, the virtual addresses received upstream could themselves correspond to a disabstraction of a higher virtualization layer.

The application that accesses the stored data may include a driver that manages the relationships between the various elements such as the application-file system interaction, the file system interaction-correspondence module, the correspondence module-client interaction of storage, implementing the storage server policy by getting each item a result and calling the next item with that result (or a modified form of that result).

As a variant, the system is autonomous and does not depend on the application that calls the data, and the elements are able to communicate with each other, so that the information goes down and then goes up the element layers in element.

Similarly, the communications between these elements can be provided in different ways, for example by means of the POSIX interface, IP ₁ TCP, UDP protocols, shared memory, RDMA (Remote Direct Access Memory). It should be borne in mind that the object of the invention is to provide the advantages of specialized storage systems based on existing network resources.

An exemplary embodiment of the system described above is based on a network in which the stations are made with computers comprising:

a specialized or general purpose processor (for example of the CISC or RISC type or other type), one or more storage disks (for example Serial ATA, SCSI, or other hard disk drives) or any other type of storage, and

* a network interface (for example Gigabit, Ethernet, Infiniband, SCI ...)

an operating system-based application environment (eg Linux) to support applications and provide a file system manager,

an application set for carrying out the correspondence module, for example the Clustered Logical Volume Manager module of the Exanodes (registered trademark) application of the company Seanodes (registered trademark),

* an application set to realize the storage client and the storage server of each NBD, for example the Exanodes Network Block module

Device Exanodes application (trademark) of the company Seanodes (trademark),

* an application set to manage the distributed elements, for example the module Exanodes Clustered Service Manager Exanodes application (registered trademark) of the company Seanodes (trademark).

This type of system can be realized in a network comprising: conventional user stations, adapted for application use on a network and acting as application nodes, and

^* A set of computer devices made in accordance with the above, and which act as network servers and storage nodes.

Other materials and applications will be apparent to those skilled in the art for making alternative devices within the scope of the invention.

The invention encompasses the computer system comprising the application nodes and the nodes storing as a whole. It also encompasses the individual elements of this computer system, and in particular the application nodes and the storage nodes in their individuality, as well as the various means for carrying them out.

Similarly, the data management method is to be considered in its entirety, that is to say in the interaction of the application nodes and the storage nodes, but also in the individuality of the computer stations adapted to achieve the application nodes. and the storage nodes of this process.

The above description is intended to describe a particular embodiment of the invention. It can not be considered limiting or describing it in a limiting manner, and covers in particular all combinations of characteristics of the variants described.

The invention also covers a method for storing data comprising determining at least a first and a second storage address from a virtual address received at the input, the storage addresses being associated with storage units, and storing data associated with said virtual address in said first and second determined storage addresses, the method being characterized in that: a first table comprising data for identification of failed storage units, and a second table comprising virtual address block modification data are maintained, and

At the recovery of a failed storage unit, the storage addresses of this storage unit are updated from virtual addresses from the second table, based on the data of the first table.

The method may further be characterized in that: * the modification of a virtual address, the second table stores an index for the block of virtual addresses to which this address belongs, said index being defined to indicate the posteriority with respect to the most recent failure indicated in the first table;

at the reestablishment of a storage unit, the storage addresses of this storage unit which correspond in the correspondence module to virtual addresses which belong to blocks of virtual addresses whose index in the second table indicates a posteriority relative to the index of the storage unit restored in the first table are updated;

in the event of a failure of a storage unit, the first table stores an index for this storage unit, said index being defined higher than the indices already defined in the first table;

^* Modification of a virtual address, the second table stores an index for the virtual address block to which the address belongs, said index being defined equal to the maximum indices defined in the first table in the modification;

at the recovery of a storage unit, the storage addresses of this storage unit which correspond in the correspondence module to virtual addresses which belong to blocks of virtual addresses whose index in the second table is greater than or equal to the index of the storage unit restored in the first table are updated; and

^* When the storage addresses of a reestablished storage unit have been updated, the first table is updated by deleting the corresponding index. to the reinstated storage unit, and the second table is updated by setting the indices of the second table that are greater than the maximum of the indices defined in the first updated table as equal to this maximum.

The invention also covers, as products, the software elements described, made available under any "medium" (support) readable by computer. The term "computer readable medium" includes data storage media, magnetic, optical and / or electronic, as well as a medium or transmission vehicle, such as an analog or digital signal.

ANNEX A

SECTION 1

SECTION 2 APPENDIX A (continued)

SECTION 2 (cont'd)

SECTION 3

Claims

claims

A computer data storage tool comprising a correspondence module (40) connected to storage units (38), said correspondence module (40) being arranged to determine a correspondence between virtual addresses and physical addresses on the units storage, each virtual address being assigned to at least two storage addresses, characterized in that the correspondence module maintains a first table comprising data for identification of failed storage units, as well as a second table comprising virtual address block modification data, the computer tool further comprising a recovery unit (406) arranged, upon recovery of a failed storage unit, to update the storage addresses of that storage unit; storage by calling the matching module with virtual addresses from the second table, based on data from the first able.

2. Computer tool according to claim 1, characterized in that the correspondence module is arranged, during a failure of a storage unit, for storing in the first table an index for this storage unit, said index being defined to indicate the posteriority of this failure compared to the failures already indicated in the first table.

3. Computer tool according to claim 2, characterized in that the correspondence module is arranged, at the modification of a virtual address, to store in the second table an index for the block of virtual addresses to which this address belongs, said index being defined to indicate the posteriority with respect to the most recent failure indicated in the first table.

4. Computer tool according to claim 3, characterized in that the cover unit is arranged, at the recovery of a storage unit, for update the storage addresses of this storage unit that correspond in the correspondence module to virtual addresses that belong to virtual address blocks whose index in the second table indicates a posteriority with respect to the index of the storage unit restored to the first table.

5. Computer tool according to one of the preceding claims, characterized in that the correspondence module is arranged, during a failure of a storage unit, for storing in the first table an index for this storage unit, said index being defined higher than the indices already defined in the first table.

6. Computer tool according to claim 5, characterized in that the correspondence module is arranged, at the modification of a virtual address, to store in the second table an index for the block of virtual addresses to which this address belongs, said index being defined equal to the maximum of the indices defined in the first table during the modification.

7. Computer tool according to claims 6, characterized in that the recovery unit is arranged, at the recovery of a storage unit, to update the storage addresses of this storage unit that correspond in the correspondence module. virtual addresses that belong to virtual address blocks whose index in the second table is greater than or equal to the index of the storage unit restored in the first table.

8. Computer tool according to claim 7, characterized in that the correspondence module is arranged, when the recovery unit has updated the storage addresses of a re-established storage unit, to update the first table by erasing the index corresponding to the reestablished storage unit, and to update the second table by defining the indices of the second table which are greater than the maximum of the indices defined in the first table updated as equal to this maximum.