WO2016079809A1

WO2016079809A1 - Storage unit, file server, and data storage method

Info

Publication number: WO2016079809A1
Application number: PCT/JP2014/080513
Authority: WO
Inventors: ファビオ平良
Original assignee: 株式会社日立製作所; 株式会社日立情報通信エンジニアリング
Priority date: 2014-11-18
Filing date: 2014-11-18
Publication date: 2016-05-26

Abstract

The present invention reduces processing for replacing data in a memory with data in a storage device and thereby improves the performance of the storage unit. When it is determined that a working chunk is not identical to a matching chunk, a processor according to the present invention puts the working chunk in a working container. The processor determines whether the working container satisfies a predetermined closing condition, and, if it is determined that the working container satisfies the closing condition, determines whether the working container satisfies a predetermined registration condition. If it is determined that the working container satisfies the registration condition, the processor causes chunk information based on each chunk in the working container to be included in index information. If it is determined that the working container satisfies the closing condition, the processor destages the working container to the storage device.

Description

Storage device, file server, and data storage method

The present invention relates to a storage apparatus.

A deduplication technique is known in which a plurality of data overlapping each other is determined in a storage device, and only one of the plurality of data is stored. When the storage device receives the write data, it reads the data related to the partial data of the write data out of the data stored in the storage device into the memory, and determines whether or not the partial data is duplicated with the read data. The partial data determined to be not duplicated is written to the storage device. When the amount of data read into the memory reaches a predetermined upper limit, the storage apparatus reads another data from the storage device and replaces the data in the memory with another data.

In Patent Document 1, the backup device uses the first duplication determination information to determine that the content is not stored in the storage device, and the storage device uses the second duplication determination information to store the content in the storage device. The storage device transmits the second duplication judgment information to the backup device, and the backup device incorporates the received second duplication judgment information into the first duplication judgment information. Has been.

International Publication No. 2014/085508

If the process of replacing the data in the memory with the data in the storage device occurs frequently in order to determine duplication, the performance of the storage device will deteriorate.

In order to solve the above problems, a storage apparatus according to an aspect of the present invention includes a storage device, a memory, and a processor connected to the storage device, the memory, and a host computer. The processor receives write data from the host computer, the processor creates a work container that is a container for including data in the memory, and the processor divides the write data into a plurality of chunks. The processor selects one of the plurality of chunks as a work chunk in order of a chunk position indicating a position of each of the plurality of chunks in the write data, and the processor selects each chunk in the storage device. Based on the index information including chunk information based on the chunk, it is determined whether or not a matching chunk that may overlap with the work chunk is stored in the storage device, and the matching chunk is stored in the storage device. If it is determined that the processor contains the matching chunk, A conforming container is read into the memory, and the processor determines, based on the conforming container, whether the work chunk overlaps with the conforming chunk, and if the work chunk does not overlap with the conforming chunk. If determined, the processor includes the work chunk in the work container, the processor determines whether the work container satisfies a predetermined close condition, and the work container satisfies the close condition. If determined to satisfy, the processor determines whether or not the work container satisfies a predetermined registration condition, and if determined that the work container satisfies the registration condition, the processor Chunk information based on each chunk in the work container is included in the index information, and the work container If it is determined that the closed condition is satisfied, the processor writes the working container into the storage device.

・ Processing to replace data in memory with data in storage device can be reduced, and the performance of the storage device can be improved.

The duplication exclusion process of the comparative example in a 1st state is shown. The deduplication process of the comparative example in a 2nd state is shown. Stage processing is shown. Destage processing is shown. The duplication exclusion process of the comparative example in a 3rd state is shown. The deduplication process of the comparative example in a 4th state is shown. The deduplication process of the comparative example in a 5th state is shown. The deduplication process of the comparative example in a 6th state is shown. The deduplication process of embodiment in a 1st state is shown. The deduplication process of embodiment in a 2nd state is shown. The deduplication process of embodiment in a 3rd state is shown. The deduplication process of embodiment in a 7th state is shown. The deduplication process of embodiment in an 8th state is shown. The structure of a file server is shown. 2 shows a functional configuration of the storage apparatus 10. The structure of the process data 420 is shown. Indicates backup processing. Indicates chunk write processing. The conforming container stage process is shown. Indicates new chunk storage processing. The container close determination process is shown. The container close process is shown. Read processing is shown.

In the following description, the information of the present embodiment will be described using the expressions “file” and “index”, but these information may not necessarily be expressed by these data structures. For example, it may be expressed by a data structure such as “table”, “list”, “DB (database)”, “queue”, or the like. Therefore, “file”, “index”, “table”, “list”, “DB”, “queue”, and the like can be simply referred to as “information” in order to show that they do not depend on the data structure. Further, in describing the contents of each information, the expressions “identification information”, “identifier”, “name”, “name”, “ID”, and “number” can be used. Can be replaced.

In the following description, “program” will be the subject of the description, but since the program uses a memory and a communication port (communication control device) to perform processing defined by being executed by a CPU (Central Processing Unit), The description may be based on the CPU. The processing disclosed with the program as the subject may be processing performed by a computer such as a server computer, a storage controller, a management computer, or an information processing apparatus. Part or all of the program may be realized by dedicated hardware, or may be modularized. Various programs may be installed in each computer by a program distribution server or a storage medium.

The management computer has input / output devices. Examples of input / output devices include a display, a keyboard, and a pointer device, but other devices may be used. As an alternative to the input / output device, a serial interface or an Ethernet interface is used as the input / output device, and a display computer having a display or keyboard or pointer device is connected to the interface, and the display information is transmitted to the display computer. By receiving the input information from the display computer, the display computer may perform the display, or the input may be replaced by the input / output device by receiving the input.

Hereinafter, a set of one or more computers that manage the computer system and display the display information of the present invention may be referred to as a management system. When the management computer displays display information, the management computer is a management system, and a combination of the management computer and the display computer is also a management system. In addition, in order to increase the speed and reliability of the management process, a plurality of computers may realize processing equivalent to that of the management computer. In this case, the plurality of computers (if the display computer performs the display, display (Including computers) is the management system.

Hereinafter, deduplication processing by the storage device of the comparative example will be described.

FIG. 1 shows the deduplication processing of the comparative example in the first state.

This figure shows a first state in which the storage device 90b of the comparative example receives new backup data BD1 from the host computer 80. In this case, the storage apparatus 90b divides the backup data BD1 into chunks a, b,... Z as deduplication processing (S11). Each chunk has a preset chunk length. After that, the storage apparatus 90b sequentially determines whether or not the chunks a, b,... Z overlap with existing chunks stored in the storage apparatus 90b, and the chunks a, b,. It is determined that all do not overlap with existing chunks (new chunks). In this case, the storage apparatus 90b creates a new container CN1 for storing a new chunk on the memory as a work container, and sequentially stores the new chunks in the work container (S12). Hereinafter, the number of chunks stored in the work container is referred to as the number of stored chunks. At this time, when the number of storage chunks reaches a preset storage chunk number upper limit value, the storage apparatus 90b creates a new container CN2 as a work container and stores subsequent new chunks in the work container.

FIG. 2 shows the deduplication processing of the comparative example in the second state.

This figure shows a second state in which the storage apparatus 90b receives the next backup data BD2 from the host computer 80 after the first state. In this case, the storage apparatus 90b divides the backup data BD2 into a plurality of chunks as deduplication processing (S21), and determines that the chunks f1, o1, and p1 are new chunks. In this case, the storage apparatus 90b creates a new container CN3 as a work container and stores the new chunk in the work container (S22).

FIG. 3 shows stage processing.

The storage device 90b includes a memory 91b and a disk 92b. The disk 92b stores a container. The memory 91b stores backup data received from the host computer 80 and a container read from the disk 92b for duplication determination.

In the deduplication processing, the storage device 90b selects chunks in the backup data in order from the top of the backup data, and determines whether the selected chunk is duplicated with the data stored in the storage device 90 Make a decision. Here, the storage apparatus 90b reads (stages) the container related to the chunk i in the backup data from the disk 92b to the memory 91b (S31), and compares the chunk i with the chunk on the memory 91b.

FIG. 4 shows the destage processing.

Hereinafter, the number of containers staged from the disk 92b to the memory 91b is referred to as the number of stage containers. In the storage apparatus 90b, a stage container number upper limit value, which is an upper limit value of the number of stage containers, is set in advance. When the number of stage containers reaches the upper limit value of the number of stage containers in the deduplication processing, the storage apparatus 90b invalidates (destages) the container with the oldest staged time (S41). Thereafter, the storage apparatus 90b stages the container including the next chunk for duplication determination (S42).

Hereinafter, problems of deduplication processing by the storage device of the comparative example will be described.

FIG. 5 shows the deduplication processing of the comparative example in the third state.

This figure shows a third state in which the storage apparatus 90b receives the next backup data BD5 from the host computer 80 after receiving the backup data from the host computer 80. In this case, as a deduplication process, the storage apparatus 90b divides the backup data BD5 into a plurality of chunks, and determines that chunks a1, k1, and o1 are new chunks (S51). In this case, the storage apparatus 90b creates a new container CN5, stores the new chunks a1, k1, and o1 in the container (S52), and destages the container (S53).

FIG. 6 shows the deduplication process of the comparative example in the fourth state.

This figure shows a fourth state in which the storage apparatus 90b receives the next backup data BD6 from the host computer 80 after the third state. In this case, the storage apparatus 90b divides the backup data BD6 into a plurality of chunks, and selects the container CN5 including the chunks a1, k1, and o1 from the disk 92b in order to determine the received chunk a1 as a duplicate (S61). Then, the selected container CN5 is staged (S62), and the received chunk a1 is compared with the staged chunk a1.

FIG. 7 shows the deduplication processing of the comparative example in the fifth state.

This figure shows a fifth state after the fourth state, in which the storage apparatus 90b performs duplication determination from chunk b to chunk i in the backup data BD6, and determines chunk j as duplication. In this case, the storage device 90b determines that the containers CN5, CN6, and CN7 are staged by the duplication determination until immediately before (S71), and the number of stage containers has reached the upper limit of the number of stage containers. Stage (S72), and stage container CN8 including chunk j in disk 92b (S73).

FIG. 8 shows the deduplication processing of the comparative example in the sixth state.

This figure shows the sixth state after the fifth state in which the storage apparatus 90b determines that the chunk k1 after the chunk j is duplicated in the backup data BD6. In this case, the storage device 90b destages the oldest staged container CN6 (S82) because the number of stage containers has reached the upper limit of the number of stage containers due to the duplication determination up to the chunk j (S81), and within the disk 92b. The container CN5 including the chunk a1 is staged again (S83). That is, the storage apparatus 90b stages immediately after destaging the container CN5. As the number of destages and stages increases, the performance of the storage device 90b decreases.

Hereinafter, an outline of an embodiment of the present invention will be described.

FIG. 9 shows the deduplication processing of the embodiment in the first state.

In the first state, the storage apparatus 90 of the present embodiment divides the backup data BD1 into chunks a, b,... Z, and assigns chunk numbers to each chunk in order from the top of the backup data BD1 (S1111). Instead of the chunk number, a chunk position indicating the position of the chunk in the backup data BD1, such as an offset from the head of the backup data BD1, may be used. Thereafter, the storage apparatus 90 performs duplication determination of each chunk in order of the chunk number, opens the new container CN1 as a work container on the memory, and stores the new chunks in the work container in order (S1112). To open a work container is to create a work container on the memory and make it possible to add a new chunk to the work container. When the new chunk is stored in the work container, the storage device 90 records the first chunk number of the work container as head (starting chunk number) and the last chunk number of the working container as tail (endpoint chunk number). When the number of stored chunks reaches the upper limit value of the number of stored chunks, the storage device 90 closes the work container and creates a new container CN2 as a work container. Closing the work container means that a new chunk cannot be added to the work container, the work container is written to the disk, and the work container in the memory is invalidated (destaged).

FIG. 10 shows the deduplication processing of the embodiment in the second state.

In the second state, the storage device 90 divides the backup data BD2 into a plurality of chunks (S1121), opens the new container CN3 as a work container, detects new chunks f1, o1, and p1 from the backup data BD2, and creates a new one. The chunks f1, o1, and p1 are stored in the work container (S1122), and the work container is closed.

FIG. 11 shows the deduplication processing of the embodiment in the seventh state.

This figure shows a seventh state in which the storage apparatus 90 receives the next backup data BD5 from the host computer 80 after receiving the backup data. The storage device 90 includes a memory 91 and a disk 92. In this case, the storage apparatus 90 creates a work container on the memory 91 by opening the new container CN5 as a work container, detects the new chunks a1, k1, and o1 in the backup data BD5, and creates the new chunks a1, k1. , O1 is stored in the work container (S1131). Furthermore, the storage apparatus 90 determines whether or not the work container satisfies a predetermined registration condition. For example, the registration condition is that the interval evaluation value indicating the spread of chunk numbers in the work container is smaller than a predetermined interval evaluation threshold, and the number of stored chunks is larger than a predetermined lower limit value of the number of stored chunks. The interval evaluation value is represented by, for example, “(tail-head) / number of stored chunks”. The interval evaluation value increases as the interval between chunk numbers of a plurality of chunks in the work container increases. When it is determined that the work container satisfies the registration condition, the storage apparatus 90 writes the work container to the disk 92 as the registration container. Here, the storage apparatus 90 includes the chunk information indicating the chunk in the registered container in the index information, and can refer to the chunk in the container based on the index information in subsequent duplication determination. On the other hand, when it is determined that the work container does not satisfy the registration condition, the storage apparatus 90 writes the work container to the disk 92 as a non-registered container (S1132). Here, the storage apparatus 90 can prevent reference to the chunk in this container in subsequent duplication determination by not including the chunk information indicating the chunk in the unregistered container in the index information. Further, the storage apparatus 90 closes the work container (S1133).

FIG. 12 shows the deduplication processing of the embodiment in the eighth state.

This figure shows an eighth state in which the storage apparatus 90 receives the next backup data BD6 from the host computer 80 after the seventh state. In this case, since the storage device 90 determines that the chunk a1 in the backup data BD6 is duplicated (S1141), the container CN5 including the chunks a1, k1, and o1 in the disk 92 is an unregistered container. The duplication determination to be used is not performed (S1142), the chunk a1 is determined as a new chunk (S1143), the new container CN12 is opened as a work container, the chunk a1 is stored in the work container, and the work container is closed. Therefore, the storage apparatus 90 does not stage the unregistered container CN5. Thereby, the storage apparatus 90 can reduce the number of stages and destages compared to the storage apparatus 90b.

This deduplication process stores the first new chunk in the work container, stores the second new chunk having a chunk number close to the chunk number of the first new chunk, and starts from the chunk number of the second new chunk. A third new chunk having a far chunk number may be stored in the work container. In this case, the deduplication processing determines that the work container is a non-registered container, and duplication determination is not performed for all chunks in the non-registered container, so that the deduplication rate decreases. Therefore, the deduplication processing of this embodiment performs the following processing.

FIG. 13 shows the deduplication processing of the embodiment in the ninth state.

This figure shows a ninth state in which the storage apparatus 90 receives the next backup data BD7 from the host computer 80 after the eighth state. In this case, the storage apparatus 90 divides the backup data BD7 into a plurality of chunks as deduplication processing, and determines that chunks a1, b1, c1, and o1 are new chunks. In this case, the storage apparatus 90 creates the new container CN13 as a work container, and sequentially selects new chunks as work chunks. Here, the storage apparatus 90 determines whether or not the work container satisfies a predetermined separation condition. For example, when the chunk number of a work chunk is used as a work chunk number, the separation condition is a distance evaluation threshold in which a distance evaluation value indicating a distance from the chunk number of the immediately previous work chunk to the chunk number of the current work chunk is predetermined. And the number of stored chunks is larger than the separation chunk number threshold. The distance evaluation value is represented by “work chunk number−tail”, for example. When it is determined that the work container does not satisfy the separation condition, the storage apparatus 90 stores the work chunk in the work container (S1151). When it is determined that the work container satisfies the separation condition, the storage apparatus 90 closes the work container, creates a new container CN14 as a work container, and stores the work chunk in the work container (S1152). As a result, the storage apparatus 90 can store the chunks whose positions are close to each other in the backup data in one container, and can store the chunks whose positions are separated from each other in the backup data in different containers. Thereby, the storage apparatus 90 can use a container including a plurality of chunks close to each other, such as a plurality of consecutive chunks, as a registered container, and can suppress a decrease in the deduplication rate.

Hereinafter, a file server according to an embodiment of the present invention will be described.

Here, the configuration of the file server of this embodiment will be described.

FIG. 14 shows the configuration of the file server of this embodiment.

The file server of this embodiment includes the storage device 10 and the host computer 30.

The storage device 10 corresponds to the storage device 90 described above. The storage apparatus 10 includes a plurality of nodes 40 (N0, N1,... Nn) and a disk array 60. Each of the plurality of nodes 40 is connected to the host computer 30 via a communication network such as a LAN (Local Area Network). The disk array 60 corresponds to the disk 92 described above. The disk array 60 is connected to a plurality of nodes 40 via a communication network such as an FC (Fibre Channel) cable. The number of nodes 40 may be one. Instead of the disk array 60, other storage devices including storage devices such as HDD (Hard Disk Drive) and SSD (Solid State Drive) may be used.

The node 40 includes a CPU (Central Processing Unit) 110, a memory 120, an FC port 130, an HDD 140, and a network interface (NW I / F) 150. The HDD 140 stores programs and data for the node 40. The CPU 110 executes processing according to a program in the HDD 140. The memory 120 corresponds to the memory 91 described above. The memory 120 stores programs read from the HDD 140, data used for processing, data transmitted to and received from the host computer 30, data communicated to the disk array 60, and the like. The FC port 130 is connected to the disk array 60 and communicates with the disk array 60 in accordance with an instruction from the CPU 110. The NW I / F 150 is connected to the host computer 30 and communicates with the host computer 30 in accordance with an instruction from the CPU 110.

The disk array 60 includes two duplicated controllers 210 (CTL0, CTL1) and a plurality of HDDs 230. The controller 210 is connected to the node 40 and performs I / O processing for a plurality of HDDs 230 in accordance with instructions from the node 40. The plurality of HDDs 230 store data from the node 40. The number of controllers 210 may be one. The controller 210 creates an LU using a plurality of HDDs 230 and provides it to the node 40.

The controller 210 includes a CPU 211, a memory 212, an FC port 213, and a disk interface (I / F) 214. The memory 212 stores programs and data. The CPU 211 executes processing such as RAID (Redundant Arrays of Inexpensive Disks) according to a program in the memory 212. The FC port 213 is connected to the node 40 via the FC, and communicates with the node 40 in accordance with an instruction from the CPU 211. The disk I / F 214 is connected to a plurality of HDDs 230 and accesses the HDDs 230 according to instructions from the CPU 211. Other SAN (Storage Area Network) may be used instead of FC.

The host computer 30 is connected to a terminal device via a communication network. The host computer 30 stores data accessed from the terminal device. The host computer 30 creates backup data based on the stored data, and transmits a request to write the backup data to the storage apparatus 10. In addition, the host computer 30 transmits a request for reading backup data to the storage apparatus 10 in order to restore the data.

Note that the file server may include a backup server connected to the host computer 30 and the storage apparatus 10 via a communication network. In this case, the host computer 30 stores the file, and transmits a write request for the file to be backed up from the stored file to the backup server. The backup server receives a file from the host computer 30, creates backup data based on the received file, and transmits it to the storage device 10. Further, the host computer 30 transmits a file read request to the backup server. The backup server reads backup data including the specified file from the storage apparatus 10 and transmits the specified file to the host computer 30.

FIG. 15 shows a functional configuration of the storage apparatus 10.

The node 40 includes a deduplication processing unit 310 and a file system management unit 320 as functions by programs in the memory 120. The deduplication processing unit 310 receives the backup data 410 from the host computer 30 and performs deduplication processing on the backup data 410. The file system management unit 320 creates a file system (FS) 330 using the LU provided from the disk array 60. In addition, the file system management unit 320 writes processing data 420 that is backup data after deduplication processing to the file system 330 based on an instruction from the deduplication processing unit 310. Further, the file system management unit 320 reads the processing data 420 from the file system 330 based on an instruction from the deduplication processing unit 310. The deduplication processing unit 310 reconstructs the backup data 410 from the processing data 420 and transmits it to the host computer 30. The memory 120 further stores work data 440 for the deduplication processing unit 310.

Further, the deduplication processing unit 310 uses the file system management unit 320 to acquire a setting value from the setting data 460 in the file system 330. The setting data 460 may include any one of a storage chunk number upper limit value, a storage chunk number lower limit value, an interval evaluation threshold value, a distance evaluation threshold value, a separation chunk number threshold value, and a stage container number upper limit value. As a result, the administrator of the storage apparatus 10 can adjust the balance between the performance of the storage apparatus 10 and the deduplication rate. The management computer connected to the node 40, the node 40, the host computer 30, etc. may write the setting data 460. The setting data 460 may be written in the memory 120, the HDD 14, or the like.

The disk array 60 is file-accessed from the node 40, but may be block-accessed from the node 40. In this case, the chunk may be a block. Further, the controller 210 may have the function of the node 40.

Hereinafter, the configuration of the processing data 420 stored in the disk array 60 will be described.

FIG. 16 shows the configuration of the processing data 420.

The processing data 420 includes a plurality of contents 510, a plurality of container indexes 520, a plurality of containers 530, and a plurality of chunk indexes 540.

The content 510 for reconstructing one backup data includes a content ID 610 indicating the content, and a plurality of chunk information 620 indicating a plurality of chunks included in the content. Chunk information 620 indicating one chunk includes an offset 621 from the beginning of the backup data to the chunk, a length 622 of the chunk, a container ID 623 of the container 530 including the chunk, and a fingerprint 624 of the chunk. . The fingerprint 624 is a hash value obtained from the chunk. The deduplication processing unit 310 can identify a chunk included in the backup data and a container including the chunk based on the content 510. The fingerprint 624 is a value obtained by shortening the chunk data. As a result, it is possible to search for a chunk at a higher speed than searching for chunk data.

The container index 520 indicating one container includes a container ID 630 indicating the container and at least one chunk information 640 indicating the chunk included in the container. The chunk information 640 indicating one chunk includes a fingerprint 641 of the chunk, an offset 642 from the top of the container to the chunk, and a length 643 of the chunk. The deduplication processing unit 310 can identify the chunk included in the container based on the container index 520.

The container 530 includes a container ID 650 indicating the container and at least one chunk information 660 indicating the chunk included in the container. The chunk information 660 indicating one chunk includes a length 661 of the chunk and chunk data 662 that is data of the chunk. The beginning of the chunk data 662 in the container 530 is indicated by an offset 642 in the container index 520. The length of the chunk data 662 in the container 530 is indicated by a length 643 in the container index 520 and a length 661 in the container 530. The deduplication processing unit 310 can acquire the chunk data 662 from the container 530.

The chunk index 540 indicating a group of chunks that share a part of the fingerprint includes a group ID 670 indicating a part of the fingerprint and at least one chunk information 680 indicating a chunk belonging to the group. The chunk information 680 indicating one chunk includes a fingerprint 681 of the chunk and a container ID 682 indicating a container to which the chunk belongs. By searching the chunk index 540, the deduplication processing unit 310 can identify a chunk that may overlap with a work chunk from the processing data 420, and can identify a container that includes the chunk.

The chunk index 540 of the present embodiment includes the chunk information 680 of the chunk in the registered container, and does not include the chunk information 680 of the chunk in the unregistered container. As a result, the unregistered container is not staged because it is not searched for by subsequent duplicate determination.

The deduplication processing unit 310 instructs the file system management unit 320 to write out the processing data 420. In response to the instruction, the file system management unit 320 writes each of the plurality of contents 510, the plurality of container indexes 520, the plurality of containers 530, and the plurality of chunk indexes 540 to the disk array 60 as files.

Hereinafter, the operation of the deduplication processing unit 310 will be described.

FIG. 17 shows the backup process.

When the deduplication processing unit 310 receives a backup data write request from the host computer 30, the deduplication processing unit 310 starts backup processing. The backup data may be a file or may include a plurality of files.

The deduplication processing unit 310 performs backup initial processing for backup initialization (S110).

Here, the backup initial processing will be explained. The deduplication processing unit 310 opens a work container that is a new container (S120). Here, the deduplication processing unit 310 creates an updatable work container in the work data 440. Thereafter, the deduplication processing unit 310 opens a work container index which is a new container index corresponding to the work container (S130). Here, the deduplication processing unit 310 creates an updatable work container index in the work data 440. Thereafter, the deduplication processing unit 310 opens work content that is content corresponding to the backup data from the processing data 420 (S140). Here, the deduplication processing unit 310 copies the work content from the processing data 420 on the disk array 60 to the work data 440 on the memory 120 so that the work content can be updated. The above is the backup initial processing.

In the subsequent backup initial processing, the deduplication processing unit 310 divides the backup data into a plurality of chunks and gives chunk numbers to the plurality of divided chunks (S150).

Thereafter, the deduplication processing unit 310 selects a work chunk number in order from a plurality of chunk numbers, performs a chunk writing process (described later) on a work chunk that is a chunk having the work chunk number (S160), and performs a plurality of divided pieces. It is determined whether chunk write processing has been performed on all chunks (S170). When it is determined that the chunk writing process has not been performed for all the chunks (S170: no), the deduplication processing unit 310 shifts the process to S160 and selects the next work chunk number.

On the other hand, when it is determined that the chunk writing process has been performed on all the chunks (S170: yes), the deduplication processing unit 310 performs a backup end process for ending the backup (S180), and ends this flow. .

Here, the backup end process will be described. The deduplication processing unit 310 closes the work content (S210). Here, the deduplication processing unit 310 replaces the work content in the processing data 420 with the work content in the work data 440. Thereafter, the deduplication processing unit 310 closes the work container index (S220). Here, the deduplication processing unit 310 adds the work container index in the work data 440 to the process data 420. Thereafter, the deduplication processing unit 310 validates the close flag (S230). The deduplication processing unit 310 stores a close flag on the memory 120. The close flag indicates whether to close the work container. Thereafter, the deduplication processing unit 310 performs container close processing (described later) (S240). Here, the deduplication processing unit 310 adds the work container in the work data 440 to the process data 420. The above is the backup end process.

The above is the backup process.

The file system management unit 320 reads a file from the disk array 60 to the memory 120 in accordance with an instruction to open the file in the processing data 420 by the deduplication processing unit 310. Further, the file system management unit 320 writes the file from the memory 120 to the disk array 60 in accordance with the instruction to close the file in the processing data 420 by the deduplication processing unit 310.

Here, the chunk writing process in S160 of the above-described backup process will be described.

FIG. 18 shows chunk write processing.

The deduplication processing unit 310 acquires the designated work chunk (S310). Thereafter, the deduplication processing unit 310 determines whether or not the work chunk overlaps with the staged chunk (S320). Here, when the chunk data of the work chunk is the same as any one of the chunk data in the container staged in the work data 440, the deduplication processing unit 310 determines that the chunk in which the work chunk is staged. It is determined that there are duplicates. Thereby, the deduplication processing unit 310 can determine at high speed that the work chunk overlaps with the chunk staged in the memory 120.

If it is determined that the work chunk overlaps with the staged chunk (S320: yes), the deduplication processing unit 310 shifts the process to S410.

When it is determined that the work chunk does not overlap with the staged chunk (S320: no), the deduplication processing unit 310 determines whether the work chunk is registered in the chunk index 540 (S330). . Here, the deduplication processing unit 310 calculates the fingerprint of the work chunk and, when detecting the fingerprint of the work chunk from the chunk index 540, determines that the work chunk is registered in the chunk index 540. Furthermore, the deduplication processing unit 310 identifies the chunk indicated in the detected fingerprint 681 as a conforming chunk, and identifies the container indicated by the container ID 682 including the conforming chunk as a conforming container. As a result, the deduplication processing unit 310 can determine whether or not the matching chunk that may overlap with the work chunk is stored in the disk array 60. Note that the de-duplication processing unit 310 may speed up the determination in S330 using a Bloom filter.

If it is determined that the work chunk is not registered in the chunk index 540 (the work chunk is a new chunk) (S330: no), the deduplication processing unit 310 shifts the process to S360.

When it is determined that the work chunk is registered in the chunk index 540 (S330: yes), the deduplication processing unit 310 performs a conforming container stage process (described later) for the conforming container (S340).

Thereafter, the deduplication processing unit 310 determines whether or not the work chunk is included in the conforming container (S350). Here, the deduplication processing unit 310 determines that the work chunk is included in the conforming container when the chunk data of the work chunk is the same as the chunk data of the conforming chunk in the conforming container staged by the stage processing. . Thereby, the deduplication processing unit 310 can determine that the work chunk overlaps with the chunk stored in the processing data 420 in the disk array 60.

When it is determined that the work chunk is not included in the conforming container (the work chunk is a new chunk) (S350: no), the deduplication processing unit 310 performs a new chunk storage process (described later) (S360), The process proceeds to S410.

On the other hand, when it is determined that the work chunk is included in the conforming container (S350: yes), the deduplication processing unit 310 performs the container close determination process (described later) (S410), and performs the container close process (described later). Perform (S420).

Thereafter, the deduplication processing unit 310 determines whether or not the work container is closed (S430). When it is determined that the work container is not closed (S430: no), the deduplication processing unit 310 shifts the process to S450. When it is determined that the work container is closed (S430: yes), the deduplication processing unit 310 opens a work container that is a new container (S440).

Thereafter, the deduplication processing unit 310 registers the work chunk information in the work content in the work data 440 (S450), and ends this flow.

The above is the chunk writing process.

Here, the stage process in S340 of the above-described chunk write process will be described.

FIG. 19 shows the conforming container stage process.

The deduplication processing unit 310 determines whether or not the conforming container is staged (S510). Thereby, even if there exists a period when the work data 440 is not locked between S320 and S340, the deduplication processing part 310 can confirm that the compatible container is not staged.

When it is determined that the conforming container is staged (S510: yes), the deduplication processing unit 310 ends this flow.

When it is determined that the compatible container is not staged (S510: no), the deduplication processing unit 310 determines whether or not the number of stage containers has reached the upper limit value of the number of stage containers (S520).

When it is determined that the number of stage containers has not reached the upper limit value of the number of stage containers (S520: no), the deduplication processing unit 310 shifts the process to S540.

On the other hand, when it is determined that the number of stage containers has reached the upper limit value of the number of stage containers (S520: yes), the deduplication processing unit 310 selects the oldest staged container among the staged containers. Destage the selected container (invalidate the container on the memory) (S530). Thereafter, the deduplication processing unit 310 stages the compatible container (S540), and ends this flow. At this time, the deduplication processing unit 310 instructs the file system management unit 320 on the stage of the compatible container. The file system management unit 320 reads a container from the disk array 60 to the work data 440 in the memory 120 in accordance with the instruction.

The above is the conforming container stage process. According to this process, the storage apparatus 10 replaces the oldest staged container in the memory 120 with a compatible container read from the disk array 60 when the number of stage containers reaches the upper limit number of stage containers. Can do.

In S520 described above, the deduplication processing unit 310 measures the amount of containers staged in the memory 120, and determines whether or not the measured amount is equal to or greater than a predetermined stage amount upper limit value. Also good. Here, the amount of containers staged in the memory 120 may be the number of containers staged in the memory 120, may be the total size of the containers staged in the memory 120, or may be staged in the memory 120. It may be the total number of chunks in a given container.

Here, the new chunk storage process in S360 of the above-described chunk write process will be described.

FIG. 20 shows a new chunk storage process.

The de-duplication processing unit 310 determines whether or not the number of stored chunks has reached the upper limit value of the number of stored chunks (S610). The fact that the number of stored chunks has reached the upper limit value of the number of stored chunks is sometimes called a stored chunk number condition.

When it is determined that the number of stored chunks has not reached the upper limit value of the number of stored chunks (S610: no), the deduplication processing unit 310 shifts the process to S710.

On the other hand, when it is determined that the number of stored chunks has reached the upper limit value of the number of stored chunks (S610: yes), the deduplication processing unit 310 enables the close flag (S620). As a result, the deduplication processing unit 310 can limit the number of stored chunks. Thereafter, the deduplication processing unit 310 performs container close processing (described later) (S630). Thereby, the work container is closed. Thereafter, the deduplication processing unit 310 opens the new container as a work container (S640).

Thereafter, the deduplication processing unit 310 stores the work chunk in the work container (S710). Thereafter, the deduplication processing unit 310 performs section setting processing (S720).

Here, the section setting process will be described. The deduplication processing unit 310 records the work chunk number in the tail (S730). Thereafter, the deduplication processing unit 310 determines whether a chunk number is recorded in the head (S740). When it is determined that the head stores the chunk number (S740: yes), the deduplication processing unit 310 ends the section setting process. On the other hand, when it is determined that the head does not store the chunk number (S740: no), the deduplication processing unit 310 records the work chunk number in the head (S750), and ends the section setting process. The above is the section setting process.

In the subsequent new chunk storage process, the deduplication processing unit 310 registers the chunk information 640 of the work chunk in the work container index (S760), and ends this flow.

The above is the new chunk storage process. According to this processing, the storage apparatus 10 can store the work chunk in the work container and record the head and tail.

Here, the container close determination process in S410 of the above-described chunk write process will be described.

FIG. 21 shows container close determination processing.

The de-duplication processing unit 310 determines whether or not the distance evaluation value is larger than the distance evaluation threshold (S810). When it is determined that the distance evaluation value is equal to or less than the distance evaluation threshold (S810: no), the deduplication processing unit 310 ends this flow. On the other hand, when it is determined that the distance evaluation value is greater than the distance evaluation threshold (S810: yes), the de-duplication processing unit 310 determines whether the number of stored chunks is equal to or greater than the separation chunk number threshold (S820). When it is determined that the number of stored chunks is smaller than the separation chunk number threshold value (S820: no), the deduplication processing unit 310 ends this flow. On the other hand, when it is determined that the number of stored chunks is equal to or greater than the separation chunk number threshold value (S820: yes), the deduplication processing unit 310 enables the close flag (S830) and ends this flow.

The above is the container close determination process. According to this process, the storage apparatus 10 can suppress a decrease in the deduplication rate due to the container close process described later by setting a container including a plurality of chunks whose positions in the backup data are close to each other as a registered container. . Further, the storage apparatus 10 can keep the number of stored chunks of each container at or above the separation chunk number threshold by using the separation chunk number threshold.

Here, the container close process in S240 of the backup process and S420 of the chunk write process will be described.

FIG. 22 shows the container close process.

The deduplication processing unit 310 determines whether or not the close flag is valid (S910). When it is determined that the close flag is invalid (S910: no), the deduplication processing unit 310 ends this flow. On the other hand, when it is determined that the close flag is valid (S910: yes), the deduplication processing unit 310 performs a registered container determination process (S920).

Here, the registered container determination process will be described. The de-duplication processing unit 310 determines whether or not the interval evaluation value is smaller than the interval evaluation threshold (S930). When it is determined that the interval evaluation value is greater than or equal to the interval evaluation threshold (S930: no), the deduplication processing unit 310 ends the registered container determination process. On the other hand, when it is determined that the interval evaluation value is smaller than the interval evaluation threshold (S930: yes), the deduplication processing unit 310 determines whether or not the number of stored chunks is equal to or greater than the storage chunk number lower limit (S940). . When it is determined that the number of stored chunks is smaller than the lower limit value of the number of stored chunks (S940: no), the deduplication processing unit 310 ends the registered container determination process. When it is determined that the number of stored chunks is equal to or greater than the lower limit value of the number of stored chunks (S940: yes), the deduplication processing unit 310 registers the chunk information 680 of all the chunks in the work container in the chunk index 540 ( S950), the registered container determination process is terminated.

In the subsequent container close process, the deduplication processing unit 310 closes the work container (S960), and ends this flow. At this time, the deduplication processing unit 310 instructs the file system management unit 320 to destage the work container. In response to the instruction, the file system management unit 320 writes the work container from the memory 120 to the disk array 60 and invalidates the work container on the memory 120.

The above is the container closing process. According to this process, the storage apparatus 10 can close the work container when the work container satisfies either the separation condition or the storage chunk number condition. Further, the storage apparatus 10 can reduce the number of stages and destages by making a container including new chunks whose positions in the backup data are separated from each other and a container having a small number of stored chunks as non-registered containers. A decrease in performance of the storage apparatus 10 can be suppressed. Further, the storage apparatus 10 can keep the number of stored chunks of each registered container equal to or greater than the lower limit value of the stored chunk number by using the lower limit value of the stored chunk number.

Here, the read process will be described.

FIG. 23 shows the Read process.

When the deduplication processing unit 310 receives a backup data read request from the host computer 30, the deduplication processing unit 310 starts read processing.

The deduplication processing unit 310 opens the work content that is the content corresponding to the backup data in the processing data 420 (S2110). Thereafter, the deduplication processing unit 310 selects chunks as selected chunks in the order of offset 621 from the work content, and acquires chunk information 620 of the selected chunks (S2120). Thereafter, the deduplication processing unit 310 identifies the container ID 623 of the container including the selected chunk and the fingerprint 624 of the selected chunk, and identifies the offset 642 and length 643 in the container index 520 corresponding to the identified container ID. The specified offset and length chunk data 662 is read from the container 530 corresponding to the specified container ID (S2130). Thereafter, the deduplication processing unit 310 determines whether all chunks in the work content have been read (S2140).

If it is determined that all the chunks in the work content have not yet been read (S2140: no), the deduplication processing unit 310 shifts the processing to S2120 and selects the next selected chunk. If it is determined that all chunks in the work content have been read (S2140: yes), the deduplication processing unit 310 closes the work content (S2150), and ends this flow.

The above is the Read process. According to this processing, the backup data designated by the host computer 30 can be reconstructed from the processing data 420, and the reconstructed backup data can be transmitted to the host computer 30.

The terms for expressing the present invention will be described. As the storage device, the storage device 90, the storage device 10, or the like may be used. As the host computer, the host computer 80, the host computer 30, or the like may be used. A memory 91, a memory 120, or the like may be used as the memory. As the processor, the CPU 110 or the like may be used. As the storage device, a disk 92, a disk array 60, or the like may be used. A fingerprint or the like may be used as the chunk information. As the index information, a chunk index 540 or the like may be used. As the closing condition, a separation condition, a storage chunk number condition, or the like may be used.

DESCRIPTION OF SYMBOLS 10 ... Storage device, 30 ... Host computer, 40 ... Node, 60 ... Disk array, 80 ... Host computer, 90 ... Storage device, 91 ... Memory, 92 ... Disk, 120 ... Memory, 130 ... FC port, 140 ... HDD, 210 ... Controller, 230 ... HDD

Claims

A storage device;
Memory,
A processor connected to the storage device, the memory, and a host computer;
With
The processor receives write data from the host computer,
The processor creates a working container in the memory that is a container for containing data;
The processor divides the write data into a plurality of chunks,
The processor selects one of the plurality of chunks as a work chunk in order of a chunk position indicating a position of each of the plurality of chunks in the write data,
The processor determines, based on index information including chunk information based on each chunk in the storage device, whether a matching chunk that may overlap with the work chunk is stored in the storage device;
If it is determined that the conforming chunk is stored in the storage device, the processor reads a conforming container, which is a container containing the conforming chunk, into the memory;
The processor determines, based on the conforming container, whether the work chunk overlaps with the conforming chunk;
If it is determined that the work chunk does not overlap with the conforming chunk, the processor includes the work chunk in the work container;
The processor determines whether the work container satisfies a predetermined closing condition;
If it is determined that the work container satisfies the closing condition, the processor determines whether the work container satisfies a predetermined registration condition;
If it is determined that the work container satisfies the registration condition, the processor includes chunk information based on each chunk in the work container in the index information,
If it is determined that the work container satisfies the closing condition, the processor writes the work container to the storage device;
Storage device.
The processor calculates the number of stored chunks, which is the number of chunks in the work container;
The processor calculates an interval evaluation value indicating a spread of a chunk position in the work container;
The processor determines that the registration condition is satisfied when the interval evaluation value is smaller than a predetermined interval evaluation threshold value and the number of stored chunks is equal to or greater than a predetermined storage chunk number lower limit value.
The storage apparatus according to claim 1.
The processor calculates a distance evaluation value indicating a distance from a chunk position of the last chunk in the work container to a chunk position of the work chunk;
The processor determines that the closing condition is satisfied when the distance evaluation value is greater than a predetermined distance evaluation threshold and the number of stored chunks is equal to or greater than a predetermined separation chunk number threshold.
The storage apparatus according to claim 2.
The processor obtains the interval evaluation value by dividing a value obtained by subtracting the chunk position of the first chunk in the work container from the chunk position of the last chunk in the work container by the number of storage containers. To calculate,
The storage device according to claim 3.
The processor determines that the closing condition is satisfied when the number of stored chunks is equal to a predetermined upper limit value of the number of stored chunks;
The storage apparatus according to claim 4.
The processor calculates a hash value of the work chunk as chunk information of the work chunk;
The storage apparatus according to claim 1.
The processor determines whether the work chunk overlaps with a chunk in the memory;
If it is determined that the work chunk does not overlap with a chunk in the memory, the processor determines whether the index information includes chunk information of the conforming chunk;
If it is determined that the index information includes chunk information of the matching chunk, the processor determines that the matching chunk is stored in the storage device;
The storage apparatus according to claim 1.
When it is determined that the matching chunk is stored in the storage device, the processor determines whether the amount of containers in the memory is equal to or greater than a predetermined stage amount upper limit value, and the memory If it is determined that the amount of the container is equal to or greater than the stage amount upper limit value, the oldest loaded container among the containers in the memory is selected, and the selected container is replaced with the conforming container.
The storage apparatus according to claim 1.
The storage device stores setting information including any of the interval evaluation threshold value, the storage chunk number lower limit value, the distance evaluation threshold value, and the separation chunk number threshold value.
The storage device according to claim 3.
The processor writes content information including chunk information of the plurality of chunks and the index information to the storage device;
The storage apparatus according to claim 1.
A storage device;
A host computer connected to the storage device;
With
The storage device
A storage device;
Memory,
A processor connected to the storage device, the memory, and the host computer;
With
The host computer stores data, generates write data based on the data, sends the write data to the storage device,
The processor receives write data from the host computer,
The processor creates a working container in the memory that is a container for containing data;
The processor divides the write data into a plurality of chunks,
The processor selects one of the plurality of chunks as a work chunk in order of a chunk position indicating a position of each of the plurality of chunks in the write data,
The processor determines, based on index information including chunk information based on each chunk in the storage device, whether a matching chunk that may overlap with the work chunk is stored in the storage device;
If it is determined that the conforming chunk is stored in the storage device, the processor reads a conforming container, which is a container containing the conforming chunk, into the memory;
The processor determines, based on the conforming container, whether the work chunk overlaps with the conforming chunk;
If it is determined that the work chunk does not overlap with the conforming chunk, the processor includes the work chunk in the work container;
The processor determines whether the work container satisfies a predetermined closing condition;
If it is determined that the work container satisfies the closing condition, the processor determines whether the work container satisfies a predetermined registration condition;
If it is determined that the work container satisfies the registration condition, the processor includes chunk information based on each chunk in the work container in the index information,
If it is determined that the work container satisfies the closing condition, the processor writes the work container to the storage device;
file server.
Using a processor connected to the storage device, the memory and the host computer, the write data is received from the host computer,
Using the processor, create a working container in the memory that is a container for including data,
Using the processor, the write data is divided into a plurality of chunks,
Using the processor, one of the plurality of chunks is selected as a work chunk in the order of chunk positions indicating the positions of the plurality of chunks in the write data,
Using the processor, based on index information including chunk information based on each chunk in the storage device, determining whether a matching chunk that may overlap with the work chunk is stored in the storage device And
If it is determined that the conforming chunk is stored in the storage device, the processor is used to read a conforming container that is a container containing the conforming chunk into the memory;
Using the processor to determine whether the work chunk overlaps with the conforming chunk based on the conforming container;
If it is determined that the work chunk does not overlap with the conforming chunk, the processor is used to include the work chunk in the work container;
Using the processor to determine whether the work container satisfies a predetermined closing condition;
If it is determined that the work container satisfies the closing condition, the processor is used to determine whether the work container satisfies a predetermined registration condition;
When it is determined that the work container satisfies the registration condition, using the processor, the chunk information based on each chunk in the work container is included in the index information,
If it is determined that the working container satisfies the closing condition, the processor is used to write the working container to the storage device;
A data storage method comprising: