JP6888446B2

JP6888446B2 - Information processing device, deduplication rate identification method and deduplication rate identification program

Info

Publication number: JP6888446B2
Application number: JP2017134894A
Authority: JP
Inventors: 和彦臼井
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2017-07-10
Filing date: 2017-07-10
Publication date: 2021-06-16
Anticipated expiration: 2037-07-10
Also published as: JP2019016293A

Description

本発明は、情報処理装置、重複除去率特定方法及び重複除去率特定プログラムに関する。 The present invention relates to an information processing device, a deduplication rate specifying method, and a deduplication rate specifying program.

ストレージ装置では、同一データを格納しないようにする重複除去機能によりディスク容量の削減を行っている。重複除去は、シン・プロビジョニング・プール（Thin Provisioning Pool）毎に行われる。ここで、シン・プロビジョニング・プールは、シン・プロビジョニングによりデータが格納された物理領域である。なお、以下の説明では、シン・プロビジョニング・プールを単にプールと呼ぶ。 In the storage device, the disk capacity is reduced by the deduplication function that prevents the same data from being stored. Deduplication is performed for each thin provisioning pool. Here, the thin provisioning pool is a physical area in which data is stored by thin provisioning. In the following description, the thin provisioning pool is simply referred to as a pool.

ストレージ装置は、所定の大きさの単位データをＳＨＡ−１機能を用いてハッシュ化し、２０バイトのハッシュ値にしてメインメモリ上に記憶する。ここで、ＳＨＡ−１は、ハッシュ関数である。また、ストレージ装置は、単位データをディスクに記憶する。 The storage device hashes unit data of a predetermined size using the SHA-1 function, converts it into a hash value of 20 bytes, and stores it in the main memory. Here, SHA-1 is a hash function. In addition, the storage device stores unit data on an optical disc.

単位データの更新が行われると、ストレージ装置は、ＳＨＡ−１機能を用いてハッシュ化し、２０バイトのハッシュ値にして、既存のハッシュ値と同じものがあるかを検索することで、同一データがあるか否かを判定する。 When the unit data is updated, the storage device uses the SHA-1 function to hash the data into a 20-byte hash value, and searches for the same existing hash value to obtain the same data. Determine if it exists.

同一のハッシュ値が見つからなければ、ストレージ装置は、ハッシュ値をメインメモリ上に記憶し、更新後の単位データをディスクに記憶する。一方、同一のハッシュ値が見つかれば、ストレージ装置は、ハッシュ値に対応付けられたリファレンスカウントを１増加する。リファレンスカウントは、ハッシュ値に対応する単位データのプール内の数である。 If the same hash value is not found, the storage device stores the hash value in the main memory and stores the updated unit data in the disk. On the other hand, if the same hash value is found, the storage device increments the reference count associated with the hash value by one. The reference count is the number in the pool of unit data corresponding to the hash value.

なお、データの重複除去のためにデータの同一性を調べる技術に関連して、テキストを分類する技術がある。この技術は、与えられたテキストに対してテキスト間の含意認識を行い、個々のテキストを選択し、選択したテキストを含意するテキストをメンバとするグループを生成し、グループ間のメンバ重複度合いに基づく所定の条件を満たす場合にグループを統合する。この技術によれば、複数のテキストを、概観を把握可能なグループに分類することができ、また、含意関係があると判定されなくても意味的に含意関係があるテキスト同士を同じグループに分類することができる。 In addition, there is a technique for classifying texts in relation to a technique for checking the identity of data for deduplication of data. This technique performs inter-text implication recognition for a given text, selects individual texts, creates groups of text that imply the selected text, and is based on the degree of member multiplicity between groups. Integrate groups when certain conditions are met. According to this technology, multiple texts can be classified into groups whose appearance can be grasped, and texts that have semantic implications are classified into the same group even if they are not determined to have implications. can do.

また、ユーザへのストレージ装置の割り当てに関連して、複数のストレージ装置を含んだ階層構造を持つストレージシステムにおいて、ユーザ毎のストレージ資源使用容量を効率的にかつ公平に管理する技術がある。この技術は、ストレージコスト係数とユーザコスト分配情報から、各ユーザについて各ストレージ装置の理想的な使用容量分配を表す情報である理想使用量を算出し、各ユーザについて、各ストレージ装置に、性能の高いものから順に理想使用量を割り当てる。 Further, in relation to the allocation of storage devices to users, there is a technique for efficiently and fairly managing the storage resource usage capacity for each user in a storage system having a hierarchical structure including a plurality of storage devices. This technology calculates the ideal usage amount, which is information representing the ideal usage capacity distribution of each storage device for each user, from the storage cost coefficient and the user cost distribution information, and for each user, the performance of each storage device is calculated. The ideal usage is assigned in descending order.

特許第６００８０６７号公報Japanese Patent No. 6008067 特表２０１２−５１６４７９号公報Special Table 2012-516479

複数のプールを用いて業務運用を行っている場合に、複数のプールを統合したい場合がある。しかしながら、複数のプール間にはデータの重複がある可能性があるため、統合後に必要となる物理領域の大きさを予め見積もることができないという問題がある。なお、実際に複数のプールのデータを読み込んでハッシュ値を計算し、重複を判定することで、統合後に必要となる物理領域の大きさを算出することはできるが、処理に時間がかかる。 If you are operating a business using multiple pools, you may want to integrate multiple pools. However, since there is a possibility that data may be duplicated between a plurality of pools, there is a problem that the size of the physical area required after integration cannot be estimated in advance. It is possible to calculate the size of the physical area required after integration by actually reading the data of a plurality of pools, calculating the hash value, and determining the duplication, but the processing takes time.

本発明は、１つの側面では、統合後に必要となる物理領域の大きさを予め見積もるために、複数のプールを統合した場合の重複除去率を効率良く算出することを目的とする。 One aspect of the present invention is to efficiently calculate the deduplication rate when a plurality of pools are integrated in order to estimate in advance the size of the physical region required after integration.

１つの態様では、情報処理装置は、第１算出部と第２算出部とを有する。第１算出部は、所定の大きさの単位データについて、２つの仮想ストレージプール間で重複する単位データの数を示す重複数と該２つの仮想ストレージプールに含まれる単位データの総数とを算出する。第２算出部は、第１算出部により算出された重複数と総数を用いて、重複除去による物理領域の縮小率を算出し、縮小率から重複除去率を算出する。 In one aspect, the information processing apparatus has a first calculation unit and a second calculation unit. The first calculation unit, for a given size unit data, calculates the total number of unit data included in the duplication number and the two virtual storage pool that indicates the number of unit data to be duplicated between two virtual storage pool .. The second calculation unit calculates the reduction rate of the physical region by deduplication using the multiples and the total number calculated by the first calculation unit, and calculates the deduplication rate from the reduction rate.

１つの側面では、本発明は、複数のプールを統合した場合の重複除去率を効率良く算出することができる。 In one aspect, the present invention can efficiently calculate the deduplication rate when a plurality of pools are integrated.

図１は、実施例に係る情報処理システムのハードウェア構成を示す図である。FIG. 1 is a diagram showing a hardware configuration of an information processing system according to an embodiment. 図２は、ＣＭの機能構成を示す図である。FIG. 2 is a diagram showing a functional configuration of a CM. 図３は、ハッシュテーブルの一例を示す図である。FIG. 3 is a diagram showing an example of a hash table. 図４は、他のＣＭからのハッシュ値とリファレンスカウントの取得を説明するための図である。FIG. 4 is a diagram for explaining acquisition of hash values and reference counts from other CMs. 図５は、重複除去率算出処理のフローを示すフローチャートである。FIG. 5 is a flowchart showing the flow of the deduplication rate calculation process. 図６は、重複数算出処理のフローを示すフローチャートである。FIG. 6 is a flowchart showing the flow of the multiple calculation process.

以下に、本願の開示する情報処理装置、重複除去率特定方法及び重複除去率特定プログラムの実施例を図面に基づいて詳細に説明する。なお、この実施例は開示の技術を限定するものではない。 Hereinafter, examples of the information processing apparatus disclosed in the present application, the deduplication rate specifying method, and the deduplication rate specifying program will be described in detail with reference to the drawings. It should be noted that this embodiment does not limit the disclosed technology.

まず、実施例に係る情報処理システムのハードウェア構成について説明する。図１は、実施例に係る情報処理システムのハードウェア構成を示す図である。図１に示すように、実施例に係る情報処理システム１は、ホスト２と、ストレージ装置３とを有する。ホスト２は、ストレージ装置３を使用して情報処理を行う。ストレージ装置３は、ホスト２が使用するデータを記憶する。 First, the hardware configuration of the information processing system according to the embodiment will be described. FIG. 1 is a diagram showing a hardware configuration of an information processing system according to an embodiment. As shown in FIG. 1, the information processing system 1 according to the embodiment includes a host 2 and a storage device 3. The host 2 uses the storage device 3 to perform information processing. The storage device 3 stores the data used by the host 2.

ストレージ装置３は、ＣＭ（Controller Module）４と、ボリューム記憶装置９とを有する。ＣＭ４は、ストレージ装置３を制御する制御装置であるとともに、情報処理を行う情報処理装置でもある。ボリューム記憶装置９は、複数のボリューム９ａを記憶する。ボリューム記憶装置９は、例えば、複数台のＨＤＤ（Hard Disk Drive）、ＳＳＤ（Solid State Drive）である。 The storage device 3 has a CM (Controller Module) 4 and a volume storage device 9. The CM 4 is a control device that controls the storage device 3 and is also an information processing device that performs information processing. The volume storage device 9 stores a plurality of volumes 9a. The volume storage device 9 is, for example, a plurality of HDDs (Hard Disk Drives) and SSDs (Solid State Drives).

ＣＭ４は、ＣＡ（Channel Adapter）５と、ＣＰＵ６（Central Processing Unit）と、メインメモリ７と、２つのＦＣ（Fiber Channel）８とを有する。 The CM 4 has a CA (Channel Adapter) 5, a CPU 6 (Central Processing Unit), a main memory 7, and two FC (Fiber Channel) 8.

ＣＡ５は、ホスト２とのインタフェースである。ＣＰＵ６は、メインメモリ７からプログラムを読み出して実行する中央処理装置である。メインメモリ７は、プログラムやプログラムの実行途中結果などを記憶するＲＡＭ（Random Access Memory）である。ＦＣ８は、ボリューム記憶装置９とのインタフェースである。ＦＣ８は、冗長化される。 CA5 is an interface with host 2. The CPU 6 is a central processing unit that reads a program from the main memory 7 and executes it. The main memory 7 is a RAM (Random Access Memory) that stores a program, a result during execution of the program, and the like. The FC8 is an interface with the volume storage device 9. FC8 is made redundant.

なお、ここでは説明の便宜上、１台のストレージ装置３のみを示したが、情報処理システム１は、複数のストレージ装置３を有してよい。また、ストレージ装置３は、複数のＣＭ４を有してよい。 Although only one storage device 3 is shown here for convenience of explanation, the information processing system 1 may have a plurality of storage devices 3. Further, the storage device 3 may have a plurality of CM4s.

次に、ＣＭ４の機能構成について説明する。図２は、ＣＭ４の機能構成を示す図である。図２に示すように、ＣＭ４は、重複除去率特定部４ａを有する。重複除去率特定部４ａは、以下の式（１）を用いて２つのプールの間の重複除去率を特定する。

Next, the functional configuration of CM4 will be described. FIG. 2 is a diagram showing a functional configuration of CM4. As shown in FIG. 2, CM4 has a deduplication rate specifying unit 4a. The deduplication rate specifying unit 4a specifies the deduplication rate between the two pools using the following equation (1).

ここで、総物理使用量は、２つのプールの物理使用量を足したサイズである。リファレンスカウントの総数は、２つのプールのリファレンスカウントを足した数である。チャンクサイズは、重複除去の単位である単位データのサイズであり、例えば４キロバイト（ＫＢ）である。 Here, the total physical usage is the size obtained by adding the physical usage of the two pools. The total number of reference counts is the sum of the reference counts of the two pools. The chunk size is the size of the unit data, which is the unit of deduplication, for example, 4 kilobytes (KB).

重複除去率は、統合後の物理使用量（（総物理使用量）−（重複するハッシュ値の数）×（チャンクサイズ））と論理使用量（（リファレンスカウントの総数）×（チャンクサイズ））の比率に基づく値である。統合後の物理使用量は、２つのプール間で重複する部分のサイズ（（重複するハッシュ値の数）×（チャンクサイズ））が（総物理使用量）から引かれている。 The deduplication rate is the physical usage after integration ((total physical usage)-(number of duplicate hash values) x (chunk size)) and logical usage ((total number of reference counts) x (chunk size)). It is a value based on the ratio of. For the physical usage after integration, the size of the overlapping portion between the two pools ((the number of overlapping hash values) × (chunk size)) is subtracted from (total physical usage).

重複除去率特定部４ａは、ユーザが例えばキーボードやマウスを用いて入力した２つのプールの識別子を受け付けて、２つのプールを統合した場合の重複除去率を特定し、特定した重複除去率を２つのプールを統合した場合に必要な物理使用量とともに出力する。重複除去率特定部４ａは、記憶部４０と、重複数算出部４１と、重複除去率算出部４２と、通信部４３とを有する。 The deduplication rate specifying unit 4a receives the identifiers of the two pools input by the user using, for example, a keyboard or a mouse, specifies the deduplication rate when the two pools are integrated, and sets the specified deduplication rate to 2. Output with the physical usage required when two pools are integrated. The deduplication rate specifying unit 4a includes a storage unit 40, a multiple calculation unit 41, a deduplication rate calculation unit 42, and a communication unit 43.

記憶部４０は、ハッシュテーブルの情報を記憶する。また、記憶部４０には、重複除去率を算出する際に一時的に用いられるデータを記憶する領域、他のＣＭ４との通信に用いられるバッファ等が含まれる。 The storage unit 40 stores the information in the hash table. Further, the storage unit 40 includes an area for storing data temporarily used when calculating the deduplication rate, a buffer used for communication with another CM4, and the like.

図３は、ハッシュテーブルの一例を示す図である。図３に示すように、ハッシュテーブルは、ハッシュ値にＬＵＮ（Logical Unit No）、ＬＢＡ（Logical Block Address）、リファレンスカウントを対応付けるテーブルである。ハッシュ値に対応する単位データのストレージ装置３での格納場所はＬＵＮとＬＢＡの組合せで特定される。 FIG. 3 is a diagram showing an example of a hash table. As shown in FIG. 3, the hash table is a table in which a LUN (Logical Unit No), an LBA (Logical Block Address), and a reference count are associated with a hash value. The storage location of the unit data corresponding to the hash value in the storage device 3 is specified by the combination of LUN and LBA.

例えば、ハッシュ値が「ｈ」である単位データは、ストレージ装置３のＬＵＮが「ｌ」、ＬＢＡが「ａ」である位置に格納され、プール内に同じ単位データが「ｎ」個ある。 For example, the unit data having a hash value of “h” is stored at a position where the LUN of the storage device 3 is “l” and the LBA is “a”, and there are “n” of the same unit data in the pool.

重複数算出部４１は、重複数すなわち２つのプール間で重複するハッシュ値の数とリファレンスカウントの総数とを２つのプールのハッシュテーブルを用いて計算する。具体的には、重複数算出部４１は、第１のハッシュテーブルのハッシュ値が第２のハッシュテーブルに存在するか否かを判定し、存在する場合には、重複数を１増加する（１）。また、重複数算出部４１は、第１のハッシュテーブルのハッシュ値をリファレンスカウントの総数に加える（２）。重複数算出部４１は、上記（１）、（２）の処理を第１のハッシュテーブルの全ハッシュ値に対して行う。また、重複数算出部４１は、リファレンスカウントの総数に第２のハッシュテーブルの全リファレンスカウントを加える処理を別途行う。 The multiple calculation unit 41 calculates the number of hash values overlapping between the multiple pools, that is, the total number of reference counts, and the total number of reference counts using the hash tables of the two pools. Specifically, the multiple multiple calculation unit 41 determines whether or not the hash value of the first hash table exists in the second hash table, and if so, increments the multiple plural by 1 (1). ). Further, the multiple calculation unit 41 adds the hash value of the first hash table to the total number of reference counts (2). The multiple calculation unit 41 performs the processes (1) and (2) above for all hash values in the first hash table. Further, the multiple calculation unit 41 separately performs a process of adding all the reference counts of the second hash table to the total number of reference counts.

重複除去率算出部４２は、重複数算出部４１により計算された重複数とリファレンスカウントの総数とを用いて重複除去率を計算する。そして、重複除去率算出部４２は、計算した重複除去率を２つのプールを統合した場合に必要な物理使用量とともに出力する。重複除去率算出部４２は、例えば、２つのプールを統合した場合に必要な物理使用量と重複除去率をストレージ装置３の表示装置に表示する。 The deduplication rate calculation unit 42 calculates the deduplication rate using the multiples calculated by the multiple calculation unit 41 and the total number of reference counts. Then, the deduplication rate calculation unit 42 outputs the calculated deduplication rate together with the physical usage amount required when the two pools are integrated. The deduplication rate calculation unit 42 displays, for example, the physical usage amount and the deduplication rate required when the two pools are integrated on the display device of the storage device 3.

通信部４３は、統合される２つのプールが異なるＣＭ４により制御される場合に、他のＣＭ４からハッシュテーブルに登録されたハッシュ値とリファレンスカウントを取得する。また、通信部４３は、他のＣＭ４からハッシュテーブルに登録されたハッシュ値とリファレンスカウントの取得要求を受信すると、ハッシュ値とリファレンスカウントを送信する。 The communication unit 43 acquires the hash value and the reference count registered in the hash table from the other CM4 when the two integrated pools are controlled by different CM4s. Further, when the communication unit 43 receives a request for acquisition of the hash value and the reference count registered in the hash table from another CM4, the communication unit 43 transmits the hash value and the reference count.

図４は、他のＣＭ４からのハッシュ値とリファレンスカウントの取得を説明するための図である。図４は、ＣＭ＃１がプール＃１を制御し、ＣＭ＃２がプール＃２を制御し、プール＃１とプール＃２が統合される場合を示す。また、ＣＭ＃２の通信部４３がプール＃１のハッシュテーブルの情報を取得する。 FIG. 4 is a diagram for explaining acquisition of hash values and reference counts from other CM4s. FIG. 4 shows a case where CM # 1 controls pool # 1, CM # 2 controls pool # 2, and pool # 1 and pool # 2 are integrated. Further, the communication unit 43 of CM # 2 acquires the information of the hash table of pool # 1.

図４に示すように、ＣＭ＃２の通信部４３は、プール＃１のハッシュ値とリファレンスカウントを記憶するためのバッファを用意する。そして、ＣＭ＃２の通信部４３は、プール＃１のハッシュ値とリファレンスカウントを取得するために、取得位置情報を付与して取得要求を発行する（１）。ここで、取得位置情報は、例えば、リファレンステーブルのｐ番目からｑ個といった情報である。 As shown in FIG. 4, the communication unit 43 of CM # 2 prepares a buffer for storing the hash value and reference count of pool # 1. Then, the communication unit 43 of CM # 2 issues an acquisition request with acquisition position information in order to acquire the hash value and reference count of pool # 1 (1). Here, the acquired position information is, for example, information such as q items from the p-th position in the reference table.

ＣＭ＃１の通信部４３は、取得要求を受信すると、プール＃１のハッシュテーブルから取得位置情報に基づきハッシュ値とリファレンスカウントを読み出してＣＭ＃２へ送信する（２）。そして、ＣＭ＃２の通信部４３は、送信されたハッシュ値とリファレンスカウントを受信してバッファに格納する。 When the communication unit 43 of CM # 1 receives the acquisition request, it reads the hash value and the reference count from the hash table of pool # 1 based on the acquisition position information and transmits them to CM # 2 (2). Then, the communication unit 43 of CM # 2 receives the transmitted hash value and the reference count and stores them in the buffer.

ＣＭ＃２の重複数算出部４１は、ＣＭ＃２にあるプール＃２のハッシュテーブルとバッファに格納されたハッシュ値とリファレンスカウントを用いて重複数とリファレンスカウントの総数を計算する。バッファに格納されたハッシュ値とリファレンスカウントが全て処理されると、ＣＭ＃２の通信部４３は、取得位置情報を付与して取得要求を発行する。 The multiple calculation unit 41 of CM # 2 calculates the total number of multiples and reference counts using the hash table of pool # 2 in CM # 2, the hash value stored in the buffer, and the reference count. When all the hash values and reference counts stored in the buffer are processed, the communication unit 43 of CM # 2 adds acquisition position information and issues an acquisition request.

なお、重複除去率特定部４ａは、同様の機能を有する重複除去率特定プログラムが図１に示したメインメモリ７から読み出されてＣＰＵ６によって実行されることによって実現される。重複除去率特定プログラムは、ＣＭ４により読み出し可能な記録媒体の一例であるＣＤ−Ｒ（Compact Disc）に記憶され、ＣＤ−Ｒから読み出されてボリューム記憶装置９に格納される。あるいは、重複除去率特定プログラムは、ネットワークを介して接続されたコンピュータシステムのデータベース等に記憶され、これらのデータベースから読み出されてボリューム記憶装置９に格納される。ボリューム記憶装置９に格納された重複除去率特定プログラムは、メインメモリ７に読み出されてＣＰＵ６によって実行される。 The deduplication rate specifying unit 4a is realized by reading from the main memory 7 shown in FIG. 1 and executing the deduplication rate specifying program having the same function by the CPU 6. The deduplication rate specifying program is stored in a CD-R (Compact Disc), which is an example of a recording medium that can be read by CM4, is read from the CD-R, and is stored in the volume storage device 9. Alternatively, the deduplication rate specifying program is stored in a database or the like of a computer system connected via a network, read from these databases, and stored in the volume storage device 9. The deduplication rate specifying program stored in the volume storage device 9 is read into the main memory 7 and executed by the CPU 6.

次に、重複除去率算出処理のフローについて説明する。図５は、重複除去率算出処理のフローを示すフローチャートである。図５に示すように、重複除去率算出部４２は、記憶部４０に、情報を一時的に記憶するための各種領域を確保する（ステップＳ１）。 Next, the flow of the deduplication rate calculation process will be described. FIG. 5 is a flowchart showing the flow of the deduplication rate calculation process. As shown in FIG. 5, the deduplication rate calculation unit 42 secures various areas for temporarily storing information in the storage unit 40 (step S1).

そして、重複除去率算出部４２は、重複数算出部４１に依頼して、重複数とリファレンスカウントの総数を算出する重複数算出処理を実行させる（ステップＳ２）。そして、重複除去率算出部４２は、プール＃１とプール＃２の物理使用量を取得し（ステップＳ３）、重複除去率を計算する（ステップＳ４）。 Then, the deduplication rate calculation unit 42 requests the multiple calculation unit 41 to execute the multiple calculation process for calculating the total number of the multiple and the reference count (step S2). Then, the deduplication rate calculation unit 42 acquires the physical usage of pool # 1 and pool # 2 (step S3), and calculates the deduplication rate (step S4).

そして、重複除去率算出部４２は、必要な物理使用量と重複除去率を出力し（ステップＳ５）、情報を一時的に記憶するために確保した各種領域を解放する（ステップＳ６）。 Then, the deduplication rate calculation unit 42 outputs the required physical usage amount and the deduplication removal rate (step S5), and releases various areas reserved for temporarily storing the information (step S6).

このように、重複除去率算出部４２が重複数、リファレンスカウントの総数、プール＃１とプール＃２の物理使用量を用いて重複除去率を算出して出力するので、ユーザは、２つのプールの統合後に必要となる物理使用量を知ることができる。 In this way, the deduplication rate calculation unit 42 calculates and outputs the deduplication rate using the multiples, the total number of reference counts, and the physical usage of pools # 1 and pool # 2, so that the user can use two pools. You can know the physical usage required after the integration of.

次に、重複数算出処理のフローについて説明する。図６は、重複数算出処理のフローを示すフローチャートである。図６に示すように、重複数算出部４１は、重複数とリファレンスカウントの総数を０で初期化する（ステップＳ２１）。 Next, the flow of the multiple calculation process will be described. FIG. 6 is a flowchart showing the flow of the multiple calculation process. As shown in FIG. 6, the multiple multiple calculation unit 41 initializes the multiple multiple and the total number of reference counts to 0 (step S21).

そして、重複数算出部４１は、プール＃１の先頭のハッシュ値とリファレンスカウントを取得する（ステップＳ２２）。そして、重複数算出部４１は、取得したハッシュ値がプール＃２のハッシュテーブルに存在するか否かを判定し（ステップＳ２３）、存在する場合には、重複数をインクリメントする（ステップＳ２４）。 Then, the multiple calculation unit 41 acquires the hash value and the reference count at the beginning of the pool # 1 (step S22). Then, the multiple multiple calculation unit 41 determines whether or not the acquired hash value exists in the hash table of pool # 2 (step S23), and if so, increments the multiple plural (step S24).

そして、重複数算出部４１は、プール＃１のリファレンスカウントをリファレンスカウントの総数に加算する（ステップＳ２５）。そして、重複数算出部４１は、プール＃１の先頭のハッシュ値の検索だった場合、プール＃２の全リファレンスカウントをリファレンスカウントの総数に加算する（ステップＳ２６）。 Then, the multiple calculation unit 41 adds the reference count of pool # 1 to the total number of reference counts (step S25). Then, when the search is for the hash value at the beginning of pool # 1, the multiple calculation unit 41 adds all the reference counts of pool # 2 to the total number of reference counts (step S26).

そして、重複数算出部４１は、プール＃１の全てのハッシュ値を処理したか否かを判定し（ステップＳ２７）、プール＃１の全てのハッシュ値を処理した場合には、処理を終了する。一方、プール＃１に処理していないハッシュ値がある場合には、重複数算出部４１は、プール＃１の次のハッシュ値とリファレンスカウントを取得し（ステップＳ２８）、ステップＳ２３に戻る。 Then, the multiple calculation unit 41 determines whether or not all the hash values of the pool # 1 have been processed (step S27), and ends the processing when all the hash values of the pool # 1 have been processed. .. On the other hand, when there is an unprocessed hash value in pool # 1, the multiple calculation unit 41 acquires the next hash value and reference count of pool # 1 (step S28), and returns to step S23.

このように、重複数算出部４１が重複数とリファレンスカウントの総数を算出することで、重複除去率算出部４２は、重複除去率を算出することができる。 In this way, the deduplication rate calculation unit 42 can calculate the deduplication rate by calculating the total number of the deduplication and the reference counts by the deduplication calculation unit 41.

上述してきたように、実施例では、重複数算出部４１が２つのプールの単位データの重複数と２つのプールのリファレンスカウントの総数を算出する。そして、重複除去率算出部４２が重複数、リファレンスカウントの総数を用いて重複除去率を算出する。したがって、重複除去率特定部４ａは、２つのプールを統合した場合の重複除去率を算出することができる。このため、ユーザは、２つのプールを統合した場合に必要となる物理使用量を統合前に知ることができる。 As described above, in the embodiment, the multiple calculation unit 41 calculates the multiple multiple unit data of the two pools and the total number of reference counts of the two pools. Then, the deduplication rate calculation unit 42 calculates the deduplication rate by using the multiples and the total number of reference counts. Therefore, the deduplication rate specifying unit 4a can calculate the deduplication rate when the two pools are integrated. Therefore, the user can know the physical usage required when the two pools are integrated before the integration.

重複除去率が高いほど、統合後の物理使用量が減るため、統合するメリットがある。統合前のそれぞれのプールの物理使用量を足し合わせた結果と、統合後の物理使用量の結果を比べ、統合後の物理使用量が小さければ、統合するメリットがあることとなる。 The higher the deduplication rate, the smaller the physical usage after integration, which has the advantage of integration. Comparing the result of adding the physical usage of each pool before integration with the result of physical usage after integration, if the physical usage after integration is small, there is a merit of integration.

また、重複除去率特定部４ａは、実データを読んで処理する必要がないため、実データを読み込んで重複除去率を調べるのに比べて、処理時間を少なくすることができる。またメインメモリ７には、重複数を記憶する８バイト、プール＃１及び＃２の物理使用量を記憶する８バイト×２、プール＃１及び＃２のリファレンスカウントの総数を記憶する８バイト、重複除去率を記憶する１バイトがあればよく、メモリサイズを小さくできる。プール＃１とプール＃２が異なるＣＭ４で制御される場合にも、他のプールのハッシュテーブルから取得するハッシュ値とリファレンスカウントの情報を格納するだけのバッファがあればよい。 Further, since the deduplication rate specifying unit 4a does not need to read the actual data and process it, the processing time can be reduced as compared with reading the actual data and examining the deduplication rate. Further, in the main memory 7, 8 bytes for storing the duplicates, 8 bytes × 2 for storing the physical usage of the pools # 1 and # 2, and 8 bytes for storing the total number of reference counts of the pools # 1 and # 2. It suffices to have one byte to store the deduplication rate, and the memory size can be reduced. Even when pool # 1 and pool # 2 are controlled by different CM4s, it is sufficient that there is a buffer that stores the hash value and reference count information acquired from the hash tables of other pools.

また、実施例では、重複除去率算出部４２は、重複数にチャンクサイズを乗じて重複サイズを算出し、総物理使用量から重複サイズを引いて重複除去後サイズを算出する。また、重複除去率算出部４２は、リファレンスカウントの総数にチャンクサイズを乗じて２つのプールを合わせた領域の総サイズを算出する。そして、重複除去率算出部４２は、重複除去後サイズを総サイズで割って縮小率を算出し、１から縮小率を引いて重複除去率を算出する。したがって、重複除去率算出部４２は、正確に重複除去率を算出することができる。 Further, in the embodiment, the deduplication rate calculation unit 42 calculates the deduplication size by multiplying the multiples by the chunk size, and subtracts the deduplication size from the total physical usage amount to calculate the size after deduplication. Further, the deduplication rate calculation unit 42 multiplies the total number of reference counts by the chunk size to calculate the total size of the area obtained by combining the two pools. Then, the deduplication rate calculation unit 42 calculates the reduction rate by dividing the size after deduplication by the total size, and subtracts the reduction rate from 1 to calculate the deduplication rate. Therefore, the deduplication rate calculation unit 42 can accurately calculate the deduplication rate.

また、実施例では、プール＃１とプール＃２が異なるＣＭ４で制御される場合に、プール＃２を制御するＣＭ４の通信部４３が、プール＃１のハッシュテーブルに登録されたハッシュ値とリファレンスカウントを取得する。そして、プール＃２を制御するＣＭ４の重複数算出部４１が、通信部４３が取得したハッシュ値とリファレンスカウントとプール＃２のハッシュテーブルを用いて重複数とリファレンスカウントの総数とを算出する。したがって、重複数算出部４１は、２つのプールが異なるＣＭ４で制御される場合にも、重複数とリファレンスカウントの総数とを算出することができる。 Further, in the embodiment, when pool # 1 and pool # 2 are controlled by different CM4s, the communication unit 43 of CM4 that controls pool # 2 has a hash value and a reference registered in the hash table of pool # 1. Get the count. Then, the multiple calculation unit 41 of the CM4 that controls the pool # 2 calculates the multiple and the total number of reference counts using the hash value and the reference count acquired by the communication unit 43 and the hash table of the pool # 2. Therefore, the multiple calculation unit 41 can calculate the multiple and the total number of reference counts even when the two pools are controlled by different CM4s.

また、実施例では、シン・プロビジョニング・プールの場合について説明したが、本発明はこれに限定されるものではなく、重複除去が行われる仮想ストレージプールに適用することができる。 Further, in the embodiment, the case of the thin provisioning pool has been described, but the present invention is not limited to this, and can be applied to a virtual storage pool in which deduplication is performed.

また、実施例では、２つのプールを統合する場合について説明したが、本発明はこれに限定されるものではなく、３つ以上のプールを統合する場合にも同様に適用することができる。 Further, in the examples, the case where two pools are integrated has been described, but the present invention is not limited to this, and the present invention can be similarly applied to the case where three or more pools are integrated.

また、実施例では、ＣＭ４が重複除去率を算出する場合について説明したが、本発明はこれに限定されるものではなく、ホスト２が重複除去率を算出する場合にも同様に適用することができる。 Further, in the examples, the case where the CM4 calculates the deduplication rate has been described, but the present invention is not limited to this, and the same can be applied to the case where the host 2 calculates the deduplication rate. it can.

１情報処理システム
２ホスト
３ストレージ装置
４ＣＭ
４ａ重複除去率特定部
５ＣＡ
６ＣＰＵ
７メインメモリ
８ＦＣ
９ボリューム記憶装置
９ａボリューム
４０記憶部
４１重複数算出部
４２重複除去率算出部
４３通信部 1 Information processing system 2 Host 3 Storage device 4 CM
4a Deduplication rate identification part 5 CA
6 CPU
7 Main memory 8 FC
9 Volume storage device 9a Volume 40 Storage unit 41 Multiple multiple calculation unit 42 Deduplication rate calculation unit 43 Communication unit

Claims

For a given size unit data, the first calculating unit for calculating the total number of the unit data included in the two duplication number and the two virtual storage pool that indicates the number of the unit data to be duplicated between the virtual storage pool When,
It is characterized by having a second calculation unit that calculates the reduction rate of the physical area by deduplication using the multiple and total number calculated by the first calculation unit and calculates the deduplication rate from the reduction rate. Information processing device.

The second calculation unit calculates the overlapping size of the overlapping area by multiplying the multiple by the predetermined size, and subtracts the overlapping size from the sum of the sizes of the physical areas used by the two virtual storage pools. The size after deduplication is calculated, the total size is multiplied by the predetermined size to calculate the total size of the combined area of the two virtual storage pools, and the reduction rate obtained by dividing the size after deduplication by the total size. The information processing apparatus according to claim 1, wherein the value obtained by subtracting 1 from 1 is calculated as the deduplication rate.

The first calculation unit calculates the multiple and the total number by using the hash correspondence information in which the number of unit data for which the hash value is calculated by the hash calculation in the virtual storage pool and the hash value are associated with each other. The information processing apparatus according to claim 1 or 2.

One of the two virtual storage pools is controlled by the other information processing device.
Information about the virtual storage pool controlled by itself is registered in the hash correspondence information, and the information is registered.
It has a receiving unit that receives information registered in the hash correspondence information of the other information processing device from the other control device.
The first calculation unit is characterized in that the multiple and the total number are calculated by using the information registered in the hash correspondence information of itself and the information acquired by the receiving unit from another storage control device. The information processing device according to claim 3.

The computer
For a given size unit data, and calculates the total number of the unit data included in the two duplication number and the two virtual storage pool that indicates the number of the unit data to be duplicated between the virtual storage pool,
A method for specifying a deduplication rate, which comprises calculating the reduction rate of a physical area by deduplication using the calculated multiples and the total number, and executing a process of calculating the deduplication rate from the reduction rate.

On the computer
For a given size unit data, and calculates the total number of the unit data included in the two duplication number and the two virtual storage pool that indicates the number of the unit data to be duplicated between the virtual storage pool,
A deduplication rate identification program characterized by calculating the reduction rate of a physical area due to deduplication using the calculated multiples and the total number, and executing a process of calculating the deduplication rate from the reduction rate.