CN109582640B

CN109582640B - Sliding window-based data deduplication storage method and device and storage medium

Info

Publication number: CN109582640B
Application number: CN201811359237.3A
Authority: CN
Inventors: 赵磊
Original assignee: Shenzhen Coocaa Network Technology Co Ltd
Current assignee: Shenzhen Coocaa Network Technology Co Ltd
Priority date: 2018-11-15
Filing date: 2018-11-15
Publication date: 2020-12-01
Anticipated expiration: 2038-11-15
Also published as: CN109582640A

Abstract

The invention discloses a data deduplication storage method, a data deduplication storage device and a data deduplication storage medium based on a sliding window, wherein the method comprises the following steps: segmenting the stored data to obtain each piece of fragment data and the optimal number of the piece of fragment data; establishing a query index for each fragment data, and establishing a variable sliding window; detecting whether new data to be written in exists or not, and if yes, sending a query instruction to each segmented fragment data through the variable sliding window so as to judge whether the new data to be written in is repeated with the current fragment data or not, namely, the new data is marked as repeated data; if yes, discarding the new data; and if the current fragment data does not exist, writing the new data into the current fragment data. The invention determines the segmentation mode of the maximum fragment number from time and space dimensions according to the rule of occurrence of repeated data, and dynamically performs deduplication query on the fragment data by adjusting the variable sliding window, thereby optimizing the query, improving the query efficiency and facilitating the user while reducing the overall cluster performance overhead.

Description

Sliding window-based data deduplication storage method and device and storage medium

Technical Field

The invention relates to the technical field of data storage, in particular to a sliding window-based data deduplication storage method and device and a storage medium.

Background

With the development of technology and the increase of demand, the required space for storing data is larger and larger, but in practical application, the storage space is limited. Every time a file data is newly added, the uniqueness of data storage needs to be guaranteed, all files are inquired whether existing or not through establishing an index relation, however, along with the continuous addition of data volume, especially under the condition of massive data, a large amount of time is needed for re-inquiry, the efficiency is low, meanwhile, the time cost corresponding to inquiry is increased in a thread mode through a traditional inquiry mode, the performance cost of the whole cluster is increased, and the existing common single file storage cannot meet the performance requirement of the existing large-data-volume storage.

Accordingly, the prior art is yet to be improved and developed.

Disclosure of Invention

The technical problem to be solved by the present invention is to provide a sliding window based data deduplication storage method, device and storage medium, aiming at performing deduplication query on maximized fragmented data in a diversified manner by continuously adjusting a variable sliding window according to a fixed rule of occurrence of duplicate data, optimizing the query, improving the query efficiency, reducing the overall performance overhead, and facilitating the user.

The technical scheme adopted by the invention for solving the technical problems is as follows:

the invention provides a data deduplication storage method based on a sliding window, which comprises the following steps:

acquiring stored data, recording the stored data as a data source, and segmenting to obtain each piece of fragment data and the optimal number of the piece of fragment data;

establishing a query index for each fragment data, and establishing a variable sliding window;

detecting whether new data to be written in exists or not, and if yes, sending a query instruction to each segmented fragment data through the variable sliding window so as to judge whether the new data to be written in is repeated with the current fragment data or not, namely, the new data is marked as repeated data;

if yes, discarding the new data, and not writing the new data into the current fragment data; and if the current fragment data does not exist, writing the new data into the current fragment data.

The sliding window-based data deduplication storage method includes that the obtaining of the storage data as a data source and the segmentation are performed, and obtaining of each piece of fragment data and the optimal number of the piece of fragment data specifically includes:

acquiring streaming storage data, and recording the streaming storage data as a data source;

and selecting an optimal segmentation mode to segment the data source, so that the number of segmented fragment data is the largest, marking the segmented fragment data as the optimal fragment data number, and simultaneously acquiring each segmented fragment data.

The sliding window-based data deduplication storage method includes the steps of selecting an optimal segmentation mode to segment the data source, enabling the number of segmented fragment data to be the largest, marking the segmented fragment data as the optimal fragment data number, and meanwhile obtaining each segmented fragment data specifically includes the following steps:

comparing the number of the first sliced data obtained by the first slicing mode with the number of the second sliced data obtained by the second slicing mode;

when the number of first sliced data acquired by a first slicing mode is larger than that of second sliced data acquired by a second slicing mode, the first slicing mode is used as an optimal slicing mode to slice the data source, the number of sliced data is marked as the optimal number of sliced data, and each piece of first sliced data after slicing is acquired and respectively marked; and when the number of the second sliced data acquired by the second slicing mode is larger than that of the first sliced data acquired by the first slicing mode, slicing the data source by taking the second slicing mode as an optimal slicing mode, marking the number of the sliced data as the optimal number of the sliced data, and acquiring and marking each piece of the second sliced data after slicing.

The sliding window-based data deduplication storage method includes:

acquiring a preset maximum time range threshold value of occurrence of repeated data;

setting a fixed value as the number of the first slicing mode according to experience and the maximum time range threshold, namely the number of first slicing data;

and acquiring the value of the number of the first sliced data.

The sliding window based data deduplication storage method includes:

acquiring a preset maximum time range threshold value of occurrence of the repeated data and a maximum storage capacity value which can be stored in all the fragmented data after the second segmentation mode is segmented;

carrying out time conversion on the maximum storage capacity value to obtain a maximum time threshold value for storage;

dividing the maximum time range threshold by the maximum time threshold to obtain the maximum sliced data number, namely the second sliced data number; and acquiring the value of the number of the second fragment data.

The sliding window-based data deduplication storage method includes the steps of establishing a query index for each piece of sliced data, and establishing a variable sliding window specifically includes:

acquiring each piece of marked segmented data after the optimal segmentation mode is segmented, and respectively establishing query indexes;

and establishing a variable sliding window, and setting an initial value of the window size of the variable sliding window as the value of the number of the first sliced data segmented by the first segmentation mode.

The sliding window-based data deduplication storage method includes the steps of detecting whether new data to be written exists or not, and if yes, sending a query instruction to each sliced piece of data through the variable sliding window to judge whether the new data to be written and the current sliced piece of data are repeated or not specifically includes:

and detecting whether new data to be written in exists, if so, adjusting the window size value of the variable sliding window and sending a query instruction to each segmented fragment data to realize effective query, judging whether the new data to be written in is repeated with the current fragment data, and simultaneously returning a query result and the current window size value of the variable sliding window.

The sliding window-based data deduplication storage method is characterized in that if the data deduplication storage method exists, the new data is discarded and is not written into the current fragmented data; if the new data does not exist, writing the new data into the current fragment data specifically includes:

when the returned query result is 1, namely the repeated data exists, discarding the new data and not writing the new data into the current fragment data;

and when the returned query result is 0, namely no repeated data exists, writing the new data into the current fragment data.

The sliding window-based data deduplication storage method includes that the effective query is realized by adjusting the value of the window size of the variable sliding window and sending a query instruction to each sliced piece of data after slicing, and returning the current value of the window size of the variable sliding window specifically includes:

defining the number of the sliced fragment data as N and the value of the window size of the variable sliding window as M;

when the situation that the repeated data does not exist between the Nth fragment data and the new data to be written in and the repeated data exists between the Nth fragment data and the new data to be written in is detected, the value M is kept unchanged, and the variable sliding window does not need to be adjusted to realize first effective query;

when detecting that the nth fragment data and the new data to be written have repeated data, adding 1 to the value of M, namely adjusting a variable sliding window M to be M +1 to realize a second effective query;

when the situation that duplicate data exists between the Xth fragment data and the new data to be written is detected, and X is not more than N-1, adjusting a variable sliding window M to be X +1 to realize third effective query;

wherein N is more than or equal to 3, M is more than or equal to 3, X is more than or equal to 1, and N, M and X are integers.

The invention also provides a data deduplication storage device based on the sliding window, which comprises a processor and a memory connected with the processor, wherein the memory stores a data deduplication storage program based on the sliding window, and the data deduplication storage program based on the sliding window is used by the processor to realize the steps of the data deduplication storage method based on the sliding window.

The invention also provides a storage medium, wherein the storage medium stores a data deduplication storage program based on a sliding window, and the data deduplication storage program based on the sliding window is used for realizing the data deduplication storage method based on the sliding window when being executed by a processor.

The invention provides a data deduplication storage method and device based on a sliding window and a storage medium, and has the advantages that:

1. the partitioned data can be stored in a plurality of devices in a distributed storage mode, so that the single storage pressure is reduced, and the caching capacity is improved.

2. The method has the advantages that the window size of the variable sliding window is adjusted in a diversified manner, duplicate removal query on the fragment data is dynamically realized, the query mode is flexible, the coverage is wide, large-data-volume query can be quickly realized, the storage capacity is improved to the maximum extent, and the waste of space resources is greatly reduced.

3. The method is suitable for complex and large-data-volume scenes, and the distributed storage optimization query is utilized, so that the overall time performance overhead is reduced, and the storage efficiency is accelerated.

Drawings

FIG. 1 is a flow chart of a first preferred embodiment of a sliding window based data deduplication storage method according to the present invention.

FIG. 2 is a block diagram of a sliding window based data deduplication storage method of the present invention.

FIG. 3 is a schematic diagram of a deduplication query interaction of the data deduplication storage method based on a sliding window.

FIG. 4 is a functional block diagram of a sliding window based data deduplication storage apparatus provided by the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer and clearer, the present invention is further described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

In the invention, "data deduplication" refers to dividing a file into data pieces (namely fragmented data) with substantially equal lengths, and only one data piece with the same content is stored in a file system, so as to avoid storing duplicate data and wasting space and resources.

Example one

Referring to fig. 1, fig. 1 is a flowchart illustrating a sliding window based data deduplication storage method according to a first preferred embodiment of the present invention.

As shown in fig. 1, a sliding window based data deduplication storage method includes the following steps:

and S100, acquiring the stored data, recording the stored data as a data source, and segmenting to obtain each piece of fragment data and the optimal number of the piece of fragment data.

In the present invention, the data source refers to data having a characteristic of presenting streaming data, such as network data, and eligible data combinations are transmitted on the channel according to an agreed channel and output one by one to the outside, and the arrangement order of the data combinations is the same as the order of the output to the outside. In specific implementation, streaming storage data, such as mass file data, is acquired and used as a data source for deduplication query to optimize the query and reduce the overall performance overhead.

Due to the time locality of the data, namely the acquired storage data are arranged according to the time sequence, massive storage data can be segmented according to time to obtain a plurality of segmented data to be stored in a plurality of storage devices, so that the storage stability is improved. Through mass data experiments, new data to be written and data in a fixed time range before a certain fragment of data in all stored data are repeated necessarily, and the fixed time range is several. In the present invention, the maximum time range for the occurrence of the repeated data is set to be 1 hour in advance, that is, the maximum fixed time range threshold is 1 hour. Of course, the threshold of the longest time range of occurrence of the repeated data is not limited, and may be set according to the user requirement, or may be set according to the data storage time and the storage capacity.

In order to reduce the overall query time performance overhead to the maximum extent and improve the storage speed and the space utilization rate, an optimal segmentation mode is selected to segment the data source, so that the number of segmented fragment data is the largest, the segmented fragment data is marked as the optimal fragment data number, and each segmented fragment data is obtained at the same time. Specifically, the data source is segmented by selecting one of the following two segmentation modes with the largest number of segmented data as an optimal segmentation mode, wherein the two segmentation modes comprise: for better description, the number of pieces of data defining the optimal slice is represented by N.

First cut-through (time dimension):

firstly, acquiring a preset maximum time range threshold value of occurrence of repeated data, namely the maximum time range threshold value is 1 hour; then, according to experience and the maximum time range threshold, the number of the first slicing mode is set to be a fixed value, and N1 is set to be 3, that is, according to engineering experience, at least 3 slicing data can be stored in the time dimension, that is, within the maximum time range threshold, that is, the number of the first slicing data N1 is 3. For example, if the maximum time range of the occurrence of the repeated data is 1 hour, and the number of the sliced data is 3, 3 slices are stored, and at this time, each slice will store 20 minutes of data.

Second segmentation approach (spatial dimension):

firstly, a preset maximum time range threshold value of occurrence of repeated data is also required to be obtained, for example, 1 hour, and the maximum size value, namely the maximum storage capacity value, which can be stored in all fragmented data after segmentation is determined; then, time conversion is carried out on the maximum storage capacity value to obtain a maximum time threshold value when the fragment data corresponding to the maximum size value is stored; at this time, the maximum time range threshold is divided by the maximum time threshold, so as to obtain the maximum sliced data number, that is, the second sliced data number N2 obtained when the second slicing method is adopted. For example, it is defined that each slice can store 10G of data at maximum, and the time required for storing 10G is 10 minutes, then 6 slices of data can be stored within 1 hour of the maximum time range threshold value of repeated data occurrence, i.e. 1 hour/10 minutes is 6 slices, and N2 is 6.

Comparing the size of the first sliced data number N1 obtained in the first slicing mode with the size of the second sliced data number N2 obtained in the second slicing mode, selecting the slicing mode with the largest slicing number as the optimal slicing mode, which shows that in the above example, N2 is greater than N1, selecting the second slicing mode as the optimal slicing mode to slice the data source, so as to obtain 6 second sliced data, and marking the second sliced data respectively, where the optimal sliced data number N is 6. For example, second slice data 1, second slice data 2, second slice data 3 … second slice data 6. Certainly, if N1> N2, the first slicing method is selected as the optimal slicing method to slice the data source, so as to obtain N1 first sliced data, where the optimal number N of sliced data is N1.

Namely, step S100 specifically includes:

step S101, streaming storage data is recorded as a data source;

step S102, the data source is segmented in an optimal segmentation mode, so that the number of segmented fragment data is the largest, the segmented fragment data is marked as the optimal fragment data number, and each segmented fragment data is obtained at the same time.

Wherein, step S102 specifically includes:

step S1021, the number of the first sliced data obtained by the first slicing mode and the number of the second sliced data obtained by the second slicing mode are the same;

step S1022, when the number of the first sliced data obtained by a slicing method is greater than the number of the second sliced data obtained by a second slicing method, the data source is sliced by using the first slicing method as an optimal slicing method, the number of sliced data is marked as an optimal sliced data number, and each piece of the first sliced data after slicing is obtained and marked respectively;

and step S1023, when the number of the second fragment data acquired by the segmentation mode is larger than the number of the first fragment data acquired by the first segmentation mode, segmenting the data source by taking the second segmentation mode as an optimal segmentation mode, marking the segmented fragment data as the optimal fragment data, acquiring each second fragment data after segmentation, and marking the second fragment data respectively.

Certainly, in some embodiments, the maximum number of slices can be obtained by performing combined calculation and segmentation on the time dimension and the space dimension, that is, the number of slices can be calculated by performing multiplication value on the time dimension and the space dimension, wherein in the time dimension, a minimum principle is satisfied, that is, at least 3 slices are stored in a unit capacity (1G), and then the time dimension is fixed to be 3; the value of the space dimension is determined according to the size of the storage space and the storage size of each fragment, a system default value can be set, and the value can also be set according to the requirements of users.

Step S200, establishing a query index for each piece of fragment data, and establishing a variable sliding window.

In the invention, the data storage has uniqueness, so that the storage with large data volume is often queried in a mode of establishing an index so as to improve the query efficiency and accuracy. At this time, based on step S100, a plurality of fragment data are obtained, a query index is established for each fragment data, a distributed storage manner is utilized, that is, the fragment data are stored in a plurality of connected storage devices, and a variable sliding window is established, as shown in fig. 2, the variable sliding window refers to a sliding window with a variable volume, a window size value of the variable sliding window changes along with a change of a deduplication query condition, and the variable sliding window is used for changing a window size value in real time according to a query instruction feedback result to implement dynamic query on the fragment data, so as to implement deduplication query optimization. In a specific implementation, the initial value of the window size of the variable sliding window is set to be the number N1 of slices in the first slicing scheme, and for convenience of description, the window size is defined to be M, that is, the initial value of M is M-N1-3.

That is, step S200 specifically includes:

step S201, acquiring each piece of marked segmented data after the optimal segmentation mode is segmented, and respectively establishing query indexes;

step S202, a variable sliding window is established, and meanwhile, the initial value of the window size of the variable sliding window is set to be the value of the number of the first sliced data which are sliced in the first slicing mode.

Step S300, detecting whether new data to be written in is to be written in, and if so, sending an inquiry instruction to each sliced piece of data through the variable sliding window to determine whether the new data to be written in is repeated with the current piece of data, that is, the new data is marked as repeated data.

In the embodiment of the present invention, after step S200 is performed, whether new data to be written is detected in real time, if yes, an effective query is implemented by adjusting the window size value of the variable sliding window and sending a query instruction to each sliced piece of data, and whether new data to be written is repeated with the current slice of data is determined, and a query result and the current window size value of the variable sliding window are returned. In a specific implementation scenario, all sliced data are subjected to deduplication optimization by adjusting a window size appropriate value of a variable sliding window, as shown in fig. 3, fig. 3 is a deduplication query interaction diagram of a sliding window-based data deduplication storage method provided by the present invention:

when it is detected that duplicate data does not exist between the nth fragment data and the new data to be written and duplicate data exists between the nth-1 fragment data and the new data to be written, maintaining the value of M unchanged without adjusting the variable sliding window to realize the first effective query, corresponding to scenario 1 in fig. 3, realizing effective query by setting the optimal value of M of the variable sliding window, not wasting space, and improving the storage utilization rate;

when detecting that the nth fragment data and the new data to be written have repeated data, adding 1 to the value M, that is, adjusting a variable sliding window M to be M +1 to implement a second effective query, corresponding to scenario 2 in fig. 3, reducing space waste by adjusting the value M of the variable sliding window, and increasing a transmission rate;

when the situation that duplicate data exists between the Xth fragment data and the new data to be written is detected, and X is not more than N-1, adjusting the variable sliding window M to be X +1 to realize third effective query, and performing supplementary query by increasing the value of the variable sliding window M corresponding to the scene 3 in the graph 3 to comprehensively cover the query and meet different query requirements.

The above-mentioned symbol "+ 1" is applied in the query scenario to indicate that the variable sliding window is moved forward by one tile data, as its value is increased by 1, and similarly, "-1" is applied in the query scenario to indicate that the variable sliding window is moved backward by one tile data, as its value is decreased by 1. N, M and X are integers, N is not less than 3, M is not less than 3, and X is not less than 1. Of course, the changed value of the variable sliding window M is returned at the same time.

It should be noted that, after the data source is segmented, some and only one new data is written into the current fragment data, that is, at the same time, some and only one file is written, and the rest files are read only.

Step S400, if the data exists, the new data is discarded and not written into the current fragment data; and if the current fragment data does not exist, writing the new data into the current fragment data.

Based on step S300, when there is duplicate data, that is, when the returned query result is 1, discarding the new data, and not writing the new data into the current fragmented data; and when the repeated data does not exist, namely when the returned query result is 0, writing the new data into the current fragment data.

It should be noted that, the technical solution for implementing data deduplication storage in the present invention is to perform storage operation based on a distributed storage manner.

Further, the sliding window based data deduplication storage method provided by the invention can be applied to different network environments or different devices, such as a mobile terminal for communication, where the mobile terminal includes a processor and a memory connected to the processor, and the memory stores a sliding window based data deduplication storage program for being executed by the processor to implement the sliding window based data deduplication storage method.

Of course, it will be understood by those skilled in the art that all or part of the processes of the methods of the above embodiments may be implemented by hardware (e.g., a processor, a controller, etc.) related to the sliding window based data deduplication storage program, and the program may be stored in a computer readable storage medium, and when executed, may include the processes of the above method embodiments. The storage medium may be a memory, a magnetic disk, an optical disk, etc.

Example two

The present invention further provides a sliding window based data deduplication storage apparatus, as shown in fig. 4, where the sliding window based data deduplication storage apparatus includes a processor 10 and a memory 20 connected to the processor 10, where the memory 20 stores a sliding window based data deduplication storage program, and the sliding window based data deduplication storage program is used by the processor 10 to implement the steps of the sliding window based data deduplication storage method according to the first embodiment, specifically as described above.

EXAMPLE III

The invention also provides a storage medium, wherein the storage medium stores a data deduplication storage program based on a sliding window, and the data deduplication storage program based on the sliding window realizes the data deduplication storage method based on the sliding window when being executed by the processor 10; as described above.

In summary, the present invention discloses a sliding window based data deduplication storage method, device and storage medium, wherein the sliding window based data deduplication storage method includes: acquiring stored data, recording the stored data as a data source, and segmenting to obtain each piece of fragment data and the optimal number of the piece of fragment data; establishing a query index for each fragment data, and establishing a variable sliding window; detecting whether new data to be written in exists or not, and if yes, sending a query instruction to each segmented fragment data through the variable sliding window so as to judge whether the new data to be written in is repeated with the current fragment data or not, namely, the new data is marked as repeated data; if yes, discarding the new data, and not writing the new data into the current fragment data; and if the current fragment data does not exist, writing the new data into the current fragment data. The invention determines the segmentation mode of the maximum fragment number from time and space dimensions according to the fixed rule of the occurrence of the repeated data for segmentation, and dynamically performs the repeated query in a diversified mode on the fragment data by adjusting the variable sliding window, thereby reducing the performance overhead of the whole cluster, improving the query efficiency and facilitating the user.

It is to be understood that the invention is not limited to the examples described above, but that modifications and variations may be effected thereto by those of ordinary skill in the art in light of the foregoing description, and that all such modifications and variations are intended to be within the scope of the invention as defined by the appended claims.

Claims

1. A sliding window based data deduplication storage method is characterized by comprising the following steps:

detecting whether new data to be written exists or not, if so, adjusting the window size value of the variable sliding window and sending a query instruction to each segmented fragment data to realize effective query, and judging whether the new data to be written is repeated with the current fragment data or not; simultaneously returning a query result and the value of the current window size of the variable sliding window;

2. The sliding-window-based data deduplication storage method according to claim 1, wherein the acquiring of the storage data is recorded as a data source and the segmenting is performed, and obtaining each piece of sliced data and an optimal number of the piece of sliced data specifically includes:

3. The sliding-window-based data deduplication storage method according to claim 2, wherein the selecting an optimal segmentation manner to segment the data source so that the number of segmented fragment data is the largest and is marked as the optimal fragment data number, and acquiring each segmented fragment data specifically includes the following steps:

4. The sliding-window-based data deduplication storage method of claim 3, wherein the first cut-off manner specifically comprises:

and acquiring the value of the number of the first sliced data.

5. The sliding-window-based data deduplication storage method according to claim 3, wherein the second slicing manner specifically includes:

6. The sliding-window-based data deduplication storage method of claim 1, wherein the establishing a query index for each sliced data, and the establishing a variable sliding window specifically comprises:

7. The sliding-window-based data deduplication storage method according to claim 1, wherein adjusting the value of the window size of the variable sliding window and sending a query instruction to each sliced piece of data to implement effective query, and returning the current value of the window size of the variable sliding window specifically includes:

8. A sliding-window based data deduplication storage apparatus comprising a processor and a memory connected to the processor, wherein the memory stores a sliding-window based data deduplication storage program, and the sliding-window based data deduplication storage program is used by the processor to implement the sliding-window based data deduplication storage method steps of claims 1-7.

9. A storage medium storing a sliding window based data deduplication storage program, wherein the sliding window based data deduplication storage program is configured to implement the sliding window based data deduplication storage method according to any one of claims 1-7 when executed by a processor.