CN114356212A - Data processing method, system and computer readable storage medium - Google Patents

Data processing method, system and computer readable storage medium Download PDF

Info

Publication number
CN114356212A
CN114356212A CN202111397650.0A CN202111397650A CN114356212A CN 114356212 A CN114356212 A CN 114356212A CN 202111397650 A CN202111397650 A CN 202111397650A CN 114356212 A CN114356212 A CN 114356212A
Authority
CN
China
Prior art keywords
data
data set
slice range
target
deduplication
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111397650.0A
Other languages
Chinese (zh)
Other versions
CN114356212B (en
Inventor
舒治
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba China Co Ltd
Original Assignee
Alibaba China Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba China Co Ltd filed Critical Alibaba China Co Ltd
Priority to CN202111397650.0A priority Critical patent/CN114356212B/en
Publication of CN114356212A publication Critical patent/CN114356212A/en
Application granted granted Critical
Publication of CN114356212B publication Critical patent/CN114356212B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a data processing method, a system and a computer readable storage medium. Wherein, the method comprises the following steps: acquiring a data set to be processed, and calculating the score of the data set to be processed in each slice range of at least one slice range, wherein the score represents the deduplication efficiency of data deduplication of the data set to be processed, and the slice range represents the number of bytes of a plurality of data blocks obtained by data segmentation processing of the data set to be processed; determining a target slice range matched with the data set to be processed in at least one slice range according to the score corresponding to each slice range; and performing data segmentation on the data set to be processed based on the target slice range to obtain a target data set, wherein the target data set comprises a plurality of data blocks. The method and the device solve the technical problem that in the prior art, due to the fact that the slice range is unreasonably set, the data set is low in deduplication efficiency.

Description

Data processing method, system and computer readable storage medium
Technical Field
The present application relates to the field of artificial intelligence, and in particular, to a data processing method, system, and computer-readable storage medium.
Background
In the data processing process, for the backup of the repeated data, it is usually necessary to identify the repeated data first, then delete the repeated data, and finally keep only one copy, so as to index the repeated data into the same data block.
When the repeated data is deleted, the size of the slice directly determines the efficiency of deleting the repeated data and the efficiency of storing the data.
However, there are often different data characteristics for data from different data sets, and therefore, determining an appropriate slice range is very important to improve deduplication efficiency (i.e., deduplication efficiency) of a data set.
Disclosure of Invention
The embodiment of the application provides a data processing method, a data processing system and a computer readable storage medium, which are used for at least solving the technical problem of low efficiency of data set deduplication caused by unreasonable slice range setting in the prior art.
According to an aspect of an embodiment of the present application, there is provided a data processing method, including: acquiring a data set to be processed, and calculating the score of the data set to be processed in each slice range of at least one slice range, wherein the score represents the deduplication efficiency of data deduplication of the data set to be processed, and the slice range represents the number of bytes of a plurality of data blocks obtained by data segmentation processing of the data set to be processed; determining a target slice range matched with the data set to be processed in at least one slice range according to the score corresponding to each slice range; and performing data segmentation on the data set to be processed based on the target slice range to obtain a target data set, wherein the target data set comprises a plurality of data blocks.
Optionally, the data processing method further includes: sampling the data set to be processed to obtain an initial data set, and calculating an initial score of the initial data set in each slice range, so as to determine the score of the data set to be processed in each slice range.
Optionally, the data processing method further includes: the method comprises the steps of obtaining a deduplication rate, a data processing duration and data reading and writing parameters of an initial data set in at least one slice range, wherein the data processing duration represents execution duration for performing deduplication processing on the initial data set, the data reading and writing parameters represent influence degree of the at least one slice range on data reading and writing efficiency, and a deduplication coefficient corresponding to a current slice range is determined according to the deduplication rate in the at least one slice range, so that processing efficiency corresponding to the current slice range is determined according to the data processing duration corresponding to the at least one slice range, data reading and writing efficiency corresponding to the current slice range is determined according to the data reading and writing parameters corresponding to the at least one slice range, and then the deduplication coefficient, the processing efficiency and the data reading and writing efficiency are subjected to weighted summation calculation to obtain an initial score.
Optionally, the data processing method further includes: the method comprises the steps of obtaining a first deduplication rate corresponding to a current slicing range, sequencing the deduplication rates corresponding to at least one slicing range to obtain a first sequencing result, determining a second deduplication rate from the deduplication rates corresponding to at least one slicing range according to the first sequencing result, and calculating a ratio of the first deduplication rate to the second deduplication rate to obtain a deduplication coefficient.
Optionally, the data processing method further includes: and acquiring a first data volume corresponding to the initial data set, and acquiring a second data volume corresponding to the first data set, which is obtained by performing data deduplication processing on the initial data set based on the current slice range, so as to calculate a ratio of the second data volume to the first data volume, and obtain a first deduplication rate.
Optionally, the data processing method further includes: and calculating the average value of the data processing duration corresponding to the at least one slice range to obtain an average processing duration, sequencing the data processing durations corresponding to the at least one slice range to obtain a second sequencing result, determining a target processing duration from the data processing durations corresponding to the at least one slice range according to the second sequencing result, and calculating the ratio of the average processing duration to the target processing duration to obtain the processing efficiency.
Optionally, the data processing method further includes: the method comprises the steps of obtaining the maximum request times and the maximum reading speed of a storage unit, wherein the storage unit is used for storing a data set to be processed, and determining target data according to the maximum request times and the maximum reading speed, the target data is the file size of a minimum file influencing the reading and writing performance of the storage unit, so that the number of bytes corresponding to a plurality of data blocks is obtained, and the data reading and writing efficiency corresponding to the current slice range is determined according to the number of bytes corresponding to the plurality of data blocks and the target data.
Optionally, the data processing method further includes: and determining the score with the minimum score as a target score from the scores corresponding to each slice range in at least one slice range, and determining the slice range corresponding to the target score as a target slice range.
Optionally, the data processing method further includes: after data segmentation is carried out on a data set to be processed based on a target slice range to obtain a target data set, fingerprint information corresponding to a plurality of data blocks contained in the target data set is calculated, wherein the fingerprint information is used for identifying the data blocks, when the preset fingerprint library does not contain the fingerprint information, backup processing is carried out on the data blocks corresponding to the fingerprint information, the fingerprint information is stored in the preset fingerprint library, and therefore when the preset fingerprint library contains the fingerprint information, the number of times of repetition corresponding to the fingerprint information is recorded.
Optionally, the data processing method further includes: and after recording the repetition times corresponding to the fingerprint information, performing data deduplication processing on the data set to be processed according to the repetition times and the fingerprint information.
According to another aspect of the embodiments of the present application, there is also provided a data processing method, including: reading a data set to be processed, responding to the duplicate removal operation of the data set to be processed, determining a target slice range for carrying out data segmentation processing on the data set to be processed, carrying out data segmentation on the data set to be processed based on the target slice range to obtain a target data set, wherein the target slice range represents the number of bytes of a plurality of data blocks obtained by carrying out data segmentation processing on the data set to be processed, and displaying the duplicate removal result of the data duplicate removal on the target data set.
According to another aspect of the embodiments of the present application, there is also provided a computer-readable storage medium, in which a computer program is stored, wherein the computer program is configured to execute the above data processing method when running.
According to another aspect of embodiments of the present application, there is also provided an electronic device, including one or more processors; storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to carry out a method for operating the program, wherein the program is arranged to carry out the above-mentioned data processing method when executed.
According to another aspect of the embodiments of the present application, there is also provided a data processing system including: the data source unit is used for storing a data set to be processed; the processing unit is used for calculating the score of the data set to be processed in each slice range of at least one slice range, determining a target slice range matched with the data set to be processed in at least one slice range according to the score corresponding to each slice range, and then performing data segmentation on the data set to be processed based on the target slice range to obtain a target data set, wherein the score represents the deduplication efficiency of data deduplication of the data set to be processed, the slice range represents the byte number of a plurality of data blocks obtained by performing data segmentation processing on the data set to be processed, and the target data set comprises a plurality of data blocks; the fingerprint database is used for storing preset fingerprint information corresponding to the preset data set; the processing unit is also used for calculating target fingerprint information corresponding to a plurality of data blocks contained in the target data set and carrying out duplicate removal processing on preset fingerprint information stored in the fingerprint database according to the target fingerprint information; and the backup library is used for storing the first data set, wherein the fingerprint information corresponding to the first data set is not stored in the fingerprint library.
In the embodiment of the application, a mode of calculating the score of the data to be processed in each slice range is adopted, the data set to be processed is obtained, the score of the data set to be processed in each slice range in at least one slice range is calculated, a target slice range matched with the data set to be processed is determined in at least one slice range according to the score corresponding to each slice range, and then the data set to be processed is subjected to data segmentation based on the target slice range, so that a target data set is obtained, wherein the score represents the deduplication efficiency of data deduplication of the data set to be processed, the slice range represents the byte number of a plurality of data blocks obtained by data segmentation processing of the data set to be processed, and the target data set comprises a plurality of data blocks.
As can be seen from the above, in the embodiment of the present application, for an acquired data set, a score of the acquired data set in each slice range is calculated, where the score represents deduplication efficiency for data deduplication of the data set, and the slice range represents byte numbers of a plurality of data blocks obtained by data segmentation processing on the data set, and therefore, a slice range that is most matched with the data set to be processed, that is, a target slice range, can be selected according to the score, so that when the data set to be processed is subjected to deduplication, a most reasonable slice range can be determined, data segmentation is performed on the data set, and a problem of low deduplication efficiency of the data set due to an unreasonable slice range in the prior art is solved. In addition, the data set is divided in a reasonable slicing range, so that the generated target data set is more reasonable and accurate, the problem that residual repeated data exist in the data set is avoided, and the effect of improving the data storage efficiency is realized.
Therefore, the scheme provided by the embodiment of the application achieves the purpose of carrying out data division on the data set in a reasonable slicing range, so that the technical effect of improving the deduplication efficiency of the data set is achieved, and the technical problem of low deduplication efficiency of the data set caused by unreasonable slicing range setting in the prior art is solved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
fig. 1 is a block diagram of a hardware configuration of a computer terminal for implementing a data processing method;
FIG. 2 is a flow chart of an alternative data processing method according to embodiment 1 of the present application;
FIG. 3 is a flow chart of an alternative data processing method according to embodiment 1 of the present application;
FIG. 4 is a flow chart of an alternative data processing method according to embodiment 1 of the present application;
FIG. 5 is a block diagram of an alternative data processing system according to embodiment 1 of the present application;
FIG. 6 is a flow chart of an alternative data processing method according to embodiment 2 of the present application;
fig. 7 is a schematic diagram of an electronic device for implementing a data processing method.
Detailed Description
In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
In the technical scheme of the embodiment of the application, the acquisition, storage, application and the like of the personal information of the related user all accord with the regulations of related laws and regulations, and do not violate the good custom of the public order.
Example 1
There is also provided, in accordance with an embodiment of the present application, a data processing method embodiment, it should be noted that the steps illustrated in the flowchart of the accompanying drawings may be performed in a computer system such as a set of computer executable instructions, and that, although a logical order is illustrated in the flowchart, in some cases, the steps illustrated or described may be performed in an order different than that herein.
The method provided by the embodiment 1 of the present application can be executed in a mobile terminal, a computer terminal or a similar computing device. Fig. 1 shows a hardware configuration block diagram of a computer terminal for implementing the data processing method. As shown in fig. 1, the computer terminal 10 (or mobile device 10) may include one or more (shown as 102a, 102b, … …, 102 n) processors 102 (the processors 102 may include, but are not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA, etc.), a memory 104 for storing data, and a transmission module 106 for communication functions. Besides, the method can also comprise the following steps: a display, an input/output interface (I/O interface), a Universal Serial BUS (USB) port (which may be included as one of the ports of the BUS), a network interface, a power source, and/or a camera. It will be understood by those skilled in the art that the structure shown in fig. 1 is only an illustration and is not intended to limit the structure of the electronic device. For example, the computer terminal 10 may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.
It should be noted that the one or more processors 102 and/or other data processing circuitry described above may be referred to generally herein as "data processing circuitry". The data processing circuitry may be embodied in whole or in part in software, hardware, firmware, or any combination thereof. Further, the data processing circuit may be a single stand-alone processing module, or incorporated in whole or in part into any of the other elements in the computer terminal 10 (or mobile device). As referred to in the embodiments of the application, the data processing circuit acts as a processor control (e.g. selection of a variable resistance termination path connected to the interface).
The memory 104 may be used to store software programs and modules of application software, such as program instructions/data storage devices corresponding to the data processing method in the embodiment of the present application, and the processor 102 executes various functional applications and data processing by executing the software programs and modules stored in the memory 104, that is, implementing the data processing method of the application program. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to the computer terminal 10 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The transmission device 106 is used for receiving or transmitting data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the computer terminal 10. In one example, the transmission device 106 includes a Network adapter (NIC) that can be connected to other Network devices through a base station to communicate with the internet. In one example, the transmission device 106 can be a Radio Frequency (RF) module, which is used to communicate with the internet in a wireless manner.
The display may be, for example, a touch screen type Liquid Crystal Display (LCD) that may enable a user to interact with a user interface of the computer terminal 10 (or mobile device).
It should be noted here that in some alternative embodiments, the computer device (or mobile device) shown in fig. 1 described above may include hardware elements (including circuitry), software elements (including computer code stored on a computer-readable medium), or a combination of both hardware and software elements. It should be noted that fig. 1 is only one example of a particular specific example and is intended to illustrate the types of components that may be present in the computer device (or mobile device) described above.
In addition, it should be noted that a processor may be used as an execution subject of the data processing method in the present embodiment.
Under the operating environment, the application provides a data processing method as shown in fig. 2. Fig. 2 is a flowchart of a data processing method according to embodiment 1 of the present application. As can be seen from fig. 2, the method comprises the following steps:
step S202, a dataset to be processed is acquired.
Optionally, in step S202, the data set may be, but is not limited to, a data set composed of multiple types of data, such as a text data set, a picture data set, an audio data set, and a video data set, or a data set composed of a mixture of multiple types of data. The data sets may also come from various sources, such as data sets downloaded from websites, data sets stored on NAS (network attached storage), data sets stored on local file systems, and data sets stored by various types of terminal objects. The way of acquiring the data set to be processed may be, but is not limited to, setting a backup task, and the processor acquires the data set to be backed up by executing a program of the backup task.
Further, the process of acquiring the data set to be processed can be applied to a data backup scene, and can also be applied to a series of data processing scenes such as data transmission, data restoration, data encryption, data security and the like.
Step S204, calculating the score of the data set to be processed in each slice range in at least one slice range.
Optionally, in step S204, the score represents the deduplication efficiency of the data deduplication performed on the data set to be processed, and the slice range represents the number of bytes of the multiple data blocks obtained by performing the data segmentation processing on the data set to be processed. Scoring deduplication efficiency for data deduplication of a data set to be processed may be scored in a number of ways, including but not limited to: the system comprises a data processing unit, a data processing unit and a data processing unit, wherein the data processing unit is used for processing data, the data processing unit is used for processing the data, and the data processing unit is used for processing the data. In addition, the data block is a form of data recombined after data is segmented according to a slicing range, for example, repeated data in a data set is identified on the basis of the slicing range, only one copy of the repeated data is reserved, the same data is indexed into the same data block, different data indexes are different in different data blocks, the number of bytes between each data block can be different, and the data is specifically determined according to the slicing range.
It should be noted that, in the data deduplication process, if the slicing range is too large, the probability that the processor queries the duplicate data becomes small, so that the efficiency of deleting the duplicate data is reduced, and the storage space is increased, and if the slicing range is too small, the step of querying whether the data is duplicated needs to be repeatedly executed, so that the throughput reduction rate of the entire data processing system is limited. Through the steps, the slicing range with the highest deduplication efficiency is determined according to the grading size, the problem that the slicing range cannot be reasonably determined can be solved, and the effects of improving deduplication efficiency of repeated data and improving storage space utilization rate are achieved.
And step S206, determining a target slice range matched with the data set to be processed in at least one slice range according to the score corresponding to each slice range.
Optionally, in step S206, the target slice range is one of the at least one slice range, and the target slice range is one of the slice ranges with the highest deduplication efficiency when data deduplication is performed on the data set to be processed. The process of selecting the target slice range may be automatically selected according to the grade size through a model, or the process of manually determining the target slice range by an operator through comparing all the grade results.
It should be noted that, because the sources of the data sets may be different, the data in the data sets may also be different, and therefore, for each data set, the most matched target slice range may also be different, and through the above steps, the target slice range matched with the data set is determined according to the score, thereby achieving an effect of intelligently determining the target slice range corresponding to the data set, and avoiding a problem of poor accuracy of the obtained target data set due to uniform division of the slice ranges for all the data sets.
And S208, performing data segmentation on the data set to be processed based on the target slice range to obtain a target data set.
Optionally, in step S208, the target data set includes a plurality of data blocks. The target slice range can be in a KB level, and the processor can perform sliding block data segmentation on the data set to be processed based on the target slice range to obtain the target data set.
It should be noted that, in the above process, since the data set is divided in a reasonable target slicing range, the generated target data set is more reasonable and accurate, the problem of residual repeated data in the data set is avoided, and the effect of improving the data storage efficiency is achieved. Moreover, data segmentation is carried out on the data set according to the target slice range, and the deduplication efficiency of the data set can be effectively improved.
Based on the contents of the above steps S202 to S208, in the embodiment of the present application, a mode of calculating a score of the to-be-processed data in each slice range is adopted, a target slice range matched with the to-be-processed data set is determined in at least one slice range according to the score corresponding to each slice range by obtaining the to-be-processed data set and calculating the score of the to-be-processed data set in each slice range in at least one slice range, and then data segmentation is performed on the to-be-processed data set based on the target slice range to obtain the target data set, where the score represents deduplication efficiency of data deduplication of the to-be-processed data set, the slice range represents the number of bytes of a plurality of data blocks obtained by data segmentation processing of the to-be-processed data set, and the target data set includes a plurality of data blocks.
It is easy to note that, in the embodiment of the present disclosure, for an acquired data set, a score of the acquired data set in each slice range is calculated, where the score represents deduplication efficiency for data deduplication of the data set, and the slice range represents byte numbers of a plurality of data blocks obtained by data segmentation processing on the data set, and therefore, a slice range that is most matched with the data set to be processed, that is, a target slice range, may be selected according to the size of the score, so that when the data set to be processed is deleted, a most reasonable slice range may be determined, and data segmentation is performed on the data set, and a problem in the prior art that deduplication efficiency of the data set is low due to unreasonable slice ranges is solved. In addition, the data set is divided in a reasonable slicing range, so that the generated target data set is more reasonable and accurate, the problem that residual repeated data exist in the data set is avoided, and the effect of improving the data storage efficiency is realized.
Therefore, the scheme provided by the embodiment of the disclosure achieves the purpose of data division of the data set in a reasonable slicing range, thereby achieving the technical effect of improving the deduplication efficiency of the data set, and further solving the technical problem of low deduplication efficiency of the data set caused by unreasonable slicing range setting in the prior art.
In an alternative embodiment, the initial score may be determined for the set of data to be processed at each slice range by modeling the set of data to be processed to obtain an initial data set and calculating the initial score for the initial data set at each slice range.
Optionally, as shown in fig. 3, the model may first perform preprocessing on the data set to be processed, where the preprocessing includes performing feature sampling on the data set to be processed, so as to obtain an initial data set.
Optionally, the model may select a part of data in the data set according to the size of the data set to perform feature sampling, or may perform feature sampling on all data in the data set.
Further, after the initial data set is obtained, the model can perform cycle learning and calculation on the initial data set for multiple times, so that the initial score of the initial data set in at least one slice range is obtained through calculation, and the score standard of the initial score can refer to the influence factors such as the deduplication rate, the data processing duration, the data reading and writing parameters and the like and the characteristics of the data set to be processed, so that the initial score is determined as the score of the data set to be processed in each slice range.
Through the process, before data segmentation is carried out on the data set, the data set sampling is calculated by adopting an intelligent learning method to obtain the initial score of the initial data set in at least one slice range, the problem that a reasonable slice range cannot be determined is solved, and the deduplication efficiency of the data set is improved.
In an optional embodiment, the processor obtains a deduplication rate of an initial data set in at least one slice range, data processing duration and data reading and writing parameters, wherein the data processing duration represents execution duration for performing deduplication processing on the initial data set, the data reading and writing parameters represent influence degree of the at least one slice range on data reading and writing efficiency, and a deduplication coefficient corresponding to a current slice range is determined according to the deduplication rate in the at least one slice range; determining the processing efficiency corresponding to the current slice range according to the data processing duration corresponding to at least one slice range; determining data reading and writing efficiency corresponding to the current slice range according to the data reading and writing parameters corresponding to at least one slice range; and finally, carrying out weighted summation calculation on the deduplication coefficient, the processing efficiency and the data reading and writing efficiency to obtain an initial score.
Optionally, the processor may determine to obtain the initial score according to the deduplication rate, the data processing duration, and the data read-write parameter through the established model. As shown in the following equation 1:
Figure BDA0003370588480000091
wherein score is an initial score, and a lower score of score indicates that the current slice range is better matched with the data set to be processed, and is closer to the expected balance. In addition, the model can determine a deduplication coefficient corresponding to the current slice range according to the deduplication rate under at least one slice range, determine processing efficiency corresponding to the current slice range according to the data processing duration corresponding to at least one slice range, and determine data read-write parameters corresponding to at least one slice rangeAnd determining the data reading and writing efficiency corresponding to the current slice range. Wherein, in formula 1
Figure BDA0003370588480000092
The ratio of the ith deduplication rate to the maximum deduplication rate is represented as a deduplication coefficient, and the smaller the ratio is, the better the deduplication effect is represented; in equation 1
Figure BDA0003370588480000093
For the processing efficiency, the average execution duration of the data processing algorithm is represented, and the smaller the ratio is, the higher the algorithm execution efficiency is; of formula 1
Figure BDA0003370588480000094
For data read-write efficiency, x represents the size of the minimum file which affects the storage read-write performance, and smaller ratio represents smaller influence on the storage read-write efficiency. α, β, and γ represent the weight of the puncturing coefficient, the weight of the processing efficiency, and the weight of the data read/write efficiency, respectively.
Further, the values of the three weights α, β, and γ may be set and adjusted by an operator according to actual conditions, so as to indicate whether the overall data processing algorithm is good or bad for the weight of one aspect.
Through the process, the deduplication coefficient, the processing efficiency and the data reading and writing efficiency of the current slicing range can be determined according to the deduplication rate, the data processing duration and the data reading and writing parameters, so that the initial score of the current slicing range is obtained, different weighted values are given to the three parameters in a matching mode, the effect of data segmentation of the initial data set by the current slicing range can be accurately judged, and the accuracy of the initial score is guaranteed based on comprehensive consideration of the three aspects.
In an optional embodiment, the processor obtains a first deduplication rate corresponding to a current slice range, and sorts the deduplication rates corresponding to at least one slice range to obtain a first sorting result, so as to determine a second deduplication rate from the deduplication rates corresponding to at least one slice range according to the first sorting result, and calculate a ratio of the first deduplication rate to the second deduplication rate to obtain a deduplication coefficient.
Optionally, the first deduplication rate is a deduplication rate corresponding to the current slice range, for example, as in formula 1, the deduplication rate of the current slice range is dupiThe current slice range is the ith slice range. In addition, the second puncturing rate may be a maximum puncturing rate among the puncturing rates corresponding to the plurality of slice ranges, and the processor may sort the puncturing rates corresponding to the plurality of slice ranges from small to large, and select a second puncturing rate (max (dup) in formula 1) which is the maximum puncturing rate according to the sorting result. Finally, the processor will calculate the ratio of the first and second puncturing rates to obtain the puncturing coefficient (in equation 1)
Figure BDA0003370588480000101
)。
In the above process, according to the first deduplication rate of the current slice range and the maximum deduplication rates of the multiple slice ranges, which slice range has the best deduplication effect can be determined, so that the slice range with the highest deduplication effect is obtained, and the deduplication efficiency of the initial data set is improved.
In an optional embodiment, the processor obtains a first data volume corresponding to the initial data set, and obtains a second data volume corresponding to the first data set, which is obtained by performing data deduplication processing on the initial data set based on the current slice range, so as to calculate a ratio of the second data volume to the first data volume, and obtain the first deduplication rate.
Optionally, the processor may further obtain that data deduplication processing is performed on the initial data set based on the current slice range to obtain a first data set, and obtain a second data size, for example, 10 data in the initial data set are total, and after data deduplication processing is performed according to the current slice range, the remaining 6 data in the first data set are obtained, that is, the second data size is 6, and the first data size is 10. At this time, the processor may calculate a ratio of the second data amount to the first data amount to obtain the first puncturing rate, e.g., 6/10 ═ 0.6.
In the above process, according to the statistics of the first data volume and the second data volume, the first deduplication rate corresponding to the current slice range can be obtained, and further, the deduplication rate corresponding to each slice range can be accurately calculated.
In an optional embodiment, the processor calculates an average value of the data processing durations corresponding to the at least one slice range to obtain an average processing duration, and sorts the data processing durations corresponding to the at least one slice range to obtain a second sorting result, so as to determine a target processing duration from the data processing durations corresponding to the at least one slice range according to the second sorting result, and calculate a ratio of the average processing duration to the target processing duration to obtain the processing efficiency.
Optionally, the processor may calculate an average value of the data processing durations corresponding to the multiple slice ranges, for example, the data processing durations corresponding to each of the multiple slice ranges are different, some of the data processing durations are shorter, and some of the data processing durations are longer, and the processor may sum the data processing durations corresponding to all the slice ranges and divide the sum by the number of the slice ranges to obtain the average processing duration, that is, avg (time) in formula 1. The processor also sorts the data processing time lengths corresponding to the plurality of slice ranges from large to small, and selects the minimum data processing time length as a target processing time length, namely min (time) in formula 1. Finally, the processor calculates the ratio of the average processing time length to the target processing time length to obtain the processing efficiency, i.e. the processing efficiency in formula 1
Figure BDA0003370588480000111
Through the above process, the slicing range in which the data processing duration is closest to the minimum data processing duration can be determined, thereby achieving the effect of improving the data deduplication speed.
In an optional embodiment, the processor obtains the maximum number of requests and the maximum reading speed of a storage unit, where the storage unit is configured to store a data set to be processed, and determines target data according to the maximum number of requests and the maximum reading speed, where the target data is a file size of a minimum file that affects read-write performance of the storage unit, so as to obtain a number of bytes corresponding to a plurality of data blocks, and determines data read-write efficiency corresponding to a current slice range according to the number of bytes corresponding to the plurality of data blocks and the target data.
Optionally, the storage unit may be a solid state disk or a mechanical hard disk installed on the server or the terminal device. The processor may obtain a maximum number of requests and a maximum read speed for the memory unit. The target data is the file size of the smallest file that affects the read and write performance of the storage unit, such as x in equation 1.
In the process, the slice range with the highest data reading and writing efficiency can be determined, so that the effect of improving the data throughput efficiency is realized.
In an alternative embodiment, the processor determines the score with the minimum score from the scores corresponding to each slice range in the at least one slice range as the target score; and determining the slice range corresponding to the target score as a target slice range.
Optionally, as shown in fig. 3, each slice range in the multiple slice ranges has a corresponding score, the processor selects the score with the smallest score as a target score, and determines the slice range corresponding to the target score as a target slice range, that is, the processor learns the slice range most suitable for the initial sample set through the model, where the slice range is a reasonable slice range, and then the processor may apply the target slice range to a data processing algorithm to perform deduplication processing on the data set to be processed according to the target slice range.
In the process, the reasonable slicing range of the data set to be processed is determined, so that the effect of improving the deduplication efficiency of the data set is achieved.
In an optional embodiment, after data segmentation is performed on a data set to be processed based on a target slice range to obtain a target data set, a processor calculates fingerprint information corresponding to a plurality of data blocks included in the target data set, where the fingerprint information is used to identify the data blocks, and when a preset fingerprint library does not include the fingerprint information, the data blocks corresponding to the fingerprint information are backed up and stored in the preset fingerprint library, so that when the preset fingerprint library includes the fingerprint information, the number of repetitions corresponding to the fingerprint information is recorded.
Optionally, as shown in fig. 4, after the processor reads data to be backed up from the data set to be processed and performs data segmentation on the data set to be processed according to the target slice range, the processor calculates fingerprint information corresponding to a plurality of data blocks in the target data set, where the fingerprint information is used to identify the data blocks, that is, to represent data, different data can calculate different fingerprints, and the fingerprints can be stored in a preset fingerprint database.
Further, the processor may determine whether the current fingerprint exists or is repeated in the preset database, and when the processor identifies that the current fingerprint does not exist in the preset fingerprint database, perform backup processing on the data block corresponding to the fingerprint information, for example, backup the data block to a backup database for storing backup data, and at the same time, the processor stores the fingerprint information to the preset fingerprint database. And if the preset fingerprint library contains the fingerprint information, the processor records the repetition times corresponding to the fingerprint information.
In addition, as shown in fig. 5, the data processing system includes not only the preset fingerprint library but also a backup source, a backup program, and a backup library. The backup source is a data source (a data set to be processed) that needs to be backed up, the backup library is used for storing backup data, and the backup program is a program for executing a backup task and is used for reading the data that needs to be backed up from the backup source and backing up the data in the backup source into the backup library. The backup program can also slice the read backup data, calculate fingerprint information, inquire whether the fingerprint information exists in a preset fingerprint database, and store the data which does not exist in the preset fingerprint database in the backup database.
In an optional embodiment, after recording the number of repetitions corresponding to the fingerprint information, the processor performs data deduplication processing on the data set to be processed according to the number of repetitions and the fingerprint information.
Optionally, the same fingerprint information represents the same data, and the number of the same data is represented by the number of the repetition times, so that the processor can perform data deduplication processing on the data set to be processed according to the number of the repetition times and the fingerprint information.
Through the process, the processor can delete the repeated data in the data set, and the utilization rate of the storage space is improved.
As can be seen from the above, in the embodiment of the present disclosure, for an acquired data set, a score of the acquired data set in at least one slice range is calculated, where the score represents deduplication efficiency for data deduplication of the data set, and the slice range represents byte numbers of a plurality of data blocks obtained by data segmentation processing on the data set, and therefore, a slice range that is most matched with the data set to be processed, that is, a target slice range, may be selected according to the score, so that when deduplication processing is performed on the data set to be processed, a most reasonable slice range may be determined, and data segmentation is performed on the data set, and a problem in the prior art that deduplication efficiency of the data set is low due to an unreasonable slice range is solved. In addition, the data set is divided in a reasonable slicing range, so that the generated target data set is more reasonable and accurate, the problem that residual repeated data exist in the data set is avoided, and the effect of improving the data storage efficiency is realized.
Example 2
According to an embodiment of the present application, there is further provided an embodiment of a data processing method, where fig. 6 is a flowchart of a data processing method according to embodiment 2 of the present application, where the method includes the following steps:
step S602, a data set to be processed is read.
In step S602, the data set may be, but is not limited to, a data set composed of multiple types of data, such as a text data set, a picture data set, an audio data set, and a video data set, or a data set composed of a mixture of multiple types of data. The data sets may also come from various sources, such as data sets downloaded from websites, data sets stored on NAS (network attached storage), data sets stored on local file systems, and data sets stored by various types of terminal objects. The way of reading the data set to be processed may be, but is not limited to, setting a backup task, and the processor acquires the data set to be backed up by executing a program of the backup task.
In addition, the process of reading the data set to be processed can be applied to the scene of data backup, and can also be applied to a series of data processing scenes such as data transmission, data restoration, data encryption, data security and the like.
Step S604, responding to the deduplication operation of the data set to be processed, determining a target slice range for performing data segmentation processing on the data set to be processed, and performing data segmentation on the data set to be processed based on the target slice range to obtain a target data set.
In step S604, the target slice range represents the number of bytes of the data blocks obtained by performing the data slicing process on the data set to be processed. The target data set includes a plurality of data blocks. The target slice range can be in a KB level, and the processor can perform sliding block data segmentation on the data set to be processed based on the target slice range to obtain the target data set. The target slice range is one of the at least one slice range, and the target slice range is one of the slice ranges with the highest deduplication efficiency when data deduplication is performed on the data set to be processed. The process of selecting the target slice range may be automatically selected according to the grade size through a model, or the process of manually determining the target slice range by an operator through comparing all the grade results.
It should be noted that, because the sources of the data sets may be different, and the data in the data sets may also be different, for each data set, the most matched target slice range may also be different, and through the above steps, the data to be processed is segmented according to the target slice range to obtain the target data set, thereby avoiding the problem of low efficiency of data set deduplication caused by uniformly dividing the slice range for all the data sets. In addition, the data set is divided in a reasonable target slicing range, so that the generated target data set is more reasonable and accurate, the problem that residual repeated data exist in the data set is avoided, and the effect of improving the data storage efficiency is realized.
Step S606, displaying the deduplication result of performing data deduplication on the target data set.
In step S606, the deduplication result of data deduplication of the target data set may be displayed by a display device, where the display device may be a display screen of the computer device itself or a third party display screen connected to the computer device. Computer devices include, but are not limited to: desktop computers, notebook computers, smart phones, smart tablets, servers, and the like. The deduplication result may be information related to the final remaining data, including but not limited to: the information about the number of remaining data, the size information of remaining data, the detail information of remaining data, and the number information of deleted data.
In the process, the duplicate removal operation of the data is realized, the duplicate data is deleted, the utilization rate of the storage space is improved, and the storage cost is saved.
Based on the contents of steps S602 to S606, in the embodiment of the present application, a data splitting manner is adopted for the data set to be processed based on the target slice range, a deduplication operation of the data set to be processed is responded, the target slice range for data splitting processing of the data set to be processed is determined, data splitting is performed on the data set to be processed based on the target slice range, a target data set is obtained, and then a deduplication result for data deduplication of the target data set is displayed. The target slice range represents the number of bytes of a plurality of data blocks obtained by carrying out data segmentation processing on the data set to be processed.
It is easy to note that, in the embodiment of the present disclosure, for a read data set, data segmentation is performed on the data set to be processed according to a target slice range to obtain a target data set, so that a problem of low efficiency of data set deduplication caused by uniform slice range division on all data sets is avoided. In addition, the data set is divided in a reasonable target slicing range, so that the generated target data set is more reasonable and accurate, the problem that residual repeated data exist in the data set is avoided, and the effect of improving the data storage efficiency is realized.
Therefore, the scheme provided by the embodiment of the disclosure achieves the purpose of data division of the data set in a reasonable slicing range, thereby achieving the technical effect of improving the deduplication efficiency of the data set, and further solving the technical problem of low deduplication efficiency of the data set caused by unreasonable slicing range setting in the prior art.
In an alternative embodiment, the processor calculates the score of the data set to be processed in each of the at least one slice range, and determines the score with the minimum score as the target score from the scores corresponding to the at least one slice range, so as to determine the segmentation range corresponding to the target score as the target slice range. Wherein the score characterizes a deduplication efficiency of data deduplication of the dataset to be processed.
Optionally, the score represents deduplication efficiency of data deduplication on the data set to be processed, and the slice range represents the number of bytes of the multiple data blocks obtained by data segmentation processing on the data set to be processed. Scoring deduplication efficiency for data deduplication of a data set to be processed may be scored in a number of ways, including but not limited to: the deduplication rate, the data processing duration and the data read-write parameters can be given different weighted values for each aspect so as to better meet the actual data processing requirements
It should be noted that, in the data deduplication process, if the slicing range is too large, the probability that the processor queries the duplicate data becomes small, so that the efficiency of deleting the duplicate data is reduced, and the storage space is increased, and if the slicing range is too small, the step of querying whether the data is duplicated needs to be repeatedly executed, so that the throughput reduction rate of the entire data processing system is limited. Through the steps, the slicing range with the highest deduplication efficiency is determined according to the grading size, the problem that the slicing range cannot be reasonably determined can be solved, and the effects of improving deduplication efficiency of repeated data and improving storage space utilization rate are achieved.
In addition, the method for obtaining the target data set and performing data deduplication on the target data set has been described in detail in the above embodiment 1, and will not be repeated herein.
Example 3
According to an embodiment of the present application, there is also provided an embodiment of a data processing system, where the system includes: the system comprises a data source unit, a processor, a fingerprint library and a backup library.
The data source unit is used for storing a data set to be processed; the processor is used for calculating the score of the data set to be processed in each slice range of at least one slice range, determining a target slice range matched with the data set to be processed in at least one slice range according to the score corresponding to each slice range, and then performing data segmentation on the data set to be processed based on the target slice range to obtain a target data set, wherein the score represents the deduplication efficiency of data deduplication of the data set to be processed, the slice range represents the number of bytes of a plurality of data blocks obtained by performing data segmentation processing on the data set to be processed, and the target data set comprises a plurality of data blocks; the fingerprint database is used for storing preset fingerprint information corresponding to the preset data set; the processor is also used for calculating target fingerprint information corresponding to a plurality of data blocks contained in the target data set and carrying out duplicate removal processing on preset fingerprint information stored in the fingerprint database according to the target fingerprint information; and the backup library is used for storing the first data set, wherein the fingerprint information corresponding to the first data set is not stored in the fingerprint library.
Optionally, the data source unit may be a backup source in the data processing system shown in fig. 5, and the processing unit may be a backup program in fig. 5, as can be seen from fig. 5, the backup source is a data source (i.e., a data set to be processed) that needs to be backed up, the backup library is used to store backup data, and the backup program is a program that executes a backup task and is used to read data that needs to be backed up from the backup source and backup the data in the backup source into the backup library. The backup program can also slice the read backup data, calculate fingerprint information, inquire whether the fingerprint information exists in a preset fingerprint database, and store the data which does not exist in the preset fingerprint database in the backup database.
In an optional embodiment, after recording the number of repetitions corresponding to the fingerprint information, the processing unit may further perform data deduplication processing on the data set to be processed according to the number of repetitions and the fingerprint information.
Optionally, the same fingerprint information represents the same data, and the number of the same data is represented by the number of the repetition times, so that the processor can perform data deduplication processing on the data set to be processed according to the number of the repetition times and the fingerprint information.
Through the process, the processing unit can delete the repeated data in the data set, and the utilization rate of the storage space is improved.
As can be seen from the above, in the embodiment of the present application, a mode of calculating a score of data to be processed in each slice range is adopted, a target slice range matched with the data set to be processed is determined in at least one slice range according to the score corresponding to each slice range by obtaining the data set to be processed and calculating the score of the data set to be processed in each slice range in at least one slice range, and then data segmentation is performed on the data set to be processed based on the target slice range to obtain a target data set, where the score represents deduplication efficiency of data deduplication of the data set to be processed, the slice range represents the number of bytes of a plurality of data blocks obtained by data segmentation of the data set to be processed, and the target data set includes the plurality of data blocks.
As can be seen from the above, in the embodiment of the present disclosure, for an acquired data set, a score of the acquired data set in each slice range is calculated, where the score represents deduplication efficiency for data deduplication of the data set, and the slice range represents byte numbers of a plurality of data blocks obtained by data segmentation processing on the data set, and therefore, a slice range that is most matched with the data set to be processed, that is, a target slice range, may be selected according to the score, so that when processing is performed on the data set to be processed, a most reasonable slice range may be determined, data segmentation is performed on the data set, and a problem in the prior art that deduplication efficiency of the data set is low due to an unreasonable slice range is solved. In addition, the data set is divided in a reasonable slicing range, so that the generated target data set is more reasonable and accurate, the problem that residual repeated data exist in the data set is avoided, and the effect of improving the data storage efficiency is realized.
Therefore, the scheme provided by the embodiment of the disclosure achieves the purpose of data division of the data set in a reasonable slicing range, thereby achieving the technical effect of improving the deduplication efficiency of the data set, and further solving the technical problem of low deduplication efficiency of the data set caused by unreasonable slicing range setting in the prior art.
Example 4
According to an embodiment of the present application, there is also provided a processor, where the processor is configured to execute a program, where the program executes the data processing method of embodiment 1 and embodiment 2.
Example 5
Embodiments of the present application may provide an electronic device, which may be any one of computer terminal devices in a computer terminal group. Alternatively, in this embodiment, the electronic device may be replaced with a terminal device such as a mobile terminal.
Optionally, in this embodiment, the electronic device may be located in at least one network device of a plurality of network devices of a computer network.
In this embodiment, the electronic device may execute program codes of the following steps in the data processing method of the application program: acquiring a data set to be processed; calculating the score of the data set to be processed in each slice range of at least one slice range, wherein the score represents the deduplication efficiency of data deduplication of the data set to be processed, and the slice range represents the number of bytes of a plurality of data blocks obtained by data segmentation processing of the data set to be processed; determining a target slice range matched with the data set to be processed in at least one slice range according to the score corresponding to each slice range; and performing data segmentation on the data set to be processed based on the target slice range to obtain a target data set, wherein the target data set comprises a plurality of data blocks.
Alternatively, fig. 7 is a block diagram of an electronic device according to an embodiment of the present invention. As shown in fig. 7, the electronic device 100 may include: one or more processors 702 (only one of which is shown), a memory 706, and a peripheral interface 706.
The memory may be configured to store software programs and modules, such as program instructions/modules corresponding to the data processing method and apparatus in the embodiments of the present application, and the processor executes various functional applications and data processing by running the software programs and modules stored in the memory, so as to implement the data processing method. The memory may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory may further include memory located remotely from the processor, and these remote memories may be connected to the electronic device through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
Optionally, the processor may call the information and the application program stored in the memory through the transmission device to execute the following steps: acquiring a data set to be processed; calculating the score of the data set to be processed in each slice range of at least one slice range, wherein the score represents the deduplication efficiency of data deduplication of the data set to be processed, and the slice range represents the number of bytes of a plurality of data blocks obtained by data segmentation processing of the data set to be processed; determining a target slice range matched with the data set to be processed in at least one slice range according to the score corresponding to each slice range; and performing data segmentation on the data set to be processed based on the target slice range to obtain a target data set, wherein the target data set comprises a plurality of data blocks.
Optionally, the processor may further call the information and the application program stored in the memory through the transmission device to perform the following steps: sampling a data set to be processed to obtain an initial data set; calculating an initial score of the initial data set in each slice range; an initial score is determined for the dataset to be processed at each slice range.
Optionally, the processor may further call the information and the application program stored in the memory through the transmission device to perform the following steps: acquiring a deduplication rate, data processing duration and data reading and writing parameters of an initial data set in at least one slice range, wherein the data processing duration represents execution duration for performing deduplication processing on the initial data set, and the data reading and writing parameters represent influence degree of at least one slice range on data reading and writing efficiency; determining a deduplication coefficient corresponding to the current slice range according to the deduplication rate under at least one slice range; determining the processing efficiency corresponding to the current slice range according to the data processing duration corresponding to at least one slice range; determining data reading and writing efficiency corresponding to the current slice range according to the data reading and writing parameters corresponding to at least one slice range; and carrying out weighted summation calculation on the deduplication coefficient, the processing efficiency and the data reading and writing efficiency to obtain an initial score.
Optionally, the processor may further call the information and the application program stored in the memory through the transmission device to perform the following steps: acquiring a first deduplication rate corresponding to a current slice range; sorting the deduplication rates corresponding to at least one slice range to obtain a first sorting result; determining a second deduplication rate from the deduplication rates corresponding to the at least one slice range according to the first sorting result; and calculating the ratio of the first deduplication rate to the second deduplication rate to obtain a deduplication coefficient.
Optionally, the processor may further call the information and the application program stored in the memory through the transmission device to perform the following steps: acquiring a first data volume corresponding to the initial data set; acquiring a second data volume corresponding to a first data set obtained by carrying out data deduplication processing on the initial data set based on the current slice range; and calculating the ratio of the second data volume to the first data volume to obtain a first deduplication rate.
Optionally, the processor may further call the information and the application program stored in the memory through the transmission device to perform the following steps: calculating the average value of the data processing duration corresponding to at least one slice range to obtain the average processing duration; sequencing the data processing duration corresponding to at least one slice range to obtain a second sequencing result; determining a target processing time length from the data processing time lengths corresponding to at least one slice range according to the second sequencing result; and calculating the ratio of the average processing time length to the target processing time length to obtain the processing efficiency.
Optionally, the processor may further call the information and the application program stored in the memory through the transmission device to perform the following steps: acquiring the maximum request times and the maximum reading speed of a storage unit, wherein the storage unit is used for storing a data set to be processed; determining target data according to the maximum request times and the maximum reading speed, wherein the target data is the file size of a minimum file which influences the reading and writing performance of the storage unit; acquiring the number of bytes corresponding to a plurality of data blocks; and determining the data reading and writing efficiency corresponding to the current slice range according to the number of bytes corresponding to the plurality of data blocks and the target data.
Optionally, the processor may further call the information and the application program stored in the memory through the transmission device to perform the following steps: determining the score with the minimum score from the scores corresponding to each slice range in at least one slice range as a target score; and determining the slice range corresponding to the target score as a target slice range.
Optionally, the processor may further call the information and the application program stored in the memory through the transmission device to perform the following steps: after data segmentation is carried out on a data set to be processed based on a target slice range to obtain a target data set, fingerprint information corresponding to a plurality of data blocks contained in the target data set is calculated, wherein the fingerprint information is used for identifying the data blocks; when the preset fingerprint database does not contain the fingerprint information, performing backup processing on the data block corresponding to the fingerprint information, and storing the fingerprint information into the preset fingerprint database; and when the preset fingerprint library contains the fingerprint information, recording the repetition times corresponding to the fingerprint information.
Optionally, the processor may further call the information and the application program stored in the memory through the transmission device to perform the following steps: and after recording the repetition times corresponding to the fingerprint information, performing data deduplication processing on the data set to be processed according to the repetition times and the fingerprint information.
It can be understood by those skilled in the art that the structure shown in fig. 7 is only an illustration, and the electronic device may also be a terminal device such as a smart phone (e.g., an Android phone, an iOS phone, etc.), a tablet computer, a palmtop computer, a Mobile Internet Device (MID), a PAD, and the like. Fig. 7 is a diagram illustrating a structure of the electronic device. For example, electronic device 100 may also include more or fewer components (e.g., network interfaces, display devices, etc.) than shown in FIG. 7, or have a different configuration than shown in FIG. 7.
Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by a program instructing hardware associated with the terminal device, where the program may be stored in a computer-readable storage medium, and the storage medium may include: flash disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.
Example 6
Embodiments of the present application also provide a computer-readable storage medium. Alternatively, in this embodiment, the computer-readable storage medium may be used to store the program codes executed by the data processing methods provided in embodiments 1 and 2.
Optionally, in this embodiment, the computer-readable storage medium may be located in any one of a group of computer terminals in a computer network, or in any one of a group of mobile terminals.
Optionally, in this embodiment, the computer-readable storage medium is configured to store program codes for performing the following steps: acquiring a data set to be processed; calculating the score of the data set to be processed in each slice range of at least one slice range, wherein the score represents the deduplication efficiency of data deduplication of the data set to be processed, and the slice range represents the number of bytes of a plurality of data blocks obtained by data segmentation processing of the data set to be processed; determining a target slice range matched with the data set to be processed in at least one slice range according to the score corresponding to each slice range; and performing data segmentation on the data set to be processed based on the target slice range to obtain a target data set, wherein the target data set comprises a plurality of data blocks.
Optionally, the computer-readable storage medium is configured to store program codes for performing the following steps: sampling a data set to be processed to obtain an initial data set; calculating an initial score of the initial data set in each slice range; an initial score is determined for the dataset to be processed at each slice range.
Optionally, the computer-readable storage medium is configured to store program codes for performing the following steps: acquiring a deduplication rate, data processing duration and data reading and writing parameters of an initial data set in at least one slice range, wherein the data processing duration represents execution duration for performing deduplication processing on the initial data set, and the data reading and writing parameters represent influence degree of at least one slice range on data reading and writing efficiency; determining a deduplication coefficient corresponding to the current slice range according to the deduplication rate under at least one slice range; determining the processing efficiency corresponding to the current slice range according to the data processing duration corresponding to at least one slice range; determining data reading and writing efficiency corresponding to the current slice range according to the data reading and writing parameters corresponding to at least one slice range; and carrying out weighted summation calculation on the deduplication coefficient, the processing efficiency and the data reading and writing efficiency to obtain an initial score.
Optionally, the computer-readable storage medium is configured to store program codes for performing the following steps: acquiring a first deduplication rate corresponding to a current slice range; sorting the deduplication rates corresponding to at least one slice range to obtain a first sorting result; determining a second deduplication rate from the deduplication rates corresponding to the at least one slice range according to the first sorting result; and calculating the ratio of the first deduplication rate to the second deduplication rate to obtain a deduplication coefficient.
Optionally, the computer-readable storage medium is configured to store program codes for performing the following steps: acquiring a first data volume corresponding to the initial data set; acquiring a second data volume corresponding to a first data set obtained by carrying out data deduplication processing on the initial data set based on the current slice range; and calculating the ratio of the second data volume to the first data volume to obtain a first deduplication rate.
Optionally, the computer-readable storage medium is configured to store program codes for performing the following steps: calculating the average value of the data processing duration corresponding to at least one slice range to obtain the average processing duration; sequencing the data processing duration corresponding to at least one slice range to obtain a second sequencing result; determining a target processing time length from the data processing time lengths corresponding to at least one slice range according to the second sequencing result; and calculating the ratio of the average processing time length to the target processing time length to obtain the processing efficiency.
Optionally, the computer-readable storage medium is configured to store program codes for performing the following steps: acquiring the maximum request times and the maximum reading speed of a storage unit, wherein the storage unit is used for storing a data set to be processed; determining target data according to the maximum request times and the maximum reading speed, wherein the target data is the file size of a minimum file which influences the reading and writing performance of the storage unit; acquiring the number of bytes corresponding to a plurality of data blocks; and determining the data reading and writing efficiency corresponding to the current slice range according to the number of bytes corresponding to the plurality of data blocks and the target data.
Optionally, the computer-readable storage medium is configured to store program codes for performing the following steps: determining the score with the minimum score from the scores corresponding to each slice range in at least one slice range as a target score; and determining the slice range corresponding to the target score as a target slice range.
Optionally, the computer-readable storage medium is configured to store program codes for performing the following steps: after data segmentation is carried out on a data set to be processed based on a target slice range to obtain a target data set, fingerprint information corresponding to a plurality of data blocks contained in the target data set is calculated, wherein the fingerprint information is used for identifying the data blocks; when the preset fingerprint database does not contain the fingerprint information, performing backup processing on the data block corresponding to the fingerprint information, and storing the fingerprint information into the preset fingerprint database; and when the preset fingerprint library contains the fingerprint information, recording the repetition times corresponding to the fingerprint information.
Optionally, the computer-readable storage medium is configured to store program codes for performing the following steps: and after recording the repetition times corresponding to the fingerprint information, performing data deduplication processing on the data set to be processed according to the repetition times and the fingerprint information.
It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.
Through the above description of the embodiments, those skilled in the art can clearly understand that the data processing method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method of the embodiments of the present application.
The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.
In the above embodiments of the present application, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, a division of a unit is merely a division of a logic function, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.
Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method of the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.
The foregoing is only a preferred embodiment of the present application and it should be noted that those skilled in the art can make several improvements and modifications without departing from the principle of the present application, and these improvements and modifications should also be considered as the protection scope of the present application.

Claims (14)

1. A data processing method, comprising:
acquiring a data set to be processed;
calculating the score of the data set to be processed in each slice range of at least one slice range, wherein the score represents the deduplication efficiency of data deduplication on the data set to be processed, and the slice range represents the number of bytes of a plurality of data blocks obtained by data deduplication on the data set to be processed;
determining a target slice range matched with the data set to be processed in the at least one slice range according to the score corresponding to each slice range;
and performing data segmentation on the data set to be processed based on the target slice range to obtain a target data set, wherein the target data set comprises the data blocks.
2. The method of claim 1, wherein calculating a score for the dataset to be processed at each of at least one slice range comprises:
sampling the data set to be processed to obtain an initial data set;
calculating an initial score for the initial dataset at the each slice range;
determining the initial score as the score of the data set to be processed in each slice range.
3. The method of claim 2, wherein calculating an initial score for the initial dataset at the range of each slice comprises:
acquiring a deduplication rate, data processing duration and data reading and writing parameters of the initial data set in the at least one slice range, wherein the data processing duration represents execution duration for performing deduplication processing on the initial data set, and the data reading and writing parameters represent influence degree of the at least one slice range on data reading and writing efficiency;
determining a deduplication coefficient corresponding to the current slice range according to the deduplication rate under the at least one slice range;
determining the processing efficiency corresponding to the current slice range according to the data processing duration corresponding to the at least one slice range;
determining data reading and writing efficiency corresponding to the current slice range according to the data reading and writing parameters corresponding to the at least one slice range;
and carrying out weighted summation calculation on the deduplication coefficient, the processing efficiency and the data reading and writing efficiency to obtain the initial score.
4. The method of claim 3, wherein determining the puncturing coefficient corresponding to the current slice range according to the puncturing rate under the at least one slice range comprises:
acquiring a first deduplication rate corresponding to the current slice range;
sorting the deduplication rates corresponding to the at least one slice range to obtain a first sorting result;
determining a second deduplication rate from the deduplication rates corresponding to the at least one slice range according to the first sorting result;
and calculating the ratio of the first deduplication rate to the second deduplication rate to obtain the deduplication coefficient.
5. The method of claim 4, wherein obtaining the first deduplication rate corresponding to the current slice range comprises:
acquiring a first data volume corresponding to the initial data set;
acquiring a second data volume corresponding to a first data set obtained by carrying out data deduplication processing on the initial data set based on the current slice range;
and calculating the ratio of the second data volume to the first data volume to obtain the first deduplication rate.
6. The method of claim 3, wherein determining the processing efficiency corresponding to the current slice range according to the data processing duration corresponding to the at least one slice range comprises:
calculating the average value of the data processing duration corresponding to the at least one slice range to obtain the average processing duration;
sorting the data processing duration corresponding to the at least one slice range to obtain a second sorting result;
determining a target processing duration from the data processing durations corresponding to the at least one slice range according to the second sorting result;
and calculating the ratio of the average processing time length to the target processing time length to obtain the processing efficiency.
7. The method according to claim 3, wherein determining the data read-write efficiency corresponding to the current slice range according to the data read-write parameter corresponding to the at least one slice range comprises:
acquiring the maximum request times and the maximum reading speed of a storage unit, wherein the storage unit is used for storing the data set to be processed;
determining target data according to the maximum request times and the maximum reading speed, wherein the target data is the file size of a minimum file influencing the reading and writing performance of the storage unit;
acquiring the number of bytes corresponding to the plurality of data blocks;
and determining the data reading and writing efficiency corresponding to the current slice range according to the number of bytes corresponding to the plurality of data blocks and the target data.
8. The method of claim 1, wherein determining a target slice range matching the data set to be processed in the at least one slice range according to the score corresponding to each slice range comprises:
determining the score with the minimum score from the scores corresponding to each slice range in the at least one slice range as a target score;
and determining the slice range corresponding to the target score as the target slice range.
9. The method of claim 1, wherein after data slicing the dataset to be processed based on the target slice range to obtain a target dataset, the method further comprises:
calculating fingerprint information corresponding to a plurality of data blocks contained in the target data set, wherein the fingerprint information is used for identifying the plurality of data blocks;
when the preset fingerprint database does not contain the fingerprint information, performing backup processing on a data block corresponding to the fingerprint information, and storing the fingerprint information into the preset fingerprint database;
and when the preset fingerprint library contains the fingerprint information, recording the repetition times corresponding to the fingerprint information.
10. The method of claim 9, wherein after recording the fingerprint information for a corresponding number of repetitions, the method further comprises:
and carrying out data deduplication processing on the data set to be processed according to the repetition times and the fingerprint information.
11. A data processing method, comprising:
reading a data set to be processed;
responding to the duplicate removal operation of the data set to be processed, determining a target slice range for carrying out data segmentation processing on the data set to be processed, and carrying out data segmentation on the data set to be processed based on the target slice range to obtain a target data set, wherein the target slice range represents the number of bytes of a plurality of data blocks obtained by carrying out data segmentation processing on the data set to be processed;
and displaying a deduplication result of data deduplication of the target data set.
12. A computer-readable storage medium, in which a computer program is stored, wherein the computer program is arranged to execute the data processing method of any one of claims 1 to 11 when executed.
13. An electronic device, characterized in that the electronic device comprises one or more processors; storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement a method for running a program, wherein the program is arranged to perform the data processing method of any of claims 1 to 11 when run.
14. A data processing system, comprising:
the data source unit is used for storing a data set to be processed;
the processing unit is used for calculating the score of a data set to be processed in each slice range of at least one slice range, determining a target slice range matched with the data set to be processed in the at least one slice range according to the score corresponding to each slice range, and then performing data segmentation on the data set to be processed based on the target slice range to obtain a target data set, wherein the score represents the deduplication efficiency of data deduplication of the data set to be processed, the slice range represents the number of bytes of a plurality of data blocks obtained by performing data segmentation on the data set to be processed, and the target data set comprises the data blocks;
the fingerprint database is used for storing preset fingerprint information corresponding to the preset data set;
the processing unit is further configured to calculate target fingerprint information corresponding to a plurality of data blocks included in the target data set, and perform deduplication processing on preset fingerprint information stored in the fingerprint library according to the target fingerprint information;
the backup library is used for storing a first data set, wherein the fingerprint information corresponding to the first data set is not stored in the fingerprint library.
CN202111397650.0A 2021-11-23 2021-11-23 Data processing method, system and computer readable storage medium Active CN114356212B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111397650.0A CN114356212B (en) 2021-11-23 2021-11-23 Data processing method, system and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111397650.0A CN114356212B (en) 2021-11-23 2021-11-23 Data processing method, system and computer readable storage medium

Publications (2)

Publication Number Publication Date
CN114356212A true CN114356212A (en) 2022-04-15
CN114356212B CN114356212B (en) 2024-06-14

Family

ID=81095512

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111397650.0A Active CN114356212B (en) 2021-11-23 2021-11-23 Data processing method, system and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN114356212B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20240028460A1 (en) * 2022-07-25 2024-01-25 Dell Products L.P. Method and system for grouping data slices based on average data file size for data slice backup generation
US12007845B2 (en) 2022-07-25 2024-06-11 Dell Products L.P. Method and system for managing data slice backups based on grouping prioritization

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140095439A1 (en) * 2012-10-01 2014-04-03 Western Digital Technologies, Inc. Optimizing data block size for deduplication
CN104102646A (en) * 2013-04-07 2014-10-15 腾讯科技(深圳)有限公司 Method, device and system for processing data
WO2016070529A1 (en) * 2014-11-07 2016-05-12 中兴通讯股份有限公司 Method and device for achieving duplicated data deletion
CN106469152A (en) * 2015-08-14 2017-03-01 阿里巴巴集团控股有限公司 A kind of document handling method based on ETL and system
US20180039423A1 (en) * 2015-05-12 2018-02-08 Hitachi, Ltd. Storage system and storage control method
US20180107424A1 (en) * 2015-03-31 2018-04-19 International Business Machines Corporation Selecting a set of storage units in a dispersed storage network
JP6341307B1 (en) * 2017-03-03 2018-06-13 日本電気株式会社 Information processing device
CN111984203A (en) * 2020-09-27 2020-11-24 苏州浪潮智能科技有限公司 Data deduplication method and device, electronic equipment and storage medium
CN113126879A (en) * 2019-12-30 2021-07-16 中国移动通信集团四川有限公司 Data storage method and device and electronic equipment
US20210319354A1 (en) * 2020-04-10 2021-10-14 International Business Machines Corporation Performance measurement of predictors

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140095439A1 (en) * 2012-10-01 2014-04-03 Western Digital Technologies, Inc. Optimizing data block size for deduplication
CN104102646A (en) * 2013-04-07 2014-10-15 腾讯科技(深圳)有限公司 Method, device and system for processing data
WO2016070529A1 (en) * 2014-11-07 2016-05-12 中兴通讯股份有限公司 Method and device for achieving duplicated data deletion
US20180107424A1 (en) * 2015-03-31 2018-04-19 International Business Machines Corporation Selecting a set of storage units in a dispersed storage network
US20180039423A1 (en) * 2015-05-12 2018-02-08 Hitachi, Ltd. Storage system and storage control method
CN106469152A (en) * 2015-08-14 2017-03-01 阿里巴巴集团控股有限公司 A kind of document handling method based on ETL and system
JP6341307B1 (en) * 2017-03-03 2018-06-13 日本電気株式会社 Information processing device
CN113126879A (en) * 2019-12-30 2021-07-16 中国移动通信集团四川有限公司 Data storage method and device and electronic equipment
US20210319354A1 (en) * 2020-04-10 2021-10-14 International Business Machines Corporation Performance measurement of predictors
CN111984203A (en) * 2020-09-27 2020-11-24 苏州浪潮智能科技有限公司 Data deduplication method and device, electronic equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
曹晖;张秦正;: "基于FSL数据集的去重性能分析", 电子科技大学学报, no. 04, pages 143 - 147 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20240028460A1 (en) * 2022-07-25 2024-01-25 Dell Products L.P. Method and system for grouping data slices based on average data file size for data slice backup generation
US12007845B2 (en) 2022-07-25 2024-06-11 Dell Products L.P. Method and system for managing data slice backups based on grouping prioritization

Also Published As

Publication number Publication date
CN114356212B (en) 2024-06-14

Similar Documents

Publication Publication Date Title
CN114356212B (en) Data processing method, system and computer readable storage medium
CN107179930B (en) Application uninstalling recommendation method and device
CN105512156B (en) Click model generation method and device
CN110782291A (en) Advertisement delivery user determination method and device, storage medium and electronic device
CN111552767A (en) Search method, search device and computer equipment
CN111783743A (en) Image clustering method and device
CN111598176A (en) Image matching processing method and device
CN108363727B (en) Data storage method and device based on ZFS file system
CN115019360A (en) Matching method and device, nonvolatile storage medium and computer equipment
CN109450963B (en) Message pushing method and terminal equipment
CN112149708A (en) Data model selection optimization method and device, computer device and storage medium
CN110298178B (en) Trusted policy learning method and device and trusted security management platform
CN112598353A (en) Material substitution method, device, storage medium and equipment
CN117097789A (en) Data processing method and device, electronic equipment and storage medium
CN114968933A (en) Method and device for classifying logs of data center
CN112445814A (en) Data acquisition method and device, computer equipment and storage medium
CN113065025A (en) Video duplicate checking method, device, equipment and storage medium
CN109587353B (en) Method, device and storage medium for identifying short message number attribution information
CN113343577A (en) Parameter optimization method and device, computer equipment and readable storage medium
CN108509560B (en) User similarity obtaining method and device, equipment and storage medium
CN111797406A (en) Medical fund data analysis processing method and device and readable storage medium
CN112765449A (en) Application program display method and device and storage medium
CN105512232A (en) Data storage method and device
CN117112846B (en) Multi-information source license information management method, system and medium
CN116909816B (en) Database recovery method and device, storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant