CN116011403B - Repeated data identification method for computer data storage - Google Patents

Repeated data identification method for computer data storage Download PDF

Info

Publication number
CN116011403B
CN116011403B CN202310300571.6A CN202310300571A CN116011403B CN 116011403 B CN116011403 B CN 116011403B CN 202310300571 A CN202310300571 A CN 202310300571A CN 116011403 B CN116011403 B CN 116011403B
Authority
CN
China
Prior art keywords
code
processed
segment
coding
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310300571.6A
Other languages
Chinese (zh)
Other versions
CN116011403A (en
Inventor
张福美
吕太峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Laiwu Vocational and Technical College
Original Assignee
Laiwu Vocational and Technical College
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Laiwu Vocational and Technical College filed Critical Laiwu Vocational and Technical College
Priority to CN202310300571.6A priority Critical patent/CN116011403B/en
Publication of CN116011403A publication Critical patent/CN116011403A/en
Application granted granted Critical
Publication of CN116011403B publication Critical patent/CN116011403B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The invention relates to the technical field of data processing, in particular to a repeated data identification method for computer data storage. The method comprises the following steps: collecting computer storage data to generate a coding section to be processed; performing frequency analysis on the codes to be processed in the code segment to be processed to obtain the repetition degree of the codes to be processed, and clustering the codes to be processed according to the repetition degree to generate code point clusters to be processed; determining a target point cluster and target codes, traversing the coding section to be processed, and marking the target codes in the coding section to be processed to obtain marking information of the target codes; dividing the coding section to be processed into coding identification sub-sections, and determining the identification priority of the coding identification sub-sections; and combining the target codes to obtain code segments to be identified, determining repeated code segments according to the identification priority, and taking computer storage data corresponding to the repeated code segments as repeated data. In summary, the invention can effectively improve the recognition efficiency of the repeated data.

Description

Repeated data identification method for computer data storage
Technical Field
The invention relates to the technical field of data processing, in particular to a repeated data identification method for computer data storage.
Background
The data stored in the computer often contains a large amount of repeated data, and the repeated data occupy disk space, so that great waste is caused to the storage space of the computer, the execution efficiency of data interaction and database access is reduced, and the data processing performance of the computer is affected, so that the repeated data needs to be identified and processed.
In the related art, by setting the identifier, the computer storage data is divided into a plurality of sub-blocks and the corresponding identifiers are configured, and whether the same identifier exists in the computer storage data is searched for so as to realize the identification of the repeated data.
Disclosure of Invention
In order to solve the technical problem that the repeated data identification efficiency is obviously insufficient, the invention provides a repeated data identification method for computer data storage, which adopts the following technical scheme:
the invention provides a repeated data identification method for computer data storage, which comprises the following steps:
collecting computer storage data, carrying out binary conversion on the computer storage data, and generating a code to be processed, wherein the code to be processed forms a code section to be processed; performing frequency analysis on the code to be processed in the code segment to be processed to obtain the repetition degree of the code to be processed, and performing clustering processing on the code to be processed according to the repetition degree to generate at least one code point cluster to be processed;
determining a target point cluster from the to-be-processed code point cluster according to the repetition degree of all the to-be-processed codes in the to-be-processed code point cluster, taking the to-be-processed codes in the target point cluster as target codes, traversing the to-be-processed code segments, and marking the target codes in the to-be-processed code segments to obtain marking information of the target codes;
dividing the coding section to be processed into at least two coding identification subsections according to the total number of the coding to be processed and the marking information, and determining the identification priority of the coding identification subsections according to the frequency of the target coding in the coding identification subsections;
and carrying out coding combination on the target codes according to the marking information and the repetition degree to obtain code segments to be identified, identifying the code segments to be identified according to the identification priority of the code identification sub-segments, determining repeated code segments, and taking the computer storage data corresponding to the repeated code segments as repeated data.
Further, the dividing the code segment to be processed into at least two code identification subsections according to the total number of codes to be processed and the marking information comprises the following steps:
dividing the coding section to be processed into two initial subsections on average according to the total number of the coding sections to be processed;
judging whether the initial sub-segment meets a preset dividing condition according to the marking information, and if not, directly taking the initial sub-segment as the coding identification sub-segment; and if so, taking the initial sub-segment as a new coding segment to be processed for average division until the new initial sub-segment obtained after the average division does not meet the preset division condition, ending the average division, and taking the new initial sub-segment obtained after the ending of the division as the coding identification sub-segment.
Further, the marking information includes a number of marks, and the determining, according to the marking information, whether the initial sub-segment meets a preset dividing condition includes:
respectively determining the number of marks in the two initial subsections with the maximum number of marks as the number of subsections marks; calculating the difference of the number of the sub-segment marks as a coding number difference, and calculating the ratio of the coding number difference to the number of the marks as a coding number difference ratio;
when the coding quantity difference ratio is larger than a preset difference ratio threshold value, determining that the initial subsection meets the preset dividing condition;
and when the code quantity difference ratio is smaller than or equal to the preset difference ratio threshold, determining that the initial sub-segment does not meet the preset dividing condition.
Further, the coding combination is performed on the target code according to the marking information and the repetition degree to obtain a code segment to be identified, which comprises the following steps:
determining other codes adjacent to the target code as neighborhood codes according to the marking information;
determining that the difference in the repetition level of the neighborhood code and the repetition level of the target code is a repetition level difference;
judging whether the neighborhood code and the target code meet a preset combination condition according to the repetition degree difference, if not, directly taking the target code as the code section to be identified, if so, taking the combination of the neighborhood code and the target code as a new target code, determining a new neighborhood code adjacent to the new target code, combining the new target code and the new neighborhood code until the new target code and the new neighborhood code do not meet the preset combination condition, and taking the target code obtained after the combination is ended as the code section to be identified.
Further, the determining whether the neighborhood code and the target code meet a preset combination condition according to the repetition degree difference includes:
determining that the preset combination condition is met when the repetition degree difference of the target code and the neighborhood code is larger than a preset repetition degree difference threshold;
and when the repetition degree difference of the target code and the neighborhood code is smaller than or equal to a preset repetition degree difference threshold, determining that the preset combination condition is not met.
Further, the identifying the code segment to be identified according to the identification priority of the code identification sub-segment, and determining the repetition code segment includes:
taking the coded identification sub-segment with the highest identification priority as a reference sub-segment;
and determining the occurrence frequency of the coding segments to be identified in the reference subsections as the identification number of the coding segments to be identified, and taking the coding segments to be identified, of which the identification number is greater than a preset identification number threshold value, as repeated coding segments.
Further, the performing the binary conversion on the computer storage data to generate a code to be processed includes:
converting the computer storage data into decimal code based on an ASCII code table, and taking the decimal code as the code to be processed.
Further, the frequency analysis is performed on the code to be processed in the code segment to be processed to obtain the repetition degree of the code to be processed, including:
and carrying out normalization processing on the frequency of the code to be processed in the code segment to be processed to obtain a frequency normalization value, and taking the frequency normalization value as the repetition degree of the code to be processed.
Further, the determining a target point cluster from the to-be-processed code point clusters according to the repetition degree of all to-be-processed codes in the to-be-processed code point cluster includes:
calculating the average value of the repetition degree of all the codes to be processed in the code point cluster to be processed, and determining the code point cluster to be processed with the maximum average value of the repetition degree as a target point cluster.
The invention has the following beneficial effects:
according to the method and the device, the repetition degree of the code to be processed can be accurately determined by carrying out frequency analysis on the code to be processed in the code segment to be processed, the code point cluster to be processed is obtained by carrying out clustering processing on the code to be processed according to the repetition degree, and the code to be processed can be effectively clustered. The target point cluster is determined from the code point clusters to be processed according to the repetition degree, and the target codes in the target point cluster are marked to obtain the marking information of the target codes, so that the target codes can be effectively marked according to the repetition degree, the time waste caused by marking all the codes is avoided, the marking time is reduced while the marking effect is ensured, and the marking efficiency is improved. The code identification sub-segment can be effectively analyzed and identified by determining the identification priority of the code identification sub-segment, so that the code segment to be identified can be conveniently identified according to the identification priority. The repeated coding segments can be accurately identified according to the identification priority of the coding identification sub-segments, meanwhile, the repeated coding segments are identified according to the identification priority, so that the identification process of the repeated coding segments is more reasonable, the data identification pressure of a computer is effectively reduced, the computer can more conveniently and rapidly execute the identification process of the repeated coding segments, the identification efficiency of the repeated coding segments is effectively improved, and the identification efficiency of repeated data is effectively improved. In summary, the invention can effectively improve the recognition efficiency of the repeated data.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions and advantages of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are only some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a method for identifying duplicate data for computer data storage according to one embodiment of the invention;
fig. 2 is a schematic diagram of a frequency distribution histogram according to an embodiment of the present invention.
Detailed Description
In order to further describe the technical means and effects adopted by the present invention to achieve the preset purposes, the following detailed description refers to a specific implementation, structure, characteristics and effects of a repeated data identification method for computer data storage according to the present invention with reference to the accompanying drawings and preferred embodiments. In the following description, different "one embodiment" or "another embodiment" means that the embodiments are not necessarily the same. Furthermore, the particular features, structures, or characteristics of one or more embodiments may be combined in any suitable manner.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
The following specifically describes a specific scheme of a repeated data identification method for computer data storage provided by the invention with reference to the accompanying drawings.
Referring to FIG. 1, a flowchart of a method for identifying duplicate data for computer data storage according to one embodiment of the invention is shown, the method comprising:
s101: collecting computer storage data, carrying out binary conversion on the computer storage data, generating a code to be processed, and forming a code section to be processed by the code to be processed; and carrying out frequency analysis on the code to be processed in the code segment to be processed to obtain the repetition degree of the code to be processed, and carrying out clustering processing on the code to be processed according to the repetition degree to generate at least one code point cluster to be processed.
The invention can convert the computer storage data into corresponding codes to be processed through the binary conversion. Further, in an embodiment of the present invention, performing a binary conversion on computer storage data to generate a code to be processed, including: the computer storage data is converted into decimal code based on an ASCII code table, and the decimal code is used as the code to be processed.
In the embodiment of the invention, since the computer can only recognize the binary code value, that is, the computer stored data is usually stored in the computer in binary code, for the sake of statistics, the binary code corresponding to each character such as number, letter, text, symbol and the like in the computer stored data can be converted into the corresponding decimal code, and the decimal code is used as the code to be processed, wherein, because (American Standard Code for Information Interchange, ASCII) information exchange standard code is a technology known in the art, the description is omitted.
After the code to be processed is obtained, the code to be processed can be arranged according to the data sequence corresponding to the data stored by the computer, so that the code segment to be processed is obtained, and the subsequent step of repeated data identification according to the code segment to be processed is facilitated.
Further, in the embodiment of the present invention, frequency analysis is performed on the code to be processed in the code segment to be processed to obtain the repetition degree of the code to be processed, including: and carrying out normalization processing on the frequency of the code to be processed in the code section to be processed to obtain a frequency normalization value, and taking the frequency normalization value as the repetition degree of the code to be processed.
The frequency of the code to be processed in the code segment to be processed is the frequency of the code to be processed in the code segment to be processed, and the code segment to be processed can be traversed to obtain the frequency of the code to be processed.
In the embodiment of the invention, the frequency distribution histogram can be established by taking the code to be processed as the horizontal axis and the frequency of the code to be processed in the code segment to be processed as the vertical axis, and the repetition degree of the code to be processed can be more intuitively embodied according to the frequency distribution histogram, as shown in fig. 2, and fig. 2 is a schematic diagram of the frequency distribution histogram provided by one embodiment of the invention.
In the embodiment of the invention, the frequency of the code to be processed in the code segment to be processed is normalized to obtain the frequency normalization value, wherein the frequency of the code to be processed in the code segment to be processed can be normalized by using a normalization formula, and the corresponding formula is as follows:
in the method, in the process of the invention,indicate->Frequency normalization value of the class code to be processed, +.>An index indicating the kind of code to be processed,indicate->Frequency of class-to-be-processed code,/->Representing the minimum value of the frequencies in all the codes to be processed,/->Representing the maximum of the frequencies in all the pending codes.
The frequency normalization value of the invention can be directly used as the repetition degree of the code to be processed in the code segment to be processed, that is, the higher the frequency normalization value is, the higher the frequency of the corresponding code to be processed in the code segment to be processed is, the smaller the frequency normalization value is, and the lower the frequency of the corresponding code to be processed in the code segment to be processed is, the smaller the repetition degree is.
In the embodiment of the invention, an unsupervised clustering algorithm (Density-Based Spatial Clustering of Applications with Noise, DBSCAN) spatial clustering algorithm can be used for carrying out clustering analysis on the repetition degree of the codes to be processed in the code segments to be processed to generate at least one code point cluster to be processed, and it is understood that the codes to be processed in the same code point cluster to be processed can be represented to have the same or similar repetition degree, so that the codes to be processed are divided by using the code point clusters to be processed, and the repetition degree of the codes to be processed in each code point cluster to be processed can be effectively ensured to be similar.
S102: determining a target point cluster from the to-be-processed code point clusters according to the repetition degree of all to-be-processed codes in the to-be-processed code point clusters, taking the to-be-processed codes in the target point cluster as target codes, traversing the to-be-processed code segments, and marking the target codes in the to-be-processed code segments to obtain marking information of the target codes.
Optionally, in the embodiment of the present invention, determining, according to the repetition degree of all the to-be-processed codes in the to-be-processed code point cluster, the target point cluster from the to-be-processed code point cluster includes: and calculating the average value of the repetition degree of all the codes to be processed in the code point cluster to be processed, and determining the code point cluster to be processed with the maximum average value of the repetition degree as a target point cluster.
In the embodiment of the invention, the average value of the repetition degree of all the codes to be processed in the code point cluster to be processed can be calculated by using the average value formula of the repetition degree, wherein the average value formula of the repetition degree is shown as follows:
in the method, in the process of the invention,indicate->Average value of the repetition degree of the clusters of the code points to be processed,/->Index representing a cluster of code points to be processed, +.>Indicate->The number of categories of the code to be processed in the cluster of code points to be processed, < >>Index indicating the kind of code to be processed, +.>Indicate->The degree of repetition of the code to be processed is similar.
According to the formula of the average value of the repetition degree, when the repetition degree of the code to be processed in the code point cluster to be processed is larger, the average value of the repetition degree is larger, the code point cluster to be processed with the largest average value of the repetition degree can be selected as the target point cluster, and the code to be processed in the target point cluster is used as the target code.
Of course, the embodiment of the present invention also supports the use of various other arbitrary possible implementations, so as to determine the target point cluster from the to-be-processed code point clusters according to the repetition degree of all to-be-processed codes in the to-be-processed code point cluster, for example, determine the target point cluster according to the index of the size of the median of the repetition degree of the to-be-processed codes, the probability that the value of the repetition degree is greater than a certain set threshold, and the like, which is not limited.
In the embodiment of the invention, the target code in the code segment to be processed is marked to obtain the marking information of the target code, that is, the code segment to be processed can be traversed, and the target code in the code segment to be processed is marked to obtain the marking information.
The marking information may specifically be, for example, a marking position and a marking number of the target code in the to-be-processed coding section, and the marking position and the marking number of the target code in the to-be-processed coding section may be synthesized to obtain information related to distribution of the target code in the to-be-processed coding section, such as the marking number in the interval, which is not limited.
S103: dividing the coding section to be processed into at least two coding identification subsections according to the total number of the coding to be processed and the marking information, and determining the identification priority of the coding identification subsections according to the frequency of the target coding in the coding identification subsections.
The code identification subsections are subsections obtained by carrying out average division on the code sections to be processed, when the code sections to be processed are divided into two code identification subsections on average, if the total number of the codes to be processed in the code sections to be processed is double, the code identification subsections can be directly divided on average, and if the total number of the codes to be processed in the code sections to be processed is singular, redundant codes to be processed can be randomly placed into any code identification subsections according to the corresponding positions of the codes to be processed after the code identification subsections are divided on average, so that the code identification subsections are not limited.
Further, in the embodiment of the present invention, according to the total number of the codes to be processed and the tag information, the code segment to be processed is divided into at least two code identification subsections, including: dividing the coding section to be processed into two initial subsections on average according to the total number of the coding sections to be processed; judging whether the initial sub-section meets a preset dividing condition according to the marking information, if not, directly taking the initial sub-section as a coding identification sub-section; if yes, taking the initial sub-segment as a new coding segment to be processed for average division until the new initial sub-segment obtained after the average division does not meet the preset division condition, ending the average division, and taking the new initial sub-segment obtained after the ending of the division as a coding identification sub-segment.
Further, in the embodiment of the present invention, the marking information includes a number of marks, and determining whether the initial sub-segment satisfies a preset dividing condition according to the marking information includes: respectively determining the number of marks in the two initial subsections with the maximum number of marks as the number of marks of the subsections; calculating the difference of the number of the sub-segment marks as a coding number difference, and calculating the ratio of the coding number difference to the number of the marks as a coding number difference ratio; when the difference ratio of the number of codes is larger than a preset difference ratio threshold value, determining that the initial sub-segment meets a preset dividing condition; and when the difference ratio of the number of codes is smaller than or equal to a preset difference ratio threshold value, determining that the initial sub-segment does not meet the preset dividing condition.
In the embodiment of the invention, two initial subsections with the maximum number of marks can be determined from a plurality of initial subsections, and the number of marks in the two initial subsections with the maximum number of marks is taken as the number of marks of the subsections.
That is, if the number of the initial sub-segments is two, the number of marks in the two initial sub-segments is directly used as the number of marks of the sub-segments, and if the number of the initial sub-segments is eight, the two initial sub-segments with the largest number of marks are searched, and the number of marks corresponding to the two initial sub-segments is used as the number of marks of the sub-segments.
After the number of the sub-segment marks is obtained, the difference of the number of the sub-segment marks is calculated to be used as the code number difference, and the ratio of the code number difference to the number of the marks is calculated to be used as the code number difference ratio.
If the number of the factor segment marks is two, the absolute value of the difference value of the number of the two sub-segment marks can be calculated as the difference of the coding number.
Wherein the code number difference ratio is the ratio of the code number difference in the initial sub-segment to the total number of marks.
In the embodiment of the invention, the coding quantity difference ratio can be calculated by using a coding quantity difference ratio formula, and the corresponding calculation formula is as follows:
in the method, in the process of the invention,coding number difference ratio indicating number of sub-segment marks, < ->Indicates the number of marks, < >>And->Sub-segment label numbers respectively representing two initial sub-segments with the largest label numbers,/->The representation takes absolute value.
As can be seen from the coding number difference ratio formula, if the coding number difference ratio is larger, the more obvious the number difference of the target codes in the two initial subsections with the largest number of marks is, a corresponding preset difference ratio threshold value can be set, preferably, the preset difference ratio threshold value can be set to be 0.3, that is, whether the preset dividing condition is that the coding number difference ratio is larger than 0.3.
When the coding quantity difference ratio is larger than a preset difference ratio threshold, determining that the initial sub-section meets preset dividing conditions, taking the initial sub-section as a new coding section to be processed, and respectively carrying out average division on the new coding section to be processed to obtain a new initial sub-section, and judging whether the new initial sub-section meets the preset dividing conditions.
It can be understood that, since the initial sub-segment is taken as a new to-be-processed encoding segment, and the new to-be-processed encoding segments are divided equally to obtain new initial sub-segments, that is, the to-be-processed encoding segment is divided into 2 initial sub-segments in the first division, the initial sub-segment obtained in the first division is taken as a new to-be-processed encoding segment in the second division, and the new to-be-processed encoding segments are divided equally to obtain 4 new initial sub-segments, and so on, in the first divisionIn the case of subdivision, the corresponding number of new initial subsections is +.>Personal (S)>Representing the number of divisions.
For example, the to-be-processed coding segment is divided into a first coding sub-segment and a second coding sub-segment on average, the first coding sub-segment and the second coding sub-segment are initial sub-segments obtained by first average division, when the coding number difference ratio of the first coding sub-segment and the second coding sub-segment is larger than a preset difference ratio threshold, the first coding sub-segment is divided into a third coding sub-segment and a fourth coding sub-segment on average, the second coding sub-segment is divided into a fifth coding sub-segment and a sixth coding sub-segment on average, wherein the third coding sub-segment, the fourth coding sub-segment, the fifth coding sub-segment and the sixth coding sub-segment are initial sub-segments obtained by second average division, two initial sub-segments with the largest number of marks are determined from the third coding sub-segment, the fourth coding sub-segment, the fifth coding sub-segment and the sixth coding sub-segment, the new coding number difference ratio is calculated, whether the new coding number difference ratio is larger than the preset difference ratio threshold is judged, if the new coding number difference ratio is smaller than the preset difference ratio threshold, and the average division is continued, and if the average division is smaller than or equal to the preset difference ratio is smaller than the preset difference ratio, the third coding sub-segment and the fourth coding sub-segment and the sixth coding sub-segment is identified.
In the embodiment of the invention, when the difference ratio of the number of codes is smaller than or equal to the preset difference ratio threshold, it is determined that the initial sub-segment does not meet the preset dividing condition, at this time, the average division is stopped, the final obtained initial sub-segment is used as the code identification sub-segment, for example, if the difference ratio of the number of codes calculated after 3 divisions is smaller than or equal to the preset difference ratio threshold, the 8 initial sub-segments obtained by the third division can be used as the code identification sub-segments.
After the code identification sub-segment is obtained, the embodiment of the invention can determine the identification priority of the code identification sub-segment according to the frequency of the target code in the code identification sub-segment.
In the embodiment of the invention, the identification priority of the code identification sub-segment is determined according to the frequency of the target code in the code identification sub-segment, that is, the more the frequency of the target code in the code identification sub-segment is, the higher the identification priority of the corresponding code identification sub-segment is.
S104: and carrying out coding combination on the target codes according to the marking information and the repetition degree to obtain code segments to be identified, identifying the code segments to be identified according to the identification priority of the code identification sub-segments, determining repeated code segments, and taking computer storage data corresponding to the repeated code segments as repeated data.
Further, in the embodiment of the present invention, according to the marking information and the repetition degree, the encoding combination is performed on the target encoding to obtain the encoding segment to be identified, including: determining other codes adjacent to the target code as neighborhood codes according to the marking information; determining the difference between the repetition degree of the neighborhood code and the repetition degree of the target code as the repetition degree difference; judging whether the neighborhood code and the target code meet the preset combination condition according to the repetition degree difference, if not, directly taking the target code as a code segment to be identified, if so, combining the neighborhood code and the target code as a new target code, determining a new neighborhood code adjacent to the new target code, combining the new target code and the new neighborhood code until the new target code and the new neighborhood code do not meet the preset combination condition, and ending the combination, wherein the target code obtained after the combination is taken as the code segment to be identified.
Optionally, the preset combination condition includes: the repetition degree difference of the target code and the neighborhood code is larger than a preset repetition degree difference threshold. That is, according to the repetition degree difference, determining whether the neighborhood code and the target code meet a preset combination condition includes: when the repetition degree difference of the target code and the neighborhood code is larger than a preset repetition degree difference threshold value, determining that a preset combination condition is met; and when the repetition degree difference of the target code and the neighborhood code is smaller than or equal to a preset repetition degree difference threshold value, determining that the preset combination condition is not met.
The preset repetition degree difference threshold may be, for example, 0.1, which is not limited.
In the embodiment of the invention, when the repetition degree difference is smaller than or equal to the preset repetition degree difference threshold, the target code is directly used as the code segment to be identified, when the repetition degree difference is larger than the preset repetition degree difference threshold, the neighborhood code and the target code are combined to be used as new target codes, and it is understood that two neighborhood codes adjacent to the target code can be provided, namely, two neighborhood codes adjacent to the two sides respectively can be provided, the repetition degree differences of the target code and the neighborhood codes on the two sides can be respectively determined and respectively compared, when the repetition degree difference between the target code and the neighborhood codes on any side is larger than the preset repetition degree difference threshold, the target code and the neighborhood codes on the side are combined to be used as new target codes, and the new neighborhood codes adjacent to the two sides of the new target code are redetermined until the repetition degree difference between the new neighborhood codes and the new target code is smaller than or equal to the preset repetition degree difference threshold, and the new target code is used as the code segment to be identified.
Further, in the embodiment of the present invention, identifying the code segment to be identified according to the identification priority of the code identification sub-segment, and determining the repetition code segment includes: taking the coded identification sub-segment with the highest identification priority as a reference sub-segment; and determining the occurrence frequency of the coding segments to be identified in the reference sub-segment as the identification number of the coding segments to be identified, and taking the coding segments to be identified, the identification number of which is greater than a preset identification number threshold value, as repeated coding segments.
Of course, the present invention may also use any other possible implementation manner to identify the code segment to be identified according to the identification priority of the code identification sub-segment, for example, traversing the code identification sub-segment according to the identification priority to determine the repetition code segment, which is not limited.
Because the code identification sub-segment with the highest identification priority is used as the reference sub-segment, the repeated code segment can be determined directly according to the code identification sub-segment with the highest identification priority, the problem of low efficiency caused by directly determining the repeated code segment according to all the code identification sub-segments is avoided while the reliability of data identification is ensured, and the identification efficiency of the repeated code segment is effectively improved.
The preset number threshold is a threshold of the number of identification, and optionally, the preset number threshold is 20, that is, when the number of identification of the code segments to be identified is greater than 20, the code segments to be identified are used as repeated code segments, and the value of the preset number threshold can be adjusted according to the capacity of the computer for storing data, which is not limited.
In the embodiment of the invention, the computer storage data corresponding to the repeated coding section is used as the repeated data, and because the codes in the repeated coding section are decimal codes, the codes in the repeated coding section can be respectively converted into binary format data corresponding to the computer storage data, and the binary format data are combined according to the sequence in the repeated coding section to be used as the repeated data.
According to the method and the device, the repetition degree of the code to be processed can be accurately determined by carrying out frequency analysis on the code to be processed in the code segment to be processed, the code point cluster to be processed is obtained by carrying out clustering processing on the code to be processed according to the repetition degree, and the code to be processed can be effectively clustered. The target point cluster is determined from the code point clusters to be processed according to the repetition degree, and the target codes in the target point cluster are marked to obtain the marking information of the target codes, so that the target codes can be effectively marked according to the repetition degree, the time waste caused by marking all the codes is avoided, the marking time is reduced while the marking effect is ensured, and the marking efficiency is improved. The code identification sub-segment can be effectively analyzed and identified by determining the identification priority of the code identification sub-segment, so that the code segment to be identified can be conveniently identified according to the identification priority. The repeated coding segments can be accurately identified according to the identification priority of the coding identification sub-segments, meanwhile, the repeated coding segments are identified according to the identification priority, so that the identification process of the repeated coding segments is more reasonable, the data identification pressure of a computer is effectively reduced, the computer can more conveniently and rapidly execute the identification process of the repeated coding segments, the identification efficiency of the repeated coding segments is effectively improved, and the identification efficiency of repeated data is effectively improved. In summary, the invention can effectively improve the recognition efficiency of the repeated data.
It should be noted that: the sequence of the embodiments of the present invention is only for description, and does not represent the advantages and disadvantages of the embodiments. The processes depicted in the accompanying drawings do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments.

Claims (8)

1. A method for identifying duplicate data for computer data storage, the method comprising:
collecting computer storage data, carrying out binary conversion on the computer storage data, and generating a code to be processed, wherein the code to be processed forms a code section to be processed; performing frequency analysis on the code to be processed in the code segment to be processed to obtain the repetition degree of the code to be processed, and performing clustering processing on the code to be processed according to the repetition degree to generate at least one code point cluster to be processed;
determining a target point cluster from the to-be-processed code point cluster according to the repetition degree of all the to-be-processed codes in the to-be-processed code point cluster, taking the to-be-processed codes in the target point cluster as target codes, traversing the to-be-processed code segments, and carrying out marking processing on the target codes in the to-be-processed code segments to obtain marking information of the target codes, wherein the marking information is the number of marks of the target codes in the to-be-processed code segments;
dividing the coding section to be processed into at least two coding identification subsections according to the total number of the coding to be processed and the marking information, and determining the identification priority of the coding identification subsections according to the frequency of the target coding in the coding identification subsections;
according to the marking information and the repetition degree, carrying out coding combination on the target codes to obtain code segments to be identified, identifying the code segments to be identified according to the identification priority of the code identification sub-segments, determining repeated code segments, and taking the computer storage data corresponding to the repeated code segments as repeated data;
and performing coding combination on the target code according to the marking information and the repetition degree to obtain a code segment to be identified, wherein the method comprises the following steps:
determining other codes adjacent to the target code as neighborhood codes according to the marking information;
determining that the difference in the repetition level of the neighborhood code and the repetition level of the target code is a repetition level difference;
judging whether the neighborhood code and the target code meet a preset combination condition according to the repetition degree difference, if not, directly taking the target code as the code section to be identified, if so, taking the combination of the neighborhood code and the target code as a new target code, determining a new neighborhood code adjacent to the new target code, combining the new target code and the new neighborhood code until the new target code and the new neighborhood code do not meet the preset combination condition, and taking the target code obtained after the combination is ended as the code section to be identified.
2. The method of claim 1, wherein the dividing the code segment to be processed into at least two code identification subsections based on the total number of codes to be processed and the tag information comprises:
dividing the coding section to be processed into two initial subsections on average according to the total number of the coding sections to be processed;
judging whether the initial sub-segment meets a preset dividing condition according to the marking information, and if not, directly taking the initial sub-segment as the coding identification sub-segment; and if so, taking the initial sub-segment as a new coding segment to be processed for average division until the new initial sub-segment obtained after the average division does not meet the preset division condition, ending the average division, and taking the new initial sub-segment obtained after the ending of the division as the coding identification sub-segment.
3. The method of claim 2, wherein the determining whether the initial sub-segment satisfies a preset division condition according to the marking information comprises:
respectively determining the number of marks in the two initial subsections with the maximum number of marks as the number of subsections marks; calculating the difference of the number of the sub-segment marks as a coding number difference, and calculating the ratio of the coding number difference to the number of the marks as a coding number difference ratio;
when the coding quantity difference ratio is larger than a preset difference ratio threshold value, determining that the initial subsection meets the preset dividing condition;
and when the code quantity difference ratio is smaller than or equal to the preset difference ratio threshold, determining that the initial sub-segment does not meet the preset dividing condition.
4. The method of claim 1, wherein the determining whether the neighborhood code and the target code satisfy a preset combination condition according to the repetition level difference comprises:
determining that the preset combination condition is met when the repetition degree difference of the target code and the neighborhood code is larger than a preset repetition degree difference threshold;
and when the repetition degree difference of the target code and the neighborhood code is smaller than or equal to a preset repetition degree difference threshold, determining that the preset combination condition is not met.
5. The method of claim 1, wherein the identifying the code segment to be identified based on the identification priority of the code identification sub-segment, determining a repetition code segment, comprises:
taking the coded identification sub-segment with the highest identification priority as a reference sub-segment;
and determining the occurrence frequency of the coding segments to be identified in the reference subsections as the identification number of the coding segments to be identified, and taking the coding segments to be identified, of which the identification number is greater than a preset identification number threshold value, as repeated coding segments.
6. The method of claim 1, wherein said performing a binary conversion on said computer stored data to generate a pending code comprises:
converting the computer storage data into decimal code based on an ASCII code table, and taking the decimal code as the code to be processed.
7. The method of claim 1, wherein the frequency analysis of the code to be processed within the code segment to be processed to obtain the repetition level of the code to be processed comprises:
and carrying out normalization processing on the frequency of the code to be processed in the code segment to be processed to obtain a frequency normalization value, and taking the frequency normalization value as the repetition degree of the code to be processed.
8. The method of claim 1, wherein said determining a target point cluster from said clusters of code points to be processed based on said repetition level of all said code points to be processed within said clusters of code points to be processed comprises:
calculating the average value of the repetition degree of all the codes to be processed in the code point cluster to be processed, and determining the code point cluster to be processed with the maximum average value of the repetition degree as a target point cluster.
CN202310300571.6A 2023-03-27 2023-03-27 Repeated data identification method for computer data storage Active CN116011403B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310300571.6A CN116011403B (en) 2023-03-27 2023-03-27 Repeated data identification method for computer data storage

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310300571.6A CN116011403B (en) 2023-03-27 2023-03-27 Repeated data identification method for computer data storage

Publications (2)

Publication Number Publication Date
CN116011403A CN116011403A (en) 2023-04-25
CN116011403B true CN116011403B (en) 2023-10-03

Family

ID=86035812

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310300571.6A Active CN116011403B (en) 2023-03-27 2023-03-27 Repeated data identification method for computer data storage

Country Status (1)

Country Link
CN (1) CN116011403B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117272933B (en) * 2023-11-23 2024-02-02 山东华善建筑工程有限公司 Concrete pavement report data storage method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6026397A (en) * 1996-05-22 2000-02-15 Electronic Data Systems Corporation Data analysis system and method
JP2015064856A (en) * 2013-08-30 2015-04-09 株式会社日立ソリューションズ Data analysis program, data analysis method, and data analyzer
CN110276764A (en) * 2019-05-29 2019-09-24 南京工程学院 K-Means underwater picture background segment innovatory algorithm based on the estimation of K value

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105900092B (en) * 2014-03-26 2019-05-14 株式会社日立制作所 Time series data management method and time series data management system
US10929042B2 (en) * 2016-10-20 2021-02-23 Hitachi, Ltd. Data storage system, process, and computer program for de-duplication of distributed data in a scalable cluster system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6026397A (en) * 1996-05-22 2000-02-15 Electronic Data Systems Corporation Data analysis system and method
JP2015064856A (en) * 2013-08-30 2015-04-09 株式会社日立ソリューションズ Data analysis program, data analysis method, and data analyzer
CN110276764A (en) * 2019-05-29 2019-09-24 南京工程学院 K-Means underwater picture background segment innovatory algorithm based on the estimation of K value

Also Published As

Publication number Publication date
CN116011403A (en) 2023-04-25

Similar Documents

Publication Publication Date Title
CN107291842B (en) Track query method based on track coding
CN116011403B (en) Repeated data identification method for computer data storage
CN112953550B (en) Data compression method, electronic device and storage medium
CN108021650A (en) A kind of efficient storage of time series data and reading system
CN116303374B (en) Multi-dimensional report data optimization compression method based on SQL database
CN111325156B (en) Face recognition method, device, equipment and storage medium
CN115955513B (en) Data optimization transmission method for Internet of things
CN111291824B (en) Time series processing method, device, electronic equipment and computer readable medium
EP4167107A1 (en) Image retrieval system, method and apparatus
CN102622353A (en) Fixed audio retrieval method
CN112287657B (en) Information matching system based on text similarity
CN115882867B (en) Data compression storage method based on big data
CN115865099B (en) Huffman coding-based multi-type data segment compression method and system
CN115048682A (en) Safe storage method of land circulation information
CN115186138A (en) Comparison method and terminal for power distribution network data
CN113407576A (en) Data association method and system based on dimension reduction algorithm
CN113065597A (en) Clustering method, device, equipment and storage medium
CN104714953A (en) Time series data motif identification method and device
CN107992590B (en) Big data system beneficial to information comparison
CN116028500B (en) Range query indexing method based on high-dimensional data
CN115955250B (en) College scientific research data acquisition management system
CN117216022B (en) Digital engineering consultation data management system
Shi et al. Variable-Length Quantization Strategy for Hashing
CN114186964A (en) Active engineering file implementation collecting method based on information difference comparison
CN116489002A (en) Fault delimitation method based on machine learning algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant