CN112559257B - Data storage method based on data screening - Google Patents

Data storage method based on data screening Download PDF

Info

Publication number
CN112559257B
CN112559257B CN202110189565.9A CN202110189565A CN112559257B CN 112559257 B CN112559257 B CN 112559257B CN 202110189565 A CN202110189565 A CN 202110189565A CN 112559257 B CN112559257 B CN 112559257B
Authority
CN
China
Prior art keywords
data
coefficient
level
similarity
hierarchy
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110189565.9A
Other languages
Chinese (zh)
Other versions
CN112559257A (en
Inventor
金树柏
罗玲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Dcs Technology Co ltd
Original Assignee
Shenzhen Dcs Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Dcs Technology Co ltd filed Critical Shenzhen Dcs Technology Co ltd
Priority to CN202110189565.9A priority Critical patent/CN112559257B/en
Publication of CN112559257A publication Critical patent/CN112559257A/en
Application granted granted Critical
Publication of CN112559257B publication Critical patent/CN112559257B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1446Point-in-time backing up or restoration of persistent data
    • G06F11/1448Management of the data involved in backup or backup restore
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1446Point-in-time backing up or restoration of persistent data
    • G06F11/1458Management of the backup or restore process
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/602Providing cryptographic facilities or services

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Quality & Reliability (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Bioethics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a data storage method based on data screening, which comprises the steps of obtaining data to be stored, wherein the data to be stored comprises a plurality of hierarchical data, and data information is also included under each hierarchical data; a level matrix G (G1, G2, G3, G4 and G5) and a data similarity coefficient matrix K (K1, K2, K3, K4 and K5) are further arranged in the central processing unit, when data similarity comparison is carried out, if the similarity of two data of any level is larger than or equal to the coefficient corresponding to the level, the similarity of the two data of the level is high, the data are judged to be repeated data, otherwise, data screening is not carried out, and the screened level data and corresponding data information under the level data are stored to serve as different-place backup data of the current data level. The stored data are divided according to a specific format, and different similarity coefficients are set for each hierarchy, so that the similarity of the data can be accurately judged, and the accuracy of data comparison is improved by adopting parameters with higher similarity coefficients.

Description

Data storage method based on data screening
Technical Field
The invention relates to the field of data storage, in particular to a data storage method based on data screening.
Background
Currently, in the process of informatization development, a large amount of information data are generated rapidly, and people can analyze the actual operation conditions of various enterprises according to the data and mine high-value information in the actual operation conditions. How to realize the efficient management of the various types of data also becomes a key research content at the present stage. Even if a sufficient number of storage devices are constructed to complete the storage process of data, a large amount of network bandwidth is required to be occupied when the data is transmitted, causing a problem of network congestion.
Since many similar data are duplicated when data is stored, such duplicated data are usually backup portions generated to ensure data stability and avoid loss, and part of data are duplicated when the same data is stored due to an error operation or some other factors. Under the influence of rapid increase of data volume, current storage systems are challenged in many aspects, and in order to further increase storage speed, effective measures are required to eliminate various kinds of redundant information, which is also a key method for overcoming the limitation of storage capacity. A redundancy screening method can be introduced to eliminate repeated data existing in each file after analysis and processing, so that the effect of reducing data is achieved, and the storage space of the data is effectively reduced.
However, when data is screened, blocking is usually performed, but the content of the blocking rule is usually blocked, but the blocking is performed according to the content, so that the data block is too long or too short, which is still not beneficial to data storage, limits the speed of data storage, and affects the integrity of the data content.
Disclosure of Invention
Therefore, the invention provides a data storage method based on data screening, which can determine the similarity of data through the comparison of hierarchical data so as to screen the data and improve the screening efficiency.
In order to achieve the above object, the present invention provides a data storage method based on data screening, which includes:
acquiring data to be stored, wherein the data to be stored comprises a plurality of hierarchical data and data information under each hierarchical data;
a hierarchy matrix G (G1, G2, G3, G4, G5) is further provided within the central processor, wherein G1 represents a first hierarchy, G2 represents a second hierarchy, G3 represents a third hierarchy, G4 represents a fourth hierarchy, and G5 represents a fifth hierarchy, wherein data of the first hierarchy is larger than data of the second hierarchy, data of the second hierarchy is larger than data of the third hierarchy, data of the third hierarchy is larger than data of the fourth hierarchy, and data of the fourth hierarchy is larger than data of the fifth hierarchy;
the central processing unit is also internally provided with a first coefficient K1, a second coefficient K2, a third coefficient K3, a fourth coefficient K4 and a fifth coefficient K5, wherein K1 is smaller than K2, K2 is smaller than K3, K3 is smaller than K4, and K4 is smaller than K5;
when the first level comprises a second level, the second level comprises a third level, the third level comprises a fourth level, the fourth level comprises a fifth level, and the data of the first level and the data of the fourth level are screened at the moment;
if the data is in the first level, selecting a first coefficient K1 from the data similarity coefficient matrix K (K1, K2, K3, K4 and K5) for comparison, and if the similarity of the two data in the first level is smaller than a first coefficient K1, indicating that the two data are different;
when a second level of data is compared, selecting a second coefficient K2 from the data similarity coefficient matrix K (K1, K2, K3, K4 and K5) for comparison, if the similarity of two data of the second level is greater than or equal to a second coefficient K2, indicating that the similarity of the two data of the second level is high, determining that the data are repeated, and performing data screening on the second level and the data contained in the second level;
if the similarity of the two data of the second hierarchy is smaller than a second coefficient K2, the two data are different;
when a third level of data is compared, selecting a third coefficient K3 from the data similarity coefficient matrix K (K1, K2, K3, K4 and K5) for comparison, if the similarity of two data of the third level is more than or equal to the third coefficient K3, indicating that the similarity of the two data of the third level is higher, determining that the data are repeated, and then performing data screening on the third level and the data contained in the third level;
if the similarity of the two data at the third level is less than a third coefficient K3, the two data are different;
when a fourth level of data is compared, selecting a fourth coefficient K4 from the data similarity coefficient matrix K (K1, K2, K3, K4 and K5) for comparison, if the similarity of two data of the same fourth level is greater than or equal to a fourth coefficient K4, indicating that the similarity of the two data of the fourth level is high, determining that the data are repeated, and performing data screening on the fourth level and the data contained in the fourth level;
if the similarity of the two data of the fourth level is smaller than a fourth coefficient K4, the two data are different;
when the fifth level of the data is compared, selecting a fifth coefficient K5 from the data similarity coefficient matrix K (K1, K2, K3, K4 and K5) for comparison, if the similarity of two data of the fifth level is more than or equal to a fifth coefficient K5, indicating that the similarity of the two data of the fifth level is higher, determining that the data are repeated, and performing data screening on the fifth level and the data contained in the fifth level;
if the similarity of the two data of the fifth layer level is smaller than a fifth coefficient K5, the two data are different;
storing the screened hierarchical data and corresponding data information under the hierarchical data into a mirror pool as allopatric backup data of the current hierarchical data;
before any data is subjected to screening comparison, data to be compared with the data is determined, and a data correlation matrix D (R1, R2, R3 and R4) is further arranged in the central processing unit, wherein R1 represents a first correlation, R2 represents a second correlation, R3 represents a third correlation, R4 represents a fourth correlation, R1 is larger than R2, R2 is larger than R3, and R3 is larger than R4;
in the current database, determining the relevance R of any data in other data except the current data and the current data;
setting a data similarity comprehensive evaluation coefficient K00,
K00=K1/K10+K2/K20+K3/K30+K4/K40+K5/K50+R/(R1+R2+R3+R4),
wherein K10 represents a standard level coefficient corresponding to a first level, K20 represents a standard level coefficient corresponding to a second level, K30 represents a standard level coefficient corresponding to a third level, K40 represents a standard level coefficient corresponding to a fourth level, and K50 represents a standard level coefficient corresponding to a fifth level.
Further, if the degree of correlation R with the current data is greater than or equal to the first degree of correlation R1, the priority of the similarity comparison with the current data is the highest, and the priority is the first priority data;
if the first correlation degree R1> is more than or equal to the second correlation degree R2 with the correlation degree R of the current data, the priority of similarity comparison with the current data is the second priority data;
if the correlation degree R of the second correlation degree R2> and the current data is larger than or equal to the third correlation degree R3, the current data is the third-priority data;
if the third correlation degree R3> is not less than the fourth correlation degree R4 with the correlation degree R of the current data, the priority is the lowest and is the fourth priority data;
if the degree of correlation R with the current data < the fourth degree of correlation R4, a similarity comparison with the current data is not required.
Further, the byte length and the key information are adopted for the correlation of the data to determine the correlation of the two data, if the byte lengths are the same, the two data may be similar data, if the byte lengths are different, the two data may not be similar data, if the byte lengths are the same, whether the key information of the two data is the same is determined, and if the key information is also the same, the levels in the data structure need to be further compared to determine the correlation.
Further, for data Gi of any hierarchy, i =1, 2, 3, 4, 5, which includes n data structures, each of which takes a value Lj, the similarity calculation formula for data Gi of any hierarchy is Si = Σ Lj × aj, where aj represents a weight coefficient corresponding to Lj =1, 2, 3 …, n.
Further, a data similarity coefficient standard K0 is set in the central processing unit, and when the similarity coefficient of any two hierarchical data in the hierarchical data is greater than or equal to the data similarity coefficient standard K0, one of the hierarchical data and the corresponding data information under the hierarchical data are deleted;
the data similarity coefficient criterion K0 is,
K0=(K1/K2+K2/K3+K3/K4+K4/K5)/4+4R4/(R1+R2+R3+R4)。
further, when n is 3, L1 corresponds to a weight coefficient a1= 0.75;
l2 corresponding to a weight coefficient a2= 0.15;
l3 corresponds to a weight factor a3= 0.1.
Further, the storing the screened hierarchical data and the corresponding data information under the hierarchical data into a mirror pool as the remote backup data of the current data hierarchy includes:
and setting a key for the remote backup data, wherein the key is generated according to the byte length of the screened data.
Furthermore, a key matrix P (P1, P2, P3, P4, P5 …, Pn) is also arranged in the central processing unit, wherein P1 represents a first key corresponding to a first byte length, P2 represents a second key corresponding to a second byte length, P3 represents a third key corresponding to a third byte length, P4 represents a fourth key corresponding to a fourth byte length, P5 represents a fifth key corresponding to a fifth byte length, Pn represents an nth key corresponding to an nth byte length;
in a mirror pool, if the byte length of the off-site backup data belongs to a first byte length, a first key is selected from the key matrix P (P1, P2, P3, P4, P5 …, Pn) to encrypt the off-site backup data;
and if the byte length of the off-site backup data belongs to the nth byte length, selecting the nth key from the key matrix P (P1, P2, P3, P4, P5 …, Pn) to encrypt the off-site backup data.
Further, a key updating coefficient S (S1, S2, S3) is set in the central processor, wherein S1 represents a first updating coefficient, S2 represents a second updating coefficient, and S3 represents a third updating coefficient;
when the total amount of data in the database is screened by 1/3 total amount of data, updating a key matrix P (P1, P2, P3, P4, P5 …, Pn) by using a first updating coefficient S1;
if 1/4 total data are screened, updating the key matrix P (P1, P2, P3, P4, P5 …, Pn) by adopting a second updating coefficient S2;
if the data of 1/5 total data amount is screened, the key matrix P (P1, P2, P3, P4, P5 …, Pn) is updated by using the third update coefficient S3.
Further, the first update coefficient S1= R1/R2;
the second update coefficient S2= R2/R3;
the third update coefficient S3= R3/R4.
Compared with the prior art, the method has the advantages that the stored data are divided according to a specific format, and different similarity coefficients are set for each hierarchy, so that the similarity evaluation standards for each hierarchy are different, so that lower parameters can be adopted when the coefficients with higher hierarchy similarity are set, and the accuracy of data screening can be influenced if the similarity coefficients are too high because the data with higher hierarchy contain more data.
Particularly, the relevance of other data and the current data is determined to be sorted, and the higher the relevance is, the higher the priority level of similarity comparison is, so that the analysis efficiency of data screening is higher, and the screening efficiency is improved.
Particularly, before hierarchical comparison is carried out on the data, the byte length of the data and the key information are required to be compared, and a part of data which is obviously impossible to be repeated data can be screened out by roughly comparing the byte length with the key information, so that the comparison time is saved, the efficiency of data comparison and analysis is greatly improved, and the efficiency of data screening is further improved.
In particular, different weight coefficients are set for different data structures of data of any hierarchy, so that the screening and analysis of the data structures are facilitated, and the screening accuracy and the screening efficiency are effectively improved.
Particularly, by setting the data similarity coefficient standard, if the similarity coefficient of any two levels of data is greater than or equal to the data similarity coefficient standard K0, the data to be screened is represented, the data is judged more intuitively and conveniently, and in practical application, the data similarity coefficient standard is represented by using each coefficient parameter in the similarity coefficient matrix and the parameter in the data correlation matrix.
In particular, in the data storage method based on data screening provided by the embodiment of the invention, the encrypted storage of the data is realized by setting the key for the backup data, the data tampering by a third party is prevented, the safety of the stored data is effectively improved, when the data is damaged, the damaged data can be repaired according to the data in the mirror image pool, and the safety and the stability of the database are improved.
Particularly, in the embodiment of the invention, the data with different byte lengths are encrypted by adopting different keys, so that the safety of the data is improved, in the actual operation, if the same key is adopted, all the keys are cracked, and great risk exists.
Particularly, the key matrix is updated through evaluation of the data amount to be screened in the database, and the dynamic key is used for storage, so that the storage of data is safer, and the update of the key matrix is performed according to the actual data amount of the database, so that association is established between data simplification and remote backup.
Drawings
Fig. 1 is a flowchart of a data storage method based on data screening according to an embodiment of the present invention.
Detailed Description
In order that the objects and advantages of the invention will be more clearly understood, the invention is further described below with reference to examples; it should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Preferred embodiments of the present invention are described below with reference to the accompanying drawings. It should be understood by those skilled in the art that these embodiments are only for explaining the technical principle of the present invention, and do not limit the scope of the present invention.
It should be noted that in the description of the present invention, the terms of direction or positional relationship indicated by the terms "upper", "lower", "left", "right", "inner", "outer", etc. are based on the directions or positional relationships shown in the drawings, which are only for convenience of description, and do not indicate or imply that the device or element must have a specific orientation, be constructed in a specific orientation, and be operated, and thus, should not be construed as limiting the present invention.
Furthermore, it should be noted that, in the description of the present invention, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, and may be, for example, fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.
Referring to fig. 1, a data storage method based on data filtering according to an embodiment of the present invention includes:
s100, acquiring data to be stored, wherein the data to be stored comprises a plurality of hierarchical data, and data information is also included under each hierarchical data, and the data information can be multimedia data such as images, videos, audios and the like, and can also be data such as characters and/or numbers and the like.
S200, a hierarchy matrix G (G1, G2, G3, G4 and G5) and a data similarity coefficient matrix K (K1, K2, K3, K4 and K5) are further arranged in the central processing unit, when data similarity comparison is carried out, if the similarity of two data of any hierarchy is larger than or equal to a coefficient corresponding to the hierarchy, the similarity of the two data of the hierarchy is high, the data is judged to be repeated data, a data similarity coefficient standard K0 is arranged in the central processing unit, and when the similarity coefficient of any two hierarchies in the hierarchy is larger than or equal to a data similarity coefficient standard K0, one hierarchy and corresponding data information under the hierarchy are deleted;
and S300, storing the deleted hierarchical data and the corresponding data information under the hierarchical data.
Specifically, in the embodiment of the invention, any two levels of data are compared by setting the data similarity coefficient standard K0, and if the data similarity coefficient standard K0 is higher than the data similarity coefficient standard K0, one level of data and corresponding data information under the level of data are deleted, so that the data is more conveniently and accurately screened, and the screening efficiency is high.
A hierarchy matrix G (G1, G2, G3, G4, G5) is further provided within the central processor, wherein G1 represents a first hierarchy, G2 represents a second hierarchy, G3 represents a third hierarchy, G4 represents a fourth hierarchy, and G5 represents a fifth hierarchy, wherein data of the first hierarchy is larger than data of the second hierarchy, data of the second hierarchy is larger than data of the third hierarchy, data of the third hierarchy is larger than data of the fourth hierarchy, and data of the fourth hierarchy is larger than data of the fifth hierarchy; and for any data, any one of the first hierarchy, the second hierarchy, the third hierarchy, the fourth hierarchy and the fifth hierarchy can be empty.
The central processing unit is also internally provided with a first coefficient K1, a second coefficient K2, a third coefficient K3, a fourth coefficient K4 and a fifth coefficient K5, wherein K1 is smaller than K2, K2 is smaller than K3, K3 is smaller than K4, and K4 is smaller than K5;
when the first level comprises a second level, the second level comprises a third level, the third level comprises a fourth level, the fourth level comprises a fifth level, and the data of the first level and the data of the fourth level are screened at the moment;
if the data is in the first level, selecting a first coefficient K1 from the data similarity coefficient matrix K (K1, K2, K3, K4 and K5) for comparison, and if the similarity of the two data in the first level is smaller than a first coefficient K1, indicating that the two data are different;
when a second level of data is compared, selecting a second coefficient K2 from the data similarity coefficient matrix K (K1, K2, K3, K4 and K5) for comparison, if the similarity of two data of the second level is greater than or equal to a second coefficient K2, indicating that the similarity of the two data of the second level is high, determining that the data are repeated, and performing data screening on the second level and the data contained in the second level;
if the similarity of the two data of the second hierarchy is smaller than a second coefficient K2, the two data are different;
when a third level of data is compared, selecting a third coefficient K3 from the data similarity coefficient matrix K (K1, K2, K3, K4 and K5) for comparison, if the similarity of two data of the third level is more than or equal to the third coefficient K3, indicating that the similarity of the two data of the third level is higher, determining that the data are repeated, and then performing data screening on the third level and the data contained in the third level;
if the similarity of the two data at the third level is less than a third coefficient K3, the two data are different;
when a fourth level of data is compared, selecting a fourth coefficient K4 from the data similarity coefficient matrix K (K1, K2, K3, K4 and K5) for comparison, if the similarity of two data of the same fourth level is greater than or equal to a fourth coefficient K4, indicating that the similarity of the two data of the fourth level is high, determining that the data are repeated, and performing data screening on the fourth level and the data contained in the fourth level;
if the similarity of the two data of the fourth level is smaller than a fourth coefficient K4, the two data are different;
when the fifth level of the data is compared, selecting a fifth coefficient K5 from the data similarity coefficient matrix K (K1, K2, K3, K4 and K5) for comparison, if the similarity of two data of the fifth level is more than or equal to a fifth coefficient K5, indicating that the similarity of the two data of the fifth level is higher, determining that the data are repeated, and performing data screening on the fifth level and the data contained in the fifth level;
if the similarity of the two data of the fifth layer level is smaller than a fifth coefficient K5, the two data are different;
before any data is subjected to screening comparison, data to be compared with the data is determined, and a data correlation matrix D (R1, R2, R3 and R4) is further arranged in the central processing unit, wherein R1 represents a first correlation, R2 represents a second correlation, R3 represents a third correlation, R4 represents a fourth correlation, R1 is larger than R2, R2 is larger than R3, and R3 is larger than R4;
if any data in other data after the current data in the current database is removed, the degree of correlation R between the data and the current data is determined, a data similarity comprehensive evaluation coefficient K00 is set,
K00=K1/K10+K2/K20+K3/K30+K4/K40+K5/K50+R/(R1+R2+R3+R4),
wherein K10 represents a standard level coefficient corresponding to a first level, K20 represents a standard level coefficient corresponding to a second level, K30 represents a standard level coefficient corresponding to a third level, K40 represents a standard level coefficient corresponding to a fourth level, and K50 represents a standard level coefficient corresponding to a fifth level.
The data storage method based on data screening provided by the embodiment of the invention divides the stored data according to a specific format, and a different similarity coefficient is set for each hierarchy, so that the similarity evaluation criterion is different for each hierarchy, so that the lower parameters can be used when setting the coefficients of similarity with higher hierarchy, since the data with high level contains more data, if the similarity coefficient is too high, the accuracy of data screening will be affected, because in order to improve the accuracy of screening, the data with high level adopts a lower similarity coefficient, the data with low level adopts a higher similarity coefficient, the data quantity contained in the data with low level is smaller, the similarity of the data is convenient to judge accurately, and the data comparison accuracy is improved by adopting a parameter with higher similarity coefficient.
Specifically, the data similarity comprehensive evaluation coefficient K00 is set, so that the data evaluation is more accurate, the real-time correlation of the data and the parameters in the data correlation matrix D (R1, R2, R3, R4) need to be referred to in addition to the consideration of the hierarchy similarity of the data, the data similarity evaluation is more accurate, the screening accuracy is further improved, and the data transmission backup efficiency is further improved.
In the embodiment of the invention, when data are compared, if the similarity of two data of the first level exceeds the first coefficient, the subsequent data do not need to be compared, and the screening operation is directly executed, so that the time for comparing data of other levels is saved, and the efficiency for screening the data is greatly improved.
Specifically, if the degree of correlation R with the current data is equal to or greater than the first degree of correlation R1, the priority of the similarity comparison with the current data is the highest, and the data is the first priority data;
if the first degree of correlation R1 is greater than or equal to the second degree of correlation R2, the priority of similarity comparison with the current data is the second priority data;
if the second degree of correlation R2 is greater than or equal to the third degree of correlation R3, the data is third-priority data;
if the third correlation degree R3 is greater than or equal to the fourth correlation degree R4, the priority is the lowest and is the fourth priority data;
if the correlation degree R with the current data is smaller than the fourth correlation degree R4, similarity comparison with the current data is not needed, and when the data similarity comparison is carried out, the sequence of the first priority data, the second priority data, the third priority data and the fourth priority data is adopted to be compared with the current data.
Specifically, according to the data storage method based on data screening provided by the embodiment of the present invention, the relevance between other data and the current data is determined and ranked, and the higher the relevance is, the higher the priority level of the similarity comparison is, so that the analysis efficiency of data screening is higher, and the screening efficiency is improved.
Specifically, the byte length and the key information are adopted for the relevancy of the data to determine the relevancy of the two data, if the byte lengths are the same, the two data may be similar data, if the byte lengths are different, the two data may not be similar data, if the byte lengths are the same, whether the key information of the two data is the same is determined, and if the key information is also the same, the levels in the data structure need to be further compared to determine the relevancy.
Specifically, according to the data storage method based on data screening provided in the embodiment of the present invention, before performing hierarchical comparison on data, the byte length of the data needs to be compared with the key information, and by roughly comparing the byte length with the key information, a part of data that is obviously unlikely to be duplicated data can be screened out, so that the comparison time is saved, the efficiency of data comparison and analysis is greatly improved, and the efficiency of data screening is further improved.
Specifically, the data storage method based on data screening provided by the embodiment of the present invention is described as follows by way of example: the data to be stored in the database is 'I is Chinese', the other data is 'I is not Chinese', the data which are not similar to each other can be directly eliminated through byte length comparison, the data are not screened and are stored in the database and stored in a mirror image pool, however, if the data to be stored in the database are 'I is Chinese' and 'he is Chinese', the data with the structure are divided into three sections, namely, a first section L1 which is a second section L2, a Chinese person which is a third section L3, the corresponding data are the first section which is a second section and Chinese person which is a third section, the contents of the second section and the third section are completely the same in the embodiment of the invention, and the contents of the first section are completely different, in the specific implementation process, the similarity S of the data is determined by setting the proportion of the data structure sections, wherein the weight of the first section is a1, the weight of the second section is a2, the weight of the third segment is a3, the similarity calculation of the data "i is Chinese" is S1= L1 × a1+ L2 × a2+ L3 × a3, the similarity calculation of the data "he is Chinese" is S2= L1' × a1+ L2 × a2+ L3 × a3, and the weight a1 of the first segment is greater than the weight a2 of the second segment and is greater than or equal to the weight a3 of the third segment in the data structure, in the embodiment of the invention, the weight a1 is far greater than a2 and a3, so as to realize accurate judgment of the similar phase of the data.
According to the embodiment of the invention, different weight coefficients are set for different data structures of data of any hierarchy, so that the screening analysis of the data structures is facilitated, and the screening accuracy and the screening efficiency are effectively improved.
Specifically, for data Gi of an arbitrary hierarchy, i =1, 2, 3, 4, 5, which includes n data structures, each of which takes a value Lj, the similarity calculation formula for data Gi of an arbitrary hierarchy is Si = Σ Lj × aj, where aj represents a weight coefficient corresponding to Lj, and j =1, 2, 3 …, n.
Specifically, in the embodiment of the invention, similarity calculation is performed on each data structure in the data, so that each data structure in data of any hierarchy participates in similarity calculation, and the weight coefficients set by each data structure are different, so that the similarity of the data can be flexibly reflected, the data is more accurate in screening, and the screening accuracy and the simplification efficiency are improved.
Specifically, a data similarity coefficient standard K0 is set in the central processing unit, and when the similarity coefficient of any two hierarchical data in the hierarchical data is greater than or equal to the data similarity coefficient standard K0, one of the hierarchical data and the corresponding data information under the hierarchical data are deleted;
the data similarity coefficient criterion K0 is,
K0=(K1/K2+K2/K3+K3/K4+K4/K5)/4+4R4/(R1+R2+R3+R4)。
specifically, by setting the data similarity coefficient standard, if the similarity coefficient of any two levels of data is greater than or equal to the data similarity coefficient standard K0, the data to be screened is represented, the data is judged more intuitively and conveniently, and in practical application, the data similarity coefficient standard is represented by using each coefficient parameter in the similarity coefficient matrix and a parameter in the data correlation matrix.
Specifically, when n is 3, the weight coefficient a1=0.75 corresponding to L1;
l2 corresponding to a weight coefficient a2= 0.15;
l3 corresponds to a weight factor a3= 0.1.
Specifically, the weight coefficient of a simple data structure is set, and the similarity of data is calculated by adopting the weight coefficient in the embodiment of the invention, so that the method is visual, convenient, quick and accurate, the convenience of calculation is improved, the calculation efficiency is improved, and the processing speed of data screening is further improved.
Specifically, the storing the screened hierarchical data and the corresponding data information under the hierarchical data into a mirror pool as the remote backup data of the current data hierarchy includes:
and setting a key for the stored data, wherein the key is generated according to the byte length of the screened data.
Specifically, in the data storage method based on data screening provided by the embodiment of the present invention, by setting a key for backup data, encrypted storage of the data is realized, tampering of the data by a third party is prevented, security of the stored data is effectively improved, when the data is damaged, the damaged data can be repaired according to the data in the mirror image pool, and security and stability of the database are improved.
Specifically, a key matrix P (P1, P2, P3, P4, P5 …, Pn) is further arranged in the central processing unit, wherein P1 represents a first key corresponding to a first byte length, P2 represents a second key corresponding to a second byte length, P3 represents a third key corresponding to a third byte length, P4 represents a fourth key corresponding to a fourth byte length, P5 represents a fifth key corresponding to a fifth byte length, Pn represents an nth key corresponding to an nth byte length;
in the mirror image pool, if the byte length of the data for data storage belongs to a first byte length, a first key is selected from the key matrix P (P1, P2, P3, P4, P5 …, Pn) to encrypt the remote backup data;
and if the byte length of the data for data storage belongs to the nth byte length, selecting the nth key from the key matrix P (P1, P2, P3, P4, P5 …, Pn) to encrypt the offsite backup data.
Specifically, in the embodiment of the invention, different keys are used for encrypting data with different byte lengths, so that the safety of the data is improved.
Specifically, a key update coefficient S (S1, S2, S3) is further provided in the central processor, where S1 denotes a first update coefficient, S2 denotes a second update coefficient, and S3 denotes a third update coefficient;
when the total amount of data in the database is screened by 1/3 total amount of data, updating a key matrix P (P1, P2, P3, P4, P5 …, Pn) by using a first updating coefficient S1;
if 1/4 total data are screened, updating the key matrix P (P1, P2, P3, P4, P5 …, Pn) by adopting a second updating coefficient S2;
if the data of 1/5 total data amount is screened, the key matrix P (P1, P2, P3, P4, P5 …, Pn) is updated by using the third update coefficient S3.
In the embodiment of the invention, the key matrix is updated by evaluating the data quantity to be screened in the database, and the dynamic key is adopted for storage, so that the storage of the data is safer, and the updating of the key matrix is carried out according to the actual data volume of the database, so that the association between the data screening and the data storage is established, if the screened data is more, the data amount stored in the database is small, the useful data of the database is not much, the encryption can be performed by using a slightly simple secret key, if the data screened in the database is small, the data stored in the database is large, the secret key of the database needs to be upgraded, the data is effectively protected by using a complex secret key, the safety of the data is improved, and the risk of repairing the data is reduced.
Specifically, the first update coefficient S1= R1/R2;
the second update coefficient S2= R2/R3;
the third update coefficient S3= R3/R4.
Specifically, the update coefficient in the embodiment of the present invention is expressed according to the correlation of the data, if the correlation of the data is higher, the update coefficient after the quotient is smaller, and the size of the update coefficient determines how much the key is changed, if the coefficient is smaller, the key is not changed much, and if the coefficient is higher, the key is changed much, so that the screened data is more conveniently stored in the mirror image pool, and the adopted encryption method is more effective, thereby improving the security of data storage. Those skilled in the art will appreciate that the location where the data is stored may be a mirror pool, or may be in other storage structures.
So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention; various modifications and alterations to this invention will become apparent to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (8)

1. A data storage method based on data screening is characterized by comprising the following steps:
acquiring data to be stored, wherein the data to be stored comprises a plurality of hierarchical data and data information under each hierarchical data;
a hierarchy matrix G (G1, G2, G3, G4, G5) is further provided within the central processor, wherein G1 represents a first hierarchy, G2 represents a second hierarchy, G3 represents a third hierarchy, G4 represents a fourth hierarchy, and G5 represents a fifth hierarchy, wherein the data of the first hierarchy is larger than the data of the second hierarchy, the data of the second hierarchy is larger than the data of the third hierarchy, the data of the third hierarchy is larger than the data of the fourth hierarchy, and the data of the fourth hierarchy is larger than the data of the fifth hierarchy;
the central processing unit is also internally provided with a first coefficient K1, a second coefficient K2, a third coefficient K3, a fourth coefficient K4 and a fifth coefficient K5, wherein K1 is smaller than K2, K2 is smaller than K3, K3 is smaller than K4, and K4 is smaller than K5;
when data similarity comparison is carried out, if the similarity of two data of the first level is greater than or equal to a first coefficient K1, the similarity of the two data of the first level is high, and the data is judged to be repeated data;
if the data is in the first level, selecting a first coefficient K1 from a data similarity coefficient matrix K (K1, K2, K3, K4 and K5) for comparison, and if the similarity of the two data in the first level is smaller than a first coefficient K1, indicating that the two data are different;
when a second level of data is compared, selecting a second coefficient K2 from the data similarity coefficient matrix K (K1, K2, K3, K4 and K5) for comparison, if the similarity of two data of the second level is greater than or equal to a second coefficient K2, indicating that the similarity of the two data of the second level is high, determining that the data are repeated, and performing data screening on the second level and the data contained in the second level;
if the similarity of the two data of the second hierarchy is smaller than a second coefficient K2, the two data are different;
when a third level of data is compared, selecting a third coefficient K3 from the data similarity coefficient matrix K (K1, K2, K3, K4 and K5) for comparison, if the similarity of two data of the third level is more than or equal to the third coefficient K3, indicating that the similarity of the two data of the third level is higher, determining that the data are repeated, and then performing data screening on the third level and the data contained in the third level;
if the similarity of the two data at the third level is less than a third coefficient K3, the two data are different;
when a fourth level of data is compared, selecting a fourth coefficient K4 from the data similarity coefficient matrix K (K1, K2, K3, K4 and K5) for comparison, if the similarity of two data of the same fourth level is greater than or equal to a fourth coefficient K4, indicating that the similarity of the two data of the fourth level is high, determining that the data are repeated, and performing data screening on the fourth level and the data contained in the fourth level;
if the similarity of the two data of the fourth level is smaller than a fourth coefficient K4, the two data are different;
when the fifth level of the data is compared, selecting a fifth coefficient K5 from the data similarity coefficient matrix K (K1, K2, K3, K4 and K5) for comparison, if the similarity of two data of the fifth level is more than or equal to a fifth coefficient K5, indicating that the similarity of the two data of the fifth level is higher, determining that the data are repeated, and performing data screening on the fifth level and the data contained in the fifth level;
if the similarity of the two data of the fifth layer level is smaller than a fifth coefficient K5, the two data are different;
storing the screened hierarchical data and corresponding data information under the hierarchical data into a mirror pool as allopatric backup data of the current hierarchical data;
before any data is subjected to screening comparison, data to be compared with the data is determined, and a data correlation matrix D (R1, R2, R3 and R4) is further arranged in the central processing unit, wherein R1 represents a first correlation, R2 represents a second correlation, R3 represents a third correlation, R4 represents a fourth correlation, R1 is larger than R2, R2 is larger than R3, and R3 is larger than R4;
in the current database, determining the relevance R of any data in other data except the current data and the current data;
setting a data similarity comprehensive evaluation coefficient K00,
K00=K1/K10+K2/K20+K3/K30+K4/K40+K5/K50+R/(R1+R2+R3+R4),
wherein K10 represents a standard level coefficient corresponding to a first level, K20 represents a standard level coefficient corresponding to a second level, K30 represents a standard level coefficient corresponding to a third level, K40 represents a standard level coefficient corresponding to a fourth level, and K50 represents a standard level coefficient corresponding to a fifth level;
storing the screened hierarchical data and the corresponding data information under the hierarchical data into a mirror pool, wherein the allopatric backup data serving as the current hierarchical data comprises the following steps:
setting a key for the remote backup data, wherein the key is generated according to the byte length of the screened data;
a key matrix P (P1, P2, P3, P4, P5 … and Pn) is further arranged in the central processing unit, wherein P1 represents a first key corresponding to a first byte length, P2 represents a second key corresponding to a second byte length, P3 represents a third key corresponding to a third byte length, P4 represents a fourth key corresponding to a fourth byte length, P5 represents a fifth key corresponding to a fifth byte length, and Pn represents an nth key corresponding to an nth byte length;
in a mirror pool, if the byte length of the off-site backup data belongs to a first byte length, a first key is selected from the key matrix P (P1, P2, P3, P4, P5 …, Pn) to encrypt the off-site backup data;
and if the byte length of the off-site backup data belongs to the nth byte length, selecting the nth key from the key matrix P (P1, P2, P3, P4, P5 …, Pn) to encrypt the off-site backup data.
2. The data storage method based on data filtering of claim 1,
if the correlation degree R with the current data is larger than or equal to the first correlation degree R1, the priority of similarity comparison with the current data is the highest and is the first priority data;
if the first correlation degree R1> is more than or equal to the second correlation degree R2 with the correlation degree R of the current data, the priority of similarity comparison with the current data is the second priority data;
if the correlation degree R of the second correlation degree R2> and the current data is larger than or equal to the third correlation degree R3, the current data is the third-priority data;
if the third correlation degree R3> is not less than the fourth correlation degree R4 with the correlation degree R of the current data, the priority is the lowest and is the fourth priority data;
if the degree of correlation R with the current data < the fourth degree of correlation R4, a similarity comparison with the current data is not required.
3. The data storage method based on data screening as claimed in claim 2, wherein the correlation of the two data is determined by using byte length and key information for the correlation of the data, if the byte length is the same, the two data may be similar data, if the byte length is different, the two data may not be similar data, if the byte length is the same, it is determined whether the key information of the two data is the same, if the key information is also the same, the respective levels in the data structure need to be further compared to determine the correlation.
4. The data storage method based on data screening as claimed in claim 3, wherein, for data Gi, i =1, 2, 3, 4, 5 at any level, the data Gi includes n data structures, each data structure is respectively subjected to a value Lj, and the similarity calculation formula for data Gi at any level is Si = Σ Lj × aj, where aj represents a weight coefficient corresponding to Lj, and j =1, 2, 3 …, n.
5. The data storage method based on data filtering of claim 2,
setting a data similarity coefficient standard K0 in the central processing unit, and deleting one of the hierarchical data and corresponding data information under the hierarchical data when the similarity coefficient of any two hierarchical data in the hierarchical data is more than or equal to the data similarity coefficient standard K0;
the data similarity coefficient criterion K0 is,
K0=(K1/K2+K2/K3+K3/K4+K4/K5)/4+4R4/(R1+R2+R3+R4)。
6. the data storage method based on data screening of claim 4, wherein when n is 3, L1 corresponds to a weight coefficient a1= 0.75;
l2 corresponding to a weight coefficient a2= 0.15;
l3 corresponds to a weight factor a3= 0.1.
7. The data storage method based on data filtering as claimed in claim 1, wherein a key update coefficient S (S1, S2, S3) is further provided in the central processor, wherein S1 represents a first update coefficient, S2 represents a second update coefficient, and S3 represents a third update coefficient;
when the total amount of data in the database is screened by 1/3 total amount of data, updating a key matrix P (P1, P2, P3, P4, P5 …, Pn) by using a first updating coefficient S1;
if 1/4 total data are screened, updating the key matrix P (P1, P2, P3, P4, P5 …, Pn) by adopting a second updating coefficient S2;
if the data of 1/5 total data amount is screened, the key matrix P (P1, P2, P3, P4, P5 …, Pn) is updated by using the third update coefficient S3.
8. The data storage method based on data filtering of claim 7, wherein the first update coefficient S1= R1/R2;
the second update coefficient S2= R2/R3;
the third update coefficient S3= R3/R4.
CN202110189565.9A 2021-02-19 2021-02-19 Data storage method based on data screening Active CN112559257B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110189565.9A CN112559257B (en) 2021-02-19 2021-02-19 Data storage method based on data screening

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110189565.9A CN112559257B (en) 2021-02-19 2021-02-19 Data storage method based on data screening

Publications (2)

Publication Number Publication Date
CN112559257A CN112559257A (en) 2021-03-26
CN112559257B true CN112559257B (en) 2021-07-13

Family

ID=75036007

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110189565.9A Active CN112559257B (en) 2021-02-19 2021-02-19 Data storage method based on data screening

Country Status (1)

Country Link
CN (1) CN112559257B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113254975B (en) * 2021-06-15 2021-09-28 湖南三湘银行股份有限公司 Digital financial data sharing method

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6847897B1 (en) * 1998-12-23 2005-01-25 Rosetta Inpharmatics Llc Method and system for analyzing biological response signal data

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7392356B1 (en) * 2005-09-06 2008-06-24 Symantec Corporation Promotion or demotion of backup data in a storage hierarchy based on significance and redundancy of the backup data
CN104899340B (en) * 2015-07-08 2018-01-23 哈尔滨工程大学船舶装备科技有限公司 A kind of IETM technical information fragment retrieval device and its search method based on fragment of most compacting
WO2020010503A1 (en) * 2018-07-10 2020-01-16 深圳花儿数据技术有限公司 Multi-layer consistent hashing-based distributed data storage method and system
CN111240652A (en) * 2018-11-28 2020-06-05 北京京东尚科信息技术有限公司 Data processing method and device, computer storage medium and electronic equipment
CN110471948B (en) * 2019-07-10 2021-01-15 北京交通大学 Intelligent customs clearance commodity classification method based on historical data mining
CN111949629B (en) * 2020-07-22 2024-03-22 金钱猫科技股份有限公司 File storage method and terminal oriented to edge cloud

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6847897B1 (en) * 1998-12-23 2005-01-25 Rosetta Inpharmatics Llc Method and system for analyzing biological response signal data

Also Published As

Publication number Publication date
CN112559257A (en) 2021-03-26

Similar Documents

Publication Publication Date Title
US7519835B2 (en) Encrypted table indexes and searching encrypted tables
WO2021258992A1 (en) User behavior monitoring method and apparatus based on big data, device, and medium
US7680998B1 (en) Journaled data backup during server quiescence or unavailability
US7685181B2 (en) Method and system for utilizing a hierarchical bitmap structure to provide a fast and reliable mechanism to represent large deleted data sets in relational databases
US7716177B2 (en) Proactive space allocation in a database system
US7519637B2 (en) System and method for reorganizing a database
CN104657672B (en) Method and system for the predefined part archive to table
US20100098256A1 (en) Decryption Key Management
JP2013520738A (en) Secure caching method, system, and program for database applications
US7496609B2 (en) Dirty shutdown recovery of file system filters
CN112559257B (en) Data storage method based on data screening
CN116719799A (en) Environment-friendly data management method, device, computer equipment and storage medium
Ding et al. Bitsense: Universal and nearly zero-error optimization for sketch counters with compressive sensing
US20130041886A1 (en) Methods for calculating a combined impact analysis repository
JP2006228202A (en) Management method and management system of secret data
CN114448659B (en) Yellow river dam bank monitoring Internet of things access control optimization method based on attribute exploration
CN108073624B (en) Service data processing system and method
US20220043798A1 (en) System and method for improving data validation and synchronization across disparate parties
Papadopoulos et al. Continuous spatial authentication
CN117952324B (en) Government affair data management method and related device based on redundant information
CN113254808B (en) GIS data screening method and system
US11604698B2 (en) Method and process for automatic determination of file/object value using meta-information
CN117032587B (en) Optical storage integrated information management system based on distributed architecture
US11496117B1 (en) Stable cuckoo filter for data streams
CN117573714A (en) Query method and device for persistent items in distributed stream

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant