CN112732726B - Data processing method and device, processor and computer storage medium - Google Patents

Data processing method and device, processor and computer storage medium Download PDF

Info

Publication number
CN112732726B
CN112732726B CN202110359130.4A CN202110359130A CN112732726B CN 112732726 B CN112732726 B CN 112732726B CN 202110359130 A CN202110359130 A CN 202110359130A CN 112732726 B CN112732726 B CN 112732726B
Authority
CN
China
Prior art keywords
data
value
values
preset
processing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110359130.4A
Other languages
Chinese (zh)
Other versions
CN112732726A (en
Inventor
黄岩
熊军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yunhe Enmo Beijing Information Technology Co ltd
Original Assignee
Yunhe Enmo Beijing Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yunhe Enmo Beijing Information Technology Co ltd filed Critical Yunhe Enmo Beijing Information Technology Co ltd
Priority to CN202110359130.4A priority Critical patent/CN112732726B/en
Publication of CN112732726A publication Critical patent/CN112732726A/en
Application granted granted Critical
Publication of CN112732726B publication Critical patent/CN112732726B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2264Multidimensional index structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/602Providing cryptographic facilities or services
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0646Horizontal data movement in storage systems, i.e. moving data in between storage devices or systems
    • G06F3/0652Erasing, e.g. deleting, data cleaning, moving of data to a wastebasket

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Health & Medical Sciences (AREA)
  • Bioethics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Human Computer Interaction (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a data processing method and device, a processor and a computer storage medium. Wherein, the method comprises the following steps: determining the values of multiple dimensions of the data according to preset rules and the attributes of the data; tagging the data according to values of multiple dimensions of the data; and processing the data according to the plurality of dimensional values marked by the data. The invention solves the technical problems of low data processing efficiency and poor data management effect caused by processing according to time or according to an instruction actively sent by a user when data is processed by a data processing method in the related art.

Description

Data processing method and device, processor and computer storage medium
Technical Field
The present invention relates to the field of data processing, and in particular, to a data processing method and apparatus, a processor, and a computer storage medium.
Background
When a user uses a mobile phone, a personal computer or other computer systems with information storage functions, some files (or other data) are deleted regularly. In many cases, the user deletes a file (or data) only to free up more storage space for storing new data, and if there is a lot of free storage space in the system, the user does not actively delete the file or data. The deleted files or data are only data of relatively low value, and are not completely worthless. Therefore, it is often the case that the user deletes the file or data by mistake, and it is very difficult to recover the deleted file or data.
File systems, disk arrays, databases, etc., which all provide a delete command that a user can use to free up more disk space to store new data.
In order to reduce the management burden, users usually do not execute the deletion commands very frequently, and the deletion commands are more prone to be executed in batches. That is, many data are deleted at a time, and many storage spaces are released, so that the user is not required to recognize and select the files which can be deleted again for a long time and then delete the files.
Although bulk deletion reduces the administrative burden on the user, it does not make full use of the large amount of storage space. The better scheme is as follows: the utilization rate of the storage space is kept at a high level as much as possible, the data is kept in the system for as long as possible, and when new data arrives and the storage space needs to be allocated, the old data is deleted. Therefore, the system is not influenced to receive and store new data on time, and the opportunity that old data is recovered can be increased.
In the prior art, the 'recycle bin' of the Windows system can alleviate this problem, but cannot completely solve it. Windows 'recycle bin' allows a user to suspend unwanted files in the 'recycle bin' until storage is insufficient, and to actually delete the data by 'emptying the recycle bin'. The "recycle bin" solution has two problems:
1. the user needs to operate twice, the data is temporarily stored in the garbage station for the first time, the garbage station is emptied for the second time, the operation times are increased, and the management burden is increased compared with the management burden of direct deletion.
2. Garbage stations increase the chances that data can be recovered, but far from optimal. Because the system prompts the user to "clear the trash release space" once the storage space is insufficient, the user still prefers to clear all the files in the trash at once, rather than just release a portion. This results in a substantial reduction in the utilization of storage space that is not efficiently utilized to store low value data.
In addition, most storage systems in computer systems today are hierarchical, with the memory closest to the computing components being the CPU internal registers and Cache, then the Dynamic Random Access Memory (DRAM) directly connected to the CPU, then the external storage device (hard disk drive HDD or solid state disk SSD). The closer the memory is to the CPU computation unit, the higher the performance, the more expensive the price, and the smaller the configuration capacity. The farther away from the CPU the lower the performance, the cheaper the price, and the larger the configuration capacity. Generally, a memory close to the CPU is used as a Cache memory (Cache) of a memory farther away, and when the required data cannot be found in the Cache inside the CPU, the data is found in the DRAM; the CPU cannot be found in the DRAM, or finds it in an external storage device (in the SSD or HDD). After the external storage device finds the desired data, the CPU loads this data into the inner memory, such as the above-mentioned Cache and DRAM, while deleting the data that has not been used recently from the inner memory, which is the most commonly used LRU algorithm for managing the Cache.
The LRU algorithm is effective in most cases because computer programs always tend to use data that has been used recently, that is, the access pattern of the computer program to the memory has temporal locality.
However, the LRU algorithm is not always valid. The LRU algorithm is not valid for all data, in any case. For example: for video data such as movies or dramas stored on a personal cellular phone or a personal computer, there is no "play more recently after one play", that is, there is no temporal locality in the access pattern to the data on the personal cellular phone or the personal computer, and therefore, the LRU algorithm is not effective for such data. If the LRU algorithm is used to buffer the video data, more recently used data is eliminated from the CACHE, and the mobile phone or the computer does not achieve the optimal performance.
In view of the above problems, no effective solution has been proposed.
Disclosure of Invention
Embodiments of the present invention provide a data processing method and apparatus, a processor, and a computer storage medium, so as to at least solve the technical problems of low data processing efficiency and poor data management effect caused by processing only according to time or according to an instruction actively issued by a user when data is processed by a data processing method in the related art.
According to an aspect of an embodiment of the present invention, there is provided a data processing method including: determining values of multiple dimensions of the data according to preset rules and attributes of the data; tagging the data according to values of multiple dimensions of the data; and performing different processing on the data according to the plurality of dimensional values marked by the data.
Optionally, the attribute includes a data format, a data access mode, a data storage mode, a data writing mode, data reliability, and a data security level; the values of the multiple dimensions include a storage value, a write value, a read value and a security value.
Optionally, determining the values of the multiple dimensions of the data according to the preset rule and the attribute of the data includes: determining the storage value of the data to be a high storage value under the condition that the preset rule is reliable and the attribute of the data is quantity and the quantity is smaller than a preset value; when the preset rule indicates that the data has a secrecy requirement, the attribute of the data is a security level, and the security level exceeds a preset level, determining that the secrecy value of the data is a high secrecy value; determining the read value of the data to be a high read value under the condition that the preset rule is that the data has the requirement of reading performance, the attribute of the data is write-once, and the data is read for multiple times; and determining the write value of the data to be a high write value under the condition that the preset rule is the requirement that the data has write performance and the attribute of the data is high write frequency in a preset time period.
Optionally, performing different processing on the data according to the multiple dimensional values marked by the data includes: classifying the data according to a plurality of dimensional values marked by the data; and respectively carrying out different processing on different types of data according to the types of the data.
Optionally, classifying the data according to the plurality of dimensional values marked by the data includes: classifying the data according to a high value quantity of the plurality of dimensional values of the data label; the data types comprise junk data and platinum data, the junk data are data with low values in multiple dimensions, and the platinum data are data with high values exceeding a preset number in the values of the multiple dimensions.
Optionally, the respectively performing different processes on different types of data according to the types of the data includes: processing the data according to a high value of a plurality of dimensional values of the data when the type of the data is non-spam data; under the condition that the high value in the multiple dimension values is a high storage value, storing the data in a multi-copy mode; encrypting the data through a preset high-strength encryption algorithm under the condition that the high value in the multiple dimensional values is a high secret value; under the condition that the high value in the dimension values is a high reading value, caching the data before reading; and caching the data and gradually writing the data under the condition that the high value in the dimension values is the high writing value.
Optionally, the processing of different types of data is performed differently according to the type of the data, and the processing further includes: and under the condition that the type of the data is junk data, automatically deleting the data according to the sequence of the generation time of the data.
According to another aspect of the embodiments of the present invention, there is also provided a data processing apparatus, including: the determining module is used for determining the values of multiple dimensions of the data according to preset rules and the attributes of the data; a tagging module to tag the data according to values of a plurality of dimensions of the data; and the processing module is used for processing the data according to the plurality of dimensional values marked by the data.
According to another aspect of the embodiments of the present invention, there is also provided a processor, configured to execute a program, where the program executes the data processing method described in any one of the above.
According to another aspect of the embodiments of the present invention, there is also provided a computer storage medium, where the computer storage medium includes a stored program, and when the program runs, the apparatus where the computer storage medium is located is controlled to execute any one of the above data processing methods.
In the embodiment of the invention, the values of multiple dimensions of the data are determined according to the preset rules and the attributes of the data; tagging the data according to values of multiple dimensions of the data; according to the data processing method, when the data is processed, the data is processed only according to time or according to an instruction sent by a user actively, so that the data processing efficiency is low, and the data management effect is poor.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:
FIG. 1 is a flow chart of a method of data processing according to an embodiment of the invention;
FIG. 2 is a schematic diagram of a data value based storage system according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of an example of a data value based process according to an embodiment of the present invention;
fig. 4 is a schematic diagram of a data processing apparatus according to an embodiment of the present invention.
Detailed Description
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
In accordance with an embodiment of the present invention, there is provided a method embodiment of a data processing method, it being noted that the steps illustrated in the flowchart of the figure may be performed in a computer system such as a set of computer-executable instructions and that, although a logical order is illustrated in the flowchart, in some cases, the steps illustrated or described may be performed in an order different than that presented herein.
Fig. 1 is a flow chart of a data processing method according to an embodiment of the present invention, as shown in fig. 1, the method including the steps of:
step S102, determining values of multiple dimensions of data according to preset rules and attributes of the data;
step S104, marking the data according to the values of a plurality of dimensions of the data;
and step S106, performing different processing on the data according to the plurality of dimensional values marked by the data.
Through the steps, the values of multiple dimensions of the data are determined according to the preset rules and the attributes of the data; tagging the data according to values of multiple dimensions of the data; according to the data processing method, when the data is processed, the data is processed only according to time or according to an instruction sent by a user actively, so that the data processing efficiency is low, and the data management effect is poor.
The preset rule can be preset by a user and is used for judging the values of multiple dimensions of the data according to the attributes of the data, and the attributes of the data can be a data format, a data access mode, a data storage mode, a data writing mode, data reliability and a data security level.
For example, if a binlog file of MySQL data is marked as having high write performance value, then a file with a rule of "file name/var/lib/MySQL/binlog" is marked as having high write performance value ", and when a new file is created, if the file name is/var/lib/MySQL/binlog.000678, then the file matches the rule, and then the file can be marked as having high write performance value. The rules may be preset by the user or may be automatically extracted from the data by algorithms such as AI and machine learning.
And determining the values of multiple dimensions of the data through the preset rules and the attributes of the data, wherein the values of the multiple dimensions comprise a storage value, a write-in value, a read value and a secret value. The above-mentioned value may be divided into a plurality of levels, for example, the present embodiment divides each value into two levels, a high value and a low value, and performs different processing on data according to the high value and the low value.
For example, the storage value is high, a more reliable storage mode for the data is needed, for example, multiple backup storage is needed, and the storage value is low, so that a general storage mode can be adopted for the data. The write value is high and more system resources are allocated to the writing of the data to ensure efficient and effective writing of the data. The reading value is high, and a more reliable reading mode for the data is needed, for example, the data is cached in advance before being read, so that the reading speed and the reliability of the data are improved. The security value is high, and the data needs to be processed in a more reliable security mode.
Therefore, the technical effects of improving the data processing efficiency and optimizing the data management effect are achieved, and the technical problems that the data processing efficiency is low and the data management effect is poor due to the fact that the data processing method in the related technology processes the data only according to time or processes the data according to an instruction actively sent by a user are solved.
Optionally, determining the values of multiple dimensions of the data according to the preset rule and the attribute of the data includes: determining the storage value of the data to be high under the condition that the preset rule is reliable and the attribute of the data is quantity and the quantity is smaller than the preset value; determining the confidentiality value of the data to be a high confidentiality value under the condition that the preset rule is that the data has confidentiality requirements, the attribute of the data is a security level, and the security level exceeds the preset level; under the condition that the preset rule is that the data has the requirement of reading performance, the attribute of the data is write-in once, and the data is read for multiple times, the reading value of the data is determined to be high; and determining the write value of the data to be a high write value under the condition that the preset rule is that the data has the requirement of write performance and the attribute of the data is that the write frequency of the preset time period is high.
When the preset rule is that the data storage requirement is reliable, the data attribute is quantity, and the quantity is smaller than the preset value, the storage value of the data is determined to be high, for example, some data require that the storage reliability is very high, for example, data of a financial transaction database, at the moment when a transaction just occurs, the data only has one copy, once the data is lost, the transaction details are lost, and further serious financial and legal risks are generated. While other data does not require high reliability because there are many copies of the data at different locations simultaneously, for example: the source code of some well-known open source software has many copies around the world, and one copy is lost without causing particularly serious adverse effects. For another example, the data is used for a CDN (content delivery network) node, which is used to accelerate the distribution of content on the network, and the CDN node is not the source of the data, and data at a certain CDN point is lost, which does not cause a particularly serious influence. For data with high reliability value, the data can be stored by adopting a higher-level data redundancy method, for example: and (4) three copies. For data with low reliability value, a method with lower redundancy level, 12+2 erasure code, or copy, is adopted.
And under the condition that the preset rule is that the data has a secrecy requirement, the attribute of the data is a security level, and the security level exceeds the preset level, determining that the secrecy value of the data is high secrecy value, for example, if the secrecy value of some data is high, adopting a higher-strength encryption algorithm. And for data with low security value, a low-strength but high-performance encryption algorithm is adopted, or the data is not encrypted.
In the case that the preset rule is that the data has the requirement of reading performance, and the attribute of the data is write-once and read-many, the read value of the data is determined to be high, for example, some hot spot data are not modified after write-once but need to be read frequently, and the requirement of reading performance is high. For example, video files of currently popular movie programs, each with a different start time on demand, are read multiple times over a period of time. The data can be read according to the standard requirement to smoothly play the video. Therefore, when the playing starts, a part of data which is not played can be read into the memory (or the storage medium with higher performance) in advance, so that the data reading operation can be responded immediately once the data is played to the position, and the reading performance is prevented from being influenced.
In the case where the preset rule is that the data has a requirement of writing performance and the attribute of the data is that the writing frequency of the preset time period is high, the writing value of the data is determined to be high, for example, after many archived data are written once, the archived data are rarely read and used later. For this type of data, it is not meaningful to have a high read performance. Because very high read performance is almost useless, a slightly lower read performance can be tolerated if a special event is encountered that requires a small amount of archived data to be retrieved. However, normal business generates a large amount of new data every day to be archived, and the writing performance is important. If the write performance is low, a large amount of archive data cannot be stored at a predetermined time. Many businesses require archival window time, for example, bank transaction data is archival time window 12:00 a night to 8:00 a morning the next day, and archival operations are no longer possible within this time window. If the archiving operation is performed during working hours, a large amount of reading needs to be performed on the transaction database, and the normal transaction operation of the database is influenced. For such archived data requiring high writing performance for a short period of time, the storage system may first cache the data on a high performance storage medium, such as: NVMe SSD disk, then slowly move the data outside the archive time window to a low performance storage medium for permanent storage, such as: a SATA HDD or a tape. Before moving to the low-performance storage directly, the data can be compressed, and then the low-performance medium is written, so that the characteristic of low data reading performance value can be fully utilized, and the storage cost of the data is further reduced.
Optionally, the performing different processing on the data according to the multiple dimensional values marked by the data includes: classifying the data according to the plurality of dimensional values marked by the data; and respectively processing different types of data according to the types of the data.
The data is classified according to the values of the plurality of dimensions to obtain a plurality of classifications, for example, the data can be classified into garbage data, wood data (wood data), iron data, copper data, silver data, gold data, platinum data, and the like according to the amount of high value in the values of the plurality of dimensions. Different processing strategies are performed on different classes of data. The processing strategies for different types of data are preset processing strategies, and in the execution process, a user can reconfigure the processing strategies according to requirements.
For example, platinum data is stored on a higher-performance storage medium as much as possible; allocating more cache spaces for platinum data in an internal memory; more copies are saved for platinum data; more frequent backups of platinum data, etc. For wood data, if the space is insufficient, the wood data can be packaged and moved to an off-line cheap magnetic tape to release more space. For the junk data, not displaying the junk data in the folder; if the space is insufficient, the data can be directly discarded, and the data can be discarded according to the chronological order.
Optionally, classifying the data according to the plurality of dimensional values marked by the data includes: classifying the data according to the high value quantity in the plurality of dimensionality values of the data marks; the data types comprise junk data and platinum data, the junk data are data with low values in multiple dimensions, and the platinum data are data with high values in the values of the multiple dimensions and with the number exceeding a preset number.
Classifying the data according to the high value quantity in the dimension values, for example, when the high value quantity is zero, the data is determined to be garbage data; when the high-value quantity reaches a first quantity, the data is determined to be wood data, wherein the first quantity is not zero; identifying the data as iron data when the high value quantity is between a first quantity and a second quantity, wherein the second quantity is greater than the first quantity; when the value quantity is between the second quantity and a third quantity, the data is determined to be copper data, wherein the third quantity is greater than the second quantity; when the value quantity is between a third quantity and a fourth quantity, the data is determined to be silver data, wherein the third quantity is larger than the second quantity; when the value quantity is between a fourth quantity and a fifth quantity, the data is determined to be golden data, wherein the fourth quantity is larger than the third quantity; the data is considered platinum data when the value number is between a fifth number and a sixth number, wherein the fifth number is greater than the sixth number, and the sixth number is less than or equal to the total number of value dimensions.
The classification of the data can also be recognized by instructions from the user, receiving conventional commands from the user, but treating them as value classification rules, or as value tagging of the data. For example, rm-faaa- > is marked as garbage data; 2. cpio bbb; rm-f bbb- > marker bbb is "wood data"; 3. create snapshot ccc- > marks ccc as 'copper data'; 4. back up ddd- > mark ddd is 'silver data'; 5. op _ xxx eee- > mark eee is "gold data"; 6. op _ xxx fff- > flag fff is "platinum data"; 7. all other data are "iron data" and need not be specifically labeled.
Optionally, the respectively performing different processes on different types of data according to the types of the data includes: under the condition that the type of the data is non-junk data, performing different processing on the data according to high values in a plurality of dimensional values of the data; under the condition that the high value in the multiple dimension values is the high storage value, the data is stored in a multi-copy mode; under the condition that the high value in the multiple dimensional values is the high secret value, encrypting the data through a preset high-strength encryption algorithm; under the condition that the high value in the multiple dimension values is a high reading value, caching the data before reading; and caching the data and gradually writing the data under the condition that the high value in the dimension values is the high writing value.
When the high value of the multiple dimensional values is the high storage value, the data is stored in a multi-copy manner, the multiple dimensional values may include the high storage value, and for the data with the high storage value, a higher-level data redundancy method may be used for storing the data with the high storage value, for example: and (4) three copies. For data with low reliability value, a method with lower redundancy level, 12+2 erasure code, or copy, is adopted.
And under the condition that the high value in the multiple dimensional values is the high secret value, encrypting the data through a preset high-strength encryption algorithm, wherein the multiple dimensional values comprise the high secret value, and for the data with the high secret value, the data with the higher strength is encrypted. And for data with low security value, a low-strength but high-performance encryption algorithm is adopted, or the data is not encrypted.
When the high value among the multiple dimensional values is the high reading value, the data is cached in advance before being read, the multiple dimensional values can include the high reading value, and for the data with the high reading value, a part of data which is not read can be read into the memory (or the storage medium with higher performance) in advance, so that the data reading operation can be responded immediately once the position is read, and the reading performance is prevented from being influenced.
When the high value of the multiple dimensional values is the high write value, the data is cached and gradually written, the multiple dimensional values may include the high write value, and for the data requiring the high write value within the preset time, the storage system may cache the data on a high-performance storage medium, for example: NVMe SSD disk, then outside the preset time window, slowly move the data to the low performance storage medium for permanent storage, for example: a SATA HDD or a tape. Before moving to the low-performance storage directly, the data can be compressed, and then the low-performance medium is written, so that the characteristic of low data reading performance value can be fully utilized, and the storage cost of the data is further reduced.
Optionally, the processing of different types of data is performed differently according to the type of the data, and the processing further includes: and automatically deleting the data according to the sequence of the generation time of the data under the condition that the type of the data is junk data.
For spam data, these data are not displayed in folders; if space is not available, the data can be discarded in chronological order of being marked. Thus, when the storage space is insufficient, the space manager finds out the earliest marked file from the garbage data, then directly reclaims the storage space occupied by the file, if the storage space is insufficient, then finds out the next garbage data file to reclaim the space, and so on until the remaining space is enough to accommodate new data.
If on a certain day the user thinks that a file was deleted by himself in error, he can look into the list of files marked as "junk data", and if the space occupied by the file is not reclaimed, he can choose to restore the file.
It should be noted that the present application also provides an alternative implementation, and the details of the implementation are described below.
The embodiment aims to solve the problems that a data value support mechanism is lacked in the existing storage system, and data is not subjected to value classification and value identification, so that the data is difficult to distinguish and process according to different values of the data.
The present embodiment provides a data processing method, which can more easily process mass data, for example: the method has the advantages that low-value junk data can be stored on a persistent medium for a longer time, the cache management module can more accurately identify which data have cache value, the backup management module can more easily identify which data should be backed up more frequently, and the backup time interval of which data can be longer.
The technical scheme of the embodiment has the following key points:
there are two classification methods for classifying data according to different values:
(a) the user identifies the data value and definitely marks a value classification label on the data through a command; (b) the user gives a classification rule of the data, the system automatically matches the data by using the rule, and a value classification label is marked on the data according to a matching result;
storing both the data itself and its classification label;
the data are distinguished and processed by using the classification labels of the data, so that the purposes of improving the performance, saving resources, preventing loss, preventing mistaken deletion and the like are achieved;
FIG. 2 is a schematic diagram of a data value based storage system according to an embodiment of the present invention, and as shown in FIG. 2, the whole system is divided into two parts: 1. marker & classifier; 2. data processor & data memory; the data is classified into different value categories after passing through "tagger & classifier" and labeled with different value labels according to predefined rules or according to commands input by the user. The tags will be stored in memory along with the data and will be fed to the processor along with the data.
The data processor adopts different strategies for the data according to the quality label. The data store may also adopt different storage strategies depending on the quality label.
FIG. 3 is a diagram illustrating an example of Data Value based processing according to an embodiment of the present invention, and is a Data Value file system (dvfs) shown in FIG. 3. dvfs accepts user-entered conventional file management commands for compatibility with existing file systems, but at the same time has the side effect of marking the data with quality data, or has the command redesigned to have the same look and feel.
The following describes the processing flow of the data value mark in the system by taking the command rm for deleting a file as an example. The rm-f aaa command, dvfs converts it into an internal command "mark aaa as garbage data", and there is a processing policy corresponding to "garbage data" on the memory side: "these data are not displayed in the folder; these data can be discarded in chronological order of being marked if there is insufficient space. Thus, when the storage space is insufficient, the space manager of dvfs finds out the earliest marked file from the garbage data, then directly reclaims the storage space occupied by the file, if the storage space is insufficient, finds out the next garbage data file to reclaim the space, and so on until the remaining space is enough to accommodate new data.
If on a certain day the user thinks that a file was deleted by himself in error, he can look into the list of files marked as "junk data", and if the space occupied by the file is not reclaimed, he can choose to restore the file.
In the above example of dvfs deletion, compared with rm deleting data directly and releasing space, dvfs makes full use of the remaining space to save files on the hard disk for more time, so that the user has more opportunities to restore the deleted files. Relative to windows' recycle bin solutions, (1) dvfs does not need to execute a command to clear the recycle bin, and (2) there is more chance that deleted files on dvfs can be restored because the files of dvfs are discarded one by one in chronological order, rather than being "cleared" to the recycle bin in bulk.
For files marked as "garbage data", there may be more processing strategies in addition to the two processing strategies "do not show, can discard", for example: the expensive SSD storage space is not occupied, and the SSD storage space is migrated to the cheap HDD storage space; they are compressed for storage. And so on. The increased management burden is not much every time a processing strategy is added, and the categories of data values are limited compared with the number of files.
dvfs greatly reduces the management burden of the user on the data, increases the flexibility of data manipulation, and reduces the probability of accidental deletion of the data. These benefits benefit from dvfs support for identifying and flagging data values, and different processing strategies can be taken on them depending on the data values, and these configuration strategies are allowed to change.
The role of the data value tag will be explained below using a video file as an example. The user may set the rule "if the file is mp4 ended and the access pattern of the file is read in order, then the data for this file is marked as ' prefetch value high ' cache value low '. For a video file, if data is read into the memory from a hard disk with low performance in advance before the player reads a certain block of data, the data in the memory can be directly returned to the player when the player needs the block of data. By doing so, the time delay required by data reading can be greatly reduced, and the times of 'pause' during video playing can be reduced. Once the video data is read once, the video data is immediately eliminated from the cache of the memory, and the released memory space can be used for caching other data with high cache value, so that the hit rate of the data is increased.
From the above examples, it can be seen that the "value" of the present embodiment can have various dimensions, such as: the value of prefetching, caching, high-reliability storage, high-performance access, etc., are not just one dimension. The granularity of the distinction of the value size can be fine, and the system can adopt 64-bit integers to express the value of a certain dimension. The granularity of the distinction of the value sizes may also be very coarse, e.g., there may be only two value classifications, "high" and "low".
Support for a "data value" mechanism is added to data storage systems that function similarly to the original society in which money was invented. Before money is lost, everyone must produce more things that need to live, because the exchange of things is inconvenient and it is difficult to find a suitable exchange object. This has resulted in each household having to breed his own grain, his own cotton, his own wire cloth, or even his own crockery. After the money is invented, people can concentrate on producing a few products by using the skills of themselves, or simply sell the labor time of themselves to a larger organization, and can store the money after receiving the money, so that the people can buy the required things by using the money at any time. Money is used as an intermediate medium, and the production efficiency of the whole society is improved.
The "data value tags" in the data storage system, like money in society, decouple the "value classification rules" and the "value tagging actions" from the "data policies". This allows the user to manage the data more efficiently with less management cost.
Data value can be divided into the following dimensions:
the high reliable storage value of data.
Some data require high reliability to be stored, such as data in a financial transaction database, which has only one copy at the time when a transaction just occurs, and once the data is lost, the transaction details are lost, thereby causing serious financial and legal risks. While other data does not require high reliability because there are many copies of the data at different locations simultaneously, for example: the source code of some well-known open source software has many copies around the world, and one copy is lost without causing particularly serious adverse effects. For another example, the data is used for a CDN (content delivery network) node, which is used to accelerate the distribution of content on the network, and the CDN node is not the source of the data, and data at a certain CDN point is lost, which does not cause a particularly serious influence. For data with high reliability value, the data can be stored by adopting a higher-level data redundancy method, for example: and (4) three copies. For data with low reliability value, a method with lower redundancy level, 12+2 erasure code, or copy, is adopted.
The security value of the data.
Some data with high security value is encrypted by using a higher strength encryption algorithm. And for data with low security value, a low-strength but high-performance encryption algorithm is adopted, or the data is not encrypted.
High performance reading value of data.
Some hot spot data is not modified after write-once, but is often read and has a relatively high read performance. For example, video files of currently popular movie programs, each with a different start time on demand, are read multiple times over a period of time. The data can be read according to the standard requirement to smoothly play the video. Therefore, when the playing starts, a part of data which is not played can be read into the memory (or the storage medium with higher performance) in advance, so that the data reading operation can be responded immediately once the data is played to the position, and the reading performance is prevented from being influenced.
High performance write value of data.
Many archived data are written once and then rarely read for use later. For this type of data, it is not meaningful to have a high read performance. Because very high read performance is almost useless, a slightly lower read performance can be tolerated if a special event is encountered that requires a small amount of archived data to be retrieved. However, normal business generates a large amount of new data every day to be archived, and the writing performance is important. If the write performance is low, a large amount of archive data cannot be stored at a predetermined time. Many businesses require archival window time, for example, bank transaction data is archival time window 12:00 a night to 8:00 a morning the next day, and archival operations are no longer possible within this time window. If the archiving operation is performed during working hours, a large amount of reading needs to be performed on the transaction database, and the normal transaction operation of the database is influenced.
For such archived data requiring high writing performance for a short period of time, the storage system may first cache the data on a high performance storage medium, such as: NVMe SSD disk, then slowly move the data outside the archive time window to a low performance storage medium for permanent storage, such as: a SATA HDD or a tape. Before moving to the low-performance storage directly, the data can be compressed, and then the low-performance medium is written, so that the characteristic of low data reading performance value can be fully utilized, and the storage cost of the data is further reduced.
Methods for value tagging data fall into two categories: first, value markers triggered by human manipulation. Second, auto-triggered value markers.
The value marking triggered by human operation can be a command of directly inputting 'marking the value of certain type of data' by a data administrator. The value of the data may also be inferred from certain rules and commands entered by an administrator, for example, when a user input deletes certain data, we infer that the stored value of the data is low. When the user inputs that some data is backed up, we can conclude that the reliability value of the data is high.
The automatically triggered marking operation may scan the data at a particular scheduled point in time and then mark its value, or may mark the value of the data as it is being written or as it is being read. The data flags are based on pre-set rules and characteristics of the data itself, such as: we intend to mark binlog file of MySQL data as high write performance value, then file with rule of "file name/var/lib/MySQL/binlog" as high write performance value ", when a new file is created, if file name/var/lib/MySQL/binlog.000678, then this file matches this rule and it can be marked as" high write performance value ". The rules may be preset by an administrator or may be automatically extracted from the data by algorithms such as AI and machine learning.
The data features described above may include many aspects, such as: characteristics of the data content itself, such as: information entropy of the data, size of the data, and so forth. Extrinsic properties of data, such as: file name, path name, time of last access, etc.
The accessed, operated-on mode of the summarized data extracted from the history, for example: the average storage time of a file in a directory, the file access sequence degree with a file name characteristic, the reuse distance of data, the average read-write data block size of a file, the read-write proportion and the like.
The "data value-based data storage system" of the present embodiment has the following key points: data in the system can be classified according to value according to data processing commands input by users or according to rules configured by users; the value tag of the data is stored in the system simultaneously with the data itself; different strategies can be adopted to store and process the data according to different values of the data; the two steps of classifying the value of the data and storing and processing the data according to different strategies are clearly separated, and the data are only coupled through the value label of the data.
This embodiment protects a data storage system that has some of the following features: data in the system can be classified according to value according to data processing commands input by users or according to rules configured by users; the value tag of the data is stored in the system simultaneously with the data itself; different strategies can be adopted to store and process the data according to different values of the data; the two steps of classifying the value of the data and storing and processing the data according to different strategies are clearly separated, and the middle part is only coupled through the value label of the data; a data processing and storing method is protected, which has the following characteristics: data in the system can be classified according to value according to data processing commands input by users or according to rules configured by users; the value tag of the data is stored in the system simultaneously with the data itself; different strategies can be adopted to store and process the data according to different values of the data; the two steps of classifying the value of the data and storing and processing the data according to different strategies are clearly separated, and the middle part is only coupled through the value label of the data;
most existing storage systems do not provide mechanism support for storing and processing data according to different prices of the data. They do not classify the value of the data nor do they store value tags for the data in the system. Existing storage systems present direct data processing commands to users, such as: delete, copy, backup, etc., which the user uses to directly manipulate data, which places a heavy administrative burden on the user and lacks flexibility.
Fig. 4 is a schematic diagram of a data processing apparatus according to an embodiment of the present invention, and as shown in fig. 4, according to another aspect of the embodiment of the present invention, there is also provided a data processing apparatus including: a determination module 42, a marking module 44 and a processing module 46, which are described in detail below.
A determining module 42, configured to determine values of multiple dimensions of the data according to preset rules and attributes of the data; a marking module 44, connected to the determining module 42, for marking the data according to the values of the multiple dimensions of the data; and the processing module 46 is connected with the marking module 44 and is used for performing different processing on the data according to the plurality of dimensional values marked by the data.
By the device, the values of multiple dimensions of the data are determined according to the preset rules and the attributes of the data; tagging the data according to values of multiple dimensions of the data; according to the data processing method, when the data is processed, the data is processed only according to time or according to an instruction sent by a user actively, so that the data processing efficiency is low, and the data management effect is poor.
According to another aspect of the embodiments of the present invention, there is also provided a processor, configured to execute a program, where the program executes a data processing method of any one of the above.
According to another aspect of the embodiments of the present invention, there is also provided a computer storage medium, which includes a stored program, wherein when the program runs, an apparatus in which the computer storage medium is located is controlled to execute the data processing method of any one of the above.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units may be a logical division, and in actual implementation, there may be another division, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims (7)

1. A data processing method, comprising:
determining values of multiple dimensions of the data according to preset rules and attributes of the data;
tagging the data according to values of multiple dimensions of the data;
performing different processing on the data according to the plurality of dimensional values marked by the data;
the preset rules are used for judging the values of multiple dimensions of the data according to the attributes of the data, wherein the attributes comprise a data format, a data access mode, a data storage mode, a data writing mode, data reliability and a data security level; the values of the multiple dimensions comprise a storage value, a writing value, a reading value and a secret value;
according to the marked multiple dimension values of the data, the different processing of the data comprises the following steps: classifying the data according to a plurality of dimensional values marked by the data; respectively carrying out different processing on different types of data according to the types of the data;
classifying the data according to the plurality of dimensional values marked by the data comprises: classifying the data according to a high value quantity of the plurality of dimensional values of the data label; the data types comprise junk data and platinum data, the junk data are data with low values in multiple dimensions, and the platinum data are data with high values exceeding a preset number in the values of the multiple dimensions.
2. The method of claim 1, wherein determining values for a plurality of dimensions of the data according to preset rules and attributes of the data comprises:
determining the storage value of the data to be a high storage value under the condition that the preset rule is reliable and the attribute of the data is quantity and the quantity is smaller than a preset value;
when the preset rule indicates that the data has a secrecy requirement, the attribute of the data is a security level, and the security level exceeds a preset level, determining that the secrecy value of the data is a high secrecy value;
determining the read value of the data to be a high read value under the condition that the preset rule is that the data has the requirement of reading performance, the attribute of the data is write-once, and the data is read for multiple times;
and determining the write value of the data to be a high write value under the condition that the preset rule is the requirement that the data has write performance and the attribute of the data is high write frequency in a preset time period.
3. The method of claim 1, wherein performing different processing on different types of data according to the type of data comprises:
processing the data according to a high value of a plurality of dimensional values of the data when the type of the data is non-spam data;
under the condition that the high value in the multiple dimension values is a high storage value, storing the data in a multi-copy mode;
encrypting the data through a preset high-strength encryption algorithm under the condition that the high value in the multiple dimensional values is a high secret value;
under the condition that the high value in the dimension values is a high reading value, caching the data before reading;
and caching the data and gradually writing the data under the condition that the high value in the dimension values is the high writing value.
4. The method of claim 3, wherein different types of data are processed differently according to types of data, further comprising:
and under the condition that the type of the data is junk data, automatically deleting the data according to the sequence of the generation time of the data.
5. A data processing apparatus, comprising:
the determining module is used for determining the values of multiple dimensions of the data according to preset rules and the attributes of the data;
a tagging module to tag the data according to values of a plurality of dimensions of the data;
the processing module is used for carrying out different processing on the data according to the plurality of dimensional values marked by the data;
the preset rules are used for judging the values of multiple dimensions of the data according to the attributes of the data, wherein the attributes comprise a data format, a data access mode, a data storage mode, a data writing mode, data reliability and a data security level; the values of the multiple dimensions comprise a storage value, a writing value, a reading value and a secret value;
according to the marked multiple dimension values of the data, the different processing of the data comprises the following steps: classifying the data according to a plurality of dimensional values marked by the data; respectively carrying out different processing on different types of data according to the types of the data;
classifying the data according to the plurality of dimensional values marked by the data comprises: classifying the data according to a high value quantity of the plurality of dimensional values of the data label; the data types comprise junk data and platinum data, the junk data are data with low values in multiple dimensions, and the platinum data are data with high values exceeding a preset number in the values of the multiple dimensions.
6. A processor, characterized in that the processor is configured to run a program, wherein the program is configured to execute the data processing method according to any one of claims 1 to 4 when running.
7. A computer storage medium, comprising a stored program, wherein the program, when executed, controls an apparatus in which the computer storage medium is located to perform the data processing method of any one of claims 1 to 4.
CN202110359130.4A 2021-04-02 2021-04-02 Data processing method and device, processor and computer storage medium Active CN112732726B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110359130.4A CN112732726B (en) 2021-04-02 2021-04-02 Data processing method and device, processor and computer storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110359130.4A CN112732726B (en) 2021-04-02 2021-04-02 Data processing method and device, processor and computer storage medium

Publications (2)

Publication Number Publication Date
CN112732726A CN112732726A (en) 2021-04-30
CN112732726B true CN112732726B (en) 2022-04-29

Family

ID=75596329

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110359130.4A Active CN112732726B (en) 2021-04-02 2021-04-02 Data processing method and device, processor and computer storage medium

Country Status (1)

Country Link
CN (1) CN112732726B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107315968A (en) * 2017-06-29 2017-11-03 国信优易数据有限公司 A kind of data processing method and equipment
CN108764995A (en) * 2018-05-24 2018-11-06 国信优易数据有限公司 A kind of data value determines system and method
CN110727406A (en) * 2019-10-10 2020-01-24 深圳力维智联技术有限公司 Data storage scheduling method and device
WO2020231753A1 (en) * 2019-05-14 2020-11-19 Oracle International Corporation Efficient space management for high performance writable snapshots

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107315968A (en) * 2017-06-29 2017-11-03 国信优易数据有限公司 A kind of data processing method and equipment
CN108764995A (en) * 2018-05-24 2018-11-06 国信优易数据有限公司 A kind of data value determines system and method
WO2020231753A1 (en) * 2019-05-14 2020-11-19 Oracle International Corporation Efficient space management for high performance writable snapshots
CN110727406A (en) * 2019-10-10 2020-01-24 深圳力维智联技术有限公司 Data storage scheduling method and device

Also Published As

Publication number Publication date
CN112732726A (en) 2021-04-30

Similar Documents

Publication Publication Date Title
US8683228B2 (en) System and method for WORM data storage
US10360182B2 (en) Recovering data lost in data de-duplication system
US8010505B2 (en) Efficient backup data retrieval
CN100583096C (en) Methods for managing deletion of data
US8838530B2 (en) Method and system for directory management
CN109542358A (en) A kind of cold and hot data separation method of solid state hard disk, device and equipment
CN110647497A (en) HDFS-based high-performance file storage and management system
GB2459494A (en) A method of managing a cache
US20070061359A1 (en) Organizing managed content for efficient storage and management
US20040139127A1 (en) Backup system and method of generating a checkpoint for a database
CN109947363A (en) A kind of data cache method of distributed memory system
CN100507919C (en) FAT file system and its processing method
US7216207B1 (en) System and method for fast, secure removal of objects from disk storage
CN106874399B (en) Networking backup system and backup method
CN106155596A (en) Method for writing data and device
CN107168651A (en) A kind of small documents polymerize storage processing method
JP2003316774A (en) Document control system, document accumulation method and program executing the method
Marupudi Solid State Drive: New Challenge for Forensic Investigation
US20030074376A1 (en) File manager for storing several versions of a file
CN112732726B (en) Data processing method and device, processor and computer storage medium
TW477932B (en) Memory defragmentation in chipcards
CN103257928A (en) Method and system for data management of flash memory equipment
CN112597102B (en) High-efficiency mirror image file system implementation method
CN110262758B (en) Data storage management method, system and related equipment
CN107846327A (en) A kind of processing method and processing device of network management performance data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant