CN115437997A

CN115437997A - Intelligent identification optimization system for data life cycle

Info

Publication number: CN115437997A
Application number: CN202210879571.1A
Authority: CN
Inventors: 傅思雨; 甘云锋; 江敏; 高雁冰; 范图强
Original assignee: Hangzhou Dtwave Technology Co ltd
Current assignee: Hangzhou Dtwave Technology Co ltd
Priority date: 2022-07-25
Filing date: 2022-07-25
Publication date: 2022-12-06

Abstract

The invention discloses an intelligent identification optimization system for a data life cycle, which comprises a storage management module and a strategy management module, wherein the storage management module is used for storing a plurality of data; the storage management comprises an analysis module and a management module, wherein the analysis module evaluates the storage health score of the system by analyzing the number of small files and the cold data capacity of the file system and the health degree of the storage nodes; the management module assigns corresponding storage strategies according to the health scores, realizes optimized storage through a migration tool, and comprehensively masters storage and management conditions through a statistical chart; the strategy management module supports management of a layered storage strategy, an analysis strategy and a compression strategy, and a user sets the layered storage strategy and the compression strategy for a directory to optimize file storage; setting an analysis strategy for small files and cold data to help data analysis; the invention provides an intelligent identification optimization system for a data life cycle, which can know the health condition of each directory and even files and perform optimized storage.

Description

Intelligent identification optimization system for data life cycle

Technical Field

The invention relates to the field of computer storage, in particular to an intelligent identification optimization system for a data life cycle.

Background

In the process of enterprise big data application, the storage occupation space of data such as HDFS is larger and larger, so that the operation efficiency is low, meanwhile, enterprises cannot comprehensively control the use condition of all data, and the data optimization is difficult to achieve.

So as enterprise big data clusters get used longer, more and more data is generated. The method not only increases the memory occupation and the read-write time consumption, but also influences the cluster expansion and reduces the operation efficiency. Therefore, it is important to know the overall situation of the data file, to accurately find the directory data file to be optimized, to manage the data, and so on.

Disclosure of Invention

The invention overcomes the defects of the prior art, and provides the intelligent identification and optimization system for the data life cycle, which can understand the health condition of each directory and even files and optimize storage.

The technical scheme of the invention is as follows:

an intelligent identification optimization system for a data life cycle comprises a storage management module and a policy management module;

the storage management comprises an analysis module and a management module, wherein the analysis module evaluates the storage health score of the system by analyzing the number of small files and the cold data capacity of the file system and the health degree of the storage nodes; the management module assigns corresponding storage strategies according to the health scores, realizes optimized storage through a migration tool, and comprehensively masters storage and management conditions through a statistical chart;

the strategy management module supports management of a layered storage strategy, an analysis strategy and a compression strategy, and a user sets the layered storage strategy and the compression strategy for a directory to optimize file storage; setting an analysis strategy for small files and cold data to help data analysis;

the bottom layer of the whole technical framework comprises MySQL, hive and HDFS, the Hive Client is connected with the Hive, webHDFS and dfsadin to access the HDFS so as to obtain data of the Hive and HDFS, and MyBatis is used for interacting with MySQL to store the data;

the method comprises the following specific steps:

101 Metadata acquisition step: acquiring HDFS metadata by adopting a fsimage analysis mode;

102 Metadata indexing step: analyzing the metadata file obtained in the step 101) to construct a multi-branch tree structure;

103 Data analysis step: counting the number and scale of all files and the number and scale of different data types under the catalog, and carrying out total quantity counting, ranking analysis and proportion analysis to obtain a storage health score;

104 Data policy configuration step: the method comprises a hierarchical storage strategy, an analysis strategy and a compression strategy; the hierarchical storage strategy, namely the heterogeneous storage strategy, stores the data on different storage media according to the data access heat, so that the storage of the HDFS can flexibly and efficiently cope with various application scenes; the analysis strategy sets the definition of the small files and the threshold of the number of the small files by setting a user, sets the definition of cold data and the threshold of the total amount of the cold data, sets the threshold of the capacity of a magnetic disk and sets the threshold of the scheduling time for the system to execute analysis; and the compression strategy sets erasure codes, so that all the erasure codes which can be selected currently are checked, data migration is guaranteed, and a migration log is checked and recorded.

Furthermore, the intermediate layer of the whole technical framework adopts Schedule to realize periodic scheduling, and a multi-branch tree is constructed to facilitate data analysis; the upper layer of the whole technical framework provides an external API call interface and a visual UI operation interface.

Further, the metadata includes: path-directory Path, replication-backup number, modificationTime-last modification time, accessTime-last Access time, preferredBlockSize-preferred Block size, blocksCount-number of blocks, fileSize-File size, NSQUOTA-name quota, DSQUOTA-space quota, permission-Authority, userName-user, and GroupName-user group;

specifically, the fsimage is obtained, then the fsimage is analyzed to be metadata in a specified format, and finally the oiv file is output.

Further, the data analysis comprises small file analysis, cold data analysis, hot data analysis, table analysis, damaged block analysis and disk memory analysis:

the small file analysis is used for counting the number and the scale of the small files according to the strategy setting.

The cold data analysis is used for counting the number and scale of the cold data according to the strategy setting.

The thermal data analysis is used for counting the number and scale of the thermal data according to strategy setting.

And the table analysis is used for counting the number and the scale of all small table files in the database according to the strategy setting.

The corrupted block analysis is used to count the number of corrupted file blocks.

The disk memory analysis is used for counting the total amount and the use condition of the disk.

Further, the scoring rules for storing the health score comprise a disk score, a small file score, a cold data score and a file block score;

the total score of the disks in the disk score is 30, and if the number of the nodes is n, the score of each node is

When is divided into w ₁ Deducting when the usage of the node disk exceeds the threshold

Dividing; assuming that a node has m disks, the score of each disk is

And deducting the score when the total storage of the disks does not exceed the threshold value and the single disk exceeds the threshold value. w is a ₂ Block disk over threshold, deduct

Dividing;

the total score of the small files is 30, the threshold number of the small files is set as t, when the number x of the small files exceeds the threshold value of 1-10%, 1 score is deducted, and when the number x of the small files exceeds 11-20%, 1 score is deducted again until the deduction is finished;

the score of the cold data is 30 in total, the cold data with yG is set to be unprocessed, namely, the data with a hierarchical storage strategy and an erasure code strategy are not set, and 1 point is deducted every 100G until the total points are deducted;

the total score of the file blocks in the file block scores is 10, if z file blocks are damaged, 10 file blocks are damaged, namely 1-10 file blocks are damaged, one file block is deducted, and 11-20 file blocks are damaged, one file block is deducted again until the total score is deducted;

therefore, the calculation formula of the storage health score S obtained as described above is as follows:

wherein, each deduction score can not exceed each total score.

Further, HDFS supports a variety of common storage types, including:

ARCHIVE: a storage medium with high storage density but low power consumption for storing cold data;

DISK: disk media, which is the default storage medium for HDFS;

SSD: a solid state disk storage medium;

RAM _ DISK: data is written into the memory and a copy is asynchronously written to the storage medium.

Further, the hierarchical policy includes provider, COLD, WARM, ONE _ SSD, ALL _ SSD, and LAZY _ policy;

the PROVIDED is used for storing external HDFS, and the storage medium is DISK;

COLD is that all copies are stored on the ARCHIVE storage, and the storage medium is ARCHIVE;

WARM adopts a copy to be stored on a DISK, the other copies are stored on an archiving storage, and the storage media are DISK and ARCHIVE;

HOT adopts all copies to store in the DISK, and the copies are a default storage strategy, and the storage medium is DISK;

the ONE _ SSD adopts ONE copy to be stored in the SSD, the other copies are stored in the DISK, and the storage media are the SSD and the DISK;

ALL _ SSD adopts ALL copies to be stored in SSD, and the storage medium is SSD;

LAZY _ PERSIST adopts one copy to be stored in a RAM _ DISK, the other copies are stored in a RAM _ DISK in a DISK, and a storage medium is a DISK.

The invention has the advantages that:

the invention can not only obtain the whole health condition of the system, or the specific condition of a certain directory, but also know the health condition of each directory and even files. The invention can accurately know the distribution position of the specific small files and count the largest file directories of the small files. The invention can manage data and realize optimized storage through a migration tool. According to the data analysis strategy, cold and hot data can be intelligently analyzed and displayed in a statistical manner. The invention supports statistical analysis and treatment of Hive base tables.

Drawings

FIG. 1 is a diagram of the product architecture of the present invention;

FIG. 2 is a flow chart of the operation of the present invention;

FIG. 3 is a technical framework diagram of the present invention;

FIG. 4 is a technical flow chart of the present invention;

FIG. 5 is a metadata acquisition flow diagram of the present invention;

FIG. 6 is a metadata index layout of the present invention;

fig. 7 is a diagram of a stored health score structure of the present invention.

Detailed Description

The invention is further described with reference to the following figures and detailed description. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention, and that elements not explicitly described in the present disclosure may be implemented using conventional techniques.

Possible terms are explained below:

HDFS is a Hadoop distributed file system. Hive is a data warehouse tool based on Hadoop and used for data extraction, conversion and loading. A small file refers to a file whose file size is significantly smaller than the block size (block) on HDFS (64 MB by default, 128MB by default in hadoop2. X). Cold data refers to data that is infrequent, not frequently accessed, or even never accessed, but still requires long-term retention. Hot data refers to very hot, frequently accessed data. Fsimage stores a file containing information of all directories and files of the whole HDFS file system in the HDFS, and is loaded when the HDFS is started. Oiv is a file format, which is called an offline image viewer. Erasure codes, namely erasurcoding erasure code technology, abbreviated as EC, are a data protection technology. Instead of multiple copies, less storage may be used to ensure the same level of fault tolerance.

As shown in fig. 1 to 7, an intelligent recognition optimization system for a data lifecycle includes a storage management module and a policy management module, which mainly support storage, governance, analysis and optimization of files (HDFS) and tables (HIVE).

The storage management comprises an analysis module and a governance module, wherein the analysis module evaluates the storage health score of the system by analyzing the small file number and the cold data capacity of the file system and the health degree of the storage nodes. The management module assigns corresponding storage strategies according to the health scores, realizes optimized storage through a migration tool, and comprehensively masters storage and management conditions through a statistical chart.

The strategy management module supports management of a layered storage strategy, an analysis strategy and a compression strategy, and a user sets the layered storage strategy and the compression strategy for a directory to optimize file storage. And setting an analysis strategy for small files and cold data to help data analysis.

The local cache is established in a fsimage obtaining mode on the whole, the client side issues an analysis request, strategy configuration is issued according to an analysis result, and finally the issued strategy is enabled to take effect through data migration.

The bottom layer of the whole technical framework comprises MySQL, hive and HDFS, the Hive Client is connected with the Hive, webHDFS and dfsadin to access the HDFS so as to obtain data of the Hive and HDFS, and MyBatis is used for interacting with MySQL to store the data. The middle layer of the whole technical framework adopts Schedule to realize periodic scheduling, and a multi-branch tree is constructed to facilitate data analysis. The upper layer of the whole technical framework provides an external API call interface and a visual UI operation interface.

The method comprises the following specific steps:

101 Metadata acquisition step: the HDFS metadata is obtained by analyzing the fsimage. The metadata includes: path-directory Path, replication-backup number, modificationTime-last modification time, access time-last access time, preferredLockSize-preferred Block size (byte), blocksCount-number of blocks, fileSize-File size (byte), NSQUOTA-name quota (limiting the number of files and directories allowed under a specified directory), DSQUOTA-space quota (limiting the number of bytes allowed under that directory), permission-Authority, userName-user, and GroupName-user group.

102 Metadata indexing step: analyzing the metadata file acquired in the step 101) to construct a multi-branch tree structure.

103 Data analysis step: and counting the number and scale of all files and the number and scale of different data types under the catalog, and carrying out total quantity counting, ranking analysis and proportion analysis to obtain the storage health score. The specific data analysis comprises small file analysis, cold data analysis, hot data analysis, table analysis, damaged block analysis and disk memory analysis:

The thermal data analysis is used for counting the number and scale of the thermal data according to the strategy setting.

And the table analysis is used for counting the number and the scale of all table small files in the database according to the strategy setting.

The scoring rules for storing health scores include a disk score, a doclet score, a cold data score, and a file chunk score.

When is divided into w ₁ When the usage of the node disk exceeds the threshold value, deducting

And (4) dividing. Assuming that a node has m disks, the score of each disk is

And (4) dividing.

The total score of the small files is 30, the threshold number of the small files is set as t, when the number x of the small files exceeds the threshold value by 1-10%, 1 score is deducted, and when the number x of the small files exceeds 11-20%, 1 score is deducted again until the deduction is finished.

The cold data score is 30 in total, the cold data with yG is unprocessed, namely the data with the hierarchical storage strategy and the erasure code strategy are not set, and 1 point is deducted every 100G until the deduction is finished.

And the total score of the file blocks in the file block score is 10, if z file blocks are damaged, 10 file blocks are damaged, namely 1-10 file blocks are damaged, one point is deducted, and 11-20 file blocks are damaged, one point is deducted again until the total score is deducted.

wherein, each deduction score can not exceed each total score.

104 Data policy configuration step: including tiered storage policies, analysis policies, and compression policies.

The hierarchical storage strategy, namely the heterogeneous storage strategy, stores data on different storage media according to the data access heat, so that the storage of the HDFS can flexibly and efficiently cope with various application scenes. The basis for realizing data layering is based on that the HDFS supports heterogeneous storage and configures heterogeneous storage strategies.

HDFS supports a variety of common storage types, including:

ARCHIVE: a storage medium with high storage density but low power consumption is used for storing cold data.

DISK: disk media, which is the default storage medium for HDFS.

SSD: solid state disk storage media.

RAM _ DISK: data is written into the memory while a copy is asynchronously written to the storage medium.

Further, the hierarchical policies include PROVIDED, COLD, WARM, HOT, ONE _ SSD, ALL _ SSD, and LAZY _ PERSIST.

The provider is used for storing external HDFS, and the storage medium is DISK.

The COLD is maintained on ARCHIVE storage for all copies, and the storage medium is ARCHIVE.

WARM takes one copy to save on DISK, and the rest copies to save on ARCHIVE storage, and the storage media are DISK and ARCHIVE.

HOT adopts all copies to store in the DISK, and it is the default storage strategy, and the storage medium is DISK.

The ONE _ SSD adopts ONE copy to be stored in the SSD, the other copies are stored in the DISK, and the storage media are the SSD and the DISK.

ALL _ SSD adopts ALL copies to be stored in SSD, and the storage medium is SSD.

The analysis strategy sets the definition of the user to the small files, the threshold of the number of the small files, the definition of the cold data and the threshold of the total amount of the cold data, the threshold of the capacity of the disk and the threshold of the scheduling time for the system to execute the analysis.

And the compression strategy sets erasure codes, so that all currently selectable erasure codes are checked to guarantee data migration, and migration logs are checked and recorded. The specific erasure codes include: RS-10-4-1024k, RS-3-2-1024k, RS-6-3-1024k, RS-LEACY-6-3-1024 k and XOR-2-1-1024k.

In summary, the intelligent identification of the present invention includes a data life cycle of small files and cold data. And (4) evaluating the storage health score according to statistics and analysis of the data, and intelligently managing, compressing and migrating the data. The visual operation interface helps a user to clearly and intuitively manage and manage data and provides API calling interfaces for cluster statistical information, data analysis, data management, data migration, file management and the like.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and decorations can be made without departing from the spirit of the present invention, and these modifications and decorations should also be regarded as being within the scope of the present invention.

Claims

1. An intelligent recognition optimization system for data life cycle, characterized by: the system comprises a storage management module and a policy management module;

the strategy management module supports management of a layered storage strategy, an analysis strategy and a compression strategy, and a user sets the layered storage strategy and the compression strategy for a directory to optimize file storage; setting an analysis strategy for the small files and the cold data to help data analysis;

the bottom layer of the whole technical framework comprises MySQL, hive and HDFS, the Hive Client is connected with the Hive, webHDFS and dfsadmin to access the HDFS so as to obtain data of the Hive and the HDFS, and MyBatis is used for interacting with MySQL to store the data;

the method comprises the following specific steps:

103 Data analysis step: counting the number and scale of all files and the number and scale of different data types in the catalog, and carrying out total quantity counting, ranking analysis and proportion analysis to obtain a storage health score;

104 Data policy configuration step: the method comprises a layered storage strategy, an analysis strategy and a compression strategy; the hierarchical storage strategy, namely the heterogeneous storage strategy, stores the data on different storage media according to the data access heat, so that the storage of the HDFS can flexibly and efficiently cope with various application scenes; the analysis strategy sets the definition of the small files and the threshold of the number of the small files by setting a user, sets the definition of cold data and the threshold of the total amount of the cold data, sets the threshold of the capacity of a magnetic disk and sets the threshold of the scheduling time for the system to execute analysis; and the compression strategy sets erasure codes, so that all the erasure codes which can be selected currently are checked, data migration is guaranteed, and a migration log is checked and recorded.

2. The intelligent recognition optimization system for data lifecycle of claim 1, characterized by: the middle layer of the whole technical framework adopts Schedule to realize periodic scheduling, and a multi-branch tree is constructed to facilitate data analysis; the upper layer of the whole technical framework provides an external API call interface and a visual UI operation interface.

3. The intelligent recognition optimization system for data lifecycle of claim 1, characterized by: the metadata includes: path-directory Path, replication-backup number, modificationTime-last modification time, accessTime-last Access time, preferredBlockSize-preferred Block size, blocksCount-number of blocks, fileSize-File size, NSQUOTA-name quota, DSQUOTA-space quota, permission-Authority, userName-user, and GroupName-user group;

4. The intelligent recognition optimization system for data lifecycle of claim 1, characterized by: the data analysis comprises small file analysis, cold data analysis, hot data analysis, table analysis, damaged block analysis and disk memory analysis:

5. The intelligent recognition optimization system for data lifecycle of claim 1, characterized by: the scoring rules for storing the health scores comprise a disk score, a small file score, a cold data score and a file block score;

the total number of disks in the disk score is 30,assuming that the number of nodes is n, each node has a score of

Is divided into ₁ When the usage of the node disk exceeds the threshold value, deducting

Dividing; assuming that a node has m disks, the score of each disk is

Dividing;

the total score of the small files is 30, the threshold number of the small files is set as t, when the number x of the small files exceeds the threshold value by 1-10%, 1 score is deducted, and when the number x of the small files exceeds 11-20%, 1 score is deducted again until the deduction is finished;

the cold data score is 30 in total, the cold data set with the y G is unprocessed, namely the data set with the hierarchical storage strategy and the erasure code strategy are not set, and 1 point is deducted every 100G until the total deduction is finished;

the total score of the file blocks in the file block score is 10, z file blocks are set to be damaged, and each time when the z file blocks are damaged, namely 1-10 file blocks are damaged, one score is deducted, and when the z file blocks are damaged 11-20 file blocks, one score is deducted again until the total score is deducted;

wherein, each deduction score can not exceed each total score.

6. The intelligent recognition optimization system for data lifecycle of claim 1, characterized by: HDFS supports a variety of common storage types, including:

ARCHIVE: a storage medium with high storage density but less power consumption for storing cold data;

DISK: disk media, which is the default storage medium for HDFS;

SSD: a solid state disk storage medium;

7. The intelligent recognition optimization system for data lifecycle of claim 6, characterized by: the hierarchical policy comprises PROVIDED, COLD, WARM, HOT, ONE _ SSD, ALL _ SSD, and LAZY _ PERSIST;

the PROVIDED is used for storing external HDFS, and the storage medium is DISK;

WARM adopts one copy to store on a DISK, and the other copies store on an ARCHIVE storage, and the storage media are DISK and ARCHIVE;

the ONE _ SSD adopts ONE copy to be stored in the SSD, the other copies are stored in a DISK, and the storage media are the SSD and the DISK;

ALL _ SSD adopts ALL copies to be stored in SSD, and the storage medium is SSD;