CN115934670A - Copy placement strategy verification method and device for multiple HDFS (Hadoop distributed File System) machine rooms - Google Patents

Copy placement strategy verification method and device for multiple HDFS (Hadoop distributed File System) machine rooms Download PDF

Info

Publication number
CN115934670A
CN115934670A CN202310219585.5A CN202310219585A CN115934670A CN 115934670 A CN115934670 A CN 115934670A CN 202310219585 A CN202310219585 A CN 202310219585A CN 115934670 A CN115934670 A CN 115934670A
Authority
CN
China
Prior art keywords
file
copy
analysis
mapping table
distribution
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310219585.5A
Other languages
Chinese (zh)
Other versions
CN115934670B (en
Inventor
胡梦宇
贾承昆
张俊杰
陈曦
赵兵
李大海
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhizhe Sihai Beijing Technology Co Ltd
Original Assignee
Zhizhe Sihai Beijing Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhizhe Sihai Beijing Technology Co Ltd filed Critical Zhizhe Sihai Beijing Technology Co Ltd
Priority to CN202310219585.5A priority Critical patent/CN115934670B/en
Publication of CN115934670A publication Critical patent/CN115934670A/en
Application granted granted Critical
Publication of CN115934670B publication Critical patent/CN115934670B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a method and a device for verifying copy placement strategies of multiple HDFS (Hadoop distributed File System) machine rooms, wherein the method comprises the following steps: analyzing the mirror image files of the metadata nodes by adopting an optimized analysis tool to obtain file information of all files, and constructing a first mapping table based on the file information; acquiring a file block list of each data node, and constructing a second mapping table based on the file block list of each data node; calculating to obtain a copy distribution table based on the first mapping table and the second mapping table; and verifying whether the copy placement strategy of the file to be verified conforms to preset distribution or not based on the copy distribution table aiming at the file to be verified. The image file is analyzed by adopting an optimized analysis tool, so that the analysis speed is accelerated, the data format is enriched, the analysis output is reduced, and the missing fields are supplemented; by performing offline analysis based on the copy distribution table, verification of the copy placement policy of each file can be achieved even when the number of file copies is large.

Description

Copy placement strategy verification method and device for multiple HDFS (Hadoop distributed File System) machine rooms
Technical Field
The application relates to the technical field of computers, in particular to a method and a device for verifying copy placement strategies of multiple HDFS (Hadoop distributed File System) machine rooms.
Background
The Federation architecture of Hadoop Distributed File System (HDFS) solves the problem of metadata node metadata information storage, so that metadata nodes have the capability of almost infinite lateral expansion. On the basis, considering that the upper limit problem of single-machine capacity exists, many internet companies develop an HDFS multi-machine-room scheme, data nodes are deployed across machine rooms, and then file copies are placed on the data nodes deployed across the machine rooms, so that each machine room can provide available copies, and the problem of overlarge cross-private line flow is solved.
When the copies are placed in multiple machine rooms, the copies are strictly ensured to be matched with the placing strategy of the copies in the multiple machine rooms, otherwise, cross-machine-room flow is generated, and the cost of the cross-machine-room flow is very high. At present, in an HDFS multi-room scheme, whether a copy placement policy meets requirements or not is verified, and file copy distribution can only be checked by using a Remote Procedure Call (RPC) method to Call getBlockLocations of a metadata node, but the RPC method is only applicable to a small number of file copies, and if the file copy distribution at a million or even a million level is checked, a long time is consumed, if means such as concurrent Call are used, performance pressure of the metadata node is increased, and even the metadata node is still blocked. Therefore, for the HDFS multi-computer-room scheme, under the condition that the number of file copies is large, how to check the distribution of the file copies in time and adjust the cross-computer-room placement strategy of the file copies is a problem which is urgently needed to be solved at present.
Disclosure of Invention
The application provides a method and a device for verifying copy placement strategies of multiple HDFS machine rooms, so that the copy placement strategies of files can be checked, analyzed and adjusted in time under the condition that the number of the copies of the files is large.
In a first aspect, the present application provides a method for verifying a copy placement policy of multiple HDFS machine rooms, where the method includes:
analyzing the mirror image files of the metadata nodes by adopting an optimized analysis tool to obtain file information of all files, and constructing a first mapping table based on the file information; the optimized analysis tool is an analysis tool obtained by optimizing a path analysis method, an analysis data output and storage format and analysis content; the file information comprises file block information corresponding to each file;
acquiring a file block list of each data node, and constructing a second mapping table based on the file block list of each data node;
calculating to obtain a copy distribution table based on the first mapping table and the second mapping table; the copy distribution table is used for inquiring copy distribution of each file and copy distribution of each data node;
and verifying whether the copy placement strategy of the file to be verified conforms to preset distribution or not based on the copy distribution table aiming at the file to be verified.
According to the copy placement strategy verification method for the multiple HDFS machine rooms, optimization is performed on a path analysis method, analysis contents, analysis data output and storage formats, and the method comprises the following steps: analyzing the path information of the mirror image file based on a native character string analysis method; supplementing a block _ ids field and an ec _ id field for the analyzed content of the image file; and outputting the analytic data in a multithread parallel mode, and compressing the analytic data by adopting column storage.
According to the method for verifying the copy placement strategy of the multiple HDFS machine rooms, the block _ ids field is used for representing file block identifications corresponding to the files, and the ec _ id field is used for representing erasure code file identifications corresponding to the files.
According to the copy placement policy verification method for the multiple computer rooms of the HDFS, the first mapping table is used for querying mapping relationships between all files and all file blocks, and the first mapping table includes: path, file size, number of file blocks, file block size, file block identification, erasure code file identification, date, cluster.
According to the copy placement policy verification method for the multiple computer rooms of the HDFS, the second mapping table is used for querying mapping relationships between all file blocks and each data node, and the second mapping table includes: file block identification, data node identification, machine room identification, date and cluster.
According to the method for verifying the copy placement strategy of the multiple computer rooms of the HDFS, the method for obtaining the file block list of each data node comprises the following steps: screening out files meeting file block naming rules from data directories of all data nodes, and obtaining a file block list of all data nodes based on the files meeting the file block naming rules; and/or directly acquiring the file blocks on each data node from a communication protocol interface of the metadata node to obtain a file block list of each data node.
According to the method for verifying the copy placement strategy of the multiple computer rooms of the HDFS, whether the copy placement strategy of the file to be verified accords with the preset distribution or not is verified based on the copy distribution table, and the method comprises the following steps: inquiring the current copy distribution of the file to be verified based on the copy distribution table; judging whether the current copy distribution of the file to be verified conforms to the preset distribution of the file to be verified; and if not, performing copy migration operation on the file to be verified according to the preset distribution of the file to be verified to enable the current copy distribution of the file to be verified to be in accordance with the preset distribution of the file to be verified.
In a second aspect, the present application further provides a device for verifying a copy placement policy of multiple HDFS computer rooms, where the device includes:
the first construction module is used for analyzing the mirror image files of the metadata nodes by adopting an optimized analysis tool to obtain file information of all files and constructing a first mapping table based on the file information; the optimized analysis tool is an analysis tool obtained by optimizing a path analysis method, an analysis data output and storage format and analysis content; the file information comprises file block information corresponding to each file;
the second construction module is used for acquiring a file block list of each data node and constructing a second mapping table based on the file block list of each data node;
the third building module is used for calculating to obtain a copy distribution table based on the first mapping table and the second mapping table; the copy distribution table is used for inquiring copy distribution of each file and copy distribution of each data node;
and the verification analysis module is used for verifying whether the copy placement strategy of the file to be verified conforms to preset distribution or not based on the copy distribution table aiming at the file to be verified.
In a third aspect, an embodiment of the present application further provides an electronic device, where the electronic device includes a memory and a processor, where the memory stores a computer program, and when the processor runs the computer program, the processor executes a step in any implementation manner of the above-described method for verifying a copy placement policy of multiple HDFS rooms.
In a fourth aspect, an embodiment of the present application further provides a readable storage medium, where a computer program is stored in the readable storage medium, and when the computer program runs on a processor, the computer program executes the steps in any implementation manner of the above method for verifying the copy placement policy of the multiple HDFS rooms.
In summary, according to the method and the device for verifying the copy placement strategy of the multiple computer rooms of the HDFS, the path analysis method, the analysis content, the analysis data output and the storage format of the analysis tool are optimized, the analysis speed is increased, the analysis data format is enriched, the analysis output is reduced, the missing fields are supplemented, and the optimized analysis tool is adopted to analyze the image file of the metadata node to construct the first mapping table; file block information of each data node is pulled through a data directory of each data node or directly from a metadata node, and a second mapping table is further constructed; the first mapping table, the second mapping table and the copy distribution table calculated based on the first mapping table and the second mapping table are stored in the Hive table, so that subsequent data query, analysis, calculation and the like are facilitated; and performing offline analysis based on the copy distribution table, and even under the condition that the number of the file copies is large, verifying the copy placement strategy of each file can be realized.
Drawings
In order to more clearly illustrate the technical solutions in the present application or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic flow chart of a copy placement policy verification method for multiple HDFS computer rooms provided in the present application;
fig. 2 is a schematic structural diagram of a copy placement policy verification apparatus of an HDFS multi-room provided in the present application;
fig. 3 is a schematic structural diagram of an electronic device provided in an embodiment of the present application.
Icon: 200-a replica placement policy validation means; 210-a first building block; 220-a second building block; 230-a third building block; 240-verification analysis module; 300-an electronic device; 310-a memory; 320-a processor; 330-bus.
Detailed Description
To make the purpose, technical solutions and advantages of the present application clearer, the technical solutions in the present application will be clearly and completely described below with reference to the drawings in the present application, and it is obvious that the described embodiments are some, but not all embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Fig. 1 is a schematic flow chart of a copy placement policy verification method for multiple HDFS rooms provided by the present application. The HDFS multiple machine rooms refer to an HDFS single cluster and comprise a Client, a metadata node (NameNode) and a data node (DataNode), wherein the metadata node is deployed in a certain main machine room of the HDFS single cluster, and the data node is deployed in the multiple machine rooms. The metadata node is a main node which is used for receiving metadata requests initiated by client users in the HDFS multi-computer room, such as directory checking, directory creation, directory deletion and the like, and real data read-write service is not provided. The data node is a node used for actually storing data in the HDFS multi-computer room, and receives requests of users for reading and writing data, such as reading files, writing files and the like. In addition, there are generally a plurality of metadata nodes in the multiple HDFS machine rooms, and a master/standby mode is adopted, that is, the metadata nodes are divided into a master metadata node (ActiveNameNode) and a plurality of standby metadata nodes (StandbyNameNode) to provide high-availability services.
It should be noted that, when the HDFS stores a file, the file is divided into a plurality of file blocks by a fixed size (e.g., 64M, 128M), each file block copies a plurality of (generally three) identical copies, and the copies of each file corresponding to each file are stored in different data nodes according to a file copy placement policy, where the file blocks are independent of each other. For example, a file 1 is divided into file blocks 1 and file blocks 2 according to 128M, each file block is copied with three copies, if the copy placement policy of the file 1 is "az1, az1, az2", that is, two copies are placed in the machine room 1, and 1 copy is placed in the machine room 2, at this time, the copies of the file blocks 1 and the file blocks 2 both need to satisfy the copy placement policy of "az1, az1, az2", and it is assumed that the three copies of the file blocks 1 are respectively distributed on the data node 1 of the machine room 1, the data node 2 of the machine room 1, and the data node 5 of the machine room 2. At this time, the three copies of the file block 2 may be distributed on the same data node as the three copies of the file block 1, for example, distributed on the data node 1 of the computer room 1, the data node 2 of the computer room 1, and the data node 5 of the computer room 2; or may be distributed on different data nodes, and it is only necessary to satisfy the copy placement policy "az1, az1, az2", for example, distributed on the data node 3 of the computer room 1, the data node 4 of the computer room 1, and the data node 6 of the computer room 2.
As shown in fig. 1, the method for verifying the copy placement policy of the HDFS multiple computer rooms includes:
s1, analyzing the mirror image files of the metadata nodes by adopting an optimized analysis tool to obtain file information of all files, and constructing a first mapping table based on the file information.
The optimized analysis tool is obtained by optimizing a path analysis method, an analysis data output and storage format and analysis content.
Specifically, it can be understood that the parsing the image files of the metadata node by using the optimized parsing tool to obtain file information of all files, and constructing the first mapping table based on the file information includes:
and step S11, acquiring the mirror image file of the metadata node.
Specifically, the metadata node may periodically generate an image file "fsiamge file," where the fsiamge file is a binary file that includes detailed information of all directories and files in the multiple computer rooms of the HDFS, for example, metadata information of the directories and files, the number of file copies, file modification time, file access time, the number of file blocks and the size of the file block that constitute each file, and the like, but the fsiamge file does not record a data node where a file block corresponding to each file is specifically located. It should be noted that the number of file copies recorded by the fsiamge file is at the file level, for example, if a certain file contains 2 file blocks and each file block copies 4 copies, the number of file copies recorded by the fsiamge file is 4 at the file level, instead of 8 at the total file block copy number.
And S12, analyzing the mirror image file by adopting an optimized analysis tool to obtain file information of all files.
Specifically, the optimization based on the path analysis method, the analysis data output and storage format, and the analysis content includes:
(1) And analyzing the path information of the mirror image file based on a native character string analysis method.
In some embodiments, the path information of the mirror image file is analyzed by the native string-based analysis method, and a string type analysis may be used for analyzing, which may increase the analysis speed by 4-6 times compared with a path type analysis.
(2) Supplementing a block _ ids field and an ec _ id field with respect to the parsed contents of the image file. The block _ ids field is used for representing file block identifiers corresponding to the files, and the ec _ id field is used for representing erasure code file identifiers corresponding to the files.
(3) And outputting analysis data in a multithread parallel mode, and compressing the analysis data by adopting column type storage.
In some embodiments, downstream computing speed is increased by employing multi-threaded writes; in addition, the analysis data in formats of csv, txt and the like can be obtained by adopting conventional line storage, and the analysis data can be compressed by adopting a line storage, so that more formats of data, such as an ORC format (Optimized RecordColumnar), a queue format and the like, can be obtained, and therefore, the compression rate and the query performance are improved, and the downstream calculation is facilitated.
It should be noted that, the existing pulling and parsing tool for fsiamge files provided by HDFS has at least the following problems: on one hand, the resolution tool for the fsiamge file provided by the HDFS generally adopts Path class resolution, but in the resolution process, when the Path class object is initialized, a plurality of URI (Uniform resource identifier) objects are additionally generated, which may cause a program to run unnecessary logic and GC (garbagcollection), thereby severely slowing down the resolution speed; on the other hand, the data analyzed by the Fsimage file analyzing tool provided by the HDFS has a field missing problem, and not only lacks file block information corresponding to all files, but also lacks Erasure Code (EC) file identifiers corresponding to all files, and the analyzed data is too large, mostly in a TSV (Tab-separated values) format, and the data volume of a single file can even reach hundreds of G, which is inconvenient for downstream calculation. Therefore, it is necessary to optimize an existing fsiamge file analysis tool to analyze an fsiamge file efficiently and accurately.
And S13, constructing a first mapping table based on the file information.
The file information includes file block information corresponding to each file. The first mapping table is used for querying mapping relationships between all files and all file blocks, and the first mapping table includes: path, file size, number of file blocks, file block size, file block identification, erasure code file identification, date, cluster.
In some embodiments, a new table is first constructed in a Hive table of the HDFS multi-machine room, and the new table is named, for example, "hadoop _ admin.fsimage"; and acquiring corresponding relations between all files and all file blocks based on the file information, inputting the file information into a hadoop _ admin. The Hive table is a data warehouse tool based on Hadoop, is used for data extraction, transformation and loading, can map a Structured data file into a database table, and provides an SQL (Structured query language) query function.
In some embodiments, the file information includes, in addition to file block information corresponding to each file, file block information that includes: file copy number, file modification time, file access time, file size, date and other information. The first mapping table further includes: file copy number, file modification time, file access time, file authority and other information.
As shown in table 1, it should be noted that the permission field indicates the file authority, and the file or the directory can be distinguished through the permission field, and if the field value of the permission field starts with "-" the file is indicated, and if the field value of the permission field starts with "d" the directory is indicated; the ec _ id field represents an erasure code file identifier, and if the field value of the ec _ id field is 0, the file does not have a corresponding erasure code file; the field value of the p _ date field is typically in a year-month-day format, such as 2023-01-10.
TABLE 1 first mapping Table
Figure SMS_1
In the implementation process of the method, the mirror image files of the metadata nodes are analyzed based on an optimized analysis tool to obtain the file information of all files, so that the problems of low analysis speed, single analysis format, large analysis output, field loss of analysis content and the like are solved; and constructing a first mapping table in the Hive table based on the file information, so that the mapping relation between all files and all file blocks can be conveniently inquired subsequently.
S2, obtaining a file block list of each data node, and constructing a second mapping table based on the file block list of each data node.
The second mapping table is used for querying mapping relationships between all file blocks and each data node, and the second mapping table includes: file block identification, data node identification, machine room identification, date and cluster.
Specifically, it can be understood that the obtaining of the file block list of each data node includes: screening out files meeting file block naming rules from data directories of all data nodes, and obtaining a file block list of all data nodes based on the files meeting the file block naming rules; and/or directly acquiring the file blocks on each data node from a communication protocol interface of the metadata node to obtain a file block list of each data node.
In some embodiments, since the file block copies of each data node are stored on the local disk, and the file block copies are named according to a file block naming rule, the file block naming rule may be that the file name of the file block copy is prefixed by "blk _". Therefore, the file block list of each data node may be obtained by screening out a file meeting the file block naming rule from the data directory of each data node, and obtaining the file block list of each data node based on the file meeting the file block naming rule. Although traversing the data directory of each data node may cause a certain pressure on the performance of the data node and increase the operation and maintenance difficulty, more detailed file block information may be obtained in this way, for example, on the basis of obtaining the corresponding relationship between the file block and the data node, detailed information such as the creation time of the file block and the specific content of the file block may be further obtained.
In some embodiments, the metadata node stores the corresponding relationship between the file blocks and each data node, and a communication protocol interface is further provided, so that the file block list of each data node can be directly obtained by directly utilizing the namenodeprocol # getBlocks class of the communication protocol interface, and the operation and maintenance pressure is greatly reduced. Under the condition that the number of the file blocks of each data node is large, if the pulling exceeds the maximum limit of the request at one time, the file block list can be obtained from each data node for multiple times.
Specifically, it can also be understood that a new table is first constructed on the Hive table of the multiple HDFS machine rooms, and the new table is named, for example, "hadoop _ admin. Blocks"; and acquiring the corresponding relation between all the file blocks and each data node based on the file block list of each data node, inputting the corresponding relation into a hadoop _ admin.blocks table, and giving corresponding field values to each field in the hadoop _ admin.blocks table to obtain a second mapping table so as to query the mapping relation between all the file blocks and each data node. Examples of fields and field definitions for the second mapping table are shown in table 2.
TABLE 2 second mapping Table
Figure SMS_2
In the implementation process of the method, the following two methods can be adopted to obtain the file block list of each data node: the first method is that a file which accords with a file block naming rule is screened out by traversing a data directory of each data node, and more detailed related information of a file block is obtained, so that a file block list of each data node is obtained; in the second method, the file block information on each data node is directly acquired from the communication protocol interface of the metadata node, so that the file block list of each data node is obtained, and the operation and maintenance pressure is reduced compared with the first method. In addition, a second mapping table is constructed in the Hive table based on the file block list of each data node, so that the mapping relation between all file blocks and each data node can be conveniently inquired subsequently.
And S3, calculating to obtain a copy distribution table based on the first mapping table and the second mapping table.
The copy distribution table is used for inquiring copy distribution of each file and copy distribution of each data node.
Specifically, it can be understood that a new table is constructed in the Hive table of the HDFS multiple machine rooms, where the replica distribution table includes a path, a file block identifier, an erasure code file identifier, a data node list, a machine room list, a cluster, and a date, and the new table is named, for example, "hadoop _ admin.block _ analysis"; and acquiring the corresponding relation between all files and each data node through the first mapping table and the second mapping table based on the SQL computing capacity of the Hive table, so as to obtain the copy distribution table.
In some implementations, a representation of the fields and field definitions of the copy distribution table are shown in table 3. The copy distribution table may be constructed once a day, and a simple ETL (Extract-Transform-Load) may be generally adopted to process the first mapping table and the second mapping table to obtain the copy distribution table. Specifically, first, a first mapping table and a second mapping table of the current day are obtained according to a p _ date field; then, under the condition that the name _ service fields are the same, acquiring field value information from the first mapping table and the second mapping table respectively, and determining field values of the path field, the ec _ id field, the name _ service field, the block _ id field, the hostname field, the az field and the p _ date field; finally, the dataode _ list and az _ list fields are derived based on the hostname and az fields.
Table 3 duplicate distribution table
Figure SMS_3
In the implementation process of the method, a copy distribution table is constructed in the Hive table based on the first mapping table and the second mapping table, so that the copy distribution of each file and the copy distribution of each data node can be conveniently inquired subsequently, the first mapping table, the second mapping table and the copy distribution table are all stored in the Hive table, and various analysis engines supporting the calculation capability of the Hive table can be adopted for inquiry, so that the subsequent analysis, calculation and the like are facilitated.
And S4, verifying whether the copy placement strategy of the file to be verified conforms to preset distribution or not based on the copy distribution table aiming at the file to be verified.
Specifically, it can be understood that the verifying whether the copy placement policy of the file to be verified conforms to the preset distribution based on the copy distribution table includes:
and S41, inquiring the current copy distribution of the file to be verified based on the copy distribution table.
Specifically, based on the file name of the file to be verified, the corresponding file blocks of the file to be verified and the data nodes to which the copies of the file blocks are distributed are inquired from the copy distribution table, and the current copy distribution of the file to be verified is determined based on the data nodes to which the copies of the file blocks of the file to be verified are distributed. Txt, the data nodes distributed by the copies of the file blocks of the file to be verified can be queried through the following SQL statements.
SELECT*
FROM hadoop_admin.block_analysis
WHEREp_date= '2023-01-10'and path = '/test.txt';
Supposing that the file test.txt has 2 file blocks through the query statement, each file block has 3 copies, the copies of the file block 1 are distributed on the data node 1 of the machine room 1, the data node 2 of the machine room 1 and the data node 5 of the machine room 2, respectively, and the copies of the file block 2 are distributed on the data node 3 of the machine room 1, the data node 4 of the machine room 1 and the data node 6 of the machine room 2, it can be determined that the current distribution of the file test.txt is 'two copies are placed in the machine room 1, and one copy is placed in the machine room 2'.
Step S42, judging whether the current copy distribution of the file to be verified accords with the preset distribution of the file to be verified; and if not, performing copy migration operation on the file to be verified according to the preset distribution of the file to be verified to enable the current copy distribution of the file to be verified to be in accordance with the preset distribution of the file to be verified.
The preset distribution of the files to be verified can be a copy placement strategy of the files to be verified, and can also be a set copy distribution requirement for the files to be verified.
Specifically, if the file to be verified has a requirement of reading multiple machine rooms and the preset distribution of the file to be verified is across-machine-room deployment, but the current copy distributions inquired based on the copy distribution table are all deployed in the same machine room and are not across-machine-room deployment, the current copy distribution of the file to be verified is considered to be not in accordance with the preset distribution of the file to be verified; in addition, if the current copy distribution of the file to be verified is deployed across the machine room and does not conform to the copy placement strategy of the file to be verified, the current copy distribution of the file to be verified can also be considered to be not conform to the preset distribution of the file to be verified.
Based on the above example, if the file to be verified is the file test.txt, if the copy placement strategy of the file test.txt is "az1, az1, az2", that is, two copies are placed in the machine room 1, and one copy is placed in the machine room 2, then the current distribution of the file test.txt conforms to the copy placement strategy; if the copy placement policy of the file test.txt is "az1, az1, az2, az2", that is, two copies are placed in the machine room 1, and two copies are placed in the machine room 2, the current distribution of the file test.txt does not conform to the copy placement policy, at this time, copy migration operation needs to be performed, and the copy distribution of the file test.txt is adjusted, so that the current copy distribution is changed into "two copies are placed in the machine room 1, and two copies are placed in the machine room 2".
It should be noted that, in the above method, after the copy distribution table is obtained through steps S1 to S3, in addition to performing the copy placement policy verification as in step S4, the copy distribution information of each data node may also be obtained based on the copy distribution table query, for example, which data node among all data nodes has the most distributed copies, such as querying the copy distribution condition of the file block on the data node with the node name data01 through the following SQL statement.
SELECT*
FROM hadoop_admin.block_analysis
WHEREp_date= '2023-01-10' andarray_contains(datanode_list,'data01');
It should be noted that, when performing query based on the copy distribution, the SQL statements in the above example may be adopted, and various analysis engines supporting Hive table calculation may also be adopted, which is not limited in this application.
In the implementation process of the method, aiming at the HDFS multi-computer-room scheme, by acquiring the copy distribution table and performing off-line analysis, on one hand, the copy placement strategy of each file can be verified based on the copy distribution table, even under the condition that the number of the file copies is large, when the copy distribution table is subjected to the off-line analysis, the metadata of the metadata node does not need to be requested or scanned, and compared with the prior art, the method has the advantages that the metadata node is inquired from the metadata node for verification, the operation and maintenance pressure of the metadata node is not increased, the time is short, and the purposes of checking the distribution of the file copies and adjusting the cross-computer-room placement strategy of the file copies can be achieved in time; on the other hand, not only the copy distribution of each file but also the copy distribution of each data node can be inquired based on the copy distribution table, and the inquiry and analysis processes are all off-line processes, so that the normal working process of the multiple computer rooms of the HDFS is not influenced.
According to the method for verifying the copy placement strategy of the HDFS multiple machine rooms, the path analysis method, the analysis content, the analysis data output and the storage format of the analysis tool are optimized, the analysis speed is increased, the analysis data format is enriched, the analysis output is reduced, the missing fields are supplemented, and the optimized analysis tool is adopted to analyze the mirror image file of the metadata node so as to construct the first mapping table; file block information of each data node is pulled through a data directory of each data node or directly from a metadata node, and a second mapping table is further constructed; the first mapping table, the second mapping table and the copy distribution table calculated based on the first mapping table and the second mapping table are stored in the Hive table, so that subsequent data query, analysis, calculation and the like are facilitated; and performing offline analysis based on the copy distribution table, and even under the condition that the number of the file copies is large, verifying the copy placement strategy of each file can be realized.
Fig. 2 is a schematic structural diagram of a copy placement policy verification apparatus for multiple HDFS rooms, which may be used to implement the method described in the foregoing embodiment. As shown in fig. 2, the apparatus includes:
the first construction module 210 is configured to analyze the image files of the metadata node by using an optimized analysis tool to obtain file information of all files, and construct a first mapping table based on the file information; the optimized analysis tool is an analysis tool obtained by optimizing a path analysis method, an analysis data output and storage format and analysis content; the file information comprises file block information corresponding to each file;
the second building module 220 is configured to obtain a file block list of each data node, and build a second mapping table based on the file block list of each data node;
a third constructing module 230, configured to calculate a copy distribution table based on the first mapping table and the second mapping table; the copy distribution table is used for inquiring copy distribution of each file and copy distribution of each data node;
and the verification analysis module 240 is configured to verify, for the file to be verified, whether a copy placement policy of the file to be verified conforms to a preset distribution based on the copy distribution table.
For detailed description of the copy placement policy verification apparatus for the HDFS multi-machine room, please refer to the description of the related method steps in the above embodiments, and the repeated parts are not described again. The above-described apparatus embodiments are merely illustrative, and "modules" used herein as separate components may be a combination of software and/or hardware for implementing predetermined functions, and may or may not be physically separate. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Fig. 3 is a schematic structural diagram of an electronic device provided in the present application, and as shown in fig. 3, the electronic device includes a memory 310 and a processor 320, where the memory 310 stores a computer program, and the processor 320 executes steps in the method for verifying the copy placement policy of the HDFS multi-room when executing the computer program.
The embodiment of the application also provides a readable storage medium, wherein a computer program is stored in the readable storage medium, and when the computer program runs on a processor, the steps in the HDFS multi-computer room copy placement policy verification method are executed.
It should be understood that the electronic device may be a personal computer, tablet computer, smart phone, or other electronic device with logic computing functionality; the readable storage medium may be a ROM (Read-only Memory), a RAM (random access Memory), a magnetic disk, an optical disk, or the like.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not substantially depart from the spirit and scope of the present invention as set forth in the appended claims.

Claims (10)

1. A method for verifying copy placement strategy of multiple HDFS (Hadoop distributed File System) rooms is characterized by comprising the following steps:
analyzing the mirror image files of the metadata nodes by adopting an optimized analysis tool to obtain file information of all files, and constructing a first mapping table based on the file information; the optimized analysis tool is an analysis tool obtained by optimizing a path analysis method, analysis contents, analysis data output and storage formats; the file information comprises file block information corresponding to each file;
acquiring a file block list of each data node, and constructing a second mapping table based on the file block list of each data node;
calculating to obtain a copy distribution table based on the first mapping table and the second mapping table; the copy distribution table is used for inquiring copy distribution of each file and copy distribution of each data node;
and verifying whether the copy placement strategy of the file to be verified conforms to preset distribution or not based on the copy distribution table aiming at the file to be verified.
2. The method of claim 1, wherein optimizing for path parsing method, parsing content, parsing data output and storage format comprises:
analyzing the path information of the mirror image file based on a native character string analysis method;
supplementing a block _ ids field and an ec _ id field for the analyzed content of the image file;
and outputting analysis data in a multithread parallel mode, and compressing the analysis data by adopting column type storage.
3. The method of claim 2, wherein the block _ ids field is used to indicate a file block identifier corresponding to each file, and the ec _ id field is used to indicate an erasure code file identifier corresponding to each file.
4. The method of claim 1, wherein the first mapping table is used for querying mapping relationships between all files and all file blocks, and the first mapping table comprises: path, file size, number of file blocks, file block size, file block identification, erasure code file identification, date, cluster.
5. The method according to claim 1, wherein the second mapping table is used for querying mapping relationships between all file blocks and data nodes, and the second mapping table includes: file block identification, data node identification, machine room identification, date and cluster.
6. The method of claim 1, wherein the obtaining the file block list of each data node comprises:
screening out files meeting file block naming rules from data directories of all data nodes, and obtaining a file block list of all data nodes based on the files meeting the file block naming rules; and/or the presence of a gas in the gas,
and directly acquiring the file blocks on each data node from a communication protocol interface of the metadata node to obtain a file block list of each data node.
7. The method according to claim 1, wherein the verifying whether the copy placement policy of the file to be verified conforms to a preset distribution based on the copy distribution table comprises:
inquiring the current copy distribution of the file to be verified based on the copy distribution table;
judging whether the current copy distribution of the file to be verified conforms to the preset distribution of the file to be verified; and if not, performing copy migration operation on the file to be verified according to the preset distribution of the file to be verified to enable the current copy distribution of the file to be verified to be in accordance with the preset distribution of the file to be verified.
8. A device for verifying copy placement strategy of multiple HDFS (Hadoop distributed File System) rooms is characterized by comprising:
the first construction module is used for analyzing the mirror image files of the metadata nodes by adopting an optimized analysis tool to obtain file information of all files and constructing a first mapping table based on the file information; the optimized analysis tool is an analysis tool obtained by optimizing a path analysis method, analysis contents, analysis data output and storage formats; the file information comprises file block information corresponding to each file;
the second construction module is used for acquiring a file block list of each data node and constructing a second mapping table based on the file block list of each data node;
the third building module is used for calculating to obtain a copy distribution table based on the first mapping table and the second mapping table; the copy distribution table is used for inquiring copy distribution of each file and copy distribution of each data node;
and the verification analysis module is used for verifying whether the copy placement strategy of the file to be verified conforms to preset distribution or not based on the copy distribution table aiming at the file to be verified.
9. An electronic device, comprising a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to execute the method for verifying the copy placement policy of the HDFS multi-room according to any one of claims 1 to 7.
10. A readable storage medium, in which a computer program is stored, which, when run on a processor, performs the HDFS multi-room replica placement policy validation method of any of claims 1 to 7.
CN202310219585.5A 2023-03-09 2023-03-09 Method and device for verifying copy placement strategy of HDFS (Hadoop distributed File System) multi-machine room Active CN115934670B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310219585.5A CN115934670B (en) 2023-03-09 2023-03-09 Method and device for verifying copy placement strategy of HDFS (Hadoop distributed File System) multi-machine room

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310219585.5A CN115934670B (en) 2023-03-09 2023-03-09 Method and device for verifying copy placement strategy of HDFS (Hadoop distributed File System) multi-machine room

Publications (2)

Publication Number Publication Date
CN115934670A true CN115934670A (en) 2023-04-07
CN115934670B CN115934670B (en) 2023-05-05

Family

ID=85827715

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310219585.5A Active CN115934670B (en) 2023-03-09 2023-03-09 Method and device for verifying copy placement strategy of HDFS (Hadoop distributed File System) multi-machine room

Country Status (1)

Country Link
CN (1) CN115934670B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103257970A (en) * 2012-02-17 2013-08-21 百度在线网络技术(北京)有限公司 Method and device for testing primary node of HDFS (Hadoop Distributed File System)
CN104615606A (en) * 2013-11-05 2015-05-13 阿里巴巴集团控股有限公司 Hadoop distributed file system and management method thereof
CN108769171A (en) * 2018-05-18 2018-11-06 百度在线网络技术(北京)有限公司 The copy of distributed storage keeps verification method, device, equipment and storage medium
WO2019189962A1 (en) * 2018-03-27 2019-10-03 주식회사 리얼타임테크 Query parallelizing method for data having copy existing in distribution database
CN114385561A (en) * 2022-01-10 2022-04-22 北京沃东天骏信息技术有限公司 File management method and device and HDFS system
CN115048254A (en) * 2022-07-11 2022-09-13 北京志凌海纳科技有限公司 Simulation test method, system, equipment and readable medium of data distribution strategy

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103257970A (en) * 2012-02-17 2013-08-21 百度在线网络技术(北京)有限公司 Method and device for testing primary node of HDFS (Hadoop Distributed File System)
CN104615606A (en) * 2013-11-05 2015-05-13 阿里巴巴集团控股有限公司 Hadoop distributed file system and management method thereof
WO2019189962A1 (en) * 2018-03-27 2019-10-03 주식회사 리얼타임테크 Query parallelizing method for data having copy existing in distribution database
CN108769171A (en) * 2018-05-18 2018-11-06 百度在线网络技术(北京)有限公司 The copy of distributed storage keeps verification method, device, equipment and storage medium
CN114385561A (en) * 2022-01-10 2022-04-22 北京沃东天骏信息技术有限公司 File management method and device and HDFS system
CN115048254A (en) * 2022-07-11 2022-09-13 北京志凌海纳科技有限公司 Simulation test method, system, equipment and readable medium of data distribution strategy

Also Published As

Publication number Publication date
CN115934670B (en) 2023-05-05

Similar Documents

Publication Publication Date Title
US10713589B1 (en) Consistent sort-based record-level shuffling of machine learning data
US10366053B1 (en) Consistent randomized record-level splitting of machine learning data
CN110413595B (en) Data migration method applied to distributed database and related device
CN111324606B (en) Data slicing method and device
CN103077197A (en) Data storing method and device
CN112395157A (en) Audit log obtaining method and device, computer equipment and storage medium
CN111737227A (en) Data modification method and system
CN111240892A (en) Data backup method and device
CN109947712A (en) Automatically merge method, system, equipment and the medium of file in Computational frame
CN115033551A (en) Database migration method and device, electronic equipment and storage medium
US11533384B2 (en) Predictive provisioning of cloud-stored files
CN116501700B (en) APP formatted file offline storage method, device, equipment and storage medium
CN115934670B (en) Method and device for verifying copy placement strategy of HDFS (Hadoop distributed File System) multi-machine room
WO2022121387A1 (en) Data storage method and apparatus, server, and medium
US11392715B1 (en) Data certification process for cloud database platform
CN113448775B (en) Multi-source heterogeneous data backup method and device
AU2021268828B2 (en) Secure data replication in distributed data storage environments
CN113138772B (en) Construction method and device of data processing platform, electronic equipment and storage medium
CN113760822A (en) HDFS-based distributed intelligent campus file management system optimization method and device
Al-Masadeh et al. Tiny datablock in saving Hadoop distributed file system wasted memory.
US12067018B2 (en) Data certification process for updates to data in cloud database platform
CN114637736B (en) Database splitting method and device
CN118069625A (en) Data processing method, system, electronic equipment and storage medium
CN114328376A (en) Snapshot creating method, device, equipment and storage medium
CN115794164A (en) Cloud database upgrading method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant