CN115934670B - Method and device for verifying copy placement strategy of HDFS (Hadoop distributed File System) multi-machine room - Google Patents

Method and device for verifying copy placement strategy of HDFS (Hadoop distributed File System) multi-machine room Download PDF

Info

Publication number
CN115934670B
CN115934670B CN202310219585.5A CN202310219585A CN115934670B CN 115934670 B CN115934670 B CN 115934670B CN 202310219585 A CN202310219585 A CN 202310219585A CN 115934670 B CN115934670 B CN 115934670B
Authority
CN
China
Prior art keywords
file
copy
analysis
distribution
verified
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310219585.5A
Other languages
Chinese (zh)
Other versions
CN115934670A (en
Inventor
胡梦宇
贾承昆
张俊杰
陈曦
赵兵
李大海
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhizhe Sihai Beijing Technology Co Ltd
Original Assignee
Zhizhe Sihai Beijing Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhizhe Sihai Beijing Technology Co Ltd filed Critical Zhizhe Sihai Beijing Technology Co Ltd
Priority to CN202310219585.5A priority Critical patent/CN115934670B/en
Publication of CN115934670A publication Critical patent/CN115934670A/en
Application granted granted Critical
Publication of CN115934670B publication Critical patent/CN115934670B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a copy placement strategy verification method and device for an HDFS multi-machine room, wherein the method comprises the following steps: analyzing the mirror image file of the metadata node by adopting an optimization analysis tool to obtain file information of all files, and constructing a first mapping table based on the file information; acquiring a file block list of each data node, and constructing a second mapping table based on the file block list of each data node; based on the first mapping table and the second mapping table, calculating to obtain a copy distribution table; and verifying whether a copy placement strategy of the file to be verified accords with preset distribution or not based on the copy distribution table aiming at the file to be verified. By adopting an optimized analysis tool to analyze the image file, the analysis speed is increased, the data format is enriched, the analysis output is reduced, and the missing field is supplemented; by performing offline analysis based on the copy distribution table, verification of copy placement policies of each file can be realized even if the number of the copies of the file is large.

Description

Method and device for verifying copy placement strategy of HDFS (Hadoop distributed File System) multi-machine room
Technical Field
The application relates to the technical field of computers, in particular to a method and a device for verifying copy placement policies of an HDFS multi-machine room.
Background
The Federation architecture of the Hadoop distributed file system (Hadoop DistributedFile System, HDFS) solves the problem of metadata node meta-information storage, so that metadata nodes have almost infinite lateral expansion capability. On the basis, considering the problem of the upper limit of the capacity of a single machine room, a plurality of Internet companies develop an HDFS multi-machine room scheme, and the problem of overlarge flow across special lines is solved by arranging data nodes across machine rooms and arranging file copies on the data nodes arranged across the machine rooms so that each machine room can provide available copies.
When the copies are placed in multiple rooms, strict guarantee is needed that the copies are matched with the placement strategies of the copies in multiple rooms, otherwise, the flow of the copies across the rooms is generated, and the cost of the flow of the copies across the rooms is quite high. At present, in the HDFS multi-machine room scheme, whether a copy placement strategy meets requirements is verified, and only RPC (Remote Procedure Call) methods can be used for calling getBlockLocations of metadata nodes to check file copy distribution, but the RPC method is only applicable to a small number of file copies, if millions or even tens of millions of levels of file copy distribution can take a long time, if means such as concurrent calling are adopted, performance pressure of the metadata nodes can be increased, and even metadata nodes can be blocked. Therefore, aiming at the HDFS multi-machine-room scheme, under the condition that the number of the file copies is large, how to timely check the distribution of the file copies and adjust the cross-machine-room placement strategy of the file copies is a problem which needs to be solved at present.
Disclosure of Invention
The application provides a copy placement strategy verification method and device for an HDFS multi-machine room, so that copy placement strategies of files can be checked, analyzed and adjusted in time under the condition that the number of the copies of the files is large.
In a first aspect, the present application provides a copy placement policy verification method for an HDFS multi-machine room, where the method includes:
analyzing the mirror image file of the metadata node by adopting an optimization analysis tool to obtain file information of all files, and constructing a first mapping table based on the file information; the optimized analysis tool is an analysis tool obtained by optimizing a path analysis method, analysis data output and storage format and analysis content; the file information comprises file block information corresponding to each file;
acquiring a file block list of each data node, and constructing a second mapping table based on the file block list of each data node;
based on the first mapping table and the second mapping table, calculating to obtain a copy distribution table; the copy distribution table is used for inquiring copy distribution of each file and copy distribution of each data node;
and verifying whether a copy placement strategy of the file to be verified accords with preset distribution or not based on the copy distribution table aiming at the file to be verified.
According to the copy placement strategy verification method of the HDFS multi-machine room provided by the application, optimization is performed on a path analysis method, analysis contents, analysis data output and storage formats, and the method comprises the following steps: analyzing path information of the mirror image file based on a native character string analysis method; supplementing a block_ids field and an ec_id field for the analysis content of the image file; and outputting the analysis data in a multithreading parallel mode, and compressing the analysis data by adopting column storage.
According to the copy placement strategy verification method of the HDFS multi-machine room, the block_ids field is used for representing file block identifiers corresponding to all files, and the ec_id field is used for representing erasure code file identifiers corresponding to all files.
According to the copy placement strategy verification method of the HDFS multi-machine room provided by the application, the first mapping table is used for querying mapping relations between all files and all file blocks, and the first mapping table includes: path, file size, number of file blocks, file block size, file block identification, erasure code file identification, date, cluster.
According to the copy placement policy verification method of the HDFS multi-machine room provided by the present application, the second mapping table is used for querying mapping relations between all file blocks and each data node, and the second mapping table includes: file block identification, data node identification, machine room identification, date and cluster.
According to the copy placement policy verification method for the multi-machine room of the HDFS provided by the present application, the obtaining a file block list of each data node includes: screening files conforming to file block naming rules from the data catalogues of all the data nodes, and obtaining a file block list of all the data nodes based on the files conforming to the file block naming rules; and/or directly acquiring file blocks on each data node from the communication protocol interface of the metadata node to obtain a file block list of each data node.
According to the copy placement strategy verification method for the multi-machine room of the HDFS provided by the application, the verifying whether the copy placement strategy of the file to be verified accords with preset distribution based on the copy distribution table includes: inquiring the current copy distribution of the file to be verified based on the copy distribution table; judging whether the current copy distribution of the file to be verified accords with the preset distribution of the file to be verified; if the current distribution of the files to be verified is not in accordance with the preset distribution of the files to be verified, copy migration operation is carried out on the files to be verified, so that the current copy distribution of the files to be verified is in accordance with the preset distribution of the files to be verified.
In a second aspect, the present application further provides a copy placement policy verification apparatus for an HDFS multi-machine room, where the apparatus includes:
the first construction module is used for analyzing the mirror image files of the metadata nodes by adopting an optimization analysis tool to obtain file information of all files, and constructing a first mapping table based on the file information; the optimized analysis tool is an analysis tool obtained by optimizing a path analysis method, analysis data output and storage format and analysis content; the file information comprises file block information corresponding to each file;
the second construction module is used for acquiring a file block list of each data node and constructing a second mapping table based on the file block list of each data node;
the third construction module is used for calculating a copy distribution table based on the first mapping table and the second mapping table; the copy distribution table is used for inquiring copy distribution of each file and copy distribution of each data node;
the verification analysis module is used for verifying whether the copy placement strategy of the file to be verified accords with preset distribution or not based on the copy distribution table aiming at the file to be verified.
In a third aspect, an embodiment of the present application further provides an electronic device, where the electronic device includes a memory and a processor, where the memory stores a computer program, and when the processor runs the computer program, the processor executes steps in any implementation manner of the foregoing method for verifying a copy placement policy of an HDFS multi-machine room.
In a fourth aspect, embodiments of the present application further provide a readable storage medium, where a computer program is stored, where the computer program executes steps in any implementation of the above-mentioned copy placement policy verification method for HDFS multiple rooms when the computer program runs on a processor.
In summary, the method and the device for verifying the copy placement strategy of the HDFS multi-machine room optimize the path analysis method, the analysis content, the analysis data output and the storage format of the analysis tool, so as to accelerate the analysis speed, enrich the analysis data format, reduce the analysis output, supplement the missing field, and analyze the mirror image file of the metadata node by adopting the optimized analysis tool to construct the first mapping table; pulling file block information of each data node through traversing the data catalogue of each data node or directly from the metadata node, and further constructing a second mapping table; the first mapping table, the second mapping table and the copy distribution table obtained by calculation based on the first mapping table and the second mapping table are stored in the Hive table, so that subsequent data query, analysis, calculation and the like are facilitated; and carrying out offline analysis based on the copy distribution table, and verifying copy placement strategies of all files can be realized even under the condition of more copies of the files.
Drawings
For a clearer description of the present application or of the prior art, the drawings that are used in the description of the embodiments or of the prior art will be briefly described, it being apparent that the drawings in the description below are some embodiments of the present application, and that other drawings may be obtained from these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flowchart of a method for verifying copy placement policies of an HDFS multi-machine room provided in the present application;
fig. 2 is a schematic structural diagram of a copy placement policy verification device of an HDFS multi-machine room provided in the present application;
fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Icon: 200-copy placement policy verification means; 210-a first building block; 220-a second building block; 230-a third building block; 240-validating the analysis module; 300-an electronic device; 310-memory; 320-a processor; 330-bus.
Detailed Description
For the purposes of making the objects, technical solutions and advantages of the present application more apparent, the technical solutions in the present application will be clearly and completely described below with reference to the drawings in the present application, and it is apparent that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.
Fig. 1 is a flowchart of a copy placement policy verification method of an HDFS multi-machine room provided in the present application. The HDFS multi-machine room refers to an HDFS single cluster, and includes a Client, a metadata node (NameNode), and a data node (DataNode), where the metadata node is disposed in a main machine room of the HDFS single cluster, and the data node is disposed in a plurality of machine rooms. The metadata node is a master node in the HDFS multi-machine room for receiving metadata requests initiated by users of each client, such as checking a directory, creating a new directory, deleting a directory, and the like, and does not provide real data read-write services. The data node is a node for actually storing data in the HDFS multi-machine room, and receives a request of a user for reading and writing the data, such as reading a file, writing the file and the like. In addition, there are a plurality of metadata nodes in the HDFS multi-machine room, and a primary-backup mode is adopted, that is, the metadata nodes are divided into a primary metadata node (ActiveNameNode) and a plurality of backup metadata nodes (standby metadata nodes) to provide high availability services.
It should be noted that when the HDFS stores a file, the file is split into a plurality of file blocks with a fixed size (e.g. 64M and 128M), each file block may copy a plurality of (typically three) identical copies, and copies of each file corresponding to each file are stored on different data nodes according to a copy placement policy of the file, where the file blocks are mutually independent. For example, after the file 1 is split according to 128M, the file block 1 and the file block 2 are obtained, and each file block is duplicated to three copies, if the copy placement policy of the file 1 is "az1, az1, az2", that is, the computer room 1 places two copies, and the computer room 2 places 1 copy, at this time, the copies of the file block 1 and the file block 2 all need to satisfy the copy placement policy of "az1, az1, az2", and it is assumed that the three copies of the file block 1 are respectively distributed on the data node 1 of the computer room 1, the data node 2 of the computer room 1, and the data node 5 of the computer room 2. At this time, the three copies of the file block 2 may be distributed on the same data node as the three copies of the file block 1, for example, on the data node 1 of the room 1, the data node 2 of the room 1, and the data node 5 of the room 2; the data nodes can be distributed on different data nodes, and the copy placement policies "az1, az1 and az2" are only required to be satisfied, for example, the data nodes 3 and 4 of the machine room 1 and 6 of the machine room 2 are distributed.
As shown in fig. 1, the method for verifying the copy placement policy of the HDFS multi-machine room includes:
s1, analyzing mirror image files of metadata nodes by adopting an optimization analysis tool to obtain file information of all files, and constructing a first mapping table based on the file information.
The optimized analysis tool is an analysis tool obtained by optimizing the path analysis method, the analysis data output and storage format and the analysis content.
Specifically, it may be understood that the analyzing the mirror image file of the metadata node by using the optimizing analysis tool to obtain file information of all files, and constructing a first mapping table based on the file information includes:
step S11, obtaining an image file of the metadata node.
Specifically, the metadata node will periodically generate an image file "Fsimage file", where the Fsimage file is a binary file containing detailed information of all directories and files in the HDFS multi-machine room, for example, metadata information of the directories and files, the number of file copies, file modification time, file access time, the number of file blocks constituting each file, and the size of the file blocks, and the like, but the Fsimage file does not record the data node where the file block corresponding to each file is specifically located. It is noted that, the number of file copies recorded in the Fsimage file is at the file level, for example, if a certain file contains 2 file blocks and each file block is replicated with 4 copies, the number of file copies recorded in the Fsimage file is 4 at the file level, instead of 8 at the total file block copy.
And S12, analyzing the mirror image file by adopting an optimization analysis tool to obtain file information of all files.
Specifically, the method for optimizing the path analysis method, the analysis data output and storage format and the analysis content comprises the following steps:
(1) And analyzing the path information of the mirror image file based on a native character string analysis method.
In some embodiments, the native string parsing method may parse path information of the image file by using string class, and may increase parsing speed by 4-6 times compared to path parsing speed.
(2) The block_ids field and the ec_id field are supplemented for the resolved content of the image file. The block_ids field is used for representing file block identifiers corresponding to all files, and the ec_id field is used for representing erasure code file identifiers corresponding to all files.
(3) And outputting the analysis data in a multithreading parallel mode, and compressing the analysis data by adopting column storage.
In some embodiments, downstream computation speed is increased by employing multi-threaded writing; in addition, the method can not only adopt conventional line storage to acquire analysis data in the formats of csv, txt and the like, but also adopt column storage to compress the analysis data to acquire data in more formats, such as ORC format (Optimized RecordColumnar), parquet format and the like, so that the compression rate and the query performance are improved, and the method is convenient for downstream calculation.
It should be noted that, the tools for pulling and parsing Fsimage files provided by the existing HDFS have at least the following problems: on the one hand, the Path class is generally adopted as the resolving tool of the Fsimage file provided by the HDFS, but in the resolving process, a plurality of URI (Uniform ResourceIdentifier) objects are additionally generated when the Path class objects are initialized, which can cause unnecessary logic and GC (GarbageCollection) for program operation, so that the resolving speed is seriously slowed down; on the other hand, the resolving tool for the Fsimage file provided by the HDFS has the problem that the resolved data has field missing, not only lacks file block information corresponding to all files, but also lacks erasure code (ErasureCoding, EC) file identifiers corresponding to all files, and the resolved data is too large, and is in a TSV (Tab-separation values) format, so that the data volume of a single file can even reach hundreds of G, and the downstream calculation is inconvenient. Therefore, it is necessary to optimize the existing Fsimage file analysis tool to efficiently and accurately analyze the Fsimage file.
Step S13, a first mapping table is constructed based on the file information.
The file information comprises file block information corresponding to each file. The first mapping table is used for inquiring mapping relations between all files and all file blocks, and the first mapping table comprises: path, file size, number of file blocks, file block size, file block identification, erasure code file identification, date, cluster.
In some embodiments, a new table is first constructed in the Hive table of the HDFS multi-room, and named, for example, "hadoop_admin.fsimage"; and acquiring the corresponding relation between all files and all file blocks based on the file information, inputting the file information into a hadoop_admin.fsimage table, and obtaining the first mapping table by giving corresponding field values to each field in the hadoop_admin.fsimage table, thereby inquiring the mapping relation between all files and all file blocks. The Hive table is a data warehouse tool based on Hadoop, and is used for extracting, converting and loading data, mapping a structured data file into a database table, and providing SQL (Structured QueryLanguage) query function.
In some embodiments, the file information includes, in addition to file block information corresponding to each file, further includes: file copy number, file modification time, file access time, file size, date, etc. The first mapping table further includes: the number of file copies, file modification time, file access time, file rights and other information.
Examples of the fields and the field definitions of the first mapping table are shown in table 1, and it is noted that the permission field indicates a file permission, and files or directories may be distinguished by the permission field, where the field value of the permission field indicates a file if it starts with "-" and the field value of the permission field indicates a directory if it starts with "d"; the ec_id field represents the erasure code file identification, and if the field value of the ec_id field is 0, the file does not have a corresponding erasure code file; the field value of the p_date field is typically in year-month-day format, such as 2023-01-10.
TABLE 1 first mapping Table
Figure SMS_1
In the implementation process of the method, the mirror image files of the metadata nodes are analyzed based on an optimization analysis tool to obtain file information of all files, so that the problems of low analysis speed, single analysis format, large analysis output, field deletion of analysis contents and the like are solved; and constructing a first mapping table in the Hive table based on the file information, so that the mapping relation between all files and all file blocks can be conveniently inquired later.
S2, obtaining a file block list of each data node, and constructing a second mapping table based on the file block list of each data node.
The second mapping table is used for querying mapping relations between all file blocks and each data node, and the second mapping table comprises: file block identification, data node identification, machine room identification, date and cluster.
Specifically, it may be understood that the obtaining the file block list of each data node includes: screening files conforming to file block naming rules from the data catalogues of all the data nodes, and obtaining a file block list of all the data nodes based on the files conforming to the file block naming rules; and/or directly acquiring file blocks on each data node from the communication protocol interface of the metadata node to obtain a file block list of each data node.
In some embodiments, since the file block copies of each data node are all stored on the local disk, and the file block copies are all named according to a file block naming rule, where the file block naming rule may be that the file names of the file block copies have a prefix "blk_". Therefore, the file block list of each data node is obtained by screening files conforming to the file block naming rule from the data directory of each data node, and obtaining the file block list of each data node based on the files conforming to the file block naming rule. Although a certain pressure is caused to the performance of the data nodes by traversing the data catalogue of each data node, the operation and maintenance difficulty is increased, more detailed file block information can be obtained in this way, for example, on the basis of obtaining the corresponding relation between the file block and the data nodes, the creation time of the file block, the specific content of the file block and other detailed information can be further obtained.
In some embodiments, because the corresponding relationship between the file block and each data node is stored in the metadata node, and the communication protocol interface is further provided, the file block list of each data node is obtained, and the file block list of each data node can be directly obtained by directly using the Namenoodeprotocol #getblocks class of the communication protocol interface, so that the operation and maintenance pressure is greatly reduced. If the number of file blocks of each data node is large, if the one-time pull exceeds the maximum limit of the request, the file block list can be acquired from each data node in multiple times.
Specifically, it can be further understood that a new table is first constructed in the Hive table of the HDFS multi-machine room, and the new table is named, for example, "hadoop_admin.blocks"; and obtaining the corresponding relation between all the file blocks and each data node based on the file block list of each data node, inputting the corresponding relation into a hadoop_admin.blocks table, and obtaining the second mapping table by giving corresponding field values to each field in the hadoop_admin.blocks table, thereby inquiring the mapping relation between all the file blocks and each data node. Examples of the fields and field definitions of the second mapping table are shown in table 2.
TABLE 2 second mapping Table
Figure SMS_2
In the implementation process of the method, the following two methods can be adopted to obtain the file block list of each data node: the first method comprises the steps of screening files conforming to file block naming rules by traversing data catalogues of all data nodes to obtain more detailed related information of file blocks, so as to obtain a file block list of all the data nodes; in the second method, file block information on each data node is directly obtained from a communication protocol interface of the metadata node, so that a file block list of each data node is obtained, and operation and maintenance pressure is reduced compared with the first method. In addition, a second mapping table is constructed in the Hive table based on the file block list of each data node, so that the mapping relation between all file blocks and each data node can be conveniently queried later.
And S3, calculating a copy distribution table based on the first mapping table and the second mapping table.
The copy distribution table is used for inquiring copy distribution of each file and copy distribution of each data node.
Specifically, it can be understood that a new table is constructed in the Hive table of the HDFS multi-machine room, where the copy distribution table includes a path, a file block identifier, an erasure code file identifier, a data node list, a machine room list, a cluster, and a date, and the new table is named, for example, "hadoop_admin.block_analysis"; and based on SQL computing capability of the Hive table, obtaining corresponding relations between all files and each data node through the first mapping table and the second mapping table, and obtaining the copy distribution table.
In some implementations, examples of the fields of the duplicate distribution table and the field definitions are shown in Table 3. The copy distribution table may be constructed once a day, and a simple ETL (Extract-Transform-Load) may be generally used to process the first mapping table and the second mapping table to obtain the copy distribution table. Specifically, first, according to a p_date field, a first mapping table and a second mapping table of the same day are obtained; then, under the condition that the name_service fields are the same, acquiring the value information of each field from the first mapping table and the second mapping table respectively, and determining the field values of the path, ec_id and name_ service, block _ id, hostname, az, p _date fields; finally, the dataode_list and az_list fields are derived based on the hostname, az fields.
Table 3 copy distribution table
Figure SMS_3
In the implementation process of the method, a copy distribution table is constructed in the Hive table based on the first mapping table and the second mapping table, so that the copy distribution of each file and the copy distribution of each data node can be conveniently queried later, the first mapping table, the second mapping table and the copy distribution table are stored in the Hive table, various analysis engines supporting the computing capacity of the Hive table can be used for query, and the subsequent analysis, calculation and the like are facilitated.
S4, verifying whether a copy placement strategy of the file to be verified accords with preset distribution or not based on the copy distribution table aiming at the file to be verified.
Specifically, it may be understood that the verifying, based on the copy distribution table, whether the copy placement policy of the file to be verified meets a preset distribution includes:
and step S41, inquiring the current copy distribution of the file to be verified based on the copy distribution table.
Specifically, based on the file name of the file to be verified, querying the corresponding file blocks of the file to be verified and the data nodes distributed by the copies of the file blocks from the copy distribution table, and determining the current copy distribution of the file to be verified based on the data nodes distributed by the copies of the file blocks of the file to be verified. For example, if the file name of the file to be verified is test. Txt, the data nodes distributed by the copies of each file block of the file to be verified can be queried through the following SQL statement.
SELECT*
FROM hadoop_admin.block_analysis
WHEREp_date= '2023-01-10'and path = '/test.txt';
Assuming that 2 file blocks are queried through the query statement, each file block has 3 copies, and the copies of the file block 1 are respectively distributed on the data node 1 of the machine room 1, the data node 2 of the machine room 1 and the data node 5 of the machine room 2, and the copies of the file block 2 are distributed on the data node 3 of the machine room 1, the data node 4 of the machine room 1 and the data node 6 of the machine room 2, it can be determined that the current distribution of the file test.
Step S42, judging whether the current copy distribution of the file to be verified accords with the preset distribution of the file to be verified; if the current distribution of the files to be verified is not in accordance with the preset distribution of the files to be verified, copy migration operation is carried out on the files to be verified, so that the current copy distribution of the files to be verified is in accordance with the preset distribution of the files to be verified.
The preset distribution of the file to be verified can be a copy placement strategy of the file to be verified, and can also be a set copy distribution requirement for the file to be verified.
Specifically, if the to-be-verified file has a requirement of multi-machine room reading and the preset distribution is cross-machine room distribution, but the current copy distribution queried based on the copy distribution table is all distributed in the same machine room and is not cross-machine room distribution, the current copy distribution of the to-be-verified file is not considered to be in accordance with the preset distribution of the to-be-verified file; in addition, if the current copy distribution of the file to be verified is deployed across the machine room, but does not conform to the copy placement policy of the file to be verified, the current copy distribution of the file to be verified may also be considered to not conform to the preset distribution of the file to be verified.
Based on the above example, if the file to be verified is a file test. Txt, if the copy placement policy of the file test. Txt is "az1, az1, az2", that is, two copies are placed in the machine room 1, and one copy is placed in the machine room 2, the current distribution of the file test. Txt accords with the copy placement policy; if the copy placement policy of the file test.txt is "az1, az1, az2, az2", that is, two copies are placed in the machine room 1, and two copies are placed in the machine room 2, the current distribution of the file test.txt does not conform to the copy placement policy, at this time, a copy migration operation needs to be performed, and the copy distribution of the file test.txt is adjusted, so that the current copy distribution of the file test.txt is changed into "the two copies are placed in the machine room 1, and the two copies are placed in the machine room 2".
It should be noted that, in the above method, after the copy distribution table is obtained through steps S1-S3, in addition to performing the copy placement policy verification as in step S4, copy distribution information of each data node may be queried and obtained based on the copy distribution table, for example, query which data node of all the data nodes has the most copies distributed, for example, query the copy distribution situation of the file block on the data node with the node name of data01 through the following SQL statement.
SELECT*
FROM hadoop_admin.block_analysis
WHEREp_date= '2023-01-10' andarray_contains(datanode_list,'data01');
When the query is performed based on the copy distribution, the SQL statement in the above example may be used, or various analysis engines supporting Hive table calculation may be used, which is not particularly limited in this application.
In the implementation process of the method, aiming at the HDFS multi-machine-room scheme, by acquiring the copy distribution table and performing offline analysis, on one hand, the copy placement strategy of each file can be verified based on the copy distribution table, even if the copy distribution table is subjected to offline analysis under the condition of a large number of file copies, metadata of metadata nodes are not required to be requested or scanned, compared with the prior art, by inquiring metadata verification from the metadata nodes, the operation and maintenance pressure of the metadata nodes is not increased, the time is short, and the distribution of the file copies can be checked in time and the cross-machine-room placement strategy of the file copies can be adjusted; on the other hand, based on the copy distribution table, not only the copy distribution of each file, but also the copy distribution of each data node can be queried, and the query and analysis processes are all off-line processes, so that the normal working process of the HDFS multi-machine room is not influenced.
According to the copy placement strategy verification method for the HDFS multi-machine room, the path analysis method, the analysis content, the analysis data output and the storage format of the analysis tool are optimized, so that the analysis speed is increased, the analysis data format is enriched, the analysis output is reduced, missing fields are supplemented, and the optimization analysis tool is adopted to analyze the mirror image file of the metadata node to construct a first mapping table; pulling file block information of each data node through traversing the data catalogue of each data node or directly from the metadata node, and further constructing a second mapping table; the first mapping table, the second mapping table and the copy distribution table obtained by calculation based on the first mapping table and the second mapping table are stored in the Hive table, so that subsequent data query, analysis, calculation and the like are facilitated; and carrying out offline analysis based on the copy distribution table, and verifying copy placement strategies of all files can be realized even under the condition of more copies of the files.
Fig. 2 is a schematic structural diagram of a copy placement policy verification device of an HDFS multi-machine room provided in the present application, which may be used to implement the method described in the foregoing embodiments. As shown in fig. 2, the apparatus includes:
the first construction module 210 is configured to parse the mirror image file of the metadata node by using an optimization parsing tool to obtain file information of all files, and construct a first mapping table based on the file information; the optimized analysis tool is an analysis tool obtained by optimizing a path analysis method, analysis data output and storage format and analysis content; the file information comprises file block information corresponding to each file;
a second construction module 220, configured to obtain a file block list of each data node, and construct a second mapping table based on the file block list of each data node;
a third construction module 230, configured to calculate a copy distribution table based on the first mapping table and the second mapping table; the copy distribution table is used for inquiring copy distribution of each file and copy distribution of each data node;
the verification analysis module 240 is configured to verify, for a file to be verified, whether a copy placement policy of the file to be verified conforms to a preset distribution based on the copy distribution table.
For a detailed description of the above-mentioned copy placement policy verification device for the HDFS multi-machine room, please refer to the description of the related method steps in the above-mentioned embodiment, and the repetition is omitted. The apparatus embodiments described above are merely illustrative, wherein the "module" as illustrated as a separate component may or may not be physically separate, as may be a combination of software and/or hardware implementing the intended function. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
Fig. 3 is a schematic structural diagram of an electronic device provided in the present application, as shown in fig. 3, where the electronic device includes a memory 310 and a processor 320, where the memory 310 stores a computer program, and the processor 320 executes steps in a copy placement policy verification method of an HDFS multi-machine room when running the computer program.
The embodiment of the application also provides a readable storage medium, wherein the readable storage medium stores a computer program, and the computer program executes the steps in the copy placement strategy verification method of the HDFS multi-machine room when running on a processor.
It should be understood that the electronic device may be an electronic device with a logic computing function, such as a personal computer, a tablet computer, a smart phone, etc.; the readable storage medium may be a ROM (Read-only memory), RAM (RandomAccess Memory ), magnetic disk, optical disk, or the like.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and are not limiting thereof; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and they should not fall within the scope of the present invention.

Claims (9)

1. The method for verifying the copy placement strategy of the HDFS multi-machine room is characterized by comprising the following steps of:
analyzing the mirror image file of the metadata node by adopting an optimization analysis tool to obtain file information of all files, and constructing a first mapping table based on the file information; the optimized analysis tool is an analysis tool obtained by optimizing a path analysis method, analysis content, analysis data output and a storage format; the file information comprises file block information corresponding to each file;
acquiring a file block list of each data node, and constructing a second mapping table based on the file block list of each data node;
based on the first mapping table and the second mapping table, calculating to obtain a copy distribution table; the copy distribution table is used for inquiring copy distribution of each file and copy distribution of each data node;
verifying whether a copy placement strategy of the file to be verified accords with preset distribution or not based on the copy distribution table aiming at the file to be verified;
the method for path analysis, analysis content, analysis data output and storage format optimization comprises the following steps:
analyzing path information of the mirror image file based on a native character string analysis method;
supplementing a block_ids field and an ec_id field for the analysis content of the image file;
and outputting the analysis data in a multithreading parallel mode, and compressing the analysis data by adopting column storage.
2. The method of claim 1, wherein the block_ids field is used to represent a file block identifier corresponding to each file, and the ec_id field is used to represent an erasure code file identifier corresponding to each file.
3. The method of claim 1, wherein the first mapping table is used for querying mapping relationships between all files and all file blocks, and the first mapping table includes: path, file size, number of file blocks, file block size, file block identification, erasure code file identification, date, cluster.
4. The method of claim 1, wherein the second mapping table is used for querying mapping relationships between all file blocks and each data node, and the second mapping table includes: file block identification, data node identification, machine room identification, date and cluster.
5. The method of claim 1, wherein the obtaining a list of file blocks for each data node comprises:
screening files conforming to file block naming rules from the data catalogues of all the data nodes, and obtaining a file block list of all the data nodes based on the files conforming to the file block naming rules; and/or the number of the groups of groups,
and directly acquiring file blocks on each data node from the communication protocol interface of the metadata node to obtain a file block list of each data node.
6. The method of claim 1, wherein verifying whether the copy placement policy of the file to be verified meets a preset distribution based on the copy distribution table comprises:
inquiring the current copy distribution of the file to be verified based on the copy distribution table;
judging whether the current copy distribution of the file to be verified accords with the preset distribution of the file to be verified; if the current distribution of the files to be verified is not in accordance with the preset distribution of the files to be verified, copy migration operation is carried out on the files to be verified, so that the current copy distribution of the files to be verified is in accordance with the preset distribution of the files to be verified.
7. A copy placement policy validation apparatus for an HDFS multi-machine room, the apparatus comprising:
the first construction module is used for analyzing the mirror image files of the metadata nodes by adopting an optimization analysis tool to obtain file information of all files, and constructing a first mapping table based on the file information; the optimized analysis tool is an analysis tool obtained by optimizing a path analysis method, analysis content, analysis data output and a storage format; the file information comprises file block information corresponding to each file;
the second construction module is used for acquiring a file block list of each data node and constructing a second mapping table based on the file block list of each data node;
the third construction module is used for calculating a copy distribution table based on the first mapping table and the second mapping table; the copy distribution table is used for inquiring copy distribution of each file and copy distribution of each data node;
the verification analysis module is used for verifying whether the copy placement strategy of the file to be verified accords with preset distribution or not based on the copy distribution table aiming at the file to be verified;
the method for path analysis, analysis content, analysis data output and storage format optimization comprises the following steps:
analyzing path information of the mirror image file based on a native character string analysis method;
supplementing a block_ids field and an ec_id field for the analysis content of the image file;
and outputting the analysis data in a multithreading parallel mode, and compressing the analysis data by adopting column storage.
8. An electronic device, characterized in that the electronic device comprises a memory and a processor, the memory stores a computer program, and the processor executes the copy placement policy validation method of the HDFS multi machine room of any one of claims 1 to 6 when running the computer program.
9. A readable storage medium, wherein a computer program is stored in the readable storage medium, and the computer program executes the copy placement policy validation method of the HDFS multi-room of any one of claims 1 to 6 when running on a processor.
CN202310219585.5A 2023-03-09 2023-03-09 Method and device for verifying copy placement strategy of HDFS (Hadoop distributed File System) multi-machine room Active CN115934670B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310219585.5A CN115934670B (en) 2023-03-09 2023-03-09 Method and device for verifying copy placement strategy of HDFS (Hadoop distributed File System) multi-machine room

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310219585.5A CN115934670B (en) 2023-03-09 2023-03-09 Method and device for verifying copy placement strategy of HDFS (Hadoop distributed File System) multi-machine room

Publications (2)

Publication Number Publication Date
CN115934670A CN115934670A (en) 2023-04-07
CN115934670B true CN115934670B (en) 2023-05-05

Family

ID=85827715

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310219585.5A Active CN115934670B (en) 2023-03-09 2023-03-09 Method and device for verifying copy placement strategy of HDFS (Hadoop distributed File System) multi-machine room

Country Status (1)

Country Link
CN (1) CN115934670B (en)

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103257970B (en) * 2012-02-17 2016-06-15 百度在线网络技术(北京)有限公司 Method of testing and device for HDFS host node
CN104615606B (en) * 2013-11-05 2018-04-06 阿里巴巴集团控股有限公司 A kind of Hadoop distributed file systems and its management method
KR102049420B1 (en) * 2018-03-27 2019-11-27 주식회사 리얼타임테크 Method for parallel query processing of data comprising a replica in distributed database
CN108769171B (en) * 2018-05-18 2021-09-17 百度在线网络技术(北京)有限公司 Copy keeping verification method, device, equipment and storage medium for distributed storage
CN114385561A (en) * 2022-01-10 2022-04-22 北京沃东天骏信息技术有限公司 File management method and device and HDFS system
CN115048254B (en) * 2022-07-11 2022-12-09 北京志凌海纳科技有限公司 Simulation test method, system, equipment and readable medium for data distribution strategy

Also Published As

Publication number Publication date
CN115934670A (en) 2023-04-07

Similar Documents

Publication Publication Date Title
US11544623B2 (en) Consistent filtering of machine learning data
US11422982B2 (en) Scaling stateful clusters while maintaining access
US10713589B1 (en) Consistent sort-based record-level shuffling of machine learning data
US10366053B1 (en) Consistent randomized record-level splitting of machine learning data
US8762353B2 (en) Elimination of duplicate objects in storage clusters
US7447839B2 (en) System for a distributed column chunk data store
US8543596B1 (en) Assigning blocks of a file of a distributed file system to processing units of a parallel database management system
US8321487B1 (en) Recovery of directory information
CN110413595B (en) Data migration method applied to distributed database and related device
CN102129469A (en) Virtual experiment-oriented unstructured data accessing method
KR20130049111A (en) Forensic index method and apparatus by distributed processing
CN106649676A (en) Duplication eliminating method and device based on HDFS storage file
US9600486B2 (en) File system directory attribute correction
US10262024B1 (en) Providing consistent access to data objects transcending storage limitations in a non-relational data store
US7657585B2 (en) Automated process for identifying and delivering domain specific unstructured content for advanced business analysis
US20180060341A1 (en) Querying Data Records Stored On A Distributed File System
CN111680030A (en) Data fusion method and device, and data processing method and device based on meta information
CN103778223B (en) Pervasive word-reciting system based on cloud platform and construction method thereof
US11966489B2 (en) Data certification process for cloud database platform
CN115934670B (en) Method and device for verifying copy placement strategy of HDFS (Hadoop distributed File System) multi-machine room
US11392587B1 (en) Rule generation and data certification onboarding process for cloud database platform
US11829367B2 (en) Data certification process for updates to data in cloud database platform
US20240232421A1 (en) Data Certification Process for Cloud Database Platform
CN116739397B (en) Dynamic management method for new energy indexes
US20230376451A1 (en) Client support of multiple fingerprint formats for data file segments

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant