CN113688115B

CN113688115B - Archive big data distributed storage system based on Hadoop

Info

Publication number: CN113688115B
Application number: CN202111000510.5A
Authority: CN
Inventors: 王佩; 李帅
Original assignee: Zhongdun Innovative Digital Technology Beijing Co ltd
Current assignee: Zhongdun Innovative Digital Technology Beijing Co ltd
Priority date: 2021-08-29
Filing date: 2021-08-29
Publication date: 2024-02-20
Anticipated expiration: 2041-08-29
Also published as: CN113688115A

Abstract

The invention discloses a file big data distributed storage system based on Hadoop, which comprises a client, a control node of a master-slave structure and a plurality of slave nodes; the control node is used for controlling the plurality of slave nodes, the slave nodes are used for data storage and data processing, and the control node carry out information transmission through a TCP/IP protocol; the system uses the HDFS distributed storage system to relieve the storage cost of enterprises; by adopting a hash algorithm and improving a random placement data strategy stored by an HDFS block, the problem of data inclination of a storage system under the conditions of capacity expansion and downtime is effectively solved, and the stability of the storage system is improved.

Description

Archive big data distributed storage system based on Hadoop

Technical field:

the invention belongs to a distributed storage system, and particularly relates to a Hadoop-based archive big data distributed storage system.

The background technology is as follows:

at present, a plurality of domestic companies use mysql, sqlserver, oracle and other databases, and the databases adopt a parallel storage scheme, but the number of the database servers is limited, and the capacity of the databases can be expanded only through the longitudinal direction, namely hardware upgrading, but can not be expanded by transversely increasing the number of independent databases.

Compared with the traditional databases, the multiple databases are connected in parallel for storage, so that the storage capacity and the calculation and analysis capacity can be expanded on the original architecture, but the parallel databases separate calculation from storage, the stored data cannot be calculated and analyzed at the same time, and the multiple parallel databases have bandwidth competition in the process of data access due to bandwidth limitation, so that bandwidth bottleneck occurs. Meanwhile, when capacity expansion is performed, static shutdown is required, expansion storage is performed, and data are also distributed again. If the data suddenly increases in a certain period, the information data increase speed is greater than the database hardware upgrade speed, so that seamless connection of suddenly increasing data storage through capacity expansion cannot be met, insufficient database capacity easily occurs, and the service quality and the customer timeliness requirements are affected.

The load balancing and processing speed of the storage system under different scenes are guaranteed, an HDFS distributed system data storage strategy is generally adopted, the distributed file system HDFS default data storage strategy under the Hadoop big data platform has the consideration of the aspects of data storage efficiency, balanced distribution as much as possible, data reliability and the like at the beginning of design, but the mode that the HDFS randomly selects storage nodes to store data easily causes the following two problems: data node hardware differences are not considered in a default placement strategy and data distribution imbalance is easy to cause

When the HDFS distributed system stores data, a strategy of randomly placing data blocks is adopted by default so as to obtain the balanced distribution of the data in the nodes. When the storage nodes of the system are subjected to addition and deletion operations, the data distribution cannot be automatically adjusted, and the requirement of system load balancing cannot be met.

The HDF distributed system stores data information on the premise of cluster isomorphism, and hardware differences of data nodes in the clusters are not correspondingly optimized during design, so that disk space of some data nodes cannot be fully utilized, and the efficiency of the whole storage system is affected.

Disclosure of Invention

Aiming at the problems that the existing distributed storage system is unbalanced in data distribution and the data node hardware difference of the distributed storage system affects the efficiency of the whole storage system, the invention provides a file big data distributed storage system based on Hadoop, and the storage cost of enterprises is relieved by using the HDFS distributed storage system. By adopting a hash algorithm, the problem of data inclination of a storage system under the conditions of capacity expansion and downtime is effectively solved by improving a random placement data strategy stored by an HDFS block, the stability of the storage system is improved, and the system can achieve load balancing under isomorphic conditions; the data storage strategy can be optimized by adopting the fusion weighted polling algorithm, so that the problem of load balancing of the storage system under heterogeneous conditions is effectively solved, and the hardware resources of the system are fully utilized.

The technical scheme adopted by the invention for solving the technical problems is as follows:

the file big data distributed storage system based on Hadoop comprises a client, a control node of a master-slave structure and a plurality of slave nodes; the control node is used for controlling the plurality of slave nodes, the slave nodes are used for data storage and data processing, and the control node carry out information transmission through a TCP/IP protocol; the client is used for receiving a map function and a reduce function configured by a user, the map function is used for operation management of key/value pairs, a large-scale calculation task is decomposed into a plurality of subtasks, the subtasks are distributed to each slave node, calculation results are obtained by using calculation resources of the slave nodes, the reduce function is used for carrying out merging processing on all value values of the same key value, and the output key value is a final result;

when the data to be processed is stored in the subordinate node, the control node adopts a hash model to determine the appointed subordinate node, and the hash model is as follows:

target＝getHashCode(request.IPNum)&nodeNum

where getHashCode () represents a string operation hash function, request.ipnum represents a request IP address, nodeNum represents the total number of slave nodes, and the corresponding numbered slave nodes can be allocated according to the value of target.

Further, a plurality of virtual layers are arranged between the slave nodes and the data to be processed, a first virtual layer is mapped to a first slave node, the first virtual layer comprises a plurality of virtual nodes, and the data to be processed is stored in the slave nodes through the virtual nodes.

Further, the mapping model of the virtual node and the subordinate node is as follows:

wherein P is the initial hash position of the slave node, P' is the hash position of the map node, N is the number of slave nodes, k=0, 1,2.

Further, the mapping model of the virtual node and the subordinate node may be:

where h represents the hash result of the feature string, l represents the string length, and w represents the weight of the character in the hash operation.

Further, the control node may be configured to, after processing the data storage request sent by the client, collect load states fed back by each slave node in real time, distribute the request to each slave node according to a weight, where the slave node distributes data according to the request of the control node, sends a storage result to the control node, and finally the control node sends a result of whether the data distribution is completed to the client.

Further, the load state is a disk usage rate, and the disk usage rate model is:

wherein U is _total For disk space, U _free Space remains for the disk.

Further, after receiving the data storage request, the control node starts to collect the current disk utilization rate of the slave node; the control node calculates the average disk utilization rate of the slave nodes, and if the disk utilization rate of the second slave node is greater than or equal to the average disk utilization rate, the weight of the second slave node is set to 0; and if the disk utilization rate of the second slave node is smaller than the average disk utilization rate, calculating the weight of the second slave node according to the disk utilization rate, and executing a polling allocation task according to the weight.

Further, the client may execute a parallel query command through the control node.

Further, the parallel query flow is as follows:

the client sends SQL commands to the control node through a script or command interface;

the SQL interface generates a virtual root node according to the query keyword analysis command, outputs a field as a leaf node and constructs a query task tree; creating an attribute value for each relation table node on the query task tree, which is used for marking a query target, reading a file mapping table and adding file information into the attribute value; the task tree is output to the optimizer.

Step (3), the optimizer retrieves metadata according to the file information recorded by the attribute values, obtains the corresponding position information of the data block, and adds the position information into the attribute values; converting the task tree into an operation aiming at the data block, and adjusting the operation sequence according to factors such as judging conditions, data quantity, field size and the like; merging the operation units by taking the slave nodes as units, pressing the operation units into a queue in sequence, and generating an operation list;

combining the operation lists of all the subordinate nodes into a query plan and outputting the query plan to a distributor; the distributor distributes the operation list to the executor on the corresponding subordinate node according to the node IP address;

and (5) the executor reads the operation list, sequentially executes all operation commands, acquires the query result from the local memory, uploads the query result to the distributor, and the distributor collects the calculation result and returns the collected query result to the client.

The beneficial effects of the invention are as follows:

the system uses an HDFS distributed storage system to mitigate the storage costs of an enterprise. By adopting a hash algorithm, the problem of data inclination of a storage system under the conditions of capacity expansion and downtime is effectively solved by improving a random placement data strategy stored by an HDFS block, the stability of the storage system is improved, and the system can achieve load balancing under isomorphic conditions; the data storage strategy can be optimized by adopting the fusion weighted polling algorithm, so that the problem of load balancing of the storage system under heterogeneous conditions is effectively solved, and the hardware resources of the system are fully utilized.

The foregoing description is only an overview of the present invention, and is intended to be more clearly understood as the present invention, as it is embodied in the following description, and is intended to be more clearly understood as the following description of the preferred embodiments, given in detail, of the present invention, along with other objects, features and advantages of the present invention.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:

FIG. 1 is a block diagram of a Hadoop-based archive big data distributed storage system of the present invention

FIG. 2 is a schematic diagram of a distributed storage system based on virtual nodes according to the present invention

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

In the description of the present invention, unless explicitly stated and limited otherwise, the terms "mounted," "connected," "secured," and the like are to be construed broadly, and may be, for example, connected, detachably connected, or integrated; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communicated with the inside of two elements or the interaction relationship of the two elements. The specific meaning of the above terms in the present invention can be understood by those of ordinary skill in the art according to the specific circumstances.

The system comprises a client, a control node of a master-slave structure and a plurality of slave nodes as shown in figure 1; the control node is used for controlling the plurality of slave nodes, the slave nodes are used for data storage and data processing, and the control node carry out information transmission through a TCP/IP protocol; the client is used for receiving a map function and a reduce function configured by a user, the map function is used for operation management of key/value pairs, a large-scale calculation task is decomposed into a plurality of subtasks, the subtasks are distributed to each slave node, calculation results are obtained by using calculation resources of the slave nodes, the reduce function is used for carrying out merging processing on all value values of the same key value, and the output key value is a final result;

target＝getHashCode(request.IPNum)&nodeNum

Further, as shown in fig. 2, the mapping model of the virtual node and the subordinate node is:

wherein P is the initial hash position of the slave node, P' is the hash position of the map node, N is the number of slave nodes, k=0, 1,2. The previous hash model may be replaced with a mapping model of the virtual node and the slave node to achieve balanced storage.

Further, the mapping model of the virtual node and the subordinate node may be:

Further, the load state is a disk usage rate, and the disk usage rate model is:

wherein U is _total For disk space, U _free Space remains for the disk.

Further, the parallel query flow is as follows:

The invention has the advantages that: the system uses an HDFS distributed storage system to mitigate the storage costs of an enterprise. By adopting a hash algorithm, the problem of data inclination of a storage system under the conditions of capacity expansion and downtime is effectively solved by improving a random placement data strategy stored by an HDFS block, the stability of the storage system is improved, and the system can achieve load balancing under isomorphic conditions; the data storage strategy can be optimized by adopting the fusion weighted polling algorithm, so that the problem of load balancing of the storage system under heterogeneous conditions is effectively solved, and the hardware resources of the system are fully utilized.

The present invention is not limited to the above-mentioned embodiments, and any changes or substitutions that can be easily understood by those skilled in the art within the technical scope of the present invention are intended to be included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. The utility model provides a archives big data distributed storage system based on Hadoop which characterized in that: the system comprises a client, a control node of a master-slave structure and a plurality of slave nodes; the control node is used for controlling the plurality of slave nodes, the slave nodes are used for data storage and data processing, and the control node carry out information transmission through a TCP/IP protocol; the client is used for receiving a map function and a reduce function configured by a user, the map function is used for operation management of key/value pairs, a large-scale calculation task is decomposed into a plurality of subtasks, the subtasks are distributed to each slave node, calculation results are obtained by using calculation resources of the slave nodes, the reduce function is used for carrying out merging processing on all value values of the same key value, and the output key value is a final result;

the control node can be used for a load balancing server, after processing a data storage request sent by the client, collecting a load state fed back by each slave node in real time, distributing the request to each slave node according to a weight, distributing data by the slave node according to the request of the control node, sending a storage result to the control node, and finally sending a result of whether the data distribution is completed to the client by the control node;

target＝getHashCode(request.IPNum)&nodeNum

wherein getHashCode () represents a string operation hash function, request.ipnum represents a request IP address, nodeNum represents the total number of slave nodes, and slave nodes of corresponding numbers are allocated according to the value of target.

2. The Hadoop-based archival big data distributed storage system as claimed in claim 1, wherein: and setting a plurality of virtual layers between the slave node and the data to be processed, wherein a first virtual layer is mapped to a first slave node, the first virtual layer comprises a plurality of virtual nodes, the data to be processed is stored into the slave node through the virtual nodes, and the hash model is replaced by a mapping model of the virtual nodes and the slave node.

3. The Hadoop-based archival big data distributed storage system as claimed in claim 2, wherein: the mapping model of the virtual node and the subordinate node is as follows:

4. The Hadoop-based archival big data distributed storage system as claimed in claim 2, wherein: the mapping model of the virtual node and the subordinate node may also be:

5. The Hadoop-based archival big data distributed storage system as claimed in claim 1, wherein: the load state is the disk utilization rate, and the disk utilization rate model is as follows:

wherein U is _total For disk space, U _free Space remains for the disk.

6. The Hadoop-based archival big data distributed storage system as claimed in claim 1, wherein: after receiving the data storage request, the control node starts to collect the current disk utilization rate of the slave node; the control node calculates the average disk utilization rate of the slave nodes, and if the disk utilization rate of the second slave node is greater than or equal to the average disk utilization rate, the weight of the second slave node is set to 0; and if the disk utilization rate of the second slave node is smaller than the average disk utilization rate, calculating the weight of the second slave node according to the disk utilization rate, and executing a polling allocation task according to the weight.

7. The Hadoop-based archival big data distributed storage system as claimed in claim 1, wherein: and the client executes a parallel query command through the control node.

8. The Hadoop-based archival big data distributed storage system as claimed in claim 7, wherein: the parallel query flow is as follows:

the SQL interface generates a virtual root node according to the query keyword analysis command, outputs a field as a leaf node and constructs a query task tree; creating an attribute value for each relation table node on the query task tree, which is used for marking a query target, reading a file mapping table and adding file information into the attribute value; outputting the task tree to an optimizer;

step (3), the optimizer retrieves metadata according to the file information recorded by the attribute values, obtains the corresponding position information of the data block, and adds the position information into the attribute values; converting the task tree into an operation aiming at the data block, and adjusting the operation sequence according to judging conditions, data quantity and field size factors; merging the operation units by taking the slave nodes as units, pressing the operation units into a queue in sequence, and generating an operation list;