CN107463582B - Distributed Hadoop cluster deployment method and device - Google Patents

Distributed Hadoop cluster deployment method and device Download PDF

Info

Publication number
CN107463582B
CN107463582B CN201610395969.2A CN201610395969A CN107463582B CN 107463582 B CN107463582 B CN 107463582B CN 201610395969 A CN201610395969 A CN 201610395969A CN 107463582 B CN107463582 B CN 107463582B
Authority
CN
China
Prior art keywords
information
task
host
deployment
hadoop cluster
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610395969.2A
Other languages
Chinese (zh)
Other versions
CN107463582A (en
Inventor
高林林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
ZTE Corp
Original Assignee
ZTE Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ZTE Corp filed Critical ZTE Corp
Priority to CN201610395969.2A priority Critical patent/CN107463582B/en
Priority to PCT/CN2017/083207 priority patent/WO2017206667A1/en
Publication of CN107463582A publication Critical patent/CN107463582A/en
Application granted granted Critical
Publication of CN107463582B publication Critical patent/CN107463582B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/445Program loading or initiating
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2219Large Object storage; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • Multi Processors (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention provides a distributed Hadoop cluster deployment method and a distributed Hadoop cluster deployment device, wherein the method comprises the following steps: receiving template information for deploying a Hadoop cluster, wherein the template information is used for indicating task information and host information of the Hadoop cluster, and the task information is used for describing tasks needing to be completed by the Hadoop cluster; acquiring parameter information of one or more hosts of a Hadoop cluster according to host information, wherein each host is used for deploying one or more components, and the components are deployed by an agent and used for executing corresponding tasks; and deploying the tasks for one or more components according to the task information and the parameter information. The invention solves the problems of complex operation and long deployment time caused by artificial deployment of the Hadoop cluster in the related art.

Description

Distributed Hadoop cluster deployment method and device
Technical Field
The invention relates to the field of communication, in particular to a distributed Hadoop cluster deployment method and device.
Background
The Hadoop of the related art is a distributed system infrastructure, developed by the Apache foundation, and is not an abbreviation but an imaginary name, purportedly possibly related to a toy name of the child of the group creator, with no practical meaning. Hadoop is a software platform and an open source software framework for developing and operating large-scale data, distributed computation of mass data in a cluster formed by a large number of computers is achieved, a user can develop distributed programs without knowing details of a distributed bottom layer, and power of the cluster is fully utilized for high-speed computation and storage.
In the related technology, management personnel for distributed deployment of the Hadoop cluster need to know about the Hadoop ecosphere and hardware resource conditions of all hosts in the cluster, and high requirements are provided for the management personnel for deployment of the Hadoop cluster, and errors are easy to occur. The manual configuration of the Hadoop cluster is adopted, so that the steps are complex, the efficiency is low, and especially under the large-scale Hadoop cluster environment, the elastic management such as dynamic capacity expansion, capacity contraction and the like is difficult.
However, the current system for implementing Hadoop automated deployment has the following problems:
before deploying a Hadoop cluster, designing a Hadoop cluster network topological structure according to cluster environment software and hardware information and deployed components; the scheme has higher requirements on cluster management personnel, and the cluster management personnel are required to be familiar with environmental software and hardware information and Hadoop ecosphere; under the condition that cluster management personnel do not intervene, the automatic deployment system randomly allocates nodes such as Master and Slave, and can not reasonably allocate and utilize cluster hardware and system load information;
the Hadoop cluster component version package has a single download source, so that the deployment time of the Hadoop cluster is not controllable.
Hadoop cluster deployment puts higher requirements on operation and maintenance personnel, and the personnel need to be familiar with a Hadoop ecological circle; the resource information of each node in the cluster is known; designing a Hadoop cluster network topology; 2. the Hadoop cluster component nodes are randomly distributed; 3. hadoop clusters are deployed for a long time.
In view of the above problems in the related art, no effective solution has been found at present.
Disclosure of Invention
The embodiment of the invention provides a distributed Hadoop cluster deploying method and device, which are used for at least solving the problems of complex operation and long deploying time caused by artificial Hadoop cluster deploying in the related art.
According to an embodiment of the invention, a method for distributed deployment of a Hadoop cluster is provided, which includes: receiving template information for deploying a Hadoop cluster, wherein the template information is used for indicating task information and host information of the Hadoop cluster, and the task information is used for describing tasks needing to be completed by the Hadoop cluster; acquiring parameter information of one or more hosts of the Hadoop cluster according to the host information, wherein each host is used for deploying one or more components, and the components are deployed by an agent and used for executing corresponding tasks; and deploying tasks for one or more components according to the task information and the parameter information.
Optionally, the parameter information includes at least one of: the system comprises host operating system information, host network information, host CPU information, host memory information, host CPU utilization rate, host memory utilization rate, host disk IO utilization rate, host network delay, host average IO operation waiting time, host disk information and process information of components in the host.
Optionally, deploying tasks for one or more components in the Hadoop cluster according to the task information and the parameter information includes: generating a deployment task list according to the task information and the parameter information, wherein the deployment task list comprises the task information, the parameter information required by the task execution and the priority of the task; and selecting the task with the highest priority from the deployment task list and sending the task to the corresponding component.
Optionally, the priority is related to an attribute of the task and/or the parameter information for executing the task.
Optionally, after deploying a task to one or more of the components according to the template information and the parameter information, the method further includes: monitoring task execution progress and/or log information of the one or more components.
Optionally, the template information includes at least one of: the system comprises the number of Hadoop cluster hosts, Hadoop cluster component information to be deployed, the number of Hadoop distributed file system HDFS copies, the connection number and timeout time of each component client of the Hadoop cluster, a host network address, a host user name and a password, log storage disk information, data storage disk information and metadata storage disk information.
Optionally, after receiving template information for deploying a Hadoop cluster, the method further includes: and analyzing the template information and verifying the legality of the template information.
According to another embodiment of the present invention, an apparatus for distributed deployment of a Hadoop cluster is provided, including: the system comprises a receiving module and a processing module, wherein the receiving module is used for receiving template information for deploying a Hadoop cluster, the template information is used for indicating task information and host information of the Hadoop cluster, and the task information is used for describing tasks needing to be completed by the Hadoop cluster; the system comprises an acquisition module, a task execution module and a task execution module, wherein the acquisition module is used for acquiring parameter information of one or more hosts of the Hadoop cluster according to the host information, each host comprises one or more components, and the components are deployed by an agent and used for executing corresponding tasks; and the deployment module is used for deploying tasks for one or more components according to the task information and the parameter information.
Optionally, the deployment module further comprises: the generating unit is used for generating a deployment task list according to the task information and the parameter information, wherein the deployment task list comprises the task information, the parameter information required by the task execution and the priority of the task; and the selection unit is used for selecting the task with the highest priority from the deployment task list and issuing the task to the corresponding component.
Optionally, the apparatus further comprises: and the monitoring module is used for monitoring the task execution progress and/or log information of one or more components after the deployment module deploys tasks on one or more components according to the template information and the parameter information.
According to still another embodiment of the present invention, there is also provided a storage medium. The storage medium is configured to store program code for performing the steps of:
receiving template information for deploying a Hadoop cluster, wherein the template information is used for indicating task information and host information of the Hadoop cluster, and the task information is used for describing tasks needing to be completed by the Hadoop cluster;
acquiring parameter information of one or more hosts of the Hadoop cluster according to the host information, wherein each host comprises one or more components, and the components are used for executing corresponding tasks;
and deploying tasks for one or more components according to the task information and the parameter information.
According to the method and the device, template information for deploying the Hadoop cluster is received, wherein the template information is used for indicating task information and host information of the Hadoop cluster, and the task information is used for describing tasks needing to be completed by the Hadoop cluster; acquiring parameter information of one or more hosts of the Hadoop cluster according to the host information, wherein each host is used for deploying one or more components, and the components are deployed by an agent and used for executing corresponding tasks; and deploying tasks for one or more components according to the task information and the parameter information. The task information and the host information are received, and the load conditions of the hosts and the components are acquired by acquiring the parameter information, so that the tasks can be reasonably deployed for the hosts and the components of the Hadoop cluster, and the problems of complex operation and long deployment time caused by artificial deployment of the Hadoop cluster in the related art can be solved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:
FIG. 1 is a general structural framework diagram of a distributed deployment Hadoop cluster according to an embodiment of the present invention;
FIG. 2 is a flow diagram of a method of distributed deployment of a Hadoop cluster according to an embodiment of the invention;
FIG. 3 is a block diagram of an apparatus for distributed deployment of Hadoop clusters according to an embodiment of the present invention;
FIG. 4 is a block diagram of an alternative architecture of a distributed Hadoop cluster deployment apparatus according to an embodiment of the present invention;
FIG. 5 is a block diagram of an alternative architecture of a distributed Hadoop cluster deployment apparatus according to an embodiment of the present invention;
FIG. 6 is a structural framework diagram of an agent in the distributed Hadoop deployment cluster system according to the embodiment;
FIG. 7 is a deployment flow of the agent in the initial state of the present embodiment;
FIG. 8 is a flowchart of a Hadoop cluster deployment method according to this embodiment;
fig. 9 is a timing diagram of the Hadoop cluster deployment method of the embodiment.
Detailed Description
The invention will be described in detail hereinafter with reference to the accompanying drawings in conjunction with embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.
Example 1
The embodiment of the present application may be executed on a network architecture shown in fig. 1, where fig. 1 is a general structural framework diagram of a distributed deployment Hadoop cluster according to the embodiment of the present invention, and as shown in fig. 1, the network architecture includes: the Hadoop cluster management system comprises all functional modules and executing agent nodes, the Hadoop cluster also comprises a plurality of scattered agent nodes for executing tasks, and the deployment system is in communication connection with the Hadoop cluster.
In this embodiment, a distributed Hadoop cluster deployment method that operates in the management system for deploying Hadoop clusters is provided, and fig. 2 is a flowchart of the distributed Hadoop cluster deployment method according to the embodiment of the present invention, and as shown in fig. 2, the flowchart includes the following steps:
step S202, receiving template information for deploying a Hadoop cluster, wherein the template information is used for indicating task information and host information of the Hadoop cluster, and the task information is used for describing tasks needing to be completed by the Hadoop cluster;
step S204, acquiring parameter information of one or more hosts of the Hadoop cluster according to the host information, wherein each host is used for deploying one or more components, and the components are deployed by an agent and used for executing corresponding tasks; optionally, the deployment task is performed by an agent.
And step S206, deploying tasks for one or more components according to the task information and the parameter information.
Receiving template information for deploying a Hadoop cluster, wherein the template information is used for indicating task information and host information of the Hadoop cluster, and the task information is used for describing tasks needing to be completed by the Hadoop cluster; acquiring parameter information of one or more hosts of a Hadoop cluster according to host information, wherein each host is used for deploying one or more components, and the components are deployed by an agent and used for executing corresponding tasks; and deploying the tasks for one or more components according to the task information and the parameter information. The task information and the host information are received, and the load conditions of the hosts and the components are acquired by acquiring the parameter information, so that the tasks can be reasonably deployed for the hosts and the components of the Hadoop cluster, and the problems of complex operation and long deployment time caused by artificial deployment of the Hadoop cluster in the related art can be solved.
Optionally, the execution subject of the above steps may be a control end, a client end, and the like of the Hadoop cluster, but is not limited thereto.
Optionally, the parameter information may be, but is not limited to: host operating system information, host network information, host CPU information (such as core number and dominant frequency size), host memory information, host CPU utilization rate, host memory utilization rate, host disk IO utilization rate, host network delay, host average IO operation waiting time, host disk information and process information of components in the host.
Optionally, the template information may be, but is not limited to: the system comprises the number of Hadoop cluster hosts, Hadoop cluster component information to be deployed, the number of Hadoop distributed file system HDFS copies, the connection number and timeout time of each component client of the Hadoop cluster, a host network address, a host user name and a password, log storage disk information, data storage disk information and metadata storage disk information.
In an optional implementation manner according to this embodiment, deploying a task to one or more components in the Hadoop cluster according to the task information and the parameter information includes:
s11, generating a deployment task list according to the task information and the parameter information, wherein the deployment task list comprises the task information, the parameter information required by task execution and the priority of the task;
and S12, selecting the task with the highest priority from the deployment task list and sending the task to the corresponding component. Optionally, the priority is related to attributes of the task and/or parameter information of the executing task.
Optionally, after deploying the task to the one or more components according to the template information and the parameter information, the method further includes:
task execution progress and/or log information of one or more components is monitored.
Optionally, after receiving template information for deploying a Hadoop cluster, the method further includes: and analyzing the template information and verifying the legality of the template information. In case the template information is legal, the subsequent steps are only performed. The legal deployment template includes at least, but is not limited to, the following: the number of Hadoop cluster nodes, Hadoop cluster component information to be deployed, the number of HDFS copies, the number of client connections and timeout time of each component of the Hadoop cluster, a host network address, a user name and a password, a log storage disk, a data storage disk, a metadata storage disk and the like.
Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.
Example 2
The embodiment also provides a device for distributed deployment of a Hadoop cluster, where the device is used to implement the foregoing embodiments and preferred embodiments, and the description of the device that has been already made is omitted. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.
Fig. 3 is a block diagram of an apparatus for distributed deployment of Hadoop clusters according to an embodiment of the present invention, and as shown in fig. 3, the apparatus includes:
the receiving module 30 is configured to receive template information for deploying a Hadoop cluster, where the template information is used to indicate task information and host information of the Hadoop cluster, and the task information is used to describe a task that needs to be completed by the Hadoop cluster;
the acquisition module 32 is used for acquiring parameter information of one or more hosts of the Hadoop cluster according to the host information, wherein each host comprises one or more components, and the components are deployed by an agent and used for executing corresponding tasks;
and the deployment module 34 is used for deploying the tasks for one or more components according to the task information and the parameter information.
Optionally, the parameter information may be, but is not limited to: host operating system information, host network information, host CPU information (such as core number and dominant frequency size), host memory information, host CPU utilization rate, host memory utilization rate, host disk IO utilization rate, host network delay, host average IO operation waiting time, host disk information and process information of components in the host.
Optionally, the template information may be, but is not limited to: the system comprises the number of Hadoop cluster hosts, Hadoop cluster component information to be deployed, the number of Hadoop distributed file system HDFS copies, the connection number and timeout time of each component client of the Hadoop cluster, a host network address, a host user name and a password, log storage disk information, data storage disk information and metadata storage disk information.
Fig. 4 is a block diagram of an alternative structure of a distributed Hadoop cluster deployment apparatus according to an embodiment of the present invention, as shown in fig. 4, the apparatus includes, in addition to all modules shown in fig. 3, a deployment module 34 further including:
the generating unit 40 is configured to generate a deployment task list according to the task information and the parameter information, where the deployment task list includes the task information, the parameter information required for executing the task, and a priority of the task;
and the selecting unit 42 is configured to select a task with the highest priority from the deployment task list and send the task to the corresponding component.
Fig. 5 is a block diagram of an alternative structure of a distributed Hadoop cluster deployment apparatus according to an embodiment of the present invention, and as shown in fig. 5, the apparatus includes, in addition to all modules shown in fig. 3: and the monitoring module 50 is used for monitoring the task execution progress and/or log information of one or more components after the deployment module deploys tasks on the one or more components according to the template information and the parameter information.
It should be noted that, the above modules may be implemented by software or hardware, and for the latter, the following may be implemented, but not limited to: the modules are all positioned in the same processor; alternatively, the modules are respectively located in different processors in any combination.
Example 3
This embodiment is an alternative embodiment according to the present invention for specifically explaining and explaining the present application in detail:
the embodiment provides a distributed Hadoop cluster deploying method and system. The defects that requirements for deployment Hadoop cluster management personnel are high, Hadoop cluster component nodes are randomly distributed, and the download source of the installation package is single are overcome. The invention realizes one-click distributed deployment of the Hadoop cluster by fully utilizing hardware resources in the cluster and the load condition of each host.
A distributed deployment Hadoop cluster system of this embodiment includes the following components, as shown in fig. 1, of the framework, including:
a template analyzer: deployment templates include, but are not limited to, the following: the system comprises a host network address, a user name, a password, Hadoop component information, node number information and mounting disk information. The template analyzer analyzes the template information input by the user and verifies the legality.
A monitor: the monitor is responsible for receiving Hadoop component deployment task execution conditions and log processing sent by the agent.
A collector: the collector is responsible for receiving and persisting host information (including but not limited to operating system information, CPU information, memory information, network information, CPU utilization rate, memory utilization rate, disk IO utilization rate, network delay and the like) sent by the agent.
A task generator: and the task generator generates a Hadoop component deployment task list according to the host information and the deployment template information acquired by the collector.
And (3) a task scheduler: and the task scheduler selects a high-priority deployment task to be issued to the agent according to the host information, the host load condition and the deployment task list acquired by the collector.
The agent: the agent comprises a collector, a deployer, a parameter configurator, a monitor and the like. The collector is responsible for collecting host information at regular time and sending the host information to the collector of the system; the deployer receives and executes the tasks issued by the task scheduler; the parameter configurator is responsible for configuring configuration files of all components of Hadoop; the monitor is responsible for monitoring the execution condition of the deployment task and collecting logs, and fig. 6 is a structural framework diagram of the agent in the distributed deployment Hadoop cluster system in the embodiment, as shown in fig. 6.
Fig. 7 is a deployment flow of the agent in the initial state of this embodiment, and as shown in fig. 7, the distributed Hadoop cluster deployment method of this embodiment includes the following steps:
initialization deployment system
When the system is started, a monitor, a collector and an agent in the distributed deployment Hadoop cluster system are initialized, and a deployment template submitted by a user is prepared to be received.
Deploying agents
The agent deployment task is generated by the task generator and the task is scheduled to be executed by the task scheduler. And after the agent is deployed, the collector collects the node resource information at regular time and feeds the node resource information back to the management system.
Hadoop cluster deployment template submitted by user
And filling Hadoop cluster information to be deployed by a user according to requirements of a deployment template, and submitting the deployment template.
Parsing Hadoop cluster deployment template
A monitor of the distributed deployment Hadoop cluster system receives a deployment template submitted by a user, and an analyzer analyzes the Hadoop cluster deployment template and verifies the validity of the template.
And generating a Hadoop cluster network topological graph by the topology generator according to the deployment template and the resource information submitted by the user.
Generating Hadoop cluster component deployment tasks
And generating a component deployment task by the task generator according to the Hadoop cluster network topological graph structure.
Task scheduler executing deployment tasks
The task scheduler takes out the deployment task to be executed and the resource information of each node from the task list to generate a task sequence to be executed; and the task scheduler sequentially takes out the deployment tasks with high priority and issues the deployment tasks to the corresponding agents.
Performing deployment tasks
After the host agent receives the deployment task, the deployment task is executed by the deployment device; and the monitor of the agent feeds back the execution progress of the deployment task to the monitor of the deployment system in real time, and the monitor informs the task scheduler to continue scheduling the execution of the task. And repeating the step of executing the deployment tasks by the task scheduler until all the tasks to be deployed are executed.
According to the characteristics of each component of the Hadoop cluster, the nodes of the Hadoop cluster components are reasonably distributed by combining cluster resources; and dynamically distributing deployment tasks according to the acquired host load condition in the deployment process, thereby realizing one-key distributed deployment of the Hadoop cluster. The invention effectively overcomes the defects of complex deployment of large-scale Hadoop clusters, long deployment time, high pressure of a deployment system and the like.
Fig. 8 is a flowchart of the Hadoop cluster deployment method of the embodiment, as shown in fig. 8, and fig. 9 is a timing diagram of the Hadoop cluster deployment method of the embodiment, as shown in fig. 9, in combination with fig. 8 and 9, the embodiment includes:
initializing a system: when the distributed deployment Hadoop cluster system is started, the system needs to be initialized, and the system comprises an initialization monitor, a collector, an agent A1 and the like.
Agent deployment: the agent A1 executes the task of the deployment agent A2 in the first deployment, and after the deployment of the agent A2 is completed, the agent A2 is initialized and started; deployment agent A3, a4 tasks are then performed by agents a1, a2, and so on, until all host agents within the cluster are deployed (see fig. 7).
101. The user submits a deployment template: after the distributed deployment Hadoop cluster system is initialized, a user can submit a deployment template meeting conditions to the system. The legal deployment template includes at least, but is not limited to, the following: the number of Hadoop cluster nodes, Hadoop cluster component information to be deployed, the number of HDFS copies, the number of client connections and timeout time of each component of the Hadoop cluster, a host network address, a user name and a password, a log storage disk, a data storage disk, a metadata storage disk and the like.
102. After receiving the deployment template information, the template analyzer firstly checks the legality of the template, and if the template does not meet the agreed requirements, the deployment is finished; and if the template is analyzed by the template rule, generating a Hadoop cluster networking topological graph by the topological graph generator.
103. And according to the node resources, the deployment principle of each component of the Hadoop cluster and the deployment template information, generating a Hadoop cluster networking topological graph by the topological graph generator (such as S1). The Hadoop cluster component deployment principles include, but are not limited to, the following: 1. distributing Master and Slave nodes of the Hadoop assembly according to hardware resources and host load conditions; 2. calculating and distributing the number of ZOOKEEPER nodes according to the number of the nodes in the cluster; 3. and calculating the quantity of the journal nodes according to the quantity of the HDFS nodes and distributing. Hadoop component deployment tasks include, but are not limited to, the following information: component name (e.g., HDFS), node name (e.g., NameNode), host network address, task priority, etc.
104. The topology map generated by the topology map generator is stored.
105. And the deployment task generator generates a deployment task according to the Hadoop cluster networking topological graph.
106. And storing the deployment task list generated by the deployment task generator.
107. The task scheduler scans the deployment task list, takes out the deployment tasks which are not executed yet from the task list, calculates host loads (mainly examining average loads, memory utilization rates, disk IO utilization rates and network delay indexes) in the cluster according to the node resource information, and generates a deployment task sequence according to the priority (such as S4).
108. And the task scheduler selects the deployment tasks with high priority in sequence and issues the deployment tasks to the agents of the corresponding hosts. When the task of deploying the Hadoop component is executed for the first time, the agent a1 deploys a Hadoop cluster component deployment task of the agent a2, and the monitor of the agent a1 monitors the execution condition of the deployment task and feeds back the execution condition to the monitor of the deployment system (e.g., S10). After the monitor receives the completion situation of the execution of the deployment task, the task scheduler regenerates the task sequence according to the task list and the resource information (e.g., S5), the task scheduler selects the high-priority tasks T3 and T4, and the agents A1 and A2 deploy the tasks to the agents A3 and A4, and so on (e.g., S11 and S14). Ideally, when the t-th time (t is larger than 0), the whole cluster has 2t-1 agents executing the task of deploying the Hadoop component. Of course, each agent can start multiple threads and deploy Hadoop component tasks to multiple (for example, 2) agents, and in an ideal case, at the t-th time (t is greater than 0), 3t-1 agents of the whole Hadoop cluster execute the Hadoop component deployment tasks.
109. Agent a1 in conjunction with a distributed deployment Hadoop cluster management system.
110. And the agent is deployed at each host node in the Hadoop cluster.
Configuration generation: and the parameter configuration task completes the configuration generation of each component of the Hadoop cluster. The scheduler needs to collect deployment information (for example, host names of nodes where masters and Slave are located, log storage disks, data storage disks, metadata storage disks, and other information) of each component of the whole Hadoop cluster, and issues the deployment information together with the parameter configuration task to the parameter configurator in each host agent component. And after all the parameter configuration tasks in the cluster are executed, completing the deployment of all the components of the whole Hadoop cluster.
201. The collector in the agent component collects the hardware resources and the running state information of the host at regular time, reports the information to the collector in the deployment system and stores the node resources. The hardware resource and the operation state information include, but are not limited to, operating system information, host name, CPU information, memory information, disk, process information, CPU utilization, memory utilization, disk IO utilization, network information, average IO operation latency, and the like.
202. And storing the information of each node resource (including information of the host and the Hadoop component) collected by the monitor collector.
Example 4
The embodiment of the invention also provides a storage medium. Alternatively, in the present embodiment, the storage medium may be configured to store program codes for performing the following steps:
s1, receiving template information for deploying a Hadoop cluster, wherein the template information is used for indicating task information and host information of the Hadoop cluster, and the task information is used for describing tasks needing to be completed by the Hadoop cluster;
s2, acquiring parameter information of one or more hosts of the Hadoop cluster according to the host information, wherein each host is used for deploying one or more components, and the components are deployed by an agent and used for executing corresponding tasks;
and S3, deploying the task for one or more components according to the task information and the parameter information.
Optionally, in this embodiment, the storage medium may include, but is not limited to: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.
Optionally, in this embodiment, the processor executes and receives template information for deploying a Hadoop cluster according to a program code stored in a storage medium, where the template information is used to indicate task information and host information of the Hadoop cluster, and the task information is used to describe a task that needs to be completed by the Hadoop cluster;
optionally, in this embodiment, the processor executes, according to program codes stored in the storage medium, acquiring parameter information of one or more hosts of the Hadoop cluster according to host information, where each host is used to deploy one or more components, and the components are deployed by the agent and used to execute corresponding tasks;
optionally, in this embodiment, the processor executes the deployment task for the one or more components according to the task information and the parameter information according to the program code stored in the storage medium.
Optionally, the specific examples in this embodiment may refer to the examples described in the above embodiments and optional implementation manners, and this embodiment is not described herein again.
It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A method for distributed deployment of Hadoop clusters is characterized by comprising the following steps:
receiving template information for deploying a Hadoop cluster, wherein the template information is used for indicating task information and host information of the Hadoop cluster, and the task information is used for describing tasks needing to be completed by the Hadoop cluster;
acquiring parameter information of one or more hosts of the Hadoop cluster according to the host information, wherein the parameter information comprises load information of the hosts, each host is used for deploying one or more components, and the components are deployed by an agent and used for executing corresponding tasks;
deploying tasks for one or more of the components according to the task information and the parameter information, including:
distributing Master and Slave nodes of the Hadoop assembly according to hardware resources and host load conditions;
calculating and distributing the number of ZOOKEEPER nodes according to the number of the nodes in the cluster;
and calculating the quantity of the journal nodes according to the quantity of the HDFS nodes and distributing.
2. The method of claim 1, wherein the parameter information comprises at least one of: the system comprises host operating system information, host network information, host CPU information, host memory information, host CPU utilization rate, host memory utilization rate, host disk IO utilization rate, host network delay, host average IO operation waiting time, host disk information and process information of components in the host.
3. The method of claim 1, wherein deploying tasks for one or more components within the Hadoop cluster according to the task information and the parameter information comprises:
generating a deployment task list according to the task information and the parameter information, wherein the deployment task list comprises the task information, the parameter information required by the task execution and the priority of the task;
and selecting the task with the highest priority from the deployment task list and sending the task to the corresponding component.
4. A method according to claim 3, characterized in that said priority is related to properties of said task and/or said parameter information for executing said task.
5. The method of claim 1, wherein after deploying tasks for one or more of the components based on the template information and the parameter information, the method further comprises:
monitoring task execution progress and/or log information of the one or more components.
6. The method of claim 1, wherein the template information comprises at least one of: the system comprises the number of Hadoop cluster hosts, Hadoop cluster component information to be deployed, the number of Hadoop distributed file system HDFS copies, the connection number and timeout time of each component client of the Hadoop cluster, a host network address, a host user name and a password, log storage disk information, data storage disk information and metadata storage disk information.
7. The method of claim 1, wherein after receiving template information for deploying a Hadoop cluster, the method further comprises:
and analyzing the template information and verifying the legality of the template information.
8. A distributed Hadoop cluster deployment device, comprising:
the system comprises a receiving module and a processing module, wherein the receiving module is used for receiving template information for deploying a Hadoop cluster, the template information is used for indicating task information and host information of the Hadoop cluster, and the task information is used for describing tasks needing to be completed by the Hadoop cluster;
the system comprises an acquisition module, a task execution module and a task execution module, wherein the acquisition module is used for acquiring parameter information of one or more hosts of the Hadoop cluster according to the host information, the parameter information comprises load information of the hosts, each host comprises one or more components, and the components are deployed by an agent and used for executing corresponding tasks;
a deployment module, configured to deploy a task to one or more of the components according to the task information and the parameter information, including:
distributing Master and Slave nodes of the Hadoop assembly according to hardware resources and host load conditions;
calculating and distributing the number of ZOOKEEPER nodes according to the number of the nodes in the cluster;
and calculating the quantity of the journal nodes according to the quantity of the HDFS nodes and distributing.
9. The apparatus of claim 8, wherein the deployment module further comprises:
the generating unit is used for generating a deployment task list according to the task information and the parameter information, wherein the deployment task list comprises the task information, the parameter information required by the task execution and the priority of the task;
and the selection unit is used for selecting the task with the highest priority from the deployment task list and issuing the task to the corresponding component.
10. The apparatus of claim 9, further comprising:
and the monitoring module is used for monitoring the task execution progress and/or log information of one or more components after the deployment module deploys tasks on one or more components according to the template information and the parameter information.
CN201610395969.2A 2016-06-03 2016-06-03 Distributed Hadoop cluster deployment method and device Active CN107463582B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201610395969.2A CN107463582B (en) 2016-06-03 2016-06-03 Distributed Hadoop cluster deployment method and device
PCT/CN2017/083207 WO2017206667A1 (en) 2016-06-03 2017-05-05 Method and device for distributively deploying hadoop cluster

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610395969.2A CN107463582B (en) 2016-06-03 2016-06-03 Distributed Hadoop cluster deployment method and device

Publications (2)

Publication Number Publication Date
CN107463582A CN107463582A (en) 2017-12-12
CN107463582B true CN107463582B (en) 2021-11-12

Family

ID=60479660

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610395969.2A Active CN107463582B (en) 2016-06-03 2016-06-03 Distributed Hadoop cluster deployment method and device

Country Status (2)

Country Link
CN (1) CN107463582B (en)
WO (1) WO2017206667A1 (en)

Families Citing this family (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108228796A (en) * 2017-12-29 2018-06-29 百度在线网络技术(北京)有限公司 Management method, device, system, server and the medium of MPP databases
CN109284272A (en) * 2018-09-07 2019-01-29 郑州云海信息技术有限公司 A kind of dispositions method of distributed file system, device and equipment
CN109508196A (en) * 2018-10-15 2019-03-22 广州云新信息技术有限公司 Automatic deployment system and method based on X86 server
CN111061503B (en) * 2018-10-16 2023-08-18 航天信息股份有限公司 Cluster system configuration method and cluster system
CN111581042B (en) * 2019-02-15 2023-09-12 网宿科技股份有限公司 Cluster deployment method, deployment platform and server to be deployed
CN110262807B (en) * 2019-06-20 2023-12-26 北京百度网讯科技有限公司 Cluster creation progress log acquisition system, method and device
CN110389766B (en) * 2019-06-21 2022-12-27 深圳市汇川技术股份有限公司 HBase container cluster deployment method, system, equipment and computer readable storage medium
CN110457114B (en) * 2019-07-24 2020-11-27 杭州数梦工场科技有限公司 Application cluster deployment method and device
CN111754191A (en) * 2020-06-08 2020-10-09 中国建设银行股份有限公司 Automatic change method based on cloud platform and related equipment
CN111866013B (en) * 2020-07-29 2023-04-18 杭州安恒信息技术股份有限公司 Cloud security product management platform deployment method, device, equipment and medium
CN112363818A (en) * 2020-11-30 2021-02-12 杭州玳数科技有限公司 Method for realizing Hadoop MR task cluster independence under Yarn scheduling
CN112732410B (en) * 2021-01-21 2023-03-28 青岛海尔科技有限公司 Service node management method and device, storage medium and electronic device
CN114816444A (en) * 2021-01-28 2022-07-29 网联清算有限公司 Method and device for deploying monitoring program, electronic equipment and storage medium
CN113132383B (en) * 2021-04-19 2022-03-25 烟台中科网络技术研究所 Network data acquisition method and system
CN113127016A (en) * 2021-04-30 2021-07-16 平安国际智慧城市科技股份有限公司 Automatic deployment method, device, equipment and medium for Hdp big data platform
CN113886036B (en) * 2021-09-13 2024-04-19 天翼数字生活科技有限公司 Method and system for optimizing distributed system cluster configuration
CN115499304B (en) * 2022-07-29 2024-03-08 天翼云科技有限公司 Automatic deployment method, device, equipment and product for distributed storage
CN117742931A (en) * 2022-09-15 2024-03-22 华为云计算技术有限公司 Method and device for determining big data cluster deployment scheme, clusters and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103064742A (en) * 2012-12-25 2013-04-24 中国科学院深圳先进技术研究院 Automatic deployment system and method of hadoop cluster
CN103152393A (en) * 2013-02-05 2013-06-12 北京邮电大学 Charging method and charging system for cloud computing

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8954568B2 (en) * 2011-07-21 2015-02-10 Yahoo! Inc. Method and system for building an elastic cloud web server farm
US8756595B2 (en) * 2011-07-28 2014-06-17 Yahoo! Inc. Method and system for distributed application stack deployment
US9612812B2 (en) * 2011-12-21 2017-04-04 Excalibur Ip, Llc Method and system for distributed application stack test certification
CN105302641B (en) * 2014-06-04 2019-03-22 杭州海康威视数字技术股份有限公司 The method and device of node scheduling is carried out in virtual cluster
CN104317610B (en) * 2014-10-11 2017-05-03 福建新大陆软件工程有限公司 Method and device for automatic installation and deployment of hadoop platform
CN104734892A (en) * 2015-04-02 2015-06-24 江苏物联网研究发展中心 Automatic deployment system for big data processing system Hadoop on cloud platform OpenStack

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103064742A (en) * 2012-12-25 2013-04-24 中国科学院深圳先进技术研究院 Automatic deployment system and method of hadoop cluster
CN103152393A (en) * 2013-02-05 2013-06-12 北京邮电大学 Charging method and charging system for cloud computing

Also Published As

Publication number Publication date
WO2017206667A1 (en) 2017-12-07
CN107463582A (en) 2017-12-12

Similar Documents

Publication Publication Date Title
CN107463582B (en) Distributed Hadoop cluster deployment method and device
CN108924217B (en) Automatic deployment method of distributed cloud system
US11714671B2 (en) Creating virtual machine groups based on request
CN107145380B (en) Virtual resource arranging method and device
CN110752947B (en) K8s cluster deployment method and device, and deployment platform
CN103064742B (en) A kind of automatic deployment system and method for hadoop cluster
CN102880503A (en) Data analysis system and data analysis method
CN108920153B (en) Docker container dynamic scheduling method based on load prediction
CN111741134B (en) System and method for quickly constructing virtual machine in large-scale scene of network shooting range
CN104580519A (en) Method for rapid deployment of openstack cloud computing platform
CN109117252B (en) Method and system for task processing based on container and container cluster management system
CN104536899A (en) Software deploying and maintaining method based on intelligent cluster
CN113742031A (en) Node state information acquisition method and device, electronic equipment and readable storage medium
CN103761146A (en) Method for dynamically setting quantities of slots for MapReduce
CN115145695B (en) Resource scheduling method and device, computer equipment and storage medium
Hamdaqa et al. Adoop: MapReduce for ad-hoc cloud computing
CN115098354A (en) Method for building high-performance cloud simulation design platform
Ghit et al. Resource management for dynamic mapreduce clusters in multicluster systems
CN112261125B (en) Centralized unit cloud deployment method, device and system
CN107025134B (en) Database service system and method compatible with multiple databases
CN115866059A (en) Block chain link point scheduling method and device
CN105760215A (en) Map-reduce model based job running method for distributed file system
CN109032786A (en) Jenkins continuous integrating cluster, APP packaging method and server
CN112486502A (en) Distributed task deployment method and device, computer equipment and storage medium
Manjaly et al. TaskTracker aware scheduler with resource availability control for Hadoop MapReduce

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant