CN108572826B

CN108572826B - Method for automatically deploying Hadoop and Spark clusters based on script

Info

Publication number: CN108572826B
Application number: CN201810350880.3A
Authority: CN
Inventors: 周毅; 高艳涛
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2018-04-18
Filing date: 2018-04-18
Publication date: 2022-08-16
Anticipated expiration: 2038-04-18
Also published as: CN108572826A

Abstract

The invention relates to a method for automatically deploying Hadoop and Spark clusters based on scripts, which comprises the following steps: s1, a user provides a configuration file template, a parameter configuration file, a source file and a script file and puts the configuration file template, the parameter configuration file, the source file and the script file in the same folder; s2, selecting a node, executing by a Hadoop user to generate a configuration script, writing parameters in the parameter configuration file into a configuration file template by a script file, and generating an actual available configuration file; s3, selecting a node, executing an installation script by a Hadoop user, executing the installation script on the local computer and automatically configuring the environment, and then ssh executing the installation script to other nodes and automatically configuring the environment; and S4, testing whether the Hadoop and Spark environments are built successfully by the user.

Description

Method for automatically deploying Hadoop and Spark clusters based on script

Technical Field

The invention relates to the technical field of big data, in particular to a method for automatically deploying Hadoop and Spark clusters based on a script.

Background

Hadoop is a distributed system infrastructure developed by the Apache Foundation. A user can develop a distributed program without knowing the distributed underlying details. The power of the cluster is fully utilized to carry out high-speed operation and storage. Hadoop realizes a Distributed File System (Hadoop Distributed File System), which is called HDFS for short. HDFS is characterized by high fault tolerance and is designed for deployment on inexpensive (low-cost) hardware; and it provides high throughput (high throughput) to access data of applications, suitable for applications with very large data sets. The most core design of the Hadoop framework is as follows: HDFS and MapReduce. HDF provides storage for massive data, while MapReduce provides computation for massive data.

Spark was developed by the university of california berkeley branch AMP laboratory (Algorithms, Machines, and People Lab) and was used to build large, low-latency data analysis applications. Spark has the advantages of Hadoop and MapReduce; but different from MapReduce, Job intermediate output results can be stored in a memory, so that HDFS reading and writing are not needed, and Spark can be better suitable for MapReduce algorithms which need iteration, such as data mining, machine learning and the like.

Spark is an open source cluster computing environment similar to Hadoop, and actually Spark is a supplement to Hadoop and can run in parallel in a Hadoop file system through a third party cluster framework named Mesos.

Disclosure of Invention

The invention mainly aims to provide a method for automatically deploying Hadoop and Spark clusters based on scripts, which is used for solving the problems of time and labor waste in configuration environment.

In order to realize the purpose, the technical scheme is as follows:

a method for automatically deploying Hadoop and Spark clusters based on scripts comprises the following steps:

s1, a user provides a configuration file template, a parameter configuration file, a source file and a script file and puts the configuration file template, the parameter configuration file, the source file and the script file in the same folder. The number of the configuration file templates is 4, namely hdfs-site.xml template, core-site.xml template, mapred-site.xml template and yarn-site.xml template. The parameter configuration file comprises a plurality of lines of triple (template, parameter, value), a plurality of lines of triple (ip, hostname, role), a source file installation path and a Hadoop account password. The < template, parameter, value > triple indicates that the parameter is set to value in the template of the template configuration file. the template value is one of hdfs-site.xml, core-site.xml, map-site.xml, and yarn-site.xml. The triple indicates a computer node of a certain ip, the host name is hostname, and the role is one of a NameNode, a DataNode and a Secondary NameNode in the cluster. The script contains two in total: and generating a configuration script and an installation script.

S2, selecting a node, executing by a Hadoop user to generate a configuration script, writing parameters in the parameter configuration file into a configuration file template by the script file, and generating an actual available configuration file.

S3, selecting a node, executing the installation script by a Hadoop user, executing the installation script in the local computer and automatically configuring the environment, and then ssh executing the installation script to other nodes and automatically configuring the environment.

S4, testing whether the Hadoop and Spark environments are built successfully by the user.

The installation script is responsible for installing and configuring JDK, Hadoop and Spark on the local computer and other nodes, and is also responsible for configuring SSH password-free login of each node. The flow chart of SSH password-free login between configuration nodes is shown in FIG. 2.

Compared with the prior art, the invention has the beneficial effects that:

the invention provides a method for automatically deploying Hadoop and Spark clusters by using scripts under a linux environment. And on the configuration management node, placing the script program, the installation source file, the Hadoop and Spark cluster configuration file template and the user-defined configuration under the same folder. Hadoop and Spark cluster automatic configuration is completed by executing a script program on each linux computer in the cluster.

Drawings

FIG. 1 is a schematic flow diagram of a method.

FIG. 2 is a schematic diagram illustrating a password-free login process between configuration nodes

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the patent;

the invention is further illustrated below with reference to the figures and examples.

Example 1

1. As shown in FIG. 1, the invention provides a method for deploying Hadoop and Spark clusters based on scripts, which is to perform automatic configuration through linux shell scripts. The embodiment of the invention has 4 independent computers, and the IP, the operating system, the node function and the host name of the computer are configured as follows.

2. The installation source program version and path configuration used are as follows

Installation program	Installation version and filename	Mounting means	Installation path
				JDK	jdk‐8u141‐linux‐x64.tar	Offline installation	/usr/lib/jdk
Hadoop	Hadoop‐2.7.3.tar.gz	Offline installation	/usr/lib/Hadoop
				Spark	Spark‐2.2.0‐bin‐Hadoop2.7.tgz	Offline installation	/usr/lib/Spark

3. The parameters to be set in the configuration file template are as follows:

3. and (4) selecting a node, executing by a Hadoop user to generate a configuration script, writing the parameters in the parameter configuration file into a configuration file template by the script file, and generating an actual available configuration file.

4. And (4) optionally selecting a node, executing the installation script by a Hadoop user, executing the installation script on the local computer, configuring SSH password-free login among the nodes, and finally executing the installation script from the SSH to other nodes. The installation script, when executed, automatically adds environment variables to the bashrc file under the/etc/profile and Hadoop users and automatically refreshes the configuration. And the script will determine whether the installation and configuration was successful.

5. The method for testing whether Hadoop and Spark are successfully built or not includes the following steps:

testing whether the Hadoop is built successfully;

and testing whether Spark is built successfully.

It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. A method for automatically deploying Hadoop and Spark clusters based on scripts is characterized by comprising the following steps: the method comprises the following steps:

s1, a user provides a configuration file template, a parameter configuration file, a source file and a script file and puts the configuration file template, the parameter configuration file, the source file and the script file in the same folder; the number of the configuration file templates is 4, namely an hdfs-site.xml template, a core-site.xml template, a mapred-site.xml template and a yarn-site.xml template; the parameter configuration file comprises a plurality of lines of triple (template, parameter, value), a plurality of lines of triple (ip, hostname, role), a source file installation path and a Hadoop account password; the triple of < template, parameter and value > is shown in the template of the template configuration file, and the value of the parameter is set to be value; the template value is one of hdfs-site.xml, core-site.xml, map-site.xml and yarn-site.xml; the triple represents a computer node of a certain ip, the host name is hostname, the triple is used as a function node role in the cluster, and the role is one of NameNode, DataNode and Secondary NameNode;

the script contains two in total: generating a configuration script and an installation script;

generating a configuration script to automatically check the correctness of < template, parameter, value > and < ip, hostname, role > in the parameter configuration file; if configuration errors occur in < template, parameter, value > and < ip, hostname, role >, the script gives a prompt;

s2, selecting a node, executing by a Hadoop user to generate a configuration script, writing parameters in the parameter configuration file into a configuration file template by a script file, and generating an actual available configuration file; generating 5 actual available configuration files which are respectively an hdfs-site.xml file, a core-site.xml file, a map-site.xml file, a horn-site.xml file and a slave file according to the parameter configuration file and the 4 configuration file templates; the slots file is a set of < ip, hostname, role > role in the parameter configuration file, which is a DataNode;

s3, selecting a node, executing an installation script by a Hadoop user, executing the installation script in the local computer and automatically configuring the environment, and then ssh executing the installation script to other nodes and automatically configuring the environment; the process for configuring SSH password-free login between nodes comprises the following steps: SSH password logs in each node to generate a key pair, copies the public keys of the rest nodes for any node, and adds the public keys of the rest nodes to the authorization list of the node;

in the execution process of the script, whether the executed task is successfully completed or not can be automatically judged; if the task is not successfully completed, prompting error information;

2. The method for automated deployment of Hadoop and Spark clusters based on scripts according to claim 1, wherein: all computers used to build Hadoop and Spark clustered computing environments must already have a "Hadoop" account, and the "Hadoop" account must have administrator privileges.

3. The method for automated deployment of Hadoop and Spark clusters based on scripts according to claim 1, wherein: all computers used to build the Hadoop and Spark clustered computing environment must have installed the opennsh-client and opennsh-server.

4. The method for automated deployment of Hadoop and Spark clusters based on scripts according to claim 1, characterized in that: the script can automatically modify the configuration of the environment variables in the execution process; the script can modify/etc/profile global configuration and Hasoop user's bashrc configuration during execution, and automatically refresh the configuration.