CN108572826B - Method for automatically deploying Hadoop and Spark clusters based on script - Google Patents

Method for automatically deploying Hadoop and Spark clusters based on script Download PDF

Info

Publication number
CN108572826B
CN108572826B CN201810350880.3A CN201810350880A CN108572826B CN 108572826 B CN108572826 B CN 108572826B CN 201810350880 A CN201810350880 A CN 201810350880A CN 108572826 B CN108572826 B CN 108572826B
Authority
CN
China
Prior art keywords
file
script
hadoop
template
configuration
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810350880.3A
Other languages
Chinese (zh)
Other versions
CN108572826A (en
Inventor
周毅
高艳涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
Original Assignee
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Yat Sen University filed Critical Sun Yat Sen University
Priority to CN201810350880.3A priority Critical patent/CN108572826B/en
Publication of CN108572826A publication Critical patent/CN108572826A/en
Application granted granted Critical
Publication of CN108572826B publication Critical patent/CN108572826B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/60Software deployment
    • G06F8/61Installation
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Stored Programmes (AREA)

Abstract

The invention relates to a method for automatically deploying Hadoop and Spark clusters based on scripts, which comprises the following steps: s1, a user provides a configuration file template, a parameter configuration file, a source file and a script file and puts the configuration file template, the parameter configuration file, the source file and the script file in the same folder; s2, selecting a node, executing by a Hadoop user to generate a configuration script, writing parameters in the parameter configuration file into a configuration file template by a script file, and generating an actual available configuration file; s3, selecting a node, executing an installation script by a Hadoop user, executing the installation script on the local computer and automatically configuring the environment, and then ssh executing the installation script to other nodes and automatically configuring the environment; and S4, testing whether the Hadoop and Spark environments are built successfully by the user.

Description

Method for automatically deploying Hadoop and Spark clusters based on script
Technical Field
The invention relates to the technical field of big data, in particular to a method for automatically deploying Hadoop and Spark clusters based on a script.
Background
Hadoop is a distributed system infrastructure developed by the Apache Foundation. A user can develop a distributed program without knowing the distributed underlying details. The power of the cluster is fully utilized to carry out high-speed operation and storage. Hadoop realizes a Distributed File System (Hadoop Distributed File System), which is called HDFS for short. HDFS is characterized by high fault tolerance and is designed for deployment on inexpensive (low-cost) hardware; and it provides high throughput (high throughput) to access data of applications, suitable for applications with very large data sets. The most core design of the Hadoop framework is as follows: HDFS and MapReduce. HDF provides storage for massive data, while MapReduce provides computation for massive data.
Spark was developed by the university of california berkeley branch AMP laboratory (Algorithms, Machines, and People Lab) and was used to build large, low-latency data analysis applications. Spark has the advantages of Hadoop and MapReduce; but different from MapReduce, Job intermediate output results can be stored in a memory, so that HDFS reading and writing are not needed, and Spark can be better suitable for MapReduce algorithms which need iteration, such as data mining, machine learning and the like.
Spark is an open source cluster computing environment similar to Hadoop, and actually Spark is a supplement to Hadoop and can run in parallel in a Hadoop file system through a third party cluster framework named Mesos.
Disclosure of Invention
The invention mainly aims to provide a method for automatically deploying Hadoop and Spark clusters based on scripts, which is used for solving the problems of time and labor waste in configuration environment.
In order to realize the purpose, the technical scheme is as follows:
a method for automatically deploying Hadoop and Spark clusters based on scripts comprises the following steps:
s1, a user provides a configuration file template, a parameter configuration file, a source file and a script file and puts the configuration file template, the parameter configuration file, the source file and the script file in the same folder. The number of the configuration file templates is 4, namely hdfs-site.xml template, core-site.xml template, mapred-site.xml template and yarn-site.xml template. The parameter configuration file comprises a plurality of lines of triple (template, parameter, value), a plurality of lines of triple (ip, hostname, role), a source file installation path and a Hadoop account password. The < template, parameter, value > triple indicates that the parameter is set to value in the template of the template configuration file. the template value is one of hdfs-site.xml, core-site.xml, map-site.xml, and yarn-site.xml. The triple indicates a computer node of a certain ip, the host name is hostname, and the role is one of a NameNode, a DataNode and a Secondary NameNode in the cluster. The script contains two in total: and generating a configuration script and an installation script.
S2, selecting a node, executing by a Hadoop user to generate a configuration script, writing parameters in the parameter configuration file into a configuration file template by the script file, and generating an actual available configuration file.
S3, selecting a node, executing the installation script by a Hadoop user, executing the installation script in the local computer and automatically configuring the environment, and then ssh executing the installation script to other nodes and automatically configuring the environment.
S4, testing whether the Hadoop and Spark environments are built successfully by the user.
The installation script is responsible for installing and configuring JDK, Hadoop and Spark on the local computer and other nodes, and is also responsible for configuring SSH password-free login of each node. The flow chart of SSH password-free login between configuration nodes is shown in FIG. 2.
Compared with the prior art, the invention has the beneficial effects that:
the invention provides a method for automatically deploying Hadoop and Spark clusters by using scripts under a linux environment. And on the configuration management node, placing the script program, the installation source file, the Hadoop and Spark cluster configuration file template and the user-defined configuration under the same folder. Hadoop and Spark cluster automatic configuration is completed by executing a script program on each linux computer in the cluster.
Drawings
FIG. 1 is a schematic flow diagram of a method.
FIG. 2 is a schematic diagram illustrating a password-free login process between configuration nodes
Detailed Description
The drawings are for illustrative purposes only and are not to be construed as limiting the patent;
the invention is further illustrated below with reference to the figures and examples.
Example 1
1. As shown in FIG. 1, the invention provides a method for deploying Hadoop and Spark clusters based on scripts, which is to perform automatic configuration through linux shell scripts. The embodiment of the invention has 4 independent computers, and the IP, the operating system, the node function and the host name of the computer are configured as follows.
Figure BDA0001633213770000031
2. The installation source program version and path configuration used are as follows
Installation program Installation version and filename Mounting means Installation path
JDK jdk‐8u141‐linux‐x64.tar Offline installation /usr/lib/jdk
Hadoop Hadoop‐2.7.3.tar.gz Offline installation /usr/lib/Hadoop
Spark Spark‐2.2.0‐bin‐Hadoop2.7.tgz Offline installation /usr/lib/Spark
3. The parameters to be set in the configuration file template are as follows:
Figure BDA0001633213770000032
Figure BDA0001633213770000041
3. and (4) selecting a node, executing by a Hadoop user to generate a configuration script, writing the parameters in the parameter configuration file into a configuration file template by the script file, and generating an actual available configuration file.
4. And (4) optionally selecting a node, executing the installation script by a Hadoop user, executing the installation script on the local computer, configuring SSH password-free login among the nodes, and finally executing the installation script from the SSH to other nodes. The installation script, when executed, automatically adds environment variables to the bashrc file under the/etc/profile and Hadoop users and automatically refreshes the configuration. And the script will determine whether the installation and configuration was successful.
5. The method for testing whether Hadoop and Spark are successfully built or not includes the following steps:
testing whether the Hadoop is built successfully;
and testing whether Spark is built successfully.
It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims (4)

1. A method for automatically deploying Hadoop and Spark clusters based on scripts is characterized by comprising the following steps: the method comprises the following steps:
s1, a user provides a configuration file template, a parameter configuration file, a source file and a script file and puts the configuration file template, the parameter configuration file, the source file and the script file in the same folder; the number of the configuration file templates is 4, namely an hdfs-site.xml template, a core-site.xml template, a mapred-site.xml template and a yarn-site.xml template; the parameter configuration file comprises a plurality of lines of triple (template, parameter, value), a plurality of lines of triple (ip, hostname, role), a source file installation path and a Hadoop account password; the triple of < template, parameter and value > is shown in the template of the template configuration file, and the value of the parameter is set to be value; the template value is one of hdfs-site.xml, core-site.xml, map-site.xml and yarn-site.xml; the triple represents a computer node of a certain ip, the host name is hostname, the triple is used as a function node role in the cluster, and the role is one of NameNode, DataNode and Secondary NameNode;
the script contains two in total: generating a configuration script and an installation script;
generating a configuration script to automatically check the correctness of < template, parameter, value > and < ip, hostname, role > in the parameter configuration file; if configuration errors occur in < template, parameter, value > and < ip, hostname, role >, the script gives a prompt;
s2, selecting a node, executing by a Hadoop user to generate a configuration script, writing parameters in the parameter configuration file into a configuration file template by a script file, and generating an actual available configuration file; generating 5 actual available configuration files which are respectively an hdfs-site.xml file, a core-site.xml file, a map-site.xml file, a horn-site.xml file and a slave file according to the parameter configuration file and the 4 configuration file templates; the slots file is a set of < ip, hostname, role > role in the parameter configuration file, which is a DataNode;
s3, selecting a node, executing an installation script by a Hadoop user, executing the installation script in the local computer and automatically configuring the environment, and then ssh executing the installation script to other nodes and automatically configuring the environment; the process for configuring SSH password-free login between nodes comprises the following steps: SSH password logs in each node to generate a key pair, copies the public keys of the rest nodes for any node, and adds the public keys of the rest nodes to the authorization list of the node;
in the execution process of the script, whether the executed task is successfully completed or not can be automatically judged; if the task is not successfully completed, prompting error information;
s4, testing whether the Hadoop and Spark environments are built successfully by the user.
2. The method for automated deployment of Hadoop and Spark clusters based on scripts according to claim 1, wherein: all computers used to build Hadoop and Spark clustered computing environments must already have a "Hadoop" account, and the "Hadoop" account must have administrator privileges.
3. The method for automated deployment of Hadoop and Spark clusters based on scripts according to claim 1, wherein: all computers used to build the Hadoop and Spark clustered computing environment must have installed the opennsh-client and opennsh-server.
4. The method for automated deployment of Hadoop and Spark clusters based on scripts according to claim 1, characterized in that: the script can automatically modify the configuration of the environment variables in the execution process; the script can modify/etc/profile global configuration and Hasoop user's bashrc configuration during execution, and automatically refresh the configuration.
CN201810350880.3A 2018-04-18 2018-04-18 Method for automatically deploying Hadoop and Spark clusters based on script Active CN108572826B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810350880.3A CN108572826B (en) 2018-04-18 2018-04-18 Method for automatically deploying Hadoop and Spark clusters based on script

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810350880.3A CN108572826B (en) 2018-04-18 2018-04-18 Method for automatically deploying Hadoop and Spark clusters based on script

Publications (2)

Publication Number Publication Date
CN108572826A CN108572826A (en) 2018-09-25
CN108572826B true CN108572826B (en) 2022-08-16

Family

ID=63575208

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810350880.3A Active CN108572826B (en) 2018-04-18 2018-04-18 Method for automatically deploying Hadoop and Spark clusters based on script

Country Status (1)

Country Link
CN (1) CN108572826B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109445800B (en) * 2018-11-02 2022-05-03 中国人民银行清算总中心 Version automatic deployment method and system based on distributed system
CN112445746A (en) * 2019-09-04 2021-03-05 中国科学院深圳先进技术研究院 Cluster configuration automatic optimization method and system based on machine learning
CN112612514B (en) * 2020-12-31 2023-11-28 青岛海尔科技有限公司 Program development method and device, storage medium and electronic device
CN113282663A (en) * 2021-06-04 2021-08-20 杭州复杂美科技有限公司 Block chain rapid customization method, equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103412768A (en) * 2013-07-19 2013-11-27 蓝盾信息安全技术股份有限公司 Zookeeper cluster automatic-deployment method based on script program
CN104734892A (en) * 2015-04-02 2015-06-24 江苏物联网研究发展中心 Automatic deployment system for big data processing system Hadoop on cloud platform OpenStack
CN105893545A (en) * 2016-04-01 2016-08-24 浪潮电子信息产业股份有限公司 Efficient Hadoop cluster deployment method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10146519B2 (en) * 2016-09-20 2018-12-04 Bluedata Software, Inc. Generation and deployment of scripts for large scale processing framework services

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103412768A (en) * 2013-07-19 2013-11-27 蓝盾信息安全技术股份有限公司 Zookeeper cluster automatic-deployment method based on script program
CN104734892A (en) * 2015-04-02 2015-06-24 江苏物联网研究发展中心 Automatic deployment system for big data processing system Hadoop on cloud platform OpenStack
CN105893545A (en) * 2016-04-01 2016-08-24 浪潮电子信息产业股份有限公司 Efficient Hadoop cluster deployment method

Also Published As

Publication number Publication date
CN108572826A (en) 2018-09-25

Similar Documents

Publication Publication Date Title
CN108572826B (en) Method for automatically deploying Hadoop and Spark clusters based on script
Furda et al. Migrating enterprise legacy source code to microservices: on multitenancy, statefulness, and data consistency
CA3055823C (en) Selectively generating word vector and paragraph vector representations of fields for machine learning
CN103838672A (en) Automated testing method and device for all-purpose financial statements
CN103729169B (en) Method and apparatus for determining file extent to be migrated
WO2017165399A1 (en) Automated assessement and granding of computerized algorithms
CN108958744B (en) Deployment method, device, medium and electronic equipment of big data distributed cluster
US20200371902A1 (en) Systems and methods for software regression detection
CN113434158B (en) Custom management method, device, equipment and medium for big data component
Shang et al. Mapreduce as a general framework to support research in mining software repositories (MSR)
US10996934B2 (en) Systems and methods for comparing computer scripts
CN107122238A (en) Efficient iterative Mechanism Design method based on Hadoop cloud Computational frame
CN105678118B (en) A kind of software version generation method and device containing digital certificate
US11615016B2 (en) System and method for executing a test case
Duan et al. State feedback stabilization of stochastic nonlinear systems with SiISS inverse dynamics
Samples et al. Parameter sweeps for exploring GP parameters
US10963227B2 (en) Technique for transforming a standard messaging component to a customized component
Singh et al. A novel approach for pre-validation, auto resiliency & alert notification for SVN to git migration using Iot devices
CN108377198A (en) A kind of unified batch maintenance method of node configuration based on cloud platform
Fernández et al. Automated spark clusters deployment for big data with standalone applications integration
US10310959B2 (en) Pre-deployment validation system using intelligent databases
CN113467890B (en) Distributed college virtual laboratory management method, system and storage device
Ning Network Log Big Data Analysis Processing Based on Hadoop Cluster
Kamdi Big Data Quality Assurance and Testing Framework
Gogada et al. An Extensible Framework for Implementing and Validating Byzantine Fault-tolerant Protocols

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant