CN107682209A

CN107682209A - A kind of SDP big datas automatically dispose monitor supervision platform

Info

Publication number: CN107682209A
Application number: CN201711105672.9A
Authority: CN
Inventors: 张�林; 武保权; 马培娜; 王成锐; 韩克强; 连杰
Original assignee: Qingdao Sarntah Inteligent Technology Co Ltd
Current assignee: Qingdao Sarntah Inteligent Technology Co Ltd
Priority date: 2017-11-10
Filing date: 2017-11-10
Publication date: 2018-02-09

Abstract

The invention discloses a kind of SDP big datas automatically dispose monitor supervision platform, creation module including cluster environment, the monitoring module of cluster environment, the operation module of cluster environment, the alarm module of cluster environment, the log analysis module of cluster environment, the safety control module of cluster environment and the user role of cluster environment and authority module, the creation module of the cluster environment includes the establishment of environmental preparation script and the configuration installation of cluster environment, SDP big datas automatically dispose monitor supervision platform disclosed in this invention can carry out rapid deployment, unified cluster and service management, intelligent monitoring alarm management, security is higher.

Description

SDP big data automatic deployment monitoring platform

Technical Field

The invention relates to the technical field of computer information storage and processing, in particular to an SDP big data automatic deployment monitoring platform.

Background

The existing society is a society with high-speed development, developed science and technology and information circulation, people communicate with each other more and more closely, the life is more and more convenient, data needing to be processed is larger and larger, and the requirements of various fields on processing of mass data are more and more increased. Under the background that the storage space and the operational capability of a single machine cannot meet the requirements of people on mass data processing, distributed computing and parallel computing begin to be rapidly developed and applied, and are finally developed into grid computing.

The monitoring information of the distributed system under large scale is massive, the monitoring resource is multi-level and multi-source, and the dynamics and complexity of the large data platform bring a lot of difficulties to the monitoring system of the large data platform. How to effectively monitor software and hardware resources in a big data platform, predict the bottleneck of the resources in time, and take corresponding measures before a fault occurs is a key for improving the service quality of the big data platform and is also the key point of the current research.

Big data is the product of this high technology era. With the advent of the cloud era, Big data (Big data) has attracted more and more attention. The first problem to be solved by big data is storage and computation. The Hadoop framework arose in the year 08, and the Hadoop was independent from a package in Nuch, and is continuously concerned by people, and the Hadoop framework itself continuously evolves, including Native implementation of a compression algorithm, optimization of a Checksum mechanism, and shortcircuit read (supporting built-in short reading of direct files), which is optimization of reading performance.

Under the continuous optimization, the Hadoop is continuously improved, an ecological system surrounding the Hadoop is also continuously improved, but the Hadoop system is still very troublesome to deploy; from 08 years to the present, the large and small systems are built in, self-built and built-in, comprise various platform projects built in companies, countless, comprise scripts, permission settings and some catalogue permission settings, and a system is installed for about half a day to about one day. After being very skilled, the time is about half a day.

Big data technology is constantly evolving, but it is still somewhat of an inherent deficiency. Firstly, the technology is in a hundred flowers, and the technologies in the ecological system are continuously improved, including Hadoop, Hive, Spark and the like, so that the selection difficulty exists, and how to apply each technology is a difficult problem.

In addition, the fusion within the big data technology itself is not enough. There is a trend that every open source tool is trying to build an ecosystem around itself, emphasizing how good its performance is. In addition, how reasonably these techniques can be used. For example, the previous system is built around one technology, and then a new technology is introduced, so how to realize the fusion of the two technologies is a great problem.

Disclosure of Invention

In order to solve the technical problems, the invention provides an SDP big data automatic deployment monitoring platform so as to achieve the purposes of rapid deployment, unified cluster and service management, intelligent monitoring alarm management and higher safety.

In order to achieve the purpose, the technical scheme of the invention is as follows:

an SDP big data automatic deployment monitoring platform comprises a cluster environment creating module, a cluster environment monitoring module, a cluster environment operating module, a cluster environment warning module, a cluster environment log analyzing module, a cluster environment safety control module and a cluster environment user role and authority module.

In the above solution, the creating module of the cluster environment includes creating an environment preparation script and configuring and installing the cluster environment.

In the above scheme, the cluster environment includes components, services, interfaces, models, databases, and tools.

In the above scheme, the monitoring module of the cluster environment includes hardware usage monitoring, component service condition monitoring, data storage condition monitoring, alarm condition monitoring, task execution condition monitoring, configuration condition monitoring, and node operation condition monitoring.

In the above solution, the operation module of the cluster environment includes operations of nodes, services, background management, alarms, and data.

In the above solution, the alarm module of the cluster environment includes WEB alarm, PORT alarm, METRIC alarm, AGGREGATE alarm, SCRIPT alarm, SERVER alarm and RECOVERY alarm.

In the above scheme, the log analysis module of the cluster environment includes user behavior analysis and error log analysis.

In the above solution, the security control module of the cluster environment includes high availability of nodes and high availability of services.

In the above solution, the creating of the environment preparation script includes:

(1) closing the firewall on each node;

(2) hosts on each node are configured, and host names correspond to the ips;

(3) ssh secret-free login between the two main nodes and each sub-node;

(4) modifying the maximum open number of the files of each node;

(5) a profile server;

(6) configuring a local yum source;

(7) configuring a time server and a client;

(8) installing jdk of each node and configuring an environment variable of each node;

(9) configuring the HugePages of each node;

(10) installing a mysql database for a specified node;

(11) the provisioning sdp host service is installed.

In the above scheme, the components include HADOOP, HDFS, HBASE, zokeeper, MAPREDUCE, REDIS, elasticcsearch, SPARK, and STORM.

Through the technical scheme, the SDP big data automatic deployment monitoring platform provided by the invention has the following advantages:

1. rapid deployment:

the system integrates most common service components in the Hadoop ecosystem, provides a concise and visual installation guide, completes platform deployment in one step, and can complete the whole deployment process within one hour.

2. Unified clustering and service management:

the SDP provides visual cluster management support, and provides rich development components and service integration, and HDFS, HIVE, HBASE and the like can be quickly installed, and running resources and tasks are monitored, so that an administrator is helped to improve operation and maintenance efficiency, guarantee service quality, optimize cluster performance and reduce management cost.

3. Intelligent monitoring alarm management:

the SDP uses a set of predefined seven alerts for each cluster node and service that monitor clusters and can alert and help users to identify and troubleshoot problems. The user can also custom create alerts and set notification targets to obtain monitoring alerts of interest.

4. High-availability construction:

SDP ensures high availability by establishing primary and secondary components. SDP can configure the high availability of components in many stack services and can also manage and disable (roll back) the high availability of these components after configuring the high availability for the service. In addition, the SDP supports master-slave hot standby, and ensures the safety of SDP metadata and the safe use of a platform.

5. And (3) data security:

the SDP can install a secure (Kerberos-based) Hadoop cluster, thereby realizing the support of Hadoop security, providing functions of user authentication, authorization and audit based on roles, and integrating LDAP and ActiveDirectory for user management.

6. A product stack:

most hadoop ecosphere services, including Spark and Storm, may be installed on the platform.

Configuration dependence and version dependence among services are shielded at the bottom layer, and users only need to care about the use of service components, so that the production efficiency is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below.

Fig. 1 is a functional framework diagram of an SDP big data automation deployment monitoring platform disclosed in the embodiment of the present invention.

Detailed Description

The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention.

The invention provides an SDP big data automatic deployment monitoring platform, which can be rapidly deployed, provide more visual operations and more alarm management types and ensure safer data and service as shown in figure 1.

Creating module of cluster environment

In a further technical solution, the creating of the environment preparation script includes:

(1) closing the firewall on each node;

(2) hosts on each node are configured, and host names correspond to the ips;

(3) ssh secret-free login between the two main nodes and each sub-node;

(4) modifying the maximum open number of the files of each node;

(5) a profile server;

(6) configuring a local yum source;

(7) configuring a time server and a client;

(9) configuring the HugePages of each node;

(10) installing a mysql database for a specified node;

(11) the provisioning sdp host service is installed.

Wherein,

assembly of: including HADOOP, HDFS, HBASE, ZOOKEEPER, MAPREDUCE, REDIS, ELASTICSEARCH, SPARK, STORM, Kafka, etc.

1. Hadoop has the following properties:

convenience is realized: hadoop runs on large clusters of generic business machines, or on cloud computing services

And (3) robustness: hadoop is directed to running on general commercial hardware, the architecture of which assumes that the hardware will fail frequently, and Hadoop can gracefully handle most such failures.

And (3) expandable: hadoop can be linearly expanded to handle larger data sets by adding cluster nodes.

2、HBase：

The distributed storage system is a highly reliable, high-performance, nematic and telescopic distributed storage system, and a large-scale structured storage cluster can be built on a low-cost PC Server by utilizing the Hbase technology. HBase is an open source implementation of Google Bigtable, is similar to Google Bigtable in that GFS is used as a file storage system of the HBase, and is similar to Google Bigtable in that Hadoop HDFS is used as the file storage system of the HBase; google runs MapReduce to process mass data in Bigtable, and HBase also utilizes Hadoop MapReduce to process mass data in HBase; google Bigtable utilizes Chubby as a collaborative service and HBase utilizes Zookepper as a counterpart. Somebody asks about what relationship HBase and HDFS are, HBase is stored by using HDFS, like MySQL and a disk, MySQL is an application, and the disk is a specific storage medium. The HDFS is not suitable for random search due to its own characteristics, and is not friendly to update operations, for example, a hundred-degree network disk is constructed by the HDFS, and supports uploading and deletion, but does not allow a user to directly modify the content of a certain file on the network disk.

3. Characteristics of Kafaka

Kafaka is distributed, and all its components, i.e., server cluster, producer, and consumer, can be distributed.

An identifier topic can be used for distinguishing during the production of the message, and partitioning can be carried out; each partition is a sequential, immutable message queue and can be added on a continuous basis.

While providing high throughput for publishing and subscribing. It is understood that Kafka can produce about 25 thousand messages per second (50MB), and process 55 thousand messages per second (110 MB).

The state that the message is processed is maintained at the concurer end, not by the server end. Automatic balancing can be achieved when failure occurs.

4. Storm is a free-sourced, distributed, highly fault-tolerant real-time computing system. Storm makes continuous stream computation easy, making up real-time requirements that Hadoop batch processing cannot meet. Storm is often used in the fields of real-time analysis, online machine learning, continuous computing, distributed remote invocation, and ETL. Storm deployment management is very simple and Storm performance is superior in homogeneous streaming tools.

Storm is largely divided into two components, Nimbus and Supervisor. Both of these components fail quickly and have no state. The task state, heartbeat information and the like are stored on the Zookeeper, and submitted code resources are all on a hard disk of the local machine.

The Nimbus is responsible for sending code within the cluster, distributing work to the machines, and monitoring status. There is only one global.

The Supervisor will listen to the work assigned to that machine and start/shut down the work process Worker as required. One is deployed on each machine to run Storm, and the number of slots allocated is set according to the configuration of the machine.

Zookeeper is an external resource on which Storm is heavily dependent. Nimbus and supervisors even the actually running Worker store heartbeats on Zookeeper. The Nimbus also performs scheduling and task allocation according to the heartbeat and task running conditions on the zookeeper.

ELASTICSEARCH ElasticSearch (ES for short) is a distributed Restful search and analysis server designed for distributed computing; the method can achieve real-time search, and is stable, reliable and rapid. Like Apache Solr, it is also a Lucence-based index server, while the advantage of ElasticSearch over Solr is:

light weight: the installation is convenient to start, and a command can be started after the file is downloaded.

Schema free: xml specifies the index structure using schema.

The multi-index file supports: another index file can be created using different index parameters, which needs to be configured separately in Solr.

Distributed: the Solr Cloud configuration is relatively complex.

In the beginning of 2013, GitHub abandoned Solr and used an elastic search for PB level searches.

The development of the ElasticSearch has been rapid in recent years, and has surpassed the role of the original pure search engine, and the characteristics of data aggregation analysis (aggregation) and visualization have been increased, and the ElasticSearch is certainly the best choice if you have millions of documents to be located by keywords. Of course, if your document is JSON, you can also treat the ElasticSearch as a kind of "NoSQL database" applying the property of ElasticSearch data aggregation analysis (aggregation) to perform multidimensional analysis on the data.

Some excellent cases at home and abroad for elastic search:

github: "GitHub searches 20TB of data, including 13 billion files and 1300 billion lines of code, using an elastic search".

Soundlog: "soundlog provides instant and accurate music search service to 1.8 million users using ElasticSearch".

5、Zookeeper

As the number of compute nodes increases, cluster members need to synchronize with each other and learn where to access services and how to configure, as ZooKeeper does. ZooKeeper, as the name suggests, is a zoo administrator, which is an administrator for managing elephants (hadoops), bees (hives) and piglets (Pig), and ZooKeeper is adopted in projects such as Apache Hbase, Apache Solr and LinkedInsesie. ZooKeeper is a distributed application program coordination service with open source codes, and is distributed application such as synchronization service, configuration maintenance and naming service based on Fast Paxos algorithm.

6. Spark is an Apache project that is listed as "fast as lightning cluster calculations". It has a prosperous open source community and is currently the most active Apache project.

Spark provides a faster, more versatile data processing platform. Compared with Hadoop, Spark can increase the speed of your program by 100 times when running in the memory or by 10 times when running on the disk. In the last year, Spark has outweighed Hadoop in the 100TBDaytona gray sort race, which uses only one tenth of the machine, but the speed of operation has increased by a factor of 3.

Spark Core is a basic engine for massively parallel and distributed data processing. It is mainly responsible for:

memory management and fault recovery;

scheduling, distributing, and monitoring jobs across the cluster;

interacting with the storage system.

Spark introduced a concept called elastic distributed dataset (RDD), which is a collection of immutable, fault-tolerant, distributed objects that we can manipulate in parallel. The RDD may contain any type of object that is created when an external data set is loaded or a collection is distributed from a driver application.

RDD supports two types of operations:

translation is an operation (e.g., mapping, filtering, joining, federation, etc.) that performs an operation on one RDD and then creates a new RDD to save the result.

An action is an operation (e.g., merge, count, first, etc.) that performs some computation on an RDD and then returns the result.

In Spark, the conversions are "lazy," meaning they do not compute the results immediately. Rather, they simply "remember" the operations to be performed and the data sets (e.g., files) to be operated on. Only when a behavior is invoked will the translation actually perform the computation and return the result to the driver program. This design allows Spark to operate more efficiently. For example, if a large file is to be transformed in various ways and the file is passed to the first action, Spark will only process the first line of the file and return the result, and not the entire file.

By default, when you run an action on the converted RDD, it is possible that this RDD will be recalculated. However, you can also persist a RDD in memory from the beginning of the year by using a persistence or caching method, so Spark keeps these elements on the cluster, and when you next query it, the query speed is much faster.

SparkSQL

sparkSQL is a component of Spark, which supports our query of data in SQL or Hive query languages. It originally came from the Apache Hive project for running on Spark (instead of MapReduce), and it is now integrated into the Spark heap.

Service: refers to a thread service used by a component to provide a programming interface or a background task.

Interface: a plurality of interfaces are opened for external applications to access and use, such as data docking interfaces, and various data forms and data source docking are supported, including excel, json, files, databases, data warehouses and the like.

Model: a plurality of basic machine learning models such as a clustering model, a Pupperwell model and a linear regression model are created.

A database: various forms of databases are created, including a cache database redis, a relational database Oracle or mysql, a data warehouse hive

Tool: data transfer tool sqoop, ETL cleaning tool keyle, etc.

Monitoring module for cluster environment

The service condition monitoring of the hardware comprises a memory, a hard disk, a network, a CPU, an HDFS and the like.

The service condition monitoring of the component includes basic information of the component and resource usage conditions of the component, such as NameNode, DateNode status, JournalNodes, disk usage conditions (DFS usage, non-DFS usage, disk remnant space) and NameNode GC number, NameNode GC time, NNConnection Load, NameNode Heap, NameNode Host Load, NameNode RPC, etc.

The data storage condition monitoring comprises the distribution condition of data on the hdfs of the distributed system and the resource occupation condition.

The alarm condition monitoring comprises the quantity, the type and the problem condition of the alarm.

The task execution condition monitoring comprises the percentage of task execution, time, error condition and result.

Configuration condition monitoring refers to the condition of the configured parameters.

The monitoring of the node running condition comprises a node running state IP, a memory, a disk use, an average load and the like.

Operation module of cluster environment

The operation of the nodes mainly comprises the addition of the nodes and the deletion of the nodes.

The operation aiming at each node mainly comprises starting all components, stopping all components, restarting all components, reinstalling failed components, starting a maintenance mode, closing the maintenance mode, setting a frame, downloading a client configuration file and the like.

The operation of the service includes starting and stopping all services, displaying a service operation summary, adding services, executing service behaviors, restarting after installation, monitoring background operations, deleting services, auditing services, using quick links, refreshing yarn capacity scheduling and managing HDFS.

The background management operation comprises adding users, setting roles and setting authority.

The warning operation comprises deleting the warning, adding the custom warning, managing the early warning, managing the notification and managing the reminding setting. Management reminder set-remind number of checks, please set the number of checks before setting the notification, if the status changes during the warning check, the system will try the set number of times before sending the notification. If there is a temporary problem with the environment that would result in a false alarm, please increase the value.

Operation of the data: the method supports visual sql operation, and can directly inquire data stored in components or services such as hive, hbase, hdfs and the like through simple sql, thereby reducing the use of commands.

Fourth, alarm module of cluster environment

1. WEB alerts

WEB alerts monitor Web URLs on a given component; an alert status is determined from the HTTP response code. Therefore, it is not possible to change which HTTP response code determines the threshold for network alerts. You can customize each threshold and response text connection timeout of the whole network. Connection timeout is considered a critical alert. The threshold units are based on seconds. The response code and corresponding status of the WEB alert are as follows:

normal state: the Web URL response code is less than 400.

Warning state: the Web URL response code is equal to or greater than 400.

Error status: SDP cannot connect to a Web URL.

2. PORT warning

PORT alert check connection to a given PORT response time; the unit of the threshold is based on seconds.

3. METRIC Warning

The METRIC warning checks for single or multiple monitored performance values (if calculated). Monitoring performance is the URL termination available from a given component. A connection timeout is considered a false alarm. The threshold values are adjustable and the units depend on the monitored content corresponding to each threshold value. For example, in the case of a CPU utilization warning, the units are percentages; in the case of RPC delay warning, the unit is milliseconds.

4. AGGREGATE Warning

AGGREGATE the warning indicates the degree of status impact by the usage of the instance warning (in hundred percent). For example, the DataNode usage rate represents the warning impact level of the DataNode.

5. SCRIPT warning

SCRIPT alerts executing a SCRIPT, and the SCRIPT execution result state includes: normal, warning, error. You can customize response text, attribute values, and script warning thresholds.

6. SERVER warning

SERVER alerts implement a SERVER-side runnable class that determines alarm conditions, such as: normal, warning, error.

7. RECOVERY WARNING

The RECOVERY alert is handled by the SDP entries that are restarting the monitoring process. The number of times that the warning state is normal, warning, error are automatically restarted based on the process. This is useful to know when a process terminates and the SDP is automatically restarting.

Fifth, log analysis module of cluster environment

1. User behavior analysis

Recording all operation records of the whole cluster user, including login ID, time, duration and the like, recording various operations of the user in the system, such as which button operations are clicked at a certain time, when a node is added, configuration is changed and the like

2. Error log analysis

The running logs of each component and service are collected in real time, error conditions are analyzed, error points which possibly cause problems are given, errors are found in time, and operation and maintenance personnel can correct the errors in time conveniently.

Sixth, safety control module of cluster environment

The SDP Web provides a wizard-driven user experience that allows you to configure the high availability of components in many Hortonworks DataPlatform (HDP) stack services. By establishing primary and secondary components, high availability may be ensured. If the primary component fails or is unavailable, the secondary component is available. After configuring the service with high availability, the SDP enables you to manage and disable (roll back) the high availability of the components in the service.

High availability of nodes: and (3) HA. Preventing single node from fault, one node from fault and the other one can be automatically switched and started to bear the function of the fault node

High availability of service: the service is used for providing access to other people and providing services. And single service faults are prevented, one service is in fault, and the other service can be automatically switched and started to bear the function of fault service.

Seventh, user role and authority module of cluster environment

And a strict access verification mechanism is provided, so that the authority can be detailed to each button, even different data, and multi-dimensional authority control is performed.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. An SDP big data automatic deployment monitoring platform is characterized by comprising a cluster environment creating module, a cluster environment monitoring module, a cluster environment operating module, a cluster environment warning module, a cluster environment log analyzing module, a cluster environment safety control module and a cluster environment user role and authority module.

2. The SDP big data automation deployment monitoring platform of claim 1, wherein the creation module of the cluster environment comprises creation of an environment preparation script and configuration installation of the cluster environment.

3. The SDP big data automation deployment monitoring platform of claim 1, wherein the cluster environment comprises components, services, interfaces, models, databases, and tools.

4. The SDP big data automation deployment monitoring platform of claim 1, wherein the monitoring modules of the cluster environment comprise hardware usage monitoring, component service condition monitoring, data storage condition monitoring, alarm condition monitoring, task execution condition monitoring, configuration condition monitoring and node operation condition monitoring.

5. The SDP big data automation deployment monitoring platform of claim 1, wherein the operation modules of the cluster environment comprise operations of nodes, services, background management operations, alarms and data.

6. The SDP big data automation deployment monitoring platform of claim 1, wherein the alarm modules of the cluster environment comprise a WEB alert, a PORT alert, a METRIC alert, an AGGREGATE alert, a SCRIPT alert, a SERVER alert, and a RECOVERY alert.

7. The SDP big data automation deployment monitoring platform of claim 1, wherein the log analysis module of the cluster environment comprises user behavior analysis and error log analysis.

8. The SDP big data automation deployment monitoring platform of claim 1, wherein a security control module of the cluster environment comprises a high availability of nodes and a high availability of services.

9. The SDP big data automation deployment monitoring platform of claim 2, wherein the creation of the environment preparation script comprises:

(1) closing the firewall on each node;

(2) hosts on each node are configured, and host names correspond to the ips;

(3) ssh secret-free login between the two main nodes and each sub-node;

(4) modifying the maximum open number of the files of each node;

(5) a profile server;

(6) configuring a local yum source;

(7) configuring a time server and a client;

(9) configuring the HugePages of each node;

(10) installing a mysql database for a specified node;

(11) the provisioning sdp host service is installed.

10. The SDP big data automation deployment monitoring platform of claim 3, wherein the components comprise HADOOP, HDFS, HBASE, ZOOKEEPER, MAPREDUCE, REDIS, ELASTICSEARCH, SPARK, STORM.