CN111327681A

CN111327681A - Cloud computing data platform construction method based on Kubernetes

Info

Publication number: CN111327681A
Application number: CN202010068966.4A
Authority: CN
Inventors: 王凌霄; 张建
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2020-01-21
Filing date: 2020-01-21
Publication date: 2020-06-23

Abstract

The invention discloses a cloud computing data platform construction method based on Kubernetes, which realizes resource isolation and control through Docker technology and manages and arranges containers by utilizing the Kubernetes; data acquisition and data stream transmission and management are realized through the Flume + Kafka; processing the acquired data by using a Spark calculation frame, and storing the data by using an HDFS (Hadoop distributed file system) as a distributed storage system to realize dynamic expansion of storage nodes; the functions of inquiry and analysis are realized by combining an ElasticSearch tool and an HBase + Phoenix tool; the invention realizes the functions of data acquisition, storage analysis and monitoring of the whole platform, efficiently completes the processing, circulation and storage of the data, realizes the acquisition of resources as required by a micro-service mode, avoids unnecessary resource waste, and realizes the load balance, disaster recovery and elastic expansion of the cluster.

Description

Cloud computing data platform construction method based on Kubernetes

Technical Field

The invention belongs to the field of cloud computing big data, and particularly relates to a cloud computing platform and a storage framework.

Background

Nowadays, IT technology is continuously updated and updated, new technology is also endlessly developed, and cloud computing is a new computing mode, and through rapid development for more than 10 years, the cloud computing is widely applied to design and research and development in various fields. The cloud computing-based big data era is also fortunate, the data volume of various industries reaches hundreds of millions of bytes (TB), data is increased progressively at an explosive speed along with the development of the technical industries such as the AI, the IoT, the big data and the like, the traditional data architecture cannot meet the processing requirement of the current big data, the complexity of a data platform can be greatly reduced, the operation and maintenance work is simplified, the utilization rate of resources is improved by adopting the cloud computing, and the load balance of a server is achieved.

The existing cloud computing data platform realizes distributed computing by taking a distributed storage system as a storage basis, greatly improves computing capacity, establishes a data platform supported by mass data services, can bear access pressure of PV of tens of millions level every day, realizes second-level operation in real-time processing, can process user data in real time, analyzes user behaviors and excavates internal relation of data, and provides data support for a company decision layer. However, the existing problems are that as data increases at the TB level every day, the pressure borne by a data platform also increases day by day, the data platform cannot cope with real-time load, for the problem of peak period of the platform, thousands of nodes cannot be transversely expanded, automatic deployment cannot be realized, cluster expansion and deployment need artificial management, and various additional problems are increased by complex operation flows Real-time load reduces human error factors, improves the utilization rate of cluster resources, and improves the working efficiency of enterprises.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a cloud computing data platform construction method based on Kubernetes. The technical scheme adopted by the invention is as follows:

1. the system provides a high-availability, high-reliability and distributed function of collecting, aggregating and transmitting mass logs, collects log data from various log sources, and stores the log data on the HDFS so as to be convenient for centralized statistical analysis and processing, and the whole framework of the system is divided into three layers, namely an Agent layer, a Collector layer and a Store layer, wherein each machine of the Agent layer is provided with a process which is responsible for the log collection work of a single machine; the Collector layer is deployed on the central server and is responsible for receiving the logs sent by the Agent layer and writing the logs into the corresponding Store layer according to the routing rules; the Store layer is responsible for providing permanent or temporary log storage services, or directing log streams to other servers. The Agent to the Collector uses a LoadBalance strategy to uniformly send all the logs to all the collectors so as to achieve the purpose of load balancing, and the collectors can be linearly expanded. The target of the Collector layer is mainly three: SinkHdfs, SinkKafka and SinkBypass. Providing offline data to the HDFS and real-time log streams to Kafka and Bypass, respectively;

2. the method comprises the following steps of using Sqoop as a batch data migration tool, wherein the Sqoop is used for exporting data from a relational database such as MySQL, Oracle to HDFS of Hadoop and exporting data from a file system of Hadoop to the relational database;

3. the Kafka is used as a multi-type data pipeline and a multi-type message system, the architecture of the Kafka is divided into three layers, namely a producer, a brooker and a consumer, wherein the producer can be a server log, service data, PV generated by the front end of a page and the like; the broker saves the issued messages, can be horizontally expanded, and ensures the high throughput rate of Kafka; the consumer pulls data from the broker to provide a data source for subsequent spark cluster consumption, and Kafka can guarantee real-time performance and sequentiality of the data. In addition, Kafka uses zookeeper to realize dynamic cluster expansion, and realizes a dynamic load balancing mechanism;

4. the HDFS is used as a storage system of data, a double-NameNode architecture is introduced into Hadoop2.X, two NameNodes are respectively configured into an Active/Passive state by HA, the NameNode in the Active state is called as an Active NameNode, and the NameNode in the Passive state is called as a Standby NameNode. The Standby Namenode is used as the hot backup of the Active Namenode, and can be automatically switched to the Active Namenode in an elegant mode when the Namenode breaks down or needs to be restarted due to daily server maintenance. The NameNode is used as a Master, manages the name space of the HDFS, configures copy strategies and mapping information of data blocks, processes read-write requests from clients, and the DataNode is used as a Slave and is the actual storage position of data. Storing the data collected by the flash, the Sqoop migration data and the data collected by the Kafka in real time on the HDFS;

5. and (3) using Spark as a core computing engine of the cloud computing data platform, and realizing real-time processing of data by using Spark SQL and Spark streaming. The kernel of Spark is RDD, also called elastic distributed data set, each RDD is divided into a plurality of partitions, different partitions run on different nodes of the cluster, and are computing units of Spark, a group of RDDs may form an executed directed acyclic Graph RDD Graph, and data processing on RDD is realized by defining a series of transformation and action operations based on RDD. The DAG Scheduler builds a DAG based on the Stage according to the Job (Job), submits the Stage to the Task Scheduler, and then the Task Scheduler distributes the Task (Task) to the executive to execute the final Task;

6. using the ElasticSearch as a real-time search and analysis engine, multiple types of searches can be performed and merged. The ElasticSearch is realized based on Lucene and has all indexing and searching functions, a plurality of ES process instances are started on a plurality of machines to form an ES cluster, a master node is generated in the cluster through election, the nodes of the whole cluster are managed, all the nodes store all data, the nodes belong to the cluster, and the data are uniformly distributed by the cluster;

7. the method comprises the steps of using HBase and Phoenix to achieve a real-time query function of big data, wherein the HBase is of a distributed architecture and comprises a Master and Region servers, the Master is used for coordinating a plurality of Region servers, detecting states among the Region servers, balancing loads among the Region servers, clients are directly connected with the Region servers, and data in the HBase are obtained through a communication mechanism. Phoenix provides OLTP related functions and API of standard JDBC, so that an application program originally built on JDBC directly asks HBabse through Phoenix;

8. the method comprises the steps of using a Docker container to realize resource isolation and resource control, generating mirror images of different service modules through a Docker technology, realizing lightweight virtualization service through Namespace, having independent resources in different containers, and realizing mutual isolation of Network, PID, UID, IPC, UTS, User, Mount and the like; the CGroups technology is used for realizing the limitation and charging management of resources such as a container memory, a CPU (Central processing Unit), a disk IO (input/output) and the like, providing a unified framework for the resource management of the whole system, providing a unified interface for different application programs, packaging all required dependencies and libraries for application operation in the container, and calling the services through an interface layer API (application programming interface). In addition, because the containers are isolated in communication and process, the same Namespace can be communicated with each other, but different Namespaces cannot be communicated with each other, and the communication and data forwarding are realized by accessing the network equipment into a network bridge;

9. kubernetes is used for realizing the functions of resource scheduling, automatic deployment, elastic capacity expansion and capacity reduction of the application of the container, belongs to a Master-slave distributed architecture, and consists of a Master and a Node. The Master runs four components of ect, APIServer, Controller Manager and Scheduler, the ect is used for persisting the resource object in the cluster, and the rest three are used for scheduling and managing the cluster resource. The Node runs three components of Kubelet, Proxy and Docker Deamon, manages Pod on each Node, and realizes load balancing and service agent functions. In Kubernetes, a container cluster network is replanned by using a Fannel, so that all containers in the cluster obtain IP which belongs to an intranet and is not repeated, and the containers on different nodes are communicated with each other through the IP of the intranet.

10. The authentication service in the cluster is realized by using Kerberos, the Kerberos adopts a traditional key sharing mode to realize the previous communication between a Client and a Server under the condition that the network environment does not necessarily guarantee the security, the Kerberos is a third party authentication mechanism, and a user and the service depend on a third party (a Kerberos Server) to carry out identity verification on each other. The Kerberos server itself is called a key distribution center or KDC.

Drawings

FIG. 1 is a cloud computing data platform based on Kubernetes

FIG. 2 is a data platform

FIG. 3 is a Spark frame diagram

FIG. 4 is a diagram of a Flume three-layer structure

FIG. 5 is a Kafka three-layer structure diagram

FIG. 6 is a high-available HDFS architecture diagram

FIG. 7 is a diagram of HBase framework

Detailed Description

The specific implementation is shown in fig. 1. The invention provides a cloud computing data platform construction method based on Kubernets, which takes Kubernets as a container arrangement tool, utilizes a Docker container to realize isolation, control and scheduling of an IaaS resource layer, and IaaS (Infrastructure as a Service) is a resource pool formed by aggregating IT basic resources such as computing, network, storage and other resources through virtualization and dynamic, and a terminal user can acquire computing resources in the resource pool in a Service form to deploy own system and application programs without paying attention to how the layer is realized, and only needs to pay to use various resources in the resource pool. The Etcd is an important component in kubernets, and is used to store state information of all network configurations and objects in a cluster, such as Flannel network information. Kube-DNS is a DNS-based Service discovery module of Kubernetes, and a simple Service registration discovery and load balancing mode is realized by registering Service in DNS. Kubernets is a Master-slave distributed architecture, four components, namely an ect component, an APIServer component, a Controller Manager component and a Scheduler component, are operated on a Master, the APIServer component provides a unique entrance for resource operation and provides mechanisms such as authentication, authorization, access control, API registration and discovery, the Controller Manager component is responsible for maintaining the state of a cluster, such as fault detection and automatic expansion, and the Scheduler component schedules a pod to a corresponding node according to a preset scheduling strategy. The Node runs three components of Kubelet, Proxy and Docker Deamon, Kubelet is responsible for the management of life cycles of the Node, such as the creation, modification, monitoring, deletion and the like, and Proxy acquires the configuration information of service and endpoints from etcd, then starts a Proxy process from the Node according to the configuration information and monitors a corresponding service port, and then distributes the process to different containers for processing in a balanced manner according to external requests. The Docker Deamon is then responsible for the image management, pod and container real operations. A data platform is built on the cloud computing data platform, the data platform comprises HDFS, HBase, Spark, ES, flash, Kafka and the like, application, service, configuration, mirror images and the like of the cloud computing data platform are packaged, service is provided through a unified external interface, and specific requirements on business are met. The whole platform also needs a monitoring system, the resource use condition of the whole cloud computing platform is monitored by using the Heapster, the InfluxDB and the Grafana, metrics and event data in the cluster are written into the InfluxDB and provided for the Grafana to be inquired, and a graphical interface is used for displaying, so that operation and maintenance personnel can conveniently control the use condition of the whole cluster resource.

A cloud computing data platform architecture diagram is shown in fig. 2. And acquiring the log data of the App and the Web end by using the flash, transmitting the acquired data to a data platform for processing through Kafka, synchronizing the data in the database by using Sqoop, and storing the final result into the HDFS so as to facilitate later off-line calculation. The data computing platform is divided into offline computing and real-time computing, online data needing to be computed are requested by a user in real time through Spark streaming and Storm, the data are subjected to real-time online computing, results are returned to the user, the data can be rapidly achieved through a Spark RDD mechanism, and a real-time data warehouse can be built accordingly. Historical data stored in the HDFS can be used for constructing an offline data warehouse through methods of effective extraction, integration, mining and the like, logic is combed again by using Hive, and execution is performed by converting HiveSQL into a corresponding MapReduce task, so that data analysis can be provided for business requirements, and data support is provided for decision making.

The Spark architecture diagram is shown in fig. 3. The Client is used for submitting the application program, the Driver runs the application program submitted by the Client, the splitting of the application program is completed by creating a context SparkContext, the context SparkContext comprises RDD, DAGSScheduler, TasScheduler, SparkEnv and the like, the application program is converted into RDD, the RDD is divided into a directed acyclic graph according to a wide and narrow dependency rule, a task is generated, and the directed acyclic graph is sent to the executor to be executed. And the ClusterManager manages the scheduling of the whole cluster resource and monitors the running condition of each node through a heartbeat mechanism.

The Flume architecture diagram is shown in figure 4. The method includes that Flume realizes data input and output through an agent, the agent comprises a source, a channel and a sink, various data of WebService are collected through the source, the collected data are sorted by using a temporary storage function of the channel, and the sink is used for sending the collected data to a destination of needed data, such as HDFS, Spark and the like.

The Kafka architecture diagram is shown in fig. 5. Kafka has a similar structure to that of flash, and is also a three-layer structure, a Producer is used for producing data, a Broker is equivalent to a basket for holding data, and a Consumer is used for consuming data, wherein the Broker can realize horizontal expansion, the Broker can be divided into a plurality of partitions, different data are sequentially stored in the corresponding partitions, each message is marked by offset, and each partition can guarantee the sequence of the data, and the validity of the data can be guaranteed by using the high reliability and consistency of Kafka.

The HDFS architecture diagram is shown in FIG. 6. The HDFS introduces HA, the Active NameNode is used for processing read-write requests of the client, and the Standby NameNode is used as the Slave of the Active NameNode and keeps state synchronization with the Active NameNode as much as possible, so that switching can be completed quickly when the Active NameNode fails. Both namenodes need to communicate with a set of Journal nodes. The Active Namenode persists the modification log into the Journal Node. The Standby Namenode continuously monitors these Journal nodes, and when the monitoring finds that these modification logs are changed, these modifications are applied to its namespace and kept consistent with the namespace metadata in the Active Namenode.

The HBase framework diagram is shown in FIG. 7. HBase is a nematic distributed database with efficient real-time read-write performance, the structure of HBase is a master-slave structure, HMASter is elected by Zookeeper, and HMASter and HRegionServer report heartbeat to Zookeeper.

The HMmaster performs load balancing on the Region and distributes the load to a proper HRegion Server, wherein the HRegion Server comprises components such as HRegion, HLog, HFile, Memstore, storefile and the like. HBase divides the table into multiple HRegions, each HRegion stores a certain section of continuous data in the table; when HRegion reaches the threshold, HRegion is equally divided into two new HRegions; the HLog file records attribution information of the written data; one HRegion is composed of a plurality of storeys, each storere comprises a MemStore in a memory and a StoreFile in a disk, when data in the MemStore reaches a certain threshold value, the HRegionServer starts a flash cache process to write into a storeFile, and when the number of the storeFile files increases to a certain threshold value, the system merges.

The authentication service in the cluster is realized by using Kerberos, the Kerberos adopts a traditional key sharing mode to realize the previous communication between a Client and a Server under the condition that the network environment does not necessarily guarantee the security, the Kerberos is a third party authentication mechanism, and a user and the service depend on a third party (a Kerberos Server) to carry out identity verification on each other. The Kerberos server itself is called a key distribution center or KDC. The client establishes connection by requesting a Ticket-Granting Ticket (TGT) from a Key Distribution Center (KDC), and the KDC sends the TGT back to the client in an encrypted form after establishing the TGT; then the client end sends its TGT as its identity certificate to KDC, requests ticket of specific service from KDC, KDC sends the ticket of specific service to the client end; finally, the client sends the ticket to the server, and the server allows the client to access.

Claims

1. A cloud computing data platform construction method based on Kubernetes is characterized by comprising the following contents:

using the flash as a log collection system, wherein the system provides functions of log collection, aggregation and transmission, collects log data from various log sources, stores the log data on the HDFS for centralized statistical analysis processing, and respectively provides offline data to the HDFS and provides real-time log streams to Kafka and Bypass;

using Sqoop as a tool for mutual data transfer between the relational database and the HDFS;

using Kafka as a data pipeline and a message subscription and release system, wherein a producer is log data generated by a server and service data generated by a back end; the middle browser is used as a storage array to store the message issued by the producer; the consumer pulls data from the browser to provide a data source for the Spark cluster;

the HDFS is used as a data storage system, a double-NameNode architecture is introduced into Hadoop2.X, two NameNodes are respectively configured into Active/Passive states by HA, and the Standby NameNode is used as the hot backup of the Active NameNode and can be automatically switched into the Active NameNode when the NameNode breaks down or needs to be restarted due to daily server maintenance; the NameNode is used as a Master, manages the name space of the HDFS, configures copy strategies and mapping information of data blocks, processes read-write requests from a client, and the DataNode is used as a Slave and stores the data collected by the FLUME, the data migrated by the Sqoop and the data collected by the Kafka in real time;

using Spark as a core computing engine of a cloud computing data platform, and realizing real-time processing of data by using Spark SQL and Spark streaming; the kernel of Spark is RDD, namely an elastic distributed data set, which is an invariable distributed object set, each RDD is divided into a plurality of partitions, different partitions run on different nodes of the cluster, which forms a distributed computing model of Spark, and the data of the RDD is processed through transformation and action operations based on the RDD; abstracting an execution model based on a Spark framework into DAG, submitting different stages of the DAG scheduler to the Taskscheduler according to the width dependence of RDD, and then submitting the stages to an executive to execute a final task;

using an ElasticSearch as a real-time search and analysis engine to perform and merge multiple types of searches; the ElasticSearch is realized based on Lucene and has all indexing and searching functions, a plurality of ES process instances are started on a certain number of machines to form an ES cluster, a master node is generated in the cluster through election to manage nodes of the whole cluster, all slave nodes store all data, the nodes are subordinate to the cluster, and the data are uniformly distributed by the cluster;

the method comprises the steps that a real-time query function of big data is achieved by using HBase and Phoenix, the HBase adopts a distributed architecture and consists of a Master and Region servers, the Master is used for coordinating a plurality of Region servers, detecting states among the Region servers and balancing loads among the Region servers, clients are directly connected with the Region servers, and data in the HBase is obtained by using a communication mechanism; phoenix provides OLTP related functions and API of standard JDBC, supports ACID, SQL and secondary index, and enables an application program originally built on JDBC to directly access HBabese through Phoenix;

the method comprises the steps of using a Docker container to realize resource isolation and resource control, generating mirror images of different service modules through a Docker technology, realizing lightweight virtualization service through Namespace, and realizing mutual isolation of resources due to the fact that different containers have independent resources; the management of container resources is realized through the CGroups technology, a unified framework is provided for the management of the resources of the whole system, a unified interface is provided for different application programs, all the required dependencies and libraries for the application operation are contained in the container, and the services are packaged and then called through an interface layer API; in addition, the same Namespace can communicate with each other, but different Namespaces cannot communicate with each other, and the network equipment is accessed into the network bridge to realize communication and data forwarding;

kubernetes is used for realizing the functions of resource scheduling, automatic deployment and elastic capacity expansion and reduction of the application of the container, belongs to a Master-slave distributed architecture and consists of a Master and a Node; operating four components of ect, APIServer, Controller Manager and Scheduler on the Master, wherein ect is used for persisting resource objects in the cluster, and the rest three components are used for scheduling and managing cluster resources; running three components of Kubelet, Proxy and Docker Deamon on the nodes, managing Pod on each Node and realizing load balancing and service agent functions; in Kubernetes, a container cluster network is replanned by using a Fannel, so that all containers in a cluster obtain IPs which belong to an intranet and are not repeated, and the containers on different nodes are communicated with each other through the IPs of the intranet.

2. The method for constructing the cloud computing data platform based on Kubernetes as claimed in claim 1, wherein: zookeeper is used as a distributed coordination service framework.

3. The method for constructing the cloud computing data platform based on Kubernetes as claimed in claim 1, wherein: the Heapster + InfluxDB + Grafana is used for data acquisition and summarization, and detailed resource use conditions are provided from all layers, so that resource management and scheduling are facilitated.

4. The method for constructing the cloud computing data platform based on Kubernetes as claimed in claim 1, wherein: YARN was used as the resource scheduling platform.

5. The method for constructing the cloud computing data platform based on Kubernetes as claimed in claim 1, wherein: kerberos is used to provide authentication services for the cluster.