CN113377500A

CN113377500A - Resource scheduling method, device, equipment and medium

Info

Publication number: CN113377500A
Application number: CN202110925704.XA
Authority: CN
Inventors: 钱浩东; 周明伟; 陈文灿
Original assignee: Zhejiang Dahua Technology Co Ltd
Current assignee: Zhejiang Dahua Technology Co Ltd
Priority date: 2021-08-12
Filing date: 2021-08-12
Publication date: 2021-09-10
Anticipated expiration: 2041-08-12
Also published as: CN113377500B

Abstract

The invention discloses a resource scheduling method, a device, equipment and a medium, wherein the method comprises the following steps: deploying each application of a Hadoop system in a Kubernets system, and acquiring resource occupancy rate data of each application; and when the triggering resource scheduling conditions are determined to be met according to the resource occupancy rate data of each application and the current time is not in the preset cooling period, performing resource scheduling according to a preset resource scheduling strategy and entering the cooling period again. The scheme provided by the embodiment of the invention realizes the allocation of Hadoop system resources as required, solves the problems of waste and insufficient busy time of the Hadoop system resources in the prior art, and avoids the problem of repeatedly scheduling resources after the resources are scheduled by setting the cooling period and meeting the triggering condition of triggering the resource scheduling.

Description

Resource scheduling method, device, equipment and medium

Technical Field

The present invention relates to the field of resource scheduling technologies, and in particular, to a method, an apparatus, a device, and a medium for resource scheduling.

Background

The Kubernetes system is a container orchestration tool from Google open sources for managing containerized applications on multiple hosts in a cloud platform. It provides pooling capabilities for hardware resources such as CPU, memory and GPU, and application-oriented container orchestration, deployment. Currently, Kubernetes has become a de facto standard for container cloud platforms. The Hadoop system is a distributed system infrastructure with an Apache foundation open source, and realizes a distributed file system, wherein the HDFS provides the storage capacity of mass data, the MapReduce/Yarn provides the computing capacity of mass data, and the Hadoop is deployed on a physical machine at the beginning of design to uniformly schedule resources on the physical machine. At present, Hadoop is a necessary infrastructure for big data services of various companies in the industry, and provides support for services such as personalized recommendation, search recommendation, data analysis and the like.

Due to different service characteristics, the operating pressures of the kubernets and the hadoops are off-peak, for example, when the kubernets system is idle in the morning, the hadoops need to analyze and gather data of the previous day to generate basic data required by each service system. Thus, the Hadoop system is under great operating pressure, but the Kubernetes system is not fully utilized. When the operating pressure of the Hadoop system is small, the problem of low utilization rate of resources of the Hadoop system can also occur. Therefore, the problems of idle waste and busy shortage of Hadoop system resources exist in the prior art.

Disclosure of Invention

The embodiment of the invention provides a resource scheduling method, a resource scheduling device and a resource scheduling medium, which are used for solving the problems of waste of idle resources and insufficient busy resources of a Hadoop system in the prior art.

The embodiment of the invention provides a resource scheduling method, which comprises the following steps:

deploying each application of a Hadoop system in a Kubernets system, and acquiring resource occupancy rate data of each application;

and when the triggering resource scheduling conditions are determined to be met according to the resource occupancy rate data of each application and the current time is not in the preset cooling period, performing resource scheduling according to a preset resource scheduling strategy and entering the cooling period again.

Further, the applications of deploying the Hadoop system in the kubernets system include:

acquiring deployment configuration files of all applications of the Hadoop system, and determining the version information and the number of the copies of the image files of all the applications according to the deployment configuration files of all the applications;

and acquiring a corresponding image file from an image warehouse according to the image file version information of each application, and deploying each application in the Kubernetes system according to the copy number and the image file of each application.

Further, the resource occupancy rate data of each application is obtained; determining that the triggering resource scheduling condition is met according to the resource occupancy rate data of each application comprises:

and when the average storage resource occupancy rate data is greater than a preset first occupancy rate threshold value or the average storage resource occupancy rate data is less than a preset second occupancy rate threshold value, determining a triggering resource scheduling condition.

acquiring the computing resource occupancy rate data of each application, determining average computing resource occupancy rate data according to the computing resource occupancy rate data of each application, and determining a triggering resource scheduling condition when the average computing resource occupancy rate data is greater than a preset third occupancy rate threshold value or the average computing resource occupancy rate data is less than a preset fourth occupancy rate threshold value.

Further, the performing resource scheduling according to a preset resource scheduling policy includes:

when the average storage resource occupancy rate data is larger than a preset first occupancy rate threshold, adding a copy example of the Hadoop system resource; and when the average occupancy rate data of the storage resources is smaller than a preset second occupancy rate threshold, reducing the duplicate instances of the Hadoop system resources.

Further, when the average storage resource occupancy data is less than a preset second occupancy threshold, reducing the duplicate instances of the Hadoop system resource includes:

and deleting the DataNode copy instance with the largest copy ID when the average storage resource occupancy data is smaller than a preset second occupancy threshold and the current HDFS does not have the file with the insufficient copy number.

when the average computing resource occupancy rate data is larger than a preset third occupancy rate threshold value, adding a copy example of the Hadoop system resource; and when the average occupancy rate data of the computing resources is smaller than a preset fourth occupancy rate threshold, reducing the duplicate instances of the Hadoop system resources.

Further, when the average occupancy data of the computing resources is less than a preset fourth occupancy threshold, reducing the duplicate instances of the Hadoop system resources includes:

and when the average computing resource occupancy data is smaller than a preset fourth occupancy threshold, determining that the NodeManager with the largest copy ID does not execute the task, deleting the NodeManager copy example with the largest copy ID, and if the NodeManager with the largest copy ID is executing the task, waiting until the NodeManager with the largest copy ID executes the task, deleting the NodeManager copy example with the largest copy ID.

In another aspect, an embodiment of the present invention provides a resource scheduling apparatus, where the apparatus includes:

the deployment module is used for deploying all applications of the Hadoop system in the Kubernets system and acquiring resource occupancy rate data of all the applications;

and the scheduling module is used for scheduling resources according to a preset resource scheduling strategy and reentering the cooling period when the triggering resource scheduling conditions are determined to be met according to the resource occupancy rate data of each application and the current time is not in the preset cooling period.

Further, the deployment module is specifically configured to obtain deployment configuration files of applications of the Hadoop system, and determine version information and the number of copies of the image files of the applications according to the deployment configuration files of the applications; and acquiring a corresponding image file from an image warehouse according to the image file version information of each application, and deploying each application in the Kubernetes system according to the copy number and the image file of each application.

Further, the deployment module is specifically configured to obtain the storage resource occupancy rate data of each application, and determine average storage resource occupancy rate data according to the storage resource occupancy rate data of each application;

the scheduling module is specifically configured to determine a resource scheduling triggering condition when the average storage resource occupancy data is greater than a preset first occupancy threshold, or the average storage resource occupancy data is less than a preset second occupancy threshold.

Further, the deployment module is specifically configured to obtain the computing resource occupancy rate data of each application, and determine average computing resource occupancy rate data according to the computing resource occupancy rate data of each application;

the scheduling module is specifically configured to determine a triggering resource scheduling condition when the average computing resource occupancy data is greater than a preset third occupancy threshold, or the average computing resource occupancy data is less than a preset fourth occupancy threshold.

Further, the scheduling module is specifically configured to increase a copy instance of the Hadoop system resource when the average storage resource occupancy data is greater than a preset first occupancy threshold; and when the average occupancy rate data of the storage resources is smaller than a preset second occupancy rate threshold, reducing the duplicate instances of the Hadoop system resources.

Further, the scheduling module is specifically configured to delete the DataNode replica instance with the largest replica ID when the average storage resource occupancy data is smaller than a preset second occupancy threshold and it is determined that there is no file with an insufficient number of replicas in the current HDFS.

Further, the scheduling module is specifically configured to increase a duplicate instance of the Hadoop system resource when the average computing resource occupancy data is greater than a preset third occupancy threshold; and when the average occupancy rate data of the computing resources is smaller than a preset fourth occupancy rate threshold, reducing the duplicate instances of the Hadoop system resources.

Further, the scheduling module is specifically configured to delete the NodeManager duplicate instance with the largest duplicate ID when the average computing resource occupancy data is smaller than a preset fourth occupancy threshold and it is determined that the NodeManager with the largest duplicate ID does not execute the task, and delete the NodeManager duplicate instance with the largest duplicate ID when the NodeManager with the largest duplicate ID is executing the task and the NodeManager with the largest duplicate ID executes the task.

In another aspect, an embodiment of the present invention provides an electronic device, including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete mutual communication through the communication bus;

a memory for storing a computer program;

a processor for implementing any of the above method steps when executing a program stored in the memory.

In yet another aspect, an embodiment of the present invention provides a computer-readable storage medium, in which a computer program is stored, and the computer program, when executed by a processor, implements the method steps of any one of the above.

The embodiment of the invention provides a resource scheduling method, a device, equipment and a medium, wherein the method comprises the following steps: deploying each application of a Hadoop system in a Kubernets system, and acquiring resource occupancy rate data of each application; and when the triggering resource scheduling conditions are determined to be met according to the resource occupancy rate data of each application and the current time is not in the preset cooling period, performing resource scheduling according to a preset resource scheduling strategy and entering the cooling period again.

The technical scheme has the following advantages or beneficial effects:

in the embodiment of the invention, the triggering resource scheduling condition and the resource scheduling strategy are preset. The method comprises the steps of firstly deploying all applications of a Hadoop system in a Kubernets system, then after acquiring resource occupancy rate data of all applications of the Hadoop system, judging whether a triggering resource scheduling condition is met or not according to the resource occupancy rate data of all applications, and if so, judging whether the triggering resource scheduling condition is not met or not in a preset cooling period currently, and then performing resource scheduling according to a preset resource scheduling strategy. The scheme provided by the embodiment of the invention realizes the allocation of Hadoop system resources as required, solves the problems of waste and insufficient busy time of the Hadoop system resources in the prior art, and avoids the problem of repeatedly scheduling resources after the resources are scheduled by setting the cooling period and meeting the triggering condition of triggering the resource scheduling.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic diagram of a Hadoop system resource expansion and contraction process according to embodiment 1 of the present invention;

fig. 2 is a flowchart of services provided by embodiment 2 of the present invention for deploying the Hadoop system in the kubernets system;

fig. 3 is a flow chart of automatic capacity expansion and capacity reduction of a Hadoop storage resource according to embodiment 3 of the present invention;

fig. 4 is a flow chart of automatic capacity expansion and capacity reduction of Hadoop computing resources according to embodiment 4 of the present invention;

fig. 5 is a schematic structural diagram of a Hadoop system expansion and contraction device according to embodiment 5 of the present invention;

fig. 6 is a schematic structural diagram of an electronic device according to embodiment 6 of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the attached drawings, and it should be understood that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The English name explanation related in the embodiment of the invention is as follows:

kubernetes: is an open source legacy system for automatically deploying, extending and managing containerized applications. It combines the containers that make up the application into a logical unit to facilitate management and service discovery. Kubernetes derives from the operation and maintenance experience of the Google 15 year production environment, while aggregating the best originality and practice of the community.

Hadoop: originated in the Apache Nutch project, started in 2002, and is one of the sub-projects of Apache Lucene. Becomes a complete software set independently in 2006 and is named Hadoop.

HDFS (Hadoop distributed File System): the Hadoop Distributed File System is a core sub-project of a Hadoop project, and is the basis of data storage management in Distributed computing. Consists of a NameNode and a DataNode, wherein the NameNode depends on JournalNode and ZooKeeper.

Journal of node: an independent service in Hadoop is used for storing Editlog (various operation logs performed on the HDFS).

NameNode: an independent service in Hadoop is responsible for managing the namespace of the file system in HDFS and the access of clients.

A DataNode: and an independent service in Hadoop provides a storage service of real file data.

YARN: a Yet antenna Resource coordinator is realized by a Resource management and job scheduling technology in a Hadoop distributed processing framework. The resource manager is composed of the resource manager, the JobHistoryServer and the NodeManager, wherein the resource manager depends on the ZooKeeper.

ResourceManager: and an independent service in Hadoop is responsible for global resource management and task scheduling, and the whole cluster is used as a computing resource pool.

JobHistoryServer: and an independent service in Hadoop is responsible for recording the log of the historical distributed task.

NodeManager: an independent service in Hadoop, an agent of a single node in YARN, manages a single computing node in a Hadoop cluster.

ZooKeeper: an item of the Apache software foundation, which provides open source distributed configuration services, synchronization services and naming registrations for large distributed computing.

Apache: is a non-profit organization dedicated to providing support for the community running the open source software project.

Cloudera: a business company, a basic open-source Hadoop version, provides supporting and implementing services such as revision, perfection and the like.

Hortonworks: a commercial company, similar to Cloudera, was incorporated with Cloudear in 2018.

Mirroring: an underlying file system containing applications and their associated dependencies. The running environment of the application program can be quickly created in batches by using the mirror image.

Example 1:

fig. 1 is a schematic diagram of a resource scheduling process provided in an embodiment of the present invention, where the process includes the following steps:

s101: deploying each application of the Hadoop system in the Kubernets system, and acquiring resource occupancy rate data of each application.

S102: and when the triggering resource scheduling conditions are determined to be met according to the resource occupancy rate data of each application and the current time is not in the preset cooling period, performing resource scheduling according to a preset resource scheduling strategy and entering the cooling period again.

The resource scheduling method provided by the embodiment of the invention is applied to electronic equipment, and the electronic equipment can be equipment such as a PC (personal computer), a tablet computer and a server. The resource scheduling in the embodiment of the invention comprises the scheduling of Hadoop system resources. In particular to capacity expansion or capacity reduction of Hadoop system resources.

In order to implement the resource scheduling of the Hadoop system, various applications of the Hadoop system need to be deployed in the kubernets system first. In the embodiment of the invention, mirror images of all applications of the Hadoop system are manufactured; loading the mirror image files of all applications into a Kubernetes system; the image file is run in the kubernets system to deploy the various applications of the Hadoop system in the kubernets system.

After each application of the Hadoop system is deployed in the Kubernets system, the electronic equipment acquires resource occupancy rate data of each application, and judges whether resource scheduling is needed or not according to the resource occupancy rate data of each application of the Hadoop system, namely judges whether resource expansion or contraction is needed or not according to the resource occupancy rate data of each application of the Hadoop system. And presetting a triggering resource scheduling condition and a resource scheduling strategy in the electronic equipment, and if the resource occupancy rate data of each application of the Hadoop system meets the triggering resource scheduling condition, scheduling the resources according to the preset resource scheduling strategy. Specifically, the electronic device sets a resource expansion condition and a resource expansion strategy, and sets a resource capacity reduction condition and a resource capacity reduction strategy respectively. And if the resource occupancy rate data of each application of the Hadoop system meets the triggering resource capacity expansion condition, adopting a resource capacity expansion strategy to carry out resource scheduling. And if the resource occupancy rate data of each application meets the triggering resource capacity reduction condition, adopting a resource capacity reduction strategy to carry out resource scheduling.

The resource scheduling condition may be set according to a resource occupancy threshold, and the resource scheduling policy includes increasing a duplicate instance of the system resource or decreasing a duplicate instance of the system resource.

In addition, in the embodiment of the invention, after capacity expansion or capacity reduction, the average value of the whole occupied resources is reduced. If the expansion and contraction threshold value is set unreasonably, the expansion is finished and the contraction is triggered, and the expansion is triggered after the contraction is finished. Therefore, a cooling period needs to be set. The expansion and contraction caused by the fluctuation of the resource occupation are avoided.

And after determining that the triggering resource scheduling conditions are met according to the resource occupancy rate data of each application of the Hadoop system, judging whether the current time is in a preset cooling period, and if not, scheduling the resources according to a preset resource scheduling strategy. Of course, if the current time is within the preset cooling period, no resource scheduling is performed.

Example 2:

on the basis of the above embodiments, in the embodiment of the present invention, the applications of deploying the Hadoop system in the kubernets system include:

The image files of the applications of the Hadoop system comprise image files of JournalNode, NameNode, ResourceManager, JobHistoryServer, DataNode, NodeManager and ZooKeeper applications of the Hadoop system. Pushing the manufactured image file to an image warehouse, compiling a deployment configuration file of ZooKeeper and Hadoop, acquiring the image file from the image warehouse, and deploying applications of the ZooKeeper, a JournalNode, a NameNode, a ResourceMaranager, a JobHistoryServer, a DataNode and a NodeManager of a Hadoop system in a Kubernetes system according to the deployment configuration file of the ZooKeeper and the Hadoop.

One of the purposes of the embodiment of the invention is to solve the problem that hardware resources cannot be shared due to separate deployment of a Kubernetes cluster and a Hadoop cluster. To solve this problem, the simplest and intuitive approach is to deploy a Hadoop cluster on top of a kubernets cluster. Since Hadoop was not considered to be deployed on Kubernets at the beginning of the design, part of the embodiments of this aspect is how to deploy a Hadoop cluster on a Kubernets cluster.

Another objective of embodiments of the present invention is how to expand and contract the capacity of Hadoop storage and computing resources using the kubernets' ability to automatically elastically stretch and contract applications. The part mainly relates to the collection of container expansion and contraction capacity indexes (such as computing resource CPU occupancy rate and storage resource occupancy rate). After capacity expansion or capacity reduction, the average value of the overall occupied resources is reduced. If the expansion and contraction threshold value is set unreasonably, the expansion is finished and the contraction is triggered, and the expansion is triggered after the contraction is finished. Therefore, a cooling period needs to be set. The expansion and contraction caused by the fluctuation of the resource occupation are avoided.

Fig. 2 is a flowchart of an application for deploying the Hadoop system in a kubernets system according to an embodiment of the present invention.

The pre-dependency deployment kubernets clustering procedure is omitted here, assuming that a set of kubernets clusters already exists.

Native Hadoop, either the version offered by Apache officials or the release version offered by Cloudera, Hortonworks, are designed to be deployed on physical machines. In the embodiment of the present invention, in order to unify resource management of physical machines, it is necessary to deploy each application of Hadoop on a kubernets cluster. The first step therefore requires mirroring of Hadoop applications.

The Hadoop cluster is composed of a plurality of applications, including journal node, NameNode, ResourceManager, JobHistoryServer, DataNode, and NodeManager. In addition, the Hadoop cluster also relies on the ZooKeeper cluster, and therefore, a mirror image of the ZooKeeper application needs to be made. The produced image needs to contain appropriate configuration files and support the automatic launching of the corresponding application and the clustering.

And pushing the manufactured mirror image to a mirror image warehouse, and subsequently deploying a Hadoop cluster on the Kubernets to pull the mirror image of the corresponding version of the application from the mirror image warehouse. After the mirror image is pulled to the local, unless the version of the mirror image is changed due to subsequent Hadoop upgrading, the mirror image cached in the local can be utilized, and repeated pulling is not needed.

After the mirror image is manufactured and pushed, the deployment configuration files of the ZooKeeper cluster and the Hadoop cluster need to be compiled. The deployment configuration file includes the mirror image version of each application, the CPU and memory allocation condition, the copy number (DataNode and NodeManager include the minimum and maximum numbers), the persistent disk information, and the capacity expansion and capacity reduction policies. Due to the difference of actual deployment environments, the deployment configuration file also has various differences, such as different CPU and memory allocation, different copy numbers, different persistent disk information, and the like. The deployment configuration file is adjusted by the implementer at the time of field deployment, and is not illustrated in detail here.

The ZooKeeper cluster is deployed on the kubernets cluster using the corresponding ZooKeeper deployment configuration file, where it is assumed that a 3-copy ZooKeeper cluster is deployed in the field. The Kubernetes cluster will attempt to find locally if a specified ZooKeeper image exists, and if not, will automatically pull (download) the image from the image repository. The Kubernetes cluster ensures that after a specified version of the ZooKeeper image exists locally, a 3-copy ZooKeeper container is created from the deployment configuration file. After each ZooKeeper container is started, the number of copies is read from an environment variable (the number of copies is generated by Kubernets according to configuration information in a deployment configuration file and is injected into the container), then the script is started, the domain name information of the copy instances is filled into the configuration file according to the number of copies and service domain name prefix information, and the ZooKeeper application is started. Later, the ZooKeeper containers build clusters based on the configuration information. To this end, ZooKeeper clusters have been deployed and successfully built on kubernets.

Applications such as journalne, NameNode, JobHistoryServer, ResourceManager of Hadoop are deployed on kubernets cluster using the Hadoop deployment configuration file, and it is assumed here that 3 copies of journalne, 2 copies of NameNode, 1 copy of JobHistoryServer, and 2 copies of ResourceManager are deployed in the field. The Kubernetes cluster will attempt to find locally if the specified service image exists, and if not, will automatically pull (download) the image from the image repository.

The Kubernetes cluster ensures that after a specified version of the journalnde image exists locally, a 3-copy journalnde container is created from the deployment configuration file. After each journal node container is started, the number of copies is read from environment variables (the number of copies is generated by Kubernetes according to configuration information in a deployment configuration file and is injected into the container), then the script is started, the domain name information of the copy instances is filled into the configuration file according to the number of copies and service domain name prefix information, and journal node application is started. Later, the journal nodes containers build clusters based on the configuration information.

The Kubernetes cluster ensures that after a specified version of the NameNode image exists locally, a 2-copy NameNode container is created according to the deployment configuration file. After each NameNode container is started, the number of NameNode copies, the service domain name prefix and the number of copies of JournalNode, and the service domain name prefix and the number of copies of ZooKeeper are read from environment variables (these information are generated and injected into the container by Kuberenetes according to configuration information in a deployment configuration file, since the NameNode application depends on JournalNode and ZooKeeper, the information is needed), then a script is started to fill the domain name information of copy examples of JournalNode and ZooKeeper into the configuration file according to the service domain name prefix and the copy number of JournalNode and ZooKeeper, and NameNode application and ZKFC application are started (NameNode and ZKFC must be deployed in a container, and ZKFC belongs to an auxiliary process of NameNode). Then, NameNode with copy ID of 0 is responsible for initializing the cluster, including the steps of formatting zooKeeper, formatting JournalNode, generating cluster ID, etc. After the initialization step is completed, the NameNode with the copy ID of 1 synchronizes the cluster information from the NameNode with the copy ID of 0. Later, 2 containers of the NameNode are built into the master-slave relationship.

The kubernets cluster ensures that after a specified version of the JobHistoryServer image exists locally, 1 JobHistoryServer container is created from the deployment configuration file. After the JobHistoryServer container is started, the service domain name prefix and the copy number of the NameNode are read from the environment variables (the information is generated by Kubernetes according to the configuration information in the deployment configuration file and is injected into the container), then the script is started, the domain name information of the copy instance of the NameNode is filled into the configuration file according to the service domain name prefix and the copy number of the NameNode, and the JobHistoryServer application is started.

The Kubernetes cluster ensures that after a specified version of the ResourceManager image exists locally, a 2-copy ResourceManager container is created from the deployment configuration file. After the ResourceManager container is started, the number of ResourceManager copies, the service domain name prefix and the number of copies of the NameNode, the service domain name prefix and the number of copies of the ZooKeeper, and the service domain name prefix and the number of copies of the jobhistoreserver (these information are generated and injected into the container by Kubernetes according to the configuration information in the deployment configuration file. Later, 2 containers of ResourceManager will be assembled into a master-slave relationship.

And deploying the DataNode application of the Hadoop on the Kubernetes cluster by using the Hadoop deployment configuration file, wherein the DataNode application is mainly responsible for a distributed data storage part in the Hadoop-HDFS. The DataNode application needs to configure a minimum number of copies and a maximum number of copies, as well as the capacity expansion and capacity reduction rules. When Kubernetes detects that the resource occupation condition of each copy container of the DataNode accords with the capacity expansion or reduction standard, and the actual number of the copies is between the minimum number of the copies and the maximum number of the copies, the capacity expansion or reduction operation is triggered. The Kubernetes cluster ensures that after a specified version of the DataNode image exists locally, the DataNode container with the minimum number of copies can be created according to the deployment configuration file. After the DataNode container is started, the service domain name prefix and the copy number of the NameNode are read from the environment variables (the information is generated and injected into the container by Kubernetes according to the configuration information in the deployment configuration file, because the DataNode service depends on the NameNode, the information is needed), then the script is started, the copy instance domain name information of the NameNode is filled into the configuration file according to the service domain name prefix and the copy number of the NameNode, and the DataNode application is started. Later, the DataNode application will register with the NameNode based on the configured NameNode information. By this time, the HDFS portion in Hadoop has been deployed.

And deploying the node manager application of the Hadoop on the Kubernets cluster by using the Hadoop deployment configuration file, wherein the NoadeManager application is mainly responsible for a distributed task execution part in the Hadoop-yann. After Kubernetets cluster ensures that the node manager image of a specified version exists locally, a node manager container with the minimum number of copies is created according to a deployment configuration file, after the node manager container is started, a service domain name prefix and a number of copies of the resource manager are read from an environment variable (the information is generated and injected into the container by Kubernetes according to configuration information in the deployment configuration file, the part of information is needed because the node manager application depends on the resource manager), and then a script is started to fill the copy of the service domain name information of the resource manager into the configuration file according to the service domain name prefix and the number of copies of the resource manager, and the resource manager application is started to later configure the resource manager according to the resource information, register to resourcemanager. By this time, the yann part in Hadoop has been deployed.

At this point, the Hadoop cluster with the minimum resource is already deployed on the Kubernets cluster, and the Hadoop cluster can be automatically expanded and contracted according to the resource occupancy rate Kubernets in the subsequent use process.

Example 3:

on the basis of the foregoing embodiment, in the embodiment of the present invention, the resource occupancy data of each application is obtained; determining that the triggering resource scheduling condition is met according to the resource occupancy rate data of each application comprises:

The resource scheduling scheme provided by the embodiment of the invention relates to the scheduling of system storage resources. When the storage resources are scheduled, the storage resource occupancy rate data of each application in the Hadoop system needs to be acquired, and whether the triggering resource scheduling condition is met or not is judged according to the storage resource occupancy rate data of each application.

Specifically, the storage resource occupancy rate data of each application is acquired, the average storage resource occupancy rate data is determined according to the storage resource occupancy rate data of each application, and when the average storage resource occupancy rate data is greater than a preset first occupancy rate threshold value or the average storage resource occupancy rate data is less than a preset second occupancy rate threshold value, the triggering resource scheduling condition is determined.

And the preset first occupancy rate threshold value is greater than the preset second occupancy rate threshold value.

The resource scheduling according to the preset resource scheduling policy comprises:

When the average storage resource occupancy data is less than a preset second occupancy threshold, reducing the duplicate instances of the Hadoop system resources comprises:

Fig. 3 is a flow chart of automatic capacity expansion and capacity reduction of a Hadoop storage resource according to an embodiment of the present invention.

The Hadoop-HDFS-adaptor can regularly collect the occupied space of the storage space of the HDFS, collect the occupancy rate data of the storage resources of each application and synchronize the data to a Metrics Server of a Kubernets cluster. If the occupancy rate of the HDFS storage resources in the Hadoop system is larger than a preset first occupancy rate threshold value, determining that the capacity expansion condition is met; adding a DataNode copy example of the Hadoop system resource; if the occupancy rate of the HDFS storage resource in the Hadoop system is smaller than a preset second occupancy rate threshold value, determining that the capacity reduction condition is met; reducing the DataNode duplicate examples of the Hadoop system resources; and the preset second occupancy rate threshold value is smaller than the preset first occupancy rate threshold value.

Specifically, an HPA controller (Horizontal Pod auto scanner) in the kubernets cluster may obtain these data from the Metrics Server, and use the data to calculate the capacity expansion and capacity reduction rules to obtain the copy number of the corresponding DataNode application. For example, the number of current DataNode application copies is 4, the storage space of each copy is 1TB, and the actually used storage spaces are respectively: 550GB, 500GB, 600GB, 550GB, the average space occupancy is 55%. If the capacity reduction condition configured in the HPA rule is that the space occupancy rate is 60%, the capacity expansion condition is that the space occupancy rate is 90%. After the HPA controller calculates that the reasonable copy number is 3, the capacity reduction is triggered, the redundant DataNode service copies are deleted, and the number of the DataNode service copies is adjusted to be 3. It should be noted that: when the data node service copy is deleted, the data on the copy can be recovered on other data node service copies, so that the actually used total storage space is unchanged. The expected copy number is calculated according to the total storage space and the actual use space, and the calculation formula can be simply understood as: 60% (capacity reduction threshold) < actually used storage space/(replica storage space) < 90% (capacity expansion threshold). when the expected number of copies of the DataNode service calculated by the HPA controller is different from the current actual number of copies, the HPA controller initiates Scale operation to the replica controller of the DataNode service, adjusts the number of copies of the DataNode, and completes the capacity expansion and capacity reduction operations.

And after the copy examples of the Hadoop system resources are increased or reduced, re-entering a cooling period.

When the expected copy number of the DataNode service calculated by the HPA controller is different from the current actual copy number, and the copy is not in the cooling period after the previous expansion and contraction. The HPA controller will initiate Scale operation to the copy controller of the DataNode service, adjust the copy number of the DataNode, and complete the operation of capacity expansion and capacity reduction.

According to the calculation and subsequent judgment logic of the HPA controller, the following cases can be classified:

the expansion condition is reached and the cooling period is not reached. The HPA controller may initiate an add-copy operation to the copy controller of the DataNode application, and the Kubernetes cluster may automatically create a new DataNode copy and inject dependent environment variables. The startup script of the DataNode can acquire the dependent NameNode information from the environment variable, fill the information into the configuration file, and start the DataNode application. Subsequently, the DataNode application automatically registers to the NameNode application, and the automatic capacity expansion of the storage space of the HDFS is completed. The new DataNode will take on the write task of the partially distributed data. In an actual production environment, the quantity of expansion at each time can be a multiple of 3, and data imbalance after expansion is avoided.

The expansion condition is reached, but in the cooling period. The HPA controller may not perform a flash operation because it is currently in the cool-down period.

The shrinkage condition is reached and is not in the cooling period. The HPA controller may initiate a copy reduction operation to the replica controller of the DataNode application, and the Kubernetes cluster may automatically delete the DataNode replica with the largest replica ID. Before the DataNode copy is deleted, the DataNode needs to delete itself from the NameNode, so as to avoid the NameNode considering that the DataNode is offline, but not deleted. Thus, the storage space of the HDFS is automatically reduced. The remaining DataNodes will undertake the write task for all distributed data and restore the data on the deleted DataNodes on the remaining DataNodes. Since data on a DataNode is stored in multiple copies on multiple DataNodes, the data can be recovered on other DataNodes even if one DataNode service copy is deleted. In an actual production environment, the capacity reduction means triggering very time-consuming data recovery, and occupying the bandwidth of a cluster, so the capacity reduction can be very cautious, and the capacity reduction operation is not triggered generally even if the space is remained very large.

The capacity reduction condition is reached but in the cooling period. The HPA controller does not perform the scale-down operation because it is currently in the cool-down period.

The expansion and contraction conditions are not met. Waiting for the next round of inspection.

In kubernets, the suffix of the copy name is the copy ID, and increases from 1, and when the capacity is expanded, the copy ID of the newly created copy is the largest. When the capacity is reduced, the deletion is started from the point where the copy ID is maximum. The DataNode application has a description information for specifying the required CPU, memory and storage space, and when Kubernets uses the same description information, individual DataNode service instances are created, and these service instances are called DataNode service copies (because they are created from the same description information). Containers refer to instances created by kubernets that may be copies of the same service or different services. A copy is typically specified as one of a group of containers, a container being specified as a container.

After the capacity expansion and capacity reduction operation is completed, the Kubernetes sets a cooling period for the DataNode application, so as to avoid frequent capacity expansion and capacity reduction operations caused by capacity expansion and capacity reduction measurement index fluctuation, especially capacity reduction operations. Because the capacity reduction operation means that the number of data copies on the distributed storage is reduced, the file recovery flow is further triggered. Before the file recovery flow is completed, if the reduction operation is triggered again, data discarding may be caused. Therefore, the trigger condition of the capacity reduction can be set to be harsher, and the cooling time can be set to be longer so as to ensure the smoothness of the influence of the capacity reduction on the service. Meanwhile, the DataNode in the Hadoop cluster may be in an offline state, and in this case, the number of data copies of the HDFS is insufficient, and the data copies also wait until the offline DataNode returns to normal, and the capacity reduction process is not triggered.

Example 4:

The resource scheduling scheme provided by the embodiment of the invention relates to the scheduling of system computing resources. When the computing resources are scheduled, computing resource occupancy rate data of each application in the Hadoop system needs to be acquired, and whether triggering resource scheduling conditions are met or not is judged according to the computing resource occupancy rate data of each application.

Specifically, the computing resource occupancy data of each application is acquired, the average computing resource occupancy data is determined according to the computing resource occupancy data of each application, and when the average computing resource occupancy data is greater than a preset third occupancy threshold or the average receiving resource occupancy data is less than a preset fourth occupancy threshold, the triggering resource scheduling condition is determined.

And the preset third occupancy rate threshold value is greater than the preset fourth occupancy rate threshold value. The size relationship between the preset first occupancy rate threshold and the preset third occupancy rate threshold is not limited, and the size relationship between the preset second occupancy rate threshold and the preset fourth occupancy rate threshold is not limited.

When the average computing resource occupancy data is less than a preset fourth occupancy threshold, reducing the duplicate instances of the Hadoop system resources comprises:

Fig. 4 is a flow chart of automatic capacity expansion and capacity reduction of a Hadoop computing resource according to an embodiment of the present invention.

If the average CPU occupancy rate in each NodeManager copy in the Hadoop system is larger than a preset third occupancy rate threshold value, determining that the capacity expansion condition is met; adding NodeManager copy examples of the Hadoop system resources; if the average CPU occupancy rate in each NodeManager copy in the Hadoop system is smaller than a preset fourth occupancy rate threshold value, determining that the capacity reduction condition is met; and reducing NodeManager copy examples of the Hadoop system resources.

Specifically, the Hadoop-Yarn-adapter can collect CPU occupancy rate data in each copy of the NodeManager at regular time and synchronize the data to a Metrics Server of the Kubernetes cluster. The HPA controller (Horizontal Pod Autoscaler) in the kubernets cluster may obtain these data from Metrics Server for calculating capacity expansion and capacity reduction rules to obtain the number of copies corresponding to the NodeManager service. For example, the number of current NodeManager service copies is 4, and the actual CPU utilization rates are respectively: 60%, 65%, 60%, 45%, the average CPU occupancy is 57.5%. If the capacity reduction condition configured in the HPA rule is that the CPU occupancy rate is 60%, the capacity expansion condition is that the CPU occupancy rate is 80%. After the HPA controller calculates that the reasonable copy number is 3, the reduction is triggered, the redundant NodeManager service copies are deleted, and the NodeManager service copy number is adjusted to be 3. It should be noted that: deleting the NodeManager service copy is not performed immediately, and it is also necessary to determine whether there is an unfinished task on the NodeManager copy instance, which is described in detail below. The expected copy number is calculated according to the average CPU occupancy, and the calculation formula can be simply understood as: 60% (capacity threshold) < average CPU occupancy < 90% (capacity threshold). Note that the NodeManager reduction is performed after the task is finished, so that only the average CPU occupancy rate of the remaining NodeManager copies needs to be considered.

When the copy number of NodeManager service calculated by the HPA controller is different from the current actual copy number, and the copy is not in the cooling period after the previous expansion and contraction. The HPA controller will initiate Scale operation to the copy controller of NodeManager, adjust the copy number of NodeManager, and complete the operation of capacity expansion and capacity reduction. According to the calculation and subsequent judgment logic of the HPA controller, the following cases can be classified:

the expansion condition is reached and the cooling period is not reached. The HPA controller will initiate an add-copy operation to the NodeManager because of the copy controller, and the kubernets cluster will automatically create a new NodeManager copy and inject dependent environment variables. The node manager's startup script will get the dependent resourceman information from the environment variables, and fill the information into the configuration file, and start the node manager. Subsequently, since the NodeManager will automatically register to the ResourceManager, the Yarn's automatic capacity expansion of the computing resources is completed. The new NodeManager will afford to perform part of the distributed computing task. In the actual production environment, only one copy is expanded in each expansion, so that the problem that the utilization rate is insufficient due to excessive resource occupation is avoided.

The shrinkage condition is reached and is not in the cooling period. The HPA controller may initiate a reduce copy operation to the NodeManager because of the copy controller, and the kubernets cluster may automatically delete the NodeManager copy with the largest copy ID. Before the NodeManager copy is deleted, the NodeManager needs to delete itself from the resourcemanager, so that the resourcemanager is prevented from considering that the NodeManager is offline, but not deleted. Thus, the Yarn's automatic capacity reduction of computing resources is completed. The remaining nodemanagers will afford to perform all distributed computing tasks. In an actual production environment, the execution time and the period of the task are inconsistent, and the scheduling policy, the NodeManager with the largest copy ID is not necessarily idle when the whole cluster is idle, and there may be a case that a legacy task is executing, at this time, the NodeManager needs to be marked as unsalthy, and the NodeManager marked as unsalthy is not scheduled to execute a new task any more. And after all tasks are executed, the capacity reduction operation is triggered.

After the capacity expansion and capacity reduction operation is completed, Kubernets sets a cooling period for NodeManager service, so as to avoid frequent capacity expansion and capacity reduction operations, especially capacity reduction operations, caused by capacity expansion and capacity reduction measurement index fluctuation. Because the capacity reduction operation must wait for the completion of the execution of the task on the deleted NodeManager copy, this NodeManager copy will be marked as unhealth before the completion of the task execution. Therefore, the actual capacity reduction of the NodeManager may delete multiple NodeManager copies at one time (after some long-time tasks are executed, resources are occupied and released at the same time).

Example 5:

fig. 5 is a schematic structural diagram of a resource scheduling apparatus according to an embodiment of the present invention, where the apparatus includes:

the deployment module 51 is configured to deploy each application of the Hadoop system in the kubernets system, and acquire resource occupancy rate data of each application;

and the scheduling module 52 is configured to, when it is determined that the triggering resource scheduling condition is met according to the resource occupancy rate data of each application and the current resource is not within the preset cooling period, perform resource scheduling according to a preset resource scheduling policy, and reenter the cooling period.

The deployment module 51 is specifically configured to obtain deployment configuration files of each application of the Hadoop system, and determine version information and the number of copies of the image file of each application according to the deployment configuration files of each application; and acquiring a corresponding image file from an image warehouse according to the image file version information of each application, and deploying each application in the Kubernetes system according to the copy number and the image file of each application.

The deployment module 51 is specifically configured to obtain the storage resource occupancy rate data of each application, and determine average storage resource occupancy rate data according to the storage resource occupancy rate data of each application;

the scheduling module 52 is specifically configured to determine a resource scheduling triggering condition when the average storage resource occupancy data is greater than a preset first occupancy threshold, or the average storage resource occupancy data is less than a preset second occupancy threshold.

The deployment module 51 is specifically configured to obtain the computing resource occupancy rate data of each application, and determine average computing resource occupancy rate data according to the computing resource occupancy rate data of each application;

the scheduling module 52 is specifically configured to determine a triggering resource scheduling condition when the average computing resource occupancy data is greater than a preset third occupancy threshold, or the average computing resource occupancy data is less than a preset fourth occupancy threshold.

The scheduling module 52 is specifically configured to increase a copy instance of the Hadoop system resource when the average storage resource occupancy data is greater than a preset first occupancy threshold; and when the average occupancy rate data of the storage resources is smaller than a preset second occupancy rate threshold, reducing the duplicate instances of the Hadoop system resources.

The scheduling module 52 is specifically configured to delete the DataNode copy instance with the largest copy ID when the average storage resource occupancy data is smaller than a preset second occupancy threshold and it is determined that there is no file with an insufficient number of copies in the current HDFS.

The scheduling module 52 is specifically configured to increase a copy instance of the Hadoop system resource when the average computing resource occupancy data is greater than a preset third occupancy threshold; and when the average occupancy rate data of the computing resources is smaller than a preset fourth occupancy rate threshold, reducing the duplicate instances of the Hadoop system resources.

The scheduling module 52 is specifically configured to delete the nodeman duplicate instance with the largest duplicate ID when the average computing resource occupancy data is smaller than a preset fourth occupancy threshold and it is determined that the nodemanator with the largest duplicate ID does not execute the task, and delete the nodemanator duplicate instance with the largest duplicate ID when the nodemanator with the largest duplicate ID is executing the task and the nodemanator with the largest duplicate ID executes the task.

Example 6:

on the basis of the foregoing embodiments, an embodiment of the present invention further provides an electronic device, as shown in fig. 6, including: the system comprises a processor 301, a communication interface 302, a memory 303 and a communication bus 304, wherein the processor 301, the communication interface 302 and the memory 303 complete mutual communication through the communication bus 304;

the memory 303 has stored therein a computer program which, when executed by the processor 301, causes the processor 301 to perform the steps of:

Based on the same inventive concept, the embodiment of the present invention further provides an electronic device, and because the principle of the electronic device for solving the problem is similar to the resource scheduling method, the implementation of the electronic device may refer to the implementation of the method, and repeated parts are not described again.

The electronic device provided by the embodiment of the invention can be a desktop computer, a portable computer, a smart phone, a tablet computer, a Personal Digital Assistant (PDA), a network side device and the like.

The communication bus mentioned in the electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.

The communication interface 302 is used for communication between the above-described electronic apparatus and other apparatuses.

The Memory may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Alternatively, the memory may be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, including a central processing unit, a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an application specific integrated circuit, a field programmable gate array or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or the like.

When the processor executes the program stored in the memory in the embodiment of the invention, the deployment of each application of the Hadoop system in the Kubernets system is realized, and the resource occupancy rate data of each application is obtained; and when the triggering resource scheduling conditions are determined to be met according to the resource occupancy rate data of each application and the current time is not in the preset cooling period, performing resource scheduling according to a preset resource scheduling strategy and entering the cooling period again.

Example 7:

on the basis of the foregoing embodiments, an embodiment of the present invention further provides a computer storage readable storage medium, in which a computer program executable by an electronic device is stored, and when the program is run on the electronic device, the electronic device is caused to execute the following steps:

Based on the same inventive concept, embodiments of the present invention further provide a computer-readable storage medium, and since a principle of solving a problem when a processor executes a computer program stored in the computer-readable storage medium is similar to a resource scheduling method, implementation of the computer program stored in the computer-readable storage medium by the processor may refer to implementation of the method, and repeated details are not repeated.

The computer readable storage medium may be any available medium or data storage device that can be accessed by a processor in an electronic device, including but not limited to magnetic memory such as floppy disks, hard disks, magnetic tape, magneto-optical disks (MOs), etc., optical memory such as CDs, DVDs, BDs, HVDs, etc., and semiconductor memory such as ROMs, EPROMs, EEPROMs, non-volatile memory (NAND FLASH), Solid State Disks (SSDs), etc.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A method for scheduling resources, the method comprising:

2. The method of claim 1, wherein the deploying applications of the Hadoop system in a kubernets system comprises:

3. The method of claim 1, wherein the obtaining resource occupancy data for the respective application; determining that the triggering resource scheduling condition is met according to the resource occupancy rate data of each application comprises:

4. The method of claim 1, wherein the obtaining resource occupancy data for the respective application; determining that the triggering resource scheduling condition is met according to the resource occupancy rate data of each application comprises:

5. The method of claim 3, wherein the scheduling resources according to the predetermined resource scheduling policy comprises:

6. The method of claim 5, wherein reducing the duplicate instances of the Hadoop system resource when the average storage resource occupancy data is less than a preset second occupancy threshold comprises:

7. The method of claim 4, wherein the scheduling resources according to the predetermined resource scheduling policy comprises:

8. The method of claim 7, wherein reducing duplicate instances of the Hadoop system resource when the average computing resource occupancy data is less than a preset fourth occupancy threshold comprises:

9. An apparatus for scheduling resources, the apparatus comprising:

10. The apparatus according to claim 9, wherein the deployment module is specifically configured to obtain a deployment configuration file of each application of the Hadoop system, and determine, according to the deployment configuration file of each application, image file version information and a number of copies of each application; and acquiring a corresponding image file from an image warehouse according to the image file version information of each application, and deploying each application in the Kubernetes system according to the copy number and the image file of each application.

11. The apparatus according to claim 9, wherein the deployment module is specifically configured to obtain the storage resource occupancy data of each application, and determine average storage resource occupancy data according to the storage resource occupancy data of each application;

12. The apparatus according to claim 9, wherein the deployment module is specifically configured to obtain the computing resource occupancy data of each application, and determine average computing resource occupancy data according to the computing resource occupancy data of each application;

13. The apparatus according to claim 11, wherein the scheduling module is specifically configured to, when the average storage resource occupancy data is greater than a preset first occupancy threshold, increase duplicate instances of the Hadoop system resource; and when the average occupancy rate data of the storage resources is smaller than a preset second occupancy rate threshold, reducing the duplicate instances of the Hadoop system resources.

14. The apparatus according to claim 13, wherein the scheduling module is specifically configured to delete the DataNode replica instance with the largest replica ID when the average storage resource occupancy data is smaller than a preset second occupancy threshold and it is determined that there is no file with an insufficient number of replicas in the current HDFS.

15. The apparatus according to claim 12, wherein the scheduling module is specifically configured to increase duplicate instances of the Hadoop system resource when the average computing resource occupancy data is greater than a preset third occupancy threshold; and when the average occupancy rate data of the computing resources is smaller than a preset fourth occupancy rate threshold, reducing the duplicate instances of the Hadoop system resources.

16. The apparatus of claim 15, wherein the scheduling module is specifically configured to delete the NodeManager duplicate instance with the largest duplicate ID when the average computing resource occupancy data is less than a preset fourth occupancy threshold and it is determined that the NodeManager with the largest duplicate ID does not execute the task, and delete the NodeManager duplicate instance with the largest duplicate ID when the NodeManager with the largest duplicate ID is executing the task and the NodeManager with the largest duplicate ID completes the task.

17. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus;

a memory for storing a computer program;

a processor for implementing the method steps of any one of claims 1 to 8 when executing a program stored in the memory.

18. A computer-readable storage medium, characterized in that a computer program is stored in the computer-readable storage medium, which computer program, when being executed by a processor, carries out the method steps of any one of claims 1 to 8.