CN106790636A

CN106790636A - A kind of equally loaded system and method for cloud computing server cluster

Info

Publication number: CN106790636A
Application number: CN201710013263.XA
Authority: CN
Inventors: 姜意; 李永军; 张义; 周邦宇; 谭苗苗
Original assignee: Shanghai Blue Polytron Technologies Inc
Current assignee: Shanghai Blue Polytron Technologies Inc
Priority date: 2017-01-09
Filing date: 2017-01-09
Publication date: 2017-05-31

Abstract

The invention discloses a kind of equally loaded system and method for cloud computing server cluster, using two layers of resource allocation policy, specifically：During User logs in, cooperated by the login module in management server, monitoring module, distribute module, assign them to a certain GPU calculation server, when user submits operation on GPU calculation servers, the resource of management server estimates the occupation condition that module will estimate operation, with reference to the GPU calculation server real time resources occupancy situations that monitoring module is monitored, such as find that allocated inadequate resource to complete the task, then carries out the second Layer assignment by distribute module.The present invention is based on event-driven mechanism, realize the load balancing of cluster, with automation, it is intelligent, perform online the characteristics of, with user experience it is main to consider when load balancing is carried out, so that load balancing is simpler, efficient, transparent, it is ensured that the high-performance of group system, high quality-of-service and resilient expansion.

Description

A kind of equally loaded system and method for cloud computing server cluster

Technical field

The present invention relates to cluster server technical field, and in particular to a kind of equally loaded system of cloud computing server cluster System and method.

Background technology

Along with the development of the demand of calculating, the development of cloud computing is more and more faster.Cloud computing cluster is generally using extensive meter Calculate module to be calculated, in order to ensure the experience of security, stability and the user of cluster, it is necessary to using load balancing collection Group.

The drawbacks of conventional cluster load balance system typically can only carry out primary distribution, this mechanism to resource at present exists In after distribution when user really submits task to and starts computing, the server resource according to rule distribution differs and surely meet The calculating demand of user, and because distribution is individual layer, unidirectional, static state, so easily cause server resource overload, user making Occur with not good situation is experienced.

The problems such as in view of cost performance, Consumer's Experience, therefore, realize the automation of cluster load, intellectuality, online Change is problem demanding prompt solution.

The content of the invention

It is an object of the invention to provide a kind of equally loaded system and method for cloud computing server cluster, it is used to solve Existing cluster load balance system is not enough automated, intelligent, onlineization, cost performance and the not good problem of Consumer's Experience.

To achieve the above object, the present invention devises a kind of equally loaded system and method for cloud computing server cluster. Specifically, a kind of equally loaded system of cloud computing server cluster, it is important to, the system includes：For depositing user data With user calculate data storage server, one group be used for perform user's calculating task GPU calculation servers and for distributing, Monitor the management server of GPU calculation server ruuning situations；

Wherein, described management server includes：

Login module：For obtaining and preserving user login information, and user login information is transmitted to monitoring module and is divided With module；

Monitoring module：Number of users, GPU occupancy situations, EMS memory occupation feelings for monitoring each GPU calculation server Condition；

Distribute module：For the user of login to be distributed into specific calculation server；

Resource estimates module：Resource for being taken after user submits operation to is estimated.

Further, the maximum of concurrent user's quantity that described GPU calculation servers can be accommodated is set by keeper.

Further, public directory and User Catalog are provided with described storage server, User Catalog includes that institute is useful The respective catalogue in family, can call the information and data in own directory after User logs in.

A kind of equally loaded method of the cloud computing server cluster based on said system, methods described includes following step Suddenly：

When a, User logs in, the log-on message of user is obtained by the login module in management server, then triggering is logged in Event, after log-in events triggering, monitoring module is immediately by the number of users of all GPU calculation servers, GPU occupancy situations, interior Deposit occupancy situation and be sent to distribute module, user is distributed to the minimum GPU calculation servers of active user's number by distribute module；

B, user call the data in storage server, in the GPU calculation servers operation that management server is its distribution Task, when submitting task to, triggering resource estimates event, when estimating module according to the task of submission to by the resource in management server Description and selection parameter are estimated to the resource that task takes, and described predictor method includes according to data type, size, appoints Software used in business flow, finds estimated needs in the database that the experience accumulated according to conventional operation task is created GPU occupancies, and internal memory, IO, the network capacity for needing, the real time resources transmitted by estimation results and monitoring module are taken Situation, if finding, allocated inadequate resource, to complete task, sub-distribution again is carried out by distribute module；

After c, task run are finished, user logs off, and terminates.

Further, in described step a, user is distributed to the minimum GPU of active user's number and calculates clothes by distribute module During business device, if the minimum GPU calculation servers quantity of active user's number is more than 1, user is sequentially allocated to GPU and calculates clothes Business device.

Further, in step a, after User logs in, the number in public directory and User Catalog in storage server is called According to.

Further, in described step a, user is distributed to the minimum GPU of active user's number and calculates clothes by distribute module During business device, if all reaching maximum concurrent user number when monitoring module monitors all GPU calculation servers, sent out to login module Deliver letters breath, login module issues the user with prompt message.

Further, in described step a, user is distributed to the minimum GPU of active user's number and calculates clothes by distribute module During business device, the 80% of maximum concurrent user number is all reached when monitoring module monitors all GPU calculation servers, then to keeper Send prompt message.

Further, keeper has number of users, GPU occupancy situations, the internal memory of the GPU calculation servers in checking monitoring module The authority of the information of occupancy situation, keeper has stopping consumer process volume authority, and users personal data looks into storage server See authority only individual subscriber.

Further, be stored in the log-on message of user in database by the login module in management server, step c, User can select temporary transient offline or end process when logging off, if selection is temporarily offline, log-on message is not from data Deleted in storehouse, system thinks the user or logging status, when user logs in from webpage or client again, will be automatically matched to GPU calculation servers before, the process before continuation；If selection end process, log-on message is by from the data of login module Deleted in storehouse, user logs in and will redistribute GPU calculation servers next time.

The invention has the advantages that：1st, cloud computing platform is often that, for certain specific area, this just determines use Task be concentrated mainly in a limited scope, this carries out resource and estimates when making it possible that user submits task to.This Scheme utilizes resource pre-estimating technology, carries out double-deck resource allocation, makes every effort to reach the dynamic, abundant, effective of the utilization of resources.2nd, this hair It is bright that the load balancing of cluster is realized based on event driven double-deck distribution mechanism, with automation, intelligent, online perform Feature, with user experience is main to consider when load balancing is carried out so that load balancing is simpler, efficient, It is transparent, it is ensured that the high-performance of group system, high quality-of-service and resilient expansion.

Brief description of the drawings

Fig. 1 is system architecture diagram of the invention.

Fig. 2 is flow chart of the method for the present invention.

Specific embodiment

Following examples are used to illustrate the present invention, but are not limited to the scope of the present invention.

The technical scheme deployment including system physical structure first, for details, reference can be made to Fig. 1, the cloud computing service The physical arrangement of the equally loaded system of device cluster includes：

Management server, for distributing user, monitoring calculation server ruuning situation,

GPU calculation servers, the calculating task for performing user,

Storage server, for depositing user data and calculating data.

Above-mentioned management server includes：

Login module, for obtaining and preserving user login information, and is transmitted to user login information monitoring module and divides With module；

Monitoring module, number of users, GPU occupancy situations, EMS memory occupation situation for monitoring each calculation server；

Distribute module, for the user of login to be distributed into specific calculation server；

Resource estimates module, for after user submits operation to carrying out that estimating for resource may be taken.

Wherein, heretofore described GPU calculation servers can have multiple, and each GPU calculation servers most multipotency is accommodated Concurrent user number defined by keeper.

Further, there are public directory and User Catalog under storage server of the present invention, User Catalog includes that institute is useful The respective catalogue in family.

Referring to Fig. 2, user uses a kind of flow of the equally loaded system of cloud computing server cluster of the present invention It is as follows：

A) user initiates to log in from webpage or client；

B) login module in management server obtains the log-on message of user, then triggers log-in events；

C) after log-in events triggering, monitoring module in management server by the number of users of all GPU calculation servers, GPU occupancy situations, EMS memory occupation situation are sent to the distribute module in management server；

D) distribute module carries out the first Layer assignment, and user is distributed into the minimum GPU calculation servers of active user's number, if The minimum GPU calculation servers quantity of active user's number is more than 1, then user is sequentially allocated into GPU calculation servers；

E) user submits to task, triggering resource to estimate event on the GPU calculation servers being assigned to；

The parameter of description and selection when f) resource estimates module according to user's submission task carries out resource allocation and estimates；Institute The predictor method stated is included according to software used in data type, size, flow of task, what is accumulated according to conventional operation task The GPU occupancies of estimated needs, and internal memory, IO, the network capacity for needing are found in the database that experience is created；

G) distribute module is carried out according to the occupation condition estimated and the occupancy situation of current each GPU calculation servers Second Layer assignment, user is distributed to can to greatest extent meet the GPU calculation servers of its use demand；

H) user's operation task；

I) user can call the data in public directory and oneself catalogue in storage server, and can by data be stored in from In oneself catalogue；

J) when user submits task to every time, all trigger resource and estimate event, carry out two Layer assignments；

K) user exits, and can select temporary transient offline or end process, if selection is temporarily offline, log-on message is not from number Deleted according in storehouse, system thinks the user or logging status, when user logs in from webpage or client again, by Auto-matching GPU calculation servers before, the process before continuation；If selection end process, log-on message is by from the number of login module Deleted according in storehouse, user logs in and will redistribute GPU calculation servers next time.

Wherein, in the method flow, when log-in events are triggered, if monitoring module monitors all GPU calculation servers all Maximum concurrent user number is reached, then sends information to login module, login module gives the user that " login user number has reached Limit, keeper please be contact " prompting.

If monitoring module monitors all GPU calculation servers and all reaches the 80% of maximum concurrent user number, system to Keeper provides the prompting of " login user is more, please increase GPU calculation servers ".

Keeper with checking monitoring module, can obtain the number of users of all GPU calculation servers, CPU occupancy situations, interior The information of occupancy situation is deposited, but does not check the authority of users personal data in storage server, keeper can also stop The process of certain user.

Embodiment 1

Monitoring module in step 1, management server by receive number of users that GPU calculation servers periodically send, The information such as GPU occupancy situations, EMS memory occupation situation are monitored to GPU calculation servers, if the use of all GPU calculation servers Amount reaches the 80% of maximum concurrent user number, then perform step 2；If the number of users of all GPU calculation servers reaches Maximum concurrent user number, then perform step 3；Otherwise, then step 4 is performed；

Step 2, system provide the prompting of " login user is more, please increase GPU calculation servers ", keeper to keeper Increase a GPU calculation server S n+1 online, skip to step 4；

Step 3, user initiate to log in from webpage or client, and login module gives the user that " login user number has reached The upper limit, please contact keeper " prompting, skip to step 14；

Step 4, user initiate to log in from webpage or client, trigger log-in events, and login module believes the login of user Breath is sent to monitoring module, and the log-on message of user is saved in database；

Step 5, monitoring module are by the number of users of all GPU calculation servers, GPU occupancy situations, EMS memory occupation situation It is sent to the distribute module in management server；

If the minimum GPU calculation servers quantity of step 6, active user's number is equal to 1, step 7 is performed；If active user The minimum GPU calculation servers quantity of number is more than 1, then perform step 8；

User is distributed to the minimum GPU calculation servers of the number of users by step 7, distribute module, if keeper performs Increase the operation of GPU calculation server S n+1 in step 2, then in the method, user is first distributed into S n+1, until S n+ Untill 1 number of users is identical with other GPU calculation server numbers of users, step 9 is skipped to；

Step 8, by user according to S1, S2, S3 ... Sn's, Sn+1 is sequentially assigned to GPU calculation servers；

Step 9, user submit to task, triggering resource to estimate event, resource estimate module according to the description of the task of submission to and Parameter carries out resource and estimates, if the GPU calculation servers of estimated current distribution disclosure satisfy that the calculating demand, performs step 11, otherwise perform step 10；

Step 10, distribute module estimate occupation condition that module estimates according to resource and that monitoring module is observed is current The occupation condition of all GPU calculation servers carries out second layer allocation schedule；

Step 11, user carry out computing on the GPU calculation servers of distribution, can call public mesh in storage server Data in record and oneself catalogue, and data can be stored in the catalogue of oneself；

Step 12, user exit, and can select temporary transient offline or end process, if selection is temporarily offline, perform step 13；If selection end process, performs step 14；

Step 13, log-on message are deleted not from database, and system thinks the user or logging status, user again from When webpage or client are logged in, by the GPU calculation servers before being automatically matched to, the process before continuation；

Step 14, log-on message will be deleted from the database of login module, and user logs in and will be opened from step 1 again next time Begin, distribute GPU calculation servers；

Step 15, logs off, and terminates.

Step 1 is always in commission in the present invention.The present invention is based on event driven double-deck distribution mechanism, realizes cluster Load balancing, with automation, it is intelligent, perform online the characteristics of, when load balancing is carried out based on user experience Consider so that load balancing is simpler, efficient, transparent, it is ensured that the high-performance of group system, high quality-of-service and Resilient expansion.

Although the present invention is described in detail above to have used general explanation and specific embodiment, at this On the basis of invention, it can be made some modifications or improvements, this will be apparent to those skilled in the art.Therefore, These modifications or improvements, belong to the scope of protection of present invention without departing from theon the basis of the spirit of the present invention.

Claims

1. a kind of equally loaded system of cloud computing server cluster, it is characterised in that the system includes：For depositing number of users The storage server of data, one group of GPU calculation server for being used to perform user's calculating task are calculated according to user and for dividing With, monitoring GPU calculation server ruuning situations management server；

Wherein, described management server includes：

Login module：For obtaining and preserving user login information, and user login information is transmitted to monitoring module and distribution mould Block；

Monitoring module：For monitor the number of users of each GPU calculation server, GPU occupancy situations, EMS memory occupation situation and IO, network capacity；

2. the equally loaded system of a kind of cloud computing server cluster according to claim 1, it is characterised in that described The maximum of concurrent user's quantity that GPU calculation servers can be accommodated is set by keeper.

3. the equally loaded system of a kind of cloud computing server cluster according to claim 1, it is characterised in that described Public directory and User Catalog are provided with storage server, User Catalog includes the respective catalogue of all users, after User logs in The information and data in own directory can be called.

4. a kind of equally loaded method of cloud computing server cluster of system according to claim 1, it is characterised in that institute The method of stating is comprised the following steps：

When a, User logs in, the log-on message of user is obtained by the login module in management server, then triggers log-in events, After log-in events triggering, monitoring module is immediately by the number of users of all GPU calculation servers, GPU occupancy situations, EMS memory occupation Situation is sent to distribute module, and user is distributed to the minimum GPU calculation servers of active user's number by distribute module；

B, user call the data in storage server, in the GPU calculation server operation tasks that management server is its distribution, During submission task, triggering resource estimates event, description when estimating module according to the task of submission to by the resource in management server The resource that task takes is estimated with selection parameter, described predictor method is included according to data type, size, task flow Software used in journey, the GPU that estimated needs are found in the database that the experience accumulated according to conventional operation task is created is accounted for With rate, and internal memory, IO, the network capacity for needing, the real time resources occupancy situation transmitted by estimation results and monitoring module, If it was found that allocated inadequate resource carries out sub-distribution again to complete task by distribute module；

After c, task run are finished, user logs off, and terminates.

5. the equally loaded method of a kind of cloud computing server cluster according to claim 4, it is characterised in that described In step a, when user is distributed to the minimum GPU calculation servers of active user's number by distribute module, if active user's number is minimum GPU calculation servers quantity be more than 1 when, user is sequentially allocated to GPU calculation servers.

6. a kind of equally loaded method of cloud computing server cluster according to claim 4, it is characterised in that step a In, after User logs in, call the data in public directory and User Catalog in storage server.

7. the equally loaded method of a kind of cloud computing server cluster according to claim 4, it is characterised in that described In step a, when user is distributed to the minimum GPU calculation servers of active user's number by distribute module, if when monitoring module monitoring Maximum concurrent user number is all reached to all GPU calculation servers, then sends information to login module, login module is sent out to user Go out prompt message.

8. the equally loaded method of a kind of cloud computing server cluster according to claim 4, it is characterised in that described In step a, when user is distributed to the minimum GPU calculation servers of active user's number by distribute module, when monitoring module is monitored All GPU calculation servers all reach the 80% of maximum concurrent user number, then send prompt message to keeper.

9. a kind of equally loaded method of cloud computing server cluster according to claim 8, it is characterised in that keeper There are number of users, GPU occupancy situations, the power of the information of EMS memory occupation situation of the GPU calculation servers in checking monitoring module Limit, keeper has stopping consumer process volume authority, and users personal data checks authority only individual subscriber in storage server.

10. a kind of equally loaded method of cloud computing server cluster according to claim 4, it is characterised in that management Be stored in the log-on message of user in database by the login module in server, and step c, user can select when logging off Temporary transient offline or end process is selected, if selection is temporarily offline, log-on message is deleted not from database, and system thinks the user Or logging status, when user logs in from webpage or client again, by the GPU calculation servers before being automatically matched to, after Process before continuous；If selection end process, log-on message will be deleted from the database of login module, and user logs in next time will Redistribute GPU calculation servers.