CN109062700A

CN109062700A - A kind of method for managing resource and server based on distributed system

Info

Publication number: CN109062700A
Application number: CN201810953290.XA
Authority: CN
Inventors: 赵仁明
Original assignee: Zhengzhou Yunhai Information Technology Co Ltd
Current assignee: Zhengzhou Yunhai Information Technology Co Ltd
Priority date: 2018-08-21
Filing date: 2018-08-21
Publication date: 2018-12-21

Abstract

The invention discloses a kind of method for managing resource based on distributed system, the distributed system includes the first server for executing spark task, for executing the second server of ResourceManager process, and it is deployed with the node server of worker process, the method is executed by the first server, when including: starting Spark task, resource is assessed according to the calculation amount of pretreated training data and preset deep learning neural network model, to the second server application resource；After the information for receiving the node server for the resource abundance that the second server returns, the model file of the preparatory derived deep learning neural network model is broadcast to the node server of each resource abundance.This programme can carry out auto slice to input data to be efficiently completed the model training of data parallel.

Description

A kind of method for managing resource and server based on distributed system

Technical field

The present invention relates to the communication technology, espespecially a kind of method for managing resource and server based on distributed system.

Background technique

YARN (Yet Another Resource Negotiator, another resource coordination person) is a kind of new resource Manager, it is a universal resource management system, and unified resource management and scheduling, its introducing can be provided for upper layer application For cluster utilization rate, resource unified management and in terms of bring big advantages.YARN can be monitored often simultaneously The state of one subtask.The ApplicationMaster (main application) of YARN is variable part, supports user to different moulds Type writes the AppMst of oneself, and further types of model can be allowed to run under unified YARN frame.

The concept of deep learning is derived from the research of artificial neural network, and the multilayer perceptron containing more hidden layers is exactly a kind of depth Learning structure.Deep learning, which forms more abstract high level by combination low-level feature, indicates attribute classification or feature, with discovery The distributed nature of data indicates.TensorFlow (tensor stream) is that Google is based on DistBelief (manual depth study system System) artificial intelligence system researched and developed, it can be used for the multinomial machine learning such as speech recognition or image recognition and deep learning Field.It can run in small to one smart phone, the various equipment greatly to thousands of data center servers.Spark is The computing engines of large-scale data processing and the Universal-purpose quick of design are aimed at, being analogous to Hadoop (is a kind of distributed system Platform) MapReduce (MapReduce) universal parallel frame.

Using existing technology in application deep learning solving practical problems, user oneself is needed to complete to calculate, deposit The management of the resources such as storage.It needs oneself to complete building for such as Tensorflow even deep learning framework.It needs to manually complete number Data preprocess, Segmentation of Data Set, Feature Engineering, model training, the overall process that model verifying is assessed and model is online.It is multiple when having Task, and when task is complicated, input data amount is larger, matching, the scheduling of resource can not be automatically carried out to task, can not also be had The state of the monitor task of effect.

Summary of the invention

In order to solve the above-mentioned technical problems, the present invention provides a kind of method for managing resource and clothes based on distributed system Business device, to carry out auto slice to input data to be efficiently completed the model training of data parallel.

In order to reach the object of the invention, the present invention provides a kind of method for managing resource based on distributed system, wherein The distributed system includes the first server for executing spark task, for executing ResourceManager process Second server, and it is deployed with the node server of worker process, the method is executed by the first server, packet It includes:

When starting Spark task, according to pretreated training data and preset deep learning neural network model Calculation amount assesses resource, to the second server application resource；

It, will preparatory derived institute after the information for receiving the node server for the resource abundance that the second server returns The model file for stating deep learning neural network model is broadcast to the node server of each resource abundance.

Further, the training data is stored in advance in Hadoop distributed file system.

A kind of server, for executing spark task, wherein include:

Apply for module, when for starting Spark task, according to pretreated training data and preset deep learning mind Calculation amount through network model assesses resource, to the second server application resource；

Broadcast module, after the information of the node server for receiving the resource abundance that the second server returns, The model file of the preparatory derived deep learning neural network model is broadcast to the node clothes of each resource abundance Business device.

A kind of method for managing resource based on distributed system, wherein the distributed system includes for executing spark The first server of task for executing the second server of ResourceManager process, and is deployed with worker process Node server, the method executes by the second server, comprising:

After the resource bid for receiving the first server, will be stored in Hadoop distributed file system described in Training data carries out fragment, and the training data after fragment is corresponding with each worker process；

Worker process is created on the node server that resource utilization is less than threshold value, and resource utilization is less than threshold value The information of node server be sent to the first server.

A kind of server, for executing ResourceManager process, wherein include:

Fragment module after the resource bid for receiving the first server, will be stored in the distributed text of Hadoop The training data in part system carries out fragment, and the training data after fragment is corresponding with each worker process；

Creation module, for creating worker process on the node server that resource utilization is less than threshold value, by resource The information that utilization rate is less than the node server of threshold value is sent to the first server.

A kind of method for managing resource based on distributed system, wherein the distributed system includes for executing spark The first server of task for executing the second server of ResourceManager process, and is deployed with worker process Node server, the method executes by the node server, comprising:

Receive the model file of the deep learning neural network model of the first server broadcast；

The corresponding training data in Hadoop distributed file system is read, the model file is trained.

Further, it is described the model file is trained after, further includes:

It is distributed that deep learning neural network model and the relevant parameter export that training is completed are stored in the Hadoop In file system.

A kind of node server is deployed with worker process, wherein include:

Receiving module, the model file of the deep learning neural network model for receiving the first server broadcast；

Training module, for reading the corresponding training data in Hadoop distributed file system, to the model text Part is trained.

Further, the node server can also include:

Export module, it is described for the deep learning neural network model and relevant parameter export of training completion to be stored in In Hadoop distributed file system.

A kind of method for managing resource based on distributed system, comprising:

When first server for executing spark task starts Spark task, according to pretreated training data and The calculation amount of preset deep learning neural network model assesses resource, to for executing ResourceManager process Two server application resources；

After the second server receives the resource bid of the first server, the distributed text of Hadoop will be stored in The training data in part system carries out fragment, and the training data after fragment is corresponding with each worker process；Make in resource It is less than on the node server of threshold value the worker process that creates with rate, resource utilization is less than to the letter of the node server of threshold value Breath is sent to the first server；

After the first server receives the information of the node server for the resource abundance that the second server returns, The model file of the preparatory derived deep learning neural network model is broadcast to the node clothes of each resource abundance Business device；

The node server receives the model file of the deep learning neural network model of the first server broadcast； The corresponding training data in Hadoop distributed file system is read, the model file is trained.

One kind being based on distributed system, comprising: above-mentioned server and node server.

To sum up, a kind of method for managing resource and server and node server based on distributed system, can pass through work Make stream engine and construct DAG, the data prediction of deep learning, training, model are exported and the movements such as preservation are connected, it is convenient The layout of multiple tasks.The assessment of automation and the process for completing resource bid and distribution.The completion automated by frame is super The process of parameter distribution and model deployment can carry out auto slice to input data to be efficiently completed the model of data parallel Training.

Other features and advantages of the present invention will be illustrated in the following description, also, partly becomes from specification It obtains it is clear that understand through the implementation of the invention.The objectives and other advantages of the invention can be by specification, right Specifically noted structure is achieved and obtained in claim and attached drawing.

Detailed description of the invention

Attached drawing is used to provide to further understand technical solution of the present invention, and constitutes part of specification, with this The embodiment of application technical solution for explaining the present invention together, does not constitute the limitation to technical solution of the present invention.

Fig. 1 is a kind of flow chart of method for managing resource based on distributed system of the embodiment of the present invention；

Fig. 2 is the schematic diagram of the server of the execution spark task of the embodiment of the present invention；

Fig. 3 is the resource management side based on distributed system of the ResourceManager process side of the embodiment of the present invention The flow chart of method；

Fig. 4 is the schematic diagram of the server of the execution ResourceManager process of the embodiment of the present invention；

Fig. 5 is the process of the method for managing resource based on distributed system of the node server side of the embodiment of the present invention Figure；

Fig. 6 is the schematic diagram of the node server of the embodiment of the present invention；

Fig. 7 is that the present invention applies a kind of exemplary flow chart of the method for managing resource based on distributed system.

Specific embodiment

To make the objectives, technical solutions, and advantages of the present invention clearer, below in conjunction with attached drawing to the present invention Embodiment be described in detail.It should be noted that in the absence of conflict, in the embodiment and embodiment in the application Feature can mutual any combination.

Step shown in the flowchart of the accompanying drawings can be in a computer system such as a set of computer executable instructions It executes.Also, although logical order is shown in flow charts, and it in some cases, can be to be different from herein suitable Sequence executes shown or described step.

Fig. 1 is a kind of flow chart of method for managing resource based on distributed system of the embodiment of the present invention, the distribution Formula system includes the first server for executing spark task, for executing the second service of ResourceManager process Device, and it is deployed with the node server of worker process, the method is executed by the first server, as shown in Figure 1, packet It includes:

When step 11, starting Spark task, according to pretreated training data and preset deep learning neural network The calculation amount of model assesses resource, to the second server application resource；

It, will be preparatory after step 12, the information for the node server for receiving the resource abundance that the second server returns The model file of the derived deep learning neural network model is broadcast to the node server of each resource abundance.

It is proposed of the embodiment of the present invention carries out general element hair distribution using the broadcast of Spark, such as data and model are retouched It states.Pass through the relevant ETL of Hadoop (Extraction Transformation Loading, data pick-up, conversion and load) Tool completes the pretreatment works such as the cleaning conversion of training data, and by load shedding to HDFS (Hadoop Distributed File System, Hadoop distributed file system) on, facilitate Spark to directly read data to carry out the training of model.

Correspondingly, the present embodiment provides a kind of servers 200, as shown in Fig. 2, the server 200 of the present embodiment is for holding Row spark task, comprising:

Apply for module 201, when for starting Spark task, according to pretreated training data and preset depth The calculation amount for practising neural network model assesses resource, to the second server application resource；

Broadcast module 202, the information of the node server for receiving the resource abundance that the second server returns Afterwards, the model file of the preparatory derived deep learning neural network model is broadcast to the node of each resource abundance Server.

In one embodiment, the broadcast module 202, being will be derived in advance described by the context process of Spark The model file of deep learning neural network model is broadcast to the node server of each resource abundance.

Fig. 3 is the resource management side based on distributed system of the ResourceManager process side of the embodiment of the present invention The flow chart of method, as shown in figure 3, the method for the present embodiment includes:

Step 31, after receiving the resource bid of the first server, according to the data volume and worker of training data The quantity of process carries out fragment to the training data being stored in Hadoop distributed file system；

Step 32 creates worker process on the node server that resource utilization is less than threshold value, by resource utilization Information less than the node server of threshold value is sent to the first server.

The scheme of the present embodiment, can be according to the type and input of task, and the assessment automated simultaneously completes resource Shen Please with the process of distribution, the monitoring and management of convenient progress task status simplify the submission and management process of task.

Correspondingly, the present embodiment also provides a kind of for executing the server 400 of ResourceManager process, such as Fig. 4 Shown, the server 400 of the present embodiment may include:

Fragment module 401, after the resource bid for receiving the first server, according to the data volume of training data Fragment is carried out to the training data being stored in Hadoop distributed file system with the quantity of worker process；

Creation module 402 will be provided for creating worker process on the node server that resource utilization is less than threshold value The information that source utilization rate is less than the node server of threshold value is sent to the first server.

Fig. 5 is the process of the method for managing resource based on distributed system of the node server side of the embodiment of the present invention Scheme, the node server in the present embodiment is deployed with worker process, as shown in figure 5, the method for the present embodiment may include:

Step 51, the model file for receiving the deep learning neural network model that the first server is broadcasted；

Corresponding training data in step 52, reading Hadoop distributed file system, carries out the model file Training.

The method of the present embodiment realizes deep learning data preparation, and training environment automation constructs, and according to task class The application and distribution of type and the completion resource of datamation, it may be convenient to which the training for carrying out data parallel improves operation The degree of automation of submission and the efficiency of deep learning model training.

Correspondingly, a kind of node server 600 for being deployed with worker process is present embodiments provided, as shown in fig. 6, this The node server 600 of embodiment may include:

Receiving module 601, the model text of the deep learning neural network model for receiving the first server broadcast Part；

Training module 602, for reading the corresponding training data in Hadoop distributed file system, to the model File is trained.

In one embodiment, the node server 600 can also include:

Export module 603, for the deep learning neural network model and relevant parameter export of training completion to be stored in In the Hadoop distributed file system.

The embodiment of the present invention is constructed for the automated environment of deep learning, the scheduling of task, the monitoring and mistake of task Processing etc., while the technology combined by deep learning with Spark, it may be convenient to each worker be made to have one completely Model, and realize parallel data processing by running different data between different worker.

Deep learning library will automatically create the neural network BP training algorithm of various shape and size.Establish a nerve net The real process of network, it is more more complex than running some algorithms on data set merely.Some hyper parameters are usually had to need to be arranged, Such as every layer of neuron number, learning rate etc..The superior performance of model can be allowed by selecting correct parameter, and bad parameter will It will lead to reasoning performance on prolonged trained and bad line.In practice, machine learning practitioner can be used for multiple times different Hyper parameter reruns identical model, to find optimal set.

It is a kind of process of the specific embodiment of method for managing resource based on distributed system below, as shown in fig. 7, May include:

Pretreated training data is stored in Hadoop distributed file system by step 101；

In the present embodiment, Kettle (kettle, Kettle are the ETL tools of a external open source) or Sqoop is used Data processing tools such as (sqoop are a tool for data exporting between hadoop and relevant database) are completed The parallelization of training data is extracted, conversion and cleaning, and is stored data on HDFS.

Step 102, the design construction work for completing deep learning neural network model, export as mould for designed model Type file.

The model of deep learning neural network can be carried out according to existing more mature model (such as vgg, AlexNet etc.) Modification (such as modification number of plies, modify certain layer of number of nodes etc.).It can also be according to own service feature, one nerve of brand-new design Network model.The method of design is that respectively have difference according to different deep learning frame differences.

Step 103, when starting Spark task, input data source is appointed as in step 101 having handled completion on HDFS Training data, system is according to the deep learning neural network model in the size and step 102 of the data volume of training data The resource that calculation amount assessment needs, and the application to the second server progress resource for executing ResourceManager process.

When one model of training, need to carry out the operation such as complicated matrix operation, this just need a large amount of cpu, The resources such as memory, storage.

To after ResourceManager application resource, ResourceManager can be assessed now Spark according to demand Used resource situation directly creates corresponding resource if resource abundance, and resource situation created is fed back to Spark.If resource is inadequate, carry out after waiting in line resource release, then created.

Step 104, ResourceManager process are according to the quantity of the data volume of training data and worker process to depositing The training data stored up on HDFS carries out automation fragment, and the training data after fragment and worker process correspond.

Step 105, ResourceManager process are distributed related according to the usage amount of the present resource of each node server Resource, and corresponding node create worker process (in worker process integrate such as tensorflow (tensor stream) depth Spend learning framework).

ResourceManager process can assess the resource behaviour in service of each node, if the memory of this node is complete Portion takes, and does not just use this node；If there are also surplus resources for this node, so that it may choose this node, create Worker process.

ResourceManager process is responsible for managing resource, it is communicated with each node, is responsible for each node of management Resource use and discharge situation.When Spark starts task, need to ResourceManager process application resource, such as Fruit resource abundance just starts task, if resource is inadequate, resource is waited to discharge.

After step 106, Spark receive the resource of ResourceManager process return, Spark's In ApplicationMaster, read out the model file of step 102 preservation, by the Context of Spark (context into Journey) being broadcast to model file automation in each worker process.

Step 107, each worker process read on HDFS the to one's name number of fragment using obtained model file According to carrying out parallel model training.

The model of training completion and the export of relevant parameter are stored on HDFS by step 108, worker process.

DAG can be constructed by workflow engine using the scheme of the present embodiment, and (Directed Acyclic Graph, has To acyclic figure), the data prediction of deep learning, training, model are exported and the movements such as preservation are connected, facilitates multiple The layout of business improves whole the degree of automation.

The scheme of the present embodiment, can be according to the type and input of task, and the assessment automated simultaneously completes resource Shen Please with the process of distribution, the monitoring and management of convenient progress task status simplify the submission and management process of task.Pass through The process for completing hyper parameter distribution and model deployment of frame automation can carry out auto slice to input data with efficient The model training for completing data parallel improves the degree of automation of operation submission and the efficiency of deep learning model training.

The embodiment of the present invention also provides a kind of device of video frequency searching, including processor and computer readable storage medium, Instruction is stored in the computer readable storage medium, wherein when described instruction is executed by the processor, realize above-mentioned Method for managing resource based on distributed system.

The embodiment of the invention also provides a kind of computer readable storage mediums, are stored with computer executable instructions, The computer executable instructions are performed the realization method for managing resource based on distributed system.

It will appreciated by the skilled person that whole or certain steps, system, dress in method disclosed hereinabove Functional module/unit in setting may be implemented as software, firmware, hardware and its combination appropriate.In hardware embodiment, Division between the functional module/unit referred in the above description not necessarily corresponds to the division of physical assemblies；For example, one Physical assemblies can have multiple functions or a function or step and can be executed by several physical assemblies cooperations.Certain groups Part or all components may be implemented as by processor, such as the software that digital signal processor or microprocessor execute, or by It is embodied as hardware, or is implemented as integrated circuit, such as specific integrated circuit.Such software can be distributed in computer-readable On medium, computer-readable medium may include computer storage medium (or non-transitory medium) and communication media (or temporarily Property medium).As known to a person of ordinary skill in the art, term computer storage medium is included in for storing information (such as Computer readable instructions, data structure, program module or other data) any method or technique in the volatibility implemented and non- Volatibility, removable and nonremovable medium.Computer storage medium include but is not limited to RAM, ROM, EEPROM, flash memory or its His memory technology, CD-ROM, digital versatile disc (DVD) or other optical disc storages, magnetic holder, tape, disk storage or other Magnetic memory apparatus or any other medium that can be used for storing desired information and can be accessed by a computer.This Outside, known to a person of ordinary skill in the art to be, communication media generally comprises computer readable instructions, data structure, program mould Other data in the modulated data signal of block or such as carrier wave or other transmission mechanisms etc, and may include any information Delivery media.

Claims

1. a kind of method for managing resource based on distributed system, which is characterized in that the distributed system includes for executing The first server of spark task for executing the second server of ResourceManager process, and is deployed with The node server of worker process, the method are executed by the first server, comprising:

When starting Spark task, according to the calculating of pretreated training data and preset deep learning neural network model Amount assessment resource, to the second server application resource；

It, will the preparatory derived depth after the information for receiving the node server for the resource abundance that the second server returns The model file of degree learning neural network model is broadcast to the node server of each resource abundance.

2. the method for managing resource according to claim 1 based on distributed system, which is characterized in that

The training data is stored in advance in Hadoop distributed file system.

3. a kind of server, for executing spark task characterized by comprising

Apply for module, when for starting Spark task, according to pretreated training data and preset deep learning nerve net The calculation amount of network model assesses resource, to the second server application resource；

Broadcast module will be pre- after the information of the node server for receiving the resource abundance that the second server returns The model file of the deep learning neural network model derived from elder generation is broadcast to the node server of each resource abundance.

4. server according to claim 3, which is characterized in that

The training data is stored in advance in Hadoop distributed file system.

5. a kind of method for managing resource based on distributed system, which is characterized in that the distributed system includes for executing The first server of spark task for executing the second server of ResourceManager process, and is deployed with The node server of worker process, the method are executed by the second server, comprising:

After the resource bid for receiving the first server, the training that will be stored in Hadoop distributed file system Data carry out fragment, and the training data after fragment is corresponding with each worker process；

Worker process is created on the node server that resource utilization is less than threshold value, and resource utilization is less than to the section of threshold value The information of point server is sent to the first server.

6. a kind of server, for executing ResourceManager process characterized by comprising

Fragment module will be stored in Hadoop distributed field system after the resource bid for receiving the first server The training data on system carries out fragment, and the training data after fragment is corresponding with each worker process；

Creation module uses resource for creating worker process on the node server that resource utilization is less than threshold value The information that rate is less than the node server of threshold value is sent to the first server.

7. a kind of method for managing resource based on distributed system, which is characterized in that the distributed system includes for executing The first server of spark task for executing the second server of ResourceManager process, and is deployed with The node server of worker process, the method are executed by the node server, comprising:

8. the method for managing resource according to claim 7 based on distributed system, which is characterized in that described to the mould After type file is trained, further includes:

Deep learning neural network model and the relevant parameter export that training is completed are stored in the Hadoop distributed document In system.

9. a kind of node server is deployed with worker process characterized by comprising

Training module, for reading the corresponding training data in Hadoop distributed file system, to the model file into Row training.

10. node server according to claim 9, which is characterized in that the node server further include:

11. a kind of method for managing resource based on distributed system characterized by comprising

When first server for executing spark task starts Spark task, according to pretreated training data and preset Deep learning neural network model calculation amount assess resource, to for executing ResourceManager process second clothes Business device application resource；

After the second server receives the resource bid of the first server, Hadoop distributed field system will be stored in The training data on system carries out fragment, and the training data after fragment is corresponding with each worker process；In resource utilization Less than worker process is created on the node server of threshold value, the information that resource utilization is less than the node server of threshold value is sent out Give the first server；

It, will be pre- after the first server receives the information of the node server for the resource abundance that the second server returns The model file of the deep learning neural network model derived from elder generation is broadcast to the node server of each resource abundance；

The node server receives the model file of the deep learning neural network model of the first server broadcast；It reads Corresponding training data in Hadoop distributed file system, is trained the model file.

12. one kind is based on distributed system characterized by comprising server as described in claim 3 or 4, as right is wanted Server described in asking 6 and the node server such as claim 9 or 10.