CN110032449A

CN110032449A - A kind of method and device for the performance optimizing GPU server

Info

Publication number: CN110032449A
Application number: CN201910303999.XA
Authority: CN
Inventors: 李磊
Original assignee: Suzhou Wave Intelligent Technology Co Ltd
Current assignee: Suzhou Wave Intelligent Technology Co Ltd
Priority date: 2019-04-16
Filing date: 2019-04-16
Publication date: 2019-07-19

Abstract

The invention discloses a kind of methods of performance for optimizing GPU server, comprising: constructs deep learning frame on GPU server and is trained using deep learning frame, obtains deep learning model；Monitor the properties data of GPU server in training process；According to the obtained properties data of monitoring, judge whether exception and GPU utilization rate are less than first predetermined value to the operation of GPU server；It is operating abnormally in response to GPU server, changes the configuration of GPU server or deep learning model；And it is less than first predetermined value in response to GPU utilization rate, increases the size of transmission data block and increases data line number of passes.The invention also discloses the computer equipments and readable storage medium storing program for executing of a kind of performance for optimizing GPU server.The status monitoring to each system of server may be implemented in the method and device of the performance of optimization GPU server proposed by the present invention, finds simultaneously elimination gap, improves the calculated performance of GPU.

Description

A kind of method and device for the performance optimizing GPU server

Technical field

The present invention relates to the server fields GPU, more specifically, particularly relating to a kind of side of performance for optimizing GPU server Method and device.

Background technique

In recent years, AI technology achieves quantum jump in fields such as image recognition, natural language processing, recommender systems, is Commercial field landing provides unlimited possibility.AI model can reach higher firstly the need of a large amount of data training is carried out Precision, to play a role in actual production.The breakthrough of AI technology, other than algorithm itself, most important reason is The rapid growth of power is calculated, GPU accelerator card plays the role of vital.

Deep learning frame is commonly used to that developer is helped to fast implement the exploitation of AI model, training, it is also possible to It is disposed in production environment, is used for reasoning operation.Caffe, full name Convolutional Architecture for Fast Feature Embedding is a kind of common deep learning frame.Caffe may operate on CPU or GPU, in mould Type training stage, GPU are the most powerful calculating units of current performance.If playing maximum calculated performance to GPU, need The cooperation of CPU, memory system, PCIE system, cooling system and other I/O systems guarantee that GPU is in optimal working condition.

Summary of the invention

In view of this, the purpose of the embodiment of the present invention is to propose the method and dress of a kind of performance for optimizing GPU server It sets, can be realized the status monitoring to each system of server, find simultaneously elimination gap, improve the calculated performance of GPU.

Based on above-mentioned purpose, the one side of the embodiment of the present invention provides a kind of method of performance for optimizing GPU server, Include the following steps: to construct deep learning frame on GPU server and be trained using deep learning frame, obtains depth Learning model；Monitor the properties data of GPU server in training process；According to the properties data that monitoring obtains, sentence Whether whether exception and GPU utilization rate are less than first predetermined value for disconnected GPU server operation；It is run in response to GPU server different Often, the configuration of GPU server or deep learning model is changed；And it is less than first predetermined value in response to GPU utilization rate, increase and passes The size and increase data line number of passes of transmission of data block.

In some embodiments, monitor training process in GPU server properties data include: monitoring CPU and Temperature and utilization rate, the disk input and output situation, the size of memory cache of GPU.

In some embodiments, judge whether exception includes: whether detection CPU or GPU temperature is big to the operation of GPU server In second predetermined value.

In some embodiments, it is operating abnormally in response to GPU server, changes GPU server or deep learning model Configuration include: in response to detect CPU or GPU temperature be greater than second predetermined value when, increase the revolving speed of fan.

In some embodiments, judge whether exception includes: to detect the test of deep learning frame to the operation of GPU server Whether data are all cached in memory.

In some embodiments, it is operating abnormally in response to GPU server, changes GPU server or deep learning model Configuration include: in response to detect the non-total caching of the test data of deep learning frame into memory, extend the training time.

In some embodiments, judge whether the operation of GPU server is abnormal and include: whether detection cpu busy percentage is higher than Third predetermined value.

In some embodiments, it is operating abnormally in response to GPU server, changes GPU server or deep learning model Configuration include: in response to detect cpu busy percentage be higher than third predetermined value, replace CPU.

The another aspect of the embodiment of the present invention additionally provides a kind of computer equipment, comprising: at least one processor；With And memory, memory are stored with the computer instruction that can be run on a processor, instruction is executed as follows to realize by processor Step: deep learning frame is constructed on GPU server and is trained using deep learning frame；It monitors in training process The properties data of GPU server；According to the properties data that monitoring obtains, judge whether the operation of GPU server is abnormal And whether GPU utilization rate is less than first predetermined value；It is operating abnormally in response to GPU server, changes GPU server or depth Practise the configuration of model；And it is less than first predetermined value in response to GPU utilization rate, increases the size of transmission data block and increases data Transmit Thread Count.

The embodiment of the present invention in another aspect, additionally provide a kind of computer readable storage medium, computer-readable storage Media storage has the computer program that method as above is executed when being executed by processor.

The present invention has following advantageous effects: can be realized the status monitoring to each system of server, finds And elimination gap, improve the calculated performance of GPU.

Detailed description of the invention

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with Other embodiments are obtained according to these attached drawings.

Fig. 1 is the flow diagram of the embodiment of the method for the performance of optimization GPU server provided by the invention.

Specific embodiment

To make the objectives, technical solutions, and advantages of the present invention clearer, below in conjunction with specific embodiment, and reference The embodiment of the present invention is further described in attached drawing.

It should be noted that all statements for using " first " and " second " are for differentiation two in the embodiment of the present invention The non-equal entity of a same names or non-equal parameter, it is seen that " first " " second " only for the convenience of statement, does not answer It is interpreted as the restriction to the embodiment of the present invention, subsequent embodiment no longer illustrates this one by one.

Based on above-mentioned purpose, the first aspect of the embodiment of the present invention proposes a kind of performance of optimization GPU server The embodiment of method.Shown in fig. 1 is that the process of the embodiment of the method for the performance of optimization GPU server provided by the invention is shown It is intended to.As shown in Figure 1, the embodiment of the present invention includes following steps:

S1, deep learning frame is constructed on GPU server and is trained using deep learning frame, obtain depth Practise model；

The properties data of GPU server in S2, monitoring training process；

S3, according to the obtained properties data of monitoring, judge the operation of GPU server whether exception and GPU utilization rate Whether first predetermined value is less than；

S4, it is operating abnormally in response to GPU server, changes the configuration of GPU server or deep learning model；And response It is less than first predetermined value in GPU utilization rate, increase the size of transmission data block and increases data line number of passes.

Deep learning frame is run on GPU server, is illustrated by taking Caffe as an example in the present embodiment, in others It can also be using other deep learning frame training GPU servers in embodiment.

Monitor the properties data of GPU server in training process.Performance data includes: the temperature of CPU and GPU, benefit With rate and running frequency；Training data memory cache (cache) situation；Disk I/O situation；And memory real-time bandwidth.

In the present embodiment, the utilization rate and running frequency that turbostat tool monitoring CPU core can be used, make The temperature of CPU is obtained with IPMI tool；The management tool provided using GPU manufacturer monitors the utilization rate and temperature of GPU；It uses Memory behaviour in service is checked in Free instruction, observes the service condition of the part cache；Real-Time Disk is checked using iostat instruction IO sees whether there is the process for reading in disk data memory；Using intel-cmt-cat tool, memory bandwidth is carried out Observation in real time.

It is operating abnormally in response to GPU server, the configuration for changing the GPU server or deep learning model includes: sound Ying Yu detects that CPU or GPU temperature is greater than second predetermined value, increases the revolving speed of fan.According to the monitoring temperature number of CPU and GPU According to judging whether cooling system meets cooling requirements.For example, when the temperature of CPU or GPU is more than second predetermined value, Ke Yizeng The revolving speed of big fan improves heat-sinking capability, guarantees that CPU and GPU is worked normally.Second predetermined value can be manually set, such as can To be 80 degrees Celsius, naturally it is also possible to be set according to actual conditions as other values.

It is operating abnormally in response to GPU server, the configuration for changing the GPU server or deep learning model includes: sound Should in deep learning frame the non-total caching of test data into memory, extend the training time.Deep learning frame is for the first time When starting, according to the size of the size and training dataset of disk input and output (IO) situation and memory cache, judge to test Whether data have all been cached in memory.Since the CPU speed for obtaining data from memory is significantly larger than disk, delay The runnability of CPU can greatly be improved by depositing training data, and then improve the performance of GPU.When disk I/O stops, cache is big Small stopping increases and is greater than training data, then proves that data have cached in memory.If data do not have including total caching In depositing, then extend the trained time.

It is operating abnormally in response to GPU server, the configuration for changing the GPU server or deep learning model includes: sound Third predetermined value should be higher than in cpu busy percentage, replace the CPU.According to core cpu utilization rate situation, whether CPU specification is judged Rationally.If core cpu utilization rate is higher than third predetermined value, prove that CPU there are bottleneck, needs to change the product of more high standard. Third predetermined value can also be manually set, such as can be 90%, certainly, may be set to be other in other examples Numerical value.

Judge GPU with the presence or absence of bottleneck according to GPU utilization rate situation.If GPU utilization rate is unstable, jumps larger and pass through Saturation state often is not achieved, that is to say that GPU utilization rate is less than first predetermined value, it was demonstrated that GPU there are bottleneck, training data can not and When be sent to.GPU performance can be optimized by taking the transmission speed for increasing data at this time, for example, can suitably increase transmission The size of data block (Batchsize) or the mode for increasing data line number of passes are promoted to reduce the probability of GPU zero load Overall performance.First predetermined value can also be manually set, for example, first predetermined value can be 95%.

It is important to note that each step in each embodiment of the method for the performance of above-mentioned optimization GPU server Suddenly can intersect, replace, increase, deleting, therefore, these reasonable permutation and combination transformation in optimization GPU server The method of performance should also be as belonging to the scope of protection of the present invention, and protection scope of the present invention should not be confined to embodiment On.

Based on above-mentioned purpose, the second aspect of the embodiment of the present invention proposes a kind of computer equipment, including at least one A processor and memory.The memory is stored with the computer instruction that can be run on the processor, described instruction by The processor is executed to realize following steps: being constructed deep learning frame on GPU server and is utilized the deep learning Frame is trained；Monitor the properties data of GPU server in training process；The properties number obtained according to monitoring According to judging whether exception and GPU utilization rate are less than first predetermined value to the operation of GPU server；It is transported in response to GPU server Row is abnormal, changes the configuration of the GPU server or deep learning model；And it is predetermined less than first in response to GPU utilization rate Value increases the size of transmission data block and increases data line number of passes.

It is computer-readable the present invention also provides a kind of computer readable storage medium of performance for optimizing GPU server Storage medium is stored with the computer program that method as described above is executed when being executed by processor.

Finally, it should be noted that those of ordinary skill in the art will appreciate that realizing the whole in above-described embodiment method Or part process, related hardware can be instructed to complete by computer program, optimize the method for the performance of GPU server Program can be stored in a computer-readable storage medium, and the program is when being executed, it may include such as the implementation of above-mentioned each method The process of example.Wherein, the storage medium of program can be magnetic disk, CD, read-only memory (ROM) or random access memory (RAM) etc..The embodiment of above-mentioned computer program, can achieve that corresponding aforementioned any means embodiment is identical or phase Similar effect.

In addition, disclosed method is also implemented as the computer journey executed by processor according to embodiments of the present invention Sequence, the computer program may be stored in a computer readable storage medium.When the computer program is executed by processor, hold The above-mentioned function of being limited in row method disclosed by the embodiments of the present invention.

In addition, above method step and system unit also can use controller and for storing so that controller is real The computer readable storage medium of the computer program of existing above-mentioned steps or Elementary Function is realized.

In addition, it should be appreciated that the computer readable storage medium (for example, memory) of this paper can be volatibility and deposit Reservoir or nonvolatile memory, or may include both volatile memory and nonvolatile memory.As an example and Unrestricted, nonvolatile memory may include read-only memory (ROM), programming ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM) or flash memory.Volatile memory may include that arbitrary access is deposited Reservoir (RAM), the RAM can serve as external cache.As an example and not restrictive, RAM can be with a variety of Form obtains, such as synchronous random access memory (DRAM), dynamic ram (DRAM), synchronous dram (SDRAM), double data rate SDRAM (DDR SDRAM), enhance SDRAM (ESDRAM), synchronization link DRAM (SLDRAM) and directly Rambus RAM (DRRAM).Institute is public The storage equipment for the aspect opened is intended to the memory of including but not limited to these and other suitable type.

Those skilled in the art will also understand is that, various illustrative logical blocks, mould in conjunction with described in disclosure herein Block, circuit and algorithm steps may be implemented as the combination of electronic hardware, computer software or both.It is hard in order to clearly demonstrate This interchangeability of part and software, with regard to various exemplary components, square, module, circuit and step function to its into General description is gone.This function is implemented as software and is also implemented as hardware depending on concrete application and application To the design constraint of whole system.The function that those skilled in the art can realize in various ways for every kind of concrete application Can, but this realization decision should not be interpreted as causing a departure from range disclosed by the embodiments of the present invention.

Various illustrative logical blocks, module and circuit, which can use, in conjunction with described in disclosure herein is designed to The following component of function here is executed to realize or execute: general processor, digital signal processor (DSP), dedicated integrated electricity It is road (ASIC), field programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete Any combination of hardware component or these components.General processor can be microprocessor, but alternatively, processor can To be any conventional processors, controller, microcontroller or state machine.Processor also may be implemented as calculating the group of equipment Close, for example, the combination of DSP and microprocessor, multi-microprocessor, one or more microprocessors combination DSP and/or it is any its Its this configuration.

The step of method in conjunction with described in disclosure herein or algorithm, can be directly contained in hardware, be held by processor In capable software module or in combination of the two.Software module may reside within RAM memory, flash memory, ROM storage Device, eprom memory, eeprom memory, register, hard disk, removable disk, CD-ROM or known in the art it is any its In the storage medium of its form.Illustrative storage medium is coupled to processor, enables a processor to from the storage medium Information is written to the storage medium in middle reading information.In an alternative, storage medium can be integral to the processor Together.Pocessor and storage media may reside in ASIC.ASIC may reside in user terminal.In an alternative In, it is resident in the user terminal that pocessor and storage media can be used as discrete assembly.

In one or more exemplary designs, function can be realized in hardware, software, firmware or any combination thereof. If realized in software, can using function as one or more instruction or code may be stored on the computer-readable medium or It is transmitted by computer-readable medium.Computer-readable medium includes computer storage media and communication media, which is situated between Matter includes any medium for helping for computer program to be transmitted to another position from a position.Storage medium can be energy Any usable medium being enough accessed by a general purpose or special purpose computer.As an example and not restrictive, the computer-readable medium It may include that RAM, ROM, EEPROM, CD-ROM or other optical disc memory apparatus, disk storage equipment or other magnetic storages are set It is standby, or can be used for carrying or storage form be instruct or the required program code of data structure and can by general or Special purpose computer or any other medium of general or specialized processor access.In addition, any connection can suitably claim For computer-readable medium.For example, if using coaxial cable, optical fiber cable, twisted pair, digital subscriber line (DSL) or all It is if the wireless technology of infrared ray, radio and microwave to send software from website, server or other remote sources, then above-mentioned coaxial Cable, fiber optic cable, twisted pair, DSL or such as wireless technology of infrared ray, radio and microwave are included in determining for medium Justice.As used herein, disk and CD include compact disk (CD), it is laser disk, CD, digital versatile disc (DVD), soft Disk, Blu-ray disc, wherein disk usually magnetically reproduce data, and CD using laser optics reproduce data.Above content Combination should also be as being included in the range of computer-readable medium.

It is exemplary embodiment disclosed by the invention above, it should be noted that in the sheet limited without departing substantially from claim Under the premise of inventive embodiments scope of disclosure, it may be many modifications and modify.According to open embodiment described herein The function of claim to a method, step and/or movement be not required to the execution of any particular order.In addition, although the present invention is implemented Element disclosed in example can be described or be required in the form of individual, but be unless explicitly limited odd number, it is understood that be multiple.

It should be understood that it is used in the present context, unless the context clearly supports exceptions, singular " one It is a " it is intended to also include plural form.It is to be further understood that "and/or" used herein refers to including one or one Any and all possible combinations of a above project listed in association.

It is for illustration only that the embodiments of the present invention disclose embodiment sequence number, does not represent the advantages or disadvantages of the embodiments.

Those of ordinary skill in the art will appreciate that realizing that all or part of the steps of above-described embodiment can pass through hardware Complete, relevant hardware can also be instructed to complete by program, program can store in a kind of computer-readable storage In medium, storage medium mentioned above can be read-only memory, disk or CD etc..

It should be understood by those ordinary skilled in the art that: the discussion of any of the above embodiment is exemplary only, not It is intended to imply that range disclosed by the embodiments of the present invention (including claim) is limited to these examples；In the think of of the embodiment of the present invention Under road, it can also be combined between the technical characteristic in above embodiments or different embodiments, and there is this hair as above Many other variations of the different aspect of bright embodiment, for simplicity, they are not provided in details.Therefore, all in the present invention Within the spirit and principle of embodiment, any omission, modification, equivalent replacement, improvement for being made etc. be should be included in of the invention real It applies within the protection scope of example.

Claims

1. a kind of method for the performance for optimizing GPU server characterized by comprising

Deep learning frame is constructed on GPU server and is trained using the deep learning frame, and deep learning is obtained Model；

Monitor the properties data of GPU server in training process；

According to the obtained properties data of monitoring, judge whether exception and GPU utilization rate are less than the operation of GPU server First predetermined value；

It is operating abnormally in response to GPU server, changes the configuration of the GPU server or deep learning model；And

It is less than first predetermined value in response to GPU utilization rate, increase the size of transmission data block and increases data line number of passes.

2. the method according to claim 1, wherein monitoring the properties number of GPU server in training process According to temperature and utilization rate, the disk input and output situation, the size of memory cache for including: monitoring CPU and GPU.

3. according to the method described in claim 2, it is characterized in that, judging whether exception includes: detection to the operation of GPU server Whether CPU or GPU temperature is greater than second predetermined value.

4. according to the method described in claim 3, it is characterized in that, changing the GPU in response to GPU server operation exception The configuration of server or deep learning model includes: to increase wind in response to detecting that CPU or GPU temperature is greater than second predetermined value The revolving speed of fan.

5. according to the method described in claim 2, it is characterized in that, judging whether exception includes: that detection is deep to the operation of GPU server Whether the test data of degree learning framework is all cached in memory.

6. according to the method described in claim 5, it is characterized in that, changing the GPU in response to GPU server operation exception The configuration of server or deep learning model includes: in response to detecting that the non-total caching of the test data of deep learning frame arrives In memory, extend the training time.

7. according to the method described in claim 2, it is characterized in that, judging whether exception includes: detection to the operation of GPU server Whether cpu busy percentage is higher than third predetermined value.

8. changing the GPU the method according to the description of claim 7 is characterized in that being operating abnormally in response to GPU server The configuration of server or deep learning model includes: in response to detecting that cpu busy percentage is higher than third predetermined value, described in replacement CPU。

9. a kind of computer equipment characterized by comprising

At least one processor；And

Memory, the memory are stored with the computer instruction that can be run on the processor, and described instruction is by described Device is managed to execute to realize following steps:

Deep learning frame is constructed on GPU server and is trained using the deep learning frame；

Monitor the properties data of GPU server in training process；

10. a kind of computer readable storage medium, the computer-readable recording medium storage has computer program, and feature exists In perform claim requires method described in 1-8 any one when the computer program is executed by processor.