CN109062692A - A kind of optimization method and system of recognition of face deep learning training platform - Google Patents

A kind of optimization method and system of recognition of face deep learning training platform Download PDF

Info

Publication number
CN109062692A
CN109062692A CN201810819416.4A CN201810819416A CN109062692A CN 109062692 A CN109062692 A CN 109062692A CN 201810819416 A CN201810819416 A CN 201810819416A CN 109062692 A CN109062692 A CN 109062692A
Authority
CN
China
Prior art keywords
gpu
platform
test
utilization rate
disk
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810819416.4A
Other languages
Chinese (zh)
Inventor
王鹏飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhengzhou Yunhai Information Technology Co Ltd
Original Assignee
Zhengzhou Yunhai Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhengzhou Yunhai Information Technology Co Ltd filed Critical Zhengzhou Yunhai Information Technology Co Ltd
Priority to CN201810819416.4A priority Critical patent/CN109062692A/en
Publication of CN109062692A publication Critical patent/CN109062692A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/5044Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering hardware capabilities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/505Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the load
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification

Abstract

The invention discloses a kind of optimization methods of recognition of face deep learning training platform, which comprises the following steps: runs predetermined human face recognition model;Test the service condition of CPU;Test the read-write situation and IOPS situation of data in magnetic disk;Test GPU video memory bandwidth utilization rate and GPU video memory core utilization rate;The GPU service condition that test is grabbed according to the quantity of the GPU card of operation;According to the test result of all tests, the configuration of pre- fixed platform, parameters of operating part and its component connection relationship are optimized.The invention also discloses a kind of optimization systems of deep learning training platform.The present invention solves that simple in the prior art cost performance is low by accumulation hardware and caused by changing configuration simply in generic server, the problems such as cost of idleness limiting performance is promoted finally has saved cost, has effectively increased training speed, training stability and production efficiency.

Description

A kind of optimization method and system of recognition of face deep learning training platform
Technical field
Deep learning field of the present invention, more specifically, particularly relating to a kind of the excellent of recognition of face deep learning training platform Change method and system.
Background technique
The 1950s, " artificial intelligence (AI, Artificial Intelligence) " appeared in people's for the first time In the visual field, centre experienced multiple climax and low ebb.Wherein, a key factor for influencing Artificial Intelligence Development is computing platform Performance, with the continuous improvement being continuously increased with network algorithm complexity for calculating data, the performance of computing platform decide Whether technology can be practical and goes into operation.At this stage as CPU+GPU, CPU+FPGA, TPU etc. calculate constantly mentioning for equipment performance It rises, artificial intelligence technology achievement and related industry have welcome explosive growth.
Currently, artificial intelligence penetrates into all trades and professions, occur largely in industries such as finance, security protections based on deep learning Artificial intelligence application, wherein studying widely used is recognition of face deep learning application.Face recognition technology is based on people Face feature, to the facial image or video flowing of input.It is first determined whether with the presence or absence of face, if there is face, then into The position for providing each face of one step, size and each major facial organ location information.And according to these information, further The identity characteristic contained in each face is extracted, and it is compared with known face, to identify each face Identity.The development of deep learning algorithm makes face recognition technology in industry start research boom, and numerous manufacturers are proposed one after another Related product, such as deploy to ensure effective monitoring and control of illegal activities system for the face big data optimal in structure of public security industry, recognition of face gate, security protection face. The common power-assisted that the artificial intelligence face recognition technology of this wheel has benefited from data, calculating power and algorithm is just able to fast development. But as data scale constantly increases, algorithm complexity is continuously improved, and the requirement to platform computing capability is increasingly harsh;Such as Study burning hot 3D face recognition technology, face vivo identification technology at present, it is required that data volume is huge, algorithm more How complexity, build that deep learning training platform improves computing capability, computational efficiency becomes the development of guarantee technology, technological progress Key factor.
Recognition of face deep learning training platform at present, what is generallyd use is that generic server collocation calculates equipment, in mould The relatively easy initial stage of type simple computation, can satisfy the demand of model training;But going out with deep learning model Existing, model complexity is promoted, and generic server is no longer satisfied trained needs, and many solutions only lean on simple accumulation center The hardware such as processor (CPU), memory, image processing unit (GPU, Graphics Processing Unit) carry out improving performance, But result often spends more cost, and bring is promoted very limited.
The promotion of server performance is the process an of system, the coupling of each component, special in this way especially for recognition of face Fixed application, it is necessary to which analysing in depth application characteristic targetedly could carry out performance optimization to platform.Currently, knowing for face The most performance improvement method of training platform is to spend mint of money depalletizing hardware under other line, in generic server simply Change configuration and good cost performance income is not achieved.If going Optimization Platform in this manner, a large amount of cost can be not only wasted, It is also possible to performance is made not rise anti-drop.
A kind of performance crossed adjustment arranging scheme and improve training application under recognition of face line is not yet disclosed in the prior art, and And by being analyzed face recognition application feature to obtain the method and steps of pad optimization, ultimately form a set of basic hard Part arranging scheme.
Summary of the invention
In view of this, it is an object of the invention to propose a kind of optimization method of recognition of face deep learning training platform and System is optimized by configuration, parameters of operating part and its component connection relationship to pre- fixed platform, thus save the cost, effectively Improve training speed, production efficiency.
Based on above-mentioned purpose, the one side of the embodiment of the present invention provides a kind of recognition of face deep learning training platform Optimization method, comprising the following steps:
Run predetermined human face recognition model;
Test the service condition of CPU;
Test the read-write situation and IOPS situation of data in magnetic disk;
Test GPU video memory bandwidth utilization rate and GPU video memory core utilization rate;
The GPU service condition that test is grabbed according to the quantity of the GPU card of operation;
According to the test result of all tests, to the configuration of pre- fixed platform, parameters of operating part and its component connection relationship It optimizes.
In some embodiments, the core utilization rate of CPU, system load when the service condition for testing CPU includes: to test Utilization rate, idleness and IO, which are waited, to be occupied the utilization rate of percentage at any time and is analyzed, obtain whether core utilization rate and System load utilization rate is very low, idleness is very high and I/O waits the judgement for occupying percentage very little.
In some embodiments, the read-write situation and IOPS situation of testing data in magnetic disk include: the read or write speed to disk It is analyzed with IOPS, obtains whether the storage of data and read the judgement for influencing training effectiveness under platform line in disk.
In some embodiments, it tests GPU video memory bandwidth utilization rate and GPU video memory core utilization rate includes: to GPU index Monitoring analysis is to obtain the factor for influencing GPU video memory bandwidth and GPU video memory core and using, wherein the factor include it is following at least One of: memory bandwidth, topology, GPU dominant frequency.
In some embodiments, testing according to the GPU service condition of the quantity crawl of the GPU card of operation includes: by fixed Card test, the service condition of verifying memory bandwidth, memory size and GPU.
In some embodiments, to the configuration of pre- fixed platform, parameters of operating part and its component connection relationship optimize including The I/O performance of disk is optimized, this further comprises:
Data layout is adjusted, I/O is reasonably allocated in all physical disks;
Disk array is disk array (raid), chooses suitable raid method, and application program I/O is made to be equal to band as far as possible The multiple of size or stripe size;
Increase the queue depth of disk driver;
Application cache technology reduces the number for applying accessing disk, can apply in file system level or application program Rank;
Wherein, the index of magnetic disc i/o performance monitoring includes at least one of: I/O number (IOPS or tps) per second gulps down The amount of spitting, average I/O data size, disk activity percentage of time, service time, I/O waiting list length, waiting time.
In some embodiments, to the configuration of pre- fixed platform, parameters of operating part and its component connection relationship optimize including Memory bandwidth is optimized, this further comprises:
Memory bar number is reduced on new platform, corresponding 12 memory bars in 12 channels use the DDR4 memory of 2666Hz, memory 32G/ item is selected on capacity.
In some embodiments, to the configuration of pre- fixed platform, parameters of operating part and its component connection relationship optimize including It mends structure and hyperthread to opening up and optimizes, this further comprises:
Benefit structure is opened up in modification, and former topological structure is revised as to distinguish two selection exchangers (switch) of carry under CPU1, 4 GPU of carry are distinguished under each selection exchanger, to guarantee that whole P2P is communicated between GPU, reduce data transmission between GPU card Delay time;
Hyperthread is opened, to increase the ability that data are read in multi-process.
In some embodiments, to the configuration of pre- fixed platform, parameters of operating part and its component connection relationship optimize including GPU parameter is optimized, this further comprises:
It opens GPU overclocking (Boost) and sets maximum value for running frequency, closes Error Checking and Correcting (ECC) To guarantee training speed and training stability;
Device model will be calculated and be adjusted to V100, increase batch sizes (batchsize) and obtain the instruction of face recognition application Practice speed.
In some preferred embodiments, the configuration of pre- fixed platform, parameters of operating part and its component connection relationship are optimized Including at least one of the following:
The medium of storing data is converted to solid state hard disk by mechanical hard disk, to accelerate data I/O;
Memory bar 16 are reduced to 12, are 2666MHz frequency by 2200MHz frequency transformation, go out to avoid reducing frequency phenomenon It is existing, improve memory bandwidth;
Replacement calculates equipment and is converted to V100 by P100, to increase core processing ability;
The boost frequency of the GPU is modified as maximum;
The hyperthread of the CPU is opened, to enhance data read capability;And
Benefit Structural Transformation will be opened up can be with point-to-point communication, to reduce delay between each GPU.
The another aspect of the embodiment of the present invention additionally provides a kind of optimization system of recognition of face deep learning training platform System, comprising:
Processor;
Memory, the memory are stored with the executable instruction of the processor, and the processor is executing the finger The above method is realized when enabling.
The present invention has following advantageous effects: a kind of recognition of face deep learning training provided in an embodiment of the present invention The optimization method and system of platform are optimized by configuration, parameters of operating part and its component connection relationship to pre- fixed platform, are adjusted Whole server hardware arranging scheme, solve in the prior art it is simple by accumulation hardware and in generic server simply The problems such as cost performance is low caused by changing configuration, and cost of idleness limiting performance is promoted, has finally saved cost, has effectively increased Training speed, training stability and production efficiency.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with It obtains other drawings based on these drawings.
Fig. 1 is a kind of process of the embodiment of the optimization method of recognition of face deep learning training platform provided by the invention Schematic diagram;
Fig. 2 is the schematic diagram of one embodiment of yc-Rec152 eight provided by the invention card test cpu usage;
Fig. 3 A is the schematic diagram of one embodiment that disk read-write data volume provided by the invention changes with time;
Fig. 3 B is one embodiment of IOPS provided by the invention (number per second for being written and read (I/O) operation) situation Schematic diagram;
Fig. 4 is showing for GPU video memory bandwidth utilization rate provided by the invention and one embodiment of GPU video memory core utilization rate It is intended to;
Fig. 5 A is preceding 4GPU card provided by the invention, the memory size of rear 4GPU card and 8GPU card, memory bandwidth comparison The schematic diagram of one embodiment of situation;
Fig. 5 B is the schematic diagram of one embodiment of the GPU1 service condition of preceding 4GPU card provided by the invention;
Fig. 5 C is the schematic diagram of one embodiment of the GPU5 service condition of rear 4GPU card provided by the invention;
Fig. 6 is one optimized on AGX2 platform based on yc-Res152 human face recognition model provided by the invention The histogram of embodiment.
Specific embodiment
To make the objectives, technical solutions, and advantages of the present invention clearer, below in conjunction with specific embodiment, and reference The embodiment of the present invention is further described in attached drawing.
It should be noted that all statements for using " first " and " second " are for differentiation two in the embodiment of the present invention The non-equal entity of a same names or non-equal parameter, it is seen that " first " and " second " only for statement convenience, no It is interpreted as the restriction to the embodiment of the present invention, subsequent embodiment no longer illustrates this one by one.
Based on above-mentioned purpose, the first aspect of the embodiment of the present invention proposes a kind of recognition of face deep learning training One embodiment of the optimization method of platform.Fig. 1 shows the excellent of recognition of face deep learning training platform provided by the invention The flow diagram of the embodiment of change method.
A kind of optimization method of recognition of face deep learning training platform, optionally, comprising the following steps:
Step S100 runs predetermined human face recognition model;
Step S101 tests the service condition of CPU;
Step S102 tests the storage of data in magnetic disk and reads situation;
Step S103 tests GPU video memory bandwidth utilization rate and GPU video memory core utilization rate;
Step S104, the GPU service condition that test is grabbed according to the quantity of the GPU card of operation;
Step S105, according to the test result of all tests, to the configuration of pre- fixed platform, parameters of operating part and its component Connection relationship optimizes.
Wherein, predetermined human face recognition model is yc-Res152 human face recognition model.Pre- fixed platform is AGX2 platform.In order to A set of deep learning training platform prioritization scheme for recognition of face is formed, a human face recognition model yc- will be directed to Resnet (Residual Networks, depth residual error network) 152 models are tested for the property, by obtaining yc- The performance data of Resnet152 model training platform when calculating, targetedly optimize platform, form a set of be based on Recognition of face deep learning training platform prioritization scheme.Wherein, yc-Resnet152 model is based on Resnet152 model refinement, Belong to deep learning convolutional neural networks.
In order to improve face recognition application deep learning training performance while improve the computing capability of platform, with yc- Res152 human face recognition model is support, is applied to feature in tide AGX2 (NF5288M5 tide E-TEN Corp server 2U 8GPU Card high density server) performance on platform as reference, optimizes platform.AGX2 platform is basic before not adjusting It is configured that
For under face recognition application deep learning line the characteristics of training mission it is to deep learning platform configuration and substantially hard Part parameter carries out collocation and tuning.
Topological form between AGX-2 platform CPU and GPU are as follows: by bus switch (switch) carry 4 under CPU0 Pass through 4 GPU of switch carry under GPU, CPU1.Yc-Res152 model utilizes deep learning frame CAFFE (Convolutional Architecture for Fast FeatureEmbedding) when being tested on AGX-2 platform its Training speed are as follows:
Test basic benchmark (benchmark) performance on AGX-2 platform: cpu test floating-point operation peak value is 479GFLOPS (1,000,000,000 times per second floating-point operation numbers), memory bandwidth 130G/s, GPU- bandwidth (bandwidth) are 500G/ S, CPU+GPU floating-point operation peak value is 23.7TFLOPS (floating-point operation of 1 trillion times per second).We are eventually by adjustment hardware Arrangements lifting tests basic performance and yc-Res152 face recognition application model training speed.
Deep learning platform based on CPU+GPU has become the standard configuration in the current field AI, using based on CPU+GPU's AI server improves training effectiveness under face recognition application line, quickening model publication progress can accelerate the development of AI business;It is logical The optimization to AI training platform is crossed, the stability that input cost can be reduced, higher training effectiveness is obtained, improve training.
Fig. 2 shows the schematic diagrames that yc-Rec152 eight provided by the invention blocks test cpu usage one embodiment. In a preferred embodiment, the service condition of platform CPU in face recognition application operation is tested, including, when to test Core utilization rate, system load utilization rate, idleness and the IO of CPU waits the utilization rate of occupancy percentage at any time to be divided Analysis show whether core utilization rate and system load utilization rate are very low, idleness is very high and IO is waited and occupied percentage very little Judgement.
As shown in Fig. 2, above-mentioned AGX-2 platform is tested in face recognition application operation, the service condition of CPU: CPU's Performance may influence the speed of model treatment to a certain extent, under deep learning line in training application, the main function of CPU It can be control programmed logic, model parameter, the data transmission for controlling memory of processing GPU passback etc..Utilize day eye analysis tool (Teye) cpu usage of machine, is analyzed when crawl yc-Res152 application is tested.It can be bright from following figure service condition It is aobvious to find out core utilization rate (cpu_user) 201 and system load utilization rate (cpu_sys) 202 is very low, idleness (cpu_ Idle) 203 very high and I/O waiting (cpu_iowait) occupancy 204 very little of percentage.Above it is found that the performance of CPU is not The bottleneck that machine performance is promoted.Wherein, eye analysis tool in day is using features monitor analysis tool.Wherein, day eye is tide height A performance monitoring software of performance team independent research, the performance of machine refers to when can monitor deep learning application operation in real time Mark.By monitoring, analyzing to performance indicator, it is ensured that system run all right, the utilization of resources are abundant, can also analyze bandwidth, Computing resource utilization rate is found and applies the application of bottleneck point optimization algorithm.Deep learning cluster carries day eye analysis tool, Ke Yijie Cluster and algorithm efficiency is continuously improved in cost-saving.
In conjunction with yc-Res152 model in the performance of deep learning training platform, the analysis of tide Teye tool is utilized Available following conclusions: by CPU using analysis it is found that platform core cpu utilization rate and system load utilization rate are very low, The performance of CPU is not the bottleneck that machine performance is promoted.
Fig. 3 A is the schematic diagram of one embodiment that disk read-write data volume provided by the invention changes with time.Fig. 3 B For the schematic diagram of one embodiment of IOPS provided by the invention (number per second for being written and read (I/O) operation) situation.One In a preferred embodiment, the read-write situation and IOPS situation of platform data in magnetic disk in face recognition application operation, packet are tested It includes, analysis is carried out on the read or write speed of disk and IOPS and obtains whether the storage of data and read and influence to instruct under platform line in disk Practice the judgement of efficiency.
As shown in figs 3 a and 3b, above-mentioned AGX-2 platform is tested in face recognition application operation, the read-write of data in magnetic disk Situation map, wherein disk_read_MB 301 is reading disk data volume, disk_write_MB 302 is write-in data in magnetic disk Amount and disk_IOPS are the disk number per second for being written and read (I/O) operation.Yc-Res152 model is in disk_read_MB 17M/S or so is reached in index, but had occurred fluctuation at 120 seconds and 1542 seconds or so, R/W speed of hard disc reaches 110M/ S and 70M/S (as shown in Figure 3A).The log (log) of training is inquired, it is found that the picture number of processing per second in this period occurs Fluctuation.IOPS (number per second for being written and read (I/O) operation) situation of hard disk is had detected simultaneously, and disk1_IOPS is (as schemed Shown in 3B)) 120s and 1543s or so mutation also can interpretation of images processing speed well wave phenomenon.The read-write of disk And IOPS belongs to magnetic disc i/o problem, removes the optimization of application program, can do disk array (raid) heap using more multiple disks Product I/O uses SSD hard disk instead.It is above-mentioned analysis shows, the storage and reading of data affect training under yc-Res152 line Efficiency, the reading and writing data ability for improving training platform will be helpful to improve the efficiency of training under line.
In conjunction with yc-Res152 model in the performance of deep learning training platform, the analysis of tide Teye tool is utilized Available following conclusions: by IOPS to disk and read or write speed analysis it is found that the storage and reading of data affect it is flat The efficiency of training under platform line, the reading and writing data ability for improving training platform will be helpful to improve the efficiency of training under line.
Fig. 4 shows one embodiment of GPU video memory bandwidth utilization rate provided by the invention Yu GPU video memory core utilization rate Schematic diagram.In a preferred embodiment, test platform face recognition application operation when GPU video memory bandwidth utilization rate and GPU video memory core utilization rate, including, on GPU index monitoring analysis to show that influencing GPU video memory bandwidth and GPU video memory core makes Factor, wherein the factor includes at least one of: memory bandwidth, topology, GPU dominant frequency.
As shown in Figure 4, above-mentioned AGX-2 platform is tested in face recognition application operation, video memory bandwidth and video memory core Utilization rate (as shown in Figure 4), wherein GPU1_Rate 401 indicates the video memory core utilization rate of GPU1, GPU1_Mem_Rate 402 indicate the video memory bandwidth utilization rate of GPU1.Video memory bandwidth utilization rate changes very greatly, fluctuates between 0%~100%.But The utilization rate of most of the time bandwidth is all 100%.The data of Teye crawl show that GPU utilization rate and video memory bandwidth utilization rate are each It is had differences between card.And bandwidth uses de-stabilising effect GPU core service condition.Therefore, platform need by adjusting open up mend, Disk bandwidth etc. achievees the purpose that improve video memory bandwidth utilization rate.
In conjunction with yc-Res152 model in the performance of deep learning training platform, the analysis of tide Teye tool is utilized Available following conclusions: by GPU index monitoring analysis it is found that the service condition of video memory bandwidth and core directly characterizes The treatment effeciency and ability of platform influence all factor that video memory bandwidth and core use (memory bandwidth opens up benefit, GPU dominant frequency) Deng can all influence treatment effeciency.
Fig. 5 A shows memory size, the memory bandwidth of preceding 4GPU card provided by the invention, rear 4GPU card and 8GPU card The schematic diagram of one embodiment of comparative situation.Fig. 5 B shows the one of the GPU1 service condition of preceding 4GPU card provided by the invention The schematic diagram of a embodiment.Fig. 5 C shows showing for one embodiment of the GPU5 service condition of rear 4GPU card provided by the invention It is intended to.In a preferred embodiment, platform is tested to be grabbed in face recognition application operation according to the quantity of the GPU card of operation The GPU service condition taken, including, it is tested by fixed card, the service condition of verifying memory bandwidth, memory size and GPU.
As shown in figures 5a-5c, above-mentioned AGX-2 platform is tested in face recognition application operation, according to the GPU card of operation Quantity crawl GPU service condition.Wherein preceding 4 cards _ MemUsed_KB 501 indicates that preceding 4 card memory bandwidth, memory size make With situation, 4 card memory bandwidths, memory size service condition after the expression of rear 4 cards _ MemUsed_KB 502,8 cards _ MemUsed_KB 503 indicate all 8 card memory bandwidths, memory size service condition, and GPU1_Rate 504 indicates that the video memory core of GPU1 uses Rate, GPU1_Mem_Rate 505 indicate the video memory bandwidth utilization rate of GPU1, and GPU5_Rate 506 indicates the video memory core of GPU5 Utilization rate, GPU5_Mem_Rate 507 indicate the video memory bandwidth utilization rate of GPU5.We carry out one group and block test surely, use Data are constant, and test is divided into three groups, and binding GPU card 0,1,2,3 is one group of test;Binding GPU card 4,5,6,7 is second group of survey Examination;Primary whole eight are surveyed again to block.The main purpose of the test is verifying memory bandwidth, memory size (shown in lower Fig. 1), GPU's Service condition (shown in lower Fig. 2).By comparison 4 card with 8 card application memory size service condition it can be concluded that, it is several using capacity Consistent, bottleneck is not present in memory size;The service condition of memory bandwidth, obtains the utilization rate of bandwidth between 4 card of comparison and 8 cards And not shown linear increase, memory bandwidth become the bottleneck point of limitation application;It can be found that the video memory bandwidth in preceding 4 card test makes It is slightly more stable than what rear 4 card video memory bandwidth utilization rate showed with rate;The service condition of GPU, 4 cards make when 4 card of comparison is tested with 8 cards The utilization rate of GPU is affected after being limited with more sufficiently, memory bandwidth becomes, it is memory bandwidth that platform data transmittability, which improves, It needs to improve.Test result shows that in the smaller situation of data processing amount, the GPU bandwidth and core service condition of platform are preferable, Therefore eight model trainings blocked, platform data transmittability also need to improve in transmission data link.
In conjunction with yc-Res152 model in the performance of deep learning training platform, the analysis of tide Teye tool is utilized Available following conclusions: being analyzed by fixed card it is found that reducing memory bandwidth load, it is aobvious that GPU can be improved in raising memory bandwidth Deposit the service efficiency of bandwidth and core.
In a preferred embodiment, the configuration of pre- fixed platform, parameters of operating part and its component connection relationship are optimized It is optimized including the IO performance to disk, this further comprises:
Data layout is adjusted, I/O is reasonably allocated in all physical disks;
Disk array is disk array (raid), chooses suitable raid method, and application program I/O is made to be equal to band as far as possible The multiple of size or stripe size;
Increase the queue depth of disk driver;
Application cache technology reduces the number for applying accessing disk, can apply in file system level or application program Rank;
Wherein, the index of magnetic disc i/o performance monitoring includes at least one of: I/O number (IOPS or tps) per second gulps down The amount of spitting, average I/O data size, disk activity percentage of time, service time, I/O waiting list length, waiting time.
The optimization of magnetic disc i/o performance is done, the main effect of disk is storing data so that application program reads and writes data, disk I/O performance indicator be disk performance index main aspect.The index of magnetic disc i/o performance monitoring specifically includes that per second I/O number (IOPS or tps), handling capacity, average I/O data size, disk activity percentage of time, service time, I/O wait team Column length, waiting time etc..There are some common optimization methods for platform, such as 1, adjustment data layout are reasonable by I/O It is assigned in all physical disks;2, disk is raid, chooses suitable raid method, and application program I/O is made to be equal to item as far as possible Multiple with size or stripe size;3, increase the queue depth of disk driver;4, application cache technology reduces application The number of accessing disk can be applied in file system level or application program rank.
In a preferred embodiment, the configuration of pre- fixed platform, parameters of operating part and its component connection relationship are optimized Including optimizing to memory bandwidth, this further comprises:
Memory bar number is reduced on new platform, corresponding 12 memory bars in 12 channels use the DDR4 memory of 2666Hz, memory 32G/ item is selected on capacity.
The optimization of memory bandwidth is used by reducing corresponding 12 memory bars in several 12 channels of memory bar on new platform The DDR4 memory of 2666Hz selects 32G/ item in memory size.After memory bandwidth adjustment, yc-Res152 face recognition application Processing speed on platform are as follows:
In a preferred embodiment, the configuration of pre- fixed platform, parameters of operating part and its component connection relationship are optimized Including mending structure and hyperthread and optimizing to opening up, this further comprises:
Benefit structure is opened up in modification, and former topological structure is revised as to distinguish two selection exchangers (switch) of carry under CPU1, 4 GPU of carry are distinguished under each selection exchanger, to guarantee that whole P2P is communicated between GPU, reduce data transmission between GPU card Delay time;
Hyperthread is opened, which increases the ability that data are read in multi-process, improve using test performance.
Hyperthread and the optimization for opening up benefit structure, opening hyperthread will increase the ability that data are read in multi-process, in certain journey It can improve on degree using test performance;Benefit structure is opened up in modification, is changed under CPU1 hang two switch respectively, under each switch 4 GPU are hung respectively, can be guaranteed that whole P2P is communicated between GPU in this way, be reduced the delay time of data transmission between card.It is as follows Table show adjustment hyperthread, opens up the test performance applied after mending:
In a preferred embodiment, the configuration of pre- fixed platform, parameters of operating part and its component connection relationship are optimized Including optimizing to GPU parameter, this further comprises:
It opens GPU overclocking (Boost) and sets maximum value for running frequency, closes Error Checking and Correcting (ECC) To guarantee training speed and training stability;
Device model will be calculated and be adjusted to V100, increase batch sizes (batchsize) and obtain the instruction of face recognition application Practice speed.
The adjusting and optimizing of GPU parameter, by opening GPU Boost and running frequency being arranged to maximum, closing ECC error correction inspection It looks into etc. to guarantee training speed and training stability;Then it is V100 by adjusting device model is calculated, increases batch sizes (batchsize) etc. the training speed of face recognition application is finally obtained:
Fig. 6 is one optimized on AGX2 platform based on yc-Res152 human face recognition model provided by the invention The histogram of embodiment.In a preferred embodiment, to the configuration of pre- fixed platform, parameters of operating part and its component connection relationship into Row optimization includes at least one of the following:
The medium of storing data is converted to solid state hard disk by mechanical hard disk, to accelerate data I/O;
Memory bar 16 are reduced to 12, are 2666MHz frequency by 2200MHz frequency transformation, go out to avoid reducing frequency phenomenon It is existing, improve memory bandwidth;
Replacement calculates equipment and is converted to V100 by P100, to increase core processing ability;
The boost frequency of the GPU is modified as maximum;
The hyperthread of the CPU is opened, to enhance data read capability;And
Benefit Structural Transformation will be opened up can be with point-to-point communication, to reduce delay between each GPU.
Wherein, P100 be _ NV_24G_Tesla-P100_4096b_S_CAC it is tall and handsome reach tesla P100 video card, V100 be _ NV_16G_Tesla-V100_S_CAC is tall and handsome to reach tesla V100 video card.Pass through the series of optimum strategy used to M5 platform (as shown in Figure 6), the initial training speed of yc-Res152 model are 1300samples/s;It is replaced by Disk bandwidth optimization Data are changing to after SSD hard disk and rise to 1350samples/s or so;After memory bandwidth optimizes, data processing speed Degree reaches 1450samples/s;The comparison of benefit is opened up by different situations, can obtain between GPU card whole point-to-point communications opens up benefit shape Formula training speed is the most stable, is by opening hyperthread yc-Res152 model in the processing speed of M5 platform 1500samples/s;By opening GPUBoost and running frequency being arranged to maximum, closing ECC error correction inspection, processing speed is 1600samples/s;The equipment V100 of volta mechanism is calculated by replacement, processing speed reaches 3400samples/s, optimizes Performance is the 2.6X of initial performance afterwards.
The optimization is to be based on tide AGX2AI face recognition application deep learning training platform building plan and prioritization scheme, The AI server platform AGX2 of tide can better meet the training of client's deep learning by the optimization of the set method flow Demand improves training effectiveness, improves the AI business resolution ability and the market competitiveness of AI server.Depth based on CPU+GPU Learning platform has become the standard configuration in the current field AI, improves face recognition application using the AI server based on CPU+GPU Training effectiveness, quickening model publication progress can accelerate the development of AI business under line;It, can be with by the optimization to AI training platform The stability for reducing input cost, obtaining higher training effectiveness, improving training.
Yc-Res152 model is representative in recognition of face line drag training field, is surveyed according to yc-Res152 model Performance after trying obtained conclusion and optimization, the basic configuration of platform can be again according to recognition of face deep learning Application requirement is arranged in pairs or groups.(following table) is the deep learning training platform basic parameter based on face recognition application.Platform is basic After parameter determines, topological structure is that 2x16 distinguishes carry 4 and opens card under CPU1,8 block between can be communicated with P2P.Training under line When hyperthread open, GPU open Boost simultaneously running frequency is set to maximum.
From above-described embodiment as can be seen that a kind of optimization side of deep learning training platform provided in an embodiment of the present invention Method analyzes face recognition application feature based on human face recognition model, applies on server platform for training under line The bottleneck point of appearance is optimized by configuration, parameters of operating part and its component connection relationship to platform, it is hard to have adjusted server Part arranging scheme solves simple rely in the prior art and accumulates hardware and change configuration simply in generic server and lead The problems such as cost performance of cause is low, and cost of idleness limiting performance is promoted finally has saved cost, has effectively increased training speed, instruction Practice stability and production efficiency.
It is important to note that each step in each embodiment of the optimization method of above-mentioned deep learning training platform Suddenly it can intersect, replace, increase, deleting, therefore, these reasonable permutation and combination transformation are put down in deep learning training The optimization method of platform should also be as belonging to the scope of protection of the present invention, and protection scope of the present invention should not be confined to the reality It applies on example.
It is exemplary embodiment disclosed by the invention above, the disclosed sequence of the embodiments of the present invention is just to retouching It states, does not represent the advantages or disadvantages of the embodiments.It should be noted that the discussion of any of the above embodiment is exemplary only, it is not intended that Imply that range disclosed by the embodiments of the present invention (including claim) is limited to these examples, what is limited without departing substantially from claim Under the premise of range, it may be many modifications and modify.According to the claim to a method of open embodiment described herein Function, step and/or movement are not required to the execution of any particular order.In addition, although element disclosed by the embodiments of the present invention can be with It is described or is required in the form of individual, but be unless explicitly limited odd number, it is understood that be multiple.
Based on above-mentioned purpose, the second aspect of the embodiment of the present invention proposes a kind of recognition of face deep learning training The optimization system of platform, comprising: processor;Memory, the memory is stored with the executable instruction of the processor, described Processor realizes the above method when executing described instruction.
The optimization system of deep learning training platform provided in an embodiment of the present invention knows face based on human face recognition model Other application characteristic is analyzed, and the bottleneck point occurred on server platform is applied for training under line, by matching to platform Set, parameters of operating part and its component connection relationship optimize, have adjusted server hardware arranging scheme, solve in the prior art Simple cost performance is low by accumulation hardware and caused by changing configuration simply in generic server, and cost of idleness is restricted The problems such as capable of being promoted, has finally saved cost, has effectively increased training speed, training stability and production efficiency.
It is important to note that the embodiment of the optimization system of above-mentioned deep learning training platform uses the depth The embodiment of the optimization system of learning training platform illustrates the course of work of each module, and those skilled in the art can be very It is readily apparent that, it will be in the other embodiments of the optimization method of these module applications to the deep learning training platform.Certainly, by Each step in the optimization method embodiment of the deep learning training platform can be intersected, replaces, increases, be deleted Subtract, therefore, the optimization system in deep learning training platform of these reasonable permutation and combination transformation should also be as belonging to the present invention Protection scope, and protection scope of the present invention should not be confined on the embodiment.
Under AI application middle line in current every profession and trade all more or less there is performance bottleneck in training platform, most of using letter Single hardware device is superimposed to do pad optimization.With the great development of artificial intelligence and the variation of deep learning data scale, In order, targetedly optimizing to training platform can be improved training speed, improves production efficiency.This sets of plan is not only It can be for training application platform under recognition of face line, for being instructed under training platform, natural language processing line under speech recognition line Pad optimization specific aim can also be improved with stress-free temperature by practicing platform.
It should be understood by those ordinary skilled in the art that: the discussion of any of the above embodiment is exemplary only, not It is intended to imply that range disclosed by the embodiments of the present invention (including claim) is limited to these examples;In the think of of the embodiment of the present invention Under road, it can also be combined between the technical characteristic in above embodiments or different embodiments, and exist as described above Many other variations of the different aspect of the embodiment of the present invention, for simplicity, they are not provided in details.Therefore, all at this Within the spirit and principle of inventive embodiments, any omission, modification, equivalent replacement, improvement for being made etc. should be included in this hair Within the protection scope of bright embodiment.

Claims (11)

1. a kind of optimization method of recognition of face deep learning training platform, which comprises the following steps:
Run predetermined human face recognition model;
Test the service condition of CPU;
Test the read-write situation and IOPS situation of data in magnetic disk;
Test GPU video memory bandwidth utilization rate and the GPU video memory core utilization rate;
The GPU service condition that test is grabbed according to the quantity of the GPU card of operation;
According to the test result of all tests, the configuration of pre- fixed platform, parameters of operating part and its component connection relationship are carried out Optimization.
2. the method according to claim 1, wherein described in when the service condition of test CPU includes: test Core utilization rate, system load utilization rate, idleness and the I/O of CPU, which is waited, occupies the utilization rate progress of percentage at any time Analysis, obtain whether the core utilization rate and system load utilization rate is very low, the idleness is very high and the I/ O waits the judgement for occupying percentage very little.
3. the method according to claim 1, wherein the read-write situation and IOPS situation packet of test data in magnetic disk It includes: the read or write speed and IOPS of the disk is analyzed, show whether the storage of data and reading influences in the disk The judgement of training effectiveness under platform line.
4. the method according to claim 1, wherein test GPU video memory bandwidth utilization rate and the GPU video memory core Heart utilization rate includes: to influence the GPU video memory bandwidth and the GPU video memory core to the GPU index monitoring analysis to obtain The factor used, wherein the factor includes at least one of: memory bandwidth, topology, GPU dominant frequency.
5. the method according to claim 1, wherein what test was grabbed according to the quantity of the GPU card of operation The GPU service condition includes: to be tested by fixed card, the service condition of verifying memory bandwidth, memory size and the GPU.
6. method according to claim 1 or 3, which is characterized in that the configuration of pre- fixed platform, parameters of operating part and its component Connection relationship is optimized to be optimized including the I/O performance to disk, this further comprises:
Data layout is adjusted, I/O is reasonably allocated in all physical disks;
Disk array is disk array, chooses suitable raid method, and application program I/O is made to be equal to stripe size or item as far as possible Multiple with size;
Increase the queue depth of disk driver;
Application cache technology reduces the number for applying accessing disk, can apply in file system level or application-level Not;
Wherein, the index of magnetic disc i/o performance monitoring includes at least one of: I/O number per second, handling capacity, average I/O data Size, disk activity percentage of time, service time, I/O waiting list length, waiting time.
7. the method according to claim 1, wherein connecting to the configuration of pre- fixed platform, parameters of operating part and its component The relationship of connecing is optimized including optimizing to memory bandwidth, this further comprises:
Memory bar number is reduced on new platform, corresponding 12 memory bars in 12 channels use the DDR4 memory of 2666Hz, memory size Upper selection 32G/ item.
8. the method according to claim 1, wherein connecting to the configuration of pre- fixed platform, parameters of operating part and its component The relationship of connecing is optimized including mending structure and hyperthread and optimizing to opening up, this further comprises:
Benefit structure is opened up described in modification, and former topological structure is revised as to distinguish two selection exchangers of carry under CPU1, it is each described It selects to distinguish 4 GPU of carry under exchanger, to guarantee that whole P2P is communicated between the GPU, reduces data between GPU card The delay time of transmission;
Hyperthread is opened, to increase the ability that data are read in multi-process.
9. method according to claim 1 or 5, which is characterized in that the configuration of pre- fixed platform, parameters of operating part and its component Connection relationship is optimized including optimizing to GPU parameter, this further comprises:
Open GPU overclocking and by running frequency be set as maximum value, close Error Checking and Correcting with guarantee training speed and Training stability;
Device model will be calculated and be adjusted to V100, increase batch sizes and obtain the training speed of face recognition application.
10. the method according to claim 1, wherein connecting to the configuration of pre- fixed platform, parameters of operating part and its component The relationship of connecing is optimized including at least one of the following:
The medium of storing data is converted to solid state hard disk by mechanical hard disk, to accelerate data I/O;
Memory bar 16 are reduced to 12, are 2666MHz frequency by 2200MHz frequency transformation, to avoid reducing frequency phenomenon appearance, mention High memory bandwidth;
Replacement calculates equipment and is converted to V100 by P100, to increase core processing ability;
The boost frequency of the GPU is modified as maximum;
The hyperthread of the CPU is opened, to enhance data read capability;And
Benefit Structural Transformation will be opened up can be with point-to-point communication, to reduce delay between each GPU.
11. a kind of optimization system of recognition of face deep learning training platform characterized by comprising
Processor;
Memory, the memory are stored with the executable instruction of the processor, and the processor is when executing described instruction Realize method described in any one of claim 1-10.
CN201810819416.4A 2018-07-24 2018-07-24 A kind of optimization method and system of recognition of face deep learning training platform Pending CN109062692A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810819416.4A CN109062692A (en) 2018-07-24 2018-07-24 A kind of optimization method and system of recognition of face deep learning training platform

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810819416.4A CN109062692A (en) 2018-07-24 2018-07-24 A kind of optimization method and system of recognition of face deep learning training platform

Publications (1)

Publication Number Publication Date
CN109062692A true CN109062692A (en) 2018-12-21

Family

ID=64836173

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810819416.4A Pending CN109062692A (en) 2018-07-24 2018-07-24 A kind of optimization method and system of recognition of face deep learning training platform

Country Status (1)

Country Link
CN (1) CN109062692A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110032450A (en) * 2019-04-17 2019-07-19 中山大学 A kind of extensive deep learning method and system based on solid-state disk exented memory
CN110032449A (en) * 2019-04-16 2019-07-19 苏州浪潮智能科技有限公司 A kind of method and device for the performance optimizing GPU server
CN110427300A (en) * 2019-07-19 2019-11-08 广东浪潮大数据研究有限公司 Server GPU performance regulates and controls method, apparatus, equipment and readable storage medium storing program for executing
CN111736463A (en) * 2020-05-09 2020-10-02 刘炜 Adaptive deep learning control method based on operation platform
CN113792875A (en) * 2021-09-09 2021-12-14 曙光信息产业(北京)有限公司 Performance test method, device, equipment and medium of distributed communication library
CN116776926A (en) * 2023-08-15 2023-09-19 上海燧原科技有限公司 Optimized deployment method, device, equipment and medium for dialogue model
CN117687802A (en) * 2024-02-02 2024-03-12 湖南马栏山视频先进技术研究院有限公司 Deep learning parallel scheduling method and device based on cloud platform and cloud platform

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102929667A (en) * 2012-10-24 2013-02-13 曙光信息产业(北京)有限公司 Method for optimizing hadoop cluster performance
CN103294579A (en) * 2013-06-09 2013-09-11 浪潮电子信息产业股份有限公司 Method for testing high-performance computing cluster application performance

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102929667A (en) * 2012-10-24 2013-02-13 曙光信息产业(北京)有限公司 Method for optimizing hadoop cluster performance
CN103294579A (en) * 2013-06-09 2013-09-11 浪潮电子信息产业股份有限公司 Method for testing high-performance computing cluster application performance

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
WEIXIN_34050519: "磁盘I/O性能监控指标和调优方法", 《CSCD博客;HTTPS://BLOG.CSDN.NET/WEIXIN_34050519/ARTICLE/DETAILS/93496540》 *
李思昆等: "《大规模流场科学计算可视化》", 31 August 2013 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110032449A (en) * 2019-04-16 2019-07-19 苏州浪潮智能科技有限公司 A kind of method and device for the performance optimizing GPU server
CN110032450A (en) * 2019-04-17 2019-07-19 中山大学 A kind of extensive deep learning method and system based on solid-state disk exented memory
CN110032450B (en) * 2019-04-17 2021-04-20 中山大学 Large-scale deep learning method and system based on solid-state disk extended memory
CN110427300A (en) * 2019-07-19 2019-11-08 广东浪潮大数据研究有限公司 Server GPU performance regulates and controls method, apparatus, equipment and readable storage medium storing program for executing
CN110427300B (en) * 2019-07-19 2023-07-14 广东浪潮大数据研究有限公司 Method, device and equipment for regulating and controlling server GPU performance and readable storage medium
CN111736463A (en) * 2020-05-09 2020-10-02 刘炜 Adaptive deep learning control method based on operation platform
CN111736463B (en) * 2020-05-09 2023-03-03 刘炜 Adaptive deep learning control method based on operation platform
CN113792875A (en) * 2021-09-09 2021-12-14 曙光信息产业(北京)有限公司 Performance test method, device, equipment and medium of distributed communication library
CN116776926A (en) * 2023-08-15 2023-09-19 上海燧原科技有限公司 Optimized deployment method, device, equipment and medium for dialogue model
CN116776926B (en) * 2023-08-15 2023-11-07 上海燧原科技有限公司 Optimized deployment method, device, equipment and medium for dialogue model
CN117687802A (en) * 2024-02-02 2024-03-12 湖南马栏山视频先进技术研究院有限公司 Deep learning parallel scheduling method and device based on cloud platform and cloud platform
CN117687802B (en) * 2024-02-02 2024-04-30 湖南马栏山视频先进技术研究院有限公司 Deep learning parallel scheduling method and device based on cloud platform and cloud platform

Similar Documents

Publication Publication Date Title
CN109062692A (en) A kind of optimization method and system of recognition of face deep learning training platform
US11947829B2 (en) Data writing method, device, storage server, and computer readable storage medium
EP3612947B1 (en) Processing discontiguous memory as contiguous memory to improve performance of a neural network environment
Ando et al. Deep over-sampling framework for classifying imbalanced data
CN107437110B (en) Block convolution optimization method and device of convolutional neural network
US20220051025A1 (en) Video classification method and apparatus, model training method and apparatus, device, and storage medium
CN110070181A (en) A kind of optimization method of the deep learning for edge calculations equipment
CN105242871B (en) A kind of method for writing data and device
KR102490908B1 (en) Resource scheduling method and terminal device
CN102708067B (en) Combining memory pages having identical content
CN105184367B (en) The model parameter training method and system of deep neural network
CN110308875A (en) Data read-write method, device, equipment and computer readable storage medium
CN111325664B (en) Style migration method and device, storage medium and electronic equipment
CN103593452A (en) Data intensive computing cost optimization method based on MapReduce mechanism
CN110413776B (en) High-performance calculation method for LDA (text-based extension) of text topic model based on CPU-GPU (Central processing Unit-graphics processing Unit) collaborative parallel
KR102502569B1 (en) Method and apparuts for system resource managemnet
CN110109628A (en) Data re-establishing method, device, equipment and the storage medium of distributed memory system
CN106951301B (en) Pre-reading method of files and device
CN105159602B (en) Data processing method and storage equipment
CN106406976A (en) Method and apparatus for identifying IO intensive application in cloud computing environment
CN112200310B (en) Intelligent processor, data processing method and storage medium
CN104063230B (en) The parallel reduction method of rough set based on MapReduce, apparatus and system
CN101251789A (en) Cheap magnetic disc redundant array RAID5 roll rapid capacitance enlarging method
DE102020112066B4 (en) SHARED RESOURCES OPERATIONAL METRIC
CN113505861B (en) Image classification method and system based on meta-learning and memory network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20181221