CN106951926A

CN106951926A - The deep learning systems approach and device of a kind of mixed architecture

Info

Publication number: CN106951926A
Application number: CN201710196532.0A
Authority: CN
Inventors: 程归鹏; 卢飞; 江涛
Original assignee: Shandong Intelligent Data Technology Co Ltd
Current assignee: Shandong Intelligent Data Technology Co Ltd
Priority date: 2017-03-29
Filing date: 2017-03-29
Publication date: 2017-07-14
Anticipated expiration: 2037-03-29
Also published as: CN106951926B

Abstract

The invention discloses the deep learning systems approach and device of a kind of mixed architecture, it is characterized in that comprising the following steps：When training dataset updates, training module re-starts deep learning network model and trains and store weights and offset parameter；Server end monitoring process monitors that Parameter File changes, and is encapsulated in data structure set in advance and notifies reasoning module；Reasoning module interrupts inference service, reads weights and biasing file content from server side and updates network model；Server end monitoring process is handled simultaneously to be needed the input file of reasoning and notifies reasoning module；The system and device includes server module, training module, reasoning module, EBI；The training of the present invention and reasoning mixing CPU+GPU+CAPI isomery deep learning systems, can make full use of resource, obtain higher Energy Efficiency Ratio, realize the direct access server internal memories of CAPI, and real-time online iteration updates the parameters such as inference pattern weights.

Description

The deep learning systems approach and device of a kind of mixed architecture

Technical field

The present invention relates to circuit design and the technical field of machine learning, more particularly to a kind of depth of mixed architecture Learning system method and device.

Background technology

21 century IT industry is developed rapidly, and band gives people huge interests and facility.Deep learning application point For training and two parts of reasoning, so that ImageNet is evaluated and tested as an example, AlexNet model training processes need 8,000,000 totally 1000 The picture of individual classification, by such as AlexNet model extractions feature and counting loss, is then updated by backpropagation such as SGD Weighting parameter, so as to constantly restrain model, finally gives preferable network model.The process of reasoning is exactly that network is passed through in input Model carries out a forward direction computing, so as to obtain final classification（It is typically chosen Top5）The process of accuracy rate.Deep learning application Training process need to use substantial amounts of computing resource and training data, current training platform generally uses NVIDIA height The GPU of performance such as Tesla P100, Titan X, GTX1080 etc. accelerate training process.After available model is obtained, it is deployed to Another platform is used for reasoning and externally provides service, because reasoning process only does a forward direction computing, so to calculating It is required that can be lower, it is more require to be embodied in time delay, there is the cloud service platform based on CPU currently used for the platform of reasoning , also have based on low-power consumption GPU server clusters, also using FPGA or special ASIC clusters etc..From low delay and efficiently , can be even better using FPGA and special ASIC on energy.And FPGA has more the flexibility of framework compared to ASIC, obtains Increasing concern.CAPI is uniformity OverDrive Processor ODP interface (Coherent Accelerator Processor Interface), be high speed bus interface agreement that IBM is released on POWER processors, physical interface form be PCI-E or The BlueLink that IBM is released.PSL layers are realized inside CAPI, it is ensured that the memory access uniformity between server, you can with logical Cross virtual address and directly have access to CPU internal memories, so as to greatly reduce access time delay.And the SNAP Framework that IBM is released Programmed environment, can use C/C++ easily to realize algorithm model.

Various deep learning method and devices, such as Publication No. CN106022472A China for this people's developmental research A kind of embedded deep learning processor of patent, the invention belongs to technical field of integrated circuits, specially a kind of based on FPGA's Embedded deep learning processor；The deep learning processor includes：Central processing unit（CPU）, complete processor study and transport Necessary logical operation, control and storage work during row；Deep learning unit, the hardware of deep learning algorithm realizes list Member, is the core component for carrying out deep learning processing；The deep learning processor combines single with deep learning with reference to tradition CPU Member, wherein deep learning assembled unit can be combined by multiple deep learning units, with scalability, can be directed to different Calculation scale, is used as the core processor of artificial intelligence application.As shown in figure 5, Publication No. CN106156851A China is specially Sharp a kind of accelerator and method towards deep learning business, for carrying out deep learning to the pending data in server Calculate, including the calculation control module and first for being arranged at the network interface card of server end, being connected with the server by bus Memory and second memory；The calculation control module is PLD, including control unit, data storage list Member, logic storage unit and the EBI communicated respectively with the network interface card, first memory and second memory, first Communication interface and the second communication interface；The logic storage unit is used for storage depth and learns control logic；First storage Device is used for the weighted data and biased data for storing each layer of network；Using the present invention, computational efficiency, enhancing can be effectively improved Can power dissipation ratio.

Prior art has the following disadvantages：1）Conventional method is separated, it is necessary to safeguard two sets of platform rings using training with reasoning Border, resource is not fully utilized；2）Deep learning calculating is done using FPGA/CPLD completely, computing capability is not powerful enough, at present It is not particularly suited for large-scale Training scene；3）Communicate general by dma mode between FPGA/CPLD and server, data with The time delay of interaction is larger between cpu server.It is therefore desirable to propose a kind of new deep learning systems approach and device.

The content of the invention

For the deficiency for the prior art problem to be solved, the invention provides a kind of deep learning system of mixed architecture Method and device, has played the advantage and feature of respective module, has obtained higher Energy Efficiency Ratio, take full advantage of resource；CAPI The direct access to server memory is realized, time delay and programming complexity is reduced；The present invention solves the skill of its technical problem Art scheme is：

A kind of deep learning systems approach of mixed architecture, for realizing to deep learning training and reasoning, comprises the following steps：

S1, training dataset has more new change, and training module re-starts the training of deep learning network model, after terminating, network The weights and offset parameter of model are stored to file set in advance；

S2, server end monitoring process monitor Parameter File change, by the virtual address of weights and offset parameter memory space, Length information is encapsulated into data structure set in advance, and notifies reasoning module；

S3, reasoning module interrupts inference service, and weights and biasing file content are read from server side by EBI, and more New network model；

S4, server end monitoring process is handled simultaneously needs the input file of reasoning, and notifies reasoning module, and reasoning module is completed After return the result to service end monitoring process.

Described step S1 specifically includes following sub-step：

S11, during the more new change of training dataset, does not change network model, it is necessary to re -training, so that the net after being updated Network weights and offset parameter；

S12, it is necessary to which the weights and offset parameter of each layer of network are stored with the form appointed with reasoning module after the completion of training To file set in advance；

Described step S2 specifically includes following sub-step：

S21, service end operational monitoring process, by calling reasoning module in kernel server built-in function interface and driving, control Operation, stopping and the parameter of reasoning module update；

S22, the service end monitoring process moment has monitored whether that weights offset parameter needs renewal, and obtains most recent parameters information；

S23, when having more kainogenesis, it is necessary to send the parameter file information ceasing and desisting order and update to reasoning module；

Described step S3 specifically includes following sub-step：

S31, reasoning module directly reads corresponding weights, offset information to internal RAM by virtual address from service end；

S32, reasoning module notifies monitoring process after the completion of reading, monitoring process is sent to operation order；

S33, reasoning module updates network model parameter, proceeds inference service.

The deep learning grid model of described mixed architecture uses the deep learning network mould for picture classification Type.

The deep learning system and device of a kind of mixed architecture, for realizing the parallelization behaviour to deep learning training and reasoning Make, described device includes server module, training module, reasoning module, EBI；The server module is included at CPU Manage device, DDR internal memories, network；The training module, reasoning module are connected with the server module by EBI, and Communication can be attached.

Described server module, which has, includes the control for deep learning, data processing, network interaction, parameter storage Function.

Described CPU processor is POWER processors；Described training module is for accelerating deep learning model training The GPU of process accelerates training module；Described reasoning module for can be pre-loaded with deep learning network model set in advance and CAPI reasoning modules for deep learning reasoning process.

Compared with prior art, beneficial effects of the present invention are embodied in：A kind of depth of mixed architecture of the present invention Learning system method, comprises the following steps：Training dataset has more new change, and training module re-starts deep learning network model Training, after terminating, the weights and offset parameter of network model are stored to file set in advance；Server end monitoring process is monitored To Parameter File change, the virtual address of weights and offset parameter memory space, length information are encapsulated into number set in advance According in structure, and notify reasoning module；Reasoning module interrupt inference service, by EBI from server side read weights and File content is biased, and updates network model；Server end monitoring process is handled simultaneously needs the input file of reasoning, and notifies Service end monitoring process is returned the result to after the completion of reasoning module, reasoning module；The deep learning system of the mixed architecture Device, including server module, training module, reasoning module, EBI；In the server module CPU processor, DDR Deposit, network；The training module, reasoning module are connected with the server module by EBI, and can be connected Connect letter；The CPU+GPU+CAPI isomery deep learning systems that the present invention will be trained and reasoning is mixed using a set of platform, are played The advantage and feature of respective module, obtain higher Energy Efficiency Ratio, take full advantage of resource；CAPI is realized to server memory Directly access, reduce time delay and programming complexity；The parameter energy real-time online such as weights to inference pattern iteration updates.

Brief description of the drawings

Fig. 1 is the schematic flow sheet of the deep learning systems approach of mixed architecture of the present invention.

Fig. 2 is the Organization Chart of the deep learning device of mixed architecture of the present invention.

Fig. 3 is the Organization Chart of the deep learning device of embodiment of the present invention mixed architecture.

Fig. 4 is the present invention using the fundamental diagram exemplified by Alexnet deep learning network models.

Fig. 5 is structured flowchart of the prior embodiment towards the accelerator of deep learning business.

Embodiment

The present invention is described in further detail with reference to accompanying drawing 1 to Fig. 5, so that the public preferably grasps the embodiment party of the present invention Method, specific embodiment of the present invention is：

As shown in figure 1, a kind of deep learning systems approach of mixed architecture of the present invention, for realizing to deep learning Training and reasoning, comprise the following steps：

Described step S1 specifically includes following sub-step：

Described step S2 specifically includes following sub-step：

Described step S3 specifically includes following sub-step：

S32, reasoning module notifies monitoring process after the completion of reading, finger daemon is sent to operation order；

As shown in Fig. 2 the deep learning system and device of described mixed architecture, for realizing to deep learning training and The parallelization operation of reasoning, it is characterised in that：Described device includes server module, training module, reasoning module, bus and connect Mouthful；The server module CPU processor, DDR internal memories, network；The training module, reasoning module with the server mould Block is connected by EBI, and can be attached communication；Described server module is to include the control for deep learning System, data processing, network interaction, the server module of parameter store function；Described CPU processor is POWER processors；Institute The training module stated is for accelerating the GPU of deep learning model training process to accelerate training module；Described reasoning module is Deep learning network model set in advance can be pre-loaded with and for the CAPI reasoning modules of deep learning reasoning process；It is described The EBI of server module and training module is PCI-E or Nvlink buses；The server module and reasoning module Hardware interface be PCI-E or BlueLink, bus protocol is CAPI.

It is preferred that, as shown in figure 4, the deep learning grid model of described mixed architecture is directed to figure using a kind of The Alexnet deep learning network models of piece classification.For the ease of the understanding of the present invention program, below with deep using Alexnet Spend exemplified by learning network model, briefly explain the operation principle of the present invention：Described Alexnet deep learning network models are by 5 Relu, Pooling and Normalization behaviour are also added into layer convolutional layer and 3 layers of full articulamentum composition, part convolutional layer Make, last layer of full articulamentum exports the Softmax layers of 1000 classification.Alexnet models can be used for extensive picture point Class, according to the difference of training dataset, can do classification based training, and provide picture classification service for different situations.

Embodiment 1

As shown in figure 3, as preferred preferred forms, such as realizing Alexnet picture classification problem：The hybrid frame The deep learning device of structure, for realizing that the parallelization to deep learning training and reasoning is operated, including POWER8 processors, The server module of the compositions such as DDR internal memories, network；The GPU being connected with the server by bus accelerates training module GTX1080；The CAPI reasoning module ADM-PCIE-KU3 accelerator cards being connected with the server by bus.Described GPU instructions Practice the training process that module is used to accelerate deep learning model；Described reasoning module is pre-loaded with AlexNet network models, uses In the reasoning process of deep learning；Described server module is for the control of deep learning, data processing, network interaction, ginseng Number storage etc.；The EBI of the server module and training module is PCI-E or Nvlink buses；The server mould The hardware interface of block and reasoning module is PCI-E or BlueLink, and bus protocol is CAPI.

The deep learning systems approach of the device mixed architecture, realizes that step is as follows：

S1, uses SNAP Framework instruments（A kind of use C/C++ realizes the algorithm model work run in CAPI cards Tool）The layer network models of Alexnet 8 are realized, and are write with a brush dipped in Chinese ink into CAPI reasoning modules；

S2, based on Tensorflow depth frameworks, gets 3,000,000 pictures of 300 kinds of for example marked birds TFRecrods pictures, are supplied to two pieces of GTX1080 GPU to do distributed training as training dataset；

S3, monitoring process obtains newest training result pb files, parses weights therein and offset parameter to file A, and obtain Take virtual address and length information that parameter is stored；

S4, monitoring program calls CAPI kernel libraries function interface and driving, sends and encapsulates to ADM-PCIE-KU3 CAPI modules The data structure of parameter information；

S5, CAPI card analytic parameter address from structure, so that get parms information and the network model power of correspondence renewal storage Value and the parametric variable of biasing；

The picture reasoning request that monitoring program is sent is received in S6, CAPI clamping, and the Top5 results that network is exported are returned, can be external The picture identification service of the category is provided；

S7, while CAPI card offer services, training network can also constantly carry out the training of newly-increased classification, and will train Into parameter synchronization update into CAPI cards.It is achieved thereby that the synchronized update and iteration of training and reasoning.

Presently preferred embodiments of the present invention is the foregoing is only, but protection scope of the present invention is not restricted to the present invention Embodiment, it is all in the spirit and principles in the present invention, disclose within technical scope, any modification for being made, equally replace Change, improve, retrofit, should be included within the scope of the present invention.

Claims

1. the deep learning systems approach of a kind of mixed architecture, for realizing to deep learning training and reasoning, it is characterised in that： Comprise the following steps：

2. according to the method described in claim 1, it is characterised in that：Described step S1 specifically includes following sub-step：

S12, it is necessary to which the weights and offset parameter of each layer of network are stored with the form appointed with reasoning module after the completion of training To file set in advance.

3. according to the method described in claim 1, it is characterised in that：Described step S2 specifically includes following sub-step：

S23, when having more kainogenesis, it is necessary to send the parameter file information ceasing and desisting order and update to reasoning module.

4. according to the method described in claim 1, it is characterised in that：Described step S3 specifically includes following sub-step：

5. according to the method described in claim 1, it is characterised in that：Described network model is using a kind of for picture classification Deep learning model.

6. a kind of device of the deep learning system of the mixed architecture as described in any one of Claims 1 to 5, for realizing to depth The parallelization operation of learning training and reasoning, it is characterised in that：Described device includes server, training module, reasoning module, total Line interface；The server module includes CPU processor, DDR internal memories, network；The training module, reasoning module with service Device module is connected by EBI, and can be attached communication.

7. device according to claim 6, it is characterised in that：Described server module, which has, to be included being used for deep learning Control, data processing, network interaction, parameter store function.

8. device according to claim 6, it is characterised in that：Described CPU processor is POWER processors；Described Training module is for accelerating the GPU of deep learning model training process to accelerate training module.

9. device according to claim 6, it is characterised in that：Described reasoning module is that can be pre-loaded with deep learning net Network model, and for the CAPI reasoning modules of deep learning reasoning process.