CN106951926B

CN106951926B - Deep learning method and device of hybrid architecture

Info

Publication number: CN106951926B
Application number: CN201710196532.0A
Authority: CN
Inventors: 程归鹏; 卢飞; 江涛
Original assignee: Shandong Itl Data Technique Co ltd
Current assignee: Shandong Intelligent Optical Communication Development Co ltd
Priority date: 2017-03-29
Filing date: 2017-03-29
Publication date: 2020-11-24
Anticipated expiration: 2037-03-29
Also published as: CN106951926A

Abstract

The invention discloses a deep learning method and a deep learning device of a hybrid architecture, which are characterized by comprising the following steps of: when the training data set is updated, the training module carries out deep learning network model training again and stores the weight and the bias parameters; the server side monitoring process monitors the change of the parameter file, packages the parameter file into a preset data structure and informs an inference module; the inference module interrupts the inference service, reads the weight and the content of the bias file from the server side and updates the network model; the server side monitoring process simultaneously processes the input files needing to be inferred and informs the inference module; the system device comprises a server, a training module, an inference module and a bus interface; the training and reasoning hybrid CPU + GPU + CAPI heterogeneous deep learning system can fully utilize resources, obtain higher energy efficiency ratio, realize that CAPI directly accesses a server memory, and iteratively update parameters such as a reasoning model weight value in real time on line.

Description

Deep learning method and device of hybrid architecture

The technical field is as follows:

the present invention relates to the technical field of circuit design and machine learning, and in particular, to a deep learning method and apparatus for a hybrid architecture.

Background art:

the rapid development of the information technology industry in the 21 st century brings great benefits and convenience to people. Deep learning application is divided into two parts of training and reasoning, taking ImageNet evaluation as an example, the AlexNet model training process needs 800 thousands of pictures of 1000 categories, features are extracted through an AlexNet model and loss is calculated, and then weight parameters are updated through back propagation such as SGD, so that the model is continuously converged, and an ideal network model is finally obtained. The reasoning process is a process of performing a forward operation on the input through a network model to obtain the accuracy of the final classification (Top 5 is generally selected). The training process of the deep learning application needs to use a large amount of computing resources and training data, and the current training platform generally adopts a high-performance GPU of NVIDIA (video graphics processing Unit) such as Tesla P100, Titan X, GTX1080 and the like to accelerate the training process. After the available model is obtained, the model is deployed to another platform for reasoning and providing services for the outside, the reasoning process only needs one-time forward operation, so the requirement on calculation is lower, more requirements are reflected in time delay, the current platform for reasoning is provided with a cloud service platform based on a CPU, a GPU server cluster based on low power consumption, an FPGA or a special ASIC cluster and the like. The use of FPGAs and dedicated ASICs is even more advantageous in terms of low latency and high performance. And compared with an ASIC, the FPGA has more architectural flexibility and obtains more and more attention. CAPI, namely a Coherent Accelerator Processor Interface (Cowlett packard Interface), is a high-speed bus Interface protocol developed by IBM on a POWER Processor, and the physical Interface form is PCI-E or BlueLink developed by IBM. The PSL layer is realized inside the CAPI, the access consistency between the CAPI and the server is ensured, namely the CPU memory can be directly accessed through the virtual address, and the access delay is greatly reduced. And the SNAP Framework programming environment provided by IBM can use a C/C + + convenient algorithm model.

Therefore, various deep learning methods and devices are developed and researched by people, for example, an embedded deep learning processor disclosed in Chinese patent with the publication number of CN106022472A, the invention belongs to the technical field of integrated circuits, and particularly relates to an embedded deep learning processor based on an FPGA (field programmable gate array); the deep learning processor includes: a Central Processing Unit (CPU) for performing necessary logic operation, control and storage operations during learning and operation of the processor; the deep learning unit is a hardware implementation unit of a deep learning algorithm and is a core component for deep learning processing; the deep learning processor combines a traditional CPU and a deep learning combination unit, wherein the deep learning combination unit can be combined by a plurality of deep learning units at will, has expandability, and can be used as a core processor for artificial intelligence application aiming at different calculation scales. As shown in fig. 5, chinese patent with publication number CN106156851A is an acceleration apparatus and method for deep learning service, which is used to perform deep learning calculation on data to be processed in a server, and includes a network card disposed at a server end, a calculation control module connected to the server through a bus, a first memory and a second memory; the calculation control module is a programmable logic device and comprises a control unit, a data storage unit, a logic storage unit, a bus interface, a first communication interface and a second communication interface, wherein the bus interface, the first communication interface and the second communication interface are respectively communicated with the network card, the first memory and the second memory; the logic storage unit is used for storing deep learning control logic; the first memory is used for storing weight data and bias data of each layer of the network; by using the method and the device, the calculation efficiency can be effectively improved, and the performance power consumption ratio is improved.

The prior art has the following defects: 1) the general method adopts the separation of training and reasoning, needs to maintain two sets of platform environments, and cannot fully utilize resources; 2) the FPGA/CPLD is completely adopted for deep learning calculation, the calculation capability is not strong enough, and the method is not suitable for large-scale training scenes at present; 3) the communication between the FPGA/CPLD and the server is generally realized in a DMA mode, and the interaction time delay between data and the CPU server is large. Therefore, it is necessary to provide a new deep learning system method and apparatus.

The invention content is as follows:

in order to solve the defects of the prior art, the invention provides a deep learning method and a deep learning device of a hybrid architecture, which give full play to the advantages and the characteristics of respective modules, obtain higher energy efficiency ratio and fully utilize resources; the CAPI realizes direct access to the memory of the server, and reduces time delay and programming complexity; the technical scheme for solving the technical problems of the invention is as follows:

a deep learning method of a hybrid architecture is used for realizing deep learning training and reasoning, and comprises the following steps:

s1, when the training data set is updated, the training module carries out deep learning network model training again, and after the deep learning network model training is finished, the weight and the bias parameters of the network model are stored in a preset file;

s2, the server monitoring process monitors the parameter file change, packages the weight and the virtual address and length information of the bias parameter storage space into a preset data structure, and informs the inference module;

s3, the inference module interrupts the inference service, reads the weight and the bias file content from the server side through the bus interface, and updates the network model;

and S4, the server side monitoring process processes the input files needing to be reasoned at the same time, informs the inference module, and returns the result to the server side monitoring process after the inference module is finished.

The step S1 specifically includes the following sub-steps:

s11, when the training data set is updated, the network model is not changed, and retraining is needed, so as to obtain updated network weight and bias parameters;

s12, after training, storing the weight and the bias parameters of each layer of the network into a preset file in a format agreed with the reasoning module;

the step S2 specifically includes the following sub-steps:

s21, the server side runs a monitoring process, and controls the operation, stop and parameter update of the reasoning module by calling the reasoning module to perform function interface and drive in the kernel library of the server;

s22, the server side monitors whether the weight bias parameters need to be updated or not at all times and acquires the latest parameter information;

s23, when updating happens, a stop command and updated parameter file information need to be sent to the reasoning module; the step S3 specifically includes the following sub-steps:

s31, the reasoning module reads corresponding weight and bias information from the server side to the internal RAM directly through the virtual address;

s32, the reasoning module informs the monitoring process after reading is completed, and the monitoring process sends an operation command to the monitoring process;

and S33, the inference module updates the network model parameters and continues to perform inference service.

The deep learning network model of the hybrid architecture adopts a deep learning network model aiming at image classification.

A hybrid architecture deep learning device is used for realizing parallelization operation of deep learning training and reasoning and comprises a server, a training module, a reasoning module and a bus interface; the server comprises a CPU processor, a DDR memory and a network; the training module and the reasoning module are connected with the server through bus interfaces and can be connected and communicated.

The server has functions of control, data processing, network interaction and parameter storage for deep learning.

The CPU processor is a POWER processor; the training module is a GPU acceleration training module used for accelerating the deep learning model training process; the inference module is a CAPI inference module which can be loaded with a preset deep learning network model in advance and is used for the deep learning inference process.

Compared with the prior art, the invention has the beneficial effects that: the invention discloses a deep learning method of a hybrid architecture, which comprises the following steps: when the training data set is updated, the training module carries out deep learning network model training again, and after the deep learning network model training is finished, the weight and the bias parameters of the network model are stored in a preset file; monitoring the change of the parameter file by a monitoring process at the server end, packaging the virtual address and length information of the weight and bias parameter storage space into a preset data structure, and informing an inference module; the inference module interrupts the inference service, reads the weight and the content of the bias file from the server side through the bus interface, and updates the network model; the server side monitoring process simultaneously processes the input files needing to be inferred and informs the inference module, and the inference module returns the results to the server side monitoring process after finishing the processing; the hybrid architecture deep learning device comprises a server, a training module, an inference module and a bus interface; the server CPU processor, the DDR memory and the network; the training module and the reasoning module are connected with the server through bus interfaces and can perform connection communication; the invention adopts a set of CPU + GPU + CAPI heterogeneous deep learning system which mixes training and reasoning, exerts the advantages and characteristics of respective modules, obtains higher energy efficiency ratio and fully utilizes resources; the CAPI realizes direct access to the memory of the server, and reduces time delay and programming complexity; parameters such as the weight of the inference model and the like can be updated in an online iterative manner in real time.

Drawings

Fig. 1 is a flowchart illustrating a deep learning method of a hybrid architecture according to the present invention.

FIG. 2 is an architecture diagram of a hybrid architecture deep learning device according to the present invention.

Fig. 3 is an architecture diagram of a hybrid architecture deep learning device according to an embodiment of the present invention.

Fig. 4 is a working schematic diagram of the present invention using an Alexnet deep learning network model as an example.

Fig. 5 is a block diagram illustrating a structure of an acceleration apparatus for deep learning service according to an embodiment of the prior art.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings 1 to 5 so that the public can better understand the implementation method of the present invention, and the specific embodiments of the present invention are as follows:

as shown in fig. 1, the deep learning method with a hybrid architecture according to the present invention is used for implementing deep learning training and reasoning, and includes the following steps:

The step S1 specifically includes the following sub-steps:

the step S2 specifically includes the following sub-steps:

s23, when updating happens, a stop command and updated parameter file information need to be sent to the reasoning module;

the step S3 specifically includes the following sub-steps:

s32, the reasoning module informs the monitoring process after reading is completed, and the daemon process sends an operation command to the monitoring process;

As shown in fig. 2, the hybrid architecture deep learning apparatus is configured to implement parallelization operations of deep learning training and reasoning, and is characterized in that: the device comprises a server, a training module, an inference module and a bus interface; the server CPU processor, the DDR memory and the network; the training module and the reasoning module are connected with the server through bus interfaces and can perform connection communication; the server comprises the functions of deep learning control, data processing, network interaction and parameter storage; the CPU processor is a POWER processor; the training module is a GPU acceleration training module used for accelerating the deep learning model training process; the inference module is a CAPI inference module which can be pre-loaded with a preset deep learning network model and is used for the deep learning inference process; the bus interface of the server and the training module is a PCI-E or Nvlink bus; the hardware interface of the server and the reasoning module is PCI-E or BlueLink, and the bus protocol is CAPI.

Preferably, as shown in fig. 4, the deep learning system network model of the hybrid architecture adopts an Alexnet deep learning network model for picture classification. In order to facilitate understanding of the scheme of the invention, the working principle of the invention is briefly described below by taking an Alexnet deep learning network model as an example: the Alexnet deep learning network model is composed of 5 convolutional layers and 3 full-connection layers, Relu, Pooling and Normalization operations are added to part of the convolutional layers, and 1000 classified Softmax layers are output from the last full-connection layer. The Alexnet model can be used for wide picture classification, can perform classification training aiming at different situations according to different training data sets, and provides picture classification service.

Example 1

As shown in fig. 3, as a preferred embodiment, an Alexnet picture classification problem is implemented:

the deep learning device with the hybrid architecture is used for realizing parallelization operation of deep learning training and reasoning, and comprises a POWER8 processor, a DDR memory, a network and the like; a GPU acceleration training module GTX1080 connected with the server through a bus; and the CAPI inference module ADM-PCIE-KU3 accelerator card is connected with the server through a bus. The GPU training module is used for accelerating the training process of the deep learning model; the inference module is preloaded with an AlexNet network model and used for the inference process of deep learning; the server is used for control of deep learning, data processing, network interaction, parameter storage and the like; the bus interface of the server and the training module is a PCI-E or Nvlink bus; the hardware interface of the server and the reasoning module is PCI-E or BlueLink, and the bus protocol is CAPI.

The deep learning method of the hybrid architecture of the device comprises the following implementation steps:

SS1, using SNAP Framework tool (an algorithm model tool using C/C + + to run in CAPI card) to realize Alexnet 8 layer network model, and writing into CAPI reasoning module;

SS2, based on a Tensorflow depth frame, acquiring TFReccrods picture sets of 300 million pictures of 300 kinds of marked birds, for example, and providing the TFReccrods picture sets as training data sets to two GTX1080 GPU for distributed training;

SS3, monitoring the process to obtain the latest training result pb file, analyzing the weight and the offset parameter in the pb file to a file A, and acquiring the virtual address and length information stored by the parameter;

the SS4 calls a CAPI kernel library function interface and a driver by the monitoring program, and sends a data structure packaged with parameter information to the ADM-PCIE-KU3 CAPI module;

SS5, the CAPI card analyzes the parameter address from the structure, thereby obtaining the parameter information and correspondingly updating the stored network model weight and the biased parameter variable;

SS6, the CAPI card receives the picture inference request sent by the monitoring program, and returns the Top5 result output by the network, and the picture identification service of the category can be provided for the outside;

the SS7 can continuously train new classes while the CAPI card provides services, and synchronously update the trained parameters into the CAPI card. Thus, synchronous updating and iteration of training and reasoning are realized.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited to the specific embodiments of the present invention, and any modification, equivalent replacement, improvement, modification, etc. within the spirit and principle of the present invention and the disclosed technical scope should be included in the scope of the present invention.

Claims

1. A deep learning method of a hybrid architecture is disclosed, which realizes deep learning training and reasoning based on a deep learning system and is characterized in that: the deep learning system is a CPU + GPU + CAPI heterogeneous deep learning system which mixes training and reasoning, and comprises the following steps:

s4, the server side monitoring process processes the input files needing to be reasoned at the same time and informs the inference module, and the inference module returns the result to the server side monitoring process after finishing the process;

CPU + GPU + CAPI heterogeneous deep learning system includes:

the POWER8 processor, DDR memory, network server; a GPU acceleration training module GTX1080 connected with the server through a bus; the CAPI inference module ADM-PCIE-KU3 accelerator card is connected with the server through a bus; the GPU acceleration training module GTX1080 is used for accelerating the training process of the deep learning model; the inference module is preloaded with an AlexNet network model and used for the inference process of deep learning; the server is used for control of deep learning, data processing, network interaction and parameter storage; the bus interface of the server and the training module is a PCI-E or Nvlink bus; the hardware interface of the server and the inference module is PCI-E or BlueLink, and the bus protocol is CAPI; the deep learning method of the hybrid architecture comprises the following implementation steps:

SS1, realizing Alexnet 8-layer network model by using SNAP Framework tool, and writing to CAPI reasoning module;

SS2, acquiring picture data based on a Tensorflow depth frame, and providing the picture data as a training data set for two GTX1080 GPUs to perform distributed training;

SS6, the CAPI card receives the picture inference request sent by the monitoring program, and returns the Top5 result output by the network, and the picture identification service of the corresponding category can be provided for the outside;

the SS7 can continuously train new classes while the CAPI card provides services, and synchronously update the trained parameters into the CAPI card.

2. The method of claim 1, wherein: the step S1 specifically includes the following sub-steps:

and S12, after training, storing the weight and the bias parameters of each layer of the network into a preset file in a format agreed with the reasoning module.

3. The method of claim 1, wherein: the step S2 specifically includes the following sub-steps:

and S23, when the update happens, a stop command and updated parameter file information need to be sent to the inference module.

4. The method of claim 1, wherein: the step S3 specifically includes the following sub-steps:

5. The method of claim 1, wherein: the network model adopts a deep learning model aiming at image classification.

6. A hybrid architecture deep learning apparatus using the method of any one of claims 1-5 for implementing parallelized operations for deep learning training and reasoning, characterized by: the device comprises a server, a training module, an inference module and a bus interface; the server comprises a CPU processor, a DDR memory and a network; the training module and the reasoning module are connected with the server through bus interfaces and can perform connection communication; the CAPI directly accesses the memory of the server.

7. The apparatus of claim 6, wherein: the server has the functions of control, data processing, network interaction and parameter storage for deep learning.

8. The apparatus of claim 6, wherein: the CPU processor is a POWER processor; the training module is a GPU acceleration training module used for accelerating the deep learning model training process.

9. The apparatus of claim 6, wherein: the reasoning module is a CAPI reasoning module which can be loaded with a deep learning network model in advance and is used for the deep learning reasoning process.