CN104463324A

CN104463324A - Convolution neural network parallel processing method based on large-scale high-performance cluster

Info

Publication number: CN104463324A
Application number: CN201410674860.3A
Authority: CN
Inventors: 王馨
Original assignee: CHANGSHA MASHA ELECTRONIC TECHNOLOGY Co Ltd
Current assignee: CHANGSHA MASHA ELECTRONIC TECHNOLOGY Co Ltd
Priority date: 2014-11-21
Filing date: 2014-11-21
Publication date: 2015-03-25

Abstract

The invention discloses a convolution neural network parallel processing method based on a large-scale high-performance cluster. The method comprises the steps that (1) a plurality of copies are constructed for a network model to be trained, model parameters of all the copies are identical, the number of the copies is identical with the number of nodes of the high-performance cluster, each node is provided with one model copy, one node is selected to serve as a main node, and the main node is responsible for broadcasting and collecting the model parameters; (2) a training set is divided into a plurality of subsets, the training subsets are issued to the rest of sub nodes except the main mode each time to conduct parameter gradient calculation together, gradient values are accumulated, the accumulated value is used for updating the model parameters of the main node, and the updated model parameters are broadcast to all the sub nodes until model training is ended. The convolution neural network parallel processing method has the advantages of being capable of achieving parallelization, improving the efficiency of model training, shortening the training time and the like.

Description

A kind of convolutional neural networks method for parallel processing based on extensive High-Performance Computing Cluster

Technical field

The present invention is mainly concerned with the design field of HPCC, refers in particular to a kind of convolutional neural networks method for parallel processing based on extensive High-Performance Computing Cluster.

Background technology

High-performance computer is a computer cluster, multiple computer system is linked together by high speed interconnect technology by it, utilize all COMPREHENSIVE CALCULATING abilities being connected system to process mass computing problem, so be commonly called again " HPCC " or " High-Performance Computing Cluster ".High-Performance Computing Cluster, mainly for the treatment of the computational problem of complexity, is applied in the environment needing extensive scientific algorithm, as weather forecast, petroleum prospecting and reservoir simulation, molecular simulation, gene sequencing etc.The application program that High-Performance Computing Cluster is run generally uses parallel algorithm, a large general problem is divided into many little subproblems according to certain rule, different nodes in cluster calculate, and the result of these minor issues, the net result of former problem can be merged into through process.Because the calculating of these minor issues generally can walk abreast, thus the processing time of problem can be shortened.

High-Performance Computing Cluster is in computation process, and each node is collaborative work, and they process a part for large problem respectively, and carries out exchanges data as required in processes, and the result of each node is all a part for net result.The processing power of High-Performance Computing Cluster is directly proportional to the scale of cluster, is each node processing power sum in cluster.Along with a large amount of application developments and transplanting, aggregated structure obtains outstanding performance with lower cost, thus become the main flow of high-performance calculation, comply with popular development trend, promoted that aggregated structure is widely used in high-performance computer system.In the process of the continuous lifting of CPU and GPU computing power, the computational resource how integrating both certainly will become a study hotspot.

Convolutional neural networks is a kind of special deep-neural-network model.Convolutional network designs by the inspiration of optic nerve mechanism at first, is for identifying two-dimensional shapes and a multilayer perceptron designing, and the distortion of this network structure to translation, proportional zoom, inclination or his form altogether has height unchangeability.Within 1962, Hubel and Wiesel is by the research to cat visual cortex cell, proposes the concept of receptive field (receptive field).Within 1984, Japanese scholars Fukushima proposes neocognitron (neocognitron) model based on receptive field concept, it can be regarded as first realization of convolutional neural networks, is also the first Application of receptive field concept in artificial neural network field.

Usually, the basic structure of convolutional neural networks comprises two-layer, and one is feature extraction layer, and each neuronic input is connected with the local acceptance domain of front one deck, and extracts the feature of this local.Once after this local feature is extracted, the position relationship between it and further feature is also decided thereupon; It two is Feature Mapping layers, and each computation layer of network is made up of multiple Feature Mapping, and each Feature Mapping is a plane, and in plane, all neuronic weights are equal.Feature Mapping structure adopts sigmoid function as the activation function of convolutional network, makes Feature Mapping have shift invariant.In addition, because the neuron on a mapping face shares weights, the number of freedom of network parameter is decreased.Each convolutional layer in convolutional neural networks is used for asking the computation layer of local average and second extraction followed by one, and this distinctive twice feature extraction structure reduces feature resolution.

Convolutional neural networks is mainly used to the X-Y scheme identifying displacement, convergent-divergent and other form distortion unchangeability.Because the feature detection layer of convolutional neural networks is learnt by training data, so when using convolutional neural networks, avoiding the feature extraction of display, and implicitly learning from training data; Moreover due to the neuron weights on same Feature Mapping face identical, so network can collateral learning, this is also that convolutional network is connected with each other relative to neuron a large advantage of network.Convolutional neural networks has unique superiority with the special construction that its local weight is shared in speech recognition and image procossing, its layout is closer to the biological neural network of reality, weights share the complicacy reducing network, and particularly the image of multidimensional input vector directly can input the complexity that this feature of network avoids data reconstruction in feature extraction and assorting process.

Convolutional neural networks has become the study hotspot of current speech analysis and field of image recognition, but because the network number of plies is many, weighting parameter enormous amount, therefore the training time of network model is usually at tens of sky even some months, and the training time is longer also makes the popularization of convolutional neural networks more limited.But due to the advantage that its weights are shared, convolutional neural networks collateral learning provides thinking for solving the problem, especially in the current era that GPU computing power constantly rises, the focus that the training of concurrent computation resource to convolutional neural networks accelerates to also become research how is integrated.

And the Study on Acceleration of current international forward position to neural network mainly concentrates on both direction: first, parallel accelerate is carried out based on polylith GPU on individual server, individual server does not relate to the data transmission between multiple node, parallel accelerate easily realizes, but the dimension-limited of network model is in the configuration of individual server; The second, use the training of large-scale cluster to neural network to accelerate, propose DistBelief model, but be not applied in convolutional neural networks, apply comparatively extensive in limited Boltzmann machine and dark belief network.Therefore in conjunction with the calculating advantage of extensive High-Performance Computing Cluster, realize the collateral learning of convolutional neural networks, improving the efficiency of training pattern, is a technical barrier of this area, is also to reduce convolutional neural networks study threshold, widens an importance of its application.

Summary of the invention

The technical problem to be solved in the present invention is just: the technical matters existed for prior art, the invention provides a kind of efficiency, the convolutional neural networks method for parallel processing based on extensive High-Performance Computing Cluster of minimizing training time that can realize parallelization, improve model training.

For solving the problems of the technologies described above, the present invention by the following technical solutions:

Based on a convolutional neural networks method for parallel processing for extensive High-Performance Computing Cluster, the steps include:

(1) network model will trained constructs multiple copy, and the model parameter of each copy is all identical, and the number of copy is identical with the nodes of High-Performance Computing Cluster, and each node distributes a model copy; A selected node, as host node, is responsible for broadcast and the collection of model parameter;

(2) training set is divided into some subsets, each all the other child nodes training subset is distributed to except host node, jointly carry out the calculating of parameter gradients, and Grad is added up, aggregate-value is used for upgrading host node model parameter, model parameter after upgrading is broadcast to each child node, until model training stops.

As a further improvement on the present invention: in described step (1), before each iteration, first the parameter of network model is carried out random initializtion, initialized model parameter comprises weighting parameter W, bias unit b; First carry out initialization according to the network parameter of input, more successively initialization network weight parameter and bias unit.

As a further improvement on the present invention: described initialized mode adopts rands random fashion, random value parameter from-1 to 1 is made.

As a further improvement on the present invention: the more new technological process also comprising step (3) model parameter, that is: after the step number that iteration is certain, each child node passes the parameter gradients of accumulation back host node, and the unified renewal carrying out stipulations operation and model parameter; Then, then the model parameter after upgrading is distributed to each child node, each child node carries out the calculating of gradient again, until model training stops.

As a further improvement on the present invention: in step (2), make host process open separately a thread and to look ahead training set, adopt one process carry out digital independent and data set is distributed to other nodes; That is, No. 0 process is set to host process, is responsible for reading and the transmission of data, all the other calculation procedure are responsible for receiving data, send and receive to adopt MPI_Send and MPI_Recv to realize.

As a further improvement on the present invention: in step (2), each node is the training adopting the mode of parallel training to carry out model parameter; That is: each computing node carries out the training of network model parameter for the training dataset being assigned to this node.

As a further improvement on the present invention: the basic layer structure of described model training comprises convolutional layer, down-sampling layer and full articulamentum, each level comprises propagated forward and backward feedback two class calculates; Described convolutional layer is feature extraction layer, and each neuronic input is connected with the local receptor field of front one deck, and extracts this local feature; Described down-sampling layer is Feature Mapping layer, and each Feature Mapping is a plane, and in plane, all neuronic weights are equal; The feature integration of extraction is an one-dimensional vector by described full articulamentum, is finally connected on sorter, completes the classification feature of whole network; The calculating of described propagated forward, be the result of calculating and training label are compared, its error carries out backpropagation, calculates the gradient delta w size that local derviation obtains each model parameter in each level, and added up by Δ w according to stochastic gradient descent algorithm SGD; Repeat the process of above-mentioned forward-backward algorithm, constantly cumulative model parameter gradient delta w, when iterative computation number of times is accumulated to certain threshold value on each computing node, carries out synchronous communication, completes the renewal of model parameter.

As a further improvement on the present invention: in described step (3), when whole iterative computation is to certain number of times, all computing nodes pass accumulative parameter gradients Δ w back host process, and host process carries out stipulations operation to the Δ w that each process is passed back, and upgrades model parameter w:

Δw = μΔw + ϵ ({&lang; \frac{&PartialD; E}{&PartialD; w} &rang;}_{i} - ωw) - - - (1)

w＝w+△w (2)

Compared with prior art, the invention has the advantages that:

(1) convolutional neural networks algorithm is expanded to multiple server even on large-scale cluster, the structure of neural network algorithm is amassed and DistBelief model merges by reel, propose the algorithm structure of new applicable large-scale cluster, improve the scope of application of algorithm.

(2) advantage that the weights more making full use of convolutional neural networks algorithm are shared, carries out data parallel by the calculated amount of convolutional layer, and more computational resource can be utilized to improve counting yield, greatly reduces the primitive network tediously long model training time.

(3) by the improvement to algorithm model, convolutional neural networks is improved the new application becoming high-performance computing sector, existing computational resource and the computing technique of high-performance computing sector can be utilized more, greatly optimize its counting yield, simultaneously along with the raising of counting yield, the scale of application can be further expanded, such as computer vision, speech processes, the aspects such as natural language processing, more performance convolutional neural networks better advantage in the application.

Accompanying drawing explanation

Fig. 1 is the general system set-up schematic diagram of isomery HPCC.

Fig. 2 is the neural network basic composition cell schematics that the present invention adopts when embody rule.

Fig. 3 is schematic flow sheet of the present invention.

Fig. 4 is that the basic layer of the convolutional neural networks that the present invention adopts when embody rule forms schematic diagram.

Fig. 5 is the present invention's basic calculating operation chart of convolutional layer and down-sampling in convolutional neural networks when embody rule.

Fig. 6 is the present invention's algorithm data flow graph in conjunction with HPCC framework when embody rule.

Embodiment

Below with reference to Figure of description and specific embodiment, the present invention is described in further details.

The present invention, when embody rule, first needs to build extensive High-Performance Computing Cluster environment.High-Performance Computing Cluster environment is divided into software environment and hardware environment.Hardware is one group 1 to the aggregate of the identical independently calculating (node) of N number of configuration, connected by high performance internet between node; Each node, can also collaborative work show as a computational resource that is single, that concentrate for parallel computation task except can as a single computational resource for except oolhiu interactive user.

The task of High-Performance Computing Cluster mainly concentrates on scientific algorithm, therefore higher to the requirement of hardware computing power.Except frequency will be selected higher, outside the more CPU of core number, GPU is also indispensable as the important computations acceleration equipment on heterogeneous platform.Such as, the application popularization on a large scale based on the GPU of Kepler framework of new generation is come.Meanwhile, High-Performance Computing Cluster may occur that maximum problem is communication, and the time cost of exchanges data often becomes the bottleneck of program feature.Adopt High Speed I nfiniband optical fiber interconnections, have at a high speed, the transport property of low delay, use based on trust, the mechanism of current control guarantees the integrality that connects, packet is seldom lost.The development of InfiniBand is comparatively rapid, and SDR pattern from, ddr mode, QDR pattern, to the FDR pattern of today, from single channel, 4 passages, until nowadays support 12 passages, its transfer rate is end-to-end reaches as high as 240Gbps.

What software environment can adopt is (SuSE) Linux OS, and conventional has Centos, Redhat and Ubantu.The GNU compiler that translation and compiling environment acquiescence uses, if the efficiency of Intel compiler of having ready conditions is higher.Message passing interface MPI is current most popular distributed store Parallel Programming Environment, therefore needs a specific implementation, and guarantee software environment supports the operation of MPI program.Conventional MPI comprises MPICH2, Openmpi and Intel mpi.Use GPU to accelerate, also need to select compatible CUDA Driver to be installed, Toolkit and SDK version according to model.

As shown in Figure 1, be the general system set-up figure of the isomery HPCC in embody rule example.In figure, left side is Heterogeneous Computing node, is main computational resource, and wherein each node is exactly an independent server, and its framework is CPU+GPU, and this framework uses more and more extensive in high-performance field.Switch is managed by Infiniband switch and gigabit interconnected between node, InfiniBand is a kind of connected mode of long cable, have at a high speed, the transport property of low delay, use based on trust, the mechanism of current control guarantees the integrality that connects, packet is seldom lost.Use the network node of InfiniBand, generally need HCA card is installed.The development of InfiniBand is comparatively rapid, and SDR pattern from, ddr mode, QDR pattern, to the FDR pattern of today, from single channel, 4 passages, until nowadays support 12 passages, its transfer rate is end-to-end reaches as high as 240Gbps.And the transmission of gigabit management switch primary responsibility instruction, the transmission of not responsible concrete calculating data.

Right side switch directly and I/O node interconnect, more directly accesses disk array, and this is suitable for large-scale centralized stores data, and disk array principle utilizes array mode to make disk group, coordinates the design of data scatter arrangement, the security of lifting data.Disk array can have multiple reading-writing port on the one hand, is accessed by multiple node simultaneously, improves the speed of transmission, and redundant array greatly can improve the security of data on the other hand.Bottom-right management node is that user can use computational resource indirectly by login management node for user is arranged, and this way to manage makes computational resource be convenient to management, for user provides service convenient.

As shown in Figure 2, be neural network basic composition unit that the present invention adopts when embody rule.Neural network is by the elementary cell to human brain---neuronic modeling and connection, and explore the model of simulation human brain nervous function, and develop a kind of manual system with Intelligent Information Processing functions such as study, association, memory and pattern-recognitions.The study of neural network is a process, Fig. 2 is exactly single neuron, residing for it environment excitation under, in succession input some sample modes X1 to network, X2, X3, then reacted by neuron, namely adds upper offset b with weight matrix W convolution, according to the value of result X, according to the weight matrix W of each layer of gradient descent algorithm adjustment network, treat that each layer weights of network all converge to certain value, learning process terminates.Then, just can to do True Data by the neural network generated and classify.

As shown in Figure 3, the convolutional neural networks method for parallel processing based on extensive High-Performance Computing Cluster of the present invention, the steps include:

(1) network model will trained constructs multiple copy, and the model parameter of each copy is all identical, and the number of copy is identical with the nodes of High-Performance Computing Cluster, and each node distributes a model copy; A selected node, as host node, is responsible for broadcast and the collection of model parameter.

(2) training set is divided into some subsets, training subset is distributed to all the other child nodes except host node at every turn, jointly carries out the calculating of parameter gradients, and Grad is added up, until model training stops.

In preferably example, also comprise step (3): the more new technological process of model parameter, that is: in order to ensure the renewal of model parameter, can again after the step number that iteration is certain, each child node passes the parameter gradients of accumulation back host node, and the unified renewal carrying out stipulations operation and model parameter; Then, then the model parameter after upgrading is distributed to each child node, each child node carries out the calculating of gradient again according to above-mentioned step, until model training stops.

In the middle of above-mentioned steps (1), need to carry out model parameter initialization and broadcast.That is, first the parameter of network model is carried out random initializtion before each iteration, main initialized model parameter has weighting parameter W, bias unit b.

According to the function of each layer of network model and the difference of scale, weighting parameter W and the bias unit b of each layer are also different.First initialization should be carried out according to the network parameter of input, net_.reset (new Net<Dtype> (train_net_param)), more successively initialization network weight parameter and bias unit.

Concrete initialized mode can adopt rands random fashion, has certain randomness, makes random value parameter from-1 to 1.

In order to ensure that each node starts the model before calculating and is consistent, need to be broadcast to each computing node to the initial value of model parameter; That is:

MPI_Bcast(net_params[param_id]->mutable_cpu_data(),net_params[param_id]->count(),

((sizeof(Dtype)＝＝8)？MPI_DOUBLE:MPI_FLOAT),0,MPI_COMM_WORLD)

In the middle of step (2), after the model parameter after having had initialization, each node has had the basis can carrying out training calculating.Because model training data set is comparatively huge, in order to reduce the reading time of data set, therefore the present invention makes host process open separately a thread looking ahead training set further.By the time that computing time, obfuscated data was looked ahead, improve executing efficiency.

Due to the singularity of the tissue database of training dataset, each only permission process carries out read access, therefore adopts one process carry out digital independent and data set is distributed to other computing nodes.When specific implementation, No. 0 process that realizes in program is set to host process, is responsible for reading and the transmission of data, and all the other calculation procedure are responsible for receiving data, sends and receive to adopt MPI_Send and MPI_Recv to realize.

In above-mentioned steps (2), each node is the training adopting the mode of parallel training to carry out model parameter.That is: each computing node carries out the training of network model parameter for the training dataset being assigned to this node.Owing to each computing node being preserved the copy of a network model, the computation process therefore on each computing node is consistent, only has training dataset different.

The training process of model is identical with general convolutional neural networks training process, and according to network configuration, basic hierarchical structure comprises convolutional layer, down-sampling layer and full articulamentum.Each level comprises propagated forward and backward feedback two class calculates.

Convolutional layer is feature extraction layer, and each neuronic input is connected with the local receptor field of front one deck, and extracts this local feature.

Down-sampling layer is Feature Mapping layer (subsampling layer), and each Feature Mapping is a plane, and in plane, all neuronic weights are equal.

The feature integration of extraction is an one-dimensional vector by full articulamentum, is finally connected on sorter, completes the classification feature of whole network.

By the calculating of propagated forward, the result of calculating and training label are compared, its error carries out backpropagation, calculates the gradient delta w size that local derviation obtains each model parameter in each level, and added up by Δ w according to stochastic gradient descent algorithm SGD.Repeat the process of above-mentioned forward-backward algorithm, constantly cumulative model parameter gradient delta w, when iterative computation number of times is accumulated to certain threshold value on each computing node, needs to carry out synchronous communication, completes the renewal of model parameter.

In above-mentioned steps (3), when whole iterative computation is to certain number of times, all computing nodes pass accumulative parameter gradients Δ w back host process, and host process carries out stipulations operation to the Δ w that each process is passed back, and upgrades model parameter w:

Δw = μΔw + ϵ ({&lang; \frac{&PartialD; E}{&PartialD; w} &rang;}_{i} - ωw) - - - (1)

w＝w+△w (2)

Wherein μ is momentum factor, and ω is weights retardation coefficients, and ε is learning rate, and i is the size of a packet.

Due to be carry out in extensive High-Performance Computing Cluster distribution calculate, same iterations carries out model parameter renewal, but the training set number that unit iterations completes adds N doubly (N is the nodes of cluster), this is just equal to the popularization of individualized training collection N doubly, therefore the parameter in above-mentioned formula needs to carry out certain amendment to adapt to the change in this scale, the present invention is front being multiplied by a popularization factor N, and last formula is:

Δw = μΔw + N * ϵ ({&lang; \frac{&PartialD; E}{&PartialD; w} &rang;}_{i} - ωw) - - - (1)

w＝w+△w (2)

As shown in Figure 3, be the schematic flow sheet in embody rule example of the present invention.When algorithm is initial, first carry out the reading of network model configuration file, and initialization is carried out to network structure and each layer model parameter.Identical in order to ensure the network model copy parameter on each node subsequently, host process uses MPI_Bcast to the model parameter broadcast after initialization, makes copy on each node identical.

When access is stored in the training dataset of disk array, due to the singularity of the tissue database of training dataset, only support one process to its access, the mode therefore taked here is that host process is responsible for reading training dataset and being distributed to other computing nodes.Simultaneously because the scale of training dataset is generally all in tens the G even size of hundreds of G, each iterative computation needs the picture bag of an access batchsize size, therefore takes the mode of data pre-fetching to carry out the time of hiding data reading.Host process is opened separately a thread at every turn and is responsible for looking ahead of data and sends, the size of each reading batchsize, and send to a computing node, and the stipulations that host process is responsible for later stage Grad calculate and parameter renewal simultaneously, utilize the time of the time obfuscated data transmission calculated, improve execution efficiency.

Model training can be carried out after each computing node receives data set, forward calculation and back-propagating, and constantly add up Δ w, when whole iterations reaches certain threshold value, each child node passes Δ w back host process, and host process carries out stipulations to Δ w value, and is upgraded weight matrix w by the Δ w after stipulations, again the weight matrix after renewal is broadcast to each child node again, child node is re-starting the training of model parameter.Repeat above-mentioned computation process, until finally meet the requirement of iterations or model parameter finally reaches convergence state, algorithm completes.

As shown in Figure 4, be the basic layer pie graph of the present invention's convolutional neural networks in embody rule example.Convolutional neural networks is the neural network of a multilayer, and every layer is made up of multiple two dimensional surface, and each plane is made up of multiple independent neuron.As Fig. 4, input picture is by carrying out convolution with three trainable wave filters with being biased, three Feature Mapping figure are produced at C1 layer after convolution, then four pixels often organized in Feature Mapping figure are sued for peace again, weighted value, be biased, obtained the Feature Mapping figure of three S2 layers by a Sigmoid activation function.These mapping graphs entered filtering again and obtained C3 layer.This hierarchical structure is the same with S2 again produces S4.Finally, these pixel values are rasterized, and connect into a vector and be input to traditional neural network, exported.

Usually, C layer is feature extraction layer, and each neuronic input is connected with the local receptor field of front one deck, and extracts the feature of this local, once after this local feature is extracted, the position relationship between it and other features is also decided thereupon; S layer is Feature Mapping layer, and each computation layer of network is made up of multiple Feature Mapping, and each Feature Mapping is a plane, and in plane, all neuronic weights are equal.Feature Mapping structure adopts sigmoid function that influence function core is little as the activation function of convolutional network, makes Feature Mapping have shift invariant.

In addition, because the neuron on a mapping face shares weights, thus decrease the number of freedom of network parameter, reduce the complexity that network parameter is selected.Each feature extraction layer (C-layer) in convolutional neural networks is used for asking the computation layer (S-layer) of local average and second extraction followed by one, and this distinctive twice feature extraction structure makes network have higher distortion tolerance when identifying to input amendment.

See Fig. 5, for the present invention is when embody rule, the basic calculating operation chart of convolutional layer and down-sampling in convolutional neural networks.Convolution process comprises: with a trainable wave filter f _xdeconvolute an image inputted (first stage is the image inputted, and the stage has below been exactly convolution feature map), then adds a biased b _x, obtain convolutional layer C _x.Sub-sampling procedures comprises: four the pixel summations of every neighborhood become a pixel, then by scalar W _x+1weighting, then increase biased b _x+1, then by a sigmoid activation function, produce the Feature Mapping figure S that is probably reduced four times _x+1.Its its main operational is convolution algorithm, uses general formula to represent its calculating behavior:

As shown in Figure 6, be the present invention's algorithm data flow graph in conjunction with HPCC framework after being applied particularly to HPCC.Stochastic gradient descent (SGD) method should be the optimization method of the most frequently used training deep neural network.But unfortunately, traditional SGD method succession in essence, makes under large data collection, become no longer applicable, because the machinery compartment data mobile required for this complete serial mode is very consuming time.In order to be applied on large data sets by SGD, one uses the stochastic gradient descent mutation method of multiple distribution copy as follows: training set is divided some subsets, and the model copy independent to each performs different training subsets.Communication between model copy all exchanges data by Infiniband, and host process is responsible for the renewal of Maintenance Model parameter, and is responsible for being broadcast to other child nodes.Before each batch of process, model copy all receives up-to-date model parameter from host process.After copy obtains the model parameter after upgrading, runs the gradient that batchsize sample carrys out calculating parameter, and be pushed to host process, for the model parameter value that renewal is current.And host process is by carrying out stipulations to Grad, and upgrade model parameter, the new model parameter obtained is broadcast to each node, each child node can re-start the computation process of next batchsize sample.

Below be only the preferred embodiment of the present invention, protection scope of the present invention be not only confined to above-described embodiment, all technical schemes belonged under thinking of the present invention all belong to protection scope of the present invention.It should be pointed out that for those skilled in the art, some improvements and modifications without departing from the principles of the present invention, should be considered as protection scope of the present invention.

Claims

1. based on a convolutional neural networks method for parallel processing for extensive High-Performance Computing Cluster, it is characterized in that, step is:

2. the convolutional neural networks method for parallel processing based on extensive High-Performance Computing Cluster according to claim 1, it is characterized in that, in described step (1), first the parameter of network model is carried out random initializtion before each iteration, initialized model parameter comprises weighting parameter W, bias unit b; First carry out initialization according to the network parameter of input, more successively initialization network weight parameter and bias unit.

3. the convolutional neural networks method for parallel processing based on extensive High-Performance Computing Cluster according to claim 2, is characterized in that, described initialized mode adopts rands random fashion, makes random value parameter from-1 to 1.

4. the convolutional neural networks method for parallel processing based on extensive High-Performance Computing Cluster according to claim 1 or 2 or 3, it is characterized in that, also comprise the more new technological process of step (3) model parameter, that is: after the step number that iteration is certain, each child node passes the parameter gradients of accumulation back host node, and the unified renewal carrying out stipulations operation and model parameter; Then, then the model parameter after upgrading is distributed to each child node, each child node carries out the calculating of gradient again, until model training stops.

5. the convolutional neural networks method for parallel processing based on extensive High-Performance Computing Cluster according to claim 1 or 2 or 3, it is characterized in that, in step (2), make host process open separately a thread to look ahead training set, adopt one process carry out digital independent and data set is distributed to other nodes; That is, No. 0 process is set to host process, is responsible for reading and the transmission of data, all the other calculation procedure are responsible for receiving data, send and receive to adopt MPI_Send and MPI_Recv to realize.

6. the convolutional neural networks method for parallel processing based on extensive High-Performance Computing Cluster according to claim 1 or 2 or 3, is characterized in that, in step (2), each node is the training adopting the mode of parallel training to carry out model parameter; That is: each computing node carries out the training of network model parameter for the training dataset being assigned to this node.

7. the convolutional neural networks method for parallel processing based on extensive High-Performance Computing Cluster according to claim 4, it is characterized in that, the basic layer structure of described model training comprises convolutional layer, down-sampling layer and full articulamentum, and each level comprises propagated forward and backward feedback two class calculates; Described convolutional layer is feature extraction layer, and each neuronic input is connected with the local receptor field of front one deck, and extracts this local feature; Described down-sampling layer is Feature Mapping layer, and each Feature Mapping is a plane, and in plane, all neuronic weights are equal; The feature integration of extraction is an one-dimensional vector by described full articulamentum, is finally connected on sorter, completes the classification feature of whole network; The calculating of described propagated forward, be the result of calculating and training label are compared, its error carries out backpropagation, calculates the gradient delta w size that local derviation obtains each model parameter in each level, and added up by Δ w according to stochastic gradient descent algorithm SGD; Repeat the process of above-mentioned forward-backward algorithm, constantly cumulative model parameter gradient delta w, when iterative computation number of times is accumulated to certain threshold value on each computing node, carries out synchronous communication, completes the renewal of model parameter.

8. the convolutional neural networks method for parallel processing based on extensive High-Performance Computing Cluster according to claim 7, it is characterized in that, in described step (3), when whole iterative computation is to certain number of times, all computing nodes pass accumulative parameter gradients Δ w back host process, host process carries out stipulations operation to the Δ w that each process is passed back, and upgrades model parameter w:

Δw = μΔw + ϵ ({&lang; \frac{&PartialD; E}{&PartialD; w} &rang;}_{i} - ωw) - - - (1)

w＝w+△w (2)