CN112835844A

CN112835844A - Communication sparsization method for load calculation of impulse neural network

Info

Publication number: CN112835844A
Application number: CN202110233847.4A
Authority: CN
Inventors: 柴志雷; 刘家航; 王涛; 白云; 王皓洋; 尤佳
Original assignee: Suzhou Lanjiachong Robot Technology Co ltd
Current assignee: Suzhou Lanjiachong Robot Technology Co ltd
Priority date: 2021-03-03
Filing date: 2021-03-03
Publication date: 2021-05-25
Anticipated expiration: 2041-03-03
Also published as: CN112835844B

Abstract

The invention provides a communication sparsity method of a pulse neural network computing load, which effectively solves the problem of expandability of communication efficiency of a distributed computing platform and effectively solves the problem of gradual reduction of computing efficiency along with gradual increase of computing nodes in the distributed computing platform. In the technical scheme of the patent, the neurons are redistributed on each node based on redistribution operation, the neurons distributed on each node have the most post-synaptic neurons in the node, the nodes are in a non-blocking communication mode, in each communication process, each node asynchronously sends pulse data to all target nodes of the node and waits for receiving the pulse sent by the source node, namely, each node only sends necessary data to the target nodes of the communication, and does not communicate with the non-target nodes, and therefore communication with non-adjacent processes is avoided.

Description

Communication sparsization method for load calculation of impulse neural network

Technical Field

The invention relates to the technical field of deep learning, in particular to a communication sparseness method for a pulse neural network computing load.

Background

With the development of the field of deep learning, more and more research works of brain neural computation science appear, and the defects of the existing deep learning are hopefully overcome by analyzing the working mechanism of the brain and developing brain-like computation. The basis of brain-like computing is the Spiking Neural Network (SNN), which works closer to the biological brain than the traditional Deep Neural Network (DNN).

In order to obtain the optimal computing capacity, in order to be closer to the brain computing scale, the brain-like computing is mostly performed by constructing a large-scale cluster-forming distributed computing platform in the prior art. However, as the number of computing nodes increases, the proportion of communication time between the hardware nodes in the distributed cluster increases, and the proportion of effective computation in the computing nodes decreases gradually, that is, as the number of computing nodes in the distributed computing platform increases gradually, the computation efficiency of the SNN decreases gradually.

The skilled person also tries to solve this problem at various angles. For example, a common simulator NEST of a neural network model adopts a "buffer dynamic equalization" method to solve the problem of reduced computational efficiency. In the NEST simulation environment, nodes do not need to communicate and directly and dynamically maintain a transmission buffer. The method does not need to communicate the size of the buffer area, so the communication time is optimized to a certain extent, but the method is poor in energy consumption, and a new problem is caused: the NEST communication energy consumption is rapidly increased along with the increase of nodes, so that the expansion of the NEST in the communication energy consumption is poor.

Disclosure of Invention

In order to solve the problem that the SNN calculation efficiency is gradually reduced along with the increase of calculation nodes in a distributed calculation platform, the invention provides a communication sparseness method of a pulse neural network calculation load, which effectively solves the problem of expandability of the communication efficiency of the distributed calculation platform, and effectively solves the problem that the calculation efficiency is gradually reduced along with the gradual increase of the calculation nodes in the distributed calculation platform.

The technical scheme of the invention is as follows: a communication sparsification method for a pulse neural network computing load comprises the following steps:

s1: constructing an impulse neural network based on a distributed system architecture;

it is characterized by also comprising the following steps:

s2: traversing the pulse neural network to find all nodes, neurons and a neuron connection table;

s3: performing a redistribution operation on the neurons on each of the nodes;

the redistribution operation ensures that the neuron assigned on each node has the most post-synaptic neurons in that node;

s4: establishing synapses according to the nodes subjected to the redistribution operation and the neurons distributed on each node;

s5: after entering a communication stage and before each communication, each node communicates with other nodes to obtain information of all source nodes and target nodes of the node in the communication;

s6: each node asynchronously sends pulse data to all target nodes and waits for receiving pulses sent by a source node; in the communication process, all the nodes communicate with the source node and the target node thereof based on a non-blocking communication mode;

s7: and after the communication between the node and the source node and the target node is finished, finishing the communication stage.

It is further characterized in that:

in step S3, the redistribution operation includes the following steps:

a 1: putting all nodes into a node set, and putting all neurons into a neuron set;

a 2: selecting a node from the node set, and removing the current node from the node set after the node is set as the current node;

a 3: selecting a neuron from the neuron set, and setting the neuron as a current neuron;

a 4: distributing the current neuron onto the current node;

removing the current neuron from the set of neurons;

a 5: confirming whether the current node reaches the maximum accommodating amount;

if the maximum capacity has not been reached, performing step a 6; otherwise, executing step a 8;

a 6: based on the neuron connection table, finding out the neuron with the postsynaptic neuron most distributed in the node in the neuron set, and recording as the neuron to be distributed;

a 7: distributing the neurons to be distributed to the current node, and removing the neurons to be distributed from the neuron set; circularly executing the steps a 5-a 7;

a 8: confirming whether nodes without distributed neurons exist in the node set or not;

if yes, circularly executing the steps a 2-a 8;

otherwise, recording all distribution results, and ending the redistribution operation;

in step S1, when constructing the spiking neural network, it is only necessary to construct connections between neurons without creating specific synapses, and the neuron connection table may be obtained;

the redistribution operation in step S3 is performed on any one of the nodes, and after the operation is finished, the redistribution operation is performed on the node to send the redistribution result to other nodes;

the memory of the redistribution operation uses a calculation mode:

M＝N*M_int*(2N+1)

wherein: m represents the total memory, N represents the total number of neurons, M_intIndicating that the int-type parameter occupies memory.

The communication sparsification method for the calculation load of the pulse neural network provided by the invention redistributes the neurons on each node based on redistribution operation, and the neurons distributed on each node have the most postsynaptic neurons in the node, so that the generation of cross-node pulses is reduced, the connection between cross-hardware nodes becomes sparse, the communication between the nodes is reduced, the communication time between the hardware nodes in the distributed cluster and the hardware nodes is reduced, and the calculation efficiency of SNN is improved; based on a non-blocking communication mode, each node asynchronously sends pulse data to all target nodes in each communication process and waits for receiving pulses sent by a source node, namely, each node only sends necessary data to the target nodes in the communication, but does not communicate with the non-target nodes, communication with non-adjacent processes is avoided, and communication efficiency between the nodes is further optimized; meanwhile, after redistribution operation, the neurons are redistributed to each node, pulses generated by each round of neurons of the SNN model are reduced, point-to-point communication is reduced, and only pulse data required by a target process needs to be sent, so that the problem of poor communication energy consumption expandability caused by collective communication is solved.

Drawings

FIG. 1 is a schematic flow chart of the redistribution operation of the present patent;

FIG. 2 is a schematic flow chart of the implementation of the technical solution of the present patent based on the NEST simulator;

FIG. 3 is a diagram of a parallel SNN connectivity structure in the prior art;

FIG. 4 is a diagram illustrating a communication method between nodes in the prior art;

FIG. 5 is a schematic diagram of a SNN subgraph structure in the present patent;

fig. 6 shows the results of operating the solution of the present patent based on the cortical microcircuit model: simulation time;

fig. 7 shows the results of operating the solution of the present patent based on the cortical microcircuit model: communication time;

fig. 8 shows the results of operating the solution of the present patent based on the cortical microcircuit model: the amount of data communicated;

fig. 9 shows the results of operating the solution of the present patent based on the cortical microcircuit model: the technical scheme of the patent is time-consuming to operate;

fig. 10 shows the results of running the solution of the present patent based on the cortical microcircuit model: the number of connections of neurons to nodes is compared.

Detailed Description

In the communication sparseness method for calculating the load of the impulse neural network, as shown in fig. 2, a NEST simulator is used as an SNN load characterization simulation tool, and implementation steps of the technical scheme of the patent are described in detail.

when the impulse neural network is constructed, specific synapses do not need to be established, and only connection among neurons needs to be established, so that a neuron connection table can be obtained.

S2: and traversing the impulse neural network to find all nodes and neurons and a neuron connection table.

When the technical scheme of the patent is applied to NEST, Create only needs to obtain the connection condition between neurons; no specific synaptic object is created at Connect time, but a connection table is obtained through all connections.

S3: performing redistribution operation on the neurons on each node;

the redistribution operation ensures that the neuron assigned on each node has the most post-synaptic neurons in that node.

On NEST, when a simulation function is called to initialize simulation, redistribution processing of neurons is carried out according to the connection table, and then synapses are created and a simulation process is operated. Namely, when the template is initialized, the Redistribute module is called to run redistribution operation, then the result is applied to all nodes, and then Connect operation is carried out to create a specific synapse object.

In a typical SNN network implemented based on a NEST simulator, neurons 2 are distributed into different nodes 1, as shown in fig. 3, the connecting lines between the neurons 2 represent pulse data between the neurons, as shown in fig. 4, the connecting lines between the nodes 1 represent communication between the nodes; as shown in fig. 5, if neuron No. 20 generates a pulse, it sends pulse data to all

post-synaptic neurons

21, 22, 23, and all pulse transmissions are based on such SNN subgraphs. In the NEST simulator, when a plurality of post-synaptic neurons of a neuron are included on one node, the post-synaptic neurons share one pulse data. Thus, regardless of the number of post-synaptic neurons, the pulse sent by that neuron is transmitted to the node counts as only one, i.e., a neuron-to-node connection. The total pulse generated per neuron update is always less than or equal to the number of nodes x the number of neurons.

Therefore, in the technical scheme of the patent, through analysis of connection relations among the neurons and based on redistribution operation, the neurons with the most postsynaptic neurons in the neuron set are selected each time, the SNN subgraphs are placed into nodes as few as possible, the neurons are redistributed, generation of cross-node pulses is reduced, connection among cross-hardware nodes becomes sparse, and therefore communication among the nodes is reduced.

As shown in fig. 1, the redistribution operation includes the following steps:

a 4: distributing the current neurons to the current nodes;

removing the current neuron from the neuron set;

a 6: based on a neuron connection table, finding out the neuron with the most distributed postsynaptic neurons in the node in the neuron set, and recording as the neuron to be distributed;

a 7: distributing the neurons to be distributed to the current nodes, and removing the neurons to be distributed from the neuron set; circularly executing the steps a 5-a 7;

if yes, circularly executing the steps a 2-a 8;

otherwise, recording all distribution results, and ending the redistribution operation.

The redistribution operation distributes the neurons to a plurality of nodes according to the connection of compactness, a maximum accommodating capacity is set for each node, and the load of the nodes is well balanced by limiting the accommodating capacity of the neurons of the nodes; the technical scheme of the patent is ensured to improve the calculation efficiency and simultaneously avoid the problems of node load, performance and the like.

In a specific implementation, the algorithm description of the redistribution operation (hereinafter abbreviated as ReLOC) is as follows:

ReLOC algorithm

Inputting: node number and neuron connection table;

and (3) outputting: each node neuron is distributed;

①for each k∈np

②r＝rand(neurons)；

③put r in part[k]；

④while part[k].size<max_neurons

⑤ c＝max(post(neurons) in part[k])；

⑥put c in part[k]；

wherein: np is the number of nodes, neurones is the unassigned set of neurons, part [ k ] is the set of neurons for node k, max _ neurones is the maximum capacity of a node, and post [ n ] is all the post-synaptic neurons for neuron n.

In the ReLOC algorithm, a node is selected at first, a random neuron r is distributed on the node, and k represents the current node; then, unallocated neurons are put on the node in turn until the node k reaches the maximum accommodating amount max _ neurons, and the rule is that each time the put neurons are the most distributed neurons in the node k, the postsynaptic neurons of the neurons in the unallocated neuron set neurons are all put. When neurons are matched in sequence according to the method, the matched neuron is the neuron which is most closely related to the matching set each time.

In the redistribution operation, it is the number of post-synaptic neurons in the neuron set that determines how closely the neurons are. Because each matching can maximize the distribution of an SNN subgraph (figure 5 of the attached figure of the specification) to the same node, when all matches are optimal, each subgraph is distributed to one node, and the problem of traffic increase caused by the distribution of the same subgraph to multiple nodes is avoided.

The redistribution operation of the patent is operated on any node, and after the operation is finished, the distribution result of the redistribution operation is sent to other nodes; redistribution operation in this patent itself possesses very high execution efficiency, and the performance requirement to the device is not high simultaneously, promptly, this patent technical scheme can not extra bring too big burden to the platform, has ensured this patent technical scheme's practicality.

The memory usage calculation mode of the redistribution operation is as follows:

M＝N*M_int*(2N+1)

S4: and establishing synapses according to the nodes subjected to redistribution operation and the nodes distributed on each node.

In the technical scheme, a node-to-node communication mode is realized based on a dynamic sparse data exchange algorithm, that is, one node only sends necessary data to another node, and multiple MPI communication needs to be started in the period, although the overall efficiency is not as good as that of collective communication (MPI _ Alltoall). After the number of neighbor nodes of each node is reduced based on the redistribution operation, the number of times of starting MPI communication is reduced on the whole, and further the communication efficiency is improved.

S5: after entering the communication stage, before each communication starts, each node communicates with other nodes to obtain information of all source nodes and target nodes in the communication.

In specific implementation, each node uses the MPI _ Reduce _ scatter function to communicate with other nodes to obtain information of a source node, that is, the nodes from which pulse data needs to be received, so as to avoid communication with non-adjacent processes in subsequent communication, thereby improving communication efficiency.

S6: each node asynchronously sends pulse data to all target nodes and waits for receiving pulses sent by a source node; in the communication process, all nodes communicate with the source node and the target node based on the non-blocking communication mode.

When each node and other target nodes carry out data transmission, non-blocking communication is used, namely MPI _ Isend is used for sending data, MPI _ Recv is used for receiving the data, MPI _ Waitall is used for waiting for the completion of all data transmission, and then simulation operation is continued. The process does not need to wait for the communication of all nodes to be finished, and communication and calculation stages are overlapped through non-blocking communication, so that the efficiency of parallel calculation is improved.

When implementing SNN based on the NEST simulator, parameters therein may be set through a setcomm (type) function, thereby enabling a different communication mechanism, and if this function is not called, the default communication is the NEST native communication mode. And calling a Setcomm (type) function to start the communication mechanism in the technical scheme of the patent.

According to the technical scheme, connection from the neurons to the process is reduced through redistribution operation on the neurons, so that cross-node pulses are reduced, and nodes are sparser; because the pulses generated by each round of neurons of the SNN model after redistribution operation are rare, the occurrence of point-to-point communication is reduced to a certain extent, and only pulse data required by a target process needs to be sent, so that the problem of poor communication energy consumption expandability caused by NEST collective communication is improved.

The performance of the solution of the patent is confirmed below based on the Cortical Microcircuit (CM) model.

An experimental environment selects a high-performance heterogeneous brain platform consisting of 28 PYNQ-Z2 blocks, and each node comprises an ARM A9 dual-core processor system at a PS (process system) end, 1 FPGA (field programmable gate array) device at a PL (programmable logic) end and 512MB of physical memory. The nodes adopt Ethernet with 1000Mbps network bandwidth for communication and adopt TCP/IP protocol. The platform aims to build a high-performance brain-like computing platform through two dimensions (firstly, parallel computing is realized by utilizing a PS end of a plurality of nodes, and secondly, an efficient acceleration architecture is established by utilizing a PL end of each node).

The Cortical Microcircuit (CM) model network has four layers, each consisting of inhibitory and excitatory neuron populations, divided into 8 populations, for a total of 77k neurons and 3 hundred million synapses. The neuron type is iaf _ psc _ exp, all connections are static synapses, the connection rule is fixed _ total _ number, and 8 poisson and corresponding neuron groups are connected completely.

To verify the benefits of the patented solution to SNN distributed computation, neurons and synapses of the cortical microcircuit model were scaled to 0.1 and 0.02, simulating 200ms in 0.1ms time steps. The cortical microcircuit model calls the kernel of the NEST (C + + implementation) based on the NEST example written by Python, thus simulating the SNN. The method is mainly characterized in that time step driving is adopted, and an MPI library is used for parallel communication. The temporal statistics of the model are as follows:

(1) the calculation time is the time taken by all the neuron sequence state updates of each node;

(2) the communication time is the time from the start of the node communication to the completion of the data reception;

(3) simulation time refers to the total time it takes to complete a simulation, including computation time and communication time.

Communication data volume statistics MPI function all data byte volumes (signaled int type) actually sent by all nodes in 200ms simulation time.

In order to compare the communication efficiency and performance of the technical scheme of the patent with the original edition NEST mode, 4, 8, 16 and 28 nodes are respectively used for simulation, and the operation results are shown in FIGS. 6-10, wherein the technical scheme of the patent is marked as redistribution and sparse exchange, and the original edition NEST mode is marked as NEST. In the cortical microcircuit model, because the initial neuron pulses are dense, the sending buffer area is very large at the beginning, the waste of subsequent communication data is more obvious, and therefore simulation is performed on 28 nodes, and compared with an original edition NEST mode, the communication energy consumption of the technical scheme is reduced by about 73 times. As shown in fig. 6, when there are few nodes, the simulation time of the NEST mode of the original version is shorter than that of the technical solution of the present invention, however, the nodes continue to increase, and the simulation time of the NEST mode of the original version is in a rapid rising trend; in accordance with the curve change of fig. 6, as shown in fig. 7, as the number of nodes increases, the communication time between the nodes is also in quick proximity, and it is known that, as the number of computing nodes in the distributed computing platform increases gradually, the time for communication is effectively suppressed, and more time is left for computing. As shown in fig. 8, with the increase of nodes, the communication data volume of the NEST mode of the original version shows a rapid rising trend, while the data volume of the technical solution of the present patent is slowly increased, which indicates that the expandability of the technical solution of the present patent to the communication energy consumption is greatly improved.

As shown in fig. 9, with the increase of nodes, the time taken by the technical solution of the present invention is less and less, because the number of neurons that can be accommodated in each node decreases due to the increase of nodes, and therefore, when the algorithm loops to perform neuron matching, the number of computations for calculating the number of post-synaptic neurons located in the node neuron set is greatly reduced.

Since neuron redistribution is the time consumed before simulation, the cost required by ReLOC in the technical scheme of the patent is gradually reduced along with the increase of simulation time. Therefore, the node and simulation time are inversely proportional to the cost of ReLOC algorithm, so that the cost eventually becomes marginal.

The core of ReLOC algorithm is matching process, each matching will select the optimal matching set, the optimal means SNN subgraph cutting with lowest cost, i.e. SNN subgraphs are distributed on the least nodes. To describe the topology of SNN, a neuron connection diagram is represented by an adjacency matrix a, [ i, j ] ∈ {0,1}, where a [ i, j ] ═ 1 when there is a synaptic connection between neurons i and j, and otherwise a [ i, j ] ═ 0. The distribution matrix is denoted by T, T [ i, j ] equals 1 when a neuron i is distributed to a node j, otherwise T [ i, j ] equals 0.

The postsynaptic neurons share impulse data, and N can be used for measuring cross-node communication_NP(number of neuron-to-node connections) by first defining P

P[i,j]≧ 0 indicates that there is a connection of neuron i to node j, then N_NPCan be expressed as:

the experiment shown in fig. 10 compares the N of the cyclic distribution in the NEST mode of the original version (labeled cyclic distribution in the figure) with the redistribution in the solution of the present patent (labeled redistribution in the figure)_NPThe results show that ReLOC algorithm makes N_NPThe reduction, i.e. the inter-node communication becomes more sparse.

Claims

1. A communication sparsification method for a pulse neural network computing load comprises the following steps:

it is characterized by also comprising the following steps:

s3: performing a redistribution operation on the neurons on each of the nodes;

2. The communication sparsification method of the impulse neural network computing load according to claim 1, characterized in that: in step S3, the redistribution operation includes the following steps:

a 4: distributing the current neuron onto the current node;

removing the current neuron from the set of neurons;

if yes, circularly executing the steps a 2-a 8;

3. The communication sparsification method of the impulse neural network computing load according to claim 1, characterized in that: in step S1, when constructing the spiking neural network, it is only necessary to construct connections between neurons without creating specific synapses, and the neuron connection table may be obtained.

4. The communication sparsification method of the impulse neural network computing load according to claim 1, characterized in that: the redistribution operation in step S3 is performed on any one of the nodes, and after the operation is finished, the redistribution operation is performed on the node to send the redistribution result to other nodes;

the memory of the redistribution operation uses a calculation mode:

M＝N*M_int*(2N+1)