CN116502683A

CN116502683A - Full-flow parallel acceleration brain simulation method and system

Info

Publication number: CN116502683A
Application number: CN202310349036.XA
Authority: CN
Inventors: 王俊宜; 喻富豪; 蔡炎松; 朱苗苗; 梁华驹; 彭耿; 贾海波; 杜帅帅
Original assignee: Nanhu Research Institute Of Electronic Technology Of China
Current assignee: Nanhu Research Institute Of Electronic Technology Of China
Priority date: 2023-03-29
Filing date: 2023-03-29
Publication date: 2023-07-28

Abstract

The invention discloses a full-flow parallel acceleration brain simulation method and a system, which are applied to a GPU, wherein the full-flow parallel acceleration brain simulation method comprises the following steps: receiving cluster parameters and synaptic parameters input by a user; creating a cluster table in parallel according to the cluster parameters; creating a routing table in parallel according to the synaptic parameters; and acts on each neuron to create a pulsing matrix and a pulsing matrix; brain simulation is performed based on the cluster table, the routing table, the pulse delivery matrix and the pulse receiving matrix. The invention provides parallelization processing possibility and is more suitable for the many-core characteristics of the GPU.

Description

Full-flow parallel acceleration brain simulation method and system

Technical Field

The invention belongs to the technical field of brain simulation, and particularly relates to a full-flow parallel acceleration brain simulation method and system.

Background

Brain simulation is a technology for simulating a biological brain operation mechanism by a software means, and is a necessary means for researching novel technologies such as general intelligence in the future, brain disease diagnosis based on digital twin brains and the like. Compared with the current mainstream deep learning network, the brain simulation network has the characteristic of extremely large scale, in the mainstream deep learning network, for example, a VIT model in the vision field generally has 1 to 2 hundred million parameter amounts, a Galacta model in the language field such as Meta company in 2022 has 1200 hundred million parameter amounts, and a GPT-3 language model in the OpenAI company in 2020 has 1700 hundred million parameter amounts. In the field of brain simulation, the human brain contains about 860 hundred million neurons, the synapse number reaches 1000 trillion, and the parameter number of the simulated neural network in software reaches the order of tens of megameters.

The traditional brain simulation network needs to be completed by means of a super computer, and in order to achieve the simulation close to the actual running speed of the biological brain, the simulation system needs to have ultrahigh parallelism and communication capability. In recent years, GPUs have gained popularity in the field of dynamics-related computing. The GPU has many-core characteristics, compared with single-core processing capability emphasized by the CPU, the GPU emphasizes and improves the number of processing units on a single display card, and the a100 display card which is withdrawn and pushed out by the inflorescence company 2021 has 108 SM processing units, can provide 221184 independent computing threads, and has parallel capability far higher than that of the CPU. The well-suited features of GPUs agree with the deployment of a large number of neurons and synapses in brain simulation networks.

In European brain planning, a NEST brain simulation system commonly used by a supercomputer uses CPU cores as minimum processing units, the parallel processing capacity of the NEST brain simulation system is limited by the number of the CPU cores, and the NEST brain simulation system cannot be quickly parallel like a GPU under the same processing core condition due to the limitation of communication among the CPUs. NEST supports the use of GPU simulation networks in 2022, but cannot support full-flow parallelism, a large number of processes are completed by means of a CPU and a memory, and the problems of too slow network creation speed and the like exist. The university of saxophone in the united kingdom proposed a GeNN simulation system in 2018 and updated to version 4.8 in 2022, but the GeNN used GPU simulation at a slower speed and required cumbersome steps of code generation to create a network with overall lower execution efficiency than the new version of GPU that was proposed by the new in the same year.

The brain simulation network has a plurality of parameters and a complex structure, so that GPU parallelism is difficult to support in the whole process of creating and executing the network, and the currently mainstream GPU brain simulation system can only support parallelism in neuron updating and pulse transmission, but the step of network creation becomes complex, and the network creation of hundreds of millions of neurons often takes one day. In order to solve the problem that brain simulation cannot be performed in full flow parallelism, a set of brain simulation network data structure adapting to the many-core characteristics of the GPU needs to be redesigned, and rapid network and simulation establishment by utilizing the characteristics of multithreading and video memory is supported. The main problems are:

the data structure in the brain simulation system cannot fully utilize the many-core characteristics of the GPU, the data structure of the current brain simulation on the GPU side is often transplanted based on the CPU version, and a set of structure which completely supports the whole processes of network creation, neuron updating and pulse processing is not designed based on the GPU, so that the thread utilization rate of the GPU is not optimal.

Disclosure of Invention

The invention aims to provide a full-flow parallel acceleration brain simulation method, which provides parallelization processing possibility and is more suitable for GPU many-core characteristics.

In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:

the full-flow parallel acceleration brain simulation method is applied to a GPU, and comprises the following steps:

receiving cluster parameters and synaptic parameters input by a user;

creating a cluster table in parallel according to the cluster parameters;

creating a routing table in parallel according to the synaptic parameters;

and acts on each neuron to create a pulsing matrix and a pulsing matrix;

brain simulation is performed based on the cluster table, the routing table, the pulse delivery matrix and the pulse receiving matrix.

The following provides several alternatives, but not as additional limitations to the above-described overall scheme, and only further additions or preferences, each of which may be individually combined for the above-described overall scheme, or may be combined among multiple alternatives, without technical or logical contradictions.

Preferably, the cluster parameters include a total number of clusters, and a number of neurons, a number of neuron parameters, and a neuron initial parameter of each cluster;

the synaptic parameters comprise the total number of synaptic information, and a source cluster index, a target cluster index, a connection rule, a weight distribution and a time delay distribution contained in each piece of synaptic information.

Preferably, the creating a cluster table according to the cluster parameters includes:

acquiring a table with a cluster parameter of length L, wherein L is the total number of clusters, and the ith element in the table represents the information of the ith cluster and comprises the number M_i of neurons, the number N_i of the neuron parameters and the initial parameters of the neurons;

initializing a cluster table with the length L, starting each element in the cluster table in parallel by L threads, wherein the ith element in the cluster table represents a cluster matrix of the ith cluster, the first dimension length of the cluster matrix of the ith cluster is N_i, and the second dimension length of the cluster matrix of the ith cluster is M_i;

and starting L threads to process L cluster matrixes in parallel, wherein M_i threads are restarted in the ith thread, and M_i threads are used for filling the initial neuron parameters of M_i neurons of the ith cluster into the cluster matrix of the ith cluster in parallel.

Preferably, the creating a routing table according to the synapse parameters in parallel includes:

creating a temporary source neuron table, a temporary target neuron table, a temporary weight table and a temporary time delay table corresponding to each two clusters according to the synaptic parameters;

according to the temporary neuron table and the temporary delay table, counting the maximum delay value of each neuron and the number of synapses serving as source neurons;

initializing a one-dimensional empty routing table aiming at each cluster, wherein the length of the routing table is the number of neurons in the cluster, the elements in the routing table are one-dimensional target address tables, the length of the target address tables is the maximum delay value of the corresponding neurons, and the elements in the target address tables are [ target neuron index, weight ] key value pairs;

and filling a routing table according to the temporary source neuron table, the temporary target neuron table, the temporary weight table and the temporary time delay table.

Preferably, the creating a temporary source neuron table, a temporary target neuron table, a temporary weight table, and a temporary delay table corresponding to each two clusters according to the synaptic parameters includes:

starting threads with the same number as the total number of the synaptic information in the synaptic parameters, and processing all the synaptic information by all the threads in parallel, wherein the processing procedure of each thread is as follows:

acquiring corresponding synaptic information, wherein the synaptic information comprises a source cluster index, a target cluster index, a connection rule, weight distribution and time delay distribution;

calculating the number s_c of synapses to be created between two clusters according to a connection rule;

generating a temporary source neuron table with the length of s_c and a temporary target neuron table with the length of s_c through a random number generator, wherein elements in the temporary source neuron table are neuron indexes contained in a source cluster index, elements in the temporary target neuron table are neuron indexes contained in a target cluster index, and simultaneously generating a temporary weight table with the length of s_c and a temporary time delay table with the length of s_c through the random number generator according to weight distribution and time delay distribution.

Preferably, the filling the routing table according to the temporary source neuron table, the temporary target neuron table, the temporary weight table and the temporary delay table includes:

starting the same number of parallel threads as the total synapse number, wherein the processing procedure of the s-th thread is as follows: finding a temporary source neuron table, a temporary target neuron table, a temporary weight table and a temporary time delay table, wherein the initial index of the temporary source neuron table, the temporary target neuron table, the temporary weight table and the temporary time delay table is smaller than and closest to s;

and taking the element with the index of [ s-initial index ] in the temporary source neuron table as the index of the routing table to obtain a target address table, taking the element with the index of [ s-initial index ] in the temporary delay table as a first dimension index value of the target address table, and writing the key value pairs of the temporary weight table and the temporary target neuron table into the target address table by using the element with the index of [ s-initial index ] in the temporary weight table and the temporary target neuron table as the index of the [ target neuron index.

Preferably, the creating a pulse issuing matrix and a pulse receiving matrix for each neuron includes:

starting parallel threads with the same number as the total number of the clusters, initializing a pulse receiving matrix corresponding to one dimension of the clusters by each thread, wherein the length of the first dimension of the pulse receiving matrix is the number of neurons of the clusters, the elements are accumulated values of the number of the pulses, and the initial value is 0;

starting parallel threads with the same number as the total neuron number, initializing a two-dimensional pulse issuing matrix corresponding to the neuron by each thread, wherein the first dimension length of the pulse issuing matrix is the maximum delay value of the neuron, the second dimension length is the synapse number of the neuron serving as a source neuron, the elements are [ target neuron index, weight accumulated value ] key value pairs, the target neuron index in the key value pairs is the index of the target neuron to which the neuron serving as the source neuron is connected, and the initial value of the weight accumulated value in the key value pairs is 0.

Preferably, the brain simulation based on the cluster table, the routing table, the pulse issuing matrix and the pulse receiving matrix comprises:

1) Initializing the current time step to 0;

2) Starting M threads to collect pulses in parallel, taking the j element in the i pulse receiving matrix as the pulse received by the j neuron in the i cluster in the current time step, and clearing the receiving matrix, wherein M is the total number of neurons;

3) Starting M parallel threads, wherein each thread updates the neuron parameters of the corresponding neurons in the cluster table, judging whether the current membrane voltage in each updated neuron parameter is greater than a threshold value, if so, indicating that the neurons need to release pulses and executing the step 4), otherwise, executing the step 6);

4) For the jth neuron of the ith cluster needing to send pulses, inquiring a jth target address table of the ith routing table, taking out all key value pairs, taking a target address table index as a delay value, and forming [ delay value, target neuron index, weight ] pulse information with the corresponding key value pair;

5) Filling pulse information into a pulse issuing matrix corresponding to a jth neuron of an ith cluster, taking a delay value in the pulse information as a first dimension index of the pulse issuing matrix to obtain s_j elements, determining the same element as a target neuron index in the obtained s_j elements according to the target neuron index in the pulse information, adding the weight in the pulse information to a weight accumulated value in a key value pair of the determined elements, and adding the weights in a plurality of pulse information corresponding to the same target address table to the same pulse issuing matrix;

6) Starting M parallel threads, each thread restarting s_j max_delay_j parallel threads,

s_j is that each thread in the s_max_delay_j parallel threads corresponds to one element of the pulse issuing matrix, if the element processed by the thread is the element with the first dimension index of 0 in the pulse issuing matrix, the weight accumulated value of the processed element is accumulated to the element with the first dimension index of 0 in the pulse receiving matrix, then the weight accumulated value in the element is set to 0, each thread moves the processed element forward by one bit, and finally, if the element processed by the thread is the element with the first dimension index of (max_delay_j) -1 in the pulse issuing matrix, the weight accumulated value in the element is set to 0, wherein max_delay_j is the first dimension length of the pulse receiving matrix, and s_j is the second dimension length of the pulse receiving matrix;

7) The current time step is increased by 1;

8) Judging whether the current time step is equal to the simulation time step, if not, executing the step 2) to continue simulation, and if so, ending the flow.

The full-flow parallel acceleration brain simulation method provided by the invention supports the creation of a cluster table and a routing table in a parallel mode, and has a neuron and synapse data structure adapting to the GPU many-core characteristic so as to realize the process of rapid parameter addressing and updating based on the structure. In addition, a pulse issuing matrix and a pulse receiving matrix are created in a parallel mode, and a pulse storage structure with the characteristic of adapting to the GPU many cores is provided so as to realize a pulse processing flow in the parallel mode.

The second purpose of the invention is to provide a full-flow parallel acceleration brain simulation system, which provides parallelization processing possibility and is more suitable for GPU many-core characteristics.

a full-flow parallel acceleration brain simulation system, comprising a memory, a video memory and a GPU, wherein:

the memory is used for storing cluster parameters and synaptic parameters input by a user;

the GPU is used for creating a cluster table in a video memory according to the cluster parameters, creating a routing table in the video memory according to the synaptic parameters, creating a pulse issuing matrix and a pulse receiving matrix for each neuron in the video memory, and then performing brain simulation based on the cluster table, the routing table, the pulse issuing matrix and the pulse receiving matrix.

Preferably, the full-flow parallel acceleration brain simulation system further comprises a parameter input interface and an operation command interface;

the parameter input interface is used for receiving cluster parameters and synaptic parameters input by a user and forwarding the cluster parameters and the synaptic parameters to the memory for storage;

and the operation command interface is used for receiving an operation command and a simulation time step input by a user, and the GPU starts to operate after recognizing the operation command.

The full-flow parallel acceleration brain simulation system provided by the invention supports the creation of a cluster table and a routing table in a parallel mode, and has a neuron and synapse data structure adapting to the GPU many-core characteristic so as to realize the process of quick parameter addressing and updating based on the structure. In addition, a pulse issuing matrix and a pulse receiving matrix are created in a parallel mode, and a pulse storage structure with the characteristic of adapting to the GPU many cores is provided so as to realize a pulse processing flow in the parallel mode.

Drawings

FIG. 1 is a flow chart of a full-flow parallel acceleration brain simulation method of the present invention;

fig. 2 is a schematic structural diagram of a full-flow parallel acceleration brain simulation system according to the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.

The method aims to solve the problems that the current data structure taking the neuron as the minimum unit is not matched with the many-core characteristics of the GPU, so that the processes of network creation, parameter indexing and the like are complex and the parallel efficiency is low. Meanwhile, the method aims to solve the problem that the main processes such as membrane voltage updating, routing addressing and pulse transmission process are low in parallel efficiency in the brain simulation process. The embodiment provides a full-flow parallel acceleration brain simulation method, which realizes the complete expression of brain simulation on a GPU (graphics processing unit), comprises data structures related to neuron, synapse and pulse processing, realizes the complete flow of the data structures quickly created on a GPU video memory, provides the complete simulation flow for the data structures, comprises neuron parameter index and update, pulse routing information searching, pulse data storage, issuing, receiving and the like, and realizes the parallel acceleration simulation based on the GPU.

For ease of understanding, the terms used in the present invention will be explained first:

brain simulation system: the software system provides a basic interface for creating a brain simulation network for a user, and comprises interfaces for creating neuron clusters, inter-cluster synapses, simulation, result inquiry and the like, and hardware resources such as a GPU (graphics processing unit) and the like are called to complete network deployment and simulation.

Neurons: the simulated neuron imitating the actual operation mechanism of the biological brain neuron comprises neuron parameters such as membrane voltage, threshold value, membrane time constant, conductance and the like and parameter updating logic, is the minimum independent calculation unit in a simulation system, and each neuron has a unique index value.

Neuron clusters (clusters): a set of multiple neurons, typically containing tens of thousands of neurons in a cluster, is the smallest unit for a user to create a brain simulation network through a simulation system, each cluster having its own unique index value.

Synapses: the method comprises the steps that a user inputs parameters such as the number of synapses among clusters, connection rules, weights, time delay distribution values and the like through a simulation system, the simulation system converts the distribution values into actual values when the network is deployed, and the actual connection among neurons is created.

Time delay: a synaptic parameter represents the time taken for a pulse to propagate across the synapse in time steps.

The time steps are as follows: in the simulation process, the simulation system is used for simulating a discrete time unit of real time, and one time step can represent that the real time is different in 0.1 millisecond or 1 millisecond according to different simulation precision. The parameters of the entire network will be updated once in each time step.

Thread: herein referred to as GPU threads.

As shown in fig. 1, the full-flow parallel acceleration brain simulation method of the present embodiment includes the following steps:

s1, receiving cluster parameters and synaptic parameters input by a user.

Cluster parameters: typically represented as a table of length L, representing L clusters, i.e. L is the total number of clusters, the i-th element of the table contains information: in cluster i, the number of neurons M_i, the number of neuron parameters N_i, and the initial parameters of neurons.

The initial parameters of the neurons are values such as initial membrane voltage, time constant and the like input by a user, and generally support the user to input information by taking a cluster as a unit, namely the initial parameters of the neurons in the cluster are consistent.

Synaptic parameters: typically represented as a table of length P, each element in the table representing two clusters of inter-synaptic information, P being the total number of synaptic information, each piece of synaptic information comprising: source cluster index, target cluster index, connection rule, weight distribution and time delay distribution.

S2, creating a cluster table according to the cluster parameters.

S2.1, obtaining a table with a cluster parameter of length L, wherein L is the total number of clusters, and the ith element in the table represents the information of the ith cluster and comprises the number M_i of neurons, the number N_i of the neuron parameters and the initial parameters of the neurons.

S2.2, starting a thread to initialize a cluster table with the length L, and starting L threads to initialize each element in the cluster table in parallel, wherein the ith element in the cluster table represents a cluster matrix (also called a cluster matrix i) of the ith cluster, the first dimension length of the cluster matrix of the ith cluster is N_i, and the second dimension length of the cluster matrix of the ith cluster is M_i. The total neuron count is noted as M.

S2.3, starting L threads to process each cluster matrix in the cluster table in parallel, restarting M_i threads in the ith thread, and filling the initial neuron parameters of M_i neurons of the ith cluster into the cluster matrix of the ith cluster in parallel by the M_i threads. If the initial parameters of the neurons in the clusters are consistent, for each thread in M_i threads, reading an element with an index k in a temporary parameter array storing N_i parameters, and filling the element with a first dimension index k and a second dimension index corresponding to the thread index of the ith matrix.

S3, creating a routing table in parallel according to the synaptic parameters.

S3.1, creating a temporary source neuron table, a temporary target neuron table, a temporary weight table and a temporary time delay table corresponding to each two clusters according to the synaptic parameters.

The total synapse number S is initialized to 0. Starting P threads, and processing all the synaptic information by all the threads in parallel, wherein the c-th thread acquires the synaptic information with the index of c in the synaptic parameters. And each thread calculates the number s_c of the synapses to be created between the two clusters according to the connection rule in the acquired synapse information, and the number S of the synapses of all threads is accumulated to obtain S.

A temporary source neuron table with the length of s_c and a temporary target neuron table with the length of s_c are generated through a random number generator, elements in the temporary source neuron table are neuron indexes contained in a source cluster index, and elements in the temporary target neuron table are neuron indexes contained in a target cluster index. The random number generator generates random numbers according to the source cluster index and the target cluster index, and the generated random numbers are located in the neuron index range contained in the source cluster index and the target cluster index.

Each thread generates a temporary weight table with the length of s_c and a temporary time delay table with the length of s_c through a random number generator according to the weight distribution and the time delay distribution. A total of 4*P temporary tables are generated for each piece of synaptic information, and the starting index of each table is recorded. The random number generator generates random numbers according to the weight distribution and the time delay distribution, and the generated random numbers are positioned in accordance with the weight distribution and the time delay distribution.

And S3.2, counting the maximum time delay value of each neuron and the number of synapses serving as source neurons according to the temporary neuron table and the temporary time delay table.

Starting S parallel threads, wherein each thread inquires one element in a temporary source neuron table and a delay table, and additionally starting one thread to count the maximum delay value max_delay_j of each neuron and the synapse number s_j serving as a source neuron. Where max_delay_j and s_j represent the maximum delay value of the jth neuron of the ith cluster and the number of synapses as source neurons. To simplify the calculation, the index of neurons in each cluster is continuous, i.e., the jth neuron in the ith cluster refers to the same neuron as the jth neuron in the total neurons.

S3.3, initializing a routing table: starting L parallel threads, initializing a one-dimensional empty routing table for each cluster by each thread, wherein the length of the routing table is the number of neurons in the cluster, the elements in the routing table are one-dimensional target address tables, the length of each target address table is the maximum delay value of the corresponding neurons, namely, for the jth target address table, the table length is the max_delay_j of the jth neurons, and the elements in the target address table are [ target neuron index, weight ] key value pairs.

S3.4, filling a routing table according to the temporary source neuron table, the temporary target neuron table, the temporary weight table and the temporary time delay table.

Starting S parallel threads, wherein the processing procedure of the S-th thread is as follows: finding a temporary source neuron table, a temporary target neuron table, a temporary weight table and a temporary time delay table with a starting index smaller than and closest to s.

The method comprises the steps of obtaining a target address table by taking elements with indexes of [ s-start indexes (here, start indexes of the found temporary source neuron table) ] in a temporary source neuron table as indexes of a routing table, taking elements with indexes of [ s-start indexes (here, start indexes of the found temporary delay table) ] in a temporary delay table as first dimension index values of the target address table, and writing key value pairs of [ target neuron indexes, weight ] formed by elements with indexes of [ s-start indexes (here, start indexes of the found temporary weight table and the temporary target neuron table) in a temporary weight table and the temporary target neuron table into the target address table.

S4, a pulse issuing matrix and a pulse receiving matrix are created for each neuron.

Pulse receiving matrix (simply called receiving matrix): for storing the pulses received by the neurons, L receiving matrices. Starting L parallel threads, initializing a one-dimensional receiving matrix corresponding to the ith cluster for the ith thread, wherein the length of the first dimension is the number M_i of neurons of the ith cluster, namely the index of the first dimension is a neuron index, the elements are pulse number accumulated values, and the initial value is 0.

Pulse issuing matrix (issuing matrix for short): for storing the pulse to be issued by the neuron, and M issue matrixes are arranged. Starting M parallel threads, initializing a two-dimensional hair matrix corresponding to a jth neuron for the jth thread, wherein the first dimension length is the maximum delay value max_delay_j of the neuron, the second dimension length is the synapse number s_j of the neuron serving as a source neuron, the element is a [ target neuron index, weight accumulated value ] key value pair, the target neuron index in the key value pair is the index of the target neuron connected with the neuron serving as the source neuron, and the initial value of the weight accumulated value in the key value pair is 0.

And S5, performing brain simulation based on the cluster table, the routing table, the pulse issuing matrix and the pulse receiving matrix.

S5.1, initializing the current time step to 0.

S5.2, starting M threads to collect pulses in parallel, taking the j element in the ith receiving matrix as the pulse received by the j neuron in the ith cluster in the current time step, and emptying the receiving matrix.

And S5.3, starting M parallel threads, wherein each thread updates the neuron parameters of the corresponding neurons in the cluster table, judging whether the current membrane voltage in each updated neuron parameter is greater than a threshold value, if so, indicating that the neurons need to release pulses and executing the step S5.4, otherwise, executing the step S5.6.

It should be noted that, the updating of the neuron parameters of the corresponding neurons in the cluster table is mainly based on the neuron parameter updating logic, and the updating of the neuron membrane voltage values in the cluster table is performed by the conventional technology in updating the neuron pulse transmission parameters, which is not described in detail in this embodiment.

S5.4, inquiring a jth target address table of an ith routing table for a jth neuron of an ith cluster needing to send pulses, taking out all key value pairs, taking a target address table index as a time delay value, and forming [ time delay value, target neuron index, weight ] pulse information with the corresponding key value pair.

If a plurality of key value pairs exist in the target address table, a plurality of pulse information is finally obtained, the delay value in each pulse information is the target address table index, and the target neuron index and the weight in the pulse information are the key value pairs under the target address table index, namely the [ target neuron index, weight ].

S5.5, filling pulse information into a hair matrix corresponding to a jth neuron of an ith cluster, taking a delay value in the pulse information as a first dimension index of the hair matrix to obtain s_j elements, determining the same element as a target neuron index in the obtained s_j elements according to the target neuron index in the pulse information, adding the weight in the pulse information to a weight accumulation value in a key value pair of the determined elements, and adding the weights in a plurality of pulse information corresponding to the same target address table to the same hair matrix.

S5.6, starting M parallel threads, starting s_j max_delay_j threads again by each thread, wherein s_j max_delay_j threads correspond to one element of a pulse issuing matrix, accumulating a weight accumulated value of the element processed by the thread to an element with a first dimension index of 0 in the pulse issuing matrix, then setting the weight accumulated value in the element to 0, moving the element processed forward by each thread, and finally setting the weight accumulated value in the element to 0 if the element processed by the thread is an element with a first dimension index of (max_delay_j) -1 in the pulse issuing matrix.

S5.7, the current time step is increased by 1.

S5.8, judging whether the current time step is equal to the simulation time step, if not, executing the step S5.2, continuing simulation, and if so, ending the flow.

It should be noted that, steps S2-S5 are GPU operation steps, in actual operation, the GPU starts actual network creation and simulation processes according to user instructions, the user instructions can be input through the external interface, and the input parameters also include simulation time steps and the like.

The neuron, synapse and pulse processing related data structure and the pulse issuing and receiving scheme realized according to the data structure can obviously improve the parallel efficiency of network creation, processing pulse weight acquisition, time delay updating, route transmission and receiving processes in the GPU, and realize the parallel of the brain simulation whole processes (including 4 processes of neuron creation, synapse creation, neuron updating and pulse transmission) on the GPU. The advantages of the method proposed for implementing parallelism for different stages in the simulation flow are as follows:

1. network creation phase (including neuron creation, synapse creation): general systems require the gradual construction of neurons, synapses, etc. according to the order of user input information. The invention designs the neuron and synapse list structure which is more suitable for the many-core characteristics of the GPU, and can quickly initialize, and parallelly create and fill in the neuron and synapse parameters.

2. Neuron parameter update phase: typically, each neuron of the system is used as a minimum independent unit, and needs to process own parameter addressing and parameter updating processes respectively. The method takes the parameters as the minimum unit, and the same parameters of all neurons are stored in the same position, so that the method can be used for quick addressing and quick updating.

3. Pulse transfer phase: in general, the pulse transmission stage of the system processes pulses by the unit of neurons, and more time is required to process routing logic. The invention simplifies and accelerates the processes of routing, pulse storage and receiving through multithreading parallel processing based on the GPU many-core characteristics.

In order to further illustrate the advantages of the full-flow parallel acceleration brain simulation method provided by the invention, a specific experiment is used for the following description.

The open-source microcirculatory network is used for comparison test, and in order to improve the differentiation of experimental results, the network specification parameters in the actual test are adjusted as follows: the total number of the 8 neuron clusters is consistent with that of the original microcirculatory network, the total neuron number is increased from 7.7 to Mo Zeng to 77 ten thousand, the total synapse number is increased from 2.4 to 24 hundred million, namely, the network scale is increased by 10 times, but the network structure is unchanged. The biological time was simulated for 1s, i.e. 10000 time steps. The comparative brain simulation system (abbreviated as system), hardware type, hardware name, and neural network creation and simulation time result pairs are shown in table 1.

Table 1 creation and simulation time statistics comparison table

From table 1, the method of the invention is superior to the speed of the main brain simulation system in the market in both simulation creation and running time, and the invention is proved to be capable of truly realizing full-flow parallelism and has good market application prospect.

In another embodiment, as shown in fig. 2, a full-flow parallel acceleration brain simulation system is provided, the full-flow parallel acceleration brain simulation system comprising a memory, a video memory, and a GPU, wherein:

For specific limitations on the full-flow parallel acceleration brain simulation system, reference may be made to the above limitation on the full-flow parallel acceleration brain simulation method, and no further description is given here.

In a specific embodiment, the full-flow parallel acceleration brain simulation system further comprises a parameter input interface and an operation command interface;

The technical features of the above-described embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above-described embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples merely represent a few embodiments of the present invention, which are described in more detail and are not to be construed as limiting the scope of the invention. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the invention, which are all within the scope of the invention. Accordingly, the scope of the invention should be assessed as that of the appended claims.

Claims

1. The full-flow parallel acceleration brain simulation method is applied to a GPU, and is characterized by comprising the following steps of:

receiving cluster parameters and synaptic parameters input by a user;

creating a cluster table in parallel according to the cluster parameters;

creating a routing table in parallel according to the synaptic parameters;

and acts on each neuron to create a pulsing matrix and a pulsing matrix;

2. The full-flow parallel acceleration brain simulation method of claim 1, wherein the cluster parameters include a total number of clusters, and a number of neurons, a number of neuron parameters, and a neuron initial parameter for each cluster;

3. The full-flow parallel acceleration brain simulation method of claim 1, wherein the creating a cluster table from the cluster parameters comprises:

4. The full-flow parallel acceleration brain simulation method of claim 1, wherein creating a routing table in parallel based on the synaptic parameters comprises:

5. The full-flow parallel acceleration brain simulation method of claim 4, wherein creating a temporary source neuron table, a temporary target neuron table, a temporary weight table, and a temporary delay table corresponding to each two clusters based on the synaptic parameters, comprising:

6. The full-flow parallel acceleration brain simulation method of claim 4, wherein filling the routing table according to the temporary source neuron table, the temporary target neuron table, the temporary weight table, and the temporary delay table, comprises:

7. The full-flow parallel acceleration brain simulation method of claim 1, wherein creating a pulsing matrix and a pulse receiving matrix for each neuron comprises:

8. The full-flow parallel acceleration brain simulation method of claim 1, wherein the performing brain simulation based on the cluster table, the routing table, and the pulse issuing and receiving matrices comprises:

1) Initializing the current time step to 0;

2) Starting M threads to collect pulses in parallel, taking the j element in the i pulse receiving matrix as the pulse received by the j neuron in the i cluster in the current time step, and emptying the pulse receiving matrix, wherein M is the total number of neurons;

6) Starting M parallel threads, starting s_j max_delay_j parallel threads by each thread, enabling each of the s_j max_delay_j parallel threads to correspond to one element of a pulse issuing matrix, accumulating a weight accumulated value of the processed element to an element with a first dimension index of 0 in the pulse issuing matrix if the element processed by the thread is the element with the first dimension index of 0 in the pulse issuing matrix, setting the weight accumulated value in the element to 0, moving the processed element forwards by one bit by each thread, and finally setting the weight accumulated value of the element to 0 if the element processed by the thread is the element with the first dimension index of (max_delay_j) -1 in the pulse issuing matrix, wherein max_delay_j is the first dimension length of the pulse receiving matrix and s_j is the second dimension length of the pulse receiving matrix;

7) The current time step is increased by 1;

9. The full-flow parallel acceleration brain simulation system is characterized by comprising a memory, a video memory and a GPU, wherein:

10. The full-flow parallel acceleration brain simulation system of claim 9, further comprising a parameter input interface and an operation command interface;