CN116502683A - Full-flow parallel acceleration brain simulation method and system - Google Patents

Full-flow parallel acceleration brain simulation method and system Download PDF

Info

Publication number
CN116502683A
CN116502683A CN202310349036.XA CN202310349036A CN116502683A CN 116502683 A CN116502683 A CN 116502683A CN 202310349036 A CN202310349036 A CN 202310349036A CN 116502683 A CN116502683 A CN 116502683A
Authority
CN
China
Prior art keywords
neuron
cluster
temporary
index
matrix
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310349036.XA
Other languages
Chinese (zh)
Inventor
王俊宜
喻富豪
蔡炎松
朱苗苗
梁华驹
彭耿
贾海波
杜帅帅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanhu Research Institute Of Electronic Technology Of China
Original Assignee
Nanhu Research Institute Of Electronic Technology Of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanhu Research Institute Of Electronic Technology Of China filed Critical Nanhu Research Institute Of Electronic Technology Of China
Priority to CN202310349036.XA priority Critical patent/CN116502683A/en
Publication of CN116502683A publication Critical patent/CN116502683A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/061Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using biological neurons, e.g. biological neurons connected to an integrated circuit
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • G06N3/065Analogue means
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Molecular Biology (AREA)
  • Neurology (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Microelectronics & Electronic Packaging (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a full-flow parallel acceleration brain simulation method and a system, which are applied to a GPU, wherein the full-flow parallel acceleration brain simulation method comprises the following steps: receiving cluster parameters and synaptic parameters input by a user; creating a cluster table in parallel according to the cluster parameters; creating a routing table in parallel according to the synaptic parameters; and acts on each neuron to create a pulsing matrix and a pulsing matrix; brain simulation is performed based on the cluster table, the routing table, the pulse delivery matrix and the pulse receiving matrix. The invention provides parallelization processing possibility and is more suitable for the many-core characteristics of the GPU.

Description

Full-flow parallel acceleration brain simulation method and system
Technical Field
The invention belongs to the technical field of brain simulation, and particularly relates to a full-flow parallel acceleration brain simulation method and system.
Background
Brain simulation is a technology for simulating a biological brain operation mechanism by a software means, and is a necessary means for researching novel technologies such as general intelligence in the future, brain disease diagnosis based on digital twin brains and the like. Compared with the current mainstream deep learning network, the brain simulation network has the characteristic of extremely large scale, in the mainstream deep learning network, for example, a VIT model in the vision field generally has 1 to 2 hundred million parameter amounts, a Galacta model in the language field such as Meta company in 2022 has 1200 hundred million parameter amounts, and a GPT-3 language model in the OpenAI company in 2020 has 1700 hundred million parameter amounts. In the field of brain simulation, the human brain contains about 860 hundred million neurons, the synapse number reaches 1000 trillion, and the parameter number of the simulated neural network in software reaches the order of tens of megameters.
The traditional brain simulation network needs to be completed by means of a super computer, and in order to achieve the simulation close to the actual running speed of the biological brain, the simulation system needs to have ultrahigh parallelism and communication capability. In recent years, GPUs have gained popularity in the field of dynamics-related computing. The GPU has many-core characteristics, compared with single-core processing capability emphasized by the CPU, the GPU emphasizes and improves the number of processing units on a single display card, and the a100 display card which is withdrawn and pushed out by the inflorescence company 2021 has 108 SM processing units, can provide 221184 independent computing threads, and has parallel capability far higher than that of the CPU. The well-suited features of GPUs agree with the deployment of a large number of neurons and synapses in brain simulation networks.
In European brain planning, a NEST brain simulation system commonly used by a supercomputer uses CPU cores as minimum processing units, the parallel processing capacity of the NEST brain simulation system is limited by the number of the CPU cores, and the NEST brain simulation system cannot be quickly parallel like a GPU under the same processing core condition due to the limitation of communication among the CPUs. NEST supports the use of GPU simulation networks in 2022, but cannot support full-flow parallelism, a large number of processes are completed by means of a CPU and a memory, and the problems of too slow network creation speed and the like exist. The university of saxophone in the united kingdom proposed a GeNN simulation system in 2018 and updated to version 4.8 in 2022, but the GeNN used GPU simulation at a slower speed and required cumbersome steps of code generation to create a network with overall lower execution efficiency than the new version of GPU that was proposed by the new in the same year.
The brain simulation network has a plurality of parameters and a complex structure, so that GPU parallelism is difficult to support in the whole process of creating and executing the network, and the currently mainstream GPU brain simulation system can only support parallelism in neuron updating and pulse transmission, but the step of network creation becomes complex, and the network creation of hundreds of millions of neurons often takes one day. In order to solve the problem that brain simulation cannot be performed in full flow parallelism, a set of brain simulation network data structure adapting to the many-core characteristics of the GPU needs to be redesigned, and rapid network and simulation establishment by utilizing the characteristics of multithreading and video memory is supported. The main problems are:
the data structure in the brain simulation system cannot fully utilize the many-core characteristics of the GPU, the data structure of the current brain simulation on the GPU side is often transplanted based on the CPU version, and a set of structure which completely supports the whole processes of network creation, neuron updating and pulse processing is not designed based on the GPU, so that the thread utilization rate of the GPU is not optimal.
Disclosure of Invention
The invention aims to provide a full-flow parallel acceleration brain simulation method, which provides parallelization processing possibility and is more suitable for GPU many-core characteristics.
In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:
the full-flow parallel acceleration brain simulation method is applied to a GPU, and comprises the following steps:
receiving cluster parameters and synaptic parameters input by a user;
creating a cluster table in parallel according to the cluster parameters;
creating a routing table in parallel according to the synaptic parameters;
and acts on each neuron to create a pulsing matrix and a pulsing matrix;
brain simulation is performed based on the cluster table, the routing table, the pulse delivery matrix and the pulse receiving matrix.
The following provides several alternatives, but not as additional limitations to the above-described overall scheme, and only further additions or preferences, each of which may be individually combined for the above-described overall scheme, or may be combined among multiple alternatives, without technical or logical contradictions.
Preferably, the cluster parameters include a total number of clusters, and a number of neurons, a number of neuron parameters, and a neuron initial parameter of each cluster;
the synaptic parameters comprise the total number of synaptic information, and a source cluster index, a target cluster index, a connection rule, a weight distribution and a time delay distribution contained in each piece of synaptic information.
Preferably, the creating a cluster table according to the cluster parameters includes:
acquiring a table with a cluster parameter of length L, wherein L is the total number of clusters, and the ith element in the table represents the information of the ith cluster and comprises the number M_i of neurons, the number N_i of the neuron parameters and the initial parameters of the neurons;
initializing a cluster table with the length L, starting each element in the cluster table in parallel by L threads, wherein the ith element in the cluster table represents a cluster matrix of the ith cluster, the first dimension length of the cluster matrix of the ith cluster is N_i, and the second dimension length of the cluster matrix of the ith cluster is M_i;
and starting L threads to process L cluster matrixes in parallel, wherein M_i threads are restarted in the ith thread, and M_i threads are used for filling the initial neuron parameters of M_i neurons of the ith cluster into the cluster matrix of the ith cluster in parallel.
Preferably, the creating a routing table according to the synapse parameters in parallel includes:
creating a temporary source neuron table, a temporary target neuron table, a temporary weight table and a temporary time delay table corresponding to each two clusters according to the synaptic parameters;
according to the temporary neuron table and the temporary delay table, counting the maximum delay value of each neuron and the number of synapses serving as source neurons;
initializing a one-dimensional empty routing table aiming at each cluster, wherein the length of the routing table is the number of neurons in the cluster, the elements in the routing table are one-dimensional target address tables, the length of the target address tables is the maximum delay value of the corresponding neurons, and the elements in the target address tables are [ target neuron index, weight ] key value pairs;
and filling a routing table according to the temporary source neuron table, the temporary target neuron table, the temporary weight table and the temporary time delay table.
Preferably, the creating a temporary source neuron table, a temporary target neuron table, a temporary weight table, and a temporary delay table corresponding to each two clusters according to the synaptic parameters includes:
starting threads with the same number as the total number of the synaptic information in the synaptic parameters, and processing all the synaptic information by all the threads in parallel, wherein the processing procedure of each thread is as follows:
acquiring corresponding synaptic information, wherein the synaptic information comprises a source cluster index, a target cluster index, a connection rule, weight distribution and time delay distribution;
calculating the number s_c of synapses to be created between two clusters according to a connection rule;
generating a temporary source neuron table with the length of s_c and a temporary target neuron table with the length of s_c through a random number generator, wherein elements in the temporary source neuron table are neuron indexes contained in a source cluster index, elements in the temporary target neuron table are neuron indexes contained in a target cluster index, and simultaneously generating a temporary weight table with the length of s_c and a temporary time delay table with the length of s_c through the random number generator according to weight distribution and time delay distribution.
Preferably, the filling the routing table according to the temporary source neuron table, the temporary target neuron table, the temporary weight table and the temporary delay table includes:
starting the same number of parallel threads as the total synapse number, wherein the processing procedure of the s-th thread is as follows: finding a temporary source neuron table, a temporary target neuron table, a temporary weight table and a temporary time delay table, wherein the initial index of the temporary source neuron table, the temporary target neuron table, the temporary weight table and the temporary time delay table is smaller than and closest to s;
and taking the element with the index of [ s-initial index ] in the temporary source neuron table as the index of the routing table to obtain a target address table, taking the element with the index of [ s-initial index ] in the temporary delay table as a first dimension index value of the target address table, and writing the key value pairs of the temporary weight table and the temporary target neuron table into the target address table by using the element with the index of [ s-initial index ] in the temporary weight table and the temporary target neuron table as the index of the [ target neuron index.
Preferably, the creating a pulse issuing matrix and a pulse receiving matrix for each neuron includes:
starting parallel threads with the same number as the total number of the clusters, initializing a pulse receiving matrix corresponding to one dimension of the clusters by each thread, wherein the length of the first dimension of the pulse receiving matrix is the number of neurons of the clusters, the elements are accumulated values of the number of the pulses, and the initial value is 0;
starting parallel threads with the same number as the total neuron number, initializing a two-dimensional pulse issuing matrix corresponding to the neuron by each thread, wherein the first dimension length of the pulse issuing matrix is the maximum delay value of the neuron, the second dimension length is the synapse number of the neuron serving as a source neuron, the elements are [ target neuron index, weight accumulated value ] key value pairs, the target neuron index in the key value pairs is the index of the target neuron to which the neuron serving as the source neuron is connected, and the initial value of the weight accumulated value in the key value pairs is 0.
Preferably, the brain simulation based on the cluster table, the routing table, the pulse issuing matrix and the pulse receiving matrix comprises:
1) Initializing the current time step to 0;
2) Starting M threads to collect pulses in parallel, taking the j element in the i pulse receiving matrix as the pulse received by the j neuron in the i cluster in the current time step, and clearing the receiving matrix, wherein M is the total number of neurons;
3) Starting M parallel threads, wherein each thread updates the neuron parameters of the corresponding neurons in the cluster table, judging whether the current membrane voltage in each updated neuron parameter is greater than a threshold value, if so, indicating that the neurons need to release pulses and executing the step 4), otherwise, executing the step 6);
4) For the jth neuron of the ith cluster needing to send pulses, inquiring a jth target address table of the ith routing table, taking out all key value pairs, taking a target address table index as a delay value, and forming [ delay value, target neuron index, weight ] pulse information with the corresponding key value pair;
5) Filling pulse information into a pulse issuing matrix corresponding to a jth neuron of an ith cluster, taking a delay value in the pulse information as a first dimension index of the pulse issuing matrix to obtain s_j elements, determining the same element as a target neuron index in the obtained s_j elements according to the target neuron index in the pulse information, adding the weight in the pulse information to a weight accumulated value in a key value pair of the determined elements, and adding the weights in a plurality of pulse information corresponding to the same target address table to the same pulse issuing matrix;
6) Starting M parallel threads, each thread restarting s_j max_delay_j parallel threads,
s_j is that each thread in the s_max_delay_j parallel threads corresponds to one element of the pulse issuing matrix, if the element processed by the thread is the element with the first dimension index of 0 in the pulse issuing matrix, the weight accumulated value of the processed element is accumulated to the element with the first dimension index of 0 in the pulse receiving matrix, then the weight accumulated value in the element is set to 0, each thread moves the processed element forward by one bit, and finally, if the element processed by the thread is the element with the first dimension index of (max_delay_j) -1 in the pulse issuing matrix, the weight accumulated value in the element is set to 0, wherein max_delay_j is the first dimension length of the pulse receiving matrix, and s_j is the second dimension length of the pulse receiving matrix;
7) The current time step is increased by 1;
8) Judging whether the current time step is equal to the simulation time step, if not, executing the step 2) to continue simulation, and if so, ending the flow.
The full-flow parallel acceleration brain simulation method provided by the invention supports the creation of a cluster table and a routing table in a parallel mode, and has a neuron and synapse data structure adapting to the GPU many-core characteristic so as to realize the process of rapid parameter addressing and updating based on the structure. In addition, a pulse issuing matrix and a pulse receiving matrix are created in a parallel mode, and a pulse storage structure with the characteristic of adapting to the GPU many cores is provided so as to realize a pulse processing flow in the parallel mode.
The second purpose of the invention is to provide a full-flow parallel acceleration brain simulation system, which provides parallelization processing possibility and is more suitable for GPU many-core characteristics.
In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:
a full-flow parallel acceleration brain simulation system, comprising a memory, a video memory and a GPU, wherein:
the memory is used for storing cluster parameters and synaptic parameters input by a user;
the GPU is used for creating a cluster table in a video memory according to the cluster parameters, creating a routing table in the video memory according to the synaptic parameters, creating a pulse issuing matrix and a pulse receiving matrix for each neuron in the video memory, and then performing brain simulation based on the cluster table, the routing table, the pulse issuing matrix and the pulse receiving matrix.
Preferably, the full-flow parallel acceleration brain simulation system further comprises a parameter input interface and an operation command interface;
the parameter input interface is used for receiving cluster parameters and synaptic parameters input by a user and forwarding the cluster parameters and the synaptic parameters to the memory for storage;
and the operation command interface is used for receiving an operation command and a simulation time step input by a user, and the GPU starts to operate after recognizing the operation command.
The full-flow parallel acceleration brain simulation system provided by the invention supports the creation of a cluster table and a routing table in a parallel mode, and has a neuron and synapse data structure adapting to the GPU many-core characteristic so as to realize the process of quick parameter addressing and updating based on the structure. In addition, a pulse issuing matrix and a pulse receiving matrix are created in a parallel mode, and a pulse storage structure with the characteristic of adapting to the GPU many cores is provided so as to realize a pulse processing flow in the parallel mode.
Drawings
FIG. 1 is a flow chart of a full-flow parallel acceleration brain simulation method of the present invention;
fig. 2 is a schematic structural diagram of a full-flow parallel acceleration brain simulation system according to the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.
The method aims to solve the problems that the current data structure taking the neuron as the minimum unit is not matched with the many-core characteristics of the GPU, so that the processes of network creation, parameter indexing and the like are complex and the parallel efficiency is low. Meanwhile, the method aims to solve the problem that the main processes such as membrane voltage updating, routing addressing and pulse transmission process are low in parallel efficiency in the brain simulation process. The embodiment provides a full-flow parallel acceleration brain simulation method, which realizes the complete expression of brain simulation on a GPU (graphics processing unit), comprises data structures related to neuron, synapse and pulse processing, realizes the complete flow of the data structures quickly created on a GPU video memory, provides the complete simulation flow for the data structures, comprises neuron parameter index and update, pulse routing information searching, pulse data storage, issuing, receiving and the like, and realizes the parallel acceleration simulation based on the GPU.
For ease of understanding, the terms used in the present invention will be explained first:
brain simulation system: the software system provides a basic interface for creating a brain simulation network for a user, and comprises interfaces for creating neuron clusters, inter-cluster synapses, simulation, result inquiry and the like, and hardware resources such as a GPU (graphics processing unit) and the like are called to complete network deployment and simulation.
Neurons: the simulated neuron imitating the actual operation mechanism of the biological brain neuron comprises neuron parameters such as membrane voltage, threshold value, membrane time constant, conductance and the like and parameter updating logic, is the minimum independent calculation unit in a simulation system, and each neuron has a unique index value.
Neuron clusters (clusters): a set of multiple neurons, typically containing tens of thousands of neurons in a cluster, is the smallest unit for a user to create a brain simulation network through a simulation system, each cluster having its own unique index value.
Synapses: the method comprises the steps that a user inputs parameters such as the number of synapses among clusters, connection rules, weights, time delay distribution values and the like through a simulation system, the simulation system converts the distribution values into actual values when the network is deployed, and the actual connection among neurons is created.
Time delay: a synaptic parameter represents the time taken for a pulse to propagate across the synapse in time steps.
The time steps are as follows: in the simulation process, the simulation system is used for simulating a discrete time unit of real time, and one time step can represent that the real time is different in 0.1 millisecond or 1 millisecond according to different simulation precision. The parameters of the entire network will be updated once in each time step.
Thread: herein referred to as GPU threads.
As shown in fig. 1, the full-flow parallel acceleration brain simulation method of the present embodiment includes the following steps:
s1, receiving cluster parameters and synaptic parameters input by a user.
Cluster parameters: typically represented as a table of length L, representing L clusters, i.e. L is the total number of clusters, the i-th element of the table contains information: in cluster i, the number of neurons M_i, the number of neuron parameters N_i, and the initial parameters of neurons.
The initial parameters of the neurons are values such as initial membrane voltage, time constant and the like input by a user, and generally support the user to input information by taking a cluster as a unit, namely the initial parameters of the neurons in the cluster are consistent.
Synaptic parameters: typically represented as a table of length P, each element in the table representing two clusters of inter-synaptic information, P being the total number of synaptic information, each piece of synaptic information comprising: source cluster index, target cluster index, connection rule, weight distribution and time delay distribution.
S2, creating a cluster table according to the cluster parameters.
S2.1, obtaining a table with a cluster parameter of length L, wherein L is the total number of clusters, and the ith element in the table represents the information of the ith cluster and comprises the number M_i of neurons, the number N_i of the neuron parameters and the initial parameters of the neurons.
S2.2, starting a thread to initialize a cluster table with the length L, and starting L threads to initialize each element in the cluster table in parallel, wherein the ith element in the cluster table represents a cluster matrix (also called a cluster matrix i) of the ith cluster, the first dimension length of the cluster matrix of the ith cluster is N_i, and the second dimension length of the cluster matrix of the ith cluster is M_i. The total neuron count is noted as M.
S2.3, starting L threads to process each cluster matrix in the cluster table in parallel, restarting M_i threads in the ith thread, and filling the initial neuron parameters of M_i neurons of the ith cluster into the cluster matrix of the ith cluster in parallel by the M_i threads. If the initial parameters of the neurons in the clusters are consistent, for each thread in M_i threads, reading an element with an index k in a temporary parameter array storing N_i parameters, and filling the element with a first dimension index k and a second dimension index corresponding to the thread index of the ith matrix.
S3, creating a routing table in parallel according to the synaptic parameters.
S3.1, creating a temporary source neuron table, a temporary target neuron table, a temporary weight table and a temporary time delay table corresponding to each two clusters according to the synaptic parameters.
The total synapse number S is initialized to 0. Starting P threads, and processing all the synaptic information by all the threads in parallel, wherein the c-th thread acquires the synaptic information with the index of c in the synaptic parameters. And each thread calculates the number s_c of the synapses to be created between the two clusters according to the connection rule in the acquired synapse information, and the number S of the synapses of all threads is accumulated to obtain S.
A temporary source neuron table with the length of s_c and a temporary target neuron table with the length of s_c are generated through a random number generator, elements in the temporary source neuron table are neuron indexes contained in a source cluster index, and elements in the temporary target neuron table are neuron indexes contained in a target cluster index. The random number generator generates random numbers according to the source cluster index and the target cluster index, and the generated random numbers are located in the neuron index range contained in the source cluster index and the target cluster index.
Each thread generates a temporary weight table with the length of s_c and a temporary time delay table with the length of s_c through a random number generator according to the weight distribution and the time delay distribution. A total of 4*P temporary tables are generated for each piece of synaptic information, and the starting index of each table is recorded. The random number generator generates random numbers according to the weight distribution and the time delay distribution, and the generated random numbers are positioned in accordance with the weight distribution and the time delay distribution.
And S3.2, counting the maximum time delay value of each neuron and the number of synapses serving as source neurons according to the temporary neuron table and the temporary time delay table.
Starting S parallel threads, wherein each thread inquires one element in a temporary source neuron table and a delay table, and additionally starting one thread to count the maximum delay value max_delay_j of each neuron and the synapse number s_j serving as a source neuron. Where max_delay_j and s_j represent the maximum delay value of the jth neuron of the ith cluster and the number of synapses as source neurons. To simplify the calculation, the index of neurons in each cluster is continuous, i.e., the jth neuron in the ith cluster refers to the same neuron as the jth neuron in the total neurons.
S3.3, initializing a routing table: starting L parallel threads, initializing a one-dimensional empty routing table for each cluster by each thread, wherein the length of the routing table is the number of neurons in the cluster, the elements in the routing table are one-dimensional target address tables, the length of each target address table is the maximum delay value of the corresponding neurons, namely, for the jth target address table, the table length is the max_delay_j of the jth neurons, and the elements in the target address table are [ target neuron index, weight ] key value pairs.
S3.4, filling a routing table according to the temporary source neuron table, the temporary target neuron table, the temporary weight table and the temporary time delay table.
Starting S parallel threads, wherein the processing procedure of the S-th thread is as follows: finding a temporary source neuron table, a temporary target neuron table, a temporary weight table and a temporary time delay table with a starting index smaller than and closest to s.
The method comprises the steps of obtaining a target address table by taking elements with indexes of [ s-start indexes (here, start indexes of the found temporary source neuron table) ] in a temporary source neuron table as indexes of a routing table, taking elements with indexes of [ s-start indexes (here, start indexes of the found temporary delay table) ] in a temporary delay table as first dimension index values of the target address table, and writing key value pairs of [ target neuron indexes, weight ] formed by elements with indexes of [ s-start indexes (here, start indexes of the found temporary weight table and the temporary target neuron table) in a temporary weight table and the temporary target neuron table into the target address table.
S4, a pulse issuing matrix and a pulse receiving matrix are created for each neuron.
Pulse receiving matrix (simply called receiving matrix): for storing the pulses received by the neurons, L receiving matrices. Starting L parallel threads, initializing a one-dimensional receiving matrix corresponding to the ith cluster for the ith thread, wherein the length of the first dimension is the number M_i of neurons of the ith cluster, namely the index of the first dimension is a neuron index, the elements are pulse number accumulated values, and the initial value is 0.
Pulse issuing matrix (issuing matrix for short): for storing the pulse to be issued by the neuron, and M issue matrixes are arranged. Starting M parallel threads, initializing a two-dimensional hair matrix corresponding to a jth neuron for the jth thread, wherein the first dimension length is the maximum delay value max_delay_j of the neuron, the second dimension length is the synapse number s_j of the neuron serving as a source neuron, the element is a [ target neuron index, weight accumulated value ] key value pair, the target neuron index in the key value pair is the index of the target neuron connected with the neuron serving as the source neuron, and the initial value of the weight accumulated value in the key value pair is 0.
And S5, performing brain simulation based on the cluster table, the routing table, the pulse issuing matrix and the pulse receiving matrix.
S5.1, initializing the current time step to 0.
S5.2, starting M threads to collect pulses in parallel, taking the j element in the ith receiving matrix as the pulse received by the j neuron in the ith cluster in the current time step, and emptying the receiving matrix.
And S5.3, starting M parallel threads, wherein each thread updates the neuron parameters of the corresponding neurons in the cluster table, judging whether the current membrane voltage in each updated neuron parameter is greater than a threshold value, if so, indicating that the neurons need to release pulses and executing the step S5.4, otherwise, executing the step S5.6.
It should be noted that, the updating of the neuron parameters of the corresponding neurons in the cluster table is mainly based on the neuron parameter updating logic, and the updating of the neuron membrane voltage values in the cluster table is performed by the conventional technology in updating the neuron pulse transmission parameters, which is not described in detail in this embodiment.
S5.4, inquiring a jth target address table of an ith routing table for a jth neuron of an ith cluster needing to send pulses, taking out all key value pairs, taking a target address table index as a time delay value, and forming [ time delay value, target neuron index, weight ] pulse information with the corresponding key value pair.
If a plurality of key value pairs exist in the target address table, a plurality of pulse information is finally obtained, the delay value in each pulse information is the target address table index, and the target neuron index and the weight in the pulse information are the key value pairs under the target address table index, namely the [ target neuron index, weight ].
S5.5, filling pulse information into a hair matrix corresponding to a jth neuron of an ith cluster, taking a delay value in the pulse information as a first dimension index of the hair matrix to obtain s_j elements, determining the same element as a target neuron index in the obtained s_j elements according to the target neuron index in the pulse information, adding the weight in the pulse information to a weight accumulation value in a key value pair of the determined elements, and adding the weights in a plurality of pulse information corresponding to the same target address table to the same hair matrix.
S5.6, starting M parallel threads, starting s_j max_delay_j threads again by each thread, wherein s_j max_delay_j threads correspond to one element of a pulse issuing matrix, accumulating a weight accumulated value of the element processed by the thread to an element with a first dimension index of 0 in the pulse issuing matrix, then setting the weight accumulated value in the element to 0, moving the element processed forward by each thread, and finally setting the weight accumulated value in the element to 0 if the element processed by the thread is an element with a first dimension index of (max_delay_j) -1 in the pulse issuing matrix.
S5.7, the current time step is increased by 1.
S5.8, judging whether the current time step is equal to the simulation time step, if not, executing the step S5.2, continuing simulation, and if so, ending the flow.
It should be noted that, steps S2-S5 are GPU operation steps, in actual operation, the GPU starts actual network creation and simulation processes according to user instructions, the user instructions can be input through the external interface, and the input parameters also include simulation time steps and the like.
The neuron, synapse and pulse processing related data structure and the pulse issuing and receiving scheme realized according to the data structure can obviously improve the parallel efficiency of network creation, processing pulse weight acquisition, time delay updating, route transmission and receiving processes in the GPU, and realize the parallel of the brain simulation whole processes (including 4 processes of neuron creation, synapse creation, neuron updating and pulse transmission) on the GPU. The advantages of the method proposed for implementing parallelism for different stages in the simulation flow are as follows:
1. network creation phase (including neuron creation, synapse creation): general systems require the gradual construction of neurons, synapses, etc. according to the order of user input information. The invention designs the neuron and synapse list structure which is more suitable for the many-core characteristics of the GPU, and can quickly initialize, and parallelly create and fill in the neuron and synapse parameters.
2. Neuron parameter update phase: typically, each neuron of the system is used as a minimum independent unit, and needs to process own parameter addressing and parameter updating processes respectively. The method takes the parameters as the minimum unit, and the same parameters of all neurons are stored in the same position, so that the method can be used for quick addressing and quick updating.
3. Pulse transfer phase: in general, the pulse transmission stage of the system processes pulses by the unit of neurons, and more time is required to process routing logic. The invention simplifies and accelerates the processes of routing, pulse storage and receiving through multithreading parallel processing based on the GPU many-core characteristics.
In order to further illustrate the advantages of the full-flow parallel acceleration brain simulation method provided by the invention, a specific experiment is used for the following description.
The open-source microcirculatory network is used for comparison test, and in order to improve the differentiation of experimental results, the network specification parameters in the actual test are adjusted as follows: the total number of the 8 neuron clusters is consistent with that of the original microcirculatory network, the total neuron number is increased from 7.7 to Mo Zeng to 77 ten thousand, the total synapse number is increased from 2.4 to 24 hundred million, namely, the network scale is increased by 10 times, but the network structure is unchanged. The biological time was simulated for 1s, i.e. 10000 time steps. The comparative brain simulation system (abbreviated as system), hardware type, hardware name, and neural network creation and simulation time result pairs are shown in table 1.
Table 1 creation and simulation time statistics comparison table
From table 1, the method of the invention is superior to the speed of the main brain simulation system in the market in both simulation creation and running time, and the invention is proved to be capable of truly realizing full-flow parallelism and has good market application prospect.
In another embodiment, as shown in fig. 2, a full-flow parallel acceleration brain simulation system is provided, the full-flow parallel acceleration brain simulation system comprising a memory, a video memory, and a GPU, wherein:
the memory is used for storing cluster parameters and synaptic parameters input by a user;
the GPU is used for creating a cluster table in a video memory according to the cluster parameters, creating a routing table in the video memory according to the synaptic parameters, creating a pulse issuing matrix and a pulse receiving matrix for each neuron in the video memory, and then performing brain simulation based on the cluster table, the routing table, the pulse issuing matrix and the pulse receiving matrix.
For specific limitations on the full-flow parallel acceleration brain simulation system, reference may be made to the above limitation on the full-flow parallel acceleration brain simulation method, and no further description is given here.
In a specific embodiment, the full-flow parallel acceleration brain simulation system further comprises a parameter input interface and an operation command interface;
the parameter input interface is used for receiving cluster parameters and synaptic parameters input by a user and forwarding the cluster parameters and the synaptic parameters to the memory for storage;
and the operation command interface is used for receiving an operation command and a simulation time step input by a user, and the GPU starts to operate after recognizing the operation command.
The technical features of the above-described embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above-described embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The above examples merely represent a few embodiments of the present invention, which are described in more detail and are not to be construed as limiting the scope of the invention. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the invention, which are all within the scope of the invention. Accordingly, the scope of the invention should be assessed as that of the appended claims.

Claims (10)

1. The full-flow parallel acceleration brain simulation method is applied to a GPU, and is characterized by comprising the following steps of:
receiving cluster parameters and synaptic parameters input by a user;
creating a cluster table in parallel according to the cluster parameters;
creating a routing table in parallel according to the synaptic parameters;
and acts on each neuron to create a pulsing matrix and a pulsing matrix;
brain simulation is performed based on the cluster table, the routing table, the pulse delivery matrix and the pulse receiving matrix.
2. The full-flow parallel acceleration brain simulation method of claim 1, wherein the cluster parameters include a total number of clusters, and a number of neurons, a number of neuron parameters, and a neuron initial parameter for each cluster;
the synaptic parameters comprise the total number of synaptic information, and a source cluster index, a target cluster index, a connection rule, a weight distribution and a time delay distribution contained in each piece of synaptic information.
3. The full-flow parallel acceleration brain simulation method of claim 1, wherein the creating a cluster table from the cluster parameters comprises:
acquiring a table with a cluster parameter of length L, wherein L is the total number of clusters, and the ith element in the table represents the information of the ith cluster and comprises the number M_i of neurons, the number N_i of the neuron parameters and the initial parameters of the neurons;
initializing a cluster table with the length L, starting each element in the cluster table in parallel by L threads, wherein the ith element in the cluster table represents a cluster matrix of the ith cluster, the first dimension length of the cluster matrix of the ith cluster is N_i, and the second dimension length of the cluster matrix of the ith cluster is M_i;
and starting L threads to process L cluster matrixes in parallel, wherein M_i threads are restarted in the ith thread, and M_i threads are used for filling the initial neuron parameters of M_i neurons of the ith cluster into the cluster matrix of the ith cluster in parallel.
4. The full-flow parallel acceleration brain simulation method of claim 1, wherein creating a routing table in parallel based on the synaptic parameters comprises:
creating a temporary source neuron table, a temporary target neuron table, a temporary weight table and a temporary time delay table corresponding to each two clusters according to the synaptic parameters;
according to the temporary neuron table and the temporary delay table, counting the maximum delay value of each neuron and the number of synapses serving as source neurons;
initializing a one-dimensional empty routing table aiming at each cluster, wherein the length of the routing table is the number of neurons in the cluster, the elements in the routing table are one-dimensional target address tables, the length of the target address tables is the maximum delay value of the corresponding neurons, and the elements in the target address tables are [ target neuron index, weight ] key value pairs;
and filling a routing table according to the temporary source neuron table, the temporary target neuron table, the temporary weight table and the temporary time delay table.
5. The full-flow parallel acceleration brain simulation method of claim 4, wherein creating a temporary source neuron table, a temporary target neuron table, a temporary weight table, and a temporary delay table corresponding to each two clusters based on the synaptic parameters, comprising:
starting threads with the same number as the total number of the synaptic information in the synaptic parameters, and processing all the synaptic information by all the threads in parallel, wherein the processing procedure of each thread is as follows:
acquiring corresponding synaptic information, wherein the synaptic information comprises a source cluster index, a target cluster index, a connection rule, weight distribution and time delay distribution;
calculating the number s_c of synapses to be created between two clusters according to a connection rule;
generating a temporary source neuron table with the length of s_c and a temporary target neuron table with the length of s_c through a random number generator, wherein elements in the temporary source neuron table are neuron indexes contained in a source cluster index, elements in the temporary target neuron table are neuron indexes contained in a target cluster index, and simultaneously generating a temporary weight table with the length of s_c and a temporary time delay table with the length of s_c through the random number generator according to weight distribution and time delay distribution.
6. The full-flow parallel acceleration brain simulation method of claim 4, wherein filling the routing table according to the temporary source neuron table, the temporary target neuron table, the temporary weight table, and the temporary delay table, comprises:
starting the same number of parallel threads as the total synapse number, wherein the processing procedure of the s-th thread is as follows: finding a temporary source neuron table, a temporary target neuron table, a temporary weight table and a temporary time delay table, wherein the initial index of the temporary source neuron table, the temporary target neuron table, the temporary weight table and the temporary time delay table is smaller than and closest to s;
and taking the element with the index of [ s-initial index ] in the temporary source neuron table as the index of the routing table to obtain a target address table, taking the element with the index of [ s-initial index ] in the temporary delay table as a first dimension index value of the target address table, and writing the key value pairs of the temporary weight table and the temporary target neuron table into the target address table by using the element with the index of [ s-initial index ] in the temporary weight table and the temporary target neuron table as the index of the [ target neuron index.
7. The full-flow parallel acceleration brain simulation method of claim 1, wherein creating a pulsing matrix and a pulse receiving matrix for each neuron comprises:
starting parallel threads with the same number as the total number of the clusters, initializing a pulse receiving matrix corresponding to one dimension of the clusters by each thread, wherein the length of the first dimension of the pulse receiving matrix is the number of neurons of the clusters, the elements are accumulated values of the number of the pulses, and the initial value is 0;
starting parallel threads with the same number as the total neuron number, initializing a two-dimensional pulse issuing matrix corresponding to the neuron by each thread, wherein the first dimension length of the pulse issuing matrix is the maximum delay value of the neuron, the second dimension length is the synapse number of the neuron serving as a source neuron, the elements are [ target neuron index, weight accumulated value ] key value pairs, the target neuron index in the key value pairs is the index of the target neuron to which the neuron serving as the source neuron is connected, and the initial value of the weight accumulated value in the key value pairs is 0.
8. The full-flow parallel acceleration brain simulation method of claim 1, wherein the performing brain simulation based on the cluster table, the routing table, and the pulse issuing and receiving matrices comprises:
1) Initializing the current time step to 0;
2) Starting M threads to collect pulses in parallel, taking the j element in the i pulse receiving matrix as the pulse received by the j neuron in the i cluster in the current time step, and emptying the pulse receiving matrix, wherein M is the total number of neurons;
3) Starting M parallel threads, wherein each thread updates the neuron parameters of the corresponding neurons in the cluster table, judging whether the current membrane voltage in each updated neuron parameter is greater than a threshold value, if so, indicating that the neurons need to release pulses and executing the step 4), otherwise, executing the step 6);
4) For the jth neuron of the ith cluster needing to send pulses, inquiring a jth target address table of the ith routing table, taking out all key value pairs, taking a target address table index as a delay value, and forming [ delay value, target neuron index, weight ] pulse information with the corresponding key value pair;
5) Filling pulse information into a pulse issuing matrix corresponding to a jth neuron of an ith cluster, taking a delay value in the pulse information as a first dimension index of the pulse issuing matrix to obtain s_j elements, determining the same element as a target neuron index in the obtained s_j elements according to the target neuron index in the pulse information, adding the weight in the pulse information to a weight accumulated value in a key value pair of the determined elements, and adding the weights in a plurality of pulse information corresponding to the same target address table to the same pulse issuing matrix;
6) Starting M parallel threads, starting s_j max_delay_j parallel threads by each thread, enabling each of the s_j max_delay_j parallel threads to correspond to one element of a pulse issuing matrix, accumulating a weight accumulated value of the processed element to an element with a first dimension index of 0 in the pulse issuing matrix if the element processed by the thread is the element with the first dimension index of 0 in the pulse issuing matrix, setting the weight accumulated value in the element to 0, moving the processed element forwards by one bit by each thread, and finally setting the weight accumulated value of the element to 0 if the element processed by the thread is the element with the first dimension index of (max_delay_j) -1 in the pulse issuing matrix, wherein max_delay_j is the first dimension length of the pulse receiving matrix and s_j is the second dimension length of the pulse receiving matrix;
7) The current time step is increased by 1;
8) Judging whether the current time step is equal to the simulation time step, if not, executing the step 2) to continue simulation, and if so, ending the flow.
9. The full-flow parallel acceleration brain simulation system is characterized by comprising a memory, a video memory and a GPU, wherein:
the memory is used for storing cluster parameters and synaptic parameters input by a user;
the GPU is used for creating a cluster table in a video memory according to the cluster parameters, creating a routing table in the video memory according to the synaptic parameters, creating a pulse issuing matrix and a pulse receiving matrix for each neuron in the video memory, and then performing brain simulation based on the cluster table, the routing table, the pulse issuing matrix and the pulse receiving matrix.
10. The full-flow parallel acceleration brain simulation system of claim 9, further comprising a parameter input interface and an operation command interface;
the parameter input interface is used for receiving cluster parameters and synaptic parameters input by a user and forwarding the cluster parameters and the synaptic parameters to the memory for storage;
and the operation command interface is used for receiving an operation command and a simulation time step input by a user, and the GPU starts to operate after recognizing the operation command.
CN202310349036.XA 2023-03-29 2023-03-29 Full-flow parallel acceleration brain simulation method and system Pending CN116502683A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310349036.XA CN116502683A (en) 2023-03-29 2023-03-29 Full-flow parallel acceleration brain simulation method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310349036.XA CN116502683A (en) 2023-03-29 2023-03-29 Full-flow parallel acceleration brain simulation method and system

Publications (1)

Publication Number Publication Date
CN116502683A true CN116502683A (en) 2023-07-28

Family

ID=87327572

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310349036.XA Pending CN116502683A (en) 2023-03-29 2023-03-29 Full-flow parallel acceleration brain simulation method and system

Country Status (1)

Country Link
CN (1) CN116502683A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117194051A (en) * 2023-11-01 2023-12-08 北京灵汐科技有限公司 Brain simulation processing method and device, electronic equipment and computer readable storage medium
CN117349033A (en) * 2023-12-05 2024-01-05 北京灵汐科技有限公司 Brain simulation processing method and device, electronic equipment and computer readable storage medium
CN117689025A (en) * 2023-12-07 2024-03-12 上海交通大学 Quick large model reasoning service method and system suitable for consumer display card

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117194051A (en) * 2023-11-01 2023-12-08 北京灵汐科技有限公司 Brain simulation processing method and device, electronic equipment and computer readable storage medium
CN117194051B (en) * 2023-11-01 2024-01-23 北京灵汐科技有限公司 Brain simulation processing method and device, electronic equipment and computer readable storage medium
CN117349033A (en) * 2023-12-05 2024-01-05 北京灵汐科技有限公司 Brain simulation processing method and device, electronic equipment and computer readable storage medium
CN117349033B (en) * 2023-12-05 2024-03-08 北京灵汐科技有限公司 Brain simulation processing method and device, electronic equipment and computer readable storage medium
CN117689025A (en) * 2023-12-07 2024-03-12 上海交通大学 Quick large model reasoning service method and system suitable for consumer display card

Similar Documents

Publication Publication Date Title
CN116502683A (en) Full-flow parallel acceleration brain simulation method and system
Lym et al. Prunetrain: fast neural network training by dynamic sparse model reconfiguration
Zhang et al. Poseidon: A system architecture for efficient gpu-based deep learning on multiple machines
WO2019118299A1 (en) Evolving recurrent networks using genetic programming
Liu et al. Block proposal neural architecture search
Ordentlich et al. Network-efficient distributed word2vec training system for large vocabularies
US11507844B2 (en) Asynchronous evaluation strategy for evolution of deep neural networks
EP3940555A2 (en) Method and apparatus of processing information, method and apparatus of recommending information, electronic device, and storage medium
Groh et al. Ggnn: Graph-based gpu nearest neighbor search
CN108986872B (en) Multi-granularity attribute weight Spark method for big data electronic medical record reduction
CN105184368A (en) Distributed extreme learning machine optimization integrated framework system and method
Song et al. Large-scale training system for 100-million classification at alibaba
Canny et al. Machine learning at the limit
JP7196542B2 (en) Learning device and learning method
JP2020030699A (en) Leaning device and leaning method
CN113962358A (en) Information diffusion prediction method based on time sequence hypergraph attention neural network
Sood et al. Neunets: An automated synthesis engine for neural network design
Li et al. Dlw-nas: Differentiable light-weight neural architecture search
Zhang et al. A Survey on Graph Neural Network Acceleration: Algorithms, Systems, and Customized Hardware
WO2019180314A1 (en) Artificial neural networks
US20210264237A1 (en) Processor for reconstructing artificial neural network, electrical device including the same, and operating method of processor
JP7363145B2 (en) Learning device and learning method
CN115544029A (en) Data processing method and related device
Xu et al. Flexible few-shot class-incremental learning with prototype container
Sun et al. Active learning for image classification: A deep reinforcement learning approach

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination