WO2018187487A1 - Architecture informatique parallèle polyvalente - Google Patents

Architecture informatique parallèle polyvalente Download PDF

Info

Publication number
WO2018187487A1
WO2018187487A1 PCT/US2018/026108 US2018026108W WO2018187487A1 WO 2018187487 A1 WO2018187487 A1 WO 2018187487A1 US 2018026108 W US2018026108 W US 2018026108W WO 2018187487 A1 WO2018187487 A1 WO 2018187487A1
Authority
WO
WIPO (PCT)
Prior art keywords
coprocessors
computing
cores
core
soma
Prior art date
Application number
PCT/US2018/026108
Other languages
English (en)
Inventor
Paul BURCHARD
Ulrich Drepper
Original Assignee
Goldman Sachs & Co. LLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US15/481,201 external-priority patent/US11449452B2/en
Application filed by Goldman Sachs & Co. LLC filed Critical Goldman Sachs & Co. LLC
Priority to CN201880037698.0A priority Critical patent/CN110720095A/zh
Priority to EP18780648.4A priority patent/EP3607454A4/fr
Priority to AU2018248439A priority patent/AU2018248439C1/en
Priority to JP2019554765A priority patent/JP7173985B2/ja
Priority to CA3059105A priority patent/CA3059105A1/fr
Publication of WO2018187487A1 publication Critical patent/WO2018187487A1/fr
Priority to AU2021203926A priority patent/AU2021203926B2/en
Priority to JP2022177082A priority patent/JP2023015205A/ja

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/80Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
    • G06F15/8007Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors single instruction multiple data [SIMD] multiprocessors
    • G06F15/8023Two dimensional arrays, e.g. mesh, torus
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • G06F9/5072Grid computing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/01Probabilistic graphical models, e.g. probabilistic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections

Definitions

  • This disclosure relates generally to hardware architectures for computing devices and computing systems. More specifically, this disclosure relates to a general-purpose parallel computing architecture, which can support advanced computing functions such as those used in artificial intelligence.
  • the human brain is a massively parallel system typically containing around 100 billion neurons connected by one quadrillion synapses. Ideally, simulating the operation of the human brain could support advanced computing functions such as artificial intelligence. However, conventional attempts at simulating the human brain or designing computing systems that rival the abilities of the human brain have generally been inadequate for a number of reasons, such as not substantially matching the connectivity or three-dimensional structure of the brain.
  • This disclosure provides a general-purpose parallel computing architecture.
  • an apparatus in a first embodiment, includes multiple parallel computing cores, where each computing core is configured to perform one or more processing operations and generate input data.
  • the apparatus also includes multiple parallel coprocessors associated with each computing core.
  • Each computing core is configured to provide the input data generated by that computing core to a designated one of the coprocessors associated with each of the computing cores.
  • the coprocessors are configured to process the input data and generate output data.
  • the apparatus further includes multiple reducer circuits.
  • Each computing core is associated with one of the reducer circuits.
  • Each reducer circuit is configured to receive the output data from each of the coprocessors of the associated computing core, to apply one or more functions to the output data, and to provide one or more results to the associated computing core.
  • the computing cores, the coprocessors, and the reducer circuits are arranged laterally side-by-side in a two-dimensional layout.
  • an apparatus in a second embodiment, includes multiple parallel computing cores, where each computing core is configured to perform one or more processing operations and generate input data.
  • the apparatus also includes multiple parallel coprocessors associated with each computing core.
  • Each computing core is configured to provide the input data generated by that computing core to a designated one of the coprocessors associated with each of the computing cores.
  • the coprocessors are configured to process the input data and generate output data.
  • the coprocessors in a subset of the coprocessors for each computing core are also configured to collectively apply one or more functions to the output data, and one of the coprocessors in the subset is further configured to provide one or more results to the associated computing core.
  • an apparatus in a third embodiment, includes N parallel computing cores, where each computing core is configured to perform one or more processing operations and generate input data.
  • the apparatus also includes NxN coprocessors, where each computing core is associated with N parallel coprocessors. Each computing core is configured to provide the input data generated by that computing core to a designated one of the coprocessors associated with each of the computing cores.
  • the coprocessors are configured to process the input data and generate output data.
  • the apparatus further includes N reducer circuits. Each computing core is associated with one of the reducer circuits. Each reducer circuit is configured to receive the output data from each of the coprocessors of the associated computing core, to apply one or more functions to the output data, and to provide one or more results to the associated computing core.
  • the computing cores, the coprocessors, and the reducer circuits are arranged laterally side-by-side in a two-dimensional layout, and N is an integer having a value of at least sixteen.
  • an apparatus in a fourth embodiment, includes multiple computing cores, where each computing core is configured to perform one or more processing operations and generate input data.
  • the apparatus also includes multiple coprocessors associated with each computing core, where each coprocessor is configured to receive the input data from at least one of the computing cores, process the input data, and generate output data.
  • the apparatus further includes multiple reducer circuits, where each reducer circuit is configured to receive the output data from each of the coprocessors of an associated computing core, apply one or more functions to the output data, and provide one or more results to the associated computing core.
  • the apparatus includes multiple communication links communicatively coupling the computing cores and the coprocessors associated with the computing cores.
  • an apparatus in a fifth embodiment, includes multiple computing cores, where each computing core is configured to perform one or more processing operations and generate input data.
  • the apparatus also includes multiple coprocessors associated with each computing core, where each coprocessor is configured to receive the input data from at least one of the computing cores, process the input data, and generate output data.
  • the apparatus further includes multiple communication links communicatively coupling the computing cores and the coprocessors associated with the computing cores.
  • the coprocessors in a subset of the coprocessors for each computing core are also configured to collectively apply one or more functions to the output data, where one of the coprocessors in the subset is further configured to provide one or more results to the associated computing core.
  • an apparatus in a sixth embodiment, includes N parallel computing cores, where each computing core is configured to perform one or more processing operations and generate input data.
  • the apparatus also includes NxN coprocessors, where each computing core is associated with N parallel coprocessors. Each coprocessor is configured to receive the input data from at least one of the computing cores, process the input data, and generate output data.
  • the apparatus further includes N reducer circuits, where each computing core is associated with one of the reducer circuits. Each reducer circuit is configured to receive the output data from each of the coprocessors of the associated computing core, apply one or more functions to the output data, and provide one or more results to the associated computing core.
  • the apparatus includes multiple communication links communicatively coupling the computing cores and the coprocessors associated with the computing cores.
  • the communication links include links to a shared memory.
  • the shared memory is configured to store the input data from the computing cores and to provide the input data to the coprocessors.
  • the shared memory includes multiple memory locations having multiple memory addresses.
  • the computing cores are configured to write the input data to different memory addresses, and the coprocessors are configured to read the input data from the different memory addresses.
  • FIGURES 1A through 1C illustrate an example general-purpose parallel computing architecture according to this disclosure
  • FIGURES 2 and 3 illustrate example communications in the computing architecture of FIGURES 1A through 1C according to this disclosure
  • FIGURES 4 and 5 illustrate example coprocessor functionality in the computing architecture of FIGURES 1A through 1C according to this disclosure
  • FIGURE 6 illustrates an example programmable coprocessor and reduction functionality in the computing architecture of FIGURES 1A through 1C according to this disclosure
  • FIGURES 7 and 8 illustrate example computing systems using a general-purpose parallel computing architecture according to this disclosure
  • FIGURE 9 illustrates an example method for supporting advanced computing functions using a general-purpose parallel computing architecture according to this disclosure
  • FIGURES 10 through 12 illustrate other example connectivity of components in a general-purpose parallel computing architecture according to this disclosure.
  • FIGURES 13 through 19 illustrate example communication schemes in a general- purpose parallel computing architecture according to this disclosure.
  • FIGURES 1A through 19 discussed below, and the various embodiments used to describe the principles of the present invention in this patent document are by way of illustration only and should not be construed in any way to limit the scope of the invention. Those skilled in the art will understand that the principles of the invention may be implemented in any type of suitably arranged device or system.
  • the human brain is a massively parallel system that typically contains around 100 billion neurons connected by one quadrillion synapses.
  • the synapses support the transport of signals between the neurons.
  • the human brain is structured very differently from classical Turing machines. Simulating the human brain using a classical Turing machine is impractical given the large number of neurons and synapses typically in the human brain.
  • Designing systems that can even somewhat rival the abilities of the human brain have generally been inadequate for a number of reasons. For example, such enormous fan-in and fan-out cannot be practically mapped into a two-dimensional (2D) circuit, which has kept such highly-connected computing architectures out of the mainstream.
  • 2D two-dimensional
  • This disclosure describes various new general- purpose "connectionist" hardware architectures that include a number of high-interconnected processing cores. Among other things, these hardware architectures can accelerate a broad class of algorithms in machine learning, scientific computing, video games, and other areas. In some embodiments, these hardware architectures can be manufactured at reasonable cost using modem techniques such as three-dimensional (3D) integrated circuit techniques.
  • FIGURES 1A through 1C illustrate an example general-purpose parallel computing architecture according to this disclosure.
  • FIGURES 1A through 1C illustrate an example multi-level structure that provides a hardware architecture with high communication bandwidth. Different levels of the structure perform different functions as described below.
  • FIGURE 1A illustrates a first level 100 of the hardware architecture.
  • This level 100 includes multiple computing or processing cores 102, which are referred to as soma cores.
  • each soma core 102 can receive one or more data signals, perform some type of processing, and transmit one or more input signals.
  • the structure included in each soma core 102 for performing processing operations can range from a very simple processing core to a very complex processing core.
  • the processing unit in each soma core 102 could be a relatively simplistic computing core, such as general-purpose single instruction, multiple data (SIMD) arithmetic unit.
  • SIMD general-purpose single instruction, multiple data
  • the soma cores 102 could also represent full processing cores, such as those from ARM, INTEL, or other computer processor manufacturers.
  • the group of soma cores 102 could be implemented using existing "many core" processor designs. However, any suitable computing cores could be used to implement the soma cores 102. While the hardware architecture here includes sixteen soma cores 102, any number of soma cores 102 could be supported in the hardware architecture. In particular embodiments, all of the soma cores 102 here could be implemented within a single integrated circuit chip that is referred to as a processor. Also, it should be noted that the soma cores 102 may or may not be homogenous.
  • Each soma core 102 includes processing circuitry 104 and at least one memory device 106.
  • the processing circuitry 104 generally denotes circuitry used to perform some type of processing within the soma core 102. As noted above, the processing could be simplistic or complex, and the processing circuitry 104 can vary depending on the specific processing to be performed.
  • the memory device 106 generally denotes any suitable storage and retrieval device(s), such as one or more registers, for storing data used, generated, or received by the soma core 102. In FIGURE 1A, while the memory device 106 is shown as being embedded within a soma core 102, each memory device 106 could in whole or in part be located in any other suitable position(s) accessible to a soma core 102.
  • FIGURE IB illustrates a second level 110 of the hardware architecture, which is associated with the first level 100 of the hardware architecture.
  • the second level 110 includes a number of coprocessors 112 (referred to as synapse coprocessors) associated with each soma core 102.
  • the synapse coprocessors 112 generally process input data transmitted over signal lines (discussed below) between the soma cores 102.
  • Each soma core 102 could be associated with multiple synapse coprocessors 112.
  • each soma core 102 in a group of N soma cores up to N synapse coprocessors 112 could be provided and used to support communications from the soma cores 102 in the group (including itself) to that soma core 102.
  • each soma core 102 is communicatively coupled to one synapse coprocessor 112 for each of the soma cores 102 in the group.
  • each soma core 102 can be communicatively coupled to all N soma cores 102 (via their respective synapse coprocessors 112), although other approaches (including those discussed below) need not do this.
  • the synapse coprocessors 112 of a "local" or “host” soma core 102 are used to receive and process incoming input data from all soma cores 102 (including itself). This effectively allows all N synapse coprocessors 112 for each soma core 102 to receive input data from all N soma cores 102 in parallel in some embodiments. Note that each soma core 102 may typically include the same number of synapse coprocessors 112, although other embodiments could be used.
  • Each synapse coprocessor 1 12 includes any suitable structure supporting the processing of incoming input data for a soma core 102.
  • the synapse coprocessors 112 could have limited capabilities and could be reprogrammable.
  • each synapse coprocessor 112 includes a programmable or other arithmetic unit 1 13 and at least one memory device 1 14.
  • the arithmetic unit 1 13 denotes any suitable structure configured to execute one or more sequences of instructions to support various functions in the hardware architecture. Examples of these functions include receiving and processing of data in a specific sequence, performing an arithmetic operation on a received input and stored parameters, or forwarding values.
  • the memory device 1 14 generally denotes any suitable storage and retrieval device(s), such as one or more registers, for storing data used, generated, or received by the synapse coprocessor 1 12.
  • any suitable storage and retrieval device(s) such as one or more registers, for storing data used, generated, or received by the synapse coprocessor 1 12.
  • FIGURE IB while the memory device 1 14 is shown as being embedded within a synapse coprocessor 1 12, each memory device 1 14 could in whole or in part be located in any other suitable position(s) accessible to a synapse coprocessor 112.
  • the second level 1 10 of the hardware architecture also includes various reducer circuits or "reducers" 1 15.
  • Each reducer 115 receives output data that is produced by all of the synapse coprocessors 1 12 associated with one of the soma cores 102, processes the received output data in some way, and passes the result or results of the processing to the local soma core 102.
  • each reducer 115 could sum or otherwise accumulate received output data values, identify a minimum or maximum received output data value, or perform some other processing operation. In this way, each reducer 1 15 processes the output data for a soma core 102 and reduces the amount of data provided to that soma core 102.
  • Each reducer 1 15 includes any suitable structure for processing multiple output values.
  • each reducer 1 15 includes processing circuitry 116 and at least one memory device 1 17.
  • the processing circuitry 116 generally denotes circuitry used to perform some type of processing within the reducer 1 15 and is often times much more specialized than the processing circuitry 104 of the soma cores 102.
  • the processing circuitry 116 could include an adder tree formed by accumulators used to sum all of the output values from the synapse coprocessors 1 12 associated with one soma core 102.
  • the memory device 117 generally denotes any suitable storage and retrieval device(s), such as one or more registers, for storing data used, generated, or received by the reducer 115. In FIGURE IB, while the memory device 117 is shown as being embedded within a reducer 115, each memory device 117 could in whole or in part be located in any other suitable position(s) accessible to a reducer 115.
  • FIGURE 1C illustrates a third level 120 of the hardware architecture, which is associated with the first and second levels 100 and 110 of the hardware architecture here.
  • the third level 120 includes multiple signal lines 122 and 124 that communicatively couple the soma cores 102, thereby supporting the transport of signals to, from, and between the soma cores 102.
  • the soma cores 102 are fully connected in that each soma core 102 in a group can communicate directly with all other soma cores 102 in the same group via the signal lines 122 and 124 and appropriate configuration of the synapse coprocessors 112.
  • less than full connectivity could also be supported within the hardware architecture.
  • the physical layout of the signal lines 122 and 124 in FIGURE 1C is for illustration only and need not represent the actual physical arrangement of signal lines in the hardware architecture.
  • there are various ways to design a network between the soma cores 102 which may or may not support direct communication between all of the soma cores 102 and the synapse coprocessors 112 that receive input data from the soma cores 102.
  • the signal lines 122 and 124 could therefore be arranged to support any desired communication paths in the hardware architecture.
  • direct connections between each soma core 102 and its associated synapse coprocessors 112 are provided as an example at the logic level and not necessarily as a concrete implementation of a required network.
  • Various mechanisms could be used to provide connections between each soma core 102 and its associated synapse coprocessors 112.
  • each soma core 102 operates to execute desired instructions and process data, possibly including data received from its reducer 115 or other source(s).
  • Each soma core 102 can provide the results of its processing operations to other soma cores 102 (and possibly itself) as input data, and each soma core 102 could receive the input data generated by other soma cores' processing operations via its synapse coprocessors 112.
  • the synapse coprocessors 112 for each soma core 102 can perform desired processing operations on the input data, and data output by the synapse coprocessors 112 can be further processed by the reducer 115 for each soma core 102. Results from the reducers 115 are provided to the local/host soma cores 102, which can use the data to perform additional processing operations.
  • each synapse coprocessor 112 could receive input data over multiple channels from one soma core 102, and the synapse coprocessors 1 12 connected to that soma core 112 could perform different processing operations depending on the channels used for the input data.
  • each reducer 1 15 could receive output data from its associated synapse coprocessors 1 12 for multiple channels, and the reducer 115 could perform different processing operations depending on the channel the output data was received from by the synapse processor 112.
  • the channels could denote actual physical channels (such as when data is sent over different signal lines) or logical channels (such as when data is sent over a common signal line with different channel identifiers).
  • different registers or other memory locations in the soma cores 102, synapse coprocessors 112, and reducers 1 15 could be used to store different data and different programming instructions for different channels. This allows the hardware architecture to support concurrency or other types of programming operations.
  • the memory device 1 14 of each synapse coprocessor 1 12 can include a number of registers.
  • the registers can include registers associated with each possible connection partner (each soma core 102) and used to hold incoming input data for each connection partner's channel(s).
  • the registers could also include local registers used to hold parameter values and other values used during execution of programming instructions.
  • processing operations of the synapse coprocessors 1 12 are described using one or more instructions executed in response to incoming input data, and there are no command loops in the synapse coprocessors 112.
  • Each soma core 102 could individually control the installation of program instructions on its synapse coprocessors 1 12, and different program instructions can be provided for different channels. For example, there might be an instruction causing a soma core 102 to load the same program to some or all of its synapse coprocessors 1 12. There might also be instructions causing the soma core 102 to load parameter registers of its synapse coprocessors 1 12, often with different values. Note that a soma core 102 could load all of this data from a given memory area that is large enough to hold values for all registers of all of the soma core's synapse coprocessors 112.
  • Each soma core 102 could be allowed to read the individual parameter registers of its synapse coprocessors 112 but not the values of the per-channel registers. Instead, the values in the per-channel registers can be processed by the synapse processors 112 and/or be fed into the associated reducer 115, which can be programmed by the local/host soma core 102 to operate on the data received for each channel appropriately.
  • the inputs to each reducer 115 can represent the output values from all synapse coprocessors 112 for the associated soma core 102 on a specific channel.
  • Each soma core 102 could support a number of instructions to facilitate the use of the synapse coprocessors 112 and the reducers 115 as described above.
  • each soma core 102 could support instructions for sending an input data element to (a specific channel of) all soma cores 102, for sending input data to a specific channel of its own synapse coprocessors 112, for receiving results from its own reducer 115, for installing or selecting programs or other instructions in its synapse coprocessors 112 and reducer 115, and for storing data in the parameter registers of the synapse coprocessors 112. Additional details of example instructions supported in the hardware architecture are provided below.
  • the hardware architecture shown in FIGURES 1A through 1C could be implemented within a single integrated circuit chip.
  • the integrated circuit chip could be fabricated in any suitable manner, such as by using long-standing fabrication techniques such as Silicon-on-Insulator (SOI) or more recently developed techniques such as three- dimensional integrated circuit fabrication techniques.
  • SOI Silicon-on-Insulator
  • multiple instances of the hardware architecture shown in FIGURES 1A through 1C could be coupled together and used in order to expand the number of soma cores 102 available for use.
  • multiple integrated circuit chips could be communicatively coupled together to provide any desired number of soma cores 102, such as by coupling the signal lines 122 and 124 of each instance of the hardware architecture using one or more high-speed connections.
  • each soma core 102 could be configured to perform a specific function or a combination of functions in order to provide desired functionality in the hardware architecture. In other embodiments, each soma core 102 could be programmable so that the function(s) of the soma cores 102 can be defined and can change over time or as desired.
  • each synapse coprocessor 112 and reducer 115 could be configured to perform a specific function or a combination of functions in order to provide desired functionality in the hardware architecture. In other embodiments, each synapse coprocessor 112 and reducer 115 could be programmable so that the function(s) of the synapse coprocessors 112 and reducer 115 can be defined and can change over time or as desired.
  • each soma core 102 is able to communicate via multiple signal lines 122 and 124 at the same time given sufficient communication infrastructure between the soma cores 102.
  • this hardware architecture can support a massive number of communication connections between computing cores, and those communication connections can all be available for use at the same time. As a result, this design represents a hardware architecture with more communication bandwidth.
  • FIGURES 1A through 1C illustrate one example of a general-purpose parallel computing architecture
  • a hardware architecture could support any suitable number of soma cores, along with a suitable number of synapse coprocessors and reducers.
  • each soma core, synapse coprocessor, and reducer could be implemented in any other suitable manner, such as by using shared computing resources for the soma cores or synapse coprocessors or by using multiple reducers that allow performing more than one operation concurrently.
  • FIGURES 1A through 1C could be combined, further subdivided, rearranged, or omitted and additional components could be added according to particular needs.
  • one or more soma cores 102 may not need to be used in conjunction with a reducer 115.
  • FIGURES 10 through 12 show other possible layouts and connections between components of a general-purpose parallel computing architecture.
  • FIGURES 2 and 3 illustrate example communications in the computing architecture of FIGURES 1A through 1C according to this disclosure.
  • each soma core 102 can have synapse coprocessors 112 that receive input data from all soma cores 102 (including itself). This same pattern can be repeated for all soma cores 102 in a group of soma cores 102.
  • the signal lines 122 and 124 described above can be used to couple each soma core 102 to one synapse coprocessor 112 of all soma cores 102 in a suitable manner to support these communications.
  • each soma core 102 could be provided with N synapse coprocessors 112 (one synapse coprocessor 112 per soma core 102 including itself).
  • Each soma core 102 can broadcast information to all soma cores 102, and each soma core 102 can receive information from all other soma cores 102 via its synapse coprocessors 112.
  • the N synapse coprocessors 112 for each of the N soma cores 102 can support N independent communication networks between the soma cores 102.
  • FIGURE 3 illustrates one specific example of two of the independent communication networks between soma cores.
  • one soma core 102a can broadcast input data to one synapse coprocessor 112 of each soma core 102 in the system.
  • another soma core 102b can broadcast data to one synapse coprocessors 112 of each soma core 102 in the system.
  • the broadcasting by the soma cores 102a and 102b can, in some embodiments, occur simultaneously.
  • N soma cores 102 can engage in N broadcasts of data simultaneously.
  • each synapse coprocessor 112 that is broadcasting data could alternatively broadcast the data directly to synapse coprocessors 112 of all soma cores 102 via the signal lines 122 and 124.
  • FIGURES 2 and 3 illustrate examples of communications in the computing architecture of FIGURES 1A through 1C
  • a hardware architecture could support any suitable number of soma cores, along with a suitable number of synapse coprocessors.
  • various components in FIGURES 2 and 3 could be combined, further subdivided, rearranged, or omitted and additional components could be added according to particular needs.
  • any suitable communications amongst the soma cores 102 could be supported.
  • FIGURES 4 and 5 illustrate example coprocessor functionality in the computing architecture of FIGURES 1A through 1C according to this disclosure.
  • FIGURES 4 and 5 illustrate example mechanisms for implementing the synapse coprocessors 112 described above. Note that these example implementations are for illustration only and that the synapse coprocessors 112 could be implemented in other ways.
  • a synapse coprocessor 112 for the j th soma core 102 can be implemented using the arithmetic unit 113 described above.
  • the arithmetic unit 113 performs one or more desired computations using incoming input data received from the i th soma core 102.
  • the arithmetic unit 113 then outputs the resulting output data to a reducer 115 associated with the j th soma core 102.
  • the reducer 115 can process the outputs from multiple arithmetic units 113 of multiple synapse coprocessors 112 associated with the j th soma core 102 and provide the result(s) to the j th soma core 102.
  • the operation(s) performed by the arithmetic unit 113 in FIGURE 4 could be defined or controlled using a program ( ⁇ ) 402, and the program 402 operates using one or more parameters 404.
  • the program 402 and the parameter(s) 404 can be stored within the memory device 114 or other location(s).
  • the one or more parameters 404 can be set or controlled by the synapse coprocessor 112, by the associated soma core 102, or in any other suitable manner.
  • Example operations that could be performed by the arithmetic unit 113 can include adding, subtracting, or multiplying values; generating a constant value across all synapse coprocessors 112 associated with a soma core 102; outputting an identifier for the synapse coprocessor 112; selecting one of multiple values based on a test value; or calculating the sign or inverse square root of a value.
  • a "channel identifier" value can be used by the synapse coprocessor 112 to identify which of multiple selectable programs ( ⁇ ) 502 are to be executed by the arithmetic unit 113 on incoming data.
  • the "channel identifier” can also be used to control which parameter(s) 504 are used by the arithmetic unit 113 and where results generated by the arithmetic unit 113 are sent.
  • the selectable programs 502 and the parameters 504 could be stored in the memory device 114 of the synapse coprocessor 112 or in other location(s).
  • each of the arithmetic units 113 and the reducers 115 could be implemented in a pipelined fashion, and incoming data could denote scalar values or small vectors of values.
  • multiple scalar values or at least one vector of values could be received from the i th soma core 102, and a single program 502 or different programs 502 could be applied to the values by the arithmetic unit 113 to produce a sequence of output values.
  • the sequence of output values could be provided to the reducer 115 for further processing.
  • FIGURES 4 and 5 illustrate examples of coprocessor functionality in the computing architecture of FIGURES 1A through 1C
  • various changes may be made to FIGURES 4 and 5.
  • each synapse coprocessor 112 could be implemented in any other defined or reconfigurable manner.
  • FIGURE 6 illustrates an example programmable coprocessor and reduction functionality in the computing architecture of FIGURES 1A through 1C according to this disclosure.
  • FIGURE 6 illustrates an example mechanism for controlling the programming of the synapse coprocessors 112 and the reducers 115 described above.
  • a reducer 115 is configured to receive the output data from multiple synapse coprocessors 112 associated with a soma core 102. The reducer 115 then performs at least one operation (identified by ⁇ ) using the outputs from the synapse coprocessors 112 to generate at least one result that is provided to the associated soma core 102.
  • the one or more computations performed by the reducer 115 could include any suitable operations performed using the outputs from multiple synapse coprocessors 112.
  • the reducer 115 could execute one or more sequences of instructions to support various functions in the hardware architecture.
  • the reducer 115 could perform a programmable operation on the received data and output the result(s) to the associated soma core 102.
  • Example operations can include summing or multiplying the outputs from all synapse coprocessors 112, identifying a minimum or maximum output from the synapse coprocessors 112, or selecting a specific synapse coprocessor's value as the output.
  • a memory device 602 can be used in this structure to store one or more programs ( ⁇ ) executed by the synapse coprocessors 112.
  • the memory device 602 can also be used to store one or more programs ( ⁇ ) executed by the reducer 115.
  • the memory device 602 represents any suitable volatile or non-volatile storage and retrieval device or devices, such as part of one or more of the memories 106, 114, 117.
  • FIGURE 6 illustrates one example of programmable coprocessor and reduction functionality in the computing architecture of FIGURES 1A through 1C
  • a hardware architecture could support any suitable number of soma cores, along with a suitable number of synapse coprocessors and reducers.
  • various components in FIGURE 6 could be combined, further subdivided, rearranged, or omitted and additional components could be added according to particular needs.
  • FIGURES 7 and 8 illustrate example computing systems using a general-purpose parallel computing architecture according to this disclosure.
  • a computing system 700 includes at least one processor 702, at least one storage device 704, at least one communications unit 706, and at least one input/output (I/O) unit 708.
  • the processor 702 could denote an integrated circuit chip incorporating the soma cores 102, synapse coprocessors 112, reducers 115, and signal lines 122 and 124 described above.
  • the processor 702 executes instructions, such as those that may be loaded into a memory device 710 and then loaded into the registers or other memories of the soma cores 102, synapse coprocessors 112, and reducers 115.
  • the processor 702 may include any suitable numbers of soma cores 102, synapse coprocessors 112, reducers 115, and signal lines 122 and 124.
  • the memory device 710 and a persistent storage 712 are examples of storage devices 704, which represent any structure(s) capable of storing and facilitating retrieval of information (such as data, program code, and/or other suitable information on a temporary or permanent basis).
  • the memory device 710 may represent a random access memory or any other suitable volatile or non-volatile storage device(s).
  • the persistent storage 712 may contain one or more components or devices supporting longer-term storage of data, such as a read only memory, hard drive, Flash memory, or optical disc.
  • the communications unit 706 supports communications with other systems or devices.
  • the communications unit 706 could include a network interface card or a wireless transceiver facilitating communications over a wired or wireless network.
  • the communications unit 706 may support communications through any suitable physical or wireless communication link(s).
  • the I/O unit 708 allows for input and output of data.
  • the I/O unit 708 may provide a connection for user input through a keyboard, mouse, keypad, touchscreen, or other suitable input device.
  • the I/O unit 708 may also send output to a display, printer, or other suitable output device.
  • FIGURE 8 If needed or desired, multiple instances of the hardware architecture shown in FIGURES 1A through 1C could be coupled together and used in order to expand the number of soma cores 102 available for use. For example, multiple integrated circuit chips could be communicatively coupled together to provide any desired number of soma cores 102.
  • FIGURE 8 An example of this is shown in FIGURE 8, where a multi-processor arrangement 800 could be used in the computing system 700 as the processor 702 or in another computing system.
  • the multi-processor arrangement 800 here includes at least two processors coupled by at least one high-speed connection. In this example, four processors 802-808 are coupled by four highspeed connections 810 in a ring, although any other suitable numbers and arrangements of processors and high-speed connections could be used.
  • Each high-speed connection 810 can support any suitable communication path(s) for coupling multiple instances of the hardware architecture shown in FIGURES 1A through 1C.
  • each high-speed connection 810 can be communicatively coupled to the third level 120 of each instance of the hardware architecture so that the high-speed connection 810 supports the transport of signals between the signal lines 122 and/or 124 of the hardware instances.
  • Each high-speed connection 810 includes any suitable structure for transporting signals between hardware instances, such as between multiple integrated circuit chips.
  • each high-speed connection 810 could be implemented using a photonic connection between two integrated circuit chips.
  • the integrated circuit chips themselves could support "quilt" packaging, where each integrated circuit chip includes electrical connections along at least one side and the integrated circuit chips are mounted so that electrical connections on different chips contact one another. Note, however, that any other or additional high-speed connections 810 could also be used.
  • FIGURES 7 and 8 illustrate examples of computing systems using a general-purpose parallel computing architecture
  • various changes may be made to FIGURES 7 and 8.
  • the hardware architecture shown in FIGURES 1A through 1C could be used in any other suitable system to perform any suitable functions.
  • FIGURE 9 illustrates an example method 900 for supporting advanced computing functions using a general-purpose parallel computing architecture according to this disclosure.
  • the method 900 is described with respect to the hardware architecture shown in FIGURES 1A through 1C. However, the method 900 could be used with any other suitable hardware architecture.
  • processing operations are executed using multiple parallel computing cores at step 902.
  • the processing operations could denote simplistic operations performed by SIMD soma cores 102 up to complex operations performed by full-processor soma cores 102.
  • the operations shown in FIGURE 9 can be executed in order because of dependencies of the operations. Multiple independent chains of the same operations can be performed concurrently and communication and synapse/reducer operations can be performed in parallel using channel addressing as described above.
  • the processing results from each computing core are published to other computing cores at step 904.
  • the processing results from the computing cores are processed at step 906 and reduced at step 908.
  • the reduced results are provided to the computing cores at step 910. This could include, for example, the reducers 115 providing outputs to their associated soma cores 102.
  • the method 900 could be repeated, with the computing cores using the reduced results during further execution of the processing operations. Alternatively, the method 900 could end and be repeated later with new data.
  • FIGURE 9 illustrates one example of a method 900 for supporting advanced computing functions using a general-purpose parallel computing architecture
  • various changes may be made to FIGURE 9.
  • steps in FIGURE 9 could overlap, occur in parallel, occur in a different order, or occur any number of times.
  • the architecture is described as being a multi-level structure.
  • the synapse coprocessors 112 and reducers 115 could be located above the soma cores 102, and the signal lines 122 and 124 could be located above the synapse coprocessors 112 and reducers 115.
  • FIGURES 10 through 12 illustrate other example connectivity of components in a general-purpose parallel computing architecture according to this disclosure. For ease of explanation, the connections are described with respect to the components of the architecture shown in FIGURES 1A through 1C.
  • a layout 1000 includes the soma cores 102, a subset of which are shown here. Also, in FIGURE 10, the synapse coprocessors 112 for each soma core 102 are shown as being vertically aligned above that soma core 102. In between each soma core 102 and its synapse coprocessors 112 is a reducer 115. Each soma core 102 sends data to its respective synapse coprocessor 112 for each soma core 102 (meaning to one synapse coprocessor 112 in each column) using its respective signal line 1002.
  • the results of the computations in the synapse coprocessors 112 are sent to the reducers 115 for the soma cores 102 via signal lines 1004. Each reducer 115 sends a result back to its soma core 102 via a respective signal line 1006.
  • communication on the signal lines 1002 and/or 1004 can be point-to-point, where a synapse coprocessor 112 receives data and then forwards it to the next synapse coprocessor 112 on the line.
  • each signal line 1002 and/or 1004 includes multiple separate signal lines (such as up to N signal lines).
  • each signal line 1002 could connect a soma core 102 directly to each synapse coprocessor 112 on the signal line 1002, and each signal line 1004 could connect all synapse coprocessors 112 directly with the associated reducer 115.
  • each reducer 115 can be integrated into its associated soma core 102, and no signal lines 1006 are needed.
  • the reducers 115 in this case can be implemented using the computing functionality of the soma cores 102, or the reducers 115 could have their own compute functionality.
  • a layout 1100 includes the soma cores 102, a subset of which are shown here.
  • the soma cores 102 in FIGURE 11 are shown as in FIGURE 10, and they send data through signal lines 1102.
  • each soma core 102 is associated with multiple coprocessor/reducer cores 1104, rather than coprocessors 112 and a separate reducer 115.
  • the coprocessor/reducer cores 1104 are functional units that combine the functionality of the coprocessors 112 and parts of the functionality of the reducers 115.
  • the reducers' functionality can be distributed when certain types of operations are used in the reducers 115, such as associative operations like summing values or finding a minimum or maximum value.
  • associative operations allows intermediate results to be generated in some of the coprocessor/reducer cores 1104.
  • the last coprocessor/reducer core 1104 in the chain for each soma core 102 generates the final reducer result. This can reduce the total length of signal lines 1106, possibly simplifying the physical layout.
  • Each coprocessor/reducer core 1104 includes any suitable structure supporting the processing of incoming input data for a soma core 102. At least some of the coprocessor/reducer cores 1104 also include any suitable structures supporting binary associative or other reduction operations.
  • the signal lines 1106 couple the coprocessor/reducer cores 1104 to one another and to the soma cores 102.
  • Several signal lines 1106 here are shown as loops, going from one coprocessor/reducer core 1104 to the same coprocessor/reducer core 1104. These signal lines 1106 could denote internal communications within those coprocessor/reducer cores 1104 and need not represent actual pathways outside of the coprocessor/reducer cores 1104.
  • only those coprocessor/reducer cores 1104 that receive data from the signal lines 1106 may include reduction functionality, while the remaining coprocessor/reducer cores 1104 could denote synapse coprocessors only.
  • the coprocessor/reducer cores 1104 in the first, third, fifth, and seventh rows could denote synapse coprocessors only
  • the coprocessor/ reducer cores 1104 in the second, fourth, sixth, and eighth rows could denote synapse coprocessors with reduction functionality.
  • each of the coprocessor/reducer cores 1104 in the second row can sum two values or find a minimum or maximum of two values (its own value and a value from the first row) and output the result.
  • Each of the coprocessor/reducer cores 1104 in the fourth row can sum three values or find a minimum or maximum of three values (its own value, a value from the second row, and a value from the third row) and output the result.
  • Each of the coprocessor/reducer cores 1104 in the sixth row can sum two values or find a minimum or maximum of two values (its own value and a value from the fifth row) and output the result.
  • Each of the coprocessor/reducer cores 1104 in the eighth row can sum four values or find a minimum or maximum of four values (its own value, a value from the fourth row, a value from the sixth row, and a value from the seventh row) and output the result.
  • the result from each coprocessor/reducer core 1 104 in the eighth row would denote the sum or maximum/minimum value for the associated column.
  • the number of inputs and the source(s) of the input values used by the reduction functionality can vary as needed or desired, and the exact positions of the coprocessor/reducer cores 1 104 implementing the reduction functionality can vary as needed or desired.
  • the number of inputs can vary depending on the overall size of the architecture, such as the number of soma cores and the associated number of coprocessors.
  • the inputs that are used for the reduction operations need not only come from the coprocessor/reducer cores 1 104.
  • the inputs used for the reduction operations could also or alternatively come from one or more external agents, such as when the inputs include outputs from other reducers.
  • the signal lines 1 102 and 1106 can be used to couple the components together in a suitable manner, and the signal lines 1102 and 1 106 can be fabricated using any suitable technique(s).
  • the components in FIGURE 11 could be formed in a single device or in multiple devices that are coupled together.
  • all of the components shown in FIGURE 11 could be fabricated in a single integrated circuit chip, or different components shown in FIGURE 11 could be fabricated in different integrated circuit chips that are coupled together using electrical connections like serial point-to-point connections, a high-speed bus, or other connections.
  • FIGURE 12 illustrates an example layout 1200 in which multiple integrated circuit chips 1202 and 1204 are used to implement the soma cores 102, the synapse coprocessors 112, and the reducers 115.
  • the coprocessor/reducer cores 1 104 could be used instead of separate synapse coprocessors 112 and reducers 115 here.
  • at least one additional reducer 115 could be used to further reduce the data sent between the integrated circuit chips 1202 and 1204.
  • One or more communication links 1206 or other communication interfaces could be used to couple the components in different integrated circuit chips 1202 and 1204.
  • the communication link(s) 1206 could include connections from the soma cores 102 in the chip 1202 to the synapse coprocessors 112 in the chip 1204, as well as connections from the reducers 115 in the chip 1204 to the soma cores 102 in the chip 1202.
  • This type of layout may allow different combinations of integrated circuit chips containing different numbers or types of soma cores 102 and synapse coprocessors 112/reducers 115.
  • FIGURES 10 through 12 illustrate examples of physical layouts of components in a general-purpose parallel computing architecture
  • various changes may be made to FIGURES 10 through 12.
  • a hardware architecture could support any suitable number of soma cores, along with a suitable number of synapse coprocessors, reducers, coprocessor/reducer cores, or signal lines.
  • a wide variety of physical layouts could be used, and FIGURES 10 through 12 do not limit this disclosure to only the illustrated layouts.
  • FIGURES 13 through 19 illustrate example communication schemes in a general- purpose parallel computing architecture according to this disclosure.
  • the actual implementation of a network or other communication mechanism to support data transport between the soma cores 102 and their synapse coprocessors 112 can take on many different forms. The following describes several specific examples of these communication mechanisms, but any other suitable communication scheme can be used to transport data between the soma cores 102 and their synapse coprocessors 112. Also, the example communication schemes provided below apply equally to implementations that do or do not have the soma cores 102 and their respective synapse coprocessors 112 physically collocated.
  • each soma core 102 has a single connection 1302 to one of its synapse coprocessors 112, and the remaining synapse coprocessors 112 for that soma core 102 are daisy-chained together.
  • the first synapse coprocessor 112 in the chain receives data from the soma core 102, and all other synapse coprocessors 112 in the chain receive data from the previous synapse coprocessor 112 in the chain.
  • the synapse coprocessors 112 receive data from the soma core 102 in sequence, one after the other, until all synapse coprocessors 112 of the soma core 102 have the data.
  • each soma core 102 can have multiple connections 1402 to multiple chains of synapse coprocessors 1 12.
  • the first synapse coprocessor 1 12 in each chain receives data from the soma core 102, and the data is passed in sequence through the synapse coprocessors 1 12 in each chain.
  • the data can be provided to the different chains in parallel, allowing for faster delivery of the data to all synapse coprocessors 112 compared to FIGURE 13. Note that while two chains are shown here, any number of synapse coprocessor chains could be used.
  • a soma core 102 can have a dedicated connection 1502 to each of its synapse coprocessors 1 12.
  • the synapse coprocessors 112 receive data directly from the soma core 102, and all of the synapse coprocessors 112 could receive the data in parallel.
  • FIGURE 16 illustrates that a single connection 1602 couples the soma core 102 to one synapse coprocessor 112, and multiple connections 1604 couple that synapse coprocessor 112 to multiple chains of synapse coprocessors 112.
  • Data can be provided from the soma core 102 to the first synapse coprocessor 1 12, and the data can then be provided into the multiple chains of synapse coprocessors 112 in parallel.
  • FIGURE 17 illustrates that a single connection 1702 couples the soma core 102 to one synapse coprocessor 112.
  • Multiple connections 1704 couple that synapse coprocessor 112 to another level of synapse coprocessors 1 12, and multiple connections 1706 couple that level of synapse coprocessors 112 to yet another level of synapse coprocessors 112.
  • This configuration can be repeated additional times to support the use of synapse coprocessors 1 12 on any suitable number of hierarchical levels.
  • each synapse coprocessor 112 is coupled to three synapse coprocessors 1 12 in the next level, this is for illustration only. Any other suitable tree configuration could be supported as needed or desired. Note that another possible arrangement has more than one tree, meaning more than one synapse coprocessor 1 12 receives data directly from the soma core 102 and serve as root nodes of individual trees.
  • FIGURES 13 through 17 illustrate example configurations of communication paths, these are for illustration only. Various combination of these approaches could also be used, as long as each synapse coprocessor 1 12 can receive data from its associated soma core 102. Also, other or additional approaches could also be used, such as a mesh network that communicatively couples the soma core 102 to neighboring synapse coprocessors 112, which then pass along data to other synapse coprocessors 1 12 through mesh networking. An appropriate structure of the mesh guarantees that each synapse coprocessor 112 receives the data.
  • FIGURE 18 one possible alternative implementation is to use a virtual network 1802 for the soma cores 102, where data is effectively routed from the soma cores 102 to their synapse coprocessors 112 via the network 1802.
  • the virtual network 1802 could be implemented using logic executed by the synapse coprocessors 1 12 themselves or using components external to the synapse coprocessors 112.
  • each data package sent over the virtual network 1802 has meta information that allows the data package to reach the correct destination(s).
  • this meta information could have an identifier for the originating soma core 102.
  • a routing table 1804 (either static or dynamic)
  • data packages can be forwarded from their respective soma cores 102 to the appropriate synapse coprocessors 1 12.
  • One specific implementation may involve the use of a static routing table for each synapse coprocessor 112, where the address is used as an index into the routing table.
  • each data package could have one or more destination addresses specified by the sending soma core 102, and the virtual network 1802 can route the data packages according to their target addresses.
  • Any suitable mechanism can be used to specify the originating soma cores' identifiers or the target addresses.
  • Example mechanisms include attaching explicit information to each data package or storing each data package at a specific address (such as in the soma cores' address spaces), where the specified address implicitly conveys the required information.
  • a "store and forward” network denotes a network in which data is stored by one or more components and retrieved (forwarded) by one or more components.
  • a "store and forward" network allows each soma core 102 to communicate data to the synapse coprocessors 112 by storing the data at specific addresses, and the synapse coprocessors 112 can then read the data from the same addresses.
  • FIGURE 19 An example is shown in FIGURE 19, where a shared memory 1902 is used to transfer data from the soma cores 102 to the coprocessor/reducer cores 1104 (although the synapse coprocessors 112 and reducers 115 could also be used).
  • the shared memory 1902 includes a number of memory locations 1904.
  • the soma cores 102 can write data to those memory locations 1904, and the synapse coprocessors 112 or coprocessor/reducer cores 1104 can read the data from those memory locations 1904. This can be done in a manner that is optimized internally for the communication pattern of soma cores 102 broadcasting to synapse coprocessors 112 or to coprocessor/reducer cores 1104.
  • memory interfaces 1906 and 1908 are provided and are used to write data to or receive data from the memory locations 1904. Each of the memory interfaces 1906 and 1908 can receive an address, and the memory interface 1906 can also receive data. The memory interface 1906 writes the received data to the received address, and the memory interface 1908 reads requested data from the received address. Note, however, that the memory interfaces 1906 and 1908 could be omitted if the soma cores 102 and the synapse coprocessors 112 or coprocessor/ reducer cores 1104 are configured to read from and write to specified memory locations.
  • the synapse coprocessors 112 or coprocessor/reducer cores 1104 could access the shared memory 1902 in any suitable manner.
  • the synapse coprocessors 112 or coprocessor/reducer cores 1104 could poll the shared memory 1902 to identify new data, or the synapse coprocessors 112 or coprocessor/reducer cores 1104 could receive out-of-band notifications when data is stored in the shared memory 1902.
  • each soma core 102 may not have N synapse coprocessors 112. Instead, each soma core 102 may implement N "logical" synapse coprocessors using a smaller number of actual synapse coprocessors 112. In those embodiments, a subset of the logical communication links can be implemented physically, and the different methods described above can be simplified.
  • FIGURES 13 through 19 illustrate examples of communication schemes in a general-purpose parallel computing architecture
  • various changes may be made to FIGURES 13 through 19.
  • any number of other or additional techniques could be used to transfer data between soma cores 102 and associated synapse coprocessors 112.
  • any of the techniques shown here could be used in architectures that include synapse coprocessors 112 and reducers 115 or coprocessor/reducer cores 1104.
  • each soma core 102 can program its synapse coprocessors 112 to execute at least one program ⁇ , and the program(s) ⁇ can be executed as soon as incoming data arrives.
  • the reducer 115 for a soma core 102 executes at least one program ⁇ using the results of program ⁇ from all of the synapse coprocessors 112 for that soma core 102.
  • each program ⁇ can often execute in 0(1) time given a fixed vector size and no loops, and the program ⁇ can often execute in 0(log ⁇ ) time.
  • the collective processing performed by the synapse coprocessors 112 and the reducer 115 for each soma core 102 could be expressed as:
  • i denotes the identity of a sender soma core 102 (or the identity of a soma core 102 plus a soma group identifier of the soma core 102), and N denotes the number of soma cores 102 (or the number of soma cores 102 times the number of soma groups).
  • j denotes a channel identifier
  • p denotes one or more parameters (such as parameters 402 or 502) used in the synapse coprocessors 112 (such as state or local variables, which may or may not be channel-specific).
  • x denotes the output of the i th soma core 102
  • 3 ⁇ 4 denotes the output provided by a reducer 115 as a result to the soma core 102 in channel j.
  • (f> j () denotes the function performed by the synapse coprocessors 112 for the j th channel using the incoming data x, and possibly the parameters p
  • ⁇ () denotes the function performed by the reducer 115 for the local soma core 102 using the outputs of the synapse coprocessors 112.
  • Examples of the ⁇ ] () functions could include:
  • a, b, c, and r could denote names of registers in a synapse coprocessor 112, and x could denote an input value from a soma core 102 (although another register of the synapse coprocessor 112 could also be used instead).
  • the select operation tests the condition in the first parameter (such as by performing a simple non-zero test) and returns either the second parameter or the third parameter based on the result of the test.
  • the index operation may be specific to an implementation with multiple soma groups. Each soma group could include the same number of soma cores 102. More details of soma groups are provided below. In some embodiments, none of the functions implemented by the synapse coprocessors 112 involves loops.
  • Examples of the ⁇ () functions could include:
  • v denotes the output of a reducer 115 provided to a soma core 102
  • rfij denotes the inputs received by the reducer 115 from the synapse coprocessors 112 (multiple values from the same synapse coprocessor 112 could be obtained in an implementation with multiple soma groups).
  • Each of the max and min functions could return both (i) the maximum or minimum value and (ii) the index value i of the synapse coprocessor 112 that provided the maximum or minimum value.
  • the result of the ⁇ () function could be made available to the soma core 102 using one or more registers.
  • the synapse coprocessors 112 might not be programmed with a traditional program that runs in a loop and that actively retrieves (and if necessary waits for) input. Instead, each channel can be associated with a program ⁇ , and the program ⁇ can be marked as executable when data arrives for the channel and eventually executed when compute resources become available. When all synapse coprocessor programs ⁇ finish, the result of the reduction program ⁇ can be computed. The computation of the result by the reduction program ⁇ could start as soon as a minimal number of the synapse coprocessor results are available, with caveats such as the one mentioned below. The results of the reduction program ⁇ can be saved in per-channel registers. When a soma core 102 issues an instruction to read a reduction result, the reducer 1 15 may then be ready to produce the next reduction result for that channel. Until then, operation of the reducer 115 for that channel could be blocked.
  • the allocation of registers in the synapse coprocessors 1 12 and reducers 115 and the allocation of channels can be abstracted if desired. For example, instead of referring to an absolute index for each of these resources in a program specification, an allocation mechanism could be used to achieve the equivalent of multi-program execution. For example, when a program (including the ⁇ and ⁇ programs) is loaded, the actual registers used can be chosen from available registers of a register file, and an available channel can be selected. No explicit concurrency has to be created since the program is invoked based on incoming data. Upon finishing the program, the used resources in terms of registers and channels can be made available again.
  • the actual instructions executed by the synapse coprocessors 1 12 and reducers 115 do not have to know about any of this. Rather, the instructions of the uploaded program code could use absolute register numbers or indices, and the abstraction can occur at a higher level where the program loading by the soma core 102 is preceded by appropriate code generation or rewriting based on the needs of the program and the available resources.
  • the reducer 1 15 could be programmed to either wait for the input values so that the operation order is always maintained (resulting in slowdowns), or the reducer 1 15 could be programmed to perform the sums out of order (allowing results to be obtained more quickly but with potentially less repeatability).
  • an implementation of the hardware architecture can include more than one group of soma cores 102.
  • Such an approach could implement the soma groups in a single integrated circuit, or different soma groups could be implemented as separate integrated circuits (and the integrated circuits can be coupled together, such as with electrical or optical connections).
  • Several types of programs can be sped up significantly with this type of hardware architecture if an entire data set can be mapped to the soma cores 102.
  • each synapse coprocessor 112 could receive results from exactly one soma core 102.
  • each synapse coprocessor 1 12 could receive results from one soma core 102 per soma group.
  • this can be expressed just like in an implementation with a single soma group if the resources related to data transfers (such as a register to hold transmitted data and a register to hold a result) are duplicated.
  • a single processor can be therefore be implemented to work with up to S soma groups in case there are S duplicates for each synapse coprocessor register.
  • each of N independent networks can have one of N soma cores 102 as a source and connects that soma core 102 to N synapse coprocessors 112 (one of each soma core 102). While a dedicated network for each output of each soma core 102 would minimize possible contention in data transfers, it means that resources go unused when no transmissions are occurring.
  • soma cores 102 work in lockstep and transmit data at approximately the same time, which could be handled well only with dedicated signal lines.
  • the soma cores 102 can lose sync due to various factors, such as minute effects in execution like waiting for resources or different dynamic decisions like branch predictions. In that case, the transmissions would not happen at exactly the same time. Since the transmitted data is usually small, the use of one (or a small number) of networks to connect the soma cores 102 might suffice without significant slowdowns, and it would provide improved utilization of resources.
  • the soma ID can be dropped if each soma core 102 per soma group has its own dedicated network connecting it to a synapse coprocessor 1 12 on each soma core 102.
  • Another implementation of the connection network could have one single network per soma group, and all data packages have complete addresses attached to them.
  • Another approach would be to provide point-to-point connections with a limited set of soma cores 102 and have recipients distribute data packages further.
  • the recipients can be connected to different subsets of the soma cores 102, and these subsets can be selected to ensure that all soma cores 102 are connected.
  • the subsets can be selected to reduce or minimize the "diameter" of the network, where the diameter of a network refers to the maximal distance (the number of soma cores 102 to step through to reach a target) between two cores 102. Given a fixed upper limit on the number of connections per soma core 102, a hypercube architecture of that degree could minimize the diameter.
  • soma cores 102 receive data and spread transmissions over as many individual connections as possible.
  • various approaches could be used. For example, well-known algorithms can take the index of a sender soma core 102 and the link that data was received from into account. In those cases, data from each soma core 102 can be sent in a fixed pattern, but the pattern can be different for individual soma cores 102, maximizing the utilization of connections. This approach also allows elimination of a central starting location for each network since each soma core 102 could just communicate with selected neighbors and the neighbors could forward data if necessary.
  • One or more soma cores 102 in a network could be responsible for sending data to other soma groups, and different soma cores 102 may be responsible for communications with different soma groups.
  • Dynamic algorithms can also be used. For example, every received packet can be forwarded from one soma core 102 to all neighbors (except the soma core 102 sending the packet). Each neighbor soma core 102 could then keep track of whether it has already seen the packet. If so, the packet can simply be discarded. If not, the synapse coprocessor 112 for the neighbor soma core 102 receives and forwards the packet.
  • One advantage of this approach is that the network can be completely flooded more quickly.
  • Another advantage of this approach is that integrating multiple soma groups into the design is more straightforward. Changing a 1 :N bus architecture (which never has to check for sender conflicts) to an S:N architecture can be a big step.
  • soma core 102 of one soma group forwards a packet to another soma core 102 in another soma group, the latter can regard the packet similar to how it would regard any other incoming packet.
  • the inter-soma core link can be regarded like normal inter-soma intra-soma group connections.
  • new instructions can be used to facilitate the use of the synapse coprocessors 112 and the reducers 115.
  • These instructions include instructions executed by the soma cores 102, as well as instructions provided to and executed by the synapse coprocessors 112 and the reducers 115.
  • Table 1 illustrates example instructions that could be executed by a soma core 102 and the synapse coprocessors.
  • oreg denotes a soma core register (such as in the memory device 106)
  • yreg denotes a synapse coprocessor register (such as in the memory device 114).
  • recv channel ⁇ oregl Receive from the local reducer the results of the last [, oreg2] computation in channel. The results are stored in the provided registers. Two results are returned for certain reduction operations which then require two result registers.
  • synapse channel ⁇ recv Receive a value from a specified channel and store it in the ⁇ yreg ... ⁇ [reduce ... ] synapse coprocessor's yreg register.
  • the source of the data can be a 'send' or 'store' instruction. This event may then trigger further synapse coprocessor instructions.
  • a reduction step optionally happens with the different operations as shown in Table 2.
  • Table 2 illustrates example operations that could be executed by a reducer 115. Reduction operations could take many cycles logarithmically, so the reduction operations could benefit from pipelining multiple such operations in different tree levels.
  • each synapse coprocessor 112 can perform SIMD operations.
  • Each soma core 102 can upload, ahead of data communications on a specific channel, sequences of instructions for that channel to a local synapse coprocessor 112. Additionally, each soma core 102 can upload sequences of instructions for that channel to all its synapse coprocessors 112 by broadcasting. The soma core 102 can further program into the reducer 115 the operation that should be performed once the necessary input data becomes available.
  • Table 3 illustrates examples of the types of instructions that could be uploaded to the synapse coprocessors 112 for execution.
  • the hardware architectures described above can accelerate a broad class of algorithms in machine learning, scientific computing, video games, and other areas. Based on the types of instructions above, the following describes how six example types of problems can be accelerated and solved using the hardware architectures described in this patent document.
  • sparse coding takes a normalized input vector x with
  • 1 and computes a normalized sparse output vector y that minimizes energy e, which is defined as:
  • F is a factor matrix
  • ⁇ ⁇ y ⁇ 1.
  • denotes a sum of the absolute values of the entries in y
  • is a constant that controls sparseness of the output.
  • the factor matrix F is chosen to minimize the sum E of the energies e r across a set of training inputs x,.
  • One way to accomplish both minimizations is gradient descent, with the negative gradients defined as:
  • the training inputs x and the outputs y can reside in a shared virtual or local soma memory.
  • the entries of the factor matrix F (which is not sparse) can reside in registers of the synapse coprocessors 112.
  • the entry F k of the factor matrix F can reside in a register of the k th synapse coprocessor 112 for the j th soma core 102.
  • the SIMD instructions broadcast by the soma cores 102 to their synapse coprocessors 112 can use relative addressing so that, simultaneously across soma cores 102, the k th soma core 102 can broadcast the input entry x k to the k th synapse coprocessor 112 of the j th soma core 102.
  • the li h synapse coprocessor of the j th soma core 102 in SIMD fashion performs the multiplication FV, which is then summed in logarithmic time by the reducer 115 of the j th soma core 102 across that soma core's synapse coprocessors 112 to yield (Fx) and thus the j th entry (y-Fxj.
  • the entry F k is incremented proportionally to (y-Fxj x k .
  • the j th soma core 102 has just computed (y-Fxj , and its k th synapse coprocessor 112 has received the most recent x k value and stored it in a register of the synapse coprocessor 112.
  • the j th soma core 102 broadcasts (y-Fxj to its h synapse coprocessor 112, which then in SIMD fashion multiplies the result by the stored value and adds a multiple of that value to the P k value stored at that synapse coprocessor 1 12.
  • soma cores 102 are multiple instruction, multiple data (MIMD) cores
  • MIMD multiple instruction, multiple data
  • the instructions may be parameterized by i.
  • the soma cores 102 can broadcast the same instruction sequence to all of its synapse coprocessors 112.
  • registers are labeled with variable names instead of register numbers. Given these conventions, the sparse coding for deep learning problem can be solved using the hardware architecture as follows,
  • a ⁇ -l, l ⁇ -valued input vector x and an output vector y can be probabilistically related by a Boltzmann distribution as follows:
  • x is a training input
  • y is sampled from x as explained above
  • x ' is sampled from_y
  • y ' is sampled from x '.
  • the training inputs x k and the outputs / can reside in a shared virtual or local soma memory.
  • the couplings F k can reside in registers of the synapse coprocessors 112. Specifically, each coupling F k can reside in a register of the kf h synapse coprocessor 112 of the j th soma core 102. To explain how this algorithm is accelerated, the sampling step is first explained.
  • the k th soma core 102 broadcasts the input entry x k to the k th synapse coprocessor 112 of the j th soma core 102.
  • the k th synapse coprocessor 112 of the j th soma core 102 then in SIMD fashion performs the multiplication F k x k , which is then summed in logarithmic time by the reducer 115 of the j th soma core 102 across that soma core's synapse coprocessors 112 to yield ⁇ k F k x k .
  • the j th soma core 102 then computes the logistic function of this sum and uses it as a probability to randomly sample / from ⁇ - 1,1 ⁇ ⁇
  • the computation of the gradient occurs.
  • the j th soma core 102 broadcasts and (y '/ to all its synapse coprocessors 112 to be stored in registers there.
  • high- bandwidth communication is used to simultaneously transmit (x ' ⁇ from the k th soma core 102 to the k th synapse coprocessor 112 of every soma core 102.
  • the k th synapse coprocessor 112 of the f h soma core 102 calculates (y y (x ')* - and subtracts a multiple of this from the value F k that it holds.
  • the backward sampling can be analogous.
  • the gradient algorithm Given the sampling, the gradient algorithm can be expressed as:
  • Hierarchical clustering As a third example, a different machine learning method that can benefit from better communication is hierarchical clustering.
  • the simplest hierarchical clustering method starts with each item in its own cluster. Then, at each hierarchy level, the hierarchical clustering method groups the two clusters separated by the smallest minimum distance into a single cluster.
  • the first step of an improved hierarchical clustering method involves calculating an initial matrix of distances between clusters.
  • Each active soma core 102 can represent a cluster, and its synapses coprocessors 112 can store the squared distances to other clusters.
  • each cluster is a single item, so each active soma core 102 broadcasts its item's coordinates to the corresponding synapse coprocessors 112 of the other soma cores 102, and its synapse coprocessors 112 in parallel compute the squared distances of the other items to its own item.
  • the second step of the improved hierarchical clustering method involves finding the minimum squared distance between clusters.
  • Each soma core 102 reduces its own synapse coprocessors' squared distances using the minimum operation, and each soma core 102 broadcasts this number to all soma cores 102, which again reduce the values (through their reducers 115) with a minimum operation.
  • the second minimum operation produces on all soma cores 102 the same result, assuming there is a predictable tie breaker in cases of equal values (such as select the lowest index synapse coprocessor's value).
  • An alternative is to perform the second minimum operation on one soma core 102 and broadcast back the result to all other soma cores 102.
  • the third step of the improved hierarchical clustering method involves finding the two clusters that are separated by this minimum distance.
  • the soma core 102 corresponding to the best cluster computes the minimum distance to a soma core 102 other than itself, and this next best cluster is then broadcast back to all soma cores 102.
  • the fourth step of the improved hierarchical clustering method involves combining the two chosen clusters into a single cluster.
  • Each soma core 102 takes the minimum of its distances to the best and next best clusters, stores the minimum distance back in the synapse coprocessor 112 corresponding to the best cluster, and broadcasts the minimum distance on this soma core's channel.
  • the soma core 102 corresponding to the best cluster then has all of its synapse coprocessors 112 replace their distances with these broadcast ones. Finally, the next best soma core 102 and its corresponding synapse coprocessors 112 drop out of the computation. The second through fourth steps are then repeated until there is only a single cluster.
  • the first step of calculating the squared distance matrix (repeating for each coordinate) can be expressed as:
  • the second step of finding the minimum distance between clusters can be expressed as: send mindist ⁇ cid2
  • the third step of finding the two clusters separated by the minimum distance can be expressed as:
  • Another popular machine learning method involves Bayesian networks, which decompose a complicated joint probability function of many variables into a product of conditional probabilities, each of which involves only a small number of variables (up to the in-degree of the network). The problem then is to compute the marginal distribution of each variable.
  • this can be accomplished using the Belief Propagation Algorithm, which takes time proportional to:
  • Another class of algorithms that can be accelerated from quadratic to constant time by the proposed architectures involves geometric algorithms, such as convex hull algorithms. These algorithms may not require the nonlinear capabilities of the proposed architectures and may only rely on the matrix processing capabilities of the proposed architectures. It shown been shown that one key step of these algorithms in high dimensions is dynamic determinant computation. This computation can be accomplished serially in quadratic time by matrix-vector multiplications. However, these multiplications can be reduced to constant time using the proposed architectures.
  • the hardware architectures and associated instructions/operations described in this patent document can provide various advantages over prior approaches, depending on the implementation.
  • this disclosure provides hardware architectures that (if implemented with an adequate number of components) allow the architectures to rival the abilities of the human brain.
  • the functionalities of the hardware architectures can be used to improve other fields of computing, such as artificial intelligence, deep learning, molecular simulation, and virtual reality.
  • various functions described in this patent document are implemented or supported by a computer program that is formed from computer readable program code and that is embodied in a computer readable medium.
  • computer readable program code includes any type of computer code, including source code, object code, and executable code.
  • computer readable medium includes any type of medium capable of being accessed by a computer, such as read only memory (ROM), random access memory (RAM), a hard disk drive, a compact disc (CD), a digital video disc (DVD), or any other type of memory.
  • ROM read only memory
  • RAM random access memory
  • CD compact disc
  • DVD digital video disc
  • a “non-transitory” computer readable medium excludes wired, wireless, optical, or other communication links that transport transitory electrical or other signals.
  • a non-transitory computer readable medium includes media where data can be permanently stored and media where data can be stored and later overwritten, such as a rewritable optical disc or an erasable memory device.
  • application and “program” refer to one or more computer programs, software components, sets of instructions, procedures, functions, objects, classes, instances, related data, or a portion thereof adapted for implementation in a suitable computer code (including source code, object code, or executable code).
  • program refers to one or more computer programs, software components, sets of instructions, procedures, functions, objects, classes, instances, related data, or a portion thereof adapted for implementation in a suitable computer code (including source code, object code, or executable code).
  • communicate as well as derivatives thereof, encompasses both direct and indirect communication.
  • the term “or” is inclusive, meaning and/or.
  • phrases "associated with,” as well as derivatives thereof, may mean to include, be included within, interconnect with, contain, be contained within, connect to or with, couple to or with, be communicable with, cooperate with, interleave, juxtapose, be proximate to, be bound to or with, have, have a property of, have a relationship to or with, or the like.
  • the phrase "at least one of,” when used with a list of items, means that different combinations of one or more of the listed items may be used, and only one item in the list may be needed. For example, "at least one of: A, B, and C" includes any of the following combinations: A, B, C, A and B, A and C, B and C, and A and B and C.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computer Hardware Design (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Computational Mathematics (AREA)
  • Medical Informatics (AREA)
  • Neurology (AREA)
  • Algebra (AREA)
  • Advance Control (AREA)
  • Multi Processors (AREA)
  • Complex Calculations (AREA)
  • Logic Circuits (AREA)

Abstract

L'invention concerne un appareil qui comprend de multiples cœurs informatiques (102) parallèles, chaque cœur informatique (102) étant configuré pour réaliser une ou plusieurs opérations de traitement et générer des données d'entrée. L'appareil comprend également de multiples coprocesseurs (112) parallèles associés à chaque cœur informatique. Chaque cœur informatique est configuré pour fournir les données d'entrée générées par ledit cœur informatique à un coprocesseur désigné parmi les coprocesseurs associés à chacun des cœurs informatique. Les coprocesseurs sont configurés pour traiter les données d'entrée et générer des données de sortie. L'appareil comprend en outre de multiples circuits réducteurs (115). Chaque cœur informatique est associé à l'un des circuits réducteurs. Chaque circuit réducteur est configuré pour recevoir les données de sortie provenant de chacun des coprocesseurs du cœur informatique associé, appliquer une ou plusieurs fonctions aux données de sortie, et fournir un ou plusieurs résultats au cœur informatique associé. Les cœurs informatique, les coprocesseurs et les circuits réducteurs sont disposés latéralement côte à côte dans une configuration bidimensionnelle.
PCT/US2018/026108 2017-04-06 2018-04-04 Architecture informatique parallèle polyvalente WO2018187487A1 (fr)

Priority Applications (7)

Application Number Priority Date Filing Date Title
CN201880037698.0A CN110720095A (zh) 2017-04-06 2018-04-04 通用并行计算架构
EP18780648.4A EP3607454A4 (fr) 2017-04-06 2018-04-04 Architecture informatique parallèle polyvalente
AU2018248439A AU2018248439C1 (en) 2017-04-06 2018-04-04 General-purpose parallel computing architecture
JP2019554765A JP7173985B2 (ja) 2017-04-06 2018-04-04 汎用並列コンピューティングアーキテクチャ
CA3059105A CA3059105A1 (fr) 2017-04-06 2018-04-04 Architecture informatique parallele polyvalente
AU2021203926A AU2021203926B2 (en) 2017-04-06 2021-06-14 General-purpose parallel computing architecture
JP2022177082A JP2023015205A (ja) 2017-04-06 2022-11-04 汎用並列コンピューティングアーキテクチャ

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US15/481,201 2017-04-06
US15/481,201 US11449452B2 (en) 2015-05-21 2017-04-06 General-purpose parallel computing architecture

Publications (1)

Publication Number Publication Date
WO2018187487A1 true WO2018187487A1 (fr) 2018-10-11

Family

ID=63712341

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2018/026108 WO2018187487A1 (fr) 2017-04-06 2018-04-04 Architecture informatique parallèle polyvalente

Country Status (6)

Country Link
EP (1) EP3607454A4 (fr)
JP (2) JP7173985B2 (fr)
CN (1) CN110720095A (fr)
AU (2) AU2018248439C1 (fr)
CA (1) CA3059105A1 (fr)
WO (1) WO2018187487A1 (fr)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112068942A (zh) * 2020-09-07 2020-12-11 北京航空航天大学 一种基于单节点模拟的大规模并行系统模拟方法
CN118590497A (zh) * 2024-08-02 2024-09-03 之江实验室 一种基于异构通信的全归约通信方法及装置

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113407238B (zh) * 2020-03-16 2024-09-24 北京灵汐科技有限公司 一种具有异构处理器的众核架构及其数据处理方法
CN114356541B (zh) * 2021-11-29 2024-01-09 苏州浪潮智能科技有限公司 一种计算核心的配置方法及装置、系统、电子设备

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4965717A (en) * 1988-12-09 1990-10-23 Tandem Computers Incorporated Multiple processor system having shared memory with private-write capability
US8949577B2 (en) * 2010-05-28 2015-02-03 International Business Machines Corporation Performing a deterministic reduction operation in a parallel computer
US20150261535A1 (en) * 2014-03-11 2015-09-17 Cavium, Inc. Method and apparatus for low latency exchange of data between a processor and coprocessor
US20160342568A1 (en) 2015-05-21 2016-11-24 Goldman, Sachs & Co. General-purpose parallel computing architecture

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0535867A (ja) * 1990-09-06 1993-02-12 Matsushita Electric Ind Co Ltd 画像処理装置
JPH05242065A (ja) * 1992-02-28 1993-09-21 Hitachi Ltd 情報処理装置及びシステム
JP2561028B2 (ja) * 1994-05-26 1996-12-04 日本電気株式会社 サイドローブキャンセラ
US6829697B1 (en) * 2000-09-06 2004-12-07 International Business Machines Corporation Multiple logical interfaces to a shared coprocessor resource
TWI234737B (en) * 2001-05-24 2005-06-21 Ip Flex Inc Integrated circuit device
US8756264B2 (en) * 2006-06-20 2014-06-17 Google Inc. Parallel pseudorandom number generation
JP5684704B2 (ja) * 2008-05-27 2015-03-18 スティルウォーター スーパーコンピューティング インコーポレイテッド 実行エンジン
EP2565786A4 (fr) * 2010-04-30 2017-07-26 Nec Corporation Dispositif de traitement d'informations et procédé de commutation de tâches
EP3150463B1 (fr) * 2014-05-30 2019-05-22 Mitsubishi Electric Corporation Dispositif de commande de direction

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4965717A (en) * 1988-12-09 1990-10-23 Tandem Computers Incorporated Multiple processor system having shared memory with private-write capability
US4965717B1 (fr) * 1988-12-09 1993-05-25 Tandem Computers Inc
US8949577B2 (en) * 2010-05-28 2015-02-03 International Business Machines Corporation Performing a deterministic reduction operation in a parallel computer
US20150261535A1 (en) * 2014-03-11 2015-09-17 Cavium, Inc. Method and apparatus for low latency exchange of data between a processor and coprocessor
US20160342568A1 (en) 2015-05-21 2016-11-24 Goldman, Sachs & Co. General-purpose parallel computing architecture

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3607454A4

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112068942A (zh) * 2020-09-07 2020-12-11 北京航空航天大学 一种基于单节点模拟的大规模并行系统模拟方法
CN112068942B (zh) * 2020-09-07 2023-04-07 北京航空航天大学 一种基于单节点模拟的大规模并行系统模拟方法
CN118590497A (zh) * 2024-08-02 2024-09-03 之江实验室 一种基于异构通信的全归约通信方法及装置

Also Published As

Publication number Publication date
AU2018248439B2 (en) 2021-06-03
CN110720095A (zh) 2020-01-21
AU2018248439A1 (en) 2019-10-17
EP3607454A4 (fr) 2021-03-31
JP7173985B2 (ja) 2022-11-17
AU2021203926B2 (en) 2022-10-13
AU2021203926A1 (en) 2021-07-08
JP2020517000A (ja) 2020-06-11
CA3059105A1 (fr) 2018-10-11
AU2018248439C1 (en) 2021-09-30
JP2023015205A (ja) 2023-01-31
EP3607454A1 (fr) 2020-02-12

Similar Documents

Publication Publication Date Title
AU2021229205B2 (en) General-purpose parallel computing architecture
US10210134B2 (en) General-purpose parallel computing architecture
US20220269637A1 (en) General-purpose parallel computing architecture
AU2021203926B2 (en) General-purpose parallel computing architecture
Jones et al. Distributed Simulation of Statevectors and Density Matrices
US20230058749A1 (en) Adaptive matrix multipliers

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18780648

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 3059105

Country of ref document: CA

ENP Entry into the national phase

Ref document number: 2019554765

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2018248439

Country of ref document: AU

Date of ref document: 20180404

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 2018780648

Country of ref document: EP

Effective date: 20191106