WO2018026261A1

WO2018026261A1 - Simulation methods, computing nodes and simulation systems for parallel execution

Info

Publication number: WO2018026261A1
Application number: PCT/MY2016/050043
Authority: WO
Inventors: Atsushi Tomoda; Tomohiro Nakamura
Original assignee: Hitachi, Ltd.
Priority date: 2016-08-03
Filing date: 2016-08-03
Publication date: 2018-02-08

Abstract

According to various embodiments, there is provided a simulation method including providing a region, the region being part of a simulation space; processing the region in a computing node, wherein the processing includes dividing the region into an interior and a boundary; generating an interior output based on the region; receiving a neighbor output including output generated based at least partially on another region of the simulation space at an earlier time step; generating a boundary output based at least partially on the received neighbor output; sending the boundary output to another computing node, the another computing node processing the another region.

Description

SIMULATION METHODS, COMPUTING NODES AND

SIMULATION SYSTEMS FOR PARALLEL EXECUTION

TECHNICAL FIELD

The present invention relates to simulation methods, computing nodes and simulation systems.

BACKGROUND

High speed simulation with high accuracy is required in many industrial fields, for example, to analyze and forecast physical phenomena. One of the industrial fields requiring high speed and high accuracy simulation is seismic analysis. Seismic analysis is a process in oil and gas exploration, for recovering the subsurface structure. Seismic analysis may enable the finding of new hydro-carbon layers by analyzing the observed artificial seismic wave. In the past, seismic analysis was mostly restricted to two-dimensional (2D) seismic analysis. 2D seismic analysis images a slice of the subsurface. The amount of computations and the size of seismic data analyzed were limited by the performance of the computing environment. As computing technologies improve, it is now possible to apply three-dimensional (3D) seismic analysis to investigate subsurface structures that are more complex. However, 3D seismic analysis consumes a large amount of memory and computation resources. This may be especially so in the case where some algorithms require the retention of all intermediate data of the simulation in memory until the simulation ends. Therefore, there is a need for a simulation method that may process 3D seismic analysis with a reduced requirement for memory and computational resources.

SUMMARY

According to various embodiments, there may be provided a simulation method including: providing a region, the region being part of a simulation space; processing the region in a computing node, wherein the processing includes: dividing the region into an interior and a boundary; generating an interior output based on the region; receiving a neighbor output including output generated based at least partially on another region of the simulation space at an earlier time step; generating a boundary output based at least partially on the received neighbor output; sending the boundary output to another computing node, the another computing node processing the another region.

According to various embodiments, there may be provided a computing node configured to process a region, the computing node including: a processing unit configured to divide the region into an interior and a boundary, the region being part of a simulation space, wherein the processing unit is further configured to generate an interior output based on the region; a communication unit configured to receive a neighbor output, the neighbor output including output generated based at least partially on another region of the simulation space at an earlier time step, wherein the processing unit is further configured to generate a boundary output based at least partially on the received neighbor output; wherein the communication unit is further configured to send the boundary output to another computing node processing the another region.

According to various embodiments, there may be provided a simulation system including: a divider configured to divide a simulation space into a plurality of regions; and a plurality of computing nodes, each computing node of the plurality of computing nodes configured to process a respective region of the plurality of regions; wherein a computing node of the plurality of computing nodes includes: a processing unit configured to divide a region of the plurality of regions into an interior and a boundary, and further configured to generate an interior output based on the region; and a communication unit configured to receive a neighbor output, the neighbor output including output generated based at least partially on another region of the plurality of regions at an earlier time step, wherein the processing unit is further configured to generate a boundary output based at least partially on the received neighbor output; and wherein the communication unit is further configured to send the boundary output to another computing node of the plurality of computing nodes processing the another region.

BRIEF DESCRIPTION OF THE DRAWINGS

In the drawings, like reference characters generally refer to the same parts throughout the different views. The drawings are not necessarily to scale, emphasis instead generally being placed upon illustrating the principles of the invention. In the following description, various embodiments are described with reference to the following drawings, in which:

FIG. 1 shows a flow diagram of a simulation method according to various embodiments.

FIG. 2 shows a flow diagram showing a simulation method according to various embodiments.

FIG. 3 shows a conceptual diagram of a computing node according to various embodiments.

FIG. 4 shows a conceptual diagram of a simulation system according to various embodiments.

FIG. 5A shows a representation diagram of a simulation space.

FIG. 5B shows a diagram showing a simulation process.

FIG. 6 shows a diagram showing a simulation process according to various embodiments.

FIG. 7 shows a timing diagram of a simulation process.

FIG. 8 shows a conceptual diagram showing a simulation method according to various embodiments.

FIG. 9 shows a time diagram illustrating the processing time in each computing node according to various embodiments.

FIG. 10A shows a timing diagram for a simulation process.

FIG. 10B shows a timing diagram showing a simulation method according to various embodiments.

FIG. 11 shows a sequence diagram of a user interface according to various embodiments. FIG. 12 shows a hardware architecture for a simulation system according to various embodiments.

FIG. 13 shows a system diagram of the simulation system.

FIG. 14 shows a system diagram of the simulation system after execution of a simulation.

FIG. 15 shows an example of a pseudo code of an interface for a user- defined function.

FIG. 16A shows an example of a first pseudo code of the user-defined function.

FIG. 16B shows an example of a second pseudo code of the user-defined function.

FIG. 17 shows a flowchart of a parallel processing manager according to various embodiments.

FIG. 18 shows an example of a management table for a parallel processing manager according to various embodiments.

FIG. 19 shows a flowchart of a parallel processing worker according to various embodiments.

FIG. 20 shows a first management table and a second management table.

FIG. 21 shows a conceptual diagram of a memory space for simulated data, according to various embodiments.

FIG. 22 shows pseudo codes of data access libraries to access virtual matrix.

DESCRIPTION

Embodiments described below in context of the methods are analogously valid for the respective computing nodes and the respective simulation systems, and vice versa. Furthermore, it will be understood that the embodiments described below may be combined, for example, a part of one embodiment may be combined with a part of another embodiment. It will be understood that any property described herein for a specific method may also hold for any method described herein. It will be understood that any property described herein for a specific computing node may also hold for any computing node described herein. It will be understood that any property described herein for a specific simulation system may also hold for any simulation system described herein. Furthermore, it will be understood that for any method, computing node or simulation system described herein, not necessarily all the components or steps described must be enclosed in the method, computing node or simulation system, but only some (but not all) components or steps may be enclosed.

In this context, the simulation system as described in this description may include a memory which is for example used in the processing carried out in the device. A memory used in the embodiments may be a volatile memory, for example a DRAM (Dynamic Random Access Memory) or a non-volatile memory, for example a PROM (Programmable Read Only Memory), an EPROM (Erasable PROM), EEPROM (Electrically Erasable PROM), or a flash memory, e.g., a floating gate memory, a charge trapping memory, an MRAM (Magnetoresistive Random Access Memory) or a PCRAM (Phase Change Random Access Memory). In the specification the term "comprising" shall be understood to have a broad meaning similar to the term "including" and will be understood to imply the inclusion of a stated integer or step or group of integers or steps but not the exclusion of any other integer or step or group of integers or steps. This definition also applies to variations on the term "comprising" such as "comprise" and "comprises".

The term "coupled" (or "connected") herein may be understood as electrically coupled, or as in being in communication, or as in mechanically coupled, for example attached or fixed, or just in contact without any fixation, and it will be understood that both direct coupling or indirect coupling (in other words: coupling without direct contact) may be provided. In order that the invention may be readily understood and put into practical effect, particular embodiments will now be described by way of examples and not limitations, and with reference to the figures. High speed simulation with high accuracy is required in many industrial fields, for example, to analyze and forecast physical phenomena. One of the industrial fields requiring high speed and high accuracy simulation is seismic analysis. Seismic analysis is a process in oil and gas exploration. Seismic analysis recovers the subsurface structure and enables the finding of a new hydro-carbon layer by analyzing the observed artificial seismic wave. In the past, seismic analysis was mostly restricted to two-dimensional (2D) seismic analysis. 2D seismic analysis images a slice of the subsurface. The amount of computations and the size of seismic data analyzed were limited by the performance of the computing environment. As computing technologies improve, it is now possible to apply three-dimensional (3D) seismic analysis to investigate subsurface structures that are more complex. However, 3D seismic analysis consumes a large amount of memory and computation resources. This may be especially so in the case where some algorithms require keeping all intermediate data of the simulation in memory until the simulation ends. Therefore, there is a need for a simulation method that may process 3D seismic analysis with a reduced requirement for memory and computation resources.

In the context of various embodiments, the phrase "neighboring region" may be but is not limited to being interchangeably referred to as a "neighbor region".

In the context of various embodiments, the phrase "neighboring node" may be but is not limited to being interchangeably referred to as a "neighbor node" or "neighbor computing node" or "neighbor worker computing node".

FIG. 1 shows a flow diagram 100 of a simulation method according to various embodiments. The simulation method may include various processes, including 102 and 104. In 102, a region may be provided. The region may be part of a simulation space. In 104, the region may be processed in a computing node. The processing of the region in the computing node may include a plurality of sub- processes including 106, 108, 110, 112 and 114. In 106, the region may be divided into an interior and a boundary. In 108, an interior output may be generated based on the region. In 110, a neighbor output comprising output generated based at least partially on another region of the simulation space at an earlier time step, may be received. In 112, a boundary output may be generated based at least partially on the received neighbor output. In 114, the boundary output may be sent to another computing node. The another computing node, in other words, the other computing node, may be processing the another region, in other words, the other region. The other region may be adjacent to the region in the simulation space.

In other words, according to various embodiments, a simulation method may include process 102, in which a region that is part of a simulation space is provided. The region may be provided by dividing a simulation space into a plurality of regions. The simulation method may further include process 104, in which the region may be processed in a computing node. Each region of the plurality of regions may be processed in a respective computing node of a plurality of computing nodes. The plurality of computing nodes may be configured to process the plurality of regions at least substantially concurrently. At least two computing nodes may process their respective regions at least substantially concurrently. The process 104 may include sub-process 106, in which the region may be divided into an interior and a boundary. The process 104 may further include sub-process 108, in which an interior output may be generated based on the region. The interior output may be generated based on data within the region only, in other words, without requiring any data from neighboring regions or neighboring computing nodes. The process 104 may further include sub-process 110, in which a neighbor output is received. The neighbor output may include output generated based at least partially on another region of the simulation space at an earlier time step. The process 104 may further include sub-process 112, in which a boundary output may be generated based at least partially on the received neighbor output. The process 104 may further include sub-process 114, in which the boundary output may be sent to another computing node. The neighbor output may include an interior output of another region generated by a neighboring computing node at an earlier time step. The neighbor output may also include a boundary output of the other region generated by the neighboring computing node at the earlier time step. The neighboring computing node processing the other region may be referred herein as the other computing node. The other computing node may generate a boundary output of the other region in a future time step, based at least partially on the sent boundary output.

FIG. 2 shows a flow diagram 200 showing a simulation method according to various embodiments. The simulation method may be similar to the simulation method shown in the flow diagram 100, in that it includes processes 102 and 104. In addition, the simulation method shown in flow diagram 200 may further include processes 116 and 118. In 116, a user-defined function may be received. The dividing of the region, i.e. sub-process 106, may include determining inputs of the user-defined function. The user-defined function may be the simulation function. The user-defined function may be the function through which each point in region is processed. In 118, a user interface may be provided. The user interface may be configured to receive user inputs comprising at least one of information on the computing node, a request to execute the simulation, or a request to access results of the executed simulation.

FIG. 3 shows a conceptual diagram of a computing node 300 according to various embodiments. The computing node 300 may be configured to process a region. The region may be part of a simulation space. The computing node 300 may include a processing unit 330 and a communication unit 332. The processing unit 330 may be coupled to the communication unit 332. The processing unit 330 may be configured to divide the region into an interior and a boundary. The processing unit 330 may be further configured to generate an interior output based on the region. The communication unit 332 may be configured to receive a neighbor output and further configured to generate a boundary output based at least partially on the received neighbor output. The neighbor output may include output generated based at least partially on another region of the simulation space at an earlier time step. The communication unit 332 may be further configured to send the boundary output to another computing node processing the other region. The computing node 300 may be configured to carry out the simulation method shown in the flow diagram 100 or the simulation method shown in the flow diagram 200.

FIG. 4 shows a conceptual diagram of a simulation system 400 according to various embodiments. The simulation system 400 may include a divider 440 and a plurality of computing nodes 300. The divider 440 may be configured to divide a simulation space into a plurality of regions. Each computing node 300 of the plurality of computing nodes 300 may be configured to process a respective region of the plurality of regions. For example, a first region may be allocated to a first computing node. The first computing node may process the first region. A second region may be allocated to a second computing node. The second computing node may process the second region. If the second region is adjacent to the first region in the simulation space, the second region may be a neighbor region to the first region. Accordingly, the second computing node may be a neighbor computing node to the first computing node. Each computing node 300 may include a process unit 330 and a communication unit 332, as shown in FIG. 3. The divider 440 may be coupled to the plurality of computing nodes. Each computing node may be coupled to other computing nodes. The coupling may be in the form of data connection.

A simulation method according to various embodiments may enable fine-grained parallelization. The simulation may be applied in physical simulation. Physical simulation may be realized by differentiating a partial differential equation. Since a partial differential equation may be described locally, the result of simulation in a point at a time step may be calculated just from a past value near the point. To reduce the processing time of the simulation, the simulation space may be divided and the divided space may be assigned to multiple computing nodes. The amount of memory consumption in each computing node may also be reduced. If the size of each divided space is very small, the time taken to transfer the calculated data to the other computing node may be larger than the time taken to calculate. Then the total processing time for the simulation may be bounded by the sum of time for calculation and transfer. To overcome this limitation that the total processing time may be bounded by the sum of calculation time and data transfer time, the simulation method may "hide" the data transfer time to reduce the total processing time. The data transfer time is "hidden" by having it performed parallel to the computation in the computing node. The computing node may compute the values for points in the region even before having access to the data from neighboring regions, as the computation is conducted in two steps, wherein the first step may not require data from neighboring regions.

According to various embodiments, a simulation method may be applied to graphs. A 3D physical space may be viewed as a graph, for example taking the form of a 3D mesh. The simulation may be a simulation on a graph. If a given graph may be divided into multiple graphs and the calculation for simulation at a point and a time step is described by the past value near the point, the simulation method may be applied. For example, the simulation method may be applied to traffic simulation in a city since the road network may be represented as a graph and the flow of traffic may have a local description. FIG. 5A shows a representation diagram 500A of a simulation space 550. The simulation space 550 may be a physical region R, or may contain representations of a physical region. The simulation space may be a 2D or a 3D matrix. Assuming the simulation space 550 is a 3D matrix, its coordinates may be denoted as (x, y, z), where each of x, y and z represent a respective dimension of the simulation space 550. Any point in the simulation space 550 may be identified by its coordinates. Each point in the simulation space 550 may also be referred herein as a cell.

FIG. 5B shows a diagram 500B showing a simulation process. A simulation process may involve multiple time steps. The points in the simulation space 550 may evolve with each time step of the multiple time steps. In other words, the simulation process may compute the changes in values of each point in the simulation space 550 over a period of time. The period of time may be represented by the multiple time steps, or discretized into a plurality of time steps. The value of each point in the simulation space 550 is then computed for each time step of the plurality of time steps. The time step may be represented by t = ί in FIG. 5A, where i denotes the sequence or position in time. The physical state, S_t of the simulation space 550 may change with each time step in the simulation process. The physical state of the simulation space 550 at the i-th time step may be denoted as S_t 556. For example, before the simulation begins, i.e. at the 0^th time step, the physical state of the simulation space may be 5₀ 552. At the next time step, i = 1, the physical state of the simulation space may be S-L 554. At S-L 554, the value of each cell in the matrix may be replaced by the simulation results of the respective cell from 5₀ 552. In other words, each cell of 5₀ 552 may be processed using a simulation function, also referred herein as a user-defined function, to become the new value of the same cell of S-L 554. The processing of each cell of the simulation space may continue with each subsequent time step.

FIG. 6 shows a diagram 600 showing a simulation process according to various embodiments. In the case where the simulation function is the wave equation, the physical state S_t may be calculated by the physical state of two previous time steps, i.e. Si_₂ and . The simulation space of the time step t = i— 2 and the simulation space of the time step t = i— 1 may be provided as inputs to a computation process 660 to generate the desired physical state S_t. The computation process 660 may be denoted as P_t.

FIG. 7 shows a timing diagram 700 of a simulation process. The timing diagram 700 may include an axis 770 representing time. In the simulation process, the simulation space 550 may be divided into a plurality of regions, where each region is assigned to a respective computing node. The simulation process may be, for example, applied to process seismic algorithms, where most of the processing time may be consumed in simulating the wave equation and similar processes. By assigning a region to each computing node, each computing node may only need to process a relative small amount of data as compared to the entire simulation space 550. As such, the consumption of memory in each computing node may be reduced. For each time step, each region may be processed concurrently in the respective computing node. After completing the computation process 660 of the first time step, the results of the processing in each computing node may be transferred between the plurality of computing nodes for synchronization, before the simulation process proceeds to the second time step. After that, each computing node may again carry out the computation process 660, before the results of the processing are transferred between the plurality of computing nodes. The time taken for each computing node to carry out the computation process 660 may be referred herein as the processing time 772. The time taken for the data transfer and synchronization may be referred herein as the transfer time 774. As can be seen from the timing diagram 700, the total amount of time taken for the simulation may be a sum of the processing time 772 and the transfer time 774. This method of performing the simulation may be efficient only if the processing time 772 for each time step is sufficiently large compared to the transfer time 774. To reduce the processing time 772 and to retain all intermediate data in the memory of the computing nodes, the simulation space should preferably be divided into a large number of regions. However, if the simulation space is divided into too many regions, the transfer time 774 may increase as data needs to be transferred between a large quantity of computing nodes. As a result, the overall time taken to complete the simulation process may not be reduced even with an increased number of computing nodes, as the overall time taken may be bound by the transfer time 774. In other words, the transfer time 774 may be a bottleneck to the issue of speeding up the simulation process.

According to various embodiments, the simulation method may hide the time taken for data transfer between computing nodes, or in other words, perform processing tasks in parallel to transferring data. As such, the simulation method may reduce the overall processing time by enabling parallelized simulation in more computing nodes. FIG. 8 shows a conceptual diagram 800 showing a simulation method according to various embodiments. The simulation method may include parallelizing simulation in multiple computing nodes. For a given simulation program, the simulation method may include dividing a simulation space 550, also referred herein as simulation region R or a physical cube, into a plurality of regions. Each region may be a small cube. For each time step in either ascending or descending order, each region may be processed by a respective computing node. The processing of each region at each time step may include dividing the region to a boundary 882 and an interior 880. Processing, in other words, calculating the values of each point in the boundary 882 may require the calculated values of the neighbor regions from an earlier time step. In contrast, calculating values of each point in the interior 880 may not require the calculated values of the neighbor regions from an earlier time step. Take for example, a region 890. The neighbor regions of the region 890 may include R_n 884a, R_w 884b, R_s 884c and R_e 884d. R_n 884a may be another region in the simulation space 550, located to the north of the region 890. R_w 884b may be another region in the simulation space 550, located to the west of the region 890. R_s 884c may be another region in the simulation space 550, located to the south of the region 890. R_e 884d may be another region in the simulation space 550, located to the east of the region 890. Processing the region 890 at each time step may further include calculating values for the interior 880 and waiting to receive the calculated values of the neighboring regions from the earlier time step, calculating values for the boundary 882, sending the calculated values to the neighboring nodes and then moving to the next time step. The sending of the calculated values to the neighboring nodes need not be completed before the computing nodes move on to process their respective regions in the next time step. The process of dividing the region 890 into the interior 880 and the boundary 882 may include analyzing which points in the region 890 can be processed with the user-defined function without requiring values from outside of the region 890. The inputs of the user-defined function may be determined. The division of the region into interior 880 and boundary 882 may be based on a furthest distance between the determined inputs of the user-defined function. For example, if the user-defined function requires inputs from x— 1 , then the most exterior row along the x dimension may be defined as the boundary, as the values of the most exterior row along the x dimension cannot be processed with the user-defined function without requiring data from outside of the region. If the user-defined function requires inputs from x— 1 and x— 3 , then the most exterior three rows along the x dimension may be defined as the boundary, as the values of these three rows cannot be processed with the user-defined function without requiring data from outside of the region. The user-defined function may be received through a user interface provided on a client computer. A user may enter parameters through the client computer, to provide the user-defined function. The parameters may also include information on the plurality of computing nodes, for example the IP addresses and the identifier codes of the computing nodes. The user interface may also be configured to receive a request to execute simulation. The request to execute simulation may include a range of the physical cube and the range of time steps to simulate. In other words, the request to execute simulation may include a range of the simulation space 550 to be processed and a range of time steps to be processed. The user interface may also be configured to receive a request to access the results of the simulation. The request to access the result of the simulation may include the range of the simulation space 550 and the range of time steps to access.

FIG. 9 shows a timing diagram 900 illustrating the processing time in each computing node according to various embodiments. The timing diagram 900 includes an axis 990 indicating time. The processing time Pi may include an interior processing time 992a, Pn and a boundary processing time 992b, PBi. The interior processing time 992a may be the time taken to perform calculations which do not require calculation results from other neighboring regions, i.e. the interior processing time 992a may be the time taken to process the interior. The boundary processing time 992b may be the time taken to perform calculations requiring calculation results from other neighboring regions, i.e. the boundary processing time 992b may be the time taken to process the boundary. In general, for user- defined functions that are partial differentiation equations, the boundary processing time 992b may be small relative to the interior processing time 992a. FIG. 10A shows a timing diagram 1000A for a simulation process. The timing diagram 1000A may include a time axis 1010. The method may include dividing a simulation space 550 into small regions, assigning each small region to a respective computing node, and executing the simulation in each computing node in parallel. Physical simulation in a time step may require data generated in simulation in earlier time steps. In some cases, simulation at a point may only require the data generated in simulation in the neighborhood at earlier time steps. The neighborhood may be processed by a neighbor node. The neighborhood may also be referred herein as a neighbor region. At time step i = 1, the first node, also referred herein as own node, may process the points in the own node, in parallel to the processing in the neighbor node. Each of the own node and the neighbor node may take processing time 772 to perform the computation processing Pi. Following the computation processing, the neighbor node may send the results of its computation performed at ί = 1, to the own node. The sending of the results from the neighbor node to the own node may take up transfer time 774. After receiving the results, the own node may proceed with the computation processing of the next time step ί = 2. As can be seen from the timing diagram 1000A, the total time taken for the simulation may be a sum of the processing time 772 and transfer time 774, multiplied by the number of time steps.

FIG. 10B shows a timing diagram 1000B showing a simulation method according to various embodiments. The timing diagram 1000B may include a time axis 1012. This simulation method may complete a simulation in a shorter time as compared to the method shown in FIG. 10A. This simulation method may include classifying each region assigned to a respective computing node, into inner regions. The inner regions may include an interior and a boundary. The interior may be defined as an inner part of the region, where any point in the interior may be calculated solely based on data inside the region. The boundary may be defined as an outer part of the region, where a point in the boundary may be calculated based on data outside of the region. The point in the boundary may also be calculated further based on data inside the region. In other words, the calculation of a point in the boundary may require not only data or value inside the region but also data from neighbor regions. For a partial differential equation, the size of the interior may occupy most of the small region as compared to the boundary. Hence, in a simulation at a time step in a computing node, the computing node may calculate the value in the interior, wait for receiving the calculation result from computing nodes assigned to neighboring regions, calculate the value in the boundary, send the calculation result of the value in the boundary to computing nodes assigned to the neighboring regions, and move to the next time step without waiting for completion of the data transfer. In other words, the computing node may calculate the interior values concurrent to receiving the calculation result of neighboring regions in a previous time step. The interior values may also be referred herein as the interior output. Next, the computing node may calculate the boundary values using the received calculation results of neighboring regions in a previous time step. The boundary values may also be referred herein as the boundary output. After that, the computing node may send boundary calculations and optionally, interior calculations to the neighboring regions. Concurrently, the computing node may start the simulation at the next time step, beginning with calculating the interior. The computing node may simultaneously receive calculation results from computing nodes assigned to neighboring regions, while sending its own calculations results to the computing nodes assigned to the neighboring regions. The computing node may simultaneously be sending and receiving data from other computing nodes, while also calculating values in the interior or the boundary. In other words, the computing node may calculate the interior values in advance, prior to calculating the boundary values so that part of the calculations of the region does not need to wait for the data transfer from other computing nodes. The computing node may generate the interior output at least substantially simultaneous to receiving the neighbor output, and may also send the boundary output at least substantially simultaneous to generating the interior output in a future time step. By doing so, the time for data transfer may be hidden as the time for data transfer is overlapped with the time of receiving data from other computing nodes, thereby reducing the processing time. The data transfer time may no longer be a bottleneck to the overall simulation time. Therefore, the quantity of computing nodes employed to process the simulation space may also be increased without affecting the overall simulation time duration. FIG. 11 shows a sequence diagram 1100 of a user interface according to various embodiments. In a simulation method, the simulation process may be parallelized in multiple computing nodes. A parallel processing manager 301 may be provided to manage the parallelization. The parallel processing manager 301 may provide a unified interface, in other words, a front end interface to a user 101, even though the simulation is performed across a plurality of computing nodes at the back end. The parallel processing manager 301 may be configured to at least one of control parameters for the simulation, execute the simulation or provide access to results of the simulation. The parallel processing manager 301 may be configured to provide the user interface shown in FIG. 11. A user 101 may log into a computing node running the parallel processing manager 301. The user 101 may enter a command that brings up the user interface. The user interface may be part of the parallel processing manager 301. The user interface may not be restricted to being executed or displayed in the same computing node as the computing node that the user 101 logged into. The user interface or the parallel processing manager 301 may run on a remote computer or server, and may be connected to the other computing nodes and the log-in computing node. The connection may be provided for example, via TCP/IP communication. The user interface may be configured to receive user inputs including setting of parameters 1102, request of execution 1106 and request of access to results 1110. If the user 101 executes a command for setting of parameters 1102, the parallel processing manager 301 may set the parameters and may send a completion notice 1104 on completion of the setting of parameters. The setting parameters may include information on the computing nodes. An example of the setting parameters is illustrated in FIG. 18. If the user 101 executes a command for a request of execution, the parallel processing manager 301 may execute the simulation as requested and may send a completion notice 1108 on completion of the simulation. Through the command, some parameters for execution may be designated. If a user 101 executes a command for a request of access to the results of the simulation, the parallel processing manager 301 may respond to the request with a file name 1112 or a pointer to the result. Through the command, some parameters for access to result and ranges of area and time may be designated. FIG. 12 shows a hardware architecture for a simulation system 1200 according to various embodiments. The simulation system 1200 may include a master computing node 1220 and a plurality of worker computing nodes 210. The simulation system 1200 may further include a shared storage 220. The simulation system 1200 may also include a network 230 which may connect the master computing node 1220 to at least one of the worker computing nodes 210 or the shared storage 230. The master computing node 1220 may include a central processing unit (CPU) 201, a memory 202, a storage 203 and an input/output (I/O) adapter 204. Each worker computing node 210 may include a CPU 211, a memory 212, a storage 213 and an I/O adapter 214. The master computing node 1220 may be similar to the worker computing node 210, in other words, the master computing node 1220 and the worker computing node 210 may be physically similar in terms of their composition or may be a same type of device, for example a computer server. The master computing node 1220 may communicate with other computing nodes such as the worker computing nodes 210, and the shared storage 220 through the I/O adapter 204. Each worker computing node 210 may communicate with other computing nodes including other worker computing nodes 210, and the shared storage 220 through the I/O adaptor 214.

FIG. 13 shows a system diagram 1300 of the simulation system 1200. The system diagram 1300 shows the software and data modules of the simulation system 1200. The memory 202 in the master computing node 1220 may store a parallel processing manager 302. The parallel processing manager may be the parallel processing manager 301 of FIG. 11. The memory 202 may further store a management table 303, parameters 304 and an operating system (OS) 301. The memory 212 in the working computing node 210 may store a parallel processing worker 312. The memory 212 may further store an OS 311 and parameters 313. The shared storage 220 may store a user-defined function 321, a data access library 322, an initial data 323 and an output data 324. The user-defined function 321 may be the function that is to be run on the data points in the simulation space. The initial data 323 may be the data points in the simulation space before the simulation is run. The output data 324 may be the output of the simulation.

FIG. 14 shows a system diagram 1400 of the simulation system 1200 after execution of a simulation. The system diagram 1400 shows the software and data modules of the simulation system 1200. Once a simulation is executed, the user- defined function 321, the data access libraries 322 and the initial data 323 may be loaded to the memory 212 in the worker computing node 210. The memory 212 may also store simulated data 324. The simulated data 324 may be intermediate data from the simulation. The memory 212 may further store a management table 325 for the parallel processing worker 312.

FIG. 15 shows an example of a pseudo code 1500 of an interface for a user-defined function. A user 101 may designate the parameters 304 for variables of the interface. The parameters 304 may include a range of the simulation space and time. The simulation space may be three-dimensional, represented by dimensions x, y and z. Time may be represented by t. The user 101 may also designate a user- defined function 321. Once the parameter 304 and user-defined function 321 are designated, the simulation system may allocate a memory space having a size that is x times size of y times size of z times size of t as a virtual matrix. The allocated memory space may be accessed via data access libraries such as VMatrixRead and VMatrix Write, inside the user-defined function 321. The user 101 may access the virtual matrix through the user interface shown in FIG. 11 by specifying the range of x, y, z, and t. The simulation may be executed in ascending or descending order in time, although the illustrative example in FIG. 15 only shows that the simulation is executed in ascending order. For each time step t, the simulation at the time step t may be represented as execution of user-defined function 321 for all x, y, and z-

FIG. 16A shows an example 1600A of a first pseudo code of the user-defined function 321. The first pseudo code 1600A illustrates an abstract description of the user-defined function 321. The user-defined function 321 may be defined as a function of x, y, z and t. The user-defined function 321 may calculate a value by using a combination of values at arbitrary points (a, b, c, s) where s is less than t. The user-defined function 321 may set the value for a given point (x, y, z, t) in the simulation space. If the simulation performed is in descending order in time, the condition for s may be replaced by "s is greater than t".

FIG. 16B shows an example of a second pseudo code which illustrates an explicit example of the user-defined function 321. The user-defined function 321 in this example may be an implementation for wave equation. Given x, y, z and t, the user-defined function 321 may read the value at (x, y, z, t— 2) and the values of the neighborhood in directions of x, y and z at time step t— 1 through the data access library. The user-defined function 321 may calculate the new values at time t based on the read values, i.e. values at (x, y, z, t— 2) and the values of the neighborhood in directions of x, y and z at time step t— 1. The calculated new values may be set to (x, y, z, t) through the data access library 322. Through analyzing this pseudo code, it may be found that the value in a point (x, y, z) may be derived from values in (x, y, z) at time step t— 2 and may be further derived from values in the neighborhood at time step t-1. The neighborhood may refer to all (a, b, c) such that x— 1≤ a≤ x + 1 and y— 1≤ b≤ y + 1 , and z— 1 < c < z + 1. This implies that the value at (x, y, z) may not be affected by the value at points with distance more than 2 away from (x, y, z). The distance may refer to distance along each axis, i.e. along jc-axis, -axis and z-axis. The distance may depend on the form of partial differential equation which the simulation executes, and the order of approximation. In this interface of user-defined function 321, the user 101 may not specify a way to parallelize. The user 101 may not need to know about or provide inputs relating to the parallelization. The user 101 may only need to define the user-defined function 321 through the user interface, for example like in the pseudo-codes shown in FIGS. 15 and 16.

FIG. 17 shows a flowchart 1700 of a parallel processing manager 302 according to various embodiments. The execution flow shown in the flowchart 1700 may begin once a user 101 executes a command of the user interface of FIG. 11. In step S701, the parallel processing manager 302 may check if the command is a request for setting parameters. If the command is a request for settings, the next step may be S702 where the parallel processing manager 302 may receive the parameters from the user 101 through the user interface. The parameters may include information about the worker computing nodes. In step S702, the information about the worker computing nodes may be entered into the management table 303. After setting the parameters in the management table 303, the execution flow may proceed to end, i.e. the parallel processing manager 302 may finish the execution flow. The parameters may be specified directly in the command, for example through variables in the command. The parameters may also be provided through a file containing the parameters, for example, the command may include a variable indicating the name of the file. During the setting of the parameters, the state of the worker computer nodes may be set to "Idle".

If the command is not a request for setting, the parallel processing manager 302 may, in step S703, check whether the command is a request for execution of simulation. If the command is a request for execution of simulation, the parallel processing manager 302 may receive variables through the command in step S704. The variables may include the range of the simulation space (x, y, z), the range of time (t) and the user-defined function 321. The user-defined function 321 may be provided through a file. The variable indicating the user-defined function 321 may be the file name of the file containing the user-defined function 321. In step S704, the file name of the file containing the user-defined function 321 may be entered into the parameters 304. Next, the parallel processing manager 302 may fill up the rest of the variables in the management table 303 in step S705. The management table 303 is shown in FIG. 18. For illustration purposes, take for an example where each of the range of x, the range of y and the range of z is from 0 to 499. In other words, in the example, the simulation space may be a cube with size 500³. In this example, there may be 125 worker computing nodes 210. The parallel processing manager 302 may divide the cubic simulation space into 125 small cubes. Each small cube may also be referred herein as a region. Each small cube may have a size of 100³, i.e. contains 100³ points. The parallel processing manager 302 may assign the simulation task for each small cube to a respective worker computing node 210. A worker computing node 210 with node ID 1 may be assigned to a first small cube. The first small cube may be [0:99, 0:99, 0:99] of the simulation space. A worker computing node 210 with ID 2 may be assigned to a second small cube. The second small cube may be [0:99, 0:99, 100: 199] of the simulation space. The parallel processing manager 302 may continue to assign the remaining small cubes to the remaining worker computing nodes 210. The symbol [jc_min: jc_max, >_min: y_max, z_min: z_max] may denote the range of x, y, z. After setting the management table 303, the parallel processing manager 302 may proceed to step

5706, to send a request to each parallel processing worker 312. The request may be sent through a networking protocol, for example transmission control protocol/internet protocol (TCP/IP) communication. The requests may include the range of x, y, z which is assigned to the parallel processing worker 312, and the information of neighbor parallel processing workers 312, and the file name of the user-defined function 321. Neighbor parallel processing workers 312 may refer to the parallel processing workers in the neighbor worker computing nodes. The neighbor worker computing nodes may be worker computing nodes 210 that are processing a neighboring region of the simulation space, or in other words, an adjacent small cube. For example, using the abovementioned example, the first small cube containing [0:99, 0:99, 0:99] of the simulation space may be a neighboring region to the second small cube containing [0:99, 0:99, 100: 199] of the simulation space. The first worker computing node processing the first small cube may be a neighbor worker computing node to the second worker computing node processing the second small cube. The parallel processing worker 312 of the first worker computing node may be a neighbor parallel processing worker 312 to the parallel processing worker 3412 of the second worker computing node. In step

5707, the parallel processing manager 302 may wait to receive a notice of finishing simulation from each parallel processing worker 312. In other words, when each parallel processing worker 312 in the respective worker computing node completes its simulation task, it may notify the parallel processing manager 302. Every time the parallel processing manager 302 receives the notice, the parallel processing manager 302 may change the state of the worker computing node to "Done" in the management table 303. If the state of all parallel processing workers is "Done", the parallel processing manager 302 may update all of their states to "Idle" in the management table 303 and complete the execution flow.

If the command is not a request for setting parameters and also not a request for execution of simulation, the parallel processing manager may check whether the command is a request for accessing the result of the simulation in step S708. If the command is a request for accessing result of simulation, the parallel processing manager 302 may receive variables through the command from the user 101 in step S709. The variables may include the range of simulation space (x, y, z) and the range of time (t) of the simulation for which the results are to be accessed. In the next step S710, the parallel processing manager 302 may search in the management table 303 for the worker computing node 210 which processed part of or all of the range designated by the variables. In step S711, the parallel processing manager 302 may request the relevant worker computing nodes 210 to send the corresponding results to the parallel processing manager 302. The request may be sent to the parallel processing workers 312 of the respective relevant worker computing nodes 210. The request may include the range of x, y, z, t, where the range of x, y, z may be inside the small cube assigned to the respective parallel processing worker 312. In step S712, the parallel processing manager 302 may wait to receive data from each parallel processing worker 312. Once the parallel processing manager 302 has received the data from all of the parallel processing workers 312, the parallel processing manager 302 may combine the data in step S713. The parallel processing manager 302 may format and return the combined data to the user 101 in step S714. To reduce the amount of data transfer between the master computing node 1220 and the worker computing nodes 210, each parallel processing worker 312 may save the requested data to one of its internal memory 212, storage 213, or shared storage 220 and then return the file name of the requested data to the parallel processing manager 302. The memory 212 may include a random-access memory disk. The parallel processing manager 302 may then return a set of file names to the user 101. The user 101 may access the requested data by retrieving the files saved in the internal memory 212, the storage 213 or the shared storage 220, by searching for the file names. The parallel processing manager 302 may finish the execution flow after completing the step S714. If the command is none of a request for setting parameters, a request for execution of simulation or a request for accessing result of simulation, the parallel processing manager 302 may proceed to end the execution flow.

FIG. 18 shows an example 1800 of a management table 303 for a parallel processing manager 302, according to various embodiments. The management table 303 may include a plurality of columns. The plurality of columns may include columns for the node identity (ID) 1802, the internal protocol (IP) address 1804, the state 1806, i.e. status of the processing, the range of x 1808, the range of j 1810, and the range of z 1812. A simulation system may include multiple worker computing nodes 210. One row in the management table 303 may be assigned to each worker computing node 210. The node ID 1802 may be a unique number of the worker computing node 210, in other words, each worker computing node 210 may have its own unique node ID. The IP address 1804 may be used to access each worker computing node 210. The state 1806 may represent the state of execution of the simulation. The state 1806 may be "Idle", "Running", "Done". The initial state may be "Idle". Once a parallel processing worker 312 is assigned the task of simulation, the state 1806 may be changed to "Running". When a parallel processing worker 312 finishes simulating, the state 1806 may be changed to "Done". When the state 1806 of all parallel processing workers 312 is "Done", the state 1806 may be changed to "Idle" again. The range of x range of y and range of z may represent the range of the simulation space, i.e. the minimum and maximum of the simulation space, which is assigned to each parallel processing worker 312, and which each worker computing node 210 should simulate.

FIG. 19 shows a flowchart 1900 of a parallel processing worker 312 according to various embodiments. The flowchart 1900 may show an execution flow of the parallel processing worker 312. The parallel processing worker 312 may execute the flow every time the parallel processing manager 302 sends a request to the parallel processing worker 312. The parallel processing worker 312 may check in step S901, whether the request is a request for execution of simulation. If the request is a request for execution of simulation, the parallel processing worker 312 may receive parameters from the parallel processing manager 302. The parameters may include the range of (x, y, z) and the range of t. The parameters may also include the file name of a user-defined function 312 and information of the neighbor computing nodes. The parallel processing worker 312 may set the range of time and the file name of a user-defined function 321 to parameters 313 in step S902.

FIG. 20 shows a first management table 2000A and a second management table 2000B. The first management table 2000A and the second management table 2000B may be management tables 325 for a parallel processing worker 312 according to various embodiments. The first management table 2000 A may include the range of x, y, z which is assigned to the parallel processing worker 312. The first management table 2000A may include a first column 2002 indicating the range of x, a second column 2004 indicating the range of y and a third column 2006 indicating the range of z . The first column 2002 may include a sub-column indicating the maximum value of x and another sub-column indicating the minimum value of x . The second column 2004 may include a sub-column indicating the maximum value of y and another sub-column indicating the minimum value of y. The third column 2006 may include a sub-column indicating the maximum value of z and another sub-column indicating the minimum value of z. The first management table 2000A may include a first row 2008 representing the total range of x, y, z assigned to the parallel processing worker 312. In other words, the total range may define the small cube. The small cube may be the region referred to in FIGS. 1 and 2. The first table 2000A may include a second row 2010 representing the interior range of the total range. The interior range may be a subset of the total range.

The second management table 2000B shows the inner division of the small cube. In other words, the second management table 2000B shows how the small cube, i.e. the region is divided internally. The second management table 2000B may include a first column 2012 indicating the divisions of the small cube, also referred herein as inner region, a second column 2020 indicating the starting address of the divisions; a third column 2022 indicating the neighbor node number, i.e. a worker computing node that is processing a neighboring small cube; and a fourth column 2024 indicating the internet protocol address of the neighbor node. The second management table 2000B may also include a plurality of rows, wherein each row represents a division within the small cube. For example, the plurality of rows may include a first row 2014 for the interior, a second row 2016 for the upper boundary, a third row 2018 for the lower boundary, a fourth row 2020 for the north boundary, a fifth row 2022 for the east boundary, a sixth row 2024 for the south boundary and a seventh row 2026 for the west boundary. As explained earlier, the calculation of a point at a time step may depend on the value of the neighborhood at earlier time steps. If a point is near the boundary of the small cube assigned to the parallel processing worker 312, the calculation of the point may require values from outside of the small cube. Areas outside of the small cube may be assigned to other parallel processing workers 312. The area where the calculation of a point depends only on values inside the small cube may be referred herein as the interior. The upper boundary may be defined to be the area complement of the interior, from outside of a maximum z of the interior up to the maximum of z in the small cube. The lower boundary may be defined to be area complement of the interior, from outside of a minimum of z of the interior up to the minimum of z in the small cube. Similarly, the north boundary, the east boundary, the south boundary and the west boundary may be defined to be areas which are complement of the interior and may include the area with maximum of y, maximum of x, minimum of y, and minimum of x respectively. Each boundary may have a unique neighbor cube. For example, the upper boundary may be adjacent to the small cube assigned to parallel processing worker 312 with node ID 86, and vice versa. The parallel processing manager 302 may send the information of neighbor parallel processing workers 312 to their neighbor parallel processing worker 312 in case of execution of the simulation.

The parallel processing worker 312 may fill in the first management table 2000A. The parallel processing worker 312 may set the first row 2008 indicating the total range of x, y, z assigned to the parallel processing worker 312 as designated by the variables received from the parallel processing manager 302. In the example shown, the range is set to [300:399, 200:299, 100: 199]. The parallel processing worker 312 may also fill in the second table 2000B. The parallel processing worker 312 may determine the information required for the third column 2022 and the fourth column 2024. The parallel processing worker 312 may analyze the user- defined function 321, in other words, step S904 of the flowchart 1900. The parallel processing worker 312 may specify the range where the calculation depends only on values inside the small cube and set the range to the second row 2010 of the first management table 2000A, i.e. to specify the interior in the first management table 2000A for the worker computing node 210. In this case, since values with distance more than 2 do not affect the calculation, the range of interior may be set to [301:398, 201:298, 101: 198]. The parallel processing worker 312 may determine the size of the interior and each boundary from the range of x, y, z, in other words, step S905 of the flowchart 1900. The parallel processing worker 312 may fill in the second management table 2000B for the worker computing node 210. The parallel processing worker 312 may set the column, the starting address for each inner area and may allocate memory according to the determined size. In this example, the required size for the interior may be a multiplication of the range of x, the range of y, the range of z, the size of floating and the range of time steps. The required size for the interior may be equal to 98³*2* 1000 bytes, which may be almost 1.7GB. By similar calculations, each boundary may require about 40MB. Then the parallel processing worker 312 may set the second column 2020 indicating the starting address as shown in the second management table 2000B and also in FIG. 21.

FIG. 21 shows a conceptual diagram 2100 of a memory space for simulated data, according to various embodiments. Inside a memory space allocated to each area, the data at a time step may be saved next to, in other words, adjacent to, the data at the previous time step. In another implementation, a memory space may first be allocated to each time step and the memory space for each time step may be divided to interior and boundaries. If the simulation does not require the results of all time steps, the implementation may save the amount of memory consumption, as the parallel processing worker 312 may reuse the memory space allocated to the earlier time steps. After allocating memory, in other words, step S906 of the flowchart 1900, the parallel processing worker 312 may start to simulate for the user-defined function 321. Unless the simulation reaches to the end of simulation time, i.e. step S907, the parallel processing worker may continue the loop of steps S908 to S911. In the loop, the parallel processing worker 312 may calculate for all points (x, y, z) in the interior at a time step by executing the user-defined function 321 with variables (x, y, z, t) in step S908. After calculating for all points in the interior, the parallel processing worker 312 may wait for receiving the data of the previous time step t— 1 from neighbor worker computing nodes in step S909. Once the parallel processing worker 312 receives the data of the previous time step t— 1 from the neighbor worker computing nodes, the parallel processing worker 312 may save the received data to the corresponding area and may calculate for all points (x, y, z) in boundary areas at a time step by executing the user-defined function 321 with variables (x, y, z, t) in step S910. Once the parallel processing worker 312 finishes calculating in boundary areas, the parallel processing worker 312 may send the results of calculation for boundaries to the neighbor worker computing nodes in step S911. To send result of each boundary to the corresponding worker computing node, the parallel processing worker 312 may search for the worker computing node and its IP address in the second management table 2000B. If the simulation reaches to the end of simulation time, the parallel processing worker 312 may send the notice of finishing simulation to the parallel processing manager 302 and finish the flow. If the request is not for execution of simulation, the parallel processing worker 312 may check whether the request is for accessing result of simulation in step S913. If the request is for accessing result of simulation, the parallel processing worker may receive parameters including the range of ( x, y, z) and (t) in step S914. The parallel processing worker 312 may search the second management table for the area containing the designated range in step S915. The parallel processing worker 312 may read the simulated data from one or multiple areas and formats the data to one continuous data in step S916. The parallel processing worker 312 may send the formatted data to the parallel processing manager 302 in step S917. In another implementation, the parallel processing worker 312 may save the data to RAM disk in memory 212, local storage 213, or the shared storage 220, and may send the file name to the parallel processing manager 302 instead of sending the data itself in order to reduce the amount of data transfer. If the request is not for accessing result of simulation, the parallel processing worker 312 may finish the flow.

FIG. 22 shows pseudo codes of data access libraries to access virtual matrix. These functions may be used to access virtual matrix from the user-defined functions 312. It may be executed on each worker computing node. The first pseudo code 2200A may be a function to read the virtual matrix. Given a point and time step (x, y, z, t), the function may search the management table for the area containing the point. The function may read the starting address of the searched area and may calculate the address for (x, y, z, t). The function may read value from the address and may return it. The second pseudo code 2200B may be a function to write value to the virtual matrix. Given a point and time step (x, y, z, t) and value, the function may search the management table for the area containing the point. The function may read the starting address of the searched area and calculate the address for (x, y, z, t). The function may write the value to the address.

While embodiments of the invention have been particularly shown and described with reference to specific embodiments, it should be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. The scope of the invention is thus indicated by the appended claims and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced. It will be appreciated that common numerals, used in the relevant drawings, refer to components that serve a similar or the same purpose.

Claims

1. A simulation method comprising:

providing a region, the region being part of a simulation space;

processing the region in a computing node, wherein the processing comprises:

dividing the region into an interior and a boundary; generating an interior output based on the region;

receiving a neighbor output comprising output generated based at least partially on another region of the simulation space at an earlier time step;

generating a boundary output based at least partially on the received neighbor output;

sending the boundary output to another computing node, the another computing node processing the another region.

2. The simulation method of claim 1, wherein providing the region comprises dividing the simulation space into a plurality of regions.

3. The simulation method of claim 2, further comprising processing each region of the plurality of regions in a respective computing node of a plurality of computing nodes.

4. The simulation method of claim 3, wherein the plurality of computing nodes are configured to process the plurality of regions at least substantially concurrently.

5. The simulation method of claim 1, wherein the another computing node generates a boundary output of the another region in a future time step, based at least partially on the sent boundary output.

6. The simulation method of claim 1, wherein generating the interior output takes place at least substantially simultaneous to receiving the neighbor output.

7. The simulation method of claim 1, wherein sending the boundary output takes place at least substantially simultaneous to generating the interior output in a future time step.

8. The simulation method of claim 1, further comprising:

receiving a user-defined function,

wherein the dividing the region comprises determining inputs of the user- defined function.

9. The simulation method of claim 8, wherein dividing the region is based on a furthest distance between the determined inputs of the user-defined function.

10. The simulation method of claim 1, further comprising:

providing a user interface configured to receive user inputs comprising at least one of information on the computing node, a request to execute the simulation, or a request to access results of the executed simulation.

11. The simulation method of claim 10, wherein the request to execute the simulation comprises a range of the simulation space to be processed and a range of time steps to be processed.

12. The simulation method of claim 10, wherein the request to access the results of the executed simulation comprises a range of the simulation space to be accessed and a range of time steps to be accessed.

13. The simulation method of claim 1, wherein the another region is adjacent to the region.

14. A computing node configured to process a region, the computing node comprising:

a processing unit configured to divide the region into an interior and a boundary, the region being part of a simulation space,

wherein the processing unit is further configured to generate an interior output based on the region;

a communication unit configured to receive a neighbor output, the neighbor output comprising output generated based at least partially on another region of the simulation space at an earlier time step,

wherein the processing unit is further configured to generate a boundary output based at least partially on the received neighbor output;

wherein the communication unit is further configured to send the boundary output to another computing node processing the another region.

15. A simulation system comprising:

a divider configured to divide a simulation space into a plurality of regions; and

a plurality of computing nodes, each computing node of the plurality of computing nodes configured to process a respective region of the plurality of regions;

wherein a computing node of the plurality of computing nodes comprises: a processing unit configured to divide a region of the plurality of regions into an interior and a boundary, and further configured to generate an interior output based on the region; and

a communication unit configured to receive a neighbor output, the neighbor output comprising output generated based at least partially on another region of the plurality of regions at an earlier time step,

wherein the processing unit is further configured to generate a boundary output based at least partially on the received neighbor output; and wherein the communication unit is further configured to send the boundary output to another computing node of the plurality of computing nodes processing the another region.