WO2015102484A1

WO2015102484A1 - Method for generating a real-time haptic feedback signal for a haptic device of a virtual surgery simulator and virtual surgery simulator

Info

Publication number: WO2015102484A1
Application number: PCT/NL2013/050963
Authority: WO
Inventors: Cecill Edward ETHEREDGE; Anthonius Johannes Bernardus SANDERS; Oguz METEER; Berry KLOMP; Eduard Eelco KUNST
Original assignee: Vrest B.V.
Priority date: 2013-12-31
Filing date: 2013-12-31
Publication date: 2015-07-09

Abstract

The invention relates to a method for generating a real-time haptic feedback signal for a haptic device of a virtual surgery simulator, a computer program for executing the method and corresponding data storage medium, and a virtual surgery simulator. The method comprises - generating a physical model of an object using the virtual surgery simulator; - updating the physical model of the object on the basis of an interaction between the object and a virtual tool associated with the haptic device; - performing a calculation of the force the object exerts on the virtual tool; and - generating the haptic feedback signal on the basis of the calculated force, wherein the force is calculated using a parallel computing device.

Description

METHOD FOR GENERATING A REAL-TIME HAPTIC FEEDBACK SIGNAL FOR A

HAPTIC DEVICE OF A VIRTUAL SURGERY SIMULATOR AND VIRTUAL SURGERY SIMULATOR The invention relates to a method for generating a real-time haptic feedback signal for a haptic device of a virtual surgery simulator. Virtual surgery simulators are emerging as a training method for medical specialists and are expected to provide a virtual environment that is realistic and responsive enough to be able to physically simulate a wide variety of medical scenarios.

Haptic interaction with the environment requires an underlying physical model that is dynamic, deformable and computable in real-time at very high frame rates.

Virtual surgery simulators simulate an anatomy of a human (e.g. a patient) or animal and support an extensive degree of user interaction. For a medical training scenario that is simulated, the environment must be accurate enough in order to be used for extensive practice by medical specialists. A virtual surgery simulator may include a haptic device which represents a controllable virtual surgical tool within the simulator. For examples, a user may select from a number of virtual tools ranging from simple probes to scalpels. These tools are able to interact and physically deform the virtual object, e.g. the virtual patient. Output of such a system is provided visually and through the haptic device. In others words, the user can see and feel the interactions with the virtual object. Haptic perception of the virtual environment and corresponding visual cues in response to the interaction of the tool are important. They serve as the main factor in the simulator's perceived realism and are expected to face close scrutiny from field experts, i.e. medical professionals.

In order to translate the above idea of the ideal simulator into practice, the virtual environment has to be represented by an underlying dynamic physical model that supports this kind of interaction, while its computation must be fast enough to support real-time I/O with haptic feedback devices. In particular, the simulation must run at a fast-enough rate to support real-time I O with haptic devices and allow smooth haptic interaction. These haptic devices contain actuators that apply a set of forces to the user' s hand and generally require a force to be streamed real-time at a minimum rate of 1000 Hz, with lower data streaming rates resulting in uncontrollable oscillation and irregular actuation. Conventional virtual surgery simulators are not able to provide real-time haptic feedback at the required rate, in particular not in combination with high resolution graphics.

An object of the invention is to solve the above problems and to provide a method for generating a real-time haptic feedback signal for a haptic device of a virtual surgery simulator which enables generating the haptic feedback signal at sufficiently high rates such that smooth and realistic haptic interaction is provided.

This goal is achieved using the method according the invention, comprising:

- generating a physical model of an object using the virtual surgery simulator; - updating the physical model of the object on the basis of an interaction between the object and a virtual tool associated with the haptic device;

- performing a calculation of the force the object exerts on the virtual tool; and

- generating the haptic feedback signal on the basis of the calculated force,

wherein the force is calculated using a parallel computing device.

A parallel computing device is a device that can perform calculations simultaneously, in contrast to conventional computational devices such as CPUs which perform calculation in series, i.e. sequentially. For example, the parallel computing device may comprise one or more GPU and/or one or more multi-core processor. For example, the parallel computing device is a computer equipped with both a CPU and a GPU. In another example, the parallel computing device comprises a computer cluster.

By calculating the force using the parallel computing device, the calculation can be performed at a high rate. Therefore, real-time realistic haptic feedback is provided.

In a preferred embodiment, the physical model is a mass-spring model.

A mass-spring model, is a volumetric model wherein the volumes of objects are modelled as a set of point masses interconnected by many elastic springs that follow general physical laws. A mass-spring model does not require complex mathematical operations, while providing a realistic simulation of the virtual object. In one type of conventional virtual surgery simulator a finite-elements method (FEM) is employed. However, FEM comes at a high computational expense.

In a preferred embodiment, the mass-spring model comprises N position vectors X;(t₀), each defining the position of a mass M; at a first time instance t₀, wherein an iteration over all masses M; is performed comprising the following operations for each mass M_;:

- retrieve the index j of each of K; masses M_j neighbouring mass M_;;

- for each neighbouring mass M

- retrieve the position vector Xj(t₀) defining the position of neighbouring mass M_j at the first time instance;

- calculating the length yy of a spring Sy connecting mass M; and the neighbouring mass M_j on the basis of the position vectors X;(t₀) and Xj(t₀);

- calculating a spring force vector Fy on the basis of the calculated length yy;

- summing the spring force vectors Fy of all neighbouring masses M_j of mass M; to obtain a total force vector F;(t₀) defining the total force exerted on mass M; at first time instance t₀;

- computing the position vector X;(t₀ + At) defining the position of mass M; at a second time instance t₀ + At on the basis of the total force vector F_;.

In a further preferred embodiment, the force the object exerts on the virtual tool is calculated on the basis of the total force vectors F;(t₀) of the masses in contact with the virtual tool. The masses in contact with the virtual tool are for example determined by determining the masses closest to the outer surface of the virtual tool and/or by determining the intersection between the virtual tool and the masses.

In a further preferred embodiment, the iteration over all masses M; is performed in parallel using the parallel computing device.

The method according to the invention is scalable. If the number of masses in the model is increased, the number of calculations to be performed increases as well. In conventional systems, i.e. employing serial processors, this will always lead to an increase in computation time. In contrast, in the invention the calculations are performed in parallel using at least one parallel processing unit. Of course, the total computation time is still limited by the specific hardware on which the method of the invention is performed. For example, a given parallel computing device has a maximum number of threads which can run in parallel. However, this maximum can be increased by upgrading the hardware. For example, an additional parallel processing unit is added, thereby increasing the maximum number of parallel threads. Furthermore, the method of the invention profits from the developments that parallel processing devices are able to handle increasingly more threads. The method according to the invention avoids race conditions, as at a given time instance t₀ the calculation for each mass is independent of the calculations on the other masses. For example, the input data for the calculations performed for mass Mi is: the position vector Xi(t₀) and the positions vectors X2(to), X3(t₀), x₄(t_o) and Xs(t₀) of neighbouring masses M₂, M₃, M₄ and M₅, while the output data of the calculation for mass mi is Xi(t₀+At) and Fi(t₀). In this example only 4 neighbouring masses are considered. However, the number of neighbouring masses may vary, and may even be different for different masses M_;. Preferably, 0 < K; < 26.

Therefore, the need for synchronization between the parallel computations is minimized. Therefore, the time required for completing the calculations over all masses is minimized.

For example, the iteration is parallelized such that the calculation for each mass M; is run in a separate thread on a parallel processing unit, such as a GPU or multi-core processor.

The above description relates to the calculation for a single time step At. In other words, the above description explains how the physical model advances from t₀ to ti = t₀ + At.

Subsequently, the calculation may be performed once more, now with ti as starting point and obtaining the physical model at time ti + At, and so on. The time step At may be kept constant or be variable. Preferably, the time step At is selected corresponding to the desired rate of the haptic feedback signal. For example, in case a haptic feedback signal rate of at least 1000 Hz is desired, the time step is chosen as At < 1 ms.

In a further preferred embodiment, a feedback force array Ff_k and a feedback force array index k are defined, wherein, in the iteration, for each mass M; in contact with the virtual tool the total force vector F;(t₀) of that mass M; is copied to the feedback force array Ff_k at index k and subsequently the feedback force array index k is incremented using an atomic increment, and the force the object exerts on the virtual tool is calculated by summing the feedback force array Ff_k.

In a conventional approach, a global feedback force vector would be defined. This vector represents the force exerted on the virtual tool by the objects in contact with the virtual tool. The fact that the feedback force vector is global indicates that all parallel threads can read and write to the feedback force vector. For each mass M; in contact with the virtual tool, the total force vector F;(t₀) would be added to the feedback force vector. When the parallel computations have been completed, the feedback force vector has accumulated all forces of the masses in contact with the virtual tool and can be used for generating the haptic feedback signal. However, this conventional approach has the disadvantage that synchronizations between the threads is required to update the feedback force vector. Updating the feedback force vector requires a) reading the current value of the feedback force vector (requiring synchronization) and b) writing the updated value to the feedback force vector (also requiring synchronization). This naive approach will thus result in threads waiting for each other to complete, therefore increasing the time necessary to perform the calculations over all masses.

The approach according to the invention on the other hand does not require read and write synchronization. The only synchronization necessary is the atomic increment, which is a relatively fast instruction provided for by parallel computing hardware. The atomic increment is orders of magnitudes faster than the atomic locks as discussed for the conventional approach.

For example, a mass M₃ is in contact with the virtual tool. The thread corresponding to this mass reads the global index k, e.g. k = 0, and writes the calculated force F₃(t₀) to the array Fc at position 0, i.e. Fc₀ = F₃(t₀). Subsequently, the thread increments k to k = 1, using the atomic increment instruction provided by the hardware. Continuing the example, a thread for another mass M₂o in contact with the virtual tool reads the global index k, now having the value k = 1. The thread then writes the calculated force F₂o(to) to the array Fc at position 1, i.e. Fcj = F₂o(to).

Subsequently, the thread increments k to k = 2, using the atomic increment instruction provided by the hardware. It is noted that the order of the elements in the array Fc will be arbitrary. In the example above, it would also be possible that Fc₀ = F2o(t₀) and Fcj = F₃(t₀), as it is not known beforehand which thread will be the first to write the total force vector into the feedback force array. However, as in the end only the sum of the array is of relevance, the order of the elements in the array is irrelevant.

In preferred embodiment, in case the total force vector F;(t₀) of a mass M; is smaller than a predetermined force threshold the position vector x_;(t₀ + At) defining the position of a mass M; at the second time instance is set to be equal to the position vector X;(t₀) defining the position of said mass M; at the first time instance.

This reduces the time required for performing the calculations. In a preferred embodiment, computing the position vector X;(t₀ + At) defining the position of mass Mi at a second time instance t₀ + At comprises performing an integration of second order or higher, preferably a Verlet integration.

The relation between acceleration of a mass M; and the force exerted on said mass is given by Newton's second law:

Μ;¾ = Ρ; (1),

wherein a_; is the acceleration vector and F; is the force vector. To obtain the position of a mass due to the current force, a numerical integration method is performed. Second-order Verlet integration for the mass-spring system may for example be defined as:

x;(t₀ + At) = 2*Xi(to) - Xi(to - At) + 0,5*a;(t₀)(At)² (2),

wherein

At is the (constant) time step

X;(t₀ - At) is the position vector at the previous time instance, tO - At

X;(t₀) is the position vector at the current time instance, t₀

X;(t₀ + At) is the position vector at the next time instance, t₀ + At

a;(t₀) is the acceleration vector at the current time instance, t₀

and

a_i(to) = F_i(to) / m_i (3).

This last term reduces to a;(t₀) = F;(t₀) when choosing η¾ = 1 for all i.

By using a second order integration, the error in the outcome stays second-order. It is possible to reduce the error even further by using the higher-order Runge-Kutta integration method, though at the cost of significantly more memory space and reduced performance.

Depending on the hardware of the system this may or may not be desirable. By using a second- order integration, performance is guaranteed, while accuracy is sufficient for a realistic simulation.

In a preferred embodiment, the operation performed on each mass M; further comprises retrieving spring data for each of the springs Sy connecting the mass M; to its neighbouring masses M_j, wherein the spring data comprises spring properties, preferably the rest length y°y and the spring constant ky.

This embodiment enabling setting different spring properties for parts of the virtual object. For example, the force Fy between mass M; and M_j is calculated using Hooke's law as:

fy = *y ( |yy | - yy) j¾ (4),

wherein

Fy = the force of the spring Sy connecting masses M_; and M_j

ky = the spring constant of spring Sy

yy = X;(t₀) - X_j(to), i.e. a vector from mass M; to M_j at time instance t₀ ly l = the current length of the spring, i.e. the distance between masses M; and M_j at t₀ y°i_j = the rest length of the spring

and

- the direction of the force F of the spring S on mass M; at t₀

In a further preferred embodiment, retrieving spring data comprises determining the spring properties on the basis of material data defining material properties for the corresponding mass M;.

In the above example, a spring constant k and rest length y°y had to be defined for each spring, i.e. for all connections between mass Mj and its neighbouring masses M_j. Typically, a mass M; will have up to K = 26 neighbours. Therefore, N * 26 * 2 values have to be stored in memory to define the spring properties, wherein N is the number of masses M_;. This memory is reduced by linking spring properties to material properties of the masses. According to the embodiment of the invention, only an index representing a material has to be stored for each mass, instead of up to 26*2 spring properties per mass.

In a preferred embodiment, the parallel computing device comprises a GPU. For example, the parallel computing device is a computer comprising at least one GPU. The computer may for example further comprise a CPU.

For example, the parallel computation is performed using a number of GPUs.

In a preferred embodiment, the method further comprises displaying the physical model of the object on a display on the basis of the position vectors.

In a preferred embodiment the step of generating and updating the physical model is performed using a first GPU and the step of displaying the physical model is performed using a second GPU.

The first GPU is dedicated to perform the calculations for updating the physical model and providing the haptic feedback, while the second GPU is dedicated to convert the data from the physical model into graphics.

According to the invention, more than one GPU may be used to update the physical model and/or more than one GPU may be used to display the model.

In a further preferred embodiment, the method comprises sending from the first GPU to the second GPU display data for initialising displaying the physical model; and subsequently sending from the first GPU to the second GPU only the changes in the display data.

In other words, only data corresponding to changes in the physical model is communicated from the first GPU to the second GPU. It is noted that the rate of data transfer of the first GPU to the second GPU need not be the same as the rate of the haptic feedback signal. In general the display data transfer rate will be lower, e.g. 30-60 Hz, as the required frame rate for smooth visual display will generally be in that range. In one embodiment, the first GPU sends display data to the second GPU after updating the physical model for N > 1 time instances. For example, the first GPU calculates the position vectors for 15 incremental time steps and only then sends the corresponding changes in the display data to the second GPU.

The display data may for example comprises the position vectors x_; or data derived from the position vectors.

In a further preferred embodiment, displaying the physical model of the object further comprises:

- determining the masses M; which have less than a predetermined number Z of

neighbouring masses M_j; and

- displaying the physical model on the basis of the position vectors X; of said determined masses.

Only the outer surfaces have to be displayed, i.e. it is not necessary to perform rendering for the inner volume of the physical model. The method determines which masses correspond to an outer surface by determining which masses have less than Z neighbouring masses. For example, in a model wherein masses have a maximum number of K = 26 neighbours, Z may for example be chosen as Z = 26. In other words, the method determines the masses having less than the maximum number of springs attached.

In a further preferred embodiment, displaying the physical model of the object further comprises applying a smoothing operation on the position vectors x_;.

For example, the smoothing operation comprises averaging, low pass filtering or Gaussian smoothing.

In a preferred embodiment, the haptic feedback signal is generated at a frequency of at least 100 Hz, preferably at least 250 Hz, more preferably at least 500 Hz and most preferably at least 1000 Hz.

In a preferred embodiment a double buffer is used for input from the haptic device to the parallel computing device and output from the parallel computing device to the haptic device.

This double buffer strategy requires very little synchronization and lowers execution stalling.

In a preferred embodiment, the method further comprises the step of communicating data of the physical model with an e-learning environment.

The visual surgery simulation is preferably combined with an e-learning environment. For example, the e-learning environment defines lessons comprising theory and exercises. The theory may for example comprise text and multimedia, such as photos, videos and sound. The exercises may for example comprise a description of the task to be performed by the user and a link to the virtual surgery simulation. When the virtual surgery simulator is launched from the e-learning environment, a virtual object corresponding to the current task is loaded. For example, the method comprises evaluating the actions performed by the user in the virtual surgery environment. For example, a score is assigned to the performance of the user in the virtual surgery environment. This score can be communicated with the e-learning environment, e.g. to list acquired skills or to track the progress of the user.

In a preferred embodiment, the method further comprises storing a snapshot of the physical model at a selected time instance and/or an animation of the dynamics of the physical model over a selected time frame.

Preferably, the snapshot or animation is communicated to an e-learning environment. This enables the user to replay tasks performed by the user in the simulation environment or to discuss snapshots or animations with colleagues. For example, the snapshot is stored in an image format enabling viewing the image on a computer, smartphone and/or tablet. For example, the animation is stored in a video file format enabling replaying the video on a computer, smartphone and/or tablet. In another example, the snapshot comprises a snapshot of the physical model, e.g. the mass- spring model. This enables the use of the snapshot as a starting point for the simulator. For example, a simulation can be "paused" and saved for continuing the simulation later. In another example, if a mistake is made, the user can revert the simulation back to a snapshot corresponding to a moment prior to the mistake. In another example, the snapshot comprises a 3D model, e.g. a surface or volumetric model, which can be used in the e-learning environment.

The invention further relates to a computer program which, when executed on a computer, executes the steps of the method as described above and a computer-readable data storage medium having embodied thereon said computer program. Further, the invention relates to a virtual surgery simulator. The same features, advantages and effects apply to such a program, data storage medium and virtual surgery simulator as described in this application with respect to the method according to the invention.

In an embodiment, the virtual surgery simulator according to the invention comprises:

- a parallel computing device arranged to generate a physical model of an object;

- a haptic device for providing input from the haptic device to the computing device and providing a haptic feedback signal from the computing device to the haptic device, wherein a virtual tool is associated with the haptic device;

wherein the computing device is arranged to update the physical model on the basis of an interaction between the object and the virtual tool, to perform a calculation of the force the object exerts on the virtual tool and to generate the haptic feedback signal on the basis of the calculated force.

In a preferred embodiment, the physical model is a mass-spring model. In a preferred embodiment, the mass-spring model comprising N position vectors X;(t₀), each defining the position of a mass M; at a first time instance t₀, wherein the parallel computing device is arranged to perform the following operations for each mass M_;:

- retrieve the index j of each of K; masses M_j neighbouring mass M_;;

- for each neighbouring mass M

— retrieve the position vector Xj(t₀) defining the position of neighbouring mass M_j at the first time instance;

- computing the position vector X;(t₀ + At) defining the position of mass Mi at a second time instance t₀ + At on the basis of the total force vector F_;.

In a preferred embodiment, the parallel computing device is arranged to calculate the force the object exerts on the virtual tool on the basis of the total force vectors F;(t₀) of the masses in contact with the virtual tool.

In a preferred embodiment, the parallel computing device is arranged to perform the iteration over all masses M; in parallel.

In a preferred embodiment, the parallel computing device is arranged to define a feedback force array Ff_k and a feedback force array index k, and to copy, in the iteration over all masses M_;, for each mass M; in contact with the virtual tool the total force vector F;(t₀) of that mass M to the feedback force array Ff_k at index k and to subsequently increment the feedback force array index k using an atomic increment, and to calculate the force the object exerts on the virtual tool by summing the feedback force array Ff_k.

In a preferred embodiment, the parallel computing device is arranged to set the position vector X;(t₀ + At) defining the position of a mass M_; at the second time instance to the same value as the position vector X;(t₀) defining the position of said mass M_; at the first time instance in case the total force vector F;(t₀) of a mass M_; is smaller than a predetermined force threshold.

In a preferred embodiment, the parallel computing device is arranged to compute the position vector X;(t₀ + At) defining the position of mass Mi at a second time instance t₀ + At by performing an integration of second order or higher, preferably a Verlet integration.

In a preferred embodiment, wherein the parallel computing device is arranged to include in the operation performed on each mass M_;: retrieving spring data for each of the springs Sy connecting the mass M; to its neighbouring masses M_j, wherein the spring data comprises spring properties, preferably the rest length y°y and the spring constant ky. In a preferred embodiment, retrieving spring data comprises determining the spring properties on the basis of material data defining material properties for the corresponding mass M_;.

In a preferred embodiment, the parallel computing device comprises a first GPU.

In a preferred embodiment, the simulator further comprises a display arranged to display the physical model of the object the basis of the position vectors.

For example, the display comprises a computer screen, TV screen or 3D glasses.

In a preferred embodiment, the simulator further comprises a second GPU wherein the first GPU is arranged to perform the step of generating and updating the physical model and the second GPU is arranged to generate the graphics for displaying the physical model.

In a preferred embodiment, the first GPU is arranged to send the second GPU display data for initialising displaying the physical model and to subsequently send only the changes in the display data.

In a preferred embodiment, the first GPU is arranged to determine the masses M; which have less than a predetermined number Z of neighouring masses M_j and the second GPU is arranged to display the physical model on the basis of the position vectors X; of said determined masses. In a preferred embodiment, the first GPU and/or second GPU is arranged to apply a smoothing operation on the position vectors x_;.

In a preferred embodiment, the parallel computing device is arranged to generate the haptic feedback signal at a frequency of at least 100 Hz, preferably at least 250 Hz, more preferably at least 500 Hz and most preferably at least 1000 Hz.

In a preferred embodiment, the simulator comprises a double buffer for input from the haptic device to the parallel computing device and output from the parallel computing device to the haptic device.

In a preferred embodiment, the simulator further comprises a communication unit arranged to communicate data of the physical model to an e-learning unit.

In a preferred embodiment, the simulator comprises a recording unit arranged to store a snapshot of the physical model at a selected time instance and/or to store an animation of the dynamics of the physical model over a selected time frame.

The method and apparatus according to the invention may be adapted for visual output only, i.e. a method for performing the calculations for updating the physical model without generating a haptic feedback signal. In other words, the invention also relates to a method for updating a physical model of an object using a virtual surgery simulator, wherein the physical model is a mass-spring model comprising N position vectors X;(t₀), each defining the position of a mass M; at a first time instance t₀, wherein an iteration over all masses M; is performed comprising the following operations for each mass M_;:

- retrieve the index j of each of K; masses M_j neighbouring mass M_;; - for each neighbouring mass M .

- summing the spring force vectors Fy of all neighbouring masses M_j of mass M_; to obtain a total force vector F;(t₀) defining the total force exerted on mass M_; at first time instance t₀;

Preferably, the iteration over all masses M; is performed in parallel using a parallel computing device.

Furthermore, the invention relates to a virtual surgery simulator, comprising a computing device arranged to generate a physical model of an object wherein the computing device is arranged to update a physical model, the physical model being a mass-spring model comprising N position vectors X;(t₀), each defining the position of a mass M_; at a first time instance t₀, wherein the computing device is arranged to perform the following operations for each mass M_;:

- retrieve the index j of each of K_; masses M_j neighbouring mass M_;;

- for each neighbouring mass M_j:

- calculating the length yy of a spring Sy connecting mass M_; and the neighbouring mass M_j on the basis of the position vectors X;(t₀) and Xj(t₀);

Preferably, the computing device is a parallel computing device arranged to perform the iteration over all masses M_; in parallel.

Further advantages, features and details of the invention are elucidated on the basis of exemplary embodiments thereof, wherein reference is made to the accompanying drawings.

- Figure 1 shows schematically a first exemplary embodiment of a virtual surgery simulator according to the invention; - Figure 2 shows the virtual surgery simulator according to a second embodiment of the invention;

- Figure 3 shows the graphics displayed in the virtual surgery simulator according to the second embodiment;

- Figure 4 shows a 1 -dimensional mass-spring system;

- Figure 5 shows three-dimensional grid layouts connecting a central mass to up to 26 neighbouring masses;

- Figure 6 shows the pseudo-code of a serial implementation of a mass-spring model

- Figure 7 shows the pseudo-code of a parallel implementation of a mass-spring model according to an embodiment of the invention;

- Figure 8 shows a schematic drawing of the parallel implementation of figure 7;

- Figure 9 shows a schematic drawing an embodiment of the virtual surgery

simulation according to the invention illustrating a double buffering strategy;

- Figure 10 shows as schematic drawing of the surface generation for display of the physical model;

- Figure 11 shows a graph of the output rate of the method versus the number of springs used in the model;

- Figure 12 shows a graph of computational time versus the number of springs used in the model; and

- Figure 13 shows a graph of the GPU kernel time versus the number of threads. A virtual surgery simulator 2 (figure 1) comprises a display 4, a GPU 6, a CPU 8 and a haptic device 10. In the embodiment shown display 4 is connected to GPU 6. However, display 4 may alternatively or additionally be connected to CPU 8. Haptic device 10 is connected to CPU 8.

More than one haptic device may be provided, as indicated in figure 2. A first haptic device 10a is provided to provide input with a left hand and to provide feedback to the left hand. A second haptic device 10b is controllable by the right hand and provides feedback to the right hand. The haptic devices 10a, 10b are each provided with a pen-like tool 12a, 12b which user 14 can interact with. Tools of different size and shapes may be provided. In the embodiment of figure 2, the virtual surgery simulator comprises two displays. One display 4a displays the surgery simulation, the other display 4b displays the corresponding e-learning environment. Furthermore, 3D glasses or a 3D screen may alternatively or additionally be provided to obtain a 3-dimensional view of the virtual objects.

The virtual object is a patient 18 (figure 3). User 14 can perform virtual surgery on patient 18 using the input devices 10a, 10b. Haptic device 10a is associated with virtual tweezers 18. When user 14 moves the pen 12a, the movement is registered by the haptic device 10a. The movement will be translated into a movement of virtual tweezers 18. The physical model generates a feedback signal for haptic device 10a corresponding to the calculated force exerted on the virtual tweezers 18. Similarly, the virtual lancet 20 is associated with haptic device 10b. Movement of the pen 12b by user 14 is translated into movement of the virtual lancet 20 in the simulation environment. The computing device comprising GPU 6 and CPU 8 calculate the force the virtual objects in the simulation exert on virtual lancet 20 and provide a corresponding feedback signal to haptic device 10b. Other virtual tools may be selected by user 14, e.g. a stapler or a hand.

To illustrate how a mass-spring model works, reference is made to figure 4, showing a 1-D mass-spring system. This particular case comprises masses M₀, M_l5 M₂, M₃, M₄ and springs S₀i, S₁₂, S₂3, S₃₄. In the mass-spring model, forces that are exerted on the connected masses are summed according to Newton's second law of motion (as in equation 1). The total spring force vector F; exerted on mass M; is determined by summing all forces exerted on mass M; by springs:

Fi =∑jes_i F_tJ (5),

wherein S; is the set of springs connected to mass M_;. The vector F is the force exerted by the spring connecting the two masses M; and M_j that is obtained from e.g. Hooke's law (as in equation

4). A different relation for the force F (x) may also be used, e.g. F = -k x or F = -k(l-e^bx).

In order for the mass-spring system to exhibit movement, it is necessary to translate these forces into positions. Considering equation 1 , a naive approach to derive a position is to take the acceleration a_; and assume a non-existing velocity:

Χ;(ΐ+Δΐ) = X;(t) + 0,5 * ¾ * (At)² (6).

The acceleration vector a_; is determined from the total force F; on mass M; by using equation 1.

Such a method is however incomplete as it does not take any velocity in the system into account.

To obtain a set of useful results, a numerical integration method is performed to approximate the velocities and positions in the system based on current and previous data.

Obtained accuracy directly corresponds to the order of the numerical integration method, where lower-order methods yield more error but also decrease complexity. Examples of such method are

Euler (first-order), Verlet (second-order) and Runge-Kutta (e.g., fourth-order). For example, a first order Euler integration method would be defined as follows:

Xi(t₀ + At) = X;(to) + V;(t₀)At (7)

V_i(to + At) = _Vi(to) + a_;(t₀)At (8),

wherein

X;(t₀) is the position vector at time instance t₀

X;(t₀ + At) is the position vector at time instance t₀ + At

V;(t₀) is the velocity vector at time instance t₀

V;(t₀ + At) is the velocity vector at time instance t₀ + At

a_;(t₀) is the acceleration vector at time instance t₀ and

a;(t₀) = Fi(t₀) / n !

When looking at the Euler method, it can be seen that the velocity term always lags one step behind. When dealing with mass-spring systems, this can potentially lead to springs overshooting their positions resulting in unwanted oscillation of the system. For a better balance between error and complexity, Verlet integration is preferred as described above (equations 2 and 3).

Additionally, other factors such as universal gravity can be added by further extending the system with additional forces:

Π1;¾ = F; + m;*g ( 9),

wherein g is the universal gravity vector. For simplicity, the masses can be omitted by assuming that mi = 1.

The time step At is preferably kept constant. It may prove to be difficult to control the time difference between iterations, and given that frame rate in software is not very easy to control, it is possible to pass in a constant time step. Then, regardless of the frame rate at which the software runs, the calculation of the model advances by a constant time step every frame. Theoretically this has the disadvantage that the model does not advance at constant speed in relation to the software, e.g. it will seem to advance slower when the frame rate drops. In practice however using a time step that approximates the software's (ideal) update rate provides sufficient model stability, assuming a single integration iteration-step per frame. Note that the ideal update rate and, thus the ideal time step, has to be "small enough" to ensure a sable model. A rate equal to the haptic update rate of typically 1000 Hz with a single iteration-step per frame is sufficient.

The mass-spring algorithm according to the invention calculates all variables within the model by using the physics equations described above. Each time the algorithm is invoked on the model, the result is a new physical state of the model containing the masses and springs with new positions and other variables adjusted according to all applied forces (e.g., by springs or by interaction with a virtual tool) within the predefined time step At.

The basic layout of the mass-spring model in the three-dimensional case is represented by a uniform grid comprising masses connected by springs as shown in figure 5. The uniform grid has the advantage of simplicity and structural integrity of the model. However, other grids are possible. Every mass within the grid is connected to all of its neighbouring masses, all of which are within a 3 x 3 x 3 grid surrounding the mass. The rest length y°y of each spring Sy is defined as the initial Euclidian distance between a mass M; and any of its neighbouring masses M_j forming the spring. Because of the discrete grid layout, this distance is either 1 (figure 5a), 2 (figure 5b) or 3 (figure 5c). For every mass, we can define up to 3 * 3 * 3 -1 = 26 neighbouring masses (and thus springs). However, a mass may have fewer neighbours. For example, masses at the outer surface of the grid will have fewer neighbours.

The physically approximate nature of the mass-spring model introduces practical limitations in the behaviour of the model. Partly due to the uniform grid structure of the mode, the most significant limitation is spring inversion: a situation where a mass, in (normal) equilibrium, is displace beyond a certain point until a new equilibrium is formed where one or more springs now have a direction that is inverse to the earlier equilibrium. This problem manifests itself as irreversible "collapse" or "tangling" of masses, especially at the surface of the volumetric model and becomes very apparent with frequent mass displacement (e.g. by means of collision tools).

To alleviate the issue of spring inversion, different solutions or combinations thereof may be used according to the invention:

Shape-preserving springs

A shape -preserving spring is introduces that reduces the collapsing by restoring the model to the original shape. It is a zero-length spring, located at the initial position of the corresponding mass and prevents excessive displacement by exerting a force that pushes the mass back to its initial position. This is an elegant solution, but less effective for models that are to be pushed around and manipulated, as is often the case in virtual surgery.

Torsion springs

By rewriting Hooke's law in angular form, it is possible to simulate a spring that exerts a force proportional to an angle instead of a length. The result is a torsion spring (e.g. as used in mousetraps), and it can be added in between every connected mass in the model. Shearing of the model can now be resisted by the torsion springs and, as a result, spring inversion is further reduced.

Increased surface stiffness

As spring inversion occurs mostly at the surface of the model due to the interaction with virtual tools, the stiffness of the springs that form (or are connected to) the surface of the model is doubled. As stiffer springs are harder to invert, this typically reduces the occurrence of the problem without modifying the majority of the model.

Increased model resolution

While increasing the resolution of the model itself does not solve the issue of spring inversion, it does make its effect significantly less apparent. As more masses are available and their springs are inherently smaller and stiffer, it becomes more difficult to push the well-connected masses beyond their equilibrium.

As the design according to the invention allows for ever-increasing model resolution, the two last options are the most straightforward and non-invasive solutions. Therefore, preferably at least one of these solutions is used. As mentioned above, the mass-spring model must be capable of being manipulated by external factors such as a virtual surgical tool. The position and orientation of this virtual tool is typically controlled by corresponding input from a haptic feedback device such as 10a and 10b, and collision handling is used to exert forces on any masses where the tool touches the model. As the collision handling is very straightforward, we only consider the simple case of a probe tool (e.g., a hand) that is capable of pushing masses away. A capsule (or sphere) with predefined geometry is placed at the virtual position of the haptic device, and any masses within the radius of the shape are placed on the closest surface point of the geometry, in a constraint-based manner. Equally important, all masses that are displaced as a result of this collision interaction are marked and their total spring force is summed and provided as output to the haptic feedback device, completing the haptic feedback cycle.

The most straightforward implementation of the mass-spring algorithm simply iterates over every spring S connecting masses M; and M_j within the model and applies Equations (4) and (5), as can be seen in the pseudo code of figure 6.

For each mass M_;, forces exerted by the connected springs are first accumulated in∑F_;.

Given the total force Fi and using Equations (2), (3) and (9), Verlet integration is used to determine the new position x;(t+At) of mass M_;.

A first step for parallel computation of algorithm 1 (figure 6) would be to process the existing for-loops in parallel, instead of performing sequential iterations. However, when each of the iterations is processed by a different thread in parallel, the accumulation of variables∑F; and ∑Fj will lead to race conditions due to different threads trying to read and write the variables at the same time. This poses a serious threat to the stability of the model. The problem could be mitigated by using synchronization and, thus, making the accumulation atomic so that only one thread at a time can modify its value. This would however degrade performance due to excessive locking.

Therefore, the invention proposes an algorithm which does not require synchronization, or at least requires a small number of synchronizations. In fact, in some embodiments no

synchronization at all is required. Reference is made to figure 7, showing pseudo-code of a possible implementation of the method according to the invention on the basis of Nvidia' s CUDA programming environment. It is stressed that other GPGPU programming environments may be used as well.

In the code of figure 7, there is an iteration over every mass M; instead of every spring S as in the previous algorithm. The combination of a particular mass M; and any of its neighbouring masses M_j represents the spring S . Since every possible combination is iterated, all springs in the model are guaranteed to be iterated. The for-loop is parallelized, and race conditions are eliminated because the only data that is updated (χ_;(ι+Δι)) is unique to every iteration and not accessed by any other thread within the same time step. Every iteration runs in a separate thread and requires multiple memory access transactions with different memory access characteristics. At line 2, variable X;(t) is retrieved by thread i. As the index i is always known in advance, the memory transactions for variables x;(t), p;, X;(t-At) and X;(t+At) at respectively, lines 2, 3, 17, and 22 are thread-unique and can be predicted and optimized for data coalescing by CUD A. By using thread synchronization at lines 4 and 18, these memory transactions— performed by threads in parallel— are coalesced into single transactions by CUDA, saving considerable memory bandwidth. At line 6 function N;(k) retrieves the index j for the neighbouring mass Mj, represented by a particular 0 < k <26 for M_;; N; thus represents a data structure containing at most 26 indices for every mass M_;. At line 19, the algorithm performs the necessary collision handling for haptic interaction, as further explained below.

Data in memory that is accessed by predictable transactions should be structured in such a way that allows for coalescence optimization. As a general rule for CUDA, data elements should preferably be aligned to 32-bit, 64-bit, or 128-bit words so that multiple parallel transactions at sequential addresses are coalesced into single transactions. The data structures used in the CUDA algorithm are organized as simple one -dimensional arrays of aligned data. A position vector array contains the new and old positions of x_; as accessed by the algorithm at lines 2, 9, 17, and 22. A spring data array contains the indices for N;(k) at line 6 as well as unique spring properties such as y°i_j at line 10. A mass data array contains the mass properties p_j at line 8.

Since the number of springs (at most N * 26) quickly increases with more complex models, additional optimizations are added to decrease memory resources and bandwidth. Since spring properties, such as stiffness k , are often similar in local regions of springs, we reduce memory resources by classifying every spring S by a predefined mass material Mat(l). The mass properties array p_j stores an index 1 corresponding to a material type Mat(l). Spring properties, such as kj_j, are therefore defined by Mat(l), and, thus, implicitly shared with all springs connected to mass M; as well as all other springs sharing the same material. Mat(l) contains the predefined spring properties defined for every material Mat(l) where 1 = [0;L-1], while the material index Pi.material for every mass M; is encoded in the mass property array. Since a model usually only contains a few materials (e.g., L = 16, with the encoded index pi:material only requiring 4 bits), considerable memory resources are saved: now only L (e.g. 16), instead of at most N * 26 spring properties are required, excluding the overhead of the encoded index pi:material.

The algorithm is schematically drawn in figure 8. The input for a thread i is the position vector X;(t₀) and the position vectors Xj(t₀) for all neighbours M_j. As described above the indices of the neighbours of mass M; may be obtained via the function N;(k), with 0 < k < 26. The threads further have access to the mass property array p_;. Array p_; comprises an index 1 for looking up the spring properties in material table Mat(l). In other words, thread i reads the index value 1 in p_; and subsequently reads Mat(l) to obtain k; and y⁰ . The thread i calculates the total force vector F; and writes it into the total force vector array. The thread further calculates the next position of mass M_;, i.e. X;(t₀ + At) and writes it in the corresponding array. Note that also Χ;(ι₀-Δι) will be input to thread i if a Verlet integration is used for obtaining X;(t₀ + At). After the threads have completed, the force vectors F; corresponding to the masses marked as being in contact with the virtual tool are summed to obtain a feedback force.

In an alternative embodiment, each thread i stores the total force vector F; initially in local memory only for computation of Χ;(ι₀+Δι), wherein the thread i stores the total force vector in an array in global memory only if mass M; is in contact with the virtual tool. After completion of the threads all total force vectors stored in the global array are summed. To avoid synchronization issues due to simultaneous writing to the global array, a global integer index is maintained. Each thread i reads the index and writes its total force vector F; to the corresponding position of the total force vector array. Subsequently the thread i increments the global integer index, using an atomic increment instruction.

Total computation time can be even further reduced by skipping the computation of masses with negligible movement. If the total force IF; I of a mass M; does not exceed a predefined threshold ε, its force calculations are skipped and its position X; is left unchanged, i.e. X;(t₀+At) = X;(t₀). Thereby, it is prevented that time and resources are being spent on masses with negligible impact on the model.

Given the extended CUDA implementation of the algorithm as described with respect to figure 7 and 8, we now have the means to accurately calculate the physical state of the model at given time t. At this point, our algorithm is implemented as an integral part of the VICTAR surgery simulation framework. Figure 9 shows the multi-threaded system diagram of VICTAR containing all of the relevant subsystems in their own threads and contexts. To properly integrate the CUDA mass-spring model algorithm, it is necessary to employ a thread-safe, but fast data synchronization strategy to ensure that the model data is kept up-to-date and synchronized in all subsystems. Figure 9 provides a diagram of the complete data flow for the system and also describes whether the data is residing on the CPU or GPU. User input is provided by a haptic device 10a, 10b representing a virtual tool for manipulating and deforming the mass-spring physics model. Any subsequent force response from the physics model acting on the virtual tool is then delivered back to the device as haptic feedback output, providing the user with full two-way haptic interaction with the model. The visual representation of the model is generated by the system' s graphics subsystem, implemented using Microsoft DirectX 10 and the shader capabilities of Shader Model 4.0. A specialized volumetric multi-pass shader-rendering technique is used to render the visual representation of the mass-spring model. While details of the renderer fall beyond the scope of this paper, the synchronization between the graphics subsystem and CUDA is especially important, as will be explained later in this section. Needless to say, while we have chosen to use DirectX, alternative graphics API' s such as OpenGL are also an option.

The framework implements two-way haptic interaction by means of a haptic device and the SensAble OpenHaptics programming interface. The device has tactile sensors in six degrees of freedom to register the current stylus orientation and velocity while built-in actuators provide the ability to apply force feedback to the device in three degrees of freedom. It is controlled through a real-time priority haptic device I/O thread scheduled to run at a constant rate of 1000 Hz in which the sensors are read out and an actuator force is applied. The minimum I O streaming rate of 1000 Hz is necessary, as lower rates will result in uncontrollable oscillation and irregularities in the actuators that may cause a negative user experience.

In Figure 9, the vector representing the position of the virtual tool is acquired from the device sensors in the haptic device I/O thread and is immediately transferred to the CUDA host thread, where it is subsequently copied into GPU RAM where it is eventually read by the CUDA physics kernel. At the same time, another vector containing the force on the virtual tool is calculated by the kernel and copied back to CPU RAM, where it is immediately transferred to the haptic device I/O thread and written to the device actuators. To prevent read-while -write situations, where either of the threads is reading a value that is currently being written to by the other thread, locking can normally be applied. However, in this case, locking would introduce unacceptable execution stalling in either of the real-time priority threads. Instead, we employ a variant of the double buffering strategy usually found in the field of computer graphics. Our case comprises a single -reader single -writer situation where buffer A is used for reading while B is used for writing. At any point in time, when there are no threads reading or writing at all, buffers A and B are flipped so that the reader is able to read the updated data in B while the writer can overwrite outdated data in A. This strategy requires very little synchronization and lowers execution stalling in the CUDA thread hosting the mass-spring algorithm. Both vectors can now be quickly synchronized in a thread-safe manner. The CUDA physics kernel uses the virtual tool position vector to perform collision detection and deformation of the model as it interacts with the model, generating a force by summing the spring forces F acting on the masses that are currently within the bounding geometry of the virtual tool, as explained above. The user is therefore able to interact with the mass-spring model by feeling the resistance of the model acting on the virtual tool as it touches the surface.

Visual output is accomplished by close cooperation of the CUDA host thread and the DirectX graphics renderer. The renderer takes a position vector buffer (also known as vertex data buffer) as input and presents this buffer to the GPU through DirectX, where it is rendered onto the monitor. Without going into too much detail, specialized HLSL shaders for geometry processing and filtering are used to make sense of the volumetric vertex data (mass positions) in combination with the available data of neighbouring masses to turn the model into a visual representation that can be rendered on the screen. The renderer typically runs in sync with the monitor's vertical refresh rate, which we assume to be at a constant 60 Hz.

The shaders accept a vertex data structure (and an accompanying neighbour data structure) that is identical to the structure used by the CUDA physics kernel. In other words, the position vectors x_; of the masses can be used directly as vertex data structure. The vertex data structure is simply an aligned one-dimensional array of floating-point vectors of size N. As both CUDA and DirectX will only access data that is residing on the GPU, the vertex data never has to leave the GPU by using the CUDA graphics interoperability API, saving unnecessary memory copies between GPU and CPU.

Up to this point, it can be assumed that the CUDA physics kernel and graphics renderer are executed on the same GPU device. In this single device configuration, it is important to notice that the renderer and physics kernel typically run at vastly different rates and in different threads. The CUDA DirectX interoperability API ensures that all buffers that are to be mapped in CUDA are thread-safe, and thus free from data corruption. This is done by stalling execution while waiting for buffers that are already in use by DirectX and, subsequently, holding a lock while the buffers are mapped and used by any CUDA kernels. As a consequence, the vertex data structure is locked by both the DirectX thread (during rendering) and CUDA thead (during physics calculation) and, given the different rates of these threads, frequent stalling may occur. As frequent stalling will incur a serious performance penalty, we apply the double buffering strategy as described above on the vertex data structure.

In a more ideal situation, the physics kernel is allowed to run exclusively on a dedicated GPU device, unhampered by any other (graphical) GPU activity and any resulting locking. This multi-device configuration consists of at least two GPU devices, one of which functions as the primary DirectX display device, while the other is used exclusively to run the CUDA physics kernel. Needless to say, it is still necessary to provide some sort of synchronization between the GPU device running the CUDA physics kernel and the primary GPU graphics device; this presents a challenge similar to that in the single device configuration. Again, the double buffering strategy can be used to avoid unnecessary locks and stalls.

Figures 10A-D illustrate rendering the physical model on the basis of the calculated position vectors. Figure 10A shows the 3x3x3 grid for each mass M_;, i.e. each mass M; has K=26 neighbours in this example. The mass M; for which the calculations are to be performed is positioned at the centre of the grid. Figure 10A further defines a coordinate system with axes x, y and z. Further, three primary planes xy, yz and xz can be defined, as shown in figure 10B. These planes are used for surface integration. The number of quadrilaterals (tetragons) per primary plane depends on the number of neighbours defined for mass Mi. The maximum of number of quadrilaterals in this example is four, as can be seen in figure IOC. All quadrilaterals include four masses as vertices, of which one is formed by central mass M_;. Each quadrilateral (also known as "face") can be subdivided into two triangles, of which an example is shown in figure 10D. A first triangle comprises central mass M; and neighbouring masses M_k and Mi, a second triangle comprises central mass M_; and neighbouring masses Mi and M_j. These triangles are used for rendering the virtual object.

In this example, a maximum of 12 faces is defined (figure IOC). For each of these faces, it is determined whether all three neighbouring masses are defined. If three neighbouring masses are indeed defined, then the corresponding quadrilateral is provided as input to the rendering GPU. Optionally, the quadrilateral is broken down into two triangles which are provided as input to the rendering GPU.

Preferably, the above procedure is only applied to masses M_; which have less than 26 neighbouring masses M_j. This ensures that only masses at or near the surface of the model are rendered.

For rendering graphics, a normal vector may be calculated. The normal vector is the vector perpendicular to the surface. This vector may be used for ray trace calculations and model smoothing. For a quadrilateral, the vector perpendicular to the two triangular surfaces is determined and summed in a vector V_normai, which is subsequently normalized. Optionally, the direction of V_normai may be inverted on the basis of heuristics of which neighbours are present.

The model as defined above is discrete by definition. To avoid a pixelated appearance, a smoothing operation may be applied to the model. The smoothing operation displaces the vertices of the polygons for rendering the model such that sharp edges are smoothed. In one embodiment, the algorithm determines whether each of the six primary neighbouring masses is defined. The primary neighbouring masses are the neighbouring masses on grid positions (x, y, z) = (-1, 0, 0), (1, 0, 0), (0, -1, 0), (0, 1, 0), (0, 0, -1), (0, 0, 1). The position of each existing primary neighbouring mass is summed in a vector V_sum. The position M_; is adjusted to V_sum divded by the number of existing primary neighbouring masses. This process of averaging the position may be iterated a number of times. However, the smoothing operation may reduce the volume of the virtual object, especially when multiple iterations are applied. To compensate for this effect, the position of the central mass M_; may be corrected by adding to the position vector X; of mass M_; the corresponding normal vector V_normai times a factor I.

Performance analysis

We have implemented the parallel mass-spring model as described above as part of the VICTAR surgery simulator framework. More specifically, the physics engine of VICTAR is now composed of our mass-spring model implementation with the required buffering and is able to interface with one or more haptic devices and an existing graphics renderer. In order to test our implementation within the framework, we use a minimal general-purpose scripted scenario that generates a simple cubical mass-spring model at increasing resolutions and interfaces the model with a single haptic device. A single virtual probe tool is added to verify the model's haptic response. The increasing resolution of the model will provide a set of measurements at increasing algorithmic workload and will allow us to observe and quantify effects such as CPU overhead. Our proposed system design allows for two configurations, namely single and multi-device, with two buffering strategies— single and double buffering. Note that single buffering merely consists of a single buffer that is locked during either a read or a write. To investigate the advantage of double buffering, all combinations of configuration and buffering are implemented and tested with the above models.

Note that for the implementation of the multi-device synchronization, it is necessary to copy buffers (peer-to-peer) between two different GPU devices. Unfortunately, currently CUDA does not allow peer-to-peer copies for GPU devices involved with DirectX, so it is necessary to perform two additional copies to and from CPU host memory. Since copies between CPU and GPU have limited memory bandwidth, this introduces an additional memory bottleneck. It is therefore not currently possible to implement and test the single buffer locking strategy in the multi-device configuration within the CUDA environment.

The configuration of our hardware is a workstation with an Intel Core i7-860 CPU, 6 GB of RAM with Windows 7 x64 and two NVIDIA Tesla C2050 GPU devices with Compute capability 2.0. Each of our Tesla C2050 GPU devices contains 15 streaming multiprocessors (SMs) with a maximum of 1536 threads each at 100% kernel occupancy. For the single device configuration, the non-primary graphics card is left unused. Haptic interaction is provided by SensAble PHANTOM Omni device (see Figure 2), supported by the SensAble OpenHaptics programming interface and operating at 1000 Hz. The CUDA Visual profiler is used to perform profiled runs of our implementation, providing GPU kernel timing data, register usage, and occupancy results. Our implementation uses CUDA 4.2 and is profiled by running the complete framework (with above script and scenarios) inside the profiler, allowing it to capture measured data of each iteration of the physics engine. Practical timing measurements are provided through the framework, as explained below. Note that the CUDA Visual profiler can be set to capture a set of specific measurements and requires more separate runs of the application when more detailed infor- mation (such as branch divergence, coalesced global memory reads and writes, etc.) is required. As our implementation is only available as part of a bigger application, and not as a stand-alone executable, it is very difficult to perform multiple profiler runs: the application behaviour may differ between runs (e.g., due to varying loading times) causing issues whenever the profiler can no longer compare the results. As we are solely interested in production performance of this initial implementation, we have chosen specifically not to create a stand-alone version, and instead performed our tests with a single profiler run, capturing only the necessary performance measurements (kernel time, kernel occupancy, and register usage) thus, leaving in- depth CUDA optimization as a later exercise. These measurements represent the actual on-chip GPU performance of the implementation.

Additionally, we have added high-precision timing functionality (typically in nanoseconds) to the VICTAR framework by using the available timing APIs of the operating system

(QueryPerformanceCounter). This second class of time measurements represents the actual time that passes between calls inside the framework (e.g., one iteration of the physics engine using our implementation) including all overhead such as CUDA API calls and other blocking calls and represents the real practical performance of the physics engine or of our implementation as it is integrated in the framework. These measurements give insight in the amount of (CPU) overhead, as measured on the CPU or host, that is used in the framework and outside the GPU kernel.

With regard to the model, we have chosen not to implement the force threshold optimization described above, in order to allow for full consistent testing where no mass is assumed to have negligible movement; thus, all possible computations within the model are implicitly performed. Furthermore, we found that the two preferred solutions for reducing the spring inversion problem (increased surface stiffness and increased model resolution) reduced said problem to such an extent, that it was no longer a significant issue in our tests when performing typical surgery model interaction.

Figure 11 shows graphs of the total performance rate of the implementations for both single- and multi-device configurations and single and double buffering strategies, where the total framework performance consists of GPU kernel performance, as measured on the GPU device, and any CPU overhead, as measured on the CPU or host.

The most important variable to consider when looking at the performance is the total number of springs, as it dictates the amount of spring computations and is therefore closely related to the overall performance of the CUDA kernel, while the total number of masses provides better insight on the parallel scalability. It is evident that the multi-device configuration provides the fastest available implementation.

Allowing the CUDA physics kernel to run on its own dedicated GPU device, where it is unhampered by any other GPU activity, significantly increases the overall practical performance. This can be seen in Figure 11 , where the multi-device implementation performs four to eight times faster than the fastest single-device implementation.

Differences between single and double buffering on the multi-device configuration are virtually non-existent, due to the fact that peer-to-peer copies are currently not supported by CUDA. This makes it impossible to implement a proper single -buffer implementation with locking, as a second CPU host buffer is always present to act as a bridge between the two GPU devices and effectively cancels out any effects of single buffering. Both cases for multi-device are therefore combined into a single set of measurements.

The advantage of double buffering (over single buffering) on a single device is immediately clear when looking at the figures. The significantly lower performance of single buffering is found to be caused by internal command queues and resulting "greedy" locking in DirectX. In this case, performance is hampered by execution stalls triggered by the CUDA interoperability API waiting for the release of buffer locks by DirectX. In other words, DirectX keeps a lock on any buffer used in the graphics pipeline until all render commands involving that buffer (which have been internally queued up) have been processed. As a result, it is very difficult to control the locking, and locks will be held far longer than necessary, typically until a command buffer flush or back-buffer swap occurs. Double buffering proves to be an effective solution to this issue.

Figure 12 shows graphs of the detailed performance results of the multi-device configuration with double buffering, currently the fastest implementation of our algorithm. The total framework time, or practical performance of our implementation inside the software, is the sum of the GPU kernel time and any time caused by overhead on the CPU. This overhead primarily consists of delays induced by other operations that are executed in between subsequent launches of the CUDA kernel. This includes unavoidable operations such as CUDA kernel argument setup, thread synchronization, device management, and implementation-specific memory copies of collision and vertex data and provides only little room for further optimization. The CPU overhead is very visible in Figure 12 as a variable that is mostly constant regardless of the total number of springs (or model resolution). This implies that as the GPU kernel time is increased, e.g., when the model resolution is increased, the effect of the overhead becomes less and less apparent. Note that the CPU overhead may also vary for different GPU architectures.

Our performance results show that our algorithm and its implementations scale almost linearly with the total number of springs, which is an overall satisfactory result. This, however, does not tell much about the algorithm' s parallel scalability: the efficiency of the algorithm when using increasing numbers of parallel processing elements (threads).

To quantify the parallel scalability, we first have to look at the CUDA kernel profiling results in the graph of figure 13. Our particular Tesla C2050 GPU devices are capable of running a maximum of 1536_15 = 23040 parallel threads with all 15 SMs 100% occupied. Our profile results show an occupancy of 50%, mainly due to the high register usage of the kernel. The maximum amount of parallel threads at any time for our CUDA kernel is thus 23040_0:50 = 11520. Since our kernel requires a thread for every mass, the relevant performance range to inspect is that between 0 and the parallel maximum of 11520, after which the device is saturated, as visualized in Figure 13. When looking at the typical model resolutions (e.g., 95487) in comparison to the parallel maximum of 11520 masses, one can observe that the device is quickly saturated, emphasizing the relevance of the overall performance over parallel scalability. An ideal parallel kernel will achieve linear scaling when performance stays constant while the workload is increased in direct proportion to the number of threads. Note that the kernel is not computation-bound, but memory- bound, due to its frequent accesses in global GPU memory. This creates a memory access bottleneck that causes a certain degree of serialization to occur during execution. Parallel scalability can therefore not be ideally linear, and, thus, the performance pattern cannot be fully constant.

Figure 13 shows a significant stepped pattern that alternates between near-constant sustained performance and sudden decrease and confirms that the kernel is indeed parallelized. The stepping can be explained by the internal mechanisms (e.g., warp scheduling, memory access handling) of the GPU device that are involved with scheduling the kernel' s threads on the available SMs and may continue in a similar pattern beyond the point of device saturation. As this typically varies with different GPU architectures, it provides an opportunity for improvement by taking advantage of more recent, more efficient CUDA hardware. Together with the suboptimal distribution of our CUDA kernel on the SMs, this leaves enough room for future kernel optimization in terms of parallel scalability.

The present invention is by no means limited to the above described preferred

embodiments thereof. The rights sought are defined by the following claims, within the scope of which many modifications can be envisaged.

Claims

1. Method for generating a real-time haptic feedback signal for a haptic device of a virtual surgery simulator, comprising:

- generating a physical model of an object using the virtual surgery simulator;

- updating the physical model of the object on the basis of an interaction between the object and a virtual tool associated with the haptic device;

- generating the haptic feedback signal on the basis of the calculated force,

wherein the force is calculated using a parallel computing device.

2. Method according to claim 1 , wherein the physical model is a mass-spring model.

3. Method according to claim 2, the mass-spring model comprising N position vectors X;(t₀), each defining the position of a mass M; at a first time instance t₀, wherein an iteration over all masses M; is performed comprising the following operations for each mass M_;:

- retrieve the index j of each of K; masses M_j neighbouring mass M_;;

- for each neighbouring mass M

4. Method according to claim 3, wherein the force the object exerts on the virtual tool is calculated on the basis of the total force vectors F;(t₀) of the masses in contact with the virtual tool.

5. Method according to claim 3 or 4, wherein the iteration over all masses M; is performed in parallel using the parallel computing device

6. Method according to claim 4 and 5, further comprising defining a feedback force array Ff_k and a feedback force array index k, wherein, in the iteration, for each mass M; in contact with the virtual tool the total force vector F;(t₀) of that mass is M; copied to the feedback force array Ff_k at index k and subsequently the feedback force array index k is incremented using an atomic increment, and the force the object exerts on the virtual tool is calculated by summing the feedback force array Ff_k.

7. Method according to any of the claims 3-6, wherein in case the total force vector F;(t₀) of a mass M; is smaller than a predetermined force threshold the position vector X;(t₀ + At) defining the position of a mass M_; at the second time instance is set to be equal to the position vector X;(t₀) defining the position of said mass M_; at the first time instance.

8. Method according to any of the claims 3-7, wherein computing the position vector X;(t₀ + At) defining the position of mass Mi at a second time instance t₀ + At comprises performing an integration of second order or higher, preferably a Verlet integration.

9. Method according to any of the claims 3-8, wherein the operation performed on each mass M; further comprises retrieving spring data for each of the springs S connecting the mass M; to its neighbouring masses M_j, wherein the spring data comprises spring properties, preferably the rest length y⁰ and the spring constant k .

10. Method according to claim 9, wherein retrieving spring data comprises determining the spring properties on the basis of material data defining material properties for the corresponding mass M;.

11. Method according to any preceding claim, wherein the parallel computing device comprises a GPU.

12. Method according to any of the claims 3-11, further comprising displaying the physical model of the object on a display on the basis of the position vectors.

13. Method according to claims 11 and 12, wherein the step of generating and updating the physical model is performed using a first GPU and the step of displaying the physical model is performed using a second GPU.

14. Method according to claims 13, comprising:

- sending from the first GPU to the second GPU display data for initialising displaying the physical model; andsubsequently sending from the first GPU to the second GPU only the changes in the display data.

15. Method according to claim 12, 13 or 14, displaying the physical model of the object further comprising:

- determining the masses M_; which have less than a predetermined number Z of

neighbouring masses M_j; and

- displaying the physical model on the basis of the position vectors x_; of said determined masses.

16. Method according to any of the claims 12-15, displaying the physical model of the object further comprising applying a smoothing operation on the position vectors x_;.

17. Method according to any preceding claim, wherein the haptic feedback signal is generated at a frequency of at least 100 Hz, preferably at least 250 Hz, more preferably at least 500 Hz and most preferably at least 1000 Hz.

18. Method according to any preceding claim, wherein a double buffer is used for input from the haptic device to the parallel computing device and output from the parallel computing device to the haptic device.

19. Method according to any preceding claim, further comprising the step of communicating data of the physical model with an e-learning environment.

20. Method according to any preceding claim, further comprising storing a snapshot of the physical model at a selected time instance and/or an animation of the dynamics of the physical model over a selected time frame.

21. Computer program which, when executed on a computer, executes the steps of the method according to any of the preceding claims.

22. A computer-readable data storage medium having embodied thereon the computer program according to claim 21.

23. Virtual surgery simulator, comprising

24. Virtual surgery simulator according to claim 23, wherein the physical model is a mass- spring model.

25. Virtual surgery simulator according to claim 24, the mass-spring model comprising N position vectors X;(t₀), each defining the position of a mass M; at a first time instance t₀, wherein the parallel computing device is arranged to perform the following operations for each mass M_;:

- retrieve the index j of each of K_; masses M_j neighbouring mass M_;;

- for each neighbouring mass M

26.. Virtual surgery simulator according to claim 25, wherein the parallel computing device is arranged to calculate the force the object exerts on the virtual tool on the basis of the total force vectors F;(t₀) of the masses in contact with the virtual tool.

27. Virtual surgery simulator according to claim 25 or 26, wherein the parallel computing device is arranged to perform the iteration over all masses M_; in parallel.

28. Virtual surgery simulator according to claim 26 and 27, wherein the parallel computing device is arranged to define a feedback force array Ff_k and a feedback force array index k, and to copy, in the iteration over all masses M_;, for each mass M; in contact with the virtual tool the total force vector F;(t₀) of that mass M; to the feedback force array Ff_k at index k and to subsequently increment the feedback force array index k using an atomic increment, and to calculate the force the object exerts on the virtual tool by summing the feedback force array Ff_k.

29. Virtual surgery simulator according to any of the claims 25-28, wherein the parallel computing device is arranged to set the position vector X;(t₀ + At) defining the position of a mass M; at the second time instance to the same value as the position vector X;(t₀) defining the position of said mass M; at the first time instance in case the total force vector F;(t₀) of a mass M; is smaller than a predetermined force threshold.

30. Virtual surgery simulator according to any of the claims 25-29, wherein the parallel computing device is arranged to compute the position vector X;(t₀ + At) defining the position of mass Mi at a second time instance t₀ + At by performing an integration of second order or higher, preferably a Verlet integration.

31 Virtual surgery simulator according to any of the claims 25-30, wherein the parallel computing device is arranged to include in the operation performed on each mass M_;: retrieving spring data for each of the springs S connecting the mass M; to its neighbouring masses M_j, wherein the spring data comprises spring properties, preferably the rest length y⁰ and the spring constant k .

32 Virtual surgery simulator according to claim 31 , wherein retrieving spring data comprises determining the spring properties on the basis of material data defining material properties for the corresponding mass M_;.

33. Virtual surgery simulator according to any of the claims 23-32, wherein the parallel computing device comprises a first GPU.

34. Virtual surgery simulator according to any of the claims 23-33, further comprising a display arranged to display the physical model of the object on the basis of the position vectors.

35. Virtual surgery simulator according to claim 33 and 34, further comprising a second GPU wherein the first GPU is arranged to perform the step of generating and updating the physical model and the second GPU is arranged to generate the graphics for displaying the physical model.

36. Virtual surgery simulator according to claim 35, wherein the first GPU is arranged to send the second GPU display data for initialising displaying the physical model and to subsequently send only the changes in the display data.

37. Virtual surgery simulator according to claim 34, 35 or 36, wherein the first GPU is arranged to determine the masses M_; which have less than a predetermined number Z of neighouring masses M_j and the second GPU is arranged to display the physical model on the basis of the position vectors X; of said determined masses.

38. Virtual surgery simulator according to any of the claims 34-37, wherein the first GPU and/or second GPU is arranged to apply a smoothing operation on the position vectors x_;.

39. Virtual surgery simulator according to any of the claims 23-38, wherein the parallel computing device is arranged to generate the haptic feedback signal at a frequency of at least 100 Hz, preferably at least 250 Hz, more preferably at least 500 Hz and most preferably at least 1000 Hz.

40. Virtual surgery simulator according to any of the claims 23-39, comprising a double buffer for input from the haptic device to the parallel computing device and output from the parallel computing device to the haptic device.

41. Virtual surgery simulator according to any of the claims 23-40, further comprising a communication unit arranged to communicate data of the physical model to an e-learning unit.

42. Virtual surgery simulator according to any of the claims 23-41, comprising a recording unit arranged to store a snapshot of the physical model at a selected time instance and/or to store an animation of the dynamics of the physical model over a selected time frame.