WO2014032011A2

WO2014032011A2 - Method and system for generating four dimensional mesh from images

Info

Publication number: WO2014032011A2
Application number: PCT/US2013/056477
Authority: WO
Inventors: Panagiotis FOTEINOS; Nikos CHRISOCHOIDES
Original assignee: Old Dominion University Reasearch Foundation
Priority date: 2012-08-23
Filing date: 2013-08-23
Publication date: 2014-02-27
Also published as: WO2014032011A3

Abstract

A system, recording medium, and methods therefor for generating mesh from an image. The method includes meshing a 4D medical image, and simulating a time-evolving medical condition with the 4D medical image. The method comprises recovering an underlying hyper- surface of a moving object; and filling the volume of the hyper-surface with sliver free pentatopes. The system implementing the method can be configured to process images for at least one of finite element analysis, simulation, interpolation, computer vision, or robotics

Description

METHOD AND SYSTEM FOR GENERATING FOUR DIMENSIONAL MESH FROM

IMAGES

CROSS REFERENCE TO RELATED APPLICATIONS

This application claims the benefit of L S. Provisional Application No. 61/692,51.4, filed August 23, 2012, which is hereby incorporated by reference in its entirety.

STATEMENT REGARDING FEDERALLY SPONSORED RESEARCH

This invention was made with Government support under contract CCF-1 139864, CCF- 1 136538, and CSI-1 136536 awarded by the U.S. National Science Foundation. The Government has certain rights in the invention.

TECHNICAL, FIELD

The present invention relates to solid modeling through computer graphics.

BACKGROUND ART

There is a large body of work on constructing guaranteed quality meshes for Computer Aided Design (CAD) models. The specificity of CAD-onented approaches is that, the meshes have to match exactly to the boundaries of the models. The most widely used guaranteed-quality CAD-oriented approach is based on De!aunay refinement.

Delaunay refinement is a popular automatic mesh generation and refinement method which generates and refines meshes by present algorithms and at the same time ensures

termination and good grading. Mesh generation by Delaunay refinement is a widely used

technique for constructing guaranteed quality triangular and teirahedrai meshes. The quality guarantees are usually provided in terms of the bounds on circumradius-to-shortest edge ratio and on the grading of the resulting mesh, Traditionally circumcenters of skinny elements and middle points of boundary faces and edges are used for the positions of inserted points.

Delaunay refinement is very useful for parallel and/or time critical applications when human intervention and reruns are prohibitively time consuming.

Delaunay refinement algorithms are based on the method of inserting new points into the mesh to improve the aggrega te quality of elements (triangles in two dimensions or tetrahedra in three dimensions). Quality is traditionally defined as the ratio of the circumradius of the element to the length of its shortest edge. The us of this measure leads to the improvement of the minimum angle in two dimensions which helps to improve the conditioning of the stiffness matrix used by a field solver. In three dimensions this measure does not yield such direct benefits, however, it has been shown that bounded circumradius-io-shortest edge ratio of mesh elements is sufficient to obtain optimal convergence rates for the solution of Poisson equation using the control volume method. However, meshing methods prior to the present invention have been confined to three- dimensional (two dimensions plus time) meshing.

As described, for example, in related U.S. Application Publication Nos. 2012/0154397 and 2012/0200566, the entirety of each of which is incorporated by reference herein, one of the problems with Delaunay refinement in 3D is that it allows only for a bound on circumradius-to- shortest edge ratio of tetrahedra, which does not help to improve the dihedral angles. A bound represents a criterion for evaluating the quality of an element in the mesh. As a result, almost flat tetrahedra called slivers can survive. There are a number of post-processing techniques to eliminate slivers, as referenced in U.S. Application Publication No. 2012/0154397. While some of them have been shown to produce very good dihedral angles in practice, no implementation is available that can guarantee significant ( and above) dihedral angle bounds.

As further noted in U.S. Application Publication No. 2012/0154397, other work describes a guaranteed quality tetrahedra! meshing algorithm for genera] surfaces. Such work offers a one-sided fidelity guarantee (from the mesh to the model) in terms of the Hausdorff distance, and, provided the surface is sufficiently smooth, also the guarantee in the other direction (from the model to the mesh). The referenced algorithm first constructs an octree that covers the model, then fills the octree leaves with high quality template elements, and finally warps the mesh vertices onto the model surface, or inserts vertices on the surface, and locally modifies the mesh. Using interval arithmetic, others have shown that new elements have dihedral angles above a certain threshold. However, images are not smooth surfaces and this technique has not been extended to mesh images. One approach could be to interpolate or approximate the boundary pixels by a smooth surface, but it would be complicated by the need to control the maximum approximation (interpolation) error. On the other hand, as described in U.S. Application

Publication No. 2012/0154397, an Ϊ2Μ solution can benefit from the fact that images provide more information on their structure than general surfaces. Heuristic solutions to the 12M problem fall into two categories; (1 ) first coarsen the boundary of the image, and then apply CAD-based algorithms to construct the final mesh, (2) construct the mesh which covers the image, and then warp some of the mesh vertices onto the image surface. The first approach tries to address the fidelity before the quality requirements, whi le the second approach does it in reverse order. Neither of these approaches is able to guarantee die quality of elements in terms of dihedral angles. As a result, the output of one step does not produce an optimal input for the other step.

As further noted in related U.S. Application Publication No. 2012/0200566, one of the centra! questions in the Delaunay refinement method has been the choice of the positions for the new points. The traditional approach uses circumcenters of mesh elements, however, a number of other locations have been used to achieve various mesh optimizations.

SUMMARY OF INVENTION

According to an embodiment, the present disclosure is directed to a system and method therefor for generating mesh from an image, According to another embodiment, the present disclosure is directed to a recording medium storing a program that instructs a computer to generate a mesh from an image according to the above claimed methods. A system, recording medium and methods therefor for generating four dimensional mesh from an image are disclosed. The method includes meshing a 4~D medical image, and simulating a time-evolving medical condition with the 4-D medical image. The method comprises recovering an underlying hyper-surface of a moving object; and filling the volume of the hyper-surface with sliver free pentatopes. The system implementing the method can be configured to process images for at least one of finite element analysis, simulation, interpolation, computer vision, or robotics. The methods, systems, media, and other inventions described herein may be adapted to the processing or meshing of four dimensional digital image /in the form of four dimensional medical images.

One aspect of the present approach include embodiments of a method for generating mesh from a four dimensional image, the method being implemented by a computer system operatively connected to at least one data storage device for storing image data for images, at least one computer, and at least one non-transitory computer readable medium storing thereon computer code which when executed by the at least one computer performs the method. Such a method may include (a) receiving a four dimensional digital image /of a three dimensional object Ω over time t_h the three dimensional object Ω having a volume and a surface δΩ; (b) applying a virtual hyper-box of sixteen corners to the image of object Ω such that the distance between a box corner x and its closet feature point z = cfp_dn (x) is at least 2 _dn (z); (c) generating meshes of the four dimensional digital image / by insertion and deletion of points in a Delaunay fashion; (d) recovering an underlying hyper-surface of the object Ω; and (e) filling the volume of the hyper-surface with sliver free pentatopes. in some embodiments, the method may include generating meshes until a final mesh is a set of pentatopes having circumcenters inside Ω. in some methods, the generating of meshes is processed by providing a piuraiity of parallel processing threads T,, each of threads ,· having a poor element list of tetrahedra t requiring refining, as described below. The dynamic insertion and deletion of points in a Delaunay fashion may include refining the meshes by the application of a number of rules, Such rules may be, where δ is a predetermined sampling density and the meshes include a piuraiity of tetrahedrons t, the following steps of:

inserting point z if z is at a distance not closer than Δ_3ιβ (ζ) to any other feature vertex, when z is equal to cfp_¾j (Ο_σ4) and 'S_o4 intersects οΩ;

inserting C_o4 if radius R„ > 2 Δ_3Λ (ζ), when z is equal to cfp¾2 (Ο_σ ) and ^'Φ^ intersects ΒΩ; inserting Q4 if p_a4≥ p, when C-,4 lies inside Ω;

inserting a good point z inside Pil(o4) if σ4 contains a sliver when C₀ lies inside Ω;

inserting a good point z inside Pll(p3} if all vertices of σ^'3 are not feature vertices, when σ3(σ3 C er4) is a restricted facet, and

deleting ail tree vertices closer than ά _Ω ζ to z„

As discussed herein, the generating of meshes may be processed by providing a piuraiity of parallel processing threads T_h each of threads T_s having a poor element list of tetrahedra t requiring refining. Optionally, the method may include the step of finding and triangulating cavities C of insertions. Further, the method may include the step of finding and triangulating cavities C of insertions according to a Bowyer-Watson kernel In a multi-thread embodiment, the method may include balancing processing load among threads 1) by one or more of random work stealing, hierarchical work stealing and static box partitioning, or managing contention by pausing at least one of the piuraiity of processing threads T without liveloeks in the remaining threads, in some embodiments, contention may be locally managed by designating one of the plurality of processing threads J^'„ as a main non-pausing thread and pausing as a function of the number of rollbacks at least one of the remaining of the plurality of processing threads 1} without Hvelocks.

Aspects of the present approach may include a non-transitory recording medium storing a program that instructs a computer to generate a. mesh from an image, the method comprising (a) receiving a four dimensional digital image I of a three dimensional object Ω over tim 4 the three dimensional object Ω having a volume and a surface dQ; (b) applying a virtual hyper-box of sixteen corners to the image of object Ω such that the distance between a box corner x and its closet feature point z = cfps_a(x) is at least 2Δ_3β(ζ); (c) generating meshes of the four dimensional digital image / by insertion and deletion of points in a Delaunay fashion; (d) recovering an underlying hyper-surface of the object Ω; and (e) filling the volume of the hyper-surface with sliver free pentatopes. The method may inc lude ail the variations or aspects discussed herein, such as the application of refining rules, or multi-threaded processing.

Similarly, the present approach may include a system for performing a process on an image, the process being implemented by a. computer system operatively connected to at least one data storage device in which is stored image data, and comprising at least one computer and at least one non-transitory computer readable medium storing thereon computer code which when executed by the at least one computer performs a method of the present approach, in one embodiment, such a system may implement a method involving the steps of receiving a four dimensional digital image / of a three dimensional object Ω over time t_if the three dimensional object Ω having a volume and a surface 3Ω; applying a virtual hyper-box of sixteen corners to the image of object Ω such that the distance between a box corner x and its closet ieature point z— cfp_dn(x) is at least 2A_dn (z);

generating meshes of the four dimensional digital image I by insertion and deletion of points in a Delaunay fashion; recovering an underlying hyper-surface of the object Ω; and filling the volume of the hyper-surface with sliver free pentatopes. The invention herein should expressly include systems implementing the various methods disclosed herein. For example, without limitation, a system may include a methods that, includes generating meshes until a final mesh is a set of pentatopes having circumcenters inside Ω. The method may include refining the mesh, wherein δ is a predetermined sampling density, the meshes include a plurality of tetrahedrons t. and wherein the dynamic insertion and deletion of points in a Delaunay fashion comprises refining the meshes by: inserting point z if z is at a distance not closer than &_dii (z) to any other feature vertex, when z is equal to cfp_¾j (C_o4) and B_o4 intersects 8Ω:

inserting€_σ4 if radius R_!S > 2 A_da(z), when z is equal to cfp<¾2 (C_o4) and B_tj4 intersects 8Ω; inserting C_C4 if ρ_σ4.≥ p, when C_a4 lies inside Ω;

inserting a good point z inside PR(a4J if σ4 contains a sliver when C_y4 lies inside Ω inserting a good point z inside PR(a3) if all vertices of σ3 are not feature vertices, when

0^'3 (<x3 C σ4) is a restricted facet, and deleting all free vertices closer than Δ^ (ζ) to z. Of course, a system may implement such a method by providing a plurality of parallel processing threads Ί , each of threads 7* having a poor element iist of tetrahedra t requiring refining.

BRIEF DESCRIPTION OF DRAWINGS

To the accomplishment of the foregoing and related ends, certain illustrative

embodiments of the invention are described herein in connection with the following description and the annexed drawings. These embodiments are indicative, however, of but a few of the various ways in which the principles of the invention may be employed and the present invention is intended to include ail such aspects and their equivalents. Other advantages, embodiments and novel features of the invention may become apparent from the following description of the invention when considered in conjunction with the drawings. The following description, given by way of example, but not intended to limit the invention solely to the specific embodiments described, may best be understood in conjunction with the accompanying drawings, in which:

FIG, 1 illustrates functional modules of a computer according to an embodiment,

FIG. 2 shows a 2D illustration as described below.

FIG, 3 shows a 3D illustration of the proof of Lemma 3 in accordance with one embodiment of the present invention,

FIG, 4 is a flow diagram depicting the relationship among the rules in accordance with one embodiment of the present invention.

FIG . 5 is a diagram illustration proof of Lemma 8 in accordance with the present

invention.

FIG. 6 shows a normalized volume histogram of the output mesh obtained for the input image Patl identified in Table 1 , in accordance with one aspect of the present invention.

FIG. 7 shows an implementation design,

FIGS. 8A and 8B show two-dimensionai illustrations with re-triangulated areas.

FIG, 9 shows a graph representing the number of threads against non-local PEL removals. FIG, 1 OA shows an exemplary whole mesh as generated in accordance with the present invention,

FIG. 10B shows a mesh cross-section of FIG, 5A.

FIG. IOC shows statistics regarding performance and quality metrics of the present invention. FIGS. 1 1A-C shows graphs illustrating performance by the present invention during testing. FIGS, 12A and B show exemplary communication patterns for embodiments of the present invention.

FIG. 13 shows an illustration of a iiveloek as caused during an execution scenario for local- CM.

FIG. 14 shows a table illustrating a comparison among contention managers in accordance with one aspect of the present invention,

FIG. 15 shows a table illustrating specifications of cc-NUMA machines used in accordance with one aspect of the present invention,

FIGS. 16A and B show graphs comparing Random Work Stealing (RWS) and Hierarchical

Work Stealing (HWS) in accordance with one aspect of the present invention,

FIG, 16C shows a graph illustrating the breakdown of the overhead time for HWS in accordance with one aspect of the present invention,

FIG, 17 is a table illustrating weak scaling performance as described herein.

FIG. 18 is a diagram showing overhead time breakdown with respect to the wall time for the experiment on 176 cores of the abdominal atlas portion of the table in FIG, 17.

FIG. 19 shows an illustration of a virtual box meshed into six teirahedrals in accordance with one aspect of the present invention.

FiG. 20 shows an illustra tion of a the mesh in FIG. 19 during refinement as it is being carved in accordance with one aspect of the present invention.

FIG. 21 shows an illustration of the final set of tetrahedra as formed from the depictions in FIGS. 19 and 20 in accordance with one aspect of the present invention.

FIG, 22 depicts an exemplary parallel mesh generation algorithm for use in embodiments of the present invention.

FIG . 23 is a depiction of pseudocode elaborating on the implem entation of the

Local Contention Manager in accordance with one aspect of the present invention.

FIG. 24 shows an illustration of possible execution scenarios .for local~CM during six Time Steps in accordance with one erntodiment of the present invention,

FIG. 25 is a table illustrating hyperexecution of the case study shown in the table of Fig. 17 a.s described herein.

FIG, 26 is a table illustrating information about two input images used in a comparison as described herein. FIG. 27 shows diagrams illustration output meshes generated by PI2M of the present invention on an MR knee alias and a CT head-neck atlas in accordance with embodiments of the present invention,

FIG, 28 shows a table illustrating statistics regarding the single threaded performance in accordance with aspects of the present invention.

DESCRIPTION OF EMBODIMENTS

Introduction

It is noted that in this disclosure and particularly in the claims and/or paragraphs, terms such as "comprises," "comprised," "comprising," and the like can have the meaning attributed to it in U.S. patent law; that is, they can mean "includes," "included," "including," "including, but not limited to" and the like, and allow for elements not explicitly recited. Terms such as "consisting essentially of and "consists essentially of have the meaning ascribed to them in U.S. patent law: that is, they allow for elements not explicitly recited, but exclude elements that are found in the prior art or that affect a basic or novel characteristic of the invention. Embodiments of the present invention are disclosed or are apparent from and encompassed by, the following description.

Embodiments of the invention may be implemented by a programmable digital computer or a system using one or more programmable digital computers and computer readable storage media, including those designed for processing CAD-based algorithms as known to those of ordinary skill in the art. In one embodiment, Figure 1 depicts an example of one such computer system 1 00. which includes at least one processor 1 10, such as, e.g., an Intel or Advanced Micro Devices

microprocessor, coupled to a communications channel or bus 1 12. The computer system 100 further includes at least one input device 1 14 such as, e.g., a keyboard, mouse, touch pad or screen, or other selection or pointing device, at least one output device 1 1 such as, e.g., an electronic display device, at least one communications interface 1 18, at least one computer readable medium or data storage device 120 such as a magnetic disk or an optical disk and memory 122 such as Random- Access Memory (RAM), each coupled to the communications channel 1 12. The communications interface 1 18 may be coupled to a network 142.

One skilled in the art will recognize that many variations of the system 100 are possible, e.g., the system 100 may include multiple channels or buses 1 12, various arrangements of storage devices 120 and memory 122, as different units or combined units, one or more computer-readable storage medium (CRSM) readers 136, such as, e.g., a magnetic disk drive, magneto-optical drive,

S optical disk drive, or flash drive, multiple components of a given type, e.g., processors 1 10, input devices 1 14, communications interfaces 1 18, etc.

In one or more embodiments, computer system 100 communicates over the network 142 with at least one computer 144, which may comprise one or more host computers and/or server computers and/or one or more other computers, e.g. computer system 100, performing host arid/or server functions including web server and/or application server functions. In one or more embodiments, a database 146 is accessed by the at least one computer 244. The at least one computer 144 may include components as described for computer system 100, and other components as is well known in the computer arts. Network 142 may comprise one or more LANS, WANS, intranets, the internet, and other networks known in the art, in one or more embodiments, computer system 100 is configured as a workstation that communicates with the at least one computer 144 over the network 142. In one or more embodiments, computer system 100 is configured as a client in a client-server system in which the at least one other computer comprises one or more servers. Additional computer systems 100, any of which may be configured as a work station and/or client computer, may communicate with the at least, one computer 144 and/or another computer system 100 over the network 142,

Non-limiting exemplary embodiments of systems, methods and recording mediums which can implement the embodiments disclosed herein may be found in U.S. Patent Application Serial No. 13/311 ,335, filed on December S, 201 1 , which claims priority to Provisional Application No.

61 /419,632 filed on December 3, 2010, and to Provisional Application No. 61/482,797 filed on May

5, 201 1 , the entirety of each of which is incorporated herein by reference.

Non-limiting exemplary embodiments of systems, methods and recording mediums which can implement the embodiments disclosed herein may also be found in U.S. Patent Application Serial No. 13/31 1 ,289, filed on December 5, 201 1 , which claims priority to Provisional Application

No. 61/419,603 filed on December 3, 2010, the entirety of each of which is incorporated herein by reference.

Section 1

In one aspect, the present invention pertains to a Deiaunay refinement algorithm for 4- dimensional (3D-H) segmented images. The output mesh is proved to consist of sliver-free simplices. Assuming that the hyper-surface is a closed smooth manifold, one may also guarantee 56477

faithful geometric and topological approximation. It is implemented and demonstrated that the effectiveness of the present method on publicly available segmented cardiac images, in one aspect.

Technological advances in imaging have made the acquisition of 4D medical images feasible. At the same time, pentatope capable FEM solvers operating directly on 4D data have been shown to be effective for advection-diffusion and Navier-Stokes formulations.

As described herein, a 4-dimensional Deiaunay mesh algorithm operates directly on a 4- dimensional image J. J represents the domain Ω to be meshed as the temporal evolution of a 3D where ^li is the 3D object at time (i.e., the i^Eh slice of

Volume mesh generation methods can be divided into two categories: PLC- based and Isosurface-based, The PLC-based methods assume that the surface

9Ω of the volume Ω (about to be meshed) is given as a Piecewise Linear Complex (PLC) which contains linear sub-faces embedded in 3 or 4 dimensions. The limitation of this method is that the success of meshing depends on the quality of the given PLC: if the PLC forms very small angles, then the overall mesh quality deteriorates and termination might be compromised. In Computer Aided Design (CAD) applications, the surface is usually given as a PLC. in biomedical Computer Aided Simulations (CAS), however, there is no reason to use this approach, since it might (depending on the geometry of the image, its segmentation, and its decimation) add the additional small input angle limitation. A workaround for this small input angle limitation is to treat the surface voxels as the PLC of the domain, since those input facets meet at large angles (90 or 380 degrees). This, however, would introduce an unnecessarily large number of elements and little control over the density of the domain.

The Isosurface-based methods assume that Ω is known through a function f ' R^d → R, such that points in different regions of interest evaluate / differently. This assumption covers a wide range of inputs used in modeling and simulation, such as parametric surfaces/volumes, level- sets and segmented multi-labeled images. Of course, these type of functions can also represent PLCs, a fact that makes the Isosurface-based method a general approach, isosurface-based methods ought to recover and mesh both the isosurface c and the volume. This method does not suffer from any unnecessary small input angle artifacts introduced by the initial conversion to PLCs, since 3Ω is recovered and meshed during refinement.

In one aspect, the present invention provides a space-time Deiaunay isosurface-based meshing technique for 4 dimensions. The present approach shows that the resulting mesh is sliver free consisting of pen tatopes whose boundary is a correct approximation of the underlying isosurface that space-time meshing is different from dynamic, surface simulations, In those simulations, the isosurface is not known; instead, a tetrahedral mesh is adapted on each time step thai describes accurately the tree surface dynamics.

One way to solve the space-time 4D problem is to mesh separately each

O

3D object "ti and then connect the elements between two consecutive objects to obtain space-time elements. However, finding such correspondence— which also has to satisfy the quality criteria- is not intuitive, especially when the topology and the geometry of the two objects varies drastically,

Q

Alternatively, one could mesh a single object ⁶U and then deform the mesh to match the shape of the other temporal instances. The limitation of this approach is twofold, first, the quality of the deformed mesh might be much worse than the original; second, there is no control over the mesh density across both the spatial and the temporal direction, since the mesh size of the original instance determines the size of the rest of the instances.

Space-time meshing methods have already been proposed in the past. They assume,

Q

however, that the evolving object has the same spatial space across time. Furthermore, the implementation of these techniques is confined to only the 2D + i case (i.e., the space-time elements are tetrahedra). The more general 3D -j- 1 meshing has been the focus in, but they consider only convex hyper-surfaces such as hyper-cubes or hypercyiinders. To our knowledge, the method presented in this paper is the first to address the 3D + t problem where the topology and the geometry of the evolving object may differ subs tantially through time, and hence, it is allowed to form complex hyper-surfaces.

In the past literature, it has been shown that given a sufficiently dense sample on a surface 8Ω, the restriction of its Delaunay triangulation to 3Ω. is a topological^ good approximation, or, alternatively, it satisfies the elosed-topoiogicai-ball property. The focus, however, was not. on volume meshing, but rather, on surface reconstruction. As described herein, the present approach fills the space- time volume Ω with sliver-free pentatopes, such that dtl is approximated correctly.

Computing the appropriate sample of the surface is a challenging task. In the literature, however, it is assumed that either such a sample is known or that an initial sparse sample is given. As described herein, the present method starts directly from labeled images (of one or many more connected components) and computes the appropriate sample on the fly, respecting at the same time the quality and fidelity guarantees.

3 1 in one aspect, the algorithm of the present invention guarantees that the resulted pentatopes are of bounded aspect ratio. We achieve that by generating elements of low radius-edge ratio and b proving the absence of slivers. One may clean the mesh from slivers by integrating into the present framework the theory presented in other literature, where the surface is given as an already meshed polyhedral domain (i.e., the method in the literature is a PLC-based method), a different problem than here, since it is our algorithm's responsibility to mesh both the underlying zero-surfaces and the bounded volume with topological and geometric guarantees.

The present disclosure introduces some basic terminology, presents the algorithm according to embodiments of the present invention, provides proofs and e valuates the methods described herein on segmented 4D cardiac data.

The input of the algorithm according to embodiments of the present invention is a segmented n dimensional image I C Rⁿ . The object Ω £ J is assumed to be represented as a cut function f : Rⁿ → R , such that its surface c¾Q is defined by the set {f (p) = 0}. Clearly, from a segmented image, the zero-surface {f (p) = 0} can be easily computed by

interpolating the voxel values.

It is assumed that given a point p e R⁴ , one can ask for p's closest point on 8Ω. This can be accomplished by an Euclidean Distance Transform (EDT). Specifically, the EDT returns the voxel p e (%l which is closest to p. Then, one may traverse the ray pp-' and compute the intersection between the ray and ΘΩ by interpolating the positions of different signs. Points on 8Ω are referred to as feature points.

Fig. 2 shows a 2D illustration 12 as described. The simplex σ = {v, w} and its surface ball Βζ,σ. m is the midpoint of σ. Observe that since the radius Rz,a αϊ Βζ,σ is larger than the radius Rff - jm - vj of Βσ , the picking region of σ as defined here is larger than the picking region described in the literature.

The local feature size ίί¾_Ω (x) of a point x≡ dil is defined as the (closest) distance between x and the medial axis of 5Ω, Since 0Ω. is smooth, the local feature size is bounded from below by a positive constant ffsgn , that is, Ifsea (x) > ifsgn > 0. Another useful property is that the local feature size is l-Lipsehitz, that is,

lfsan (p)≤ |p ^~ q| ÷ ifs¾i (q) . (1)

A point set V C d£l is called an ε-sample, if for every point p <≡ dQ there is a point v V at a distance at most ε ¹ lfs_¾i (p) from p, Let V be a finite set. of vertices V = {vj , . . . , V } ^C " . The Delaunay triangulation oj;,V is denoted by D (V ). A k-simpiex = {vj , . . . , Vk+i} *= D (V ) is a simplex defined by k + 1 vertices. We denote the length of the shortest edge of a simplex σ with L_l7. The circumball B₀ of a simplex σ is the smallest closed bail circumscribing a^'s vertices. R_o is the circumradius length of the simplex and C_o is its circumcenter, The radius-edg ratio of a simplex σ is defined as ρ_σ ^::::™.

The voronoi cell V (v) of a vertex v ε V is the set V (v) = {p = Rn j |v ^~ p|≤ !q ~ p|, Vq <≡ V } . The voroiioi dual of a simplex σ€≡ D (V ) is defined as the set V (σ) = {V (vi ) Π V (vj ) j

The restriction of D (V ) to a topological space X is denoted by D_i (V ). D;x (V ) is a simplicial complex (as is D (V )) that contains simplices of D (V ) whose voronoi dual intersects X in a non-empty set, Consider a k simplex σ and let V (σ) intersect X at a point z. Any ball centered at z circumscribing σ is called a surface bail The corresponding surface bail is denoted by B_{7 !S} and its radius by R^, in the sequel By the definition of Voronoi diagrams, B_Zj(! does not contain any vertex of V in its interior. The metric used to characterize the quality of a simplex o¾ is Τ_σ& = Low

values of T imply a poor-quality element,

Definition 1 (Sliver), Simplex σ is a sliver if it contains a k-simpiex (k < 4) such that P_Gk < p, Τ_ΰ < and for any m-simplex σ„, of σ (m < k), Ρ_σ < p , T_0m ≥ f.

The picking region PR (σ₄ ) of a 4-simpiex 04 is defined as the 4~dimensional solid bail centered at C with radius ζ Κ_ο4, ζ < ί . Consider a restricted k-simpiex σ-_κ and its surface ball B_¾e k < 4. its picking region PR (σ_¾) is the intersection between dQ and the 4-dimensionai solid ball centered at z with radius ζ R_Z,1 , ζ < 1 , Note that PR (σ₄) and PR (oij are contained in B₀ and Β_Ζ>σ, respectively. Observe that the picking region of σ_¾ (k < 4) is a topological k-bali and does not belong (necessarily) in the affine k dimensional space defined by , This is different than other definitions, where the picking regions are defined inside the intersection of Β with the affine space of o. The reason for this change is the fact that the input of our algorithm is not a Pieeewise Linear Complex (PLC) but a cut function,

A good point p e= PR (σ) is a point thai does not introduce smaller slivers, A sliver is small when its radius is less than bR₀, It has been proven that (a) the number of small slivers 8(σ) possibly created after the insertion of p is constant, and (b) the volume |FG| (the forbidden region) inside which p forms a small sliver is bounded from above. The same findings hold in our case too, where the picking region of a restricted facet σ3 is not inside the intersection of Β_Σ3 and a^s affine space, but inside the intersection of Β_{¾ 3} and dil.

Lemma 1 , Given an almost-good mesh, a point p inside the picking region of a can be found in a constant number of random rounds, such that any new sliver created after the insertion of ρ has eircumradius no smaller than bR_<j_k if k = 4„ or no smaller than b _¾„_k if k = 3.

Remark 1 . The proof is similar to [30], since |Fe? \ and S(r) do not change and the volume of the intersection ofB^ and σ₃ 's affine space is smaller than the intersection of B_Z;C3 and ΘΩ. See Fig. 2 for an illustration.

Algorithm

The user specifies a parameter δ. It will be clear as described elsewhere herein that the lower 6 is, the better the mesh boundary will approximate il. For brevity, the quantity δ · lfs_¾j(z) is denoted by Δ_<¾α(ζ), where z is a feature point,

The algorithm according to aspects of the present invention initially inserts the 16 corners of a hyper-hox that contains the 4 dimensional object Ω, such thai the distance between a box comer x and its closest feature point z = efj¾o(x) is at least 2Δ;;.π(ζ). After the computation of this initial triangulation, the refinement starts dictating which extra points are inserted. At any time, the Delaunay triangulation D (V ) of the current vertices V is maintained. Note that by construction, D (V ) always covers the entire hyper-vohsme and that any point on the box is separated from dil by a distance at least 2Δ_¾>(ζ), where z is a feature point.

During the refinement, some vertices are inserted exactly on the box; these vertices are called box vertices. The box vertices might he on 1, 2, or 3-diraensional box faces. This description shall refer to the vertices that are neither box vertices nor feature vertices as free vertices.

The algorithm inserts new vertices for three reasons: to guarantee that (a) 5Ω is correctly recovered, (b) all the elements have small radius-edge ratio, and (c) there are no slivers.

Specifically, for a 4-sirnpSex 0 in the mesh, the following rules are checked in this order; ^■ R 1 : Let B_o4 intersect d€ and z be equal to cfpao(C _CA). If Z is at a distance no closer than Δ&-)(ζ) to any other feature vertex, then z is inserted,

■ R2: Let B„ intersect 5Ω and z be equal to cfpea(C_<,₄). If _o ≥ 2Aaa(z), C„4 is inserted. ^■ RJ: Lei C ₀-₄ lie inside Ω, if _σ > p , C_a!fr is inserted,

^■ R4: Let C _s4 lie inside Ω. if σ₄ contains a sliver, a good point inside PR (σ₄) is inserted.

^■ R5 : Lei σ¾ (σ₃ c c?₄) be a restricted facet If the vertices of 03 are not feature vertices, then a good point z inside PR (σ₃ ) is inserted, All the free vertices closer than Δ_5Ω(ζ) to z are deleted.

5

For i < j, priority is given to Ri over Rj. That is, right before the insertion of a point, because of Rj, there is no element that violates a rule Ri, Also, in R4, priority is given to the lower dimensional slivers that σ₄ might contain. Whenever there is no simplex for which R I , R2, R3. or R4 apply, the refinement process terminates. The final mesh reported is the set of pentatopes whose

J O circumcenters lie inside Ω.

In a nutshell, R3 and R2 is responsible for generating a sufficiently dense sample on di^'l. R5 makes sure that the vertices of the simplices restricted io d l lie on 3Ω. Lastly, R3 and R4 deal with the quality guarantees. As described herein, the present invention shows that there are values for h, ζ , and p thai do not compromise termination.

15 To prove termination,^ 0 vertices should be inserted outside the bounding box. Notice,

however, that vertices inserted due to R2 may lie outside the bounding box. To deal with such cases, C_(J4 is rejected for insertion. Instead, its projection C\_T4 on the box is inserted in the

triangulation. That is, C _σ4 is the closest to C₄ box point. As described elsewhere herein, we prove that the insertion of projected points do not compromise either quality or fidelity. Note that these 0 projections are different than the traditional encroachment rules described in prior publications, R.ecali that pentatopes with circumcenters on 9Ω or outside Ω are not part of the final mesh, and that is why rule R3 and R4 do not check them.

Fig, 3 shows a 3D illustration 14 of the proof of Lemma 3 in accordance with the present invention.

5

Termination and Quality

in accordance with the present approach, the appropriate values for ζ , p, and b are specified, so that the algorithm terminates. Specifically, it is shown during refinement the shortest edge introduced into the mesh cannot be arbitrarily small. Suppose that σ violates a rule Ri. σ is called an 0 Ri element. R i dictates the insertion of a point p (and possibly the removal of free points). Point p is called an Ri point. The insertion radius Rp of p is defined as the length of the shortest edge incident to p created right after the end of Ri and the parent P(p) of ρ as the most recently inserted vertex incident to the shortest edge of σ.

Lemma 2. Let p and q define the shortest edge of a simplex σ and q being inserted after p. Then Rq

Proof. Assume that right after the insertion of q, p is the closest point to q. in this case, R_q ^::: jp - qj = L_c. Otherwise, there has to be another closest vertex to q, which implies that Rq < jp ~ qj - L₀,

The following Lemma bounds from below the shortest edge introduced into the mesh after the insertion of a box vertex:

Lemma 3, Let v be a box vertex inserted into the mesh. Then, R_v > 2Δϋπ(ζ), where z is a feature point.

Proof A box point v is inserted only because of R2. The circumcenteg C_c of a pentatope σ iies on or outside the box and its projection C'„ ^::: falls on the box. Without loss of generality, assume that the projection iies on the interior of a 3 -face (i.e. a box tetrahedron) F . See Fig, 3 for a 3D illustration 14, (If C₀ lies precisely on the box, C'₀ is equal to v.) Consider the (2D) disk (drawn) of B₀ which is copianar with F . That disk contains v and separates Β_σ in two sides: the side that contains C_a and the side that contains a part of the box. It is claimed that the closest vertex—-say w— to v lies on the intersection of B_r s boundary and the ray C,_sv. To see why, note that Β_σ is empty of vertices, and therefore, the closest to v that an arbitrary vertex v already in the tri angulation can be is when it lies on the boundary of B₀- and on the side of Β_σ that contains a part of the box, as shown. Consider the triarigleAw'vC_o-. From the law of cosines:

|v - w' p = jC_c - w + |C₀ - vp - 2 jC_<s - wj jCa - vj cos ω

> |C_e - w' p + |C_CT - vp - 2 IQ - w'i jC_o - v| since cos ω < 1

= (|C₀ - w' j - jC_lT - vj)²

:^::: (Ι _σ ^~ IC_ff - vj)² since w' lies on the sphere and the claim is proved.

Therefore, any possible new edge connected to v has length at leagt jv - wj. Since, however, σ triggers 2, B„ has to intersect dLI, Therefore, there has to be a feature point q £≡ Ω

(illustrated) inside Β_σ and on the same side of F as w. Let us denote with q' , the projection of q to the box face F, By construction, jq - q' i is at least 2Δ_ί;Ω(ζ), where z is a feature point. Observe, however, that jv - wj is always larger than jq - q^f j, because vw k qq^; , and the statement holds. Similar reasoning applies in the case where CO lies on a box triangle or a box edge.

The following Lemma proves a lower hound on the lengths created into the mesh because of Rl and R2:

Lemma 4, Let p be a vertex inserted into the mesh because of Rl or R2. Then, R_p ≥ Δ&-} (z), where z is the closest feature point to p. Proof. If Rl is triggered, then p is equal to z and since there is no other feature point already inserted in the mesh closer than &_5Ω (p) to p. the statement holds. Otherwise. R2 applies for a simplex 0 and p is equal to C_o4. Due to the empty ball property, Rp is at least R_o4≥ 2Δ¾} (cfpso(p)), and the statement holds. Lemma 5, Let p be a verte inserted into the mesh because R3 applies for an element σ, Then, R_p ≥ p R„. _Λ

Proof, Since p is equal to C_a, R_v ≥ R_a - p_a l_a ≥ pL_a„ Lemma 2 suggests that

L_c > i?_Pfft) , and the results follows.

Lemma 6. Let p be inserted into the mesh because of R4. Then, ^RP ≥ ¾^ Λ_Ρ(νι), if P(p)ts neither R4 or R5,

R₀≥ bR_v , otherwise. Proof. Let a be the simplex that violates R4.

Suppose that P (p) is neither R4 nor Ri. Since p belongs to the picking region of σ, R_P > (1 -- ζ )R_ff ≥— L,,. From Lemma 2, we have that R„≥ ™ R~. ,

Otherwise, consider the case P (p) is an R4 point. From Lemma L one may know that the circurnradius of σ is more than b times the circumradius of the R4 simplex σ' that inserted P (p). Therefore, R_P > (1 - ζ )R_<j > ( I - ζ )bR^ . However, the quantity ( 1 - ζ ) Κ_(Ι/ is equal to R_p . and the statement holds.

The exact same logic holds when P (p) is an RS point, by just substituting ^- for R„/ , where σ^'' is an R5 simplex.

Lemma 7. Let. p he inserted into the mesh because of R5. Then,

* — fip,^ if P(p) is not an R5 point,

' Aar.(z), otherwise. Proof. Let σ.3 he the simplex that violates R4.

Suppose that P (p) is not an R5 point. Because of Lemma 2, the shortest edge of σ₃ is at least p _p), Therefore, any surface ball of σ_¾ has radius at least ¾¾¾_>)■ Since the surface ball does not contain anv vertex in its interior, R„≥— , .

Suppose that P (p) is an R5 point. Note that when P (p) is inserted, all th free vertices closer than Δβη(Ρ(ρ)) to P (p) are deleted. Due to R5, σ₃ contains at least one free vertex. Since P (p) is the most recently inserted vertex incident to the closest edge of σ_¾, the edge that contains P (p) and the free vertex has to he at least Δ^α (P (ρ))· Therefore, any surface ball of σ₃ has radius at least

~ Δ_βΩ (Ρ {pj). Hence, R_p≥ ^ A_d (P (p)).

Putting all the Lemmas together, the solid arrows of Fig. 4 show the insertion radius of the inserted poin t as a fraction of the insertion radius of its parent. An arrow from Ri to Rj wi th label x implies that the insertion radius of an Rj point p is at least x times iarger than the insertion radius of its Ri parent P (p). The label x of the dashed arrows is the absolute value of R._P. Note that the labels of the dashed arrows depend on the local feature size of

3Ω, and as such are always positive constants.

Recall that during refinement, free vertices might be deleted (because of RS). Nevertheless, 7

such deletions of vertices are always preceded by insertion of feature points. Considering the fact that feature vertices are never deleted from the mesh, termination is guaranteed if we prove that the insertion radii of the inserted vertices cannot decrease indefinitely. Clearly, if there is no solid cycle of product less than 1 , termination is guaranteed.

Theorem I . The algorithm terminates producing simpHces of bounded aspect ratio, if

(ΐ-ζ)^ζ

V ≥

Proof, See Fig. 4. FIG, 4 is a flow diagram 16 depicting th relationship among the rules in accordance with one embodiment of the present invention, As shown, no solid cycle should have a product less than i , The dashed arrows in the diagram break the cycle. The smallest produc is produced by the solid cycles R3-->R4-→R5-^→ 3 and R4→R5→R4, By requiring the iabel product of these loops to be more than 1, the desired result follows,

Accuracy

As described herein, the mesh boundary is equal to the restriction of a SO sample Δ to dCl, In the literature, it is proved that these tetrahedra approximate the surface correctly, in geometric and topological sense. First, we show that 6 directly controls the density of the feature vertices. Let V be the set of vertices in the final mesh and Δ be equal to V Π 3Ω, i sS

Lemma 8. Let δ < ~, Then Δ is a-——--sample of ο'Ω.

Proof. Recall that upon termination, there is no tetrahedron for which Rl , R2, R3, R4, or R5 apply, See Fig. 5. Let p be an arbitrary point on dCl. Since D(V) covers all the domain, point p has to lie on or inside the circum sphere of a pentatope σ (not shown). Fig. 5 is a diagram 18 illustration proof of Lemma 8 in accordance with the present invention. Hence, Β_σ intersects 0Ω, Let point p' be the feature poin closest to C₀. Note that |C₀ ^~ p\≥ \C„ ^~ p'j and therefore p^'' lies on or inside o's circumsphere. We also know that a's circumradius has to be less than 2,Asn (p since otherwise R2 would apply for t, Therefore, one has the following; ρ - p'j < 2R.3 (because both p and ' He on or inside Βσ )

< 4A¾t (p*) (because oi^'R2)

< 4δ O - '! + IfS_ftO (p)) (from Inequality (1)), and by reordering the terms, one may obtain that:

Moreover, there must exist a feature vertex v in the triangulation closer than Δ¾ι ρ') ~ δ ifsan(p') to P^' _> sinc otherwise Rl would apply for σ. Hence, iv -p^!| < 6 - Ifssn (p and using Inequality ( 1), one may have that:

jv -p'| < δ (|p -p'j + !fssn(p)) (3)

Applying the triangle inequality for Δρνρ' yields the following:

ip -- vj < 1— ρρΊ + iv -p'j

< ip - ρ'] + δ (ip ^~ p^'j + Ifsaa(p)) (from inequality (3))

= |p - p'| (l + δ) + δ · lfs¾i(p)

< -^— ifs_t¾-j(p) (1 + δ) + 6-lfs_¾>(p) (from Inequality (2))

and the proof is complete.

Denote with coi one of the it connected components that Ω consists of: Ω = . The next two Lemmas prove a few useful properties for the mesh ΜΩ and its boundary dMQ. Our goal is to show that 3ΜΩ. is always non-empty and does not have boundary (Lemma 10), a fact that will be used for proving the fidelity guarantees (Theorem 2).

Lemma 9. Let 6 <-. Then, for every coj there is a pentatope σ 6 D (V ), such that C„ lies inside co .

4

Proof. Consider a single connected component co_:. The same reasoning applies for any connected component of Ω. For the sake of co radiction, assume that there is no pentatope whose eircuracenter lies inside a _t. Since the triangulation D (V ) covers all the domain, the eircumbafls of the pentatopes in D (V ) also cover the domain ω,. Therefore, there has to he a drcumbali B₀- (σ & D (V )) which intersects a point m on the medial axis of d®-, such that m lies inside o¾. By our assumption, the circuracenter C₀ cannot lie inside coj. Therefore, B._T intersects οω,. Also, recall that R2 cannot apply to any pentatope. Hence, we have the following:

2 · δ^■ lfsen (cfpan (C„ )) > ¾ (from R2)

> (since m and cf g_a (C_c) do not lie outside B_s)

> Vsto & m m _{(since m {s on the medial axis)}

δ >

which raises a contradiction.

Lemma 10, Let δ≤ -. Then, <5 n is a non-empty set and does not have boundary.

Proof, The fact that dMa is a non-empty set follows directly from Lemma 9; since Ma cannot be empty, its boundary dMa cannot be empty too, For the other part, since CMQ is the boundary of a set of tetrahedra, it cannot have a boundary.

The following Theorem proves the fidelity guarantees:

Theorem 2. The mesh boundary OM_Q is the restriction to 5Ω of Δ ^:::: V Π dil,

Proof. Let f be a tetrahedron σ₃ in dM - As such, V(o₃) intersects ΘΩ. Due to 1 5, the vertices of <¾ lie on 5Ω. Recall that the surface ball does not contain vertices in its interior. Therefore, is empty of vertices in V Γ ι 3Ω also. Without loss of generality, assume that the vertices in V are in general position. Since there is a ball that circumscribes σ₃ and does not contain vertices of V Π δί^'ϊ in its interior, σ₃ has to appear as a simplex in D (V Π 5Ω), Since the center z of the surface bail lies on dQ., then the voronoi dual of σ₃ intersects dil in Ό& (<3Ω ⁽Ί V ), as well. Hence, ΒΜΩ, £ D;sn tan n v ).

For the other direction, we will prove that Ma cannot be a proper subset of Dja j (3Ω Π V), and therefore, equality between these 2 sets is forced. Toward this direction, we will prove that any proper non-empty subset of D n Ω. Π V ) has boundary; this is enough, because we have proved in Lemma 10 that 3Μ_Ω is non-empty and does not have boundary,

O_m (dQ (Ί V ) is the restriction of a sample of a closed manifold ο'Ω and therefore it is a 3-manifoJd without boundary. That means that any 2-simplex in D_|¾_¾ (δΩ Π V ) is incident to exactly two 3-simplices of D_|¾i (5ii Π V ), Since any proper non-empty subset A of (3Ω, Π V) has fewer 3-simplices, A contains at least a 2-simpiex σ₂ incident to only one 3-simplex. Bui this implies that σ₂ belongs to the boundary of A, and the proof is complete, Experimental Evaluation

The algorithm is implemented in C++. We employed the Bowyer- Watson kernel for point insertions. The removal of a point p is implemented by computing the small DeSaunay

triangulation of the vertices incident to p, such that the vertices inserted earlier in the triangulation are inserted into the small triangulation first, it can be shown that these new created pentatopes can always be connected back to the original triangulation without introducing invalid elements. For the Euclidean Distance Transform, we made use of the related filter implemented in ITK, Lastly, we borrowed CGAL's exact predicates for the accurate computation of the 4D in-sphere tests,

Table 1 below shows information about the images of the five patients used in this section. The spacing for ail the images is (1.77, 1 .77, 6, l)mni⁴.

Table 1

Case Pat2 Pat3 Pat4 PatS

#Voxeis (100x100x44x15) I (300 100x34x55) (100x100x26x15) 7ioikTb¾ 21xi^'5) 00χΪ00χ2 χΪ5)^'

We ran our code on five (segmented) images obtained from the 4D Heart Database. The first three represent the moving left ventricle of the patients, while the last two the ventricle together with the myocardium for 15 cardiac cycles, information about these data is given in Table 1.

Recall that the present algorithm needs the distance of any point on dil from the medial axis. The robust computation of the medial axis is a very difficult problem (involving computing the exact medial axis, for a review of image-based medial axis methods, and for computing the medial axis given a set of surface points).

Table 2 below shows statistics of the output meshes generated for each patient.

Table 2

Therefore, in the implementation, it is assumed thai Ifsani ) is uniform and equal to the unit, which implies that Δ^ρ) becomes equal to δ. That is, in practice, δ determines a uniform and (if small enough) dense sample of the surface, it was experimentally verified that a 5 value equal to 5 (the length of five consecutive voxels along the fourth dimension) yielded manifold mesh boundaries with vertices lying precisely on the iso-surface in accordance with Theorem 2,

The quantity ¾ determines the aspect ratio ofpentatope σ, but it is not normalized, and therefore, it is hard to draw comparative conclusions, For this reason, for a pentatope σ of the final mesh, we report its normalized Volume _f( defined as the ratio of its volume over the volume of a regular pentatope with eircumradius equal to the circumradius of σ (or alternatively†„ =™~^.).

Clearly, f_a€ [0, 1 ], where a value of 0 implies a degenerate

pentatope, while 1 implies a perfect quality.

Fig. 6 shows a normalized volume histogram 20 of the output mesh obtained for the input image Patl identified in Table 1 , for example.

Table 2 shows quantitative data for the mesh generated on each image, We set the radius of the picking regions equal to ζ = ~. Theorem 1 suggests that be at least 16 and b at least 4, We experimentally observed that by selecting 4 to 10 random points within the picking regions (both the 4- and the 3-topologieal balls), no small element σ was created with f₀ less than 0.01 , Despite the fact a value of 0.01 is rather small, it is three orders of magnitude larger than the minimum normalized volume reported in the case where no picking regions are employed at all. Also, notice that the average normalized volume is much higher than the minimum value. This fact together with the observed small standard deviation implies that most pentatopes have normalized volume away from the minimum valu and very close to the a verage. Fig. 6 shows the histogram 20 of the normalized volumes for the first experiment of Table 2, that is, when the input image Patl was used. Similar histograms are observed for all the other inputs as well. 6477

it can thus be seen thai the present invention provides a space-time meshing method for (3D-H) image data. The method is able to provably clean up slivers and recover the hyper-surfaces faithfully. Experiments on five 4D cardiac images show thai the resulting meshes consist of elements of bounded aspect ratio. Efficient Discontinuous Galerkin formulations require that not only the hyper-surface should be recovered but also the evolving 3D object at certain time steps. This is a more challenging task considering the non-manifold nature of the underlying space-time domain, Because of the increased memory space needed for high dimensional meshing, the 4D algorithm of the present invention may be slower than a three dimensional Delaunay rnesher.

Section 2; Multi-threaded processing

The design 155 in FIG, 7 illustrates the basic building blocks of an embodiment of a multithreaded design. It is contemplated that embodiments of 4D meshing may be implemented in multithreaded applications similar to those described herein, In other words, examples below may discuss multi-thread applications in reference to 3D methods, but such embodiments or examples extend to 4D methods disclosed herein.

1) Poor Element List (PEE): Each thread Γ maintains a poor element (multi) list (FEE,). A tetrahedron t can only reside in at most one PEL. Depending on the rule that applies for t, a specific point /> is computed for insertion or removal and is passed to the Triangulator [6j. When a certain bad element t in PEL. is examined, T, acquires it by locking its vertices. If t is already locked (by the Triangulator, see below), then T. moves on to the next bad element in PEL- , If locking was successful, then a point p for insertion or removal is computed, according to the Rule that applies for t,

2) Triangulator: If the operation is an insertion, then the cavity C (p) needs to be computed and re-triangulated. The cavity is defined as the set of those elements whose circumba!l contains p, i.e., they violate the Delaunay property. Computing the cavity involves two steps: (a) locating the first tetrahedron in C (p), and (b) delete the tetrahedra in C (ji) and create new ones that satisfy the Delaunay property. For finding the location of the first element in the cavity, we make use of the visibility walk, Specifically, starting from the poor element orientations tests are performed with respect to p, until the first element that contains JE? is reached, Having found that first element, the second step is performed via the Bowyer- Watson kernel: point J? is connected with the boundary triangles that are left in the cavity, it can be shown that, the new elements do not intersect each others' interior and also satisfy the Delaimay property. See FIG. 8A for an illustration 160 in 2D,

Removing p involves the computation and re-triangulation ofjt's ball 2> (p). This is a more challenging operation than insertion, because the re-triangulation of the hole in degenerate cases is not unique which implies the creation of illegal elements, i.e., elements that cannot be connected with the corresponding elements outside the ball. We overcome this difficulty by computing a local Delaimay triangulation _r.,,(or 2>,^ for brevity) of the vertices incident to p, such thai the vertices inserted earlier in the shared triangulation are inserted into Ό„ first. It can be shown [6] that these new locally

S

created ball telr hedra can always be connected hack to the shared triangulation without introducing invalid elements, See FIG. 8B for an illustration 162 in 2D.

in order to avoid races associated with writing, reading, and deleting vertices/cells from a PEL or the shared triangulation, any vertex touched during the operation of element location, cavity expansion, or bail filling need to he locked. We utilize GCC's atomic built-in functions for this goal, in the case a vertex is already locked by another thread, then we have a rollback: the operation is stopped and the changes are discarded. When a rollback occurs, the thread moves on to the next bad element in the list.

3) Update Cells: When an operation is successfully completed, the new and deleted cells have to be updated, Specificaiiy, deleted elements already in T!s PFJ, have to be removed from the list and the newly created elements have to be checked in order to determine whether or not a Rule applies for them and pushed back into a Poor Element List, The PEL that the new poor elements are pushed into is determined by the Load Balancer.

4) Load Balancer: When the algorithm starts, only one thread (the main thread) has elements in its PEL, as a result of the triangulation of the initial box. Clearly, the work needs to be distributed to other threads as well from the very early stage of refinement. We implemented and compared three load balancing techniques that naturally matched our implementation design: random work stealing, hierarchical work stealing, and a static box partitioning. Load balancing is described in detail herein,

5) Contention Manager (CM): in order to eliminate live-locks caused by repeated rollbacks, threads talk to a Contention Manager (CM). Its purpose is to pause the execution of some threads making sure that at least one will do useful work so that system throughput can never get stuck. See Section VII for two approaches that not only guarantee absence of Hvelocks, but they are able to greatly reduce the number of rollbacks and yield a considerable speedup, even in the generation of very small meshes. We also show that CMs also decrease the amount of idle time, which implies a higher FLG PS/Watt ratio. The contention manager is invoked every time there is a rollback.

Detail optimizations implemented justifying their selection with extensive experimental analysis are described herein, along with performance evaluation between the single-threaded algorithm of the present invention and CGAL, the state-of-the art open source mesh generation took

Optimizations

We use the PA PI library to count TLB+LLC misses and inter-socket QPI remote accesses, For this study, we used both CRTC (our cc-NUMA machine) and Blacklight, See Table 1 for the specifications of these systems. The i tra-node communication is achieved through Intel's Quick Path Interconnect (QPI). QPi also enforces cache-coherence (hardware enabled) via MES1F snooping. Accesses through QPI is an order of magnitude higher than accesses to the socket's local memory (as a result of L3 cache misses). The real overhead, however, is when a socket has to talk outside its node. Biacklight's inter-blade communication is achieved via a NUMAIink® 5 fat tree network, Despite the network's sufficient bandwidth, there is a 2,000 cycle latency per hop, which is by far the bottleneck in communication. Our goal therefore, is to: (a) decrease TLB+LLC misses for improved infra-socket communication, and (b) to decrease the inter-socket accesses as a result of intra~bla.de or (worse) inter- blade communication, Hereafter, we shall refer to inter-blade accesses as remote accesses,

The main optimizations we implemented include: (1) vector reuse, (2) custom memory pools, (3) class member packing, (4) local PEL cell removals, (5) replication, and (6) light-threaded memory pool interactions, The first two optimizations are extensions of the optimizations suggested by Antonopoulos et ciL, so that dynamic removals can benefit as well. Table 1 summarizes our findings,

s) Use last two ofsdsmigstjoas f m^*

For the first four optimizations (see Table Ha), the algorithm of the present invention ran on one and on six cores of CRTC, For each execution, the present invention reports the savings on the execution time and on the TLB+L3 misses that we observed when we turned the optimization on. We used six threads for the multi-threaded experiments, because we wanted to pin them on a single socket (of CRTC). Therefore, inter-socket communication via the Intel QPI link do not affect the timings, and thus, we measure only the impact of TLB and LLC misses. The goal of the last, two optimizations is to decrease the inter-socket accesses (see Table 1 lb). For this experiment, we used four and eight sockets of Biacklight (or equivalency, two and four blades) to demonstrate the improvements yielded by these optimizations. When a certain optimization is evaluated, the previously evaluated optimizations are kept on.

Note that different runs in Biacklight might be assigned to a different CPU mask, which might be associated with a higher number of hops. Therefore, in order to get consistent results, we made sure that each related run is pinned on the same physical cores,

1) Vector reuse: During refinement, certain vectors are accessed very often. There are three types of such vectors: (a) the vector that stores the cells in a cavity to be deleted, (b) the vector that stores the ball cells in Ό _m to be connected back to the shared triansulation. and (c) the vector that stores the incident vertices of a vertex to be removed.

in the early stages of our implementation, these vectors allocated new space every time and after the completion of the operation this space was freed. In order to decrease the number of system cails on the shared memory (which introduces also extra overhead for synchronizing the global memory resources), each vector reserves space with size equal to a certain number of elements. When the operation completes, only the elements are discarded; the memory allocated is reused to store elements for the next operation. When there is not enough space to accommodate the desired number of elements, the space is expanded (doubled). In order to minimize the frequency of vector expansions without introducing intense memory segmentation, we performed some tests on various linages to determine the length of these vectors. Table III provides measurements for the length of the vectors frequently used during the course of refinement.

The first column corresponds to a inesh of 42,049 elements and the second to a much larger mesh of 833,860 elements. Both meshes were created for the knee atlas, Although the max length values strongly depend on the size of the mesh, the average values are consistent and seemingly independent of the size of the mesh. (Slight change in the average number of cells in a cavity agrees with the findings in as well). Therefore, for each vector, we initially reserve space able to hold its average number of elements, Vector reuse boosts the speed by a little (see first row of Table 11a), which contradicts the finding in where a 36% improvement is reported. We attribute this difference to the fact that in 3D refinement; a) the computation involved is three times more than in 2D and b) the pointer chasing is dominating over the vector traversa!s. Indeed, the average size of a 3D cavity is more than 3 times larger than the size of a 2D cavity, which implies that finding the elements in a 3D cavity involves more pointer chasing. For these reasons, the improvement of vector re-use does not show.

2) Custom memory manager: Rather than allocating and deallocating vertices or ceils one at a time, the present invention maintains a memory pool for ceils and vertices for each thread. When a new element is created, it requests space from its (private) pool, if an element is deleted, its memory returns back to the pool, and therefore the memory footprint is decreased which improves both the single and the multt -threaded application. Case studies show that the number of cells is roughly 3 times more than the number of vertices created during the refinement. The present invention sets the chunk size for ceils three times larger than that for the vertices. Setting the chunk size for the vertices o E^¾^ito¾¾l yields the best results. See the second rows of Table Ila, In contrast to vector reuse, the memory pools paid off. This is also confirmed by the misses decrease,

3) Class Member Packing: The classes that represent the cell and the vertex need to store boolean variables used for checking the status during a. traversal, the invalidations, and the rule that applies for them, instead of declaring them as boolean, we pack them in BitSets, so that every variable occupies now a single bit. This decreases the memory occupied per cell and it also results in a. more compact alignment. The size of the cell and vertex class are now reduced by 20% (88 bytes totally) and 40% (48 bytes totally), respectively, and therefore, caching is likely to be improved, indeed, the savings in TLB+LLC misses (see third row of Table Ila) arc more than those of vector reuse. Although the memory manager is on, we still get a small improvement in time,

4) Local PEL. cell removals: Extra care is required when T; deletes a cell ¾ already in the PELj; of another thread T, Our first attempt was to let T. remove C, from PEL:,, We call this removal a non-local PEL removal to distinguish it from a local PEL removal where T removes a cell from its own PEL, This, however, although easier to implement, requires extra synchronization and communication between T, and T, Specifically, T, has to lock 3 PEL cells each time it wants to visit a node, because a removal might be in progress nearby. Figure 4, however, shows that even in the case of a high number of threads (128), non-local PEL removals account for about 33%.

For this reasons, Ti is not allowed to remove c. from PEL.. Rather, it raises c.'s invalidation j j

flag. When T., reaches q before anything else, it checks whether or not is invalidated. If yes, then it removes from its own PEL and it moves on to the next bad element. The advantage of this approach is that there is no need for synchronization. The disadvantage is a possible increase in the memory footprint, since now C\ does not return its address to the memory poof for re-use immediately after it is not needed any more, but only when T. finds out that it is invalidated. However, we do not expect a

J

radical increase in memory footprint as the graph 170 in FIG. 9 suggests. Indeed, the last row of Table Ila shows that local PEL removal is superior, in fact, we even observe a better caching which contradicts our intuition. We attribute this result to the fact that the memory pools are large enough to cover the need for more ceils even if they are not deleted right away.

5) Replication: in order to decrease the number of QPI accesses, we replicate read-only data structures (the segmented image for example) per socket, taking advantage of the "distributed part" of MUMA machines' architecture. Therefore, threads of the same socket have their own copy stored in their local memory bank. As shown in Table lib, we get much faster executions, as a pure result of a 99% remote accesses decrease in both 32 and 64 threads (the TLB+LLC misses remained almost intact),

6) Light-threaded pool interaction: As explained elsewhere herein, when T; deletes an element e, e's address is returned to Tj's private pooh However, it might be the case that e Is allocated by a remote thread. As a result, T. re-uses remote space. To resolve this issue, T returns e's address to the remote thread's memory pool. This introduces an extra pool synchronization, but as Table ΪΙΒ suggests, when the number of threads increases, this optimization decreased the QPI accessed and as a result the overall performance. For lower core counts, the optimization yielded worse results (see the negative percentages), because the extra intra-blade communication associated with the pool synchronization dominates. Nevertheless, since we are interested in iarge number of cores as well, we kept this optimization on.

Single-threaded evaluation

The first sequential prototypes of our refiner techniques were initially built on top of CGAL's triangulator. CGAL is the state-of-the-art mesh generation tool widely accepted by the community, It is a well-tested and robust library, arid is believed to be the fastest sequential Delaunay mesh generator which operates directly on images. CGAL incorporates many optimizations that speeds up the performance such as: Delaunay hierarchies for locating, ray shooting, occasional switch from the BW kernel to edge flipping, object counters, and so forth. Although these choices improve the

performance on a single core, their parailelization is quite difficult and might not pay off. (For example, CGAL's reference counters render even the read accesses thread unsafe.)

See Fig. IOC. in order for the evaluation to make sense, the sizing parameters of CGAL were set to values that produced meshes of similar size with the present invention. Observe that the resulting meshes are of similar quality and that the single-threaded ΡΪ2Μ is 35% faster than CGAL. Notice, that since our refiner removes points (CGAL performs only insertions for refinement), our actual ^seconc^^s is larger than what it is reported in Figure IOC,

5 The input image used in this study is a 48-tissue RI knee atlas freely available in outside references, Other input images yielded the same results: PI2M generates meshes of similar quality to CGAL's and 30% faster. FIG. 10A illustrates the mesh 180 generated by PI2M in accordance with the present invention, and FIG. 10B illustrates a cross-section 182 thereof. In one embodiment, CRTC (see Table 1 for its specifications) was used for this case study.

10 The role of the Contention Manager (CM) is twofold: (a) to guarantee correctness in a multithreaded algorithm and (b) to reduce the number of rollbacks. A common pitfall of non-blocking parallel algorithms is the lack of liveloek prevention. A iiveiock is a system-wide, global starvation, where there is no progress: threads compute their cavity/hole, lock the corresponding vertices, only to find out that they have to roll back again. Kohout et at. overlook that problem and, for that reason,

I S invalid Deiaunay elements are reported, B landlord et al. solve the problem by bootstrapping: they insert the first 500, 000 vertices sequentially (they do not report the exact number of elements, but based on experience and the fact that these points followed the uniform distribution, there must be more than a million tetrahedra). Similarly, Nave and Chrisochoides construct first an initial mesh of 100,000 tetrahedra sequentially and then the work is distributed among the workers. We believe that 0 even in the beginning (i.e., when there are not many elements present) there is parallelism that can be exploited, Therefore, instead of bootstrapping, we decided to follow a more dynamic (yet simple to implement and with little overhead) contention policy.

Two contention techniques were implemented: a global Contention Manager and a local Contention Manager.

5

A globai-CM

Each threads T_¾ tracks down its progress by computing its corresponding progress ratio ρη defined as: ρ ~ Sops;/Aops„ where Sops, is the number of successful operations and Aopsj is the number of attempted operations, i.e., completed operations plus the number of operations that started 0 but were cancelled because of a rollback, A high value of pr- means that _t does not encounter many rollbacks and thus does useful work. If pr j drops below a threshold pr^", then the execution of T; is suspended (unless ail the other threads have been already suspended). If pr, exceeds a threshold prt , then it randomly wakes an already suspended thread. The awakened thread sets its progress ratio to a predefined value pt. The computation of the progress ratio is performed locally by each thread without synchronization. Although the user is required to specify the values of three parameters (pr , pr, pr⁺), timings were not observed to vary considerably for different triplets. For this study, the triplet is assigned to (0.1. 0.2, 0.9),

Local-CM

One drawback of globai-CM is that there are global variables that are accessed by ail threads. For example, the suspend/awake function is implemented as a global condition variable. Also, the number of "active" threads should be tracked down in order to prevent the suspension of the last active thread. This cannot scale well tor large number of cores, Therefore, global-CM can be replaced by iocahCM: a manager with zero communication and synchronization among threads. Thread T now counts the number of rollbacks ri it encounters that were not interrupted by any successfully operation. That is, if Tj performs a successful operation r, is set to 0, When a rollback^" occurs, T; sleeps for r, x f seconds (f is a parameter). Lastly, the main thread is not allowed to sleep at all. This policy is locally performed by each thread and it also guarantees the absence of liveiocks. The reason is that if all threads get stuck, then they will eventually sleep for a long time, letting at least one thread (the main thread) to do some useful work, In one embodiment setting r to the duration of 5,000 clock cycles yielded the best results.

Comparison

A comparison is presented in Table IV.

TA.BLE iVl Com arison amssg so Co&fcsra i&B Manager (€ )_> glo ¾l-€M,

Is gene ated

For this case study, the final mesh consists of about 80,000 tetrahedra and the number of threads is twelve, Even in this case of a very small mesh (in the literature, speedups are reported for multi-million element meshes), considerable improvements are obtained, Generating small meshes (i.e., highly contented meshes) in real time is an application best suitable to many-core chips; they can drastically benefit from contention managers. This technique speeds up also the generation of larger meshes (i.e., the ease of mor cores), since the reduction of rollbacks implies mor useful FLOPS earlier in the refinement.

The single threaded execution time was ! .19 seconds, in one embodiment. Launching twelve threads without any contention policy caused a Iivelock. Global-CM solved the problem immediately yielding a speedup of 1.9. The 8-fold speed up of local-CM reduced not only the number of rollbacks (as compared to global-CM), but also the number of idle seconds. This improvement may be due to three reasons: (a) there is extra overhead associated to a condition variable call than a simple sleep command, (b) threads are given the chance to sleep less time since the time they go to sleep is a multiple of //sees proportional to their failure rate, and (c) local-CM requires no

communication/synchronization. Other increment ratios for ? (for example exponential counter) did not yield considerably different results.

The strong scaling speedup * ^> and the weak scaling efficiency achieved by

PI2M of the present invention are described herein. The input image we used is a CT abdominal atlas obtained by IRCAD Laparoscopic Center (http://wwvvJrcad.fr/). Load Balancing

Recall that initially there are only six elements in the mesh (initial box elements) which are likely to belong to the main thread's PEL. Thus, load balancing is needed from the beginning of refinement,

Load imbalance is a challenging problem associated with both regular and irregular applications. See for example the work stealing, multi-list, diffusive, gradient, and re-partitioning techniques described in referenced works in U.S. provisional application no. 61/692,538, which is incorporated by reference herein in its entirety. In referenced works, it is shown that these techniques behave very well (some better than others) for unstructured mesh generation, assuming that the tasks are independent. In other referenced works, however, it is also shown that equi-distributing the tasks among processors would not yield good results for mesh generators characterized by intensive intertask dependencies.

For the reasons above, the present invention implements the traditional work stealing (WS) as a base load balancer. Specifically, if the poor element list (TEL) of a thread Tj is empty of elements, Tj writes its ID to the begging list, a global array that tracks down threads without work, A running thread T_j, every time it completes an operation, it gathers the newly created threads and places the ones that are poor to the PEL of the first thread Tj found in the begging list. The classification of whether or not a newly created cell is poor or not is done by Tj. in the lines of FIGS, 1 1A-C, the speedup, the number of QPI accesses, and the percentage of the total number of rollbacks with respect to the total number of attempted operations are plotted, Specifically, FIG. 1 1 A is a graph 190 showing scaling performance on Blacklight achieved by Work Stealing (WS), FIG, i IB is a graph 192 showing scaling performance on Blacklight achieved by Hierarchical Work Stealing (HWS), and FIG, 1 1 C is a graph 1 4 showing scaling performance on Blacklight achieved by HWS with Static image decomposition (HWS+Statie).

Optimizations

In order to decrease the communication overhead associated with remote memory accesses, the present invention implements a Hierarchical Work Stealing scheme (HWS) by taking advantage of the architecture of Blacklight. The present invention can split the begging list in three levels.

Specifically, there is a Level 1 begging list per socket (abbreviated by Bl), a Level2 begging list per blade (abbreviated by B2), and a global LeveB begging list B3 for the whole system. When the poor element list of thread Ti is empty, it throws itself to its Bl . Bl is accessed only by the S cores (i.e., up to 16 threads if hyper-threading is enabled) of a specific socket. The last thread asking for work, however, does not use B l , but instead, it promotes itself to the second level begging list B2. Hence, B2 is accessed by only two threads per blade (one thread per socket) at a given time. Similarly, the second thread of the "B2 group" is promoted to the last level begging list B3, In the worst case, B3 will be accessed by a thread of each blade. A running thread Ti, which completed an operation and gathered the newly poor elements, sends new work to the begging threads giving priority to the lower level begging lists first. Only when Bl and B2 of T_j are empty, Tj accesses the global begging list. See the lines of FIGS, 27A-C, HWS greatly reduces the number of remote accesses and the associated latency, thus achieving a better speed-up than the plain WS.

Notice that threads do not ask for work until they completely run out of bad elements. HWS can be modified such that a thread asks for work, when the number of its bad elements drops below a certain threshold. This change not only did not improve performance, but it actually increased the execution time by 2 to 10 seconds, potentially for two reasons. First, extra synchronization and communication for the Poor Element Lists (PELs) is now introduced: a thread might add elements to another PEL while it is being traversed by thread Tj. Second, the improvement of the time that a thread is waiting (idle time) is diminishing. Before the modification with the threshold, no thread is busy-waiting for more than 1% of the total execution time.

FIG. 12A depicts a communication pattern 200 for Hierarchical Work Stealing (HWS) and FIG. 12B depicts a communication pattern 210 for Hierarchical Work Stealing with Decomposition (HWS+Static). FIG, 12A depicts the cells created by each thread in different shading, including areas where intensive thread communication and synchronization take place, in certain areas, three or even four threads meet. In order to decrease this communication, the image is divided into structured blocks. Each block is assigned to a. specific thread. The goal is to force the threads to operate only on their dedicated block as much as possible. For this reason, if a newly created element t of thread T is classified as poor, T determines which block contains the point that has to be inserted/removed for t, Then T places t to the Poor Element List of the thread of the corresponding block. Clearly, this static image decomposition is possible, only if we are given some extra knowledge about the orientation of the object. n certain respects, this is the best achievable, since the actual polyhedral surface of the represented object is not known a priori, and therefore, other more elaborate partitioning schemes cannot be applied without introducing overheads. Also, note that this decomposition does not replace HWS; rather, it augments it, since again if a thread runs out of work, it takes some from either of the three begging lists. Figure 12B shows that this scheme (referred to as HWS+Static) decreased thread contention. However, its performance is not better than HWS when the core count increases (see the lines of FIGS. 1 1 A~C), This result can be attributed to the fact that with many blades, the communication and synchronization overhead, associated with T_¾ placing t to a remote PEL, dominates. As FIG. 1 IB illustrates, after two blades, the remote accesses of HWS+Static are more than HWS's.

Table V depicts the weak scaling performance for HWS. (HWS is faster than HWS+Static by 1% up to 50%.) The algorithm of the present invention ran on two configurations: one thread per core (non hyper threaded configuration, or n-FIT) and two threads per core (hyper-threaded configuration or HT), Note that the size of the problem increases with respect to the number of threads. Also, the problem size increases slower than it should.

TABU! V? Weak ^callts perfora¾a3ice for HierarMcal ork Staling,

ESkimscy m

{¾ C¾c thre d p«∞rc

Cfe) Two thread per SO S (h ers

For the hyper-threading configuration (HT), the algorithm of the present invention is applied on the same problem sizes, but each core accommodates two threads. Therefore, the HT configuration involves half the blades of the corresponding n-HT configuration, and hence, it is expected to reduce the QPI accesses. However, the 32-thread (1 blade) HT configuration involves 70% more QPI accesses than the corresponding n~HT configuration (2 blades), This is counter- intuitive (since the HT configuration has fewer blades and therefore less communication), but it can be explained by the fact that there are more threads per socket and the possibility to talk to another socket increases, in the case where the QPI accesses do decrease (by 21 % on 256 threads and by 100% on 1 ), the resource sharing still deteriorates performance.

The present invention thereby provides PI2M as a novel parallel Image-to-Mesh (I.2M) Conversion tool. Starting directly from a segmented 3D image, the present invention is able to recover and mesh both the iso-surface of the domain with geometric and topological guarantees and the underlying volume with quality elements.

Th present invention shows excellent strong and weak scaling performance given the dynamic nature of its application. Specifically, the present invention observes a 91% strong sealing efficiency and a s perlinear weak scaling efficiency for up to 32 cores (2 nodes connected via network switch). For a higher number of nodes, the overhead of the irregular thread communication dominates deteriorating performance. The present invention shows that 28% (16 threads) up to 33% (128 threads) of the total number of poor elements appeared in the mesh were not refined by the same threads that created them. Nevertheless, the present invention still obtains a 71 % weak scaling performance on 64 cores with the non-hyper threaded Hierarchical Work Stealing configuration.

Extensive case studies show that 3D Delaunay meshing involves three times more computation than 2D on the average, it is three times more memory intensive (pointer chasing), and if the point removals are taken into account, eighteen times more. This difference renders some optimizations used in 2D (for example vector re-use) redundant in the 3D setting and the need for decrease in communication more critical. In a multi-threaded environment, the present invention provides evidence that: the use of custom memory managers per thread, the decrease of the ceil and vertex size (member packing), and the reduced synchronization of the threads' Poor Element Lists (local PEL ceil removal) yielded a 46% total reduction of TLB and LLC misses, and as a result a 30% total reduction of the execution time. For large core counts, the present invention shows thai:

replicating structures across sockets and minimizing remote memory re-use (lightweight pools) achieved a 99.3% total reduction of Intel QPI accesses, and as a result a 73% total reduction of the execution time. The QPI remote accesses are also decreased by up to 60% when we used the Hierarchical Work Stealing (HWS) approach, an approach that takes advantage of the unique NUMA architecture of Biacklight Moreover, HWS decreased the number of rollbacks by up to 63%, reducing in this way the communication between threads. Great decrease in the number of rollbacks is also achieved by the present invention's specially designed Contention Managers (CM), The local-CM can reduce the rollbacks by 85%, and the idle time by 96%o, making in this way full use of the dissipated power.

It is worth noting that the ΡΓ2Μ according to the present invention exhibits excellent single- threaded performance. Despite the extra overhead associated with synchronization, contention management, and load balancing, the present invention generates meshes 30% faster than CGAL and with similar quality.

The present invention thus provides a flexible 12M infrastructure able to accept any refinement strategies with minimal code change.

Section; 3

An Example of a Parallel Delaunay Refinement for Smooth Surfaces

As explained herein, before the mesh generation starts, the Euclidean Distance Transform (EDT) of the image is needed for the on-the-fly computation of the appropriate isosurfaee vertices. For this pre-processing step, one may make use of the publicly available parallel aurer filter. It can be shown that this parallel EDT scales linearly with the respect to the number of threads.

The rest of this section describes the main aspects of our parallel code. Algorithm 1 shown at 290 in Fig. 22 illustrates the basic building blocks of our multi-threaded mesh-generation design. Note thai the tightly-coupled parallelization does not alter the fidelity (Theorem 1) and the quality guarantees described in the previous section,

Poor Element List (PEL)

Each thread '/'.^■ may maintain its own Poor Element List (PEL) PEL_/. PEL, contains the ietrahedra that violate the Refinement Rules and need to be refined by thread Γ, accordingly.

Operation

An operation that refines an element can be either an insertion of a point p or the removal of a vertex p. in the case of insertion, the cavity C (p) needs to be found and re-triangulated according to the well known Bowyer-Watson kernel. Specifically, C (p) consists of the elements whose circu nsphere contains p. These elements are deleted (because they violate the Delaunay property) and p is connected to the vertices of the boundary of C (p). In the case of a removal, the ball B (p) needs to be re-triangulated. This is a more challenging operation than insertion, because the re- triangulation of the ball in degenerate cases is not unique which implies the creation of illegal elements, i.e., elements that cannot be connected with the corresponding elements outside the ball. We overcome this difficulty by computing a local Delaunay triangulation ¾_ίρ·> (or D_B for brevity) of the vertices incident to p, such thai the vertices inserted earlier in the shared triangulation are inserted into D_B first. In order to avoid races associated with writing, reading, and deleting vertices/cells from a PEL or the shared mesh, any vertex touched during the operation of cavity expansion, or ball filling needs to be locked. We utilize GCC's atomic built-in functions for this goal, since they perform faster than the conventional pthread try jocks. Indeed, replacing p thread locks (our first implementation) with GCC's atomic buiit-ins (current implementation) decreased the execution time by 3.6% on 1 core and by 4.2% on 12 cores.

in the case a vertex is already locked by another thread, then we have a rollback: the operation is stopped and the changes are discarded. When a rollback occurs, the thread moves on to the next bad element in its PEL,

Update new and deleted cells

After a thread Γ, completes an operation, ne cells are created and some cells are invalidated. The new cells are those that re-triangulate the cavity (in case of an insertion) or the ball (in case of a removal) of a point p and the invalidated cells are those that used to form the cavity or the ball of p right before the operation, Γ, determines whether a newly created element violates a rule. If it does, then Tj pushes it back to PBLi (or to another thread's PEL, see below) for future refinement. Also, T, removes the invalidated elements from the PEL they have been residing in so far, which might be the PEL, of another thread. To decrease the synchronization involved for the concurrent access to the PELs, if the invalidated cell c resides in another thread Tj 's PEL,. , then Tj removes c from PELj only if T- belongs to the same socket with Τ,· . Otherwise, T_f raises cell c's invalidation flag, so that J) can remove it when 7} examines c. Load Balancer

Right after the triangulation of the virtual box and the sequential creation of the first 6 tetrahedra, only the main thread might have a non-empty PEL. Clearly, Load Balancing is a fundamental aspect of our implementation. A base (i.e., not optimized) Load Balancer is the classic Random Work Stealing (RHW) technique, since it best fits our implementation design. An optimized work stealing b lancer has been implemented that takes advantage of the NUMA architecture and achieves an excellent performance.

If the poor element list PEL, of a thread Γ, is empty of elements, 7- "pushes back" its ID to the Begging List, a global array that tracks down threads without work, Then, T,- is busy-waiting and can be awaken by a thread 7} right after T gives some work to T, . A running thread 7} , ever}- time it completes an operation (i.e., a Delaunay insertion or a Delaunay removal), it gathers the newly created elements and places the ones that, are poor to the PEL of the first thread T> found in the begging list. The classification of whether or not a newly created cell is poor or not is done by T. . Ί) also removes Ί from the Begging List.

To decrease unnecessary communication, a thread is not allowed to give work to threads, if it does not have enough poor elements in its PEL. Hence, each thread 7\- maintains a counter that keeps track of all the poor and valid cells that reside in PEL; , T_t is forbidden to give work to a thread, if the counter is less than a threshold. We set that threshold equal to 5, since it yielded the best results. When Tj invalidates an element c or when it makes a poor element c not to be poor anymore, it decreases accordingly the counter of the thread that contains c in its PEL. Similarly, when T_:, gives extra poor elements to a thread, Τ₍· increases the counter of the corresponding thread.

Contention Manager (CM) In order to eliminate livelocks caused by repeated rollbacks, threads talk to a Contention Manager (CM), its purpose is to pause on run-time the execution of some threads making sure that at least one will do useful work so thai system throughput can never get stuck. See the description elsewhere herein for approaches able to greatly reduce the number of rollbacks and yield a considerable speedup, even in the absence of enough parallelism, Contention managers avoid energy waste because of rollbacks and reduce dynamic power consumption, by throttling the number of threads that contend, thereby providing an opportunity for the runtime system to place some cores in deep Sow power states. Contention Manager

The goal of the Contention Manager (CM) is to reduce the number of rollbacks and guarantee the absence of livelocks, if possible.

In this embodiment, four contention techniques were implemented: the Aggressive

Contention Manager (Aggressive-CM), the Random Contention Manager (Random-CM), the Global Contention Manager (Global-CM), and the Local Contention Manager (Locai-CM).

The Aggressive-CM and Random-CM are non-blocking schemes. As is usually the case for non-blocking schemes, the absence of livelocks was not proven for these techniques. Nevertheless, they are useful for comparison purposes as Aggressive-CM is the simplest to implement, and

Random-CM has already been presented in the mesh generation literature.

The Global-CM is a blocking scheme and it is proven that it does not introduce any deadlock.

(Blocking schemes are guaranteed not to introduce livelocks).

The last one, Local-CM, is semi-hiocking, that is, it has both blocking and non-blocking parts. Because of its (partial) non-blocking nature, it was found difficult to prove starvation-freedom, but it could guarantee absence of deadlocks and livelocks. It should be noted, however, that the inventors have never experienced any thread starvation when using Locai-CM: all threads in all case studies made progress concurrently for about the same period of time.

Note that none of the earlier Transactional Memory techniques and the Random Contention Managers presented in the past solve the liveloek problem. In this section, it is shown that if livelocks are not provably eliminated in our application, then termination is compromised on high core counts.

For the next of this Section assume that (without loss of generality) each thread always finds elements to refine in its Poor Element List (PEL). This assumption simplifies the presentation of this Section, since it hides several details that are mainly related to Load Balancing, The interaction between the Load Balancing and the Contention Manager techniques does not invalidate the proofs of this Section,

Aggressive-CM

Th Aggressive-CM is a brute-force technique, since there is no special treatment. Threads greedily attempt to apply the operation, and in case of a rollback, they just discard the changes, and move on to the next poor element to refine (if there is any). The purpose of this technique is to show that reducing the number of rollbacks is not just a matter of performance, but a matter of correctness. Indeed, experimental evaluation (see discussion elsewhere herein) shows that Aggressive-CM very often suffers from livelocks.

Random-CM

Random-CM has already been presented (with minor differences) in the literature and worked fairly well, i.e, no livelocks were observed in practice. This scheme lets "randomness" choose the execution scenario that would eliminate livelocks. We implement this technique as well to show that our application needs considerably more elaborate CMs. ndeed, recall that in this case, there is no much parallelism in the beginning of refinement and therefore, there is no much randomness that can be used to break the liveiock,

Each thread T₍ counts the number of consecutive rollbacks r,_; . Ifr_£ exceeds a specified upper value r⁺, then 7^*,· sleeps for a random time interval t_t . If the consecutive rollbacks break because an operation was successfully finished then r< is reset to 0. The time interval fcj is in milliseconds and is a randomly generated number between 1 and r⁺ . The value of r⁺ is set to 5, Other values yielded similar results. Note tha lower values for r⁺ do not necessarily imply faster executions. A low r^* decreases the number of rollbacks much more, but increases the number of times that a contented thread goes to sleep (for f,_: milliseconds). On the other hand, a high r* increases the number of rollbacks, but randomness is given more chance to avoid livelocks; that is, a contented thread has now more chances to find other elements to refine before it goes to sleep (for t_t milliseconds),

Random-CM cannot guarantee the absence of livelocks. This randomness can rarely lead to livelocks, but it should be rejected as si is not a valid solution. It was also experimentally verified that livelocks are not that rare (see discussion elsewhere herein).

GIoba!-CM G3obal-CM maintains a global Contention List (CL), If a thread T_t en-counters a rollback, then it writes its id in CL and it busy waits (i.e., it blocks). Threads waiting in CL are potentially awaken (in FIFO order) by threads that ha ve made a lot of progress, or in other words, by threads that have not recently encountered many rollbacks. Therefore, each thread Tjcomputes its

"progress" by counting how many consecutive successful operations s- have been performed without an interruption by a rollbacks. If exceeds a upper value s⁺, then T_t awakes the first, thread in CL, if any. The value for s⁺ is set to 10. Experimentally, this was found to value yield the best results,

Glohal-CM can never create livelocks, because it is a blocking mechanism as opposed to randorn-CM which does not block any thread. Nevertheless, the system might end up to a deadlock, because of the interaction with the Load Balancing's Begging List BL (see the Load Balancer discussion above).

Therefore, at any time, the number of active thread needs to be tracked down, that is, the number of threads that do not busy jvvait in either the CL or the Begging List. A thread is forbidden to enter CL and busy__wait, if it sees that there is only one (i.e., itself) active thread: instead, it skips CL and attempts to refine the next element in its Poor Element List, Similarly, a thread about to enter the Begging List (because it has no work to do) checks whether or not it is the only active thread at this moment, in which case, it awakes a thread from the CL, before it starts idling for extra work. In this simple way, the absence of livelocks and deadlocks are guaranteed, since threads aiways block in case of a rollback and there will always be at least one active thread. The disadvantage of this method is that there is global communication and synchronization: the CL,, and the number of active threads are giobai structures/variables that are accessed by all threads.

Locai-CM

The local Contention Manager (iocai-CM) distributes the previously global Contention List (CL) across threads. The Contention List CL, ofa thread _/ contains the ids of threads that encountered a rollback because of 7,· (i.e, they attempted to acquire a vertex already acquired by 7,·) and now they busy wait, As above, if 7,· is doing a lot of progress, i.e., the number of consecutive successful operations exceed s+ , then _/ awakes one thread from its local CL, .

Extra care should be taken, however, to guarantee not only the absence of livelocks, but also, the absence of deadlocks, It is possible that T) encounters a rollback because of T₂ (and we symbolize this relationship by writing Ί → T₂ and T₂ encounters a rollback because of Ti (i.e., T₂ ---> Ti): both threads write their ids to the other thread's CL, and no one else can wake them up. Clearly, this dependency cycle (Tj — * T2 ^→ Tj ) leads T; and T2 to a deadlock, because under no circumstances will these threads ever be awaken again,

To solve these issues, each thread is now equipped with two extra variables: conflicting_id and busy_wait. See Figure _ for a detailed pseudo-code of local-C .

The algorithm illustrated at 295 in Fig, 23 is called by a Γ,· every time it does not finish the operation successfully (i.e., it encounters a- rollback). Suppose attempts to acquire a vertex already locked by 7} (T_* → 7} ). in this case, Γ; does not complete the operation, but rather, it rollbacks by disregarding the so far changes, unlocking ail the associated vertices, and finally executing the Rollback_Occured function, with confIicting_icl equal to/, in other words, the conflicting id variables represent dependencies among threads; T_t → 7} xonflictmg_id =/.

For example, if 7\· encounters a rollback because of 7} and 7} encounters a rollback because of 7 _¾ , then the dependency path from Γ, is 7_t → 7} → T_k , which corresponds to the following values: 7^, ₍.conflicting_id =/, 7}.conflict3ng_id = k, T_k xonflicting_id = ^~1 (where -1 denotes the absence of dependency).

Lines 4-14 of Rollback Occurred decide whether or not 7- .should block (via busy-waiting).

7'i is not allowed to bioek if T_C011fij_Cfj_ngJl has already decided to block (Lines 6- 10). Threads communicate their decision to block by setting their busy_wait flags to true. If T_C£iniii«mgjd .busyjwait has already been set to true, it is imperative that Ti is not allowed to block, because it might be the case that the dependency of T, forms a cycle. By not letting Γ, to block, the dependency cycle "breaks". Otherwise, Ti writes its id to CUon ctingjd (Lines 1 5-17) and loops around its busy _wait. flag (Line 18).

The algorithm at 297 in Fig. 23 is called by a T every time it completes an operation, i.e., ever}- time T_t does not encounter a rollback. If T, has done a lot of progress (Lines 2-5 of

Rollback __Not_Oceurred), then it awakes a thread 7} from its Contention List CL, by setting 7} 's busy_wait flag to false. Therefore, 7} escapes from the loop of Line 18 in Rollback_Occurred and is free to attempt the next operation.

Fig, 24 shows an illustration 300 of possible execution scenarios for local-CM during six Time Steps. Below, is described in detail what might happen in each step: · Time Step 1 : Ail four threads are making progress without any roll-backs.

Time Step 2: Tt and T₄ attempted to acquire a vertex already owned by T₂ . Both Γ/ and TV call the code of Figure . Their conf!icting_id variables represent those exact dependencies (Line 3 of Rollback J3ccurred).

* Time Step 3: Ί) and T₄ set their busy_wait flag to true (Line 12 of Rollback Occurred), they write their ids to CL₂ (Lines 15-17), and they block via a husy__wait (Line 18),

* Time Step 4: T₂ has done lots of progress and executes the Lines 6-9 of

Roilback_Not_Occurred, awaking in this way 7V.

* Time Step 5: A dependency cycle is formed: Γ? →T^ — * 7 ^→T2 . Lines 4-14 of Rollback Occurred will determine which threads block and which ones do not, Note that the mutex locking of Lines 4-5 cannot be executed at the same time by these 3 threads. Only one thread can enter its critical section (Lines 6-14) at a time.

·· Time Step 6: Here it is shown that T₄ executed its critical section first, T₂ executed its critical section second, and T_$ was last. Therefore, TV and T? block, since the condition in Line 6 was false: their conflicting threads at that time had not set their busyjwait to true, The last thread Ί realized that its conflicting thread T₄ has already decided to block, and therefore, 2"_? returns at Line 10, without blocking.

Note that in Time Step 6, T₂ blocks without awaking the threads in its CL, and that is why both CLj and CL₃ are not empty. It might be tempting to instruct a thread 7} to awake all the threads in CL,- , when T_s is about to block. This could clearly expedite things, Nevertheless, such an approach could easily cause a live!ock as shown in illustration 215 of Fig, 13.

Local-CM is substantially more complex than global-CM, and the deadlock- free/live lock-free guarantees are not very intuitive. As described elsewhere herein, it is proved that locai-CM indeed can never introduce deadlocks or livelocks.

As shown in Fig. 13, a thread is about to busy-wait on another thread's Contention List (CL) should not awake the threads already in its own CL. Otherwise, a livelock might happen, as illustrated in this Figure. Time Step 8 leads the system to the same situation of Time Step 1 : this can be taking place for an undefined period of time with none of the threads making progress.

The following two Remarks follow directly from the definition of deadlock and livelock.

Remark 1. If a deadlock arises, then there has to be a dependency cycle where all the participant threads block. Only then these blocked threads will never be awaken again. Remark 2. If a Hvelock arises, then there has to be a dependency cycle where all the participant threads are not blocked, Since ail the participant threads break the cycle without making any progress, this "cycle breaking" might be happening indefinitely without completing any operations, In the only case where the rest threads of the system are blocked waiting on these participant threads' Contention Lists (or all the system's threads participate in such a cycle), then system-wide progress is indefinitely postponed.

The next Lemmas prove that in a dependency cycle, at least one thread will block and at ieast one thread will not block, This is enough to prove absence of deadlocks and Iivelocks.

Lemma 1 (Absence of deadlocks). In a dependency cycle at Ieast one thread will not block.

Proof. For the sake_of contradiction, assume that th¾ threads, T^- T_t 7 participate SE3a cycle, that is.T; —> Ti · ' ·→Ti -→T, , such that ail threads block, This means that ail threads evaluated Line 6 of illustration 295 in Fig. 23 to false, Therefore, since 's conflicting id is , right before ?_i decides to block (Line 12), T_{ 's busy_, wait flag was false. The same argument applies for ail the pairs of consecutive threads: {T , 7 _i }, , T_hi }, . . . , {T_ini , 7 }. But 7_ini could not have evaluated Line 6 to false, because, by our assumption, 7_t had already decided to block and 7

.busy wait had been already set to true when acquired T s mutex. A contradiction: T_lni returns from Rollback . Occurred without blocking.

Lemma 2 (Absence of livelocks). In a dependency cycle at least one thread will block, Proof For the sake of contradiction, assume that the threads 7_i , 7_i? , . . . , _ini participate in a cycle, that is, ₍ →T_l ~^~> · · · ^→7_ini ^→7ι , such that ail threads do not block. This means that all threads evaluated Line 6 of Figure to true. Consider for example 7_t . When _it acquired 7j,'s mutex, it evaluated Line 6 to true, That means that T_t had already acquired and released its mutex having executed Line 12: a contradiction because 7_¾, blocks.

Comparison For this case study, each CM on the CT abdominal atlas of IRCAD Laparoscopic Center was evaluated using 128 and 256 Blacklight cores (see Table 2 for its specification). The final mesh consists of about 150 X 106 tetrahedra, The single-threaded execution time on Blacklight was 1 ,080 seconds. .

There are three direct sources of wasted cycles in our algorithm, and all of them are shown in Table illustrated at 220 of Fig. 14.

* contention overhead time: it is the total time that threads spent busy- waiting on a Contention

List (or busy-waiting for a random number of seconds as is the ease of Random-CM) and accessing the Contention List (in case of Global-CM),

« load balance overhead time: it is the total time that threads spent busy-waiting on the

Begging List waiting tor more work to arrive (see discussion elsewhere herein) and accessing the Begging List, and

® rollback overhead time: it is the total time that threads had spent for the partial completion of an operation right before they decided that they had to discard the changes and roil back.

Observe that Aggressive-CM was stuck in a Hveloek on both 128 and 256 cores. It is known for sure that these were livelocks because it was found out that no tetrahedron was refined, i.e., no thread actually made any progress, in the time period of an hour.

Random-CM terminated successfully on 128 cores, but it was very slow compared to Global- CM and Local-CM. indeed, Random-CM exhibits a large number of rollbacks that directly increases both the contention overhead and the rollback overhead. Also, since threads' progress is much slower, threads wait for extra work for much longer, a fact that also increases the load balance overhead considerably. As explained above, Random-CM does not eliminate livelocks, and this is manifested on the 256 core experiment, where a liveloek did occur.

On both 128 and 256 cores, Loeal-CM performed better, indeed, observe that the total overhead time is approximately twice as small as Global-CM^* s overhead time, This is mainly due to the little contention overhead achieved by Locai-CM. Since Giobal-CM maintains a global Contention List, a thread 7)· waits for more time before it gets awaken from another thread for two reasons: (a) because there are more threads in front of T- that need to be awaken first, and (b) because the

Contention List and the number of active threads are accessed by ail threads which causes longer communication latencies. Although Loeal-CM is the fastest scheme, observe that it introduces higher number of rollbacks on 256 cores than Giobal~CM. This aiso justifies the increased rollback overhead (see lower portion of table illustrated at 220 in Fig. 14. in other words, fewer rollbacks do not always imply taster executions, a fact that renders the optimization the application a challenging task. This result can be explained by the following observation: the number of rollbacks (and subsequently, the rollback overhead) and the contention overhead constitute a tradeoff. The more a thread waits in a Contention List, the more its contention overhead is, but the fewer the rollbacks it encounters are, since it does not attempt to perform any operation. Conversely, the less a thread waits in a Contention List, the less its contention overhead is, but since it is given more chances to apply an operation, it might encounter more rollbacks. Nevertheless, the table illustrated at 220 in Fig. 14 suggests that. Local -CM does a very good job balancing this tradeoff on runtime.

Although there are other elaborate and hybrid contention techniques, none of them guarantees the absence of Hvelocks. Therefore, Local-CM was chosen because of its efficiency and correctness.

Performance

A load balancing optimization and the strong and weak scaling performance on Blaeklight arc described as follows. See the table illustrated at 225 in Fig. 15 for its specifications.

Hierarchical Work Stealing (HWS)

In order to further decrease the communication overhead associated with remote memory accesses, a Hierarchical Work Stealing scheme (HWS) was implemented by taking advantage of the cc-NUMA architecture.

The Begging List was re-organized into three levels: BLl , BL2, and BL3. Threads of a single socket that run out of work place themselves into the first level begging list BLl which is shared among threads of a single socket. If the thread realizes that all the other socket threads wait on BLl , it skips BLL and places itself to BL2, which is shared among threads of a single blade. Similarly, if the thread realizes that BL2 already accommodates a thread from the other socket in its blade, it asks work by placing itself into the last level begging list BL3. When a thread completes an operation and is about to send extra work to an idle thread, if gives priority to BLl threads first, then to BL2, and lastly to BL3 threads, in other words, BLl is shared among the threads of a single socket and is able to accommodate up to number of threads per socket - 1 idle threads (in Blaeklight, that is ? threads). BL2 is shared among the sockets of a single blade and is able to accommodate up to number of sockets per blade - 1 idle threads (in Biackiight, that is 1 thread). Lastly, BL3 is shared among all the allocated blades and can accommodate at most one thread per blade. In this way. an idle thread tends to take work first from threads inside its socket, if there is none, T- takes work from a thread of the other socket inside its blade, if any. Finally, if all the other threads inside Zj^'s blade are idling for extra work, Ί) places its id to BL3, asking for work from a thread of another blade,

Figs. 16A-16C show graphical illustrations 230, 235 and 240, respectively, of the strong sealing experiment demonstrating both the Random Work Stealing (RWS) load balance and the Hierarchical Work Stealing (HWS). The input image we used is the CT abdominal atlas obtained from IR CAD Laparoscopic Center. The final mesh generated consists of 124 X 10" elements. On a single Biackiight core, the execution time was 1 100 seconds.

Observe that the speed-up of RWS deteriorates by a lot for more than 64 cores (see the green line in diagram 230 of Fig. 16A). In contrast. HWS manages to achieve a (slight) improvement even on 1 76 cores. This could be attributed to the fact that the number of inter-blade (i.e., remote) accesses are greatly reduced by HWS (see diagram 235 of Fig. 16B), since begging threads are more likely to get poor elements created by threads of their own socket and blade first. Clearly, this reduces the communication involved when a thread reads memory residing in a remote memory bank. Indeed, on 176 cores, 98,9% of all the number of times threads asked for work, they received it from a thread of their own blade, yielding a 28.8% reduction in inter-blade accesses, as diagram 235 in Fig. 16B shows,

Fig. 16C shows at 240 the breakdown of the overhead time per thread for H WS across runs.

Note that since this is a strong scaling case study, the ideal behavior is a linear increase of the height of the bars with the respect to the number of threads. Observe, however, that the overhead time pet- thread is always below the overhead time measured on 16 threads. This means that Locai-CM and the Hierarchical Work Stealing method (HWS) are able to serve threads fast and tolerate congestion efficiently on runtime.

Weak Scaling Results

In this section is presented the weak scaling performance of ΡΪ2Μ on the CT abdominal atlas and the knee atlas. Other inputs exhibit very similar results on comparable mesh sizes.

The number of tetrahedra created per second across the runs was measured. Specifically, define Elements (n) and Time («), the number of elements created and the time elapsed, when n threads are employed. Then, the speedup is defined as ^e~ ^Tim-?-^- With n threads, a perfect speedup would be equal to n.

One can directly control the size of the problem (i.e., the number of generated tetrahedra) via the parameter δ (see discussion elsewhere herein). This parameter sets an upper limit on the volume of the tetrahedra generated. With a simple volume argument, one can show that a decrease of 6 by a factor of x results in an x" times increase of the mesh size, approximately.

See the table illustrated at 245 in Fig, 17. Each reported Time is computed as the average among three runs. Although the standard deviation for up to 128 cores is practically zero on both inputs, the same does not apply for higher core counts. Indeed, the standard deviation on the 144-, 160-, and 176-core executions is about 10, 15, and 29 seconds respectively, for both inputs. This behavior is attributed to the fact that in those experiments, the network switches responsible for the cache coherency were close to the root of the fat-tree topology and therefore, they were shared among many more users, affecting in this way the timings of our application considerably. (Note that the increased bandwidth of the upper level switches does not alleviate this problem, since the bottleneck of our application is latency.) This conjecture agrees with the fact that the maximum number of hops on the experiments for up to 128 cores was 3, while for 144, 160 and 176 cores, this number became 5.

Nevertheless, observe the excellent speedups for up to 128 threads. On 144 cores, an unprecedented efficiency of more than 82% was achieved, and a rate of more than 14.3 Million Elements per second for both inputs, it is worth mentioning that CGAL, the fastest sequential publicly available isosurface-based mesh generation tool, on the same CT image input, is 81 % slower than our single-threaded performance, indeed, CGAL took 548/21 seconds to generate a similarly- sized mesh (1 ,00 X 107 tetrahedra) with comparable quality and fidelity to ours (see discussion elsewhere herein for a more thorough comparison case study). Thus, compared to CGAL, the speedup we achieve on 144 cores is 751 ,25.

Observe, however, that this performance may deteriorate beyond this core count. It is believed that a main reason of this degradation is not the overhead cycles spent on rollbacks, contention lists, and begging lists (see discussion elsewhere herein), but the congested network responsible for the communication.

As may be seen in diagram 250 of Fig, 18, the overhead time breakdown with respect to the wall time for the experiment on 176 cores of the table illustrated at the top of diagram 245 in Fig. 17. A pair (x, y) tells us that up to the x^ih second of execution, threads have not been doing useful work so far for y seconds all together. First, note that the total overhead time per thread increases. Since this is a weak scaling case study, the best that can happen is a constant number of overhead seconds per thread. But this is not happening. The reason is that in the beginning of refinement, the mesh is practically empty; only the six tetrahedra needed to fill the virtual box are present (see elements 275, 280 and 285 of Figs. 19, 20, and 21 , respectively). Therefore, during the early stages of refinement, the problem does not behave as a weak scaling ease study, but as a strong scaling one; more threads, but in fact the same size, which renders our application a very challenging problem. See Fig. 18 for an illustration 250 of the 176-core experiment of table 245 in Fig. 17. The X-axis shows the wall-time clock of the execution, The Y-axis shows the total number of seconds that threads have spent on useless computation (i.e., rollback, contention, and load balance overhead, as described elsewhere herein) so far, cumulatively, The more straight the lines are, the more useful work the threads perform.

Rapidly growing lines imply lack of parallelism and intense contention, Observe that in the first 14 seconds of refinement (Phase 1 ), there is high contention and severe load imbalance. Nevertheless, even in this case, « 73%of the time, ail 1 76 threads were doing useful work, i.e., the la

threads were working on their full capacity.

However, this overhead time increase cannot explain the performance deterioration. See for example the numbers on 176 threads of the table 245 illustrated in Fig. 17. 176 threads run for 181.10s each, and, on the average, they do useless work for 10.55s each, in other words, if there were no rollbacks, no contention list overhead, and no load balancing overhead, the execution time would have to be 181.10s-i0.55s =170.55s. 170.55s, however, is far from the ideal 90,37s (that the first column with 1 thread shows) by 1 0.55s-90.37=80.18s. Therefore, while rollbacks, contention management, and load balancing introduce a merely 10.55s overhead, the real bottleneck is the 80.18s overhead spent on memory (often remote) loads/stores, indeed, since the problem size in- creases linearly with respect to the number of threads, either the communication traffic per network switch increases across runs, or it goes through a higher number of hops (each of which adds a 2,000 cycle latency penalty), or both, it seems that after 144 cores, this pressure on the switches slows performance down. A hybrid approach able to scale for larger network hierarchies is left for future work.

Hyper-threading

The table illustrated at 305 in Fig. 25 shows the performance achieved by the hyper-threaded version of our code. For this case study, we used the same input and parameters as the ones used in the experiment the Table illustrated at 245 in Fig, 17, The only difference is that now there are twice as many threads as there were in the Table illustrated at 245 in Fig. 17,

Since the hardware threads share the TLB, the cache hierarchy, and the pipeline, we report the impact of hyper-threading on TLB misses, Last Level Cache (LLC) misses, and Resource stall cycles. Specifically, we report the increase of those counters relatively to the non hyper-threaded experiment of the Table illustrated at 245 in Fig. 17. The reported Speedup is also relative to the non byper- threaded experiment.

The last three rows of Che Table illustrated at 305 in Fig. 25 suggest that the hyper-threaded version utilized the core resources more efficiently. Surprisingly enough, the TLB and LLC misses actually decrease (notice the negative sign in front of the percent- ages) when two hardware threads are launched per core. Also, as expected, the pipeline in the hyper-threaded version is busier executing micro-ops, as the decrease of resource stall cycles suggest.

Although hyper-threading achieves a better utilization of the TLB, LLC, and pipeline, there is a considerable slowdown after 64 cores (i.e., 128 hard- ware threads). Observe that hyper-threading expedited the execution for up to 64 cores, indeed, the hyper-threaded version is 47% faster on 64 cores compared to the non hyper-threaded version, Beyond this point, however, there is a considerable slowdown. This slowdown cannot be explained by merely the increase in the number of overhead seconds.

See for example the overhead sees per thread on 176 cores in the Table illustrated at. 305 in Fig. 25, It is indeed 13 times higher than its non hyper-threaded counterpart.: this is, however, expected because the size of the problem is the same but now we use twice as many hardware threads as before. If we subtract the overhead time of the hyper-threaded version on 176 cores, we get that for 480.83s - 143,37s = 337.46s, all hardware threads were doing useful work. But this is still way longer than the 181.10s - 10.55s = 1 0.55s useful seconds of the non hyper-threaded execution (see Table at 245 in Fig, 17).

We attribute this behavior to the increased communication traffic caused not by the increased problem size (as was mostly the case in the non hyper- threaded version), but by the increased number of "senders" and "receivers". That is, even though the problem size is the same, the hyper-threaded version utilizes more threads. This means that at a given moment, there will be more packages (originated by the more than before threads) in the switches waiting to be routed than before. This phenomeno increases the communication latency, It seems that the network cannot handle this pressure for more than 64 cores, or equivalently, 128 hardware threads. Note that this agrees with the fact that in the non hyper-threaded version, the slowdown occurred on more than 128 cores, which is again 128 threads (see the Table illustrated at 245 in Fig. 17).

Single-threaded evaluation

In this section, it is shown that the single-threaded execution time of thismethod (ΡΪ2Μ), although it introduces extra overhead due to locking, synchronization, contention management bookkeeping (see discussion elsewhere herein), and hierarchical load balance (see discussion elsewhere herein), it is faster than CGAL and 1 etGen, the state-of-the-art open source mesh generation tools, Moreover, PI2M has comparable quality with CGAL and much better quality than TetGen. PI2M, CGAL, and TetGe are very robust Delaunay methods, since they all use exact predicates. Specifically, PI2M adopts the exact predicates as implemented in CGAL,

It should be mentioned that although CGAL is able to operate directly on segmented multi- tissue images (i.e., it is an isosurface-based method), TetGen is a PLC-based method (see discussion elsewhere herein). That is, TetGen' s inputs are triangulated domains that separate the different tissues. For this reason, pass to TetGen the triangulated isosurfaees as recovered by this method, and then let TetGen to fill the underlying volume.

PI2M, CGAL, and TetGen were run on two different multi-tissue 3D input images obtained from a Hospital Surgical Planning Laboratory. The first is an MR knee-atlas and the second is a CT head-neck atlas, information about these two inputs is displayed in the table illustrated at 310 in Fig. 26, The resulting output meshes generated by our method PI2M are illustrated in diagrams 315 and 320 of Fig. 27.

The table il!ustrated at 325 in Fig, 28 shows timings and quality statistics for P.12M, CGAL, and TetGen. CRTC (see table illustrated at 225 in Fig, 15 for its specifications) was used for this case study. The timings reported account for everything but for disk 10 operations. The execution time reported for PI2M incorporates the 1.9 seconds and 1 ,2 seconds time interval needed for the computation of the Euclidean distance transform (see discussion elsewhere herein) for the knee atlas and the head-neck atlas, respectively.

The sizing parameters of CGAL and TetGen were set to values that produced meshes of similar size to those here, since generally, meshes with more elements exhibit better quality and fidelity, The achieved quality of these methods were assessed in terms of radius-edge ratio and dihedral angles. Those metrics are of importance, as they are shown to improve the speed and robustness of medical application solvers dealing with isotropic materials. Ideally, the radius-edge ratio should be low, the minimum dihedral angle should he large, and the maximum dihedral angle should be low. The smallest boundary planar angles are also reported. This measures the quality of the mesh boundary. Large smallest boundary planar angles imply belter boundary quality,

PI2M, CGAL, and TetGen allow users to specify the target radius-edge ratio. Apart from TetGen, these methods also allow users to specify' the target boundary planar angles. The corresponding parameters were set accordingly, so that the maximum radius-edge ratio is 2 (for PI2M, CGAL, and TetGen), and the smallest boundary planar angle is more than 30° (for PI2M and CGAL only, since TetGen does not give this parameter).

Fidelity measures how well the mesh boundary represents the isosurfaces. The fidelity achieved by these methods is assessed in terms of the symmetric (double-sided) Hausdorff distance. A low Hausdorff distance implies a good representation. The Hausdorff distance for TetGen is not reported since is the triangular mesh that represents the isosurfaces is given to it as an input.

The speed of the methods above was assessed by comparing the rate of generated tetrahedra per second. Note that since our method not only inserts but also removes points from the mesh (thus reducing the number of mesh elements), a perhaps fairer way to assess speed is to compare the rate of performed operations per second. Nevertheless, this metric is note reported for two reasons. First, a high rate of operations does not always imply a high rate of generated tetrahedra. The later, however, is the only thing that matters, since comparing the quality/fidelity achieved by meshes of very different mesh sizes makes no sense. Second, the number of removals performed by ΡΓ2 accounts for only 2% over the total number of operations. Thus, the rate of generated tetrahedra is very close the rate of operations per second; indeed, and it was experimentally found out that those two rates are practically the same.

Observe that the P12M and CGAL generate meshes of similar dihedral angles, and fidelity, but the present method is much faster, Indeed, the rate of the single-threaded PI2M is 68,7% higher than CGAL on the knee atlas and more than 3 times higher on the head-neck atlas, Also note that both PI2M and CGAL prove that the smallest boundary planar angles are more than 30° and that radius- edge ratio is less than 2. Due to numerical errors, however, these bounds might be smaller in practice than what theory suggests. Nevertheless, observe that PI2M yields much better boundary planar angles and radius-edge ratio than CGAL on the head-neck atlas.

TetGen is faster than PI2M only on the knee atlas by a couple of seconds. For larger meshes (as is the case with the head-neck atlas), TetGen is slower. Indeed, for small meshes, the computation of the Euclidean Distance Transform (EOT) accounts for a considerable percentage over the total execution time, a fact that slows down the overall execution time by a lot, For example, the actual meshing time on the knee atlas was just 4.6 sees, very close to TetGen's time and rate. Another notable observation is that this method generates meshes with much better dihedral angles and radius- edge ratio than TetGen. The achieved boundary planar angles are similar simply because the PLC that is given to TetGen was in fact the triangular boundary mesh of P12M,

Disclosed is an embodiment of PI2M: the first parallel Image-to-Mesh (PI2M) Conversion Isosurface-based algorithm and its implementation. Starting directly from a multi-label segmented 3D image, it is able to recover and mesh both the isosurface dO with geometric and topological guarantees (see Theorem 1) and the underlying volume O with quality elements.

This work is different from parallel TrianguJaior,, since parallel mesh generation and refinement focuses on the quality of elements (tetrahedra and facets) and the conformal representation of the tissues' boundaries/isosurfaces by computing on demand the appropriate points for insertion or deletion. Parallel Triangulators tesseilate only the convex hull of a set of points.

The present tightly-coupled method greatly reduces the number of rollbacks and scales up to a much higher core count, compared to the tightly-coupled method our group developed in the past. The data decomposition method does not support DeSaunay removals, a technique thai it is shown to be effectiv in the sequential mesh generation literature. The extension of partially-coupled and decoupled methods to 3D is a very difficult task, since Deiaunay-admissible 3D medial decomposition is an unsolved problem. On the contrary, the present method does not rely on any domain decomposition, and could be extended to arbitrary dimensions as well. Indeed, we plan to extend ΡΪ2Μ to 4 dimensions and generate space- time elements (needed for spatio-temporal simulations) in parallel, thus, exploiting parallelism in the fourth dimension.

The present approach is highly optimized through carefully designed contention managers, and load balancers which take advantage of NUMA architectures. Aspects of the Global Contention Manager (G!obal-CM) and Local Contention Manager (Locai-CM) provably eliminate deadlocks and livelocks. They achieve a speedup even on 256 cores, when other traditional contention managers, found in the mesh generation literature, fail to terminate. Locai-CM also reduced the number of overhead cycles by a factor of 2 compared to die Globai-CM on 256 cores, improving energy- efficiency by avoiding energy waste because of rollbacks. Lastly, the Hierarchical Work Stealing load balancer (HWS) sped up the execution by a factor of 1.45 on 176 cores, as a result of a 22.8% remote accesses reduction.

All in all, P12M achieves a strong scaling efficiency of more than 82% on 64 cores. It also achieves a weak scaling efficiency of more than 82% on up to 144 cores. The Inventors are not. aware of any 3D parallel Delaunay mesh refinement algorithm achieving such a performance. It is worth noting that ΡΪ2Μ exhibits excellent single-threaded performance, Despite the extra overhead associated with synchronization, contention management, and load balancing, PI2M generates meshes 40% faster than CGAL and with similar quality. Moreover, P12M achieves better quality than TetGen, and it is also faster than TetGen for large mesh sizes,

Recall that in the present method, threads spend time idling on the contention and load balancing lists. And this is necessary in our algorithm for correctness and performance efficiency. This fact offers great opportunities to control the power consumption, or alternatively, to maximize the ratio. Since idling is not the time critical component, the CPU frequency could be decreased during such an idling.

As explained, for core counts higher than 144, weak scaling performance deteriorates because communication traffic (per switch) is more intense and passes through a larger number of hops, in the future, scalability may be increased by employing a hierarchically layered (disiributed and shared memory) implementation design and combine this tightly-coupled method with the decoupled and partially coupled methods.

While the invention has been described and illustrated with reference to certain preferred embodiments herein, other embodiments are possible, Additionally, as such, the foregoing illustrative embodiments, examples, features, advantages, and attendant advantages are not meant to be limiting of the present invention, as the invention may be practiced according to various alternative embodiments, as well as without necessarily providing, for example, one or more of the features, advantages, and attendant advantages that may be provided by the foregoing illustrative embodiments. Systems and modules described herein may comprise software, firmware, hardware, or any combination(s) of software, firmware, or hardware suitable for the purposes described herein.

Software and other modules may reside on servers, workstations, personal computers, computerized tablets, PDAs, and other devices suitable for the purposes described herein. Software and other modules may be accessible via local memory, via a network, via a browser or other application in an ASP context, or via other means suitable for the purposes described herein. Data structures described herein may comprise computer files, variables, programming arrays, programming structures, or any electronic information storage schemes or methods, or any combinations thereof, suitable for the purposes described herein. User interface elements described herein may comprise elements from graphical user interfaces, command line interfaces, and other interfaces suitable for the purposes described herein. Except to the extent necessary or inherent in the processes themselves, no particular order to steps or stages of methods or processes described in this disclosure, including the Figures, is implied. In many cases the order of process steps may be varied, and various illustrative steps may be combined, altered, or omitted, without changing the purpose, effect or import of the methods described.

Accordingly, while the invention has been described and illustrated in connection with preferred embodiments, many variations and modifications as will be evident to those skilled in this art may be made without departing from the scope of the invention, and the invention is thus not to be limited to the precise details of methodology or construction set forth above, as such variations and modification are intended to be included within the scope of the invention. Therefore, the scope of the appended claims should not be limited to the description and illustrations of the embodiments contained herein.

Claims

What is claimed is; 1 , A method for generating mesh from a four dimensional image, the method being implemented by a computer system operativelv connected to at least one data storage device for storing image data for images, at least one computer, and at least one non-transitorycomputer readable medium storing thereon computer code which when executed by the at least one computer performs the method, the method comprising: receiving a four dimensional digital image /of a three dimensional object Ω over time r,-, the three dimensional object Ω having a volume and a surface 8Ω; applying a virtual hyper-box of sixteen corners to the image of object Ω such that the distance between a box corner x and its closet feature point z ~ cfpg_n (x^") k at least 2&_dn z); generating meshes of the four dimensional digital image / by insertion and deletion of points in a Deiaunay fashion; recovering an underlying hyper-surface of the object Ω; and filling the volume of the hyper-surface with sliver free pentatopes,

2. The method according to claim 1 , wherein the method comprises:

generating meshes until a final mesh is a set of pentatopes having cireum centers inside Ω,

3. The method according to claim 1 , wherein the four dimensional digital image / is a four dimensional medical image,

4. The method of claim 2, wherein the generating of meshes is processed by providing a plurality of parallel processing threads I , each of threads 7; having a poor element list of tetrahedra ί requiring refining,

5. The method of claim 2, wherein δ is a predetermined sampling density, the meshes include a plurality of tetrahedrons t, and wherein the dynamic insertion and deletion of points in a Delaunay fashion comprises refining the meshes by: inserting point z if z is at a distance not closer than Δ_Μ (ζ) to any other feature vertex, when z is equal to (€ ₄) and ¾_<,.¾ intersects 8Ω;

inserting€_σ4 if radius R_a > 2 A_da (z), when z is equal to d _¾; (€_σ4) and Φ_σ4 intersects 3Ω; inserting C_o4 il ^' p_o4.≥ p, when C_o4 lies inside Ω;

inserting a good point z inside P " BJa4) if σ4 contains a sliver when€_σ4 lies inside Ω;

inserting a good point z inside li( 3) if all vertices of σ3 are not feature vertices, when <τ3(σ3 C <74) is a restricted facet, and

deleting ail free vertices closer than Δ_Μ (ζ) ΐο z„.

6. The method of claim 5 , wherein the generating of meshes is processed by providing a plurality of parallei processing threads 7}, each of threads Ti having a poor element list of letrahedra t requiring refining.

7, The method of claim 6, further comprising the step of finding and triangulating cavities C of insertions.

8. The method of claim 6, further comprising the step of finding and triangulating cavities C of insertions according to a Bowyer-Watson kernel.

9. The method of claim 6 further comprising balancing processing load among threads

Ti by one or more of random work stealing, hierarchical work stealing and static box partitioning.

10. The method of claim 6, further comprising managing contention by pausing at least one of the plurality of processing threads Ti, without liveiocks in the remaining threads.

1 1. The method of claim 6, further comprising locally managing contention by designating one of the plurality of processing threads T_h as a main non-pausing thread and pausing as a function of the number of rollbacks at least one of the remaining of the plurality of processing threads Ί) without iiveiocks.

12. A non-transitory recording medium storing a program that instructs a computer to generate a mesh from an image, the method comprising: receiving a four dimensional digital image /of a three dimensional object Ω over time ί,·, the three dimensional object Ω having a volume and a surface 3Ω applying a virtual hyper-box of sixteen corners to the image of object Ω such that the distance between a box corner x and its closet feature point z = cfp_dn (x^") is at least 2A_da (z ; generating meshes of the four dimensional digital image I by insertion and deletion of points in a Delaunay fashion; recovering an underlying hyper-surface of the object Ω; and filling the volume of the hyper-surface with sliver free pentatopes.

13. A system for performing a process on an image, the process being implemented by a computer system operatively connected to at least one data storage device in which is stored image data, and comprising at least one computer and at least one non-transitory computer readable medium storing thereon computer code which when executed by the at least one computer performs a method, the method comprising: receiving a four dimensional digital image / of a three dimensional object Ω over time the three dimensional object Ω having a volume and a surface 3Ω; applying a virtual hyper-box of sixteen corners to the image of object Ω such that the distance between a box corner x and its closet feature point z ~ cfp_dn (x) is at least 2A_dn (z); generating meshes of the four dimensional digital image I by insertion and deletion of points in Delaunay fashion; recovering an underlying hyper-surface of the object Ω; and filling the volume of the hyper-surface with sliver free pentatopes.

14. The system of claim 13, wherein the method comprises:

generating meshes until a final mesh is a set of pentatopes having circumcenters inside Ω.

15. The system of claim 13, wherein the four dimensional digital image J is a four dimensional medical image.

16. The system of claim 14, wherein the generating of meshes is processed by providing a plurality of parallel processing threads T_h each of threads Ϊ) having a poor element list of tetrahedra i requiring refining.

1 7. The system of claim 14, wherein δ is a predetermined sampling density, the meshes inciude a plurality of tetrahedrons c, and wherein the dynamic insertion and deletion of points in a Delaunay fashion comprises refining the meshes by: inserting point z if z is at a distance not closer than Δ_3ιΩ(ζ) to any other feature vertex, when z is equal to efp_¾> (C_c4) and Β_σ4 intersects 6Ω:

inserting Ο_σ4 if radius R_a > 2 _βΩ(ζ , when z is equal to cf ac (Q4) and Β_α4 intersects 3Ω; inserting C_an if ρ_σ4 > p, when C₀ lies inside Ω

inserting a good point z inside PR(oA) ifo4 contains a sliver when C„4 lies inside Ω;

inserting a good point z inside PR(a3) if all vertices of σ3 are not feature vertices, when σ3(ί?3 C cr4) is a restricted facet, and

deleting all free vertices closer than Δ_Ι½ (2) to z.

18. The system of claim 17, wherein the generating of meshes is processed by providing a plurality of parallel processing threads T each of threads T) having a poor element list of tetrahedra t requiring refining.

19, The system of claim 18, further comprising the step of finding and triangulating cavities C of insertions.

20. The system of claim 18, further comprising d e step of finding and triangulating cavities C of insertions according to a Bowyer-Watson kernel

21. The system of claim 1 8 further comprising balancing processing load among threads T_j by one or more of random work stealing, hierarchical work stealing and static box partitioning,

22. The system of claim 1 8, further comprising managing contention by pausing at least one of the plurality of processing threads 7), vvithoui Ilvelocks in the remaining threads.

23. The system of claim 18, further comprising locally managing contention by designating one of the plurality of processing threads T_h as a ma in non-pausing thread and pausing as a function of the number of rollbacks at least one of the remaining of the plurality of processing threads Tj without Hvefoeks.