CN116149835A - Method, apparatus, device and medium for processing unstructured grid data - Google Patents

Method, apparatus, device and medium for processing unstructured grid data Download PDF

Info

Publication number
CN116149835A
CN116149835A CN202210964262.4A CN202210964262A CN116149835A CN 116149835 A CN116149835 A CN 116149835A CN 202210964262 A CN202210964262 A CN 202210964262A CN 116149835 A CN116149835 A CN 116149835A
Authority
CN
China
Prior art keywords
data
processed
elements
block
data block
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210964262.4A
Other languages
Chinese (zh)
Inventor
余畅
徐懿
匡冶
胡渊鸣
刘天添
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Taiqi Graphics Technology Co ltd
Original Assignee
Beijing Taiqi Graphics Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Taiqi Graphics Technology Co ltd filed Critical Beijing Taiqi Graphics Technology Co ltd
Priority to CN202210964262.4A priority Critical patent/CN116149835A/en
Publication of CN116149835A publication Critical patent/CN116149835A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

A method, apparatus, device and medium for processing unstructured grid data. The method comprises the following steps: acquiring grid data to be processed, wherein the grid data to be processed comprises a plurality of elements; partitioning the grid data to be processed based on a data partition size to obtain at least one data block, wherein the data partition size is determined according to a capacity of a processor cache of the computing device and the data partition size is smaller than the capacity of the processor cache, and wherein each of the at least one data block comprises a plurality of intra-block elements, the plurality of intra-block elements being at least a portion of the plurality of elements; determining a data block to be processed in at least one data block; and loading the data block to be processed into a processor cache, wherein a plurality of intra-block elements included in the data block to be processed are all reserved in the processor cache before processing of the data block to be processed is completed.

Description

Method, apparatus, device and medium for processing unstructured grid data
Statement of case division
The application is for 2021, 11, 19, entitled "method, apparatus, device, and medium for processing unstructured grid data," with application number: division of the chinese invention patent application 202111401270. X.
Technical Field
The present disclosure relates to the field of computers, and in particular to a method, apparatus, electronic device, non-transitory computer readable storage medium and computer program product for processing unstructured grid data.
Background
Mesh (Mesh) is a common format for data discretization. Depending on the shape of the Mesh, we can generally divide the Mesh into Structured Mesh (Mesh) and unstructured Mesh (Unstructured Mesh). Structured grids refer to grids in which all internal points have exactly the same shape, and common structured grids include quadrilateral (two-dimensional) and hexahedral (three-dimensional) grids. The numerical characteristics of the grid are excellent, and the data access is simple. But such grids can generally only be used to represent regular patterns. For patterns with irregular structures, it is difficult for the structured grid to accurately describe its surface structure. In contrast to structured grids, the internal points of unstructured grids may possess different grid shapes, which may accurately represent most of the irregular patterns. Unstructured grids are widely used in visual computing applications such as finite element simulation and geometric processing due to their superiority in expressive power. As the name suggests, unstructured grids have no regular topological relationships, which generally need to be maintained explicitly.
The approaches described in this section are not necessarily approaches that have been previously conceived or pursued. Unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section. Similarly, the problems mentioned in this section should not be considered as having been recognized in any prior art unless otherwise indicated.
Disclosure of Invention
It would be advantageous to provide a mechanism that alleviates, mitigates or even eliminates one or more of the above problems.
According to an aspect of the present disclosure, there is provided a method for processing unstructured grid data, comprising: acquiring grid data to be processed, wherein the grid data to be processed comprises a plurality of elements; partitioning the grid data to be processed based on a data partition size to obtain at least one data block, wherein the data partition size is determined according to a capacity of a processor cache of the computing device and the data partition size is smaller than the capacity of the processor cache, and wherein each of the at least one data block comprises a plurality of intra-block elements, the plurality of intra-block elements being at least a portion of the plurality of elements; determining a data block to be processed in at least one data block; and loading the data block to be processed into a processor cache, wherein a plurality of intra-block elements included in the data block to be processed are all reserved in the processor cache before processing of the data block to be processed is completed.
According to another aspect of the present disclosure, there is provided an apparatus for processing unstructured grid data, comprising: an acquisition module configured to acquire grid data to be processed, wherein the grid data to be processed includes a plurality of elements; a chunk module configured to chunk the grid data to be processed based on a data chunk size to obtain at least one data chunk, wherein the data chunk size is determined according to a capacity of a processor cache of the computing device and the data chunk size is less than the capacity of the processor cache, and wherein each of the at least one data chunk comprises a plurality of intra-chunk elements that are at least a portion of the plurality of elements; a first determining module configured to determine a data block to be processed among the at least one data block; and a loading module configured to load the data block to be processed into the processor cache, wherein a plurality of intra-block elements included in the data block to be processed are all retained in the processor cache before processing of the data block to be processed is completed.
According to still another aspect of the present disclosure, there is provided an electronic apparatus including: at least one processor, wherein each of the at least one processor comprises: caching by a processor; and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method for processing unstructured grid data described above.
According to yet another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the above-described method for processing unstructured grid data.
According to yet another aspect of the present disclosure, a computer program product is provided, comprising a computer program, wherein the computer program, when executed by a processor, implements a method for processing unstructured grid data.
According to one or more embodiments of the present disclosure, at least one data block can be obtained by partitioning unstructured grid data to be processed based on a partition size determined according to a capacity of a processor cache of a computing device. Each of these data blocks has a size no greater than the processor's cache capacity and includes a plurality of intra-block elements (e.g., points, edges, faces, or higher-dimensional elements in the grid to be processed, such as tetrahedrons). After the data blocks are obtained, the data blocks to be processed can be determined in the data blocks, then the data blocks to be processed are loaded in a processor cache and are reserved in the processor cache before the data blocks to be processed are processed, so that the elements in the blocks can be directly read from the processor cache in the processing process of the data blocks to be processed, and the extra time consumption caused by the need of re-reading and loading from a memory or even a data storage unit after the elements in the blocks to be processed are released by the cache is avoided. Therefore, the cache of the processor can be utilized to the maximum extent, the cache utilization rate is improved, the reading and writing expenditure of the memory is reduced, and the processing efficiency of unstructured grid data is improved.
These and other aspects of the disclosure will be apparent from and elucidated with reference to the embodiments described hereinafter.
Drawings
Further details, features and advantages of the present disclosure are disclosed in the following description of exemplary embodiments with reference to the drawings. The illustrated embodiments are for exemplary purposes only and do not limit the scope of the claims. Throughout the drawings, identical reference numerals designate similar, but not necessarily identical, elements.
FIG. 1 is a flowchart illustrating a method for processing unstructured grid data in accordance with an example embodiment;
FIG. 2 is a flowchart illustrating the partitioning of grid data to be processed in accordance with an exemplary embodiment;
FIG. 3 is a schematic diagram illustrating an instruction to be executed according to an example embodiment;
FIG. 4 is a flowchart illustrating a method for processing unstructured grid data in accordance with an example embodiment;
FIG. 5 is a flowchart illustrating a method for processing unstructured grid data in accordance with an example embodiment;
FIG. 6 is a flowchart illustrating a method for processing unstructured grid data in accordance with an example embodiment;
FIG. 7 is a schematic block diagram illustrating an apparatus for processing unstructured grid data in accordance with an example embodiment;
FIG. 8 is a schematic block diagram illustrating a partitioning module in accordance with an example embodiment;
FIG. 9 is a schematic block diagram illustrating an apparatus for processing unstructured grid data in accordance with an example embodiment;
FIG. 10 is a schematic block diagram illustrating an apparatus for processing unstructured grid data in accordance with an example embodiment;
FIG. 11 is a schematic block diagram illustrating an apparatus for processing unstructured grid data in accordance with an example embodiment; and is also provided with
Fig. 12 is a block diagram illustrating an exemplary computer device that can be applied to exemplary embodiments.
Detailed Description
Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
In the present disclosure, the use of the terms "first," "second," and the like to describe various elements is not intended to limit the positional relationship, timing relationship, or importance relationship of the elements, unless otherwise indicated, and such terms are merely used to distinguish one element from another. In some examples, a first element and a second element may refer to the same instance of the element, and in some cases, they may also refer to different instances based on the description of the context.
The terminology used in the description of the various illustrated examples in this disclosure is for the purpose of describing particular examples only and is not intended to be limiting. Unless the context clearly indicates otherwise, the elements may be one or more if the number of the elements is not specifically limited. As used herein, the term "plurality" means two or more, and the term "based on" should be interpreted as "based at least in part on". Furthermore, the term "and/or" and "at least one of … …" encompasses any and all possible combinations of the listed items.
Exemplary embodiments of the present disclosure are described in detail below with reference to the attached drawings.
Fig. 1 is a flowchart illustrating a method 100 for processing unstructured grid data according to an example embodiment.
Referring to fig. 1, at step 102, grid data to be processed is acquired.
According to some embodiments, the grid data to be processed may be used to describe unstructured grids to be processed. The mesh to be processed may represent the topology and geometry of objects of a particular two, three, or other dimension. The Mesh to be processed is composed of a plurality of Mesh elements (Mesh elements) having a specific dimension, and the Mesh data to be processed includes a plurality of element data corresponding to the Mesh elements one by one. Common grid elements include zero-dimensional elements, one-dimensional elements, two-dimensional elements, and three-dimensional elements, corresponding to points (Vertex), edges (Edge), faces (Face), and volumes (Cell), respectively.
For convenience of description, an "element" in the present disclosure may refer to either a mesh element or element data corresponding to the mesh element, and accordingly, when an "element" refers to element data, a dimension of the "element" may refer to a dimension of the mesh element corresponding to the element data, which is not limited herein.
According to some embodiments, the plurality of elements included in the mesh data to be processed may have the same dimensions. In one exemplary embodiment, the to-be-processed mesh data describing the two-dimensional to-be-processed mesh may include only two-dimensional elements. In another exemplary embodiment, the to-be-processed mesh data describing the three-dimensional to-be-processed mesh may include only three-dimensional elements. It will be appreciated that the high-dimensional to-be-processed grid may also be described using low-dimensional elements, for example, using only two-dimensional elements to describe a three-dimensional to-be-processed grid, without limitation.
According to some embodiments, the plurality of elements included in the mesh data to be processed may also have different dimensions. In some embodiments, for any integer N satisfying 0.ltoreq.n.ltoreq.N, the plurality of elements includes a plurality of N-dimensional elements, N being a preset integer not less than 2. That is, the plurality of elements include a plurality of points, a plurality of sides, and a plurality of faces. In embodiments where N is greater than 2, the plurality of elements may also include higher dimensional elements, such as a plurality of volumes, without limitation.
According to some embodiments, each k-dimensional element of the plurality of elements includes k+1 k-1-dimensional elements, k being any integer satisfying 0 < k.ltoreq.N. In other words, each side comprises two points, each face comprises three sides, and optionally each body comprises four faces. It should be noted that the k+1 k-1 dimension elements included in each k-dimension element define the boundaries of the k-dimension element, i.e., the two points included in each edge are the two endpoints of the corresponding edge, the three edges included in each face are the three edges forming the corresponding face, and so on. In such an embodiment, each face is triangular and, optionally, each body is Tetrahedron (Tetrahedron).
According to some embodiments, each k-dimensional element may further comprise lower-dimensional elements. In some exemplary embodiments, each face may further include three points shared by three sides included by the face, and optionally, each volume may further include six sides shared by four faces included by the volume and four points shared by the six sides. It will be appreciated that in embodiments where N is greater than 3, the above is equally true for higher dimensional elements, and is not limited herein. In this way, each element includes a constant number of elements of lower dimension, which is advantageous in data storage and processing, and by storing the elements included in each element in the mesh data to be processed, the efficiency of processing the data to be processed can be improved.
According to some embodiments, the above N may be set to 3, that is, the plurality of elements included in the mesh data to be processed includes a plurality of three-dimensional elements. Thus, by using the method of the present disclosure, three-dimensional object simulation can be performed on information on a "body". For convenience of description, N is set to 3 as an example in the embodiment of the present disclosure, but such an example is merely exemplary, and one skilled in the art may apply the method of the present disclosure to any scenario in which N is set to an integer not less than 2, such adaptation being within the scope of the present disclosure.
According to some embodiments, elements may have a neighboring relationship between them. For two different elements of the plurality of elements, the two elements may be neighbors of each other when any of the following conditions is satisfied:
the dimension of the two elements is k, and the two elements comprise the same k-1 dimension element, wherein k is an integer and satisfies 0 < k.ltoreq.N;
the condition B, the two elements have a complete inclusion relation; and
condition C, the two elements are two 0-dimensional elements included in one 1-dimensional element of the plurality of elements.
In one spring mass point system example, two springs (one-dimensional elements) sharing one end point (zero-dimensional element) are neighbors; two endpoints included in one spring are neighbors of the spring, and the two endpoints are neighbors of each other; all springs connected to an endpoint are neighbors of the endpoint.
In step 104, the grid data to be processed is partitioned based on the data partition size to obtain at least one data block.
According to some embodiments, the data chunk size is determined according to a capacity of a processor cache of the computing device, and the data chunk size is less than the capacity of the processor cache. It should be noted that the present disclosure is not limited to the type of processor of the computing device. In some embodiments, the processing of the mesh data to be processed may be performed using a central processing unit (Central Processing Unit, CPU), and the processor cache may be an upper level cache in the CPU, such as an L1 cache, an L2 cache, or the like. In other embodiments, the processing of the mesh data to be processed may be performed using a graphics processor (Graphics Processing Unit, GPU), and the processor cache may be shared memory in the GPU.
In some processors, the capacity of the processor cache may also be adjusted within a certain range. In one example, there is a tradeoff between data chunk size and the number of thread chunks (i.e., processing units in a processor described below) in the CUDA (Compute Unified Device Architecture, unified computing device architecture) background. The larger processor cache may accommodate the next larger data block, thereby improving data locality. However, in a processor including a plurality of processing units as described later, increasing the capacity of the processor cache of each processing unit reduces the number of processing units and reduces concurrency in processing the mesh data to be processed, as will be described later. It will be appreciated that in some other processors, similar situations exist. In one exemplary embodiment, the amount of data that needs to be cached may be evaluated according to the instruction to be executed to determine at least one of the capacity of the processor cache and the size of the data block.
According to some embodiments, the data chunk size may be determined according to the instruction to be executed. In some embodiments, a user may specify or modify the value of the data chunk size by an instruction. In some embodiments, for a particular processor (e.g., a particular GPU), the user may be provided with some default values of the data chunk size, among which the user may determine the data chunk size.
According to some embodiments, each data block resulting from the partitioning of the grid data to be processed may include at least a portion of the elements in the grid data to be processed, which may be referred to as intra-block elements (owed elements) of the corresponding data block. The partitioning of the mesh data to be processed may be based on the structure of the mesh to be processed, for example. In some embodiments, each of the plurality of intra-block elements may be adjacent to at least one of the other elements of the plurality of intra-block elements. In other words, each data block is a contiguous subset of the grid to be processed. By the method, when the data to be processed is processed, the element data and the relationship between the elements in the same data block can be frequently read with better data locality, so that the data processing speed is improved.
According to some embodiments, each data block obtained by partitioning the grid data to be processed may be no larger than the data partition size, and may be as close as possible to the data partition size, so as to achieve full utilization of the processor cache. These data blocks may have smaller variances therebetween to further enable full utilization of the processor cache, and the smaller variances enable parallel processing of these data blocks with greater efficiency, as will be described below.
It is understood that any blocking algorithm may be used by those skilled in the art or may be designed according to the requirements to block the mesh data to be processed, which is not limited herein.
According to some embodiments, when the grid data to be processed is partitioned, the highest-dimension elements included in the grid data to be processed may be partitioned first, and then the lower-dimension elements included in the highest-dimension elements are added to the data block where the corresponding highest-dimension elements are located. In some embodiments, any two data blocks of the at least one data block obtained by partitioning may not contain the same intra-block element, i.e. each element included in the data block to be processed may belong to only one data block. In one exemplary embodiment, each highest-dimensional element may be uniquely assigned a data block, while a lower-dimensional element may be randomly assigned a data block among a plurality of different data blocks when shared by a higher-dimensional element.
According to some embodiments, each element may have a unique global index in the grid data to be processed. After the grid data to be processed is converted into a plurality of data blocks, elements in the data blocks can be cached in a continuous manner to further improve the data locality. Thus, elements within a data block may be locally indexed to read consecutive data of the cached data block. According to some embodiments, as shown in fig. 2, step 104, partitioning the mesh data to be processed may include: step 202, for each data block in at least one data block, determining a local index for each of a plurality of intra-block elements included in the data block; and step 204, determining index mapping relations between the local indexes of the elements in the blocks and the global indexes of the elements in the blocks in the grid data to be processed.
In acquiring the related data of the intra-block elements located at the edge of the data block, there is a high probability that the elements in other data blocks need to be accessed, so determining the neighborhood elements (pixel) of the data block may provide convenience in processing the data block, as will be described below.
According to some embodiments, step 104, partitioning the mesh data to be processed may further include: step 206, for each data block of the at least one data block, determining an element that is adjacent to each of the plurality of intra-block elements included in the data block and that does not belong to the data block as a neighbor element of the data block. Such a neighborhood element may be considered a ring neighbor to a data block. In some embodiments, the bi-ring neighbors of the intra-block element (i.e., neighbors of the intra-block element) or the higher-ring neighbors may also be determined as neighbors of the corresponding data block, without limitation. It will be appreciated that each element may belong to only one data block as an intra-block element, but may belong to a plurality of different data blocks as a neighborhood element, and thus, the data block in which an element is located when it is an intra-block element may be determined as the data block in which the element is located.
According to some embodiments, a chunking approach may be employed that causes each data chunk to have fewer neighborhood elements. The fewer the neighborhood elements, the higher the occupancy of the intra-block elements in the data block and the higher the utilization of the cache.
According to some embodiments, to further enhance data locality, a local index may be performed on neighbor elements of a data block. According to some embodiments, step 104, partitioning the mesh data to be processed may further include: step 208, for each data block in the at least one data block, determining a local index for a neighborhood element of the data block; and step 210, determining an index mapping relation between the local index of the neighborhood element and the global index of the neighborhood element in the grid data to be processed. Thus, both intra-block elements and neighborhood elements within a data block are provided with local indexes.
According to some embodiments, local indexes of intra-block elements and neighborhood elements may be distinguished. In some embodiments, in determining the indices of the intra-block element and the neighborhood element, a smaller value index may be assigned to the intra-block element and a larger value index may be assigned to the neighborhood element. In one exemplary embodiment, the number of intra-block elements and neighborhood elements in each data block may be monitored using two variables "num_owed" and "num_total" and these two variables may be used to determine whether an element is an intra-block element or a neighborhood element in the corresponding data block.
In step 106, a data block to be processed is determined in the at least one data block.
According to some embodiments, determining the block of data to be processed may be determined, for example, based on instructions to be executed. The instructions to be executed may be instructions input by a user, instructions generated by a machine, or instructions stored in a storage medium, which are not limited herein. In some embodiments, a data block in which element data indicated by an instruction to be executed is located may be regarded as a data block to be processed.
According to some embodiments, step 106, determining the data block to be processed in the at least one data block may comprise: and in response to receiving the query cycle instruction, determining the data block where the target element queried by the query cycle instruction is located as the data block to be processed. The query cycle instruction is used, for example, to query a target element that satisfies a preset condition, and to instruct execution of a data processing instruction corresponding to the query cycle instruction for the target element. The query loop instruction is similar to a structure-for (struct-for) loop for traversing voxels (pixels) and a range-for (range-for) loop for indexes in the Taichi programming language (e.g., taichi version 0.8.0), and can query for grid elements that meet certain preset conditions, and thus can also be referred to as a grid-for (mesh-for) loop. The preset conditions may be read from the query cycle instruction and are typically used to indicate a series of target elements that have a neighborhood, containment, or other relationship with a particular element, grid, which target elements typically have the same dimensions, as will be described below.
In one exemplary embodiment, the example to-be-executed instruction shown in FIG. 3, the first behavior is a query loop instruction that queries (or traverses) all points in a pending grid named bunny. In this mesh-for, the preset condition is indicated by the expression "bunny. The second to seven rows act as data processing instructions for execution for each point in the bunny corresponding to the mesh-for of the first row. Specifically, the second behavior is an instruction that operates on the stress of the corresponding point, and the third behavior is a query loop instruction that queries (or traverses) all edges connected to the point u (the element queried by the mesh-for of the first row). In the mesh-for of the third row, the preset condition is indicated by the expression "u.edges", i.e. representing all edges connected to the point u. The fourth to seventh rows act as data processing instructions corresponding to the mesh-for of the third row for execution for each edge connected to the point u. Specifically, the fourth behavior is an instruction to query another end point v of the edge e (the element found by the mesh-for of the third row) except the point u, the fifth behavior is an instruction to find the difference between the coordinates of the point u and the point v, the sixth behavior is an instruction to normalize the obtained difference, and the seventh behavior is an instruction to further calculate the stress of the point u.
At step 108, the block of data to be processed is loaded into a processor cache.
According to some embodiments, a plurality of intra-block elements included in a block of data to be processed may all be retained in a processor cache before processing of the block of data to be processed is completed. In some embodiments, the processing completion of the block of data to be processed may be, for example, complete execution of all instructions that instruct data processing on elements included in the block of data to be processed. Before executing the instruction to be executed, the instruction to be executed may be analyzed first to determine the data block to be processed, then the instruction to process the data block to be processed or the element in the data block to be processed is determined in the instruction to be executed, and the data block to be processed is kept in the processor cache before the execution of the processing instructions is finished. It will be appreciated that in an instruction to be executed, the instruction to process the block of data to be processed, or elements thereof, directly may not be contiguous, and thus the block of data to be processed may be retained in the processor cache until execution of the last processing instruction has ended. In an exemplary embodiment, as in the example instruction shown in fig. 3, the instruction of the fifth line is an instruction that further calculates the intermediate operation result without directly processing the data block to be processed or the element therein, but since the seventh line thereafter processes the element in the data block to be processed, the corresponding data block to be processed may be retained in the processor cache until the seventh line execution is completed.
According to some embodiments, step 108, loading the block of data to be processed into the processor cache may include: and loading the neighborhood elements of the data block to be processed into a processor cache. As described above, since there is a high probability that the neighborhood elements need to be accessed when acquiring the related data of the intra-block elements located at the edge in the data block, the neighborhood elements can be accessed quickly by loading and retaining the neighborhood elements in the processor cache, thereby improving the efficiency of processing the data block to be processed.
In one exemplary embodiment, the plurality of intra-block elements included in the block of data to be processed may all be maintained in the processor cache until the end of execution of the query cycle instruction. During this time, the neighborhood elements of the data block to be processed may also be maintained in the processor cache. Because the query cycle instruction can be used for indicating the execution of the data processing instruction corresponding to the query cycle instruction on the target element, and the target element and the elements related to the target element are usually required to be read during the execution of the data processing instruction, the data block where the target element is located is used as the data block to be processed, and the data block to be processed is reserved in the processor cache before the execution of the data processing instruction is finished, the utilization rate of the processor cache can be improved, and the extra time consumption caused by the need of re-reading and loading the elements in the data block to be processed from the memory or even the data storage unit after the elements in the data block to be processed are released by the cache is avoided.
It will be appreciated that the end of execution of the query loop instruction may be the end of execution of the data processing instruction for the target element currently being processed, or the end of execution of the data processing instruction for all the target elements being queried, which is not limited herein.
In the process of processing the grid data to be processed, specific topological relations among elements can be frequently accessed according to different data processing tasks. Of these topological relationships, one or more relationships that are most commonly accessed may be referred to as primary relationships. By determining the primary relationships, element relationships that satisfy the primary relationships can be determined among the plurality of elements in the grid data to be processed, and further reading of these element relationships can be optimized, thereby improving the efficiency of processing the grid data to be processed, as will be described below.
Referring to fig. 4, the operations of steps 402-408 in method 400 are similar to the operations of steps 102-108 in method 100 and are not described in detail herein.
In step 410, in response to receiving the query cycle instruction, a primary relationship is determined based at least on the dimensions of the target element queried by the query cycle instruction.
According to some embodiments, the primary relationship may indicate an association of a first element having a first dimension with respect to a second element adjacent to the first element having a second dimension. The first element may be referred to as the "start-end" of the primary relationship and the second element may be referred to as the "end-of-arrival" of the primary relationship. In some embodiments, for example, the dimension of the target element may be taken as a first dimension, and the second dimension may be further determined. In one exemplary embodiment, the target element is a point, and the primary relationship may be at least one of a "point-to-point" relationship, a "point-to-edge" relationship, a "point-to-plane" relationship, and a "point-to-volume" relationship, for example. In other embodiments, for example, the dimension of the target element may be taken as the second dimension, and the first dimension may be further determined. In another exemplary embodiment, the target element is a point, and the primary relationship may be at least one of a "point-point" relationship, an "edge-point" relationship, a "face-point" relationship, and a "bulk-point" relationship, for example. It will be appreciated that one skilled in the art may also determine the primary relationship based on the dimensions of the target element by other means, not limited herein. It should also be noted that the first dimension and the second dimension may be the same or different, and are not limited herein.
According to some embodiments, step 410, determining the primary relationship may include: determining the dimension of the target element as a first dimension; and determining a dimension of the element queried by the first query in the data processing instructions as a second dimension. Typically, the first pair of topological relationships in the mesh-for (i.e., the relationships of the target elements queried by the mesh-for to the elements queried by the first query in the data processing instruction) are the most commonly accessed relationships, so determining the relationships as primary relationships can further promote optimization of reading element relationships that satisfy the primary relationships.
The first query in the data query instruction may be, for example, a query for a particular element that is related to the target element, typically a query for a particular element that is adjacent to the target element. In one exemplary embodiment, the target element queried by the query cycle instruction is edge e, the first query in the corresponding data query instruction may be "v=e.verices [0]", i.e., a query for a particular point adjacent to (i.e., included by) the target element e, then the primary relationship may be determined as an "edge-point" relationship. Alternatively or additionally, the first query in the data query instruction may also be, for example, a query loop instruction for a particular element that is related to the target element. In another exemplary embodiment, the target element queried by the query cycle instruction is point u, the first query in the corresponding data query instruction may be the query cycle "for in u.edges", i.e., a query (or traversal) of an edge adjacent (i.e., connected to) the target element u, the primary relationship may be determined as a "point-edge" relationship.
According to some embodiments, the query cycle instruction may be an outermost query cycle instruction of the plurality of query cycle instructions having a nested relationship. In one exemplary embodiment, as illustrated in the example instructions of fig. 3, the mesh-for of the first row and the mesh-for of the third row form a nested relationship, then the primary relationship may be determined and optimized accordingly based on the outermost query cycle instruction (i.e., the query cycle instruction of the first row). As described above, the first pair of topological relationships in the most-outer query cycle instructions are typically the most-accessed relationships, while the topological relationships in the nested inner query cycle instructions are relatively less accessed. Thus, optimizing the primary relationships determined based on the outermost query cycle instructions may be more efficient than optimizing all of the primary relationships determined based on the query cycle instructions.
At step 412, a plurality of element relationships that satisfy the primary relationship are determined between the plurality of elements and elements that are each adjacent to the plurality of elements.
According to some embodiments, step 412 determining a plurality of element relationships that satisfy the primary relationship between the plurality of elements and elements that are each adjacent to the plurality of elements may include: a plurality of element relationships are determined between a plurality of intra-block elements included in a data block in which the target element is located and elements adjacent to each of the plurality of intra-block elements. In processing a block of data to be processed in which a target element is located, access to the element-to-element relationship is in most cases made inside the block of data to be processed. Therefore, the number of element relationships which need to be optimized for reading can be reduced by determining the element relationships in the data block to be processed, and the optimizing effect is not greatly influenced.
At step 414, the plurality of element relationships are loaded into a processor cache.
According to some embodiments, the plurality of element relationships may be maintained in the processor cache until the end of the query cycle instruction execution. As described above, by retaining these element relationships in the processor cache, these element relationships can be quickly read during execution of the query cycle instruction and its corresponding data processing instructions, thereby improving the processing speed of the data blocks to be processed and the grid data to be processed.
According to some embodiments, other relationships in mesh-for may be determined as secondary relationships in addition to the primary relationship, and secondary element relationships satisfying the secondary relationships may be determined. The secondary element relationships may also be loaded and maintained in the processor cache, but because the frequency with which the secondary element relationships are read is relatively low, the boost that can be brought by caching the secondary element relationships is limited. In some embodiments, the operation data of the corresponding element (e.g., the arriving end in the secondary relationship) may be processed such that when the operation data of the element is cached, the element is cached as the secondary element relationship of the arriving end.
When processing the data block to be processed, the frequently read operation data can be determined according to the data processing instruction and loaded into the cache, so that the serious time consumption of reading the operation data from the memory or the external data storage unit is avoided.
At step 416, operational data associated with the plurality of element relationships is determined based on the data processing instructions.
According to some embodiments, the operational data may be, for example, operational data associated with a second element of the plurality of element relationships. In one mesh-for, a first element (typically the queried element) in a primary relationship may be accessed only once, while a second element may be accessed many times. Caching the operand data associated with the second element can improve data locality. It will be appreciated that if the first element is also accessed a number of times, the operational data associated with the first element may also be cached, as is not limited herein.
In one exemplary embodiment, where the primary relationship is a "point-plane" relationship, the operational data may be, for example, normal to the respective plane. In another exemplary embodiment, where the primary relationship is a "volume-point" relationship, the operational data may be, for example, the location, speed, etc. of the corresponding point.
At step 418, the operational data is loaded into the processor cache.
According to some embodiments, the operational data has a first priority in the processor cache that is higher than the default priority before the end of execution of the query cycle instruction. The default priority may be the default priority that general data has when loaded into the processor cache. When the processor cache releases the cache, low priority data may be released preferentially, and therefore, during data processing, operational data having a higher priority is less likely to be pushed out of the processor cache by data having a default priority. Thus, the operational data may be cached in the processor cache for a long period of time for reading.
According to some embodiments, a private area may be allocated in the processor cache for data that needs to be retained in the processor cache, and the data that needs to be retained in the processor cache may be loaded into the private area. The dedicated area may also be used for data with a default priority, but when loading data with a higher priority, these data with a default priority may be released. While in the case where the private area is already fully occupied by data that needs to be kept in the processor cache, new data that needs to be loaded and kept in the processor cache may push out data in the processor cache that has the same or lower priority. It will be appreciated that those skilled in the art may set the static capacity of the dedicated area according to the requirements, or design a scheme for dynamically adjusting the dedicated area, or may process conflicts between various data that need to be loaded and retained in the processor cache, data with a default priority, and data that need to be loaded and retained in the processor cache in other ways, which are not limited herein.
In step 420, in response to receiving a data cache instruction in the data processing instructions, target data indicated by the data cache instruction is loaded into the processor cache.
According to some embodiments, the data cache instruction may be, for example, a user-entered instruction to load the indicated target data into the processor cache. According to some embodiments, the target data has a second priority in the processor cache that is higher than the default priority before the end of execution of the query cycle instruction. Thus, the target data may be cached in the processor cache for long periods of time for reading.
According to some embodiments, the second priority may be higher than the first priority. That is, the target data indicated by the data cache instruction has a higher priority than the operation data determined according to the data processing instruction, thereby ensuring that, for example, user-specified data is not pushed out by the operation data determined by other means. It is understood that the second priority may be equal to the first priority or lower than the first priority, which is not limited herein.
Referring to fig. 5, the operations of steps 502-516 in method 500 are similar to the operations of steps 402-416 in method 400 and are not described in detail herein.
At step 518, at least a portion of the operational data for which an atomic write operation (atomic write operation) is performed is determined.
An atomic operation is an operation that is not interrupted by a thread scheduling mechanism, and once it starts, it runs to the end without switching to other threads in between. An atomic write operation may be an atomic operation for writing data. According to some embodiments, the atomic write operation may include, for example, "+ =", i.e., adding the value to the right of "+ =" to the left operand data, and then writing the sum to the left operand data. It is understood that the atomic write operation may also include other operations, not limited herein.
At step 520, at least a portion of the operational data is loaded into a processor cache.
According to some embodiments, at least a portion of the operational data may have a priority in the processor cache that is higher than the default priority before the query cycle instruction execution ends. The operand data that is subject to atomic write operations is typically accessed frequently, so that it may be loaded into the processor cache and prioritized above the default priority.
According to some embodiments, step 408, loading the block of data to be processed into the processor cache may include: and loading the index mapping relation of each element in the plurality of blocks into a processor cache. The index map of each of the elements within the plurality of blocks may be maintained in the processor cache until processing of the block of data to be processed is complete. Thus, by loading and retaining the index map of the intra-block element in the processor cache, data locality may be further increased. In some embodiments, each element in the grid data to be processed has a unique correspondence between the local index and the global index of the element as an intra-block element, then the global index of the element may be represented as its local index plus an offset (offset) in the corresponding data block as an intra-block element. Therefore, such an offset amount can be used as an index map.
According to some embodiments, the index map may include a local-to-global (global) index map, and may further include a global-to-local (global-to-global) index map. In one exemplary embodiment, both local-to-global index mappings and global-to-local index mappings may be generated and loaded and maintained in a processor cache.
According to some embodiments, step 408, loading the block of data to be processed into the processor cache may further comprise: and loading the index mapping relation of the neighborhood elements of the data block to be processed into a processor cache, wherein the index mapping relation of the neighborhood elements of the data block to be processed is reserved in the processor cache before the processing of the data block to be processed is completed. Thus, by loading and retaining the index map of the neighborhood elements in the processor cache, data locality may be further increased.
According to some embodiments, the sum of the data amounts including the data block to be processed, the plurality of element relationships, the operation data, the target data, the index mapping relationships of the elements in the plurality of blocks included in the data block to be processed, the neighborhood elements of the data block to be processed, and the index mapping relationships of the neighborhood elements of the data block to be processed may be smaller than the capacity of the processor cache. In some embodiments, where dedicated areas are allocated in the processor cache for such data, the total data amount of such data may be less than the capacity of the dedicated areas.
In summary, at least one data block can be obtained by partitioning unstructured grid data to be processed based on a partition size determined according to a capacity of a processor cache of a computing device. Each of these data blocks has a size no greater than the processor's cache capacity and includes a plurality of intra-block elements (e.g., points, edges, faces, or higher-dimensional elements in the grid to be processed, such as tetrahedrons). After the data blocks are obtained, the data blocks to be processed can be determined in the data blocks, then the data blocks to be processed are loaded in a processor cache and are reserved in the processor cache before the data blocks to be processed are processed, so that the elements in the blocks can be directly read from the processor cache in the processing process of the data blocks to be processed, and the extra time consumption caused by the need of re-reading and loading from a memory or even a data storage unit after the elements in the blocks to be processed are released by the cache is avoided. Therefore, the cache of the processor can be utilized to the maximum extent, the cache utilization rate is improved, the reading and writing expenditure of the memory is reduced, and the processing efficiency of unstructured grid data is improved.
As previously described, various types of processors may be used to perform processing of the grid data to be processed, and each of these processors (e.g., CPU, GPU, etc.) may include multiple processing units and include multiple processing unit caches corresponding to these processing units. In one exemplary embodiment, the processor is a GPU that includes a plurality of Thread blocks (i.e., processing units), each of which includes a Shared Memory (i.e., processing unit cache) that is Shared by a plurality of threads (threads).
The present disclosure further optimizes the processing of the grid data to be processed using such a processor having a plurality of processing units, enabling parallel execution of query loop instructions, as will be described below.
According to some embodiments, the data chunk size may be smaller than the capacity of the processing unit cache, such that the data chunk may be loaded in its entirety into the processing unit cache. It will be appreciated that the foregoing description of the processor cache may be directly applied to the processing unit cache, for example, the corresponding element relationships, the operation data, the target data, etc. may be loaded into the corresponding processing unit cache, and the data that needs to be loaded and retained in the processor cache may be loaded and retained in the corresponding processing unit cache, etc. which will not be described herein.
According to some embodiments, the query cycle instruction may query a plurality of target elements, and the plurality of target elements may belong to at least one data block to be processed. In one exemplary embodiment, some of the plurality of target elements may belong to the same block of data to be processed. In another exemplary embodiment, the plurality of target elements respectively belong to a plurality of data blocks to be processed.
According to some embodiments, each of the at least one data block to be processed may be uniquely loaded into one of the at least one processing unit's processing unit caches. Therefore, the same data block to be processed cannot be loaded into the processing unit caches of the plurality of processing units, and therefore the cache space is saved.
Referring to fig. 6, the operations of steps 602-608 in method 600 are similar to the operations of steps 102-108 in method 100, and are not described in detail herein.
According to some embodiments, step 608, loading the block of data to be processed into the processor cache may include: at least one block of data to be processed is loaded into at least one of a plurality of processing unit caches.
At step 610, a query cycle instruction is executed in parallel with at least one processing unit corresponding to at least one processing unit cache.
Therefore, by loading at least one data block to be processed into at least one processing unit cache, each processing unit can quickly read the intra-block elements in the corresponding data block to be processed from the corresponding processing unit cache when processing the corresponding data block to be processed, parallel processing of the at least one data block to be processed is realized, and processing of each data block to be processed can be accelerated by the processing unit cache.
And executing the data processing instruction corresponding to the query cycle instruction on the target element in the corresponding data block to be processed by utilizing each processing unit in the at least one processing unit corresponding to the at least one processing unit cache, thereby realizing the parallel execution of the query cycle instruction and improving the processing efficiency of the grid data to be processed.
According to some embodiments, executing the query cycle instruction in parallel, step 610, may include: for each of the at least one processing unit, executing data processing instructions with the processing unit for at least a portion of the plurality of target elements. At least a part of the target elements may be target elements included in the data block to be processed loaded in the processing unit cache corresponding to the processing unit.
Therefore, the data processing instruction is executed for each target element in the data block to be processed loaded in the corresponding processing unit cache by utilizing the processing unit, so that the data block to be processed only needs to be loaded into the processing unit cache once during the execution of the data processing instruction for a plurality of target elements included in the data block to be processed, the cache utilization rate is further improved, and the time consumption caused by frequently writing the data block into the cache is reduced.
Since at least one processing unit needs to process the target elements in each data block to be processed in parallel, the smaller the variance of the sizes of different data blocks to be processed, the shorter the total processing time for at least one data block to be processed, assuming that the duty ratios of the target elements in each data block to be processed are similar. Therefore, when the grid data to be processed is partitioned, the variance between the data blocks can be reduced as much as possible, so that the processing efficiency of the grid data to be processed is improved.
According to some embodiments, step 610, executing the query cycle instruction in parallel may further include: and executing the data processing instruction in parallel for the target element in the corresponding data block to be processed by utilizing a plurality of threads included in the processing unit.
Thus, the processing efficiency of the data block to be processed and the grid data to be processed can be further improved by executing the data processing instruction on the target element in the data block to be processed in parallel by utilizing a plurality of threads in the processing unit.
Fig. 7 is a schematic block diagram illustrating an apparatus 700 for processing unstructured grid data according to an example embodiment. Referring to fig. 7, an apparatus 700 includes: an acquisition module 702 configured to acquire grid data to be processed, wherein the grid data to be processed includes a plurality of elements; a chunk module 704 configured to chunk the grid data to be processed based on a data chunk size to obtain at least one data chunk, wherein the data chunk size is determined according to a capacity of a processor cache of the computing device and the data chunk size is less than the capacity of the processor cache, and wherein each of the at least one data chunk comprises a plurality of intra-chunk elements that are at least a portion of the plurality of elements; a first determining module 706 configured to determine a data block to be processed among the at least one data block; and a loading module 708 configured to load the block of data to be processed into the processor cache, wherein a plurality of intra-block elements included in the block of data to be processed are all retained in the processor cache before processing of the block of data to be processed is completed.
Thus, at least one data block can be obtained by partitioning unstructured grid data to be processed based on a partition size determined according to a capacity of a processor cache of the computing device. Each of these data blocks has a size no greater than the processor's cache capacity and includes a plurality of intra-block elements (e.g., points, edges, faces, or higher-dimensional elements in the grid to be processed, such as tetrahedrons). After the data blocks are obtained, the data blocks to be processed can be determined in the data blocks, then the data blocks to be processed are loaded in a processor cache and are reserved in the processor cache before the data blocks to be processed are processed, so that the elements in the blocks can be directly read from the processor cache in the processing process of the data blocks to be processed, and the extra time consumption caused by the need of re-reading and loading from a memory or even a data storage unit after the elements in the blocks to be processed are released by the cache is avoided. Therefore, the cache of the processor can be utilized to the maximum extent, the cache utilization rate is improved, the reading and writing expenditure of the memory is reduced, and the processing efficiency of unstructured grid data is improved.
According to some embodiments, each data block resulting from the partitioning of the grid data to be processed may include at least a portion of the elements in the grid data to be processed, which may be referred to as intra-block elements (owed elements) of the corresponding data block. The partitioning of the mesh data to be processed may be based on the structure of the mesh to be processed, for example. In some embodiments, each of the plurality of intra-block elements may be adjacent to at least one of the other elements of the plurality of intra-block elements. In other words, each data block is a contiguous subset of the grid to be processed. By the method, when the data to be processed is processed, the element data and the relationship between the elements in the same data block can be frequently read with better data locality, so that the data processing speed is improved.
According to some embodiments, each data block obtained by partitioning the grid data to be processed may be no larger than the data partition size, and may be as close as possible to the data partition size, so as to achieve full utilization of the processor cache. These data blocks may have smaller variances therebetween to further enable full utilization of the processor cache, and the smaller variances enable parallel processing of these data blocks with greater efficiency, as will be described below.
It is understood that any blocking algorithm may be used by those skilled in the art or may be designed according to the requirements to block the mesh data to be processed, which is not limited herein.
According to some embodiments, when the grid data to be processed is partitioned, the highest-dimension elements included in the grid data to be processed may be partitioned first, and then the lower-dimension elements included in the highest-dimension elements are added to the data block where the corresponding highest-dimension elements are located. In some embodiments, any two data blocks of the at least one data block obtained by partitioning may not contain the same intra-block element, i.e. each element included in the data block to be processed may belong to only one data block. In one exemplary embodiment, each highest-dimensional element may be uniquely assigned a data block, while a lower-dimensional element may be randomly assigned a data block among a plurality of different data blocks when shared by a higher-dimensional element.
According to some embodiments, each element may have a unique global index in the grid data to be processed. After the grid data to be processed is converted into a plurality of data blocks, elements in the data blocks can be cached in a continuous manner to further improve the data locality. Thus, elements within a data block may be locally indexed to read consecutive data of the cached data block. According to some embodiments, as shown in fig. 8, a partitioning module 800 may include: a first local index sub-module 802 configured to determine, for each of at least one data block, a local index for each of a plurality of intra-block elements included in the data block; and a first map generation sub-module 804 configured to determine an index mapping relationship between a local index of each of the plurality of intra-block elements and a global index of each of the plurality of intra-block elements in the grid data to be processed. It is understood that the operation of the classification module 800 in fig. 8 is similar to that of the classification module 704 in fig. 7, and will not be described herein.
In acquiring the related data of the intra-block elements located at the edge of the data block, there is a high probability that the elements in other data blocks need to be accessed, so determining the neighborhood elements (pixel) of the data block may provide convenience in processing the data block, as will be described below. According to some embodiments, the partitioning module 800 may further include: a neighborhood determination submodule 806 configured to determine, for each of the at least one data block, an element adjacent to each of the plurality of intra-block elements included by the data block and not belonging to the data block as a neighborhood element of the data block. Such a neighborhood element may be considered a ring neighbor to a data block. In some embodiments, the bi-ring neighbors of the intra-block element (i.e., neighbors of the intra-block element) or the higher-ring neighbors may also be determined as neighbors of the corresponding data block, without limitation. It will be appreciated that each element may belong to only one data block as an intra-block element, but may belong to a plurality of different data blocks as a neighborhood element, and thus, the data block in which an element is located when it is an intra-block element may be determined as the data block in which the element is located.
According to some embodiments, a chunking approach may be employed that causes each data chunk to have fewer neighborhood elements. The fewer the neighborhood elements, the higher the occupancy of the intra-block elements in the data block and the higher the utilization of the cache.
According to some embodiments, to further enhance data locality, a local index may be performed on neighbor elements of a data block. According to some embodiments, the partitioning module 800 may further include: a second local index sub-module 808 configured to determine, for each of the at least one data block, a local index for a neighborhood element of the data block; and a second map generation sub-module 810 configured to determine an index mapping relationship between the local index of the neighborhood element and the global index of the neighborhood element in the grid data to be processed. Thus, both intra-block elements and neighborhood elements within a data block are provided with local indexes.
According to some embodiments, local indexes of intra-block elements and neighborhood elements may be distinguished. In some embodiments, in determining the indices of the intra-block element and the neighborhood element, a smaller value index may be assigned to the intra-block element and a larger value index may be assigned to the neighborhood element. In one exemplary embodiment, the number of intra-block elements and neighborhood elements in each data block may be monitored using two variables "num_owed" and "num_total" and these two variables may be used to determine whether an element is an intra-block element or a neighborhood element in the corresponding data block.
According to some embodiments, determining the block of data to be processed may be determined, for example, based on instructions to be executed. The instructions to be executed may be instructions input by a user, instructions generated by a machine, or instructions stored in a storage medium, which are not limited herein. In some embodiments, a data block in which element data indicated by an instruction to be executed is located may be regarded as a data block to be processed.
According to some embodiments, the first determination module 706 may be further configured to: and in response to receiving the query cycle instruction, determining the data block where the target element queried by the query cycle instruction is located as the data block to be processed. The query cycle instruction is used, for example, to query a target element that satisfies a preset condition, and to instruct execution of a data processing instruction corresponding to the query cycle instruction for the target element. The preset conditions may be read from the query cycle instruction and are typically used to indicate a series of target elements that have a neighborhood, containment, or other relationship with a particular element, grid, which target elements typically have the same dimensions, as will be described below.
According to some embodiments, a plurality of intra-block elements included in a block of data to be processed may all be retained in a processor cache before processing of the block of data to be processed is completed. In some embodiments, the processing completion of the block of data to be processed may be, for example, complete execution of all instructions that instruct data processing on elements included in the block of data to be processed. Before executing the instruction to be executed, the instruction to be executed may be analyzed first to determine the data block to be processed, then the instruction to process the data block to be processed or the element in the data block to be processed is determined in the instruction to be executed, and the data block to be processed is kept in the processor cache before the execution of the processing instructions is finished. It will be appreciated that in an instruction to be executed, the instruction to process the block of data to be processed, or elements thereof, directly may not be contiguous, and thus the block of data to be processed may be retained in the processor cache until execution of the last processing instruction has ended. In an exemplary embodiment, as in the example instruction shown in fig. 3, the instruction of the fifth line is an instruction that further calculates the intermediate operation result without directly processing the data block to be processed or the element therein, but since the seventh line thereafter processes the element in the data block to be processed, the corresponding data block to be processed may be retained in the processor cache until the seventh line execution is completed.
According to some embodiments, the loading module 708 may be further configured to: and loading the neighborhood elements of the data block to be processed into a processor cache. As described above, since there is a high probability that the neighborhood elements need to be accessed when acquiring the related data of the intra-block elements located at the edge in the data block, the neighborhood elements can be accessed quickly by loading and retaining the neighborhood elements in the processor cache, thereby improving the efficiency of processing the data block to be processed.
In one exemplary embodiment, the plurality of intra-block elements included in the block of data to be processed may all be maintained in the processor cache until the end of execution of the query cycle instruction. Because the query cycle instruction can be used for indicating the execution of the data processing instruction corresponding to the query cycle instruction on the target element, and the target element and the elements related to the target element are usually required to be read during the execution of the data processing instruction, the data block where the target element is located is used as the data block to be processed, and the data block to be processed is reserved in the processor cache before the execution of the data processing instruction is finished, the utilization rate of the processor cache can be improved, and the extra time consumption caused by the need of re-reading and loading the elements in the data block to be processed from the memory or even the data storage unit after the elements in the data block to be processed are released by the cache is avoided.
It will be appreciated that the end of execution of the query loop instruction may be the end of execution of the data processing instruction for the target element currently being processed, or the end of execution of the data processing instruction for all the target elements being queried, which is not limited herein.
In the process of processing the grid data to be processed, specific topological relations among elements can be frequently accessed according to different data processing tasks. Of these topological relationships, one or more relationships that are most commonly accessed may be referred to as primary relationships. By determining the primary relationships, element relationships that satisfy the primary relationships can be determined among the plurality of elements in the grid data to be processed, and further reading of these element relationships can be optimized, thereby improving the efficiency of processing the grid data to be processed, as will be described below.
Referring to fig. 9, the operations of the modules 902-908 in the apparatus 900 for processing unstructured grid data are similar to the operations of the modules 702-708 in the apparatus 700, and are not described in detail herein.
According to some embodiments, as shown in fig. 9, the apparatus 900 may further include: a second determination module 910 is configured to determine, in response to receiving the query cycle instruction, a primary relationship based at least on a dimension of a target element queried by the query cycle instruction.
According to some embodiments, the primary relationship may indicate an association of a first element having a first dimension with respect to a second element adjacent to the first element having a second dimension. In some embodiments, for example, the dimension of the target element may be taken as a first dimension, and the second dimension may be further determined. In one exemplary embodiment, the target element is a point, and the primary relationship may be at least one of a "point-to-point" relationship, a "point-to-edge" relationship, a "point-to-plane" relationship, and a "point-to-volume" relationship, for example. In other embodiments, for example, the dimension of the target element may be taken as the second dimension, and the first dimension may be further determined. In another exemplary embodiment, the target element is a point, and the primary relationship may be at least one of a "point-point" relationship, an "edge-point" relationship, a "face-point" relationship, and a "bulk-point" relationship, for example. It will be appreciated that one skilled in the art may also determine the primary relationship based on the dimensions of the target element by other means, not limited herein.
According to some embodiments, the second determination module 910 may be further configured to: determining the dimension of the target element as a first dimension; and determining a dimension of the element queried by the first query in the data processing instructions as a second dimension. Typically, the first pair of topological relationships in the mesh-for (i.e., the relationships of the target elements queried by the mesh-for to the elements queried by the first query in the data processing instruction) are the most commonly accessed relationships, so determining the relationships as primary relationships can further promote optimization of reading element relationships that satisfy the primary relationships.
The first query in the data query instruction may be, for example, a query for a particular element that is related to the target element, typically a query for a particular element that is adjacent to the target element. In one exemplary embodiment, the target element queried by the query cycle instruction is edge e, and the first query in the corresponding data query instruction may be "v=e.verices [0]", i.e., a query for a particular point adjacent to (i.e., included by) target element e. Alternatively or additionally, the first query in the data query instruction may be, for example, a query loop instruction for a particular element related to the target element. In another exemplary embodiment, the target element queried by the query cycle instruction is point u, and the first query in the corresponding data query instruction may be the query cycle "for in u.edges," i.e., a query (or traversal) of an edge adjacent (i.e., connected to) the target element u.
According to some embodiments, the query cycle instruction may be an outermost query cycle instruction of the plurality of query cycle instructions having a nested relationship. As described above, the first pair of topological relationships in the most-outer query cycle instructions are typically the most-accessed relationships, while the topological relationships in the nested inner query cycle instructions are relatively less accessed. Thus, optimizing the primary relationships determined based on the outermost query cycle instructions may be more efficient than optimizing all of the primary relationships determined based on the query cycle instructions.
According to some embodiments, as shown in fig. 9, the apparatus 900 may further include a third determination module 912 configured to determine a plurality of element relationships that satisfy the primary relationship between the plurality of elements and elements that are respectively adjacent to the plurality of elements.
According to some embodiments, the third determination module 912 may be further configured to: a plurality of element relationships are determined between a plurality of intra-block elements included in a data block in which the target element is located and elements adjacent to each of the plurality of intra-block elements. In processing a block of data to be processed in which a target element is located, access to the element-to-element relationship is in most cases made inside the block of data to be processed. Therefore, the number of element relationships which need to be optimized for reading can be reduced by determining the element relationships in the data block to be processed, and the optimizing effect is not greatly influenced.
According to some embodiments, the loading module 908 may be further configured to load the plurality of element relationships into the processor cache.
According to some embodiments, the plurality of element relationships may be maintained in the processor cache until the end of the query cycle instruction execution. As described above, by retaining these element relationships in the processor cache, these element relationships can be quickly read during execution of the query cycle instruction and its corresponding data processing instructions, thereby improving the processing speed of the data blocks to be processed and the grid data to be processed.
When processing the data block to be processed, the frequently read operation data can be determined according to the data processing instruction and loaded into the cache, so that the serious time consumption of reading the operation data from the memory or the external data storage unit is avoided.
According to some embodiments, as shown in fig. 9, the apparatus 900 may further include a fourth determination module 914 that determines operational data associated with the plurality of element relationships based on the data processing instructions.
According to some embodiments, the operational data may be, for example, operational data associated with a second element of the plurality of element relationships. In one mesh-for, a first element (typically the queried element) in a primary relationship may be accessed only once, while a second element may be accessed many times. Caching the operand data associated with the second element can improve data locality. It will be appreciated that if the first element is also accessed a number of times, the operational data associated with the first element may also be cached, as is not limited herein.
In one exemplary embodiment, where the primary relationship is a "point-plane" relationship, the operational data may be, for example, normal to the respective plane. In another exemplary embodiment, where the primary relationship is a "volume-point" relationship, the operational data may be, for example, the location, speed, etc. of the corresponding point.
According to some embodiments, the loading module 908 may be further configured to load the operational data into the processor cache.
According to some embodiments, the operational data has a first priority in the processor cache that is higher than the default priority before the end of execution of the query cycle instruction. The default priority may be the default priority that general data has when loaded into the processor cache. When the processor cache releases the cache, low priority data may be released preferentially, and therefore, during data processing, operational data having a higher priority is less likely to be pushed out of the processor cache by data having a default priority. Thus, the operational data may be cached in the processor cache for a long period of time for reading.
According to some embodiments, a private area may be allocated in the processor cache for data that needs to be retained in the processor cache, and the data that needs to be retained in the processor cache may be loaded into the private area. The dedicated area may also be used for data with a default priority, but when loading data with a higher priority, these data with a default priority may be released. While in the case where the private area is already fully occupied by data that needs to be kept in the processor cache, new data that needs to be loaded and kept in the processor cache may push out data in the processor cache that has the same or lower priority. It will be appreciated that one skilled in the art may handle conflicts between various data that need to be loaded and maintained in the processor cache, data having a default priority, and data that need to be loaded and maintained in the processor cache in other ways, without limitation.
According to some embodiments, the loading module 908 may be further configured to load target data indicated by the data cache instruction into the processor cache in response to receiving the data cache instruction of the data processing instructions.
According to some embodiments, the data cache instruction may be, for example, a user-entered instruction to load the indicated target data into the processor cache. According to some embodiments, the target data has a second priority in the processor cache that is higher than the default priority before the end of execution of the query cycle instruction. Thus, the target data may be cached in the processor cache for long periods of time for reading.
According to some embodiments, the second priority may be higher than the first priority. That is, the target data indicated by the data cache instruction has a higher priority than the operation data determined according to the data processing instruction, thereby ensuring that, for example, user-specified data is not pushed out by the operation data determined by other means. It is understood that the second priority may be equal to the first priority or lower than the first priority, which is not limited herein.
Referring to fig. 10, the operations of modules 1002-1012 in apparatus 1000 for processing unstructured grid data are similar to the operations of modules 902-912 in apparatus 900, and are not described in detail herein.
According to some embodiments, as shown in fig. 10, the apparatus 1000 may further include: a fifth determination module 1014 configured to determine operational data associated with the plurality of element relationships based on the data processing instructions; and a sixth determination module 1016 configured to determine at least a portion of the operational data that was subjected to the atomic write operation.
An atomic operation is an operation that is not interrupted by a thread scheduling mechanism, and once it starts, it runs to the end without switching to other threads in between. An atomic write operation may be an atomic operation for writing data. According to some embodiments, the atomic write operation may include, for example, "+ =", i.e., adding the value to the right of "+ =" to the left operand data, and then writing the sum to the left operand data. It is understood that the atomic write operation may also include other operations, not limited herein.
In accordance with some of the embodiments of the present invention, the loading module 908 may be further configured to load at least a portion of the operational data into the processor cache.
According to some embodiments, at least a portion of the operational data may have a priority in the processor cache that is higher than the default priority before the query cycle instruction execution ends. The operand data that is subject to atomic write operations is typically accessed frequently, so that it may be loaded into the processor cache and prioritized above the default priority.
According to some embodiments, the loading module 908 may be further configured to: and loading the index mapping relation of each element in the plurality of blocks into a processor cache. The index map of each of the elements within the plurality of blocks may be maintained in the processor cache until processing of the block of data to be processed is complete. Thus, by loading and retaining the index map of the intra-block element in the processor cache, data locality may be further increased. In some embodiments, each element in the grid data to be processed has a unique correspondence between the local index and the global index of the element as an intra-block element, then the global index of the element may be represented as its local index plus an offset (offset) in the corresponding data block as an intra-block element. Therefore, such an offset amount can be used as an index map.
According to some embodiments, the index map may include a local-to-global (global) index map, and may further include a global-to-local (global-to-global) index map. In one exemplary embodiment, both local-to-global index mappings and global-to-local index mappings may be generated and loaded and maintained in a processor cache.
According to some embodiments, the loading module 908 may be further configured to: and loading the index mapping relation of the neighborhood elements of the data block to be processed into a processor cache, wherein the index mapping relation of the neighborhood elements of the data block to be processed is reserved in the processor cache before the processing of the data block to be processed is completed. Thus, by loading and retaining the index map of the neighborhood elements in the processor cache, data locality may be further increased.
According to some embodiments, the sum of the data amounts including the data block to be processed, the plurality of element relationships, the operation data, the target data, the index mapping relationships of the elements in the plurality of blocks included in the data block to be processed, the neighborhood elements of the data block to be processed, and the index mapping relationships of the neighborhood elements of the data block to be processed may be smaller than the capacity of the processor cache. In some embodiments, where dedicated areas are allocated in the processor cache for such data, the total data amount of such data may be less than the capacity of the dedicated areas.
As previously described, various types of processors may be used to perform processing of the grid data to be processed, and each of these processors (e.g., CPU, GPU, etc.) may include multiple processing units and include multiple processing unit caches corresponding to these processing units. In one exemplary embodiment, the processor is a GPU that includes a plurality of Thread blocks (i.e., processing units), each of which includes a Shared Memory (i.e., processing unit cache) that is Shared by a plurality of threads (threads).
The present disclosure further optimizes the processing of the grid data to be processed using such a processor having a plurality of processing units, enabling parallel execution of query loop instructions, as will be described below.
According to some embodiments, the data chunk size may be smaller than the capacity of the processing unit cache, such that the data chunk may be loaded in its entirety into the processing unit cache. It will be appreciated that the foregoing description of the processor cache may be directly applied to the processing unit cache, for example, the corresponding element relationships, the operation data, the target data, etc. may be loaded into the corresponding processing unit cache, and the data that needs to be loaded and retained in the processor cache may be loaded and retained in the corresponding processing unit cache, etc. which will not be described herein.
According to some embodiments, the query cycle instruction may query a plurality of target elements, and the plurality of target elements may belong to at least one data block to be processed. In one exemplary embodiment, some of the plurality of target elements may belong to the same block of data to be processed. In another exemplary embodiment, the plurality of target elements respectively belong to a plurality of data blocks to be processed.
According to some embodiments, each of the at least one data block to be processed may be uniquely loaded into one of the at least one processing unit's processing unit caches. Therefore, the same data block to be processed cannot be loaded into the processing unit caches of the plurality of processing units, and therefore the cache space is saved.
Referring to fig. 11, the operations of modules 1102-1108 in apparatus 1100 for processing unstructured grid data are similar to the operations of modules 702-708 in apparatus 700, and are not described in detail herein.
According to some embodiments, the loading module 1108 is further configured to: at least one block of data to be processed is loaded into at least one of a plurality of processing unit caches.
According to some embodiments, as shown in fig. 11, the apparatus 1100 may further include a parallel execution module 1110 to execute the query cycle instruction in parallel with at least one processing unit corresponding to the at least one processing unit cache.
Therefore, by loading at least one data block to be processed into at least one processing unit cache, each processing unit can quickly read the intra-block elements in the corresponding data block to be processed from the corresponding processing unit cache when processing the corresponding data block to be processed, parallel processing of the at least one data block to be processed is realized, and processing of each data block to be processed can be accelerated by the processing unit cache.
And executing the data processing instruction corresponding to the query cycle instruction on the target element in the corresponding data block to be processed by utilizing each processing unit in the at least one processing unit corresponding to the at least one processing unit cache, thereby realizing the parallel execution of the query cycle instruction and improving the processing efficiency of the grid data to be processed.
According to some embodiments, the parallel execution module 1110 is further configured to: for each of the at least one processing unit, executing data processing instructions with the processing unit for at least a portion of the plurality of target elements. At least a part of the target elements may be target elements included in the data block to be processed loaded in the processing unit cache corresponding to the processing unit.
Therefore, the data processing instruction is executed for each target element in the data block to be processed loaded in the corresponding processing unit cache by utilizing the processing unit, so that the data block to be processed only needs to be loaded into the processing unit cache once during the execution of the data processing instruction for a plurality of target elements included in the data block to be processed, the cache utilization rate is further improved, and the time consumption caused by frequently writing the data block into the cache is reduced.
According to some embodiments, the parallel execution module 1110 is further configured to execute the query cycle instruction in parallel may further include: and executing the data processing instruction in parallel for the target element in the corresponding data block to be processed by utilizing a plurality of threads included in the processing unit.
Thus, the processing efficiency of the data block to be processed and the grid data to be processed can be further improved by executing the data processing instruction on the target element in the data block to be processed in parallel by utilizing a plurality of threads in the processing unit.
It should be appreciated that the various modules of the apparatus 700 shown in fig. 7 may correspond to the various steps in the method 100 described with reference to fig. 1, the various modules of the apparatus 900 shown in fig. 9 may correspond to the various steps in the method 400 described with reference to fig. 4, the various modules of the apparatus 1000 shown in fig. 10 may correspond to the various steps in the method 500 described with reference to fig. 5, and the various modules of the apparatus 1100 shown in fig. 11 may correspond to the various steps in the method 600 described with reference to fig. 6. Thus, the operations, features and advantages described above with respect to method 100 apply equally to apparatus 700 and the modules comprising it, the operations, features and advantages described above with respect to method 400 apply equally to apparatus 900 and the modules comprising it, the operations, features and advantages described above with respect to method 500 apply equally to apparatus 1000 and the modules comprising it, and the operations, features and advantages described above with respect to method 600 apply equally to apparatus 1100 and the modules comprising it. For brevity, certain operations, features and advantages are not described in detail herein.
Although specific functions are discussed above with reference to specific modules, it should be noted that the functions of the various modules discussed herein may be divided into multiple modules and/or at least some of the functions of the multiple modules may be combined into a single module. The particular module performing the actions discussed herein includes the particular module itself performing the actions, or alternatively the particular module invoking or otherwise accessing another component or module that performs the actions (or performs the actions in conjunction with the particular module). Thus, a particular module that performs an action may include that particular module itself that performs the action and/or another module that the particular module invokes or otherwise accesses that performs the action. As used herein, the phrase "entity a initiates action B" may refer to entity a issuing an instruction to perform action B, but entity a itself does not necessarily perform that action B.
It should also be appreciated that various techniques may be described herein in the general context of software hardware elements or program modules. The various modules described above with respect to fig. 7, 9, 10, and 11 may be implemented in hardware or in hardware in combination with software and/or firmware. For example, the modules may be implemented as computer program code/instructions configured to be executed in one or more processors and stored in a computer-readable storage medium. Alternatively, these modules may be implemented as hardware logic/circuitry.
According to an aspect of the present disclosure, an electronic device is provided that includes at least one processor and a memory communicatively coupled to the at least one processor. Each of the at least one processor includes a processor cache. The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of any of the method embodiments described above.
According to an aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of any of the method embodiments described above.
According to an aspect of the present disclosure, there is provided a computer program product storing computer instructions comprising a computer program which, when executed by a processor, carries out the steps of any of the method embodiments described above.
Illustrative examples of such computer devices, non-transitory computer readable storage media, and computer program products are described below in connection with fig. 12.
Fig. 12 illustrates an example configuration of a computer device 1200 that may be used to implement the methods described herein. Each of the above-described apparatus 700, apparatus 900, apparatus 1000, and apparatus 1100 may also be implemented, in whole or at least in part, by computer device 1200 or a similar device or system.
Computer device 1200 may be a variety of different types of devices. Examples of computer device 1200 include, but are not limited to: a desktop, server, notebook, or netbook computer, a mobile device (e.g., tablet, cellular, or other wireless telephone (e.g., smart phone), notepad computer, mobile station), a wearable device (e.g., glasses, watch), an entertainment appliance (e.g., an entertainment appliance, a set-top box communicatively coupled to a display device, a gaming machine), a television or other display device, an automotive computer, and so forth.
The computer device 1200 may include at least one processor 1202, memory 1204, communication interface(s) 1206, display device 1208, other input/output (I/O) devices 1210, and one or more mass storage devices 1212, capable of communicating with each other, such as through a system bus 1214 or other suitable connection.
The processor 1202 may be a single processing unit or multiple processing units, all of which may include a single or multiple computing units or multiple cores. The processor 1202 may be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions. The processor 1202 may be configured to, among other capabilities, obtain and execute computer-readable instructions stored in the memory 1204, mass storage device 1212, or other computer-readable medium, such as program code of the operating system 1216, program code of the application programs 1218, program code of other programs 1220, and the like.
Memory 1204 and mass storage device 1212 are examples of computer-readable storage media for storing instructions that are executed by processor 1202 to implement the various functions as previously described. For example, memory 1204 may generally include both volatile memory and non-volatile memory (e.g., RAM, ROM, etc.). In addition, mass storage device 1212 may generally include hard disk drives, solid state drives, removable media, including external and removable drives, memory cards, flash memory, floppy disks, optical disks (e.g., CDs, DVDs), storage arrays, network attached storage, storage area networks, and the like. Memory 1204 and mass storage device 1212 may both be referred to herein as memory or computer readable storage media, and may be non-transitory media capable of storing computer readable, processor executable program instructions as computer program code that may be executed by processor 1202 as a particular machine configured to implement the operations and functions described in the examples herein.
A number of programs may be stored on mass storage device 1212. These programs include an operating system 1216, one or more application programs 1218, other programs 1220, and program data 1222, and may be loaded into the memory 1204 for execution. Examples of such application programs or program modules may include, for example, computer program logic (e.g., computer program code or instructions) for implementing the following components/functions: method 100, method 400, method 500, and/or method 600 (including any suitable steps of method 100, method 400, method 500, method 600), and/or additional embodiments described herein.
Although illustrated in fig. 12 as being stored in memory 1204 of computer device 1200, modules 1216, 1218, 1220, and 1222, or portions thereof, may be implemented using any form of computer readable media accessible by computer device 1200. As used herein, "computer-readable medium" includes at least two types of computer-readable media, namely computer-readable storage media and communication media.
Computer-readable storage media include volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules or other data. Computer-readable storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital Versatile Disks (DVD), or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium which can be used to store information for access by a computer device. In contrast, communication media may embody computer readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave or other transport mechanism. Computer-readable storage media as defined herein do not include communication media.
One or more communication interfaces 1206 are used to exchange data with other devices, such as via a network, direct connection, or the like. Such communication interfaces may be one or more of the following: any type of network interface (e.g., a Network Interface Card (NIC)), a wired or wireless (such as IEEE 802.11 Wireless LAN (WLAN)) wireless interface, a worldwide interoperability for microwave access (Wi-MAX) interface, an ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, bluetooth, etc TM An interface, a Near Field Communication (NFC) interface, etc. The communication interface 1206 may facilitate communications within a variety of networks and protocol types, including wired networks (e.g., LAN, cable, etc.) and wireless networks (e.g., WLAN, cellular, satellite, etc.), the Internet, and so forth. The communication interface 1206 may also provide communication with external storage devices (not shown) such as in a storage array, network attached storage, storage area network, or the like.
In some examples, a display device 1208, such as a monitor, may be included for displaying information and images to a user. Other I/O devices 1210 may be devices that receive various inputs from a user and provide various outputs to the user, and may include touch input devices, gesture input devices, cameras, keyboards, remote controls, mice, printers, audio input/output devices, and so on.
The techniques described herein may be supported by these various configurations of computer device 1200 and are not limited to the specific examples of techniques described herein. For example, this functionality may also be implemented in whole or in part on a "cloud" using a distributed system. The cloud includes and/or represents a platform for the resource. The platform abstracts underlying functionality of hardware (e.g., servers) and software resources of the cloud. Resources may include applications and/or data that may be used when performing computing processing on servers remote from computer device 1200. Resources may also include services provided over the internet and/or over subscriber networks such as cellular or Wi-Fi networks. The platform may abstract resources and functions to connect computer device 1200 with other computer devices. Thus, implementations of the functionality described herein may be distributed throughout the cloud. For example, the functionality may be implemented in part on computer device 1200 and in part by a platform that abstracts the functionality of the cloud.
While the disclosure has been illustrated and described in detail in the drawings and foregoing description, such illustration and description are to be considered illustrative and schematic and not restrictive; the present disclosure is not limited to the disclosed embodiments. Variations to the disclosed embodiments can be understood and effected by those skilled in the art in practicing the claimed subject matter, from a study of the drawings, the disclosure, and the appended claims. In the claims, the word "comprising" does not exclude other elements or steps than those listed and the indefinite article "a" or "an" does not exclude a plurality, the term "a" or "an" means two or more, and the term "based on" is to be interpreted as "based at least in part on". The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.

Claims (52)

1. A method for processing unstructured grid data, comprising:
acquiring grid data to be processed, wherein the grid data to be processed comprises a plurality of elements;
partitioning the grid data to be processed based on a data chunk size to obtain at least one data chunk, wherein the data chunk size is determined according to a capacity of a processor cache of a computing device and the data chunk size is smaller than the capacity of the processor cache, and wherein each of the at least one data chunk comprises a plurality of intra-chunk elements that are at least a portion of the plurality of elements;
determining a data block to be processed in the at least one data block; and
and loading the data block to be processed into the processor cache, wherein before the processing of the data block to be processed is completed, all the multiple intra-block elements included in the data block to be processed are reserved in the processor cache.
2. The method of claim 1, wherein each element of the plurality of elements has a dimension, wherein the method further comprises:
in response to receiving a query cycle instruction, determining a primary relationship based at least on dimensions of a target element queried by the query cycle instruction, wherein the query cycle instruction is used for querying the target element meeting a preset condition and is used for indicating that a data processing instruction corresponding to the query cycle instruction is executed for the target element, and wherein the primary relationship indicates an association relationship of a first element with a first dimension to a second element with a second dimension adjacent to the first element;
Determining a plurality of element relationships that satisfy the primary relationship between the plurality of elements and elements that are each adjacent to the plurality of elements; and
loading the plurality of element relationships into the processor cache, wherein the plurality of element relationships are retained in the processor cache until execution of the query cycle instruction ends.
3. The method of claim 2, wherein determining a primary relationship comprises:
determining a dimension of the target element as the first dimension; and
the dimension of the element queried by the first query in the data processing instruction is determined to be the second dimension.
4. A method as claimed in claim 2 or 3, wherein determining a plurality of element relationships between the plurality of elements and elements adjacent to each of the plurality of elements that satisfy the primary relationship comprises:
the plurality of element relationships are determined between a plurality of intra-block elements included in a data block in which the target element is located and elements adjacent to each of the plurality of intra-block elements.
5. The method of any of claims 2-4, the query cycle instruction being an outermost query cycle instruction of a plurality of query cycle instructions having a nested relationship.
6. The method of any of claims 2-5, further comprising:
determining operational data associated with the plurality of element relationships based on the data processing instructions; and
and loading the operation data into the processor cache, wherein the operation data has a first priority higher than a default priority in the processor cache before the execution of the query cycle instruction is finished.
7. The method of claim 6, wherein the operational data is operational data associated with a second element of the plurality of element relationships.
8. The method of claim 6 or 7, further comprising:
and in response to receiving a data cache instruction in the data processing instruction, loading target data indicated by the data cache instruction into the processor cache, wherein the target data has a second priority higher than a default priority in the processor cache before the execution of the query cycle instruction is finished.
9. The method of claim 8, wherein the second priority is higher than the first priority.
10. The method of any of claims 2-9, further comprising:
determining operational data associated with the plurality of element relationships based on the data processing instructions;
Determining at least one part of operation data subjected to atomic write operation in the operation data; and
and loading the at least a portion of the operational data into the processor cache, wherein the at least a portion of the operational data has a priority in the processor cache that is higher than a default priority before execution of the query cycle instruction ends.
11. The method of any of claims 1-10, wherein partitioning the grid data to be processed comprises:
determining, for each of the at least one data block, a local index for each of a plurality of intra-block elements included in the data block; and
an index mapping relationship between a local index of each of the plurality of intra-block elements and a global index of each of the plurality of intra-block elements in the grid data to be processed is determined.
12. The method of claim 11, wherein loading the block of data to be processed into the processor cache comprises:
and loading the index mapping relation of each element in the plurality of blocks into the processor cache, wherein the index mapping relation of each element in the plurality of blocks is reserved in the processor cache before the processing of the data block to be processed is completed.
13. The method of any of claims 1-12, wherein each of the plurality of intra-block elements is adjacent to at least one of the other elements of the plurality of intra-block elements.
14. The method of any of claims 1-13, wherein any two of the at least one data block do not contain the same intra-block element.
15. The method of any of claims 1-14, wherein partitioning the grid data to be processed further comprises:
for each of the at least one data block, determining an element adjacent to each of a plurality of intra-block elements included in the data block and not belonging to the data block as a neighbor element of the data block,
wherein loading the block of data to be processed into the processor cache comprises:
and loading the neighborhood elements of the data block to be processed into the processor cache.
16. The method of claim 15, wherein partitioning the grid data to be processed further comprises:
for each of the at least one data block, determining a local index for a neighborhood element of the data block; and
Determining an index mapping relationship between the local index of the neighborhood element and a global index of the neighborhood element in the grid data to be processed,
wherein loading the block of data to be processed into the processor cache comprises:
and loading the index mapping relation of the neighborhood elements of the data block to be processed into the processor cache, wherein the index mapping relation of the neighborhood elements of the data block to be processed is reserved in the processor cache before the processing of the data block to be processed is completed.
17. The method of claim 16, wherein a sum of data amounts of the to-be-processed data block, the plurality of element relationships, the operation data, the target data, index mappings of respective ones of a plurality of intra-block elements included in the to-be-processed data block, neighborhood elements of the to-be-processed data block, and index mappings of neighborhood elements of the to-be-processed data block is less than a capacity of the processor cache.
18. The method of any of claims 1-17, wherein determining a block of data to be processed in the at least one block of data comprises:
determining a data block where a target element queried by a query circulation instruction is located as the data block to be processed in response to receiving the query circulation instruction, wherein the query circulation instruction is used for querying the target element meeting a preset condition and is used for indicating to execute a data processing instruction corresponding to the query circulation instruction on the target element,
And wherein, before execution of the query cycle instruction ends, all of a plurality of intra-block elements included in the block of data to be processed are retained in the processor cache.
19. The method of claim 18, wherein the processor cache comprises a plurality of processing unit caches corresponding to a plurality of processing units in the computing device, wherein the data chunk size is less than a capacity of the processing unit caches, wherein the target element comprises a plurality of target elements, the data chunk to be processed comprises at least one data chunk to be processed in which the plurality of target elements reside,
wherein loading the block of data to be processed into the processor cache comprises:
loading the at least one block of data to be processed into at least one of the plurality of processing unit caches,
and wherein the method further comprises:
and executing the query cycle instruction in parallel by using at least one processing unit corresponding to the at least one processing unit cache.
20. The method of claim 19, wherein executing the query cycle instruction in parallel comprises:
and executing the data processing instruction for at least one part of target elements in the plurality of target elements by using the processing unit aiming at each processing unit in the at least one processing unit, wherein the at least one part of target elements are target elements included in a data block to be processed loaded in a processing unit cache corresponding to the processing unit.
21. The method of claim 19 or 20, wherein each of the plurality of processing units comprises a plurality of threads,
and wherein executing the query cycle instruction in parallel further comprises:
the data processing instructions are executed in parallel for the at least a portion of the target elements using a plurality of threads included with the processing unit.
22. The method of any of claims 19-21, wherein each of the at least one block of data to be processed is loaded uniquely into one of the at least one processing unit's processing unit caches.
23. The method of any of claims 1-22, wherein the plurality of elements comprises a plurality of N-dimensional elements for any integer N satisfying 0N, where N is a preset integer not less than 2.
24. The method of claim 23, wherein each k-dimensional element of the plurality of elements comprises k+1 k-1-dimensional elements, wherein k is any integer satisfying 0 < k N.
25. The method of claim 23 or 24, wherein N is 3.
26. The method of any of claims 1-25, wherein, for a third element of the plurality of elements and a fourth element different from the third element, the third element and the fourth element are adjacent in response to determining that one of the following is satisfied:
The dimensions of the third element and the fourth element are k, and the third element and the fourth element comprise the same k-1 dimension element, wherein k is an integer and satisfies 0 < k less than or equal to N;
a complete containment relationship between the third element and the fourth element; and
the third element and the fourth element are two 0-dimensional elements included in one 1-dimensional element of the plurality of elements.
27. The method of any of claims 1-26, wherein the processor is a graphics processor and the processor cache is a shared memory.
28. An apparatus for processing unstructured grid data, comprising:
an acquisition module configured to acquire grid data to be processed, wherein the grid data to be processed includes a plurality of elements;
a blocking module configured to block the grid data to be processed based on a data block size to obtain at least one data block, wherein the data block size is determined according to a capacity of a processor cache of a computing device and the data block size is smaller than the capacity of the processor cache, and wherein each of the at least one data block comprises a plurality of intra-block elements that are at least a portion of the plurality of elements;
A first determining module configured to determine a data block to be processed among the at least one data block; and
and the loading module is configured to load the data block to be processed into the processor cache, wherein a plurality of intra-block elements included in the data block to be processed are all reserved in the processor cache before the processing of the data block to be processed is completed.
29. The apparatus of claim 28, wherein each element of the plurality of elements has a dimension, wherein the apparatus further comprises:
a second determination module configured to determine a primary relationship based at least on dimensions of a target element queried by a query loop instruction in response to receiving the query loop instruction, wherein the query loop instruction is used for querying the target element meeting a preset condition and is used for indicating to execute a data processing instruction corresponding to the query loop instruction for the target element,and is also provided withWherein the primary relationship indicates an association of a first element having a first dimension with respect to a second element adjacent to the first element having a second dimension; and
a third determination module configured to determine a plurality of element relationships satisfying the main relationship between the plurality of elements and elements adjacent to each of the plurality of elements,
Wherein the loading module is further configured to load the plurality of element relationships into the processor cache, wherein the plurality of element relationships are retained in the processor cache until execution of the query cycle instruction ends.
30. The apparatus of claim 29, wherein the second determination module is further configured to:
determining a dimension of the target element as the first dimension; and
the dimension of the element queried by the first query in the data processing instruction is determined to be the second dimension.
31. The apparatus of claim 29 or 30, wherein the third determination module is further configured to:
the plurality of element relationships are determined between a plurality of intra-block elements included in a data block in which the target element is located and elements adjacent to each of the plurality of intra-block elements.
32. The apparatus of any of claims 29-31, wherein the query cycle instruction is an outermost query cycle instruction of a plurality of query cycle instructions having a nested relationship.
33. The apparatus of any of claims 29-32, further comprising:
a fourth determination module configured to determine operational data associated with the plurality of element relationships based on the data processing instructions,
Wherein the loading sub-module is further configured to load the operational data into the processor cache, wherein the operational data has a first priority in the processor cache that is higher than a default priority before execution of the query cycle instruction ends.
34. The apparatus of claim 33, wherein the operational data is operational data associated with a second element of the plurality of element relationships.
35. The apparatus of claim 33 or 34, wherein the loading module is further configured to:
and in response to receiving a data cache instruction in the data processing instruction, loading target data indicated by the data cache instruction into the processor cache, wherein the target data has a second priority higher than a default priority in the processor cache before the execution of the query cycle instruction is finished.
36. The apparatus of claim 35, wherein the second priority is higher than the first priority.
37. The apparatus of any of claims 29-36, further comprising:
a fifth determination module configured to determine operational data associated with the plurality of element relationships based on the data processing instructions; and
A sixth determination module configured to determine at least a part of the operation data to which the atomic write operation is performed,
wherein the loading module is further configured to load the at least a portion of the operational data into the processor cache, wherein the at least a portion of the operational data has a priority in the processor cache that is higher than a default priority before execution of the query cycle instruction ends.
38. The apparatus of any of claims 28-37, wherein the chunking module comprises:
a first local index sub-module configured to determine, for each of the at least one data block, a local index for each of a plurality of intra-block elements included in the data block; and
a first map generation sub-module configured to determine an index mapping relationship between a local index of each of the plurality of intra-block elements and a global index of each of the plurality of intra-block elements in the grid data to be processed.
39. The apparatus of claim 38, wherein the loading module is further configured to:
and loading the index mapping relation of each element in the plurality of blocks into the processor cache, wherein the index mapping relation of each element in the plurality of blocks is reserved in the processor cache before the processing of the data block to be processed is completed.
40. The apparatus of any of claims 28-39, wherein each of the plurality of intra-block elements is adjacent to at least one of the other elements of the plurality of intra-block elements.
41. The apparatus of any of claims 28-40, wherein any two of the at least one data block do not contain the same element.
42. The apparatus of any one of claims 28-41, wherein the chunking module further comprises:
a neighborhood determination submodule configured to determine, for each of the at least one data block, an element adjacent to each of a plurality of intra-block elements included in the data block and not belonging to the data block as a neighborhood element of the data block,
wherein the loading module is further configured to:
and loading the neighborhood elements of the data block to be processed into the processor cache.
43. The apparatus of claim 42, wherein the partitioning module further comprises:
a second local index sub-module configured to determine, for each of the at least one data block, a local index for a neighborhood element of the data block; and
A second map generation sub-module configured to determine an index mapping relationship between a local index of the neighborhood element and a global index of the neighborhood element in the grid data to be processed,
wherein the loading module is further configured to:
and loading the index mapping relation of the neighborhood elements of the data block to be processed into the processor cache, wherein the index mapping relation of the neighborhood elements of the data block to be processed is reserved in the processor cache before the processing of the data block to be processed is completed.
44. The apparatus of claim 43, wherein a sum of data amounts of the to-be-processed data block, the plurality of element relationships, the operation data, the target data, index mappings of respective ones of a plurality of intra-block elements included in the to-be-processed data block, neighborhood elements of the to-be-processed data block, and index mappings of neighborhood elements of the to-be-processed data block is smaller than a capacity of the processor cache.
45. The apparatus of any of claims 28-44, wherein the first determination module is further configured to:
determining a data block where a target element queried by a query circulation instruction is located as the data block to be processed in response to receiving the query circulation instruction, wherein the query circulation instruction is used for querying the target element meeting a preset condition and is used for indicating to execute a data processing instruction corresponding to the query circulation instruction on the target element,
And wherein, before execution of the query cycle instruction ends, all of a plurality of intra-block elements included in the block of data to be processed are retained in the processor cache.
46. The apparatus of claim 45, wherein the processor cache comprises a plurality of processing unit caches corresponding to a plurality of processing units in the computing device, wherein the data chunk size is smaller than a capacity of the processing unit caches, wherein the target element comprises a plurality of target elements, the data chunk to be processed comprises at least one data chunk to be processed in which the plurality of target elements are located,
wherein the loading module is further configured to:
loading the at least one block of data to be processed into at least one of the plurality of processing unit caches,
and wherein the apparatus further comprises:
and the parallel execution module is configured to execute the query cycle instruction in parallel by utilizing at least one processing unit corresponding to the at least one processing unit cache.
47. The apparatus of claim 46, wherein the parallel execution module is further configured to:
and executing the data processing instruction for at least one part of target elements in the plurality of target elements by using the processing unit aiming at each processing unit in the at least one processing unit, wherein the at least one part of target elements are target elements included in a data block to be processed loaded in a processing unit cache corresponding to the processing unit.
48. The apparatus of claim 46 or 47, wherein each of the plurality of processing units comprises a plurality of threads,
and wherein the parallel execution module is further configured to:
the data processing instructions are executed in parallel for the at least a portion of the target elements using a plurality of threads included with the processing unit.
49. The apparatus of any of claims 46-48, wherein each of the at least one block of data to be processed is loaded uniquely into one of the at least one processing unit's processing unit caches.
50. An electronic device, comprising:
at least one processor, wherein each of the at least one processor comprises:
caching by a processor; and
a memory communicatively coupled to the at least one processor, wherein
The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-27.
51. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-27.
52. A computer program product comprising a computer program, wherein the computer program, when executed by a processor, implements the method of any of claims 1-27.
CN202210964262.4A 2021-11-19 2021-11-19 Method, apparatus, device and medium for processing unstructured grid data Pending CN116149835A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210964262.4A CN116149835A (en) 2021-11-19 2021-11-19 Method, apparatus, device and medium for processing unstructured grid data

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210964262.4A CN116149835A (en) 2021-11-19 2021-11-19 Method, apparatus, device and medium for processing unstructured grid data
CN202111401270.XA CN114064286B (en) 2021-11-19 2021-11-19 Method, apparatus, device and medium for processing unstructured grid data

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
CN202111401270.XA Division CN114064286B (en) 2021-11-19 2021-11-19 Method, apparatus, device and medium for processing unstructured grid data

Publications (1)

Publication Number Publication Date
CN116149835A true CN116149835A (en) 2023-05-23

Family

ID=80276941

Family Applications (2)

Application Number Title Priority Date Filing Date
CN202111401270.XA Active CN114064286B (en) 2021-11-19 2021-11-19 Method, apparatus, device and medium for processing unstructured grid data
CN202210964262.4A Pending CN116149835A (en) 2021-11-19 2021-11-19 Method, apparatus, device and medium for processing unstructured grid data

Family Applications Before (1)

Application Number Title Priority Date Filing Date
CN202111401270.XA Active CN114064286B (en) 2021-11-19 2021-11-19 Method, apparatus, device and medium for processing unstructured grid data

Country Status (1)

Country Link
CN (2) CN114064286B (en)

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7966301B2 (en) * 2003-05-09 2011-06-21 Planeteye Company Ulc System and method for employing a grid index for location and precision encoding
EP2811112B1 (en) * 2010-09-07 2019-07-24 Saudi Arabian Oil Company Machine, computer program product and method to generate unstructured grids and carry out parallel reservoir simulation
GB201101810D0 (en) * 2011-02-03 2011-03-16 Rolls Royce Plc A method of connecting meshes
CN102968456B (en) * 2012-10-30 2016-08-24 北京地拓科技发展有限公司 A kind of raster data reading and processing method and device
CN103281376B (en) * 2013-05-31 2015-11-11 武汉大学 The automatic buffer memory construction method of magnanimity sequential remote sensing image under a kind of cloud environment
GB2531585B8 (en) * 2014-10-23 2017-03-15 Toshiba Res Europe Limited Methods and systems for generating a three dimensional model of a subject
CN105760529B (en) * 2016-03-03 2018-12-25 福州大学 A kind of spatial index of mobile terminal vector data and caching construction method
EP3837673A1 (en) * 2018-09-21 2021-06-23 Siemens Industry Software Inc. Feature based abstraction and meshing
CN109918461B (en) * 2019-01-28 2020-10-30 北京瓴域航空技术研究院有限公司 Multidimensional grid airspace application method
CN111079078B (en) * 2019-11-25 2022-04-22 清华大学 Lower triangular equation parallel solving method for structural grid sparse matrix

Also Published As

Publication number Publication date
CN114064286A (en) 2022-02-18
CN114064286B (en) 2022-08-05

Similar Documents

Publication Publication Date Title
US11675785B2 (en) Dynamic asynchronous traversals for distributed graph queries
US10922316B2 (en) Using computing resources to perform database queries according to a dynamically determined query size
US8996504B2 (en) Plan caching using density-based clustering
US20150293994A1 (en) Enhanced graph traversal
US8813091B2 (en) Distribution data structures for locality-guided work stealing
CN103455531B (en) A kind of parallel index method supporting high dimensional data to have inquiry partially in real time
US20090254594A1 (en) Techniques to enhance database performance
Mostak An overview of MapD (massively parallel database)
US20220229809A1 (en) Method and system for flexible, high performance structured data processing
WO2023093623A1 (en) Computation graph optimization method, data processing method and related product
Kumar et al. Efficient data restructuring and aggregation for I/O acceleration in PIDX
CN110134335A (en) A kind of RDF data management method, device and storage medium based on key-value pair
Tao et al. Clustering massive small data for IOT
US20200104425A1 (en) Techniques for lossless and lossy large-scale graph summarization
WO2024041376A1 (en) Distributed graph data processing system, method, apparatus and device, and storage medium
CN116149835A (en) Method, apparatus, device and medium for processing unstructured grid data
EP4390646A1 (en) Data processing method in distributed system, and related system
CN105573834B (en) A kind of higher-dimension vocabulary tree constructing method based on heterogeneous platform
CN112950451B (en) GPU-based maximum k-tress discovery algorithm
CN116185378A (en) Optimization method of calculation graph, data processing method and related products
Wang et al. GLIN: A (G) eneric (L) earned (In) dexing Mechanism for Complex Geometries
CN113688064A (en) Method and equipment for allocating storage address for data in memory
US20230376562A1 (en) Integrated circuit apparatus for matrix multiplication operation, computing device, system, and method
US20240220334A1 (en) Data processing method in distributed system, and related system
CN110413313B (en) Parameter optimization method and device for Spark application

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication