CN110163791B - GPU processing method and device of data computation flow graph - Google Patents

GPU processing method and device of data computation flow graph Download PDF

Info

Publication number
CN110163791B
CN110163791B CN201910421763.6A CN201910421763A CN110163791B CN 110163791 B CN110163791 B CN 110163791B CN 201910421763 A CN201910421763 A CN 201910421763A CN 110163791 B CN110163791 B CN 110163791B
Authority
CN
China
Prior art keywords
node
memory block
result
display
video memory
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910421763.6A
Other languages
Chinese (zh)
Other versions
CN110163791A (en
Inventor
颜俊超
李家军
鄢贵海
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yusur Technology Co ltd
Original Assignee
Yusur Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yusur Technology Co ltd filed Critical Yusur Technology Co ltd
Priority to CN201910421763.6A priority Critical patent/CN110163791B/en
Publication of CN110163791A publication Critical patent/CN110163791A/en
Application granted granted Critical
Publication of CN110163791B publication Critical patent/CN110163791B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/20Processor architectures; Processor configuration, e.g. pipelining

Landscapes

  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Image Processing (AREA)
  • Controls And Circuits For Display Device (AREA)

Abstract

The invention provides a GPU processing method and a device of a data computation flow graph, wherein the method comprises the following steps: storing the received input data to a first explicit memory block allocated for a source node in a computational flow graph; acquiring an expression of a successor node pointed by the source node, reading input data from the first video memory block, and allocating a second video memory block for the successor node according to the video memory block allocated for the source node in the computation flow graph; under the condition that all data needing to be input in the expression of the subsequent node are obtained, inputting the read input data into the expression for calculation to obtain a result of the subsequent node, and storing the result of the subsequent node into a second display and storage block; and reading the result of the subsequent node from the second video memory block, obtaining the result of processing the input data according to the computation flow graph according to the read result of the subsequent node, and outputting the processing result. According to the scheme, data can be prevented from being copied back and forth between the GPU and the external equipment, and therefore the speed of the GPU for processing data is improved.

Description

GPU processing method and device of data computation flow graph
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a method and an apparatus for processing a GPU (graphics processing Unit) of a data computation flow graph.
Background
In real life, sequence data (for example, time series, which can be a series of values observed at equal intervals on a certain physical quantity) widely exists, and covers a plurality of fields such as finance, medical treatment, biology, chemistry and the like. The processing of sequence data often involves complex and complicated computational processes that can be performed with good data parallelism, i.e., can perform the same operation on many data simultaneously.
The processing of the sequence data may be represented in a computational flow graph. The computation flow graph can clearly express the computation flow of the complex sequence data. In a computational flow graph, data can be represented by nodes, an operation can be described by edges between different nodes, an operation can be described by one or more edges together, and the direction of the edges indicates the direction of flow of sequence data, wherein each flow of sequence data processing represents an operation.
When a GPU is required to process a computational flow graph, data needs to be acquired from a CPU (Central processing unit), a result of a previous node is calculated in the GPU according to a certain policy, and stored for calculation of a result of a next node, which are sequentially performed, and finally, a value of a result node is obtained and transmitted to the CPU. In the process of calculating the results of the nodes, data needs to be copied back and forth at both ends of the GPU and the CPU, and the process of copying data is time consuming, which in turn reduces the speed of implementing highly parallel processing of sequence data on the GPU.
Disclosure of Invention
In view of this, the present invention provides a GPU processing method and apparatus for a data computation flow graph, so as to improve the speed of processing data by a GPU.
According to an aspect of the embodiments of the present invention, a method for processing a GPU of a data computation flow graph is provided, including:
receiving input data and storing the input data to a first display block allocated for a first source node in a computational flow graph;
acquiring a first expression of a first subsequent node pointed by the first source node, reading the input data from the first video memory block, and allocating a second video memory block to the first subsequent node according to the video memory block allocated to the source node in the computation flow graph;
under the condition that all data needing to be input in the first expression are acquired, inputting the read input data into the first expression for calculation to obtain a result of the first subsequent node, and storing the result of the first subsequent node into the second display and storage block;
and reading the result of the first subsequent node from the second video memory block, obtaining the result of the input data processed according to the computation flow graph according to the read result of the first subsequent node, and outputting the processing result of the input data.
According to another aspect of the embodiments of the present invention, there is provided a computer-readable storage medium, on which a computer program is stored, which when executed by a GPU implements the steps of the method described in the above embodiments.
According to another aspect of the embodiments of the present invention, there is provided a computer device, including a memory, a GPU and a computer program stored on the memory and executable on the GPU, wherein the GPU implements the steps of the method of the above embodiments when executing the program.
According to the GPU processing method, the computer readable storage medium and the computer equipment of the data computation flow graph, the display and storage blocks are distributed to the source nodes in the computation flow graph, so that input data copied from external equipment can be temporarily stored at one end of the GPU; distributing a display and storage block for a subsequent node of the source node, wherein the result of the subsequent node can be temporarily stored at one end of the GPU; when the result of the subsequent node needs to be calculated, the display and storage block is allocated to the subsequent node, and the display and storage block is allocated to the subsequent node according to the display and storage block of the source node, so that the display and storage of the GPU can be saved as much as possible, and the complex calculation flow graph is convenient to process.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts. In the drawings:
FIG. 1 is a flow chart of a GPU processing method of a data computation flow graph according to an embodiment of the invention;
FIG. 2 is a flowchart illustrating a GPU processing method of a time series computation flow graph according to an embodiment of the present invention;
FIG. 3 is a diagram illustrating a chain storage management unit according to an embodiment of the present invention;
FIG. 4 is a block diagram of a chained memory management unit in an embodiment of the invention;
FIG. 5 is a diagram illustrating a process for obtaining results of a final node according to an embodiment of the invention;
FIG. 6 is a diagram illustrating a process of changing data in a memory cell according to an embodiment of the present invention;
fig. 7 is a schematic diagram of a GPU processing device of a data computation flow graph according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the embodiments of the present invention are further described in detail below with reference to the accompanying drawings. The exemplary embodiments and descriptions of the present invention are provided to explain the present invention, but not to limit the present invention.
In the prior art, when a GPU is used to process a large amount of data according to a computation flow graph, the data needs to be copied back and forth between the GPU and a CPU, which consumes a large amount of time, and thus highly parallel data processing cannot be realized on the GPU.
Fig. 1 is a flowchart illustrating a GPU processing method of a data computation flow graph according to an embodiment of the present invention. As shown in fig. 1, a GPU processing method of a data computation flow graph of some embodiments may include:
step S110: receiving input data and storing the input data to a first display block allocated for a first source node in a computational flow graph;
step S120: acquiring a first expression of a first subsequent node pointed by the first source node, reading the input data from the first video memory block, and allocating a second video memory block to the first subsequent node according to the video memory block allocated to the source node in the computation flow graph;
step S130: under the condition that all data needing to be input in the first expression are acquired, inputting the read input data into the first expression for calculation to obtain a result of the first subsequent node, and storing the result of the first subsequent node into the second display and storage block;
step S140: and reading the result of the first subsequent node from the second video memory block, obtaining the result of the input data processed according to the computation flow graph according to the read result of the first subsequent node, and outputting the processing result of the input data.
The execution subject of the steps S110 to S140 may be a GPU or other device capable of processing data based on a video memory, the input data is generally from outside of the GPU, for example, from a CPU, and accordingly, the processing result of the input data is generally output to outside of the execution subject of the steps, for example, to the CPU outside of the GPU.
It should be noted that the "first" in the first source node refers to a certain source node in the computation flow graph, and is not limited to which source node, which is only for simplicity in description, and the source node in the computation flow graph includes the first source node. The first and second display blocks are for distinguishing different display blocks, and the size and order of the display blocks are not limited. The "first" of the first subsequent nodes refers to the node pointed to by the first source node, and may refer to the node pointed to by the first source node directly or indirectly, and when the node is pointed to indirectly, other steps of calculating the node between the first source node and the first subsequent node may be included between the step S110 and the step S120. The "first" in the above-described first expression is for distinguishing from other expressions (for example, a second expression described later), and does not limit the contents of the expressions. Further, an expression herein may be program code for performing an operation, for example, the left may be a result of the expression, and the right may be code for a function. The input data of an expression may be one or more, and the output of an expression is typically one.
In the step S110, the input data may be sequence data, such as time-series data, space-series data, and the like. The time-series data may be, for example, stock price data, weather temperature data, or the like. One or more input data may be received for one sequence at a time, or multiple sequences of input data may be received simultaneously. The input data of different sequences can be correspondingly stored in the video memory blocks of different source nodes, and the different source nodes can be distinguished through marking information such as node names. In some embodiments, a plurality of source nodes may share one display and storage block, and data corresponding to the tag information of different source nodes may be recorded in the same display and storage block.
Before or when the input data is received, one or more small blocks can be divided from the video memory of the GPU through initialization to serve as video memory blocks, and the video memory blocks are respectively distributed to different source nodes. The video memory block may be a small block divided from the video memory of the GPU, and the size of the video memory block may be determined according to a space generally occupied by data to be stored, where the data to be stored may refer to the input data, a calculation result of each node, and the like. One or more source nodes can exist in the computation flow graph, one source node can be correspondingly allocated with one display and storage block, different source nodes can be allocated with different display and storage blocks, and the sizes of the different display and storage blocks can be the same or different.
The source node and its video memory block can be corresponded by its label information (e.g., node name) and its video memory block address. When the input data is received, the marking information of the source node corresponding to the input data can be received, the corresponding relation between the marking information and the video memory block address can be searched according to the marking information of the source node corresponding to the input data, the address of the corresponding video memory block can be found, the corresponding video memory block storage unit can be found according to the address of the video memory block, and therefore the input data can be stored in the video memory block of the source node corresponding to the input data. Or, the received input data is already designated (for example, by a parameter) to the corresponding source node, and the video memory block of the source node can be directly found for storage. Or, in the case of only one source node, the received input data can be directly stored in the video memory block of the source node correspondingly.
In the step S120, the process of acquiring the first expression, the process of reading the input data, and the process of allocating the second display block may be changed in order, which is not limited. The address of the first display and storage block can be found according to the node name of the first source node, and then the corresponding display and storage block can be found according to the address of the first display and storage block, so that the input data can be read. In addition, one or more input data of the same sequence can be stored in the first display block, different input data can be identified for distinguishing, and the required input data can be read according to the identification.
The video memory blocks allocated to the source nodes in the computation flow graph may refer to all video memory blocks corresponding to the source nodes to which the video memory blocks have been allocated, corresponding video memory blocks may be allocated to some or all of the source nodes in the computation flow graph, one or more source nodes may correspond to one video memory block, or video memory blocks whose number exceeds the number of the source nodes may be allocated.
The allocating the second display and memory block to the first successor node according to the display and memory block allocated to the source node in the computation flow graph may refer to determining how to allocate the second display and memory block by considering a situation of the display and memory block of the source node when allocating the second display and memory block. For example, when there are a plurality of source nodes, when a part of the source nodes are not needed any more after participating in the computation, the corresponding video memory blocks are idle, so the second video memory block can be obtained from the idle video memory block, or when all the video memory blocks of the source nodes are occupied, a new video memory block can be added based on the data structure of the video memory blocks to serve as the second video memory block.
In step S130, the first expression may be found from the computation flow graph according to the name of the first subsequent node. When the first expression only has one input, acquiring the input data to obtain all the data required to be input of the first expression; when the first expression has multiple inputs, in addition to the input data, other data needs to be obtained, and the other data may be obtained by a method similar to the method in the above steps S110 to S120, or may be obtained by other possible methods, for example, reading from another video memory of the GPU or copying from the CPU. In addition, the first expression shows that data needs to be input (for example, data originally transmitted from the CPU) and possibly parameter data is also needed, the parameter data may be frequently used and may be transmitted from the CPU and always stored in the video memory of the GPU for use when the expression is needed, so that the parameter data may also need to be input while the input data is input into the first expression.
In the calculation flow graph, for a node, the corresponding degree of entry may be obtained, and each time an input data required by the expression corresponding to the node is obtained, the degree of entry of the node may be reduced by one, and whether the latest degree of entry of the node is zero may be determined, and if so, it may be stated that all input data required by the expression corresponding to the node are obtained.
In step S140, the result of processing the input data according to the computation flow graph is obtained according to the read result of the first subsequent node, which may be directly obtained as the final processing result, or may be obtained by performing further calculation based on the read result of the first subsequent node, for example, by performing calculation by using a method similar to that in steps S120 to S130 to obtain the result of the next subsequent node or nodes, so as to obtain the final processing result. After the final processing result is obtained, it may be output to an external device of the GPU, for example, copied to the CPU.
In this embodiment, by allocating the display block to the source node in the computation flow graph, the input data copied from the external device may be temporarily stored at one end of the GPU; distributing a display and storage block for a subsequent node of the source node, wherein the result of the subsequent node can be temporarily stored at one end of the GPU; when the result of the subsequent node needs to be calculated, the display and storage block is allocated to the subsequent node, and the display and storage block is allocated to the subsequent node according to the display and storage block of the source node, so that the display and storage of the GPU can be saved as much as possible, and the complex calculation flow graph is convenient to process.
In order to prepare for temporarily storing the input data when the step S110 is executed, before the step S110, specifically, before the input data is received, a display block may be allocated to the source node in advance, or a storage management unit of the display block may be initialized. In some embodiments, before the step S110, that is, before receiving input data and storing the input data in the first video memory block allocated for the first source node in the computation flow graph, the method of each embodiment may further include:
s150: and generating a display and storage block according to the number of the source nodes of the computation flow graph, and allocating the generated display and storage block to the source nodes in the computation flow graph to be the first display and storage block allocated to the first source node.
In the step S150, the number of the generated video memory blocks may be equal to the number of the source nodes, or may be greater than the number of the source nodes. For example, when there are four source nodes in the computation flow graph, four display and storage blocks may be generated, each source node corresponds to one display and storage block, and the sizes of the display and storage blocks corresponding to different source nodes may be the same or different.
In this embodiment, the display and storage blocks are generated and allocated to the source nodes according to the number of the source nodes of the computation flow graph, and when data in one of the source nodes is no longer needed, the corresponding display and storage block may be released, or the data may be used to temporarily store data of other nodes, so that waste of display and storage may be reduced.
In the computational flow graph, the first source node points to the first successor node, which may point to a next node, e.g., a second successor node, in the case where the successor node is not the final node. For this case, the result of the second subsequent node may be calculated using a method similar to the above-described steps S120 to S130. In some embodiments, the step S140, that is, reading the result of the first subsequent node from the second video memory block, and obtaining the result of processing the input data according to the computation flow graph according to the read result of the first subsequent node, may include:
s141: reading the result of the first successor node from the second video memory block, obtaining a second expression of a second successor node pointed by the first successor node, and allocating a third video memory block for the second successor node according to the video memory block allocated for the source node in the computation flow graph and the second video memory block allocated for the first successor node;
s142: under the condition that all data needing to be input in the second expression are acquired, inputting the result of reading the first successor node from the second video memory block into the second expression for calculation to obtain the result of the second successor node, and storing the result of the second successor node into the third video memory block;
s143: and reading the result of the second subsequent node from the third video memory block, and obtaining the result of the input data processed according to the computation flow graph according to the read result of the second subsequent node.
In the step S141, the process of reading the result of the first subsequent node, the process of acquiring the second expression, and the process of allocating the third display block may be performed in a reversed order, and are not limited. The address of the second display and storage block can be found according to the node name of the first successor node, and then the corresponding display and storage block can be found according to the address of the second display and storage block, so that the result of the first successor node is read. The second expression may be obtained from a computational flow graph. In the case that the second apparent memory block is a newly added apparent memory block, the allocating the apparent memory block to the source node in the computation flow graph and the allocating the second apparent memory block to the first subsequent node may refer to an apparent memory block of each source node and a newly added apparent memory block; the second display block is obtained from the display block of the idle source node, and the display block allocated to the source node in the computation flow graph and the second display block allocated to the first subsequent node may refer to the display block of each source node. The allocating a third display block to the second successor node according to the display block allocated to the source node in the computation flow graph and the second display block allocated to the first successor node may be that the third display block is obtained from an idle display block in the display blocks (the sum of the display block of the source node and the second display block), or may be a display block newly added when all the display blocks are occupied.
In step S142, it may be determined whether all the data required to be input in the second expression has been acquired by a method similar to that in the specific implementation of step S130. For example, the degree of entry of the second subsequent node may be initially known, and whenever data required to be input by one second expression is acquired, the degree of entry of the second subsequent node may be reduced by one, and then it may be determined whether all data required to be input by the second expression has been acquired by determining whether the degree of entry of the second subsequent node is zero.
In the step S143, if the second subsequent node is the final node in the computation flow graph, the result of the second subsequent node read from the third video memory block may be processed to output a processing result; in the case where there is another node after the second succeeding node, the results of the other node may be calculated by a method similar to the above-described steps S141 to S142. After the result of each node is obtained, it may be determined before the temporary storage whether the in-degree and the out-degree of the node are both zero, that is, there is no need for input or output, in which case the result of the node may be a final processing result, and at this time, the final processing result may not be temporarily stored in the display block but may be directly output to an external device (e.g., a CPU). In other embodiments, the result of the node may be temporarily stored, and then it is determined whether the node is the final node, and if so, the result of the node is read and output to an external device.
In this embodiment, when there is a successor node after the source node, the video memory at one end of the GPU may also be read to the input required by the successor node, so that, in this case, data copying back and forth between the GPU and the external device may also be avoided, thereby further improving the speed of processing the computation flow graph by the GPU.
In a calculation flow graph, the degree of entry of each node (including a source node and a successor node) can be recorded, whether all the inputs required by the expression corresponding to the node are obtained can be known according to the degree of entry of the node, if all the inputs are obtained, the result of the node can be calculated, and if the obtained inputs are not complete, the result of the node can be calculated only by waiting for obtaining all the inputs. In some embodiments, before step S130, that is, when all data required to be input in the first expression is acquired, the read input data is input to the first expression to perform calculation, so as to obtain a result of the first subsequent node, and before the result of the first subsequent node is stored in the second video memory block, the method of each embodiment may further include:
s160: after the input data are read from the first video memory block, the degree of entry of the first subsequent node is reduced by one, and whether all data needing to be input in the first expression are acquired or not is judged according to the current degree of entry of the first subsequent node.
In the step S160, the reading of the input data from the first video memory block refers to a part of the process executed in the step S120. For example, if the degree of entry of the first subsequent node is one, after the input data is read, the degree of entry of the first subsequent node is reduced by one, and then the current degree of entry of the first subsequent node is zero, it may be said that the input of the first subsequent node is complete, and the result of the first subsequent node may be calculated by using the first expression corresponding to the first subsequent node.
Similarly, before step S142, that is, when all data required to be input in the second expression is acquired, the result of reading the first subsequent node from the second video memory block is input to the second expression for calculation, so as to obtain the result of the second subsequent node, and before the result of the second subsequent node is stored in the third video memory block, step S140 may further include:
s144: and after reading the result of the first successor node from the second video memory block, subtracting one from the degree of entrance of the second successor node, and judging whether all data required to be input by the second expression is acquired according to the current degree of entrance of the second successor node.
In the step S144, the result of reading the first subsequent node from the second video memory block is a part of the process executed in the step S141.
In the calculation flow graph, the degree of departure of each node (including the source node and the successor node) can be recorded, whether the node still points to the next node can be known according to the degree of departure of the node, if the degree of departure is not zero, the result of the node needs to be used for continuously calculating the result of the next node, and if the degree of departure is zero, the node can be indicated as the final node of the calculation flow graph, that is, the result of the node is the final processing result. In some embodiments, before step S140, that is, before the reading the result of the first subsequent node from the second video memory block, and obtaining the result of processing the input data according to the computation flow graph according to the read result of the first subsequent node, and outputting the processing result of the input data, the method of each embodiment may further include:
s170: after the result of the first subsequent node is obtained, subtracting one from the degree of outing of the first source node, and deleting the first source node and releasing the first display and storage block under the condition that the current degree of outing of the first source node is zero, or recycling the first display and storage block to allocate a memory block for the subsequent node which is not allocated with the display and storage block in the computation flow graph.
In the step S170, obtaining the result of the first subsequent node refers to a part of the process performed in the step S130.
Similarly, before step S143, that is, before the reading the result of the second subsequent node from the third video memory block and obtaining the result of processing the input data according to the computation flow graph according to the read result of the second subsequent node, step S140 may further include:
s145: and after the result of the second successor node is obtained, subtracting one from the degree of departure of the first successor node, and deleting the first successor node and releasing the second display block or recycling the second display block for allocating a memory block for the successor node to which the display block is not allocated in the computation flow graph under the condition that the current degree of departure of the first successor node is zero.
In the step S145, obtaining the result of the second subsequent node refers to a part of the process performed in the step S142.
In the process of calculating input data according to the calculation flow graph, the result of each node needs to be calculated, after each node is calculated, whether the node is the last node in the calculation flow graph can be further judged, and if the node is the last node, the result of the node can be output as a final processing result. In some embodiments, the step S143, namely obtaining a result of processing the input data according to the computation flow graph according to the read result of the second subsequent node, may include:
s14431: and judging whether the second subsequent node is the final node in the computation flow graph or not according to the current out-degree of the second subsequent node, and if so, taking the read result of the second subsequent node as the result of processing the input data according to the computation flow graph.
In this embodiment, after the result of a node is obtained by calculation, the current out-degree of the node may be determined. For example, in the case that the out-degree of a node is reduced by one every time the result of the node is calculated, whether the node is the final node may be determined by determining whether the current out-degree of the node is zero. If the out degree of the node is not zero, the result of the node can be used for calculating the result of the next node, and the node can be deleted after the result of the next node is obtained through calculation. In other embodiments, if the value of the degree of departure of the node is adjusted in other manners or at other timings, it may be determined whether the current node is the final node according to the degree of departure according to other corresponding criteria.
In some embodiments, the step S120 of allocating a second display block for the first subsequent node according to the display block allocated for the source node in the computation flow graph may include:
s121: under the condition that the display and memory blocks allocated to the source node in the computation flow graph are all occupied, adding a new display and memory block, and allocating the added new display and memory block to the first successor node to be used as the second display and memory block; and under the condition that an idle display block exists in the display blocks distributed for the source node in the computation flow graph, distributing the idle display block to the first subsequent node to be used as the second display block.
In step S121, for the video memory blocks allocated to each node (including each source node) in the computation flow graph, the node and the video memory block may be associated with each other by the correspondence between the node name and the video memory block address. For example, in this case, the node name may be used as a key, the address of the video memory block may be used as a value, and when one node is deleted, the corresponding node name in the correspondence relationship may be deleted.
The video memory management unit can be formed by storing the video memory block addresses in the specified data structure. Where the data result may be a linked list, array, etc. When a new video memory block needs to be added, the new video memory block can be added in the video memory management unit. For example, a small block may be newly divided into a video memory of the GPU as a new video memory block, and the address of the video memory block may be additionally stored in the video memory management unit in the form of the data structure.
In addition, a plurality of source nodes may exist in the computation flow graph, and when a part of the source nodes are used for computation, the part of the source nodes are no longer needed for computation, and the part of the source nodes are idle. In implementation, the display block can be identified as occupied or idle by setting a flag bit, for example, setting a flag bit for the display block address. Therefore, the states of the display and storage blocks corresponding to the part of source nodes can be marked as idle, and the idle display and storage blocks can be found according to the states of the display and storage blocks so as to be used for storing the output of the subsequent nodes.
In this embodiment, a new video memory block is added based on the same data structure, which can facilitate video memory block management. The output of the idle video memory block storage node is utilized, the video memory space can be fully utilized, and the waste of the storage space is avoided.
In a more specific embodiment, in the method of each embodiment, the video memory blocks allocated to the source node in the computation flow graph may be connected by a linked list. Specifically, addresses of the display blocks can be connected through a linked list, and the display block addresses can be found by moving a head pointer or a tail pointer. In some embodiments, in the step S121, in a case that there is an idle display block in the display blocks allocated for the source node in the computation flow graph, allocating the idle display block to the first subsequent node as the second display block may include:
s1211: if the display and storage blocks distributed for the source node in the computation flow graph are changed from the occupied state to the idle state, enabling the display and storage blocks changed into the idle state to be the first end part of the linked list;
s1212: checking whether the video memory block at the first end part of the linked list is in an idle state, if so, distributing the video memory block at the first end part to the first subsequent node to be used as the second video memory block, enabling the second video memory block to become the second end part of the linked list, and marking the state of the second video memory block as occupied.
The first end is a position pointed by one of a head pointer and a tail pointer of the linked list, and the second end is a position pointed by the other of the head pointer and the tail pointer of the linked list. For example, when the first end refers to a video memory block address pointed by a head pointer in a linked list, the second end may refer to a video memory block address pointed by a tail pointer; conversely, when the first end refers to the video memory block address pointed by the tail pointer in the linked list, the second end may refer to the video memory block address pointed by the head pointer.
In step S1211, when the corresponding data stored in the video memory block of a certain source node is still to be used for the node calculation, the video memory block may be considered to be in the occupied state, and after the node result calculation is completed, if the corresponding data stored in the video memory block of the source node is no longer needed, the video memory block may be considered to be converted from the occupied state to the idle state. In other words, the occupied state or the idle state described herein is determined according to whether the data stored in the video memory block is still available. When the flag bit is used to mark the state of the display and storage block, if the information of the flag bit changes correspondingly along with the state of the display and storage block, whether the display and storage block corresponding to the node is occupied or idle can be judged according to the information marked by the flag bit.
In some embodiments, in the step S121, in a case that all display blocks allocated to the source node in the computation flow graph are occupied, adding a new display block, and allocating the added new display block to the first subsequent node, as the second display block, may include:
s1213: and checking whether the video memory block at the first end of the linked list is in an idle state, if not, adding a new video memory block at the second end of the linked list, distributing the added new video memory block to the first subsequent node to be used as the second video memory block, and marking the state of the second video memory block as occupied.
In the above step S1212 and step S1213, the addresses of the display blocks may be linked by a linked list, which may be a bi-directional linked list or a uni-directional linked list. The method includes the steps that after video memory blocks are distributed to source nodes, corresponding video memory block addresses are obtained, the video memory block addresses can be linked through a linked list, a head pointer can point to one end of the linked list, a tail pointer points to the other end of the linked list, the video memory blocks can store input data corresponding to the source nodes, the video memory blocks of the source nodes can be in an occupied state at the moment, in this case, the video memory blocks corresponding to a first end portion (such as the video memory block address pointed by the head pointer) of the linked list can be found to be in the occupied state through checking, new video memory block addresses can be added to a second end portion (such as the video memory block address pointed by the tail pointer) of the linked list, the tail pointer is moved to the added video memory block addresses, and the video memory blocks corresponding to the addresses are distributed to first subsequent nodes. When the data corresponding to a certain source node is not needed any more after participating in the calculation, the video memory block corresponding to the source node becomes an idle state, at this time, the video memory block in the idle state can be moved to, for example, one end pointed by the head pointer, the head pointer points to the address of the moved video memory block, and the video memory block corresponding to the address of the video memory block pointed by the head pointer at this time can be allocated to a subsequent node. And circulating in sequence, and allocating display and storage blocks for the output of each subsequent node. In short, the idle memory blocks are moved to one end of the linked list (for example, the end pointed by the head pointer), the occupied memory blocks are moved to the other end of the linked list (for example, the end pointed by the tail pointer), and whether the idle memory blocks exist in all the memory blocks managed by the linked list can be known by checking whether the memory blocks pointed by the head pointer are idle or not. The video memory block management mode enables idle video memory blocks to be conveniently found, and can reduce the time consumed for searching the idle video memory blocks.
In other embodiments, the data structure may be an array, i.e., the video memory block may be managed by storing the video memory block address by the array. For example, the length of the array may be determined according to the maximum number of concurrently occupied display blocks in the process of computing the flowsheet, then the corresponding number of display blocks may be divided according to the number of the source nodes and allocated to the source nodes, and the corresponding display block addresses may be sequentially stored in the array from the first position of the array, when allocating the display blocks for the subsequent nodes, whether the display block corresponding to the address stored in the array is free may be traversed from the beginning, and once a free display block is found, it may be allocated to the node, where the free display block may be sandwiched between the array positions where the addresses of the occupied display blocks are located, or may be behind the array positions where the addresses of all occupied display blocks are located. Since the length of the array takes into account the maximum length that may be required for the array when the array is initially generated, the length of the array is not very long and is sufficient for use in the case of continuous reuse of free memory blocks.
In some embodiments, the step S110 of reading the input data from the video memory block corresponding to the source node may include: searching a pre-established corresponding relation between the node marking information and the video memory block address according to the marking information of the source node to obtain the video memory block address corresponding to the marking information of the source node; and finding the corresponding video memory block according to the searched video memory block address, and reading the input data from the video memory block.
In some embodiments, the set data structure is a linked list; the specific implementation manner of step S120 may include, when all the memory blocks corresponding to the source nodes in the computation flow graph are occupied, generating a new memory block, storing an address of the generated new memory block based on the set data structure, and allocating the generated new memory block to the subsequent node, and more specifically, the method may include: and under the condition that the display and storage blocks corresponding to the source nodes in the computation flow graph are occupied, generating a new display and storage block, storing the address of the generated new display and storage block to a storage unit additionally arranged at the first end of the linked list, and corresponding the address of the generated new display and storage block to the mark information of the subsequent node.
In some embodiments, the set data structure is a linked list; before reading the result of the subsequent node from the video memory block corresponding to the subsequent node, obtaining a result of processing the input data according to the computation flow graph by using the read result of the subsequent node, and outputting a processing result of the input data, the method may further include: and under the condition that the out-degree of the source node corresponding to the input data is zero, moving a storage unit for storing the address of the video memory block corresponding to the source node to a second end of the linked list, marking the state of the corresponding video memory block as idle, and deleting the marking information of the source node corresponding to the input data.
Reading the result of the subsequent node from the video memory block corresponding to the subsequent node, obtaining a result of processing the input data according to the computation flow graph by using the read result of the subsequent node, and outputting a processing result of the input data, which may include: reading the result of the successor node from the video memory block corresponding to the successor node, acquiring the expression of the successor node, checking whether the video memory block corresponding to the storage unit of the second end of the linked list is idle, if the video memory block is idle, corresponding the address of the video memory block corresponding to the storage unit of the second end to the mark information of the successor node, and moving the storage unit of the second end to the first end; inputting the result of the successor node into the expression of the successor node for calculation to obtain the result of the successor node, storing the result of the successor node into a video memory block corresponding to the storage unit moved to the first end, and reducing the output of the successor node by one; and reading the result of the successor node from the video memory block corresponding to the storage unit moved to the first end, obtaining the result of processing the input data according to the computation flow graph according to the read result of the successor node, and outputting the processing result of the input data.
In some embodiments, the input data may be a time series. The time sequence may be copied from the CPU to the GPU, and then the time sequence is processed on the GPU according to a certain computational flow graph, and the processed result may be copied back to the CPU, and during this period, the data structure used for managing the memory blocks may be a linked list, such as a doubly linked list. In this case, in the GPU acceleration method or the processing method for time-series computation flow graph processing, the method for managing display blocks may include the steps of:
s11: dividing the GPU video memory into small blocks, and connecting the small blocks by using a chain table to form a chain type memory unit, wherein the addresses between the small blocks of the chain type memory unit can be discontinuous;
s12: the address of each block on the chain can be recorded by a table, the key of the table can be the name of a node, and the value of the table can be the address of the corresponding display and storage block on the chain, so that the display and storage block can be ensured to be indexed in constant time;
s13: the head pointer and the tail pointer can respectively point to the head part and the tail part of the chain type storage unit;
s14: the state of each video memory block can be marked by a flag bit, and the state can be idle or occupied;
s15: when a display memory block is added, applying for a new display memory, adding the new display memory to the tail part of the chain type memory unit, and moving a tail pointer;
s16: when the display and storage block is deleted, the tail pointer can be moved to release the display and storage block;
s17: when searching for the display and storage block, the address of the display and storage block on the chain can be found in the table according to the given node name, and then the address of the display and storage block is found.
The GPU acceleration or processing method for processing a time series computation flow graph provided by this embodiment may include the steps of:
s21: initializing chained storage management units according to source nodes obtained by traversal of a traversal module, initializing the number of video memory blocks in the storage management units according to the number of the source nodes, and setting the states of the video memory blocks;
s22: the traversal module acquires each layer of nodes with the topology sequence with the preferential width, calls the execution module, executes expression calculation of each node, acquires an input address of an expression when the node calculation is executed, can acquire an input video memory address within constant time by inquiring the chained memory management unit, and simultaneously allocates the video memory address for output;
s23: adjusting a storage management unit according to a certain video memory adjustment strategy;
s24: and repeating the steps S2 and S3 until the final node finishes executing, and obtaining a processing result.
The implementation manner of video memory block address acquisition in the GPU acceleration or processing method for processing a time series computation flow graph provided in this embodiment may include:
s31: firstly, looking up a table, and if the table has the information (node name) of the node, directly returning the information of the node through the index of the table; wherein, the information of each node in the computation flow graph can be recorded in a table in the initialization process;
s32: then inquiring the head of the chained memory unit, if the position is free, allocating the video memory block of the position to the node, and setting the state as occupied; the address of the video memory block can be stored to the position corresponding to the node in the table;
s33: if the position is occupied, it can be said that the whole storage management unit has no idle video memory block, then it is necessary to add a video memory at the tail of the storage unit, and at the same time, to set the state of the video memory block as occupied, and add the position of the video memory block in the chained storage unit (video memory block address) to the position corresponding to the node in the table.
The implementation manner of the video memory adjustment policy (step S23) in the GPU acceleration or processing method for processing a time-series computation flow graph provided in this embodiment may include:
s41: when a node is executed, the head of the chained memory management unit can be searched, if the position is free, the video memory is occupied, the state of the video memory is set to be occupied, otherwise, a video memory is added at the tail part of the memory management unit, a tail pointer is adjusted, and the position (video memory block address) of the video memory in the chained memory management unit is recorded in a table;
s42: when a node in the computation flow graph is not needed, or the out degree of the node is 0, adjusting the state of the corresponding node in the storage management unit to be idle, adjusting the position of the video memory block of the block to the head of the chained memory unit, and deleting the record of the node in the table.
The traversal module described in this embodiment may be an in-degree for a given computation flow graph and nodes in the computation flow graph, and each time a node that does not currently have data dependency is returned, the traversal module may include the following steps:
1) traversing the degree of all the nodes, and returning the nodes with the degree of 0;
2) subtracting one from the degree of entry of the subsequent node pointed by the return node;
3) if the in-degree of a node in the subsequent nodes is reduced to 0, returning the node;
4) and traversing to the end point, and ending the traversal.
The execution module described in this embodiment may be configured to execute the computation of a node in a given computation flow graph by calling the storage management unit and the traversal unit, and may obtain a final result, where the execution module may include the following steps:
1) obtaining nodes needing to be executed through a traversal unit;
2) calling a storage management unit to acquire required node address information;
3) performing a calculation for the node;
4) executing a video memory adjustment strategy, and adjusting a memory management unit;
5) the above process is repeated until the endpoint calculation is completed.
The invention has the advantages that: the method for accelerating the processing of the time sequence calculation flow graph based on the GPU can be used for memory access optimization of time sequence processing and improving the speed of time sequence processing. The chained memory management unit provided in the embodiment can acquire the memory within a constant time, and for each calculation, the input address and the output address space are acquired, the memory does not need to be manually allocated and released, so that the programming is facilitated, the data does not need to be copied back and forth between the host and the equipment, the efficiency is greatly improved, and the waste of the memory is greatly reduced by the adjustment strategy of the memory management unit. The traversal module of the method provides an optimal node execution sequence, and the optimal node execution sequence deconstructs the data dependence of the time sequence, so that the nodes without data dependence can simultaneously carry out operation. The execution module provided by the method can organically call the traversal module and the chained storage unit to complete the execution of the computation flow graph and obtain the final result.
In order that those skilled in the art will better understand the present invention, embodiments of the present invention will be described below with reference to specific examples.
Fig. 2 is a flowchart illustrating a GPU processing method of a time-series computation flow graph according to an embodiment of the present invention. As shown in fig. 2, taking the example of processing time series data according to a given computation flow graph, the method for processing the data computation flow graph may include the steps of:
s1: calling an execution module;
s2: calling a traversal module to return the node;
s3: calling a storage management unit to acquire address information of the node;
s4: node computations are performed.
S5: the steps S2-S4 are repeated until the final node calculation is completed.
In order to improve the execution and memory access efficiency of time series processing in the memory access optimization of the time series processing, the method comprises the following steps: calling an execution module; calling a traversal module to obtain a node; calling a chained memory cell module to obtain address information of a node; performing a computation of a node; executing a video memory adjustment strategy to adjust the chained memory units; if the graph is traversed to the end point, the execution of the graph is ended, otherwise, the traversal module is continuously called, and the process is repeated.
FIG. 3 is a diagram illustrating a structure of a chained memory management unit according to an embodiment of the invention. As shown in fig. 3, the chained storage management unit in this embodiment may contain storage units of 5 data blocks, and for this example, k1 to k5 represent keys in a table, which may actually represent node names, M1 to M5 are display blocks (which may actually represent addresses of the display blocks), and the table establishes a mapping from the node names to the locations of the display blocks in the chained storage unit. The chain type memory unit can be a bidirectional chain type structure, and two pointer head pointers and two pointer tail pointers respectively point to the head part and the tail part of the chain type memory unit.
In an example of a time series computation flow graph, nodes d1, d2, d3, …, d8, d9, d10 may be included, with the final node being d 10. FIG. 4 is a structural diagram of a chain storage management unit according to an embodiment of the present invention, and as shown in FIG. 4, nodes d 1-d 10 are represented by 1-10, respectively, and directional relationships between the nodes are represented by arrows. Fig. 5 is a schematic diagram of a processing procedure for obtaining a result of a final node in an embodiment of the present invention, where 6 parts s1 to s6 respectively represent a procedure for initializing a source node, a procedure for calculating a result of a first-layer node, a procedure for calculating a result of a second-layer node, a procedure for calculating a result of a third-layer node, a procedure for calculating a result of a final node, and a procedure for obtaining a value of a result of a final node. FIG. 6 is a schematic diagram illustrating a process of changing data in a memory cell according to an embodiment of the invention, as shown in FIG. 6, s 11-s 88 respectively represent data stored in the memory cell during the process of not initializing, and calculating each node, wherein a portion with a slash represents idle, and a portion without a slash only represents occupied. Referring to fig. 3 to 6, the process of processing time-series data according to the computation flow graph is as follows:
in the step S1, a traversal module is called to obtain the input source nodes d1, d2, d4, and d7, as S2 in fig. 5, and at the same time, initialize the storage management unit, as S22 in fig. 6;
in the above step S2, as shown in the step S3 in fig. 5, the nodes d3, d5 and d9 with an entry degree of 0 are accessed, and the computing nodes d3 ═ ts _ add (d1, d2), d5 ═ ts _ max (d3,5) and d9 ═ ts _ max (d7,10) are calculated, where:
1) when the node d3 is calculated, there is no free storage space, and a new video memory block needs to be added after the chain storage unit, as in step s33 in fig. 6, after the node d3 is calculated, the nodes d1 and d2 are idle, and the states of the two positions are set to be idle;
2) when the node d5 is calculated, as in the step s44 in fig. 6, the head of the chain storage unit is checked, the video memory block of the original d1 is idle, now occupies the space for the d5 to use, and is adjusted to the tail of the chain storage unit, the setting state is occupied, at this time, the data of the d4 node is no longer needed, the state is set to be idle, and is adjusted to the head of the chain storage unit;
3) when the node d9 is calculated, as in step S55 in fig. 6, the head of the chained memory unit is checked, and the video memory block of the d4 is left unused, and then occupies the position and is adjusted to the tail of the chained memory unit;
in the above step S3, a traversal module is called, a node d6 with an introductivity of 0 is accessed, as in the step S4 in fig. 5, a node d6 — ts _ sub (d3, d5) is calculated, as in the step S66 in fig. 6, the idle space of the node d2 is occupied, the node d is placed at the tail of the storage unit, meanwhile, the data of the nodes d5 and d3 are no longer needed, the set state is idle, and the head of the chained storage unit is adjusted;
in the above step S4, call the traverse module, as in the step S5 in fig. 5, access the node d8 with an entry degree of 0, calculate the node d8 ═ ts _ add (d6, d7), as in S77 in fig. 6, occupy the free space of d5, put to the tail of the storage unit, and at the same time, the data of the nodes d6 and d7 are no longer needed, set the state to be free, and adjust to the head of the chained storage unit;
in the above step S5, the traversal module is called, as in the step S6 in fig. 5, to access the node d10 with an entry degree of 0, and to calculate the node d10 ═ ts _ add (d8, d9), as in the step S88 in fig. 6, to occupy the free space of the node d6, and to put the node d6 to the tail of the storage unit, and at the same time, the data of the nodes d8 and d9 are no longer needed, and the state is set to be free, and the node d6 is adjusted to the head of the chained storage unit. And finally, obtaining a result of the end point d10, and finishing the processing of the computation flow graph.
Based on the same inventive concept as the GPU processing method of the data computation flow graph shown in fig. 1, the embodiment of the present application further provides a GPU processing apparatus of the data computation flow graph, as described in the following embodiments. Because the principle of solving the problems of the GPU processing device of the data computation flow graph is similar to that of the GPU processing method of the data computation flow graph, the implementation of the GPU processing device of the data computation flow graph can refer to the implementation of the GPU processing method of the data computation flow graph, and repeated parts are not repeated.
Fig. 7 is a schematic diagram of a GPU processing device of a data computation flow graph according to an embodiment of the present invention. As shown in fig. 7, the GPU processing device of the data computation flow graph of some embodiments may include:
an input receiving unit 210, configured to receive input data and store the input data to a first explicit storage block allocated for a first source node in a computation flow graph;
a calculation preparation unit 220, configured to obtain a first expression of a first subsequent node pointed by the first source node, read the input data from the first video memory block, and allocate a second video memory block to the first subsequent node according to a video memory block allocated to a source node in the calculation flow graph;
the node calculating unit 230 is configured to, when all data required to be input in the first expression is acquired, input the read input data into the first expression for calculation to obtain a result of the first subsequent node, and store the result of the first subsequent node in the second display block;
a result output unit 240, configured to read a result of the first subsequent node from the second video memory block, obtain a result of processing the input data according to the computation flow graph according to the read result of the first subsequent node, and output a processing result of the input data.
Embodiments of the present invention further provide a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a GPU, implements the steps of the method described in the above embodiments.
The embodiment of the invention also provides computer equipment which comprises a memory, a GPU and a computer program which is stored on the memory and can run on the GPU, and the steps of the method of the embodiment are realized when the GPU executes the program.
In summary, according to the GPU processing method, the computer-readable storage medium, and the computer device of the data computation flow graph in the embodiments of the present invention, by allocating the display block to the source node in the computation flow graph, the input data copied from the external device can be temporarily stored at one end of the GPU; distributing a display and storage block for a subsequent node of the source node, wherein the result of the subsequent node can be temporarily stored at one end of the GPU; when the result of the subsequent node needs to be calculated, the display and storage block is allocated to the subsequent node, and the display and storage block is allocated to the subsequent node according to the display and storage block of the source node, so that the display and storage of the GPU can be saved as much as possible, and the complex calculation flow graph is convenient to process.
In the description herein, reference to the description of the terms "one embodiment," "a particular embodiment," "some embodiments," "for example," "an example," "a particular example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. The sequence of steps involved in the various embodiments is provided to schematically illustrate the practice of the invention, and the sequence of steps is not limited and can be suitably adjusted as desired.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (10)

1. A GPU processing method of a data computation flow graph is characterized by comprising the following steps:
receiving input data and storing the input data to a first display block allocated for a first source node in a computational flow graph;
acquiring a first expression of a first subsequent node pointed by the first source node, reading the input data from the first video memory block, and allocating a second video memory block to the first subsequent node according to the video memory block allocated to the source node in the computation flow graph;
under the condition that all data needing to be input in the first expression are acquired, inputting the read input data into the first expression for calculation to obtain a result of the first subsequent node, and storing the result of the first subsequent node into the second display and storage block;
reading the result of the first subsequent node from the second video memory block, obtaining a result of processing the input data according to the computation flow graph according to the read result of the first subsequent node, and outputting a processing result of the input data;
the first display and memory block, the second display and memory block and the display and memory block allocated for the source node in the computation flow graph are partitioned from a display and memory of the same GPU.
2. The method of GPU processing of a data computation flow graph of claim 1, wherein receiving input data and storing the input data prior to a first video memory block allocated for a first source node in the computation flow graph, further comprising:
and generating a display and storage block according to the number of the source nodes of the computation flow graph, and allocating the generated display and storage block to the source nodes in the computation flow graph to be the first display and storage block allocated to the first source node.
3. The method for GPU processing of a data computation flow graph of claim 1, wherein reading the result of the first successor node from the second video memory block, and obtaining the result of processing the input data according to the computation flow graph according to the read result of the first successor node, comprises:
reading the result of the first successor node from the second video memory block, obtaining a second expression of a second successor node pointed by the first successor node, and allocating a third video memory block for the second successor node according to the video memory block allocated for the source node in the computation flow graph and the second video memory block allocated for the first successor node;
under the condition that all data needing to be input in the second expression are acquired, inputting the result of reading the first successor node from the second video memory block into the second expression for calculation to obtain the result of the second successor node, and storing the result of the second successor node into the third video memory block;
and reading the result of the second subsequent node from the third video memory block, and obtaining the result of the input data processed according to the computation flow graph according to the read result of the second subsequent node.
4. The GPU processing method of a data computation flow graph of claim 3,
when all data required to be input in the first expression is acquired, inputting the read input data into the first expression for calculation to obtain a result of the first successor node, and before the result of the first successor node is stored in the second video memory block, the method further includes:
after the input data are read from the first video memory block, subtracting one from the degree of entry of the first subsequent node, and judging whether all data needing to be input in the first expression are acquired according to the current degree of entry of the first subsequent node;
under the condition that all data required to be input in the second expression is acquired, inputting the result of reading the first successor node from the second video memory block into the second expression for calculation to obtain the result of the second successor node, and before storing the result of the second successor node into the third video memory block, the method further includes:
and after reading the result of the first successor node from the second video memory block, subtracting one from the degree of entrance of the second successor node, and judging whether all data required to be input by the second expression is acquired according to the current degree of entrance of the second successor node.
5. The GPU processing method of a data computation flow graph of claim 3,
before reading the result of the first subsequent node from the second video memory block, obtaining a result of processing the input data according to the computation flow graph according to the read result of the first subsequent node, and outputting a processing result of the input data, the method further includes:
after the result of the first subsequent node is obtained, subtracting one from the out-degree of the first source node, and deleting the first source node and releasing the first display and storage block under the condition that the current out-degree of the first source node is zero, or recycling the first display and storage block to allocate a memory block for the subsequent node which is not allocated with the display and storage block in the computation flow graph;
before reading the result of the second subsequent node from the third video memory block and obtaining the result of the input data processed according to the computation flow graph according to the read result of the second subsequent node, the method further includes:
and after the result of the second successor node is obtained, subtracting one from the degree of departure of the first successor node, and deleting the first successor node and releasing the second display block or recycling the second display block for allocating a memory block for the successor node to which the display block is not allocated in the computation flow graph under the condition that the current degree of departure of the first successor node is zero.
6. The method of GPU processing of a data computation flow graph of claim 3, wherein obtaining a result of processing said input data according to said computation flow graph based on a result of reading said second successor node, comprises:
and judging whether the second subsequent node is the final node in the computation flow graph or not according to the current out-degree of the second subsequent node, and if so, taking the read result of the second subsequent node as the result of processing the input data according to the computation flow graph.
7. The method of GPU processing of a data computation flow graph of claim 1, wherein allocating a second display block for the first successor node according to the display block allocated for the source node in the computation flow graph comprises:
under the condition that the display and memory blocks allocated to the source node in the computation flow graph are all occupied, adding a new display and memory block, and allocating the added new display and memory block to the first successor node to be used as the second display and memory block; and under the condition that an idle display block exists in the display blocks distributed for the source node in the computation flow graph, distributing the idle display block to the first subsequent node to be used as the second display block.
8. The GPU processing method of a data computation flow graph of claim 7, wherein the video memory blocks allocated for source nodes in the computation flow graph are connected by a linked list;
when an idle display block exists in the display blocks allocated to the source node in the computation flow graph, allocating the idle display block to the first subsequent node as the second display block, including:
if the display and storage blocks distributed for the source node in the computation flow graph are changed from the occupied state to the idle state, enabling the display and storage blocks changed into the idle state to be the first end part of the linked list;
checking whether the video memory block at the first end of the linked list is in an idle state, if so, distributing the video memory block at the first end to the first subsequent node to be used as the second video memory block, enabling the second video memory block to become the second end of the linked list, and marking the state of the second video memory block as occupied;
when all the display and memory blocks allocated to the source node in the computation flow graph are occupied, adding a new display and memory block, and allocating the added new display and memory block to the first successor node as the second display and memory block, including:
checking whether a video memory block at the first end of the linked list is in an idle state, if not, adding a new video memory block at the second end of the linked list, distributing the added new video memory block to the first subsequent node to be used as the second video memory block, and marking the state of the second video memory block as occupied;
the first end is a position pointed by one of a head pointer and a tail pointer of the linked list, and the second end is a position pointed by the other of the head pointer and the tail pointer of the linked list.
9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a GPU, carries out the steps of the method according to any of claims 1 to 8.
10. A computer device comprising a memory, a GPU and a computer program stored on the memory and executable on the GPU, characterized in that the steps of the method according to any of claims 1 to 8 are implemented by the GPU when executing said program.
CN201910421763.6A 2019-05-21 2019-05-21 GPU processing method and device of data computation flow graph Active CN110163791B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910421763.6A CN110163791B (en) 2019-05-21 2019-05-21 GPU processing method and device of data computation flow graph

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910421763.6A CN110163791B (en) 2019-05-21 2019-05-21 GPU processing method and device of data computation flow graph

Publications (2)

Publication Number Publication Date
CN110163791A CN110163791A (en) 2019-08-23
CN110163791B true CN110163791B (en) 2020-04-17

Family

ID=67631557

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910421763.6A Active CN110163791B (en) 2019-05-21 2019-05-21 GPU processing method and device of data computation flow graph

Country Status (1)

Country Link
CN (1) CN110163791B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111598768B (en) * 2020-07-23 2020-10-30 平安国际智慧城市科技股份有限公司 Image optimization processing method and device, computer equipment and storage medium
CN112957068B (en) * 2021-01-29 2023-07-11 青岛海信医疗设备股份有限公司 Ultrasonic signal processing method and terminal equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8290882B2 (en) * 2008-10-09 2012-10-16 Microsoft Corporation Evaluating decision trees on a GPU
CN107315843A (en) * 2017-07-27 2017-11-03 南方电网科学研究院有限责任公司 The storage method and system of massive structured data
CN107783782A (en) * 2016-08-25 2018-03-09 萨思学会有限公司 Compiling for parallel processing of the node apparatus based on GPU
CN107852349A (en) * 2016-03-31 2018-03-27 慧与发展有限责任合伙企业 Transaction management for multi-node cluster

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102426710A (en) * 2011-08-22 2012-04-25 浙江大学 Surface area heuristic construction KD (K-dimension) tree parallel method on graphics processing unit
CN102799427A (en) * 2012-06-26 2012-11-28 武汉天喻软件有限责任公司 Method for generating Gantt chart based on graphics processing unit (GPU)
CN109586733B (en) * 2018-11-23 2021-06-25 清华大学 LDPC-BCH decoding method based on graphic processor

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8290882B2 (en) * 2008-10-09 2012-10-16 Microsoft Corporation Evaluating decision trees on a GPU
CN107852349A (en) * 2016-03-31 2018-03-27 慧与发展有限责任合伙企业 Transaction management for multi-node cluster
CN107783782A (en) * 2016-08-25 2018-03-09 萨思学会有限公司 Compiling for parallel processing of the node apparatus based on GPU
CN107315843A (en) * 2017-07-27 2017-11-03 南方电网科学研究院有限责任公司 The storage method and system of massive structured data

Also Published As

Publication number Publication date
CN110163791A (en) 2019-08-23

Similar Documents

Publication Publication Date Title
Ashkiani et al. A dynamic hash table for the GPU
US9563477B2 (en) Performing concurrent rehashing of a hash table for multithreaded applications
CA2181099C (en) Method and means for scheduling parallel processors
Li et al. How to share concurrent wait-free variables
US7149865B2 (en) Memory allocation using mask-bit pattern to encode metadata within memory address
CN104424030B (en) Method and device for sharing memory by multi-process operation
CN110163791B (en) GPU processing method and device of data computation flow graph
US20120011166A1 (en) Skip list generation
CN111984729A (en) Heterogeneous database data synchronization method, device, medium and electronic equipment
CN111459691A (en) Read-write method and device for shared memory
CN112930526A (en) Method for vectorizing d-heap using horizontally aggregated SIMD instructions
US20130086124A1 (en) Mapping Data Structures
KR102114245B1 (en) Graphics state manage apparatus and method
CN108108242B (en) Storage layer intelligent distribution control method based on big data
US7213244B2 (en) Apparatus and method for distribution of work on a doubly linked list among processing threads
US20170255548A1 (en) Method and system for dynamically updating data fields of buffers
US20100049747A1 (en) Apparatus and method for storing log in a thread oriented logging system
US9135058B2 (en) Method for managing tasks in a microprocessor or in a microprocessor assembly
US10782970B1 (en) Scalable multi-producer and single-consumer progressive chunked queue
CN112947863A (en) Method for combining storage spaces under Feiteng server platform
CN113296961B (en) GPU-based dynamic memory allocation method and device and memory linked list
US11960933B2 (en) Versioned progressive chunked queue for a scalable multi-producer and multi-consumer queue
US20240104016A1 (en) Intermediate Representation Method and Apparatus for Compiling Computation Graphs
Bell et al. Host-Based Allocators for Device Memory
CN116167447B (en) Quantum circuit processing method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant