WO2021196745A1 - 数据处理装置、集成电路和ai加速器 - Google Patents

数据处理装置、集成电路和ai加速器 Download PDF

Info

Publication number
WO2021196745A1
WO2021196745A1 PCT/CN2020/136960 CN2020136960W WO2021196745A1 WO 2021196745 A1 WO2021196745 A1 WO 2021196745A1 CN 2020136960 W CN2020136960 W CN 2020136960W WO 2021196745 A1 WO2021196745 A1 WO 2021196745A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
heap
stack
units
unit
Prior art date
Application number
PCT/CN2020/136960
Other languages
English (en)
French (fr)
Inventor
张启荣
王文强
胡英俊
蒋科
Original Assignee
上海商汤智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海商汤智能科技有限公司 filed Critical 上海商汤智能科技有限公司
Priority to JP2021557465A priority Critical patent/JP2022531075A/ja
Priority to KR1020217031349A priority patent/KR20210129715A/ko
Publication of WO2021196745A1 publication Critical patent/WO2021196745A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0604Improving or facilitating administration, e.g. storage management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/0644Management of space entities, e.g. partitions, extents, pools
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0655Vertical data movement, i.e. input-output transfer; data movement between one or more hosts and one or more storage devices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/22Arrangements for sorting or merging computer data on continuous record carriers, e.g. tape, drum, disc
    • G06F7/24Sorting, i.e. extracting data from one or more carriers, rearranging the data in numerical or other ordered sequence, and rerecording the sorted data on the original carrier or on a different carrier or set of carriers sorting methods in general

Definitions

  • the present disclosure relates to the field of data processing technology, and in particular to data processing devices, integrated circuits, and artificial intelligence (AI) accelerators.
  • AI artificial intelligence
  • Heapsort refers to a sorting method designed by using the data structure of the heap.
  • the present disclosure provides data processing devices, integrated circuits, and AI accelerators.
  • a data processing device comprising: a plurality of heap storage units, each heap storage unit is used to store data of a group of nodes of the heap, in the group of nodes At least some of the nodes in the same layer of the heap; and a plurality of heap adjustment units, each of which is used to access at least two heap storage units to store the input original data and the at least two heaps The data stored in the cell is sorted.
  • an integrated circuit including the data processing device described in the first aspect.
  • an AI accelerator is provided, and the AI accelerator includes the integrated circuit described in the second aspect.
  • the embodiment of the present disclosure stores the data of each node in the heap in multiple heap storage units, the data in the multiple heap storage units can be read and written independently, and the previous data is sorted by the multiple heap adjustment units At the same time, the latter data can be put into the heap, so that it can be sorted at the same time during the heap building process, which improves the sorting efficiency.
  • Figure 1A is a schematic diagram of a stack of some embodiments.
  • Figure 1B is a schematic diagram of the heap sorting process of some embodiments.
  • Fig. 2 is a schematic diagram of a data processing device according to an embodiment of the present disclosure.
  • FIG. 3A and FIG. 3B are schematic diagrams of data storage manners according to embodiments of the present disclosure, respectively.
  • Fig. 4 is a schematic diagram of a data processing device according to other embodiments of the present disclosure.
  • 5A to 5F are schematic diagrams of data changes during the heap sorting process of an embodiment of the present disclosure.
  • Fig. 6 is a schematic diagram of a data flow process of an embodiment of the present disclosure.
  • first, second, third, etc. may be used in this disclosure to describe various information, the information should not be limited to these terms. These terms are only used to distinguish the same type of information from each other.
  • first information may also be referred to as second information, and similarly, the second information may also be referred to as first information.
  • word “if” as used herein can be interpreted as "when” or “when” or "in response to determination”.
  • Heap sort is widely used to deal with sorting problems. Heap sorting refers to a sorting method designed by using the data structure of the heap. As shown in Figure 1A, the heap is an approximately complete binary tree structure, and when the heap is the smallest heap, the data corresponding to each node in the heap is always less than or equal to its child nodes; in the case where the heap is the largest heap Next, the data corresponding to each node in the heap is always greater than or equal to its child nodes.
  • a complete storage unit can be used to store the entire heap, that is, the data corresponding to each node of the heap is stored in the same storage unit. Due to read-write conflicts, the data of only one node and its child nodes can be sorted at a time.
  • Fig. 1B it is a schematic diagram including 5 nodes, where the data corresponding to these 5 nodes are all stored in the same storage unit, that is, mem in the figure.
  • the data of 1 is exchanged to get the largest heap after sorting, as shown in the schematic diagram in the lower left corner.
  • the data at the top of the heap (that is, the root node of the heap) is written out from the storage unit, and the remaining data repeats the above sorting process until the data corresponding to each node in the heap is written out from the storage unit. It can be seen that the sorting efficiency of this heap sorting method is low.
  • the device may include multiple stack storage units 201 and multiple stack adjustment units 202.
  • a plurality of heap storage units 201 each heap storage unit is used to store data of a group of nodes of the heap, and the group of nodes includes at least some of the nodes in the same layer of the heap.
  • a plurality of heap adjustment units 202 each of which is used to access at least two heap storage units to sort the input raw data and the data stored in the at least two heap storage units.
  • data enters the heap from the bottom of the heap, and then sorts from the top of the heap. Therefore, the process of building and sorting is performed independently, and parallel sorting cannot be performed during the process of entering the heap.
  • the data of each node in the heap is stored in multiple heap storage units 201, and the data in the multiple heap storage units 201 can be read and written independently, and the previous data passes through the multiple heap adjustment units.
  • 202 is sorting, the latter data can be put into the heap, so that sorting can be performed at the same time during the heap building process, which improves the sorting efficiency.
  • the heap adjustment unit n for the last heap adjustment unit, the heap adjustment unit n shown in Figure 2, although two heap storage units are also connected, such as heap storage unit n and heap storage unit n+1, there is no adjustment The cell writes data to the heap memory cell n+1, and therefore, the heap adjustment unit n cannot actually read data from the heap memory cell n+1.
  • the heap storage unit n+1 may be a virtual storage unit, or a storage unit similar to other heap storage units.
  • FIG. 2 schematically illustrates the data flow direction of the heap adjustment unit accessing the heap storage unit during sorting
  • the present disclosure does not limit the heap adjustment unit i to only write to the heap storage unit i.
  • Data, and/or data can only be read from heap memory cell i+1.
  • FIG. 3A is a schematic diagram of a stack including 4-layer nodes and the storage mode of data of each node in the stack.
  • the i-th stack storage unit can be used to store data of all nodes located in the i-th layer of the stack.
  • the first stack storage unit is used to store the data of the first layer node P11 of the stack
  • the second stack The storage unit is used to store the data of the layer 2 nodes P21 and P22 of the heap, and so on.
  • the embodiment shown in FIG. 3A is only one possible implementation of the present disclosure, and the present disclosure is not limited thereto.
  • the data of all nodes in any node of the heap can also be stored in multiple heap storage units.
  • the heap storage unit that stores the data of the node P31 and the data of the node P32 may be different from the heap storage unit that stores the data of the node P33 and the data of the node P34.
  • each heap adjustment unit can access two stack storage units, where the two stack storage units are used to store data of some or all nodes in two adjacent layers of nodes in the stack.
  • the stack adjustment unit 1 can access the first stack storage unit and the second stack storage unit
  • the stack adjustment unit 2 can access the second stack storage unit and the third stack storage unit.
  • Unit 3 can access the third stack of storage units and the fourth stack of storage units, and so on.
  • each heap adjustment unit may also access more than two heap storage units to sort data in the two or more heap storage units, wherein the two or more heap storage units
  • the data in the unit can be data of some or all of the nodes of two adjacent layers, or data of some or all of the nodes of three or more adjacent layers.
  • each stack adjustment unit can also sort at least part of the data in any non-adjacent two or more nodes of the stack to meet sorting requirements in different application scenarios, which will not be repeated here.
  • At least two stack adjustment units of the plurality of stack adjustment units may be sorted in parallel, thereby improving data processing efficiency.
  • the multiple stack adjustment units may also serially sort the data in the multiple stack storage units.
  • the heap storage units accessed by at least two heap adjustment units that are sorted in parallel are different from each other.
  • the heap storage unit accessed by the heap adjustment unit 2 includes a second heap storage unit and a third heap storage unit
  • the heap storage unit accessed by the heap adjustment unit 3 includes a third heap storage unit and The fourth stack storage unit, since the stack storage units accessed by the stack adjustment unit 2 and the stack adjustment unit 3 both include the third stack storage unit, the stack adjustment unit 2 and the stack adjustment unit 3 are not sorted in parallel.
  • the stack storage unit accessed by the stack adjustment unit 1 includes the first stack storage unit and the second stack storage unit
  • the stack storage unit accessed by the stack adjustment unit 3 includes the third stack storage unit and the fourth stack storage unit
  • the stack adjustment unit 1 The heap storage units accessed by the heap adjustment unit 3 are different, that is, the heap storage units accessed by the two heap adjustment units do not include the same heap storage unit. Therefore, the stack adjustment unit 1 and the stack adjustment unit 3 can be sorted in parallel.
  • two heap storage units respectively accessed by two adjacent heap adjustment units in the plurality of heap adjustment units include an identical heap storage unit.
  • the heap storage unit accessed by the heap adjustment unit 1 includes the first heap storage unit and the second heap storage unit
  • the heap storage unit accessed by the heap adjustment unit 2 includes the second heap storage unit and the third heap storage unit, so analogy.
  • the stack adjustment unit 1 accesses the second stack storage unit
  • the stack adjustment unit 2 can access the third stack storage unit
  • the stack adjustment unit 1 can access the first stack storage unit, thus Avoid data read and write conflicts.
  • the stack storage unit accessed by the stack adjustment unit 1 includes the first stack storage unit to the third stack storage unit
  • the stack storage unit accessed by the stack adjustment unit 2 includes the third stack storage unit to the fifth stack storage unit, to And so on.
  • the stack adjustment unit 1 accesses the third stack of storage units
  • the stack adjustment unit 2 can access the fourth or fifth stack of storage units.
  • At least one stack adjustment unit is spaced between any two stack adjustment units that are sorted in parallel. For example, there is an interval between the stack adjustment unit 1 that accesses the first and second stacks of storage units and the stack adjustment unit 3 that accesses the third and fourth stacks of storage units. If the stack adjustment unit 2 of the stack storage unit, the stack adjustment unit 1 and the stack adjustment unit 3 can be sorted in parallel.
  • each time one piece of data is put into the pile the piled data and the data stored in the plurality of pile storage units can be sorted by the plurality of pile adjustment units.
  • each of the plurality of heap adjustment units can acquire data, and combine the acquired data with data in at least one of the at least two heap storage units accessed. Sort.
  • each heap storage unit stores data of all nodes in one layer of the heap
  • the heap storage unit accessed by each heap adjustment unit is used to store data of two adjacent layers of the heap as an example.
  • the scheme of the embodiment will be described. Suppose that the heap adjustment unit i is used to access the i-th pile of storage units and the (i+1)th pile of storage units, and i is a positive integer.
  • the sorting method in other cases is similar to the above case, and will not be repeated here.
  • ceil log 2 k stack adjustment units are used to form a stack adjustment pipeline.
  • ceil represents the round-up operation
  • k is the total number of ordered data that needs to be obtained, that is, k in the aforementioned top k sorting problem.
  • the original data d1 is first input to the heap adjustment unit 1, and the heap adjustment unit 1 sorts the original data d1 with the previously stored data of at least one of the first and second heap storage units. And output data d1' to the stack adjustment unit 2 according to the sorting result, where d1' can be the original data d1, or it can be a piece of data in the second stack storage unit.
  • the data d1' is input to the heap adjustment unit 2 as the original data, and the heap adjustment unit 2 sorts the data d1' with the data of at least one of the second and third heap storage units, and according to The sorting result outputs data d1" to the heap adjustment unit 3, and so on.
  • the heap adjustment unit 1 When the heap is the smallest heap and the data in the heap is full, the heap adjustment unit 1 first compares the original data d1 with the data of the two child nodes of the root node P11, and compares the smallest data (assumed to be the left child of the root node). The data of the node P21) is written into the heap storage unit corresponding to the root node. Then, the original data d1 is used as the original data of the heap adjustment unit 2. The heap adjustment unit 2 compares the original data d1 with the data of the two child nodes of the node P21, and compares the smallest data (assumed to be the left child node P31 of the node P21). The data) is written into the heap storage unit corresponding to node P21, and so on.
  • the heap adjustment unit 1 first compares the original data d1 with the data of the two child nodes of the root node P11, if d1 is smaller than the two child nodes of the root node P11 D1 can be further compared with the data of the root node P11. If d1 is less than or equal to the data of the root node P11, d1 is directly discarded. If d1 is greater than the data of the root node P11, d1 is stored in the first stack storage unit, and subsequent stack adjustment units do not need to be started. In this case, the stack adjustment unit 1 can read the data of the first stack of storage units.
  • the heap adjustment unit 1 When the heap is the largest heap and the data in the heap is full, the heap adjustment unit 1 first compares the original data d1 with the data of the two child nodes of the root node P11, and compares the largest data (assumed to be the left child of the root node). The data of the node P21) is written into the heap storage unit corresponding to the root node. Then, the original data d1 is used as the original data of the heap adjustment unit 2. The heap adjustment unit 2 compares the original data d1 with the data of the two child nodes of the node P21, and compares the largest data (assumed to be the left child node P31 of the node P21). The data) is written into the heap storage unit corresponding to node P21, and so on.
  • the data of each child node of the same node of the heap is stored in the same address of the same heap storage unit.
  • the data bit length is n
  • the data of the left child node of the node can be stored in the low n bits of the corresponding storage address
  • the data of the right child node of the same node can be stored in the high n bits of the corresponding storage address.
  • the bit width of the heap storage unit is twice the data bit length.
  • the data of the node P11 is stored in the heap storage unit mem1
  • the data of the two child nodes of the node P11 are stored under the same address of the heap storage unit mem2
  • the two child nodes of the node P21 are stored in the same address of the heap storage unit mem3 (such as the first row of mem3)
  • the data of the two child nodes of node P22 are stored in the heap storage unit mem3. Under another address (such as the second line of mem3).
  • the data of each child node of the same node can be read from the same storage address of the same storage unit in one clock cycle, thereby reducing the number of data reads , Improve data processing efficiency.
  • the device may further include: a pre-processing unit configured to perform pre-screening processing on the raw data obtained from the data storage device.
  • the pre-screened data is input to the subsequent heap adjustment unit.
  • the pre-screening process refers to filtering out the data that does not need to enter the heap from the original data. Through the pre-screening process, the number of times that data enters the heap can be reduced, thereby improving the efficiency of data processing. The greater the amount of input data, the more obvious the benefits of the pre-screening process, especially for the aforementioned top k data sorting scenarios, the greater the benefits.
  • the data storage device may be a memory located outside the device provided by the present disclosure, and the external memory is connected to the data processing device of the present disclosure.
  • the present disclosure does not limit the type of external memory.
  • it can be volatile memory, such as RAM (Random Access Memory), SDRAM (Synchronous Dynamic RAM), DDR (Double Data Rate) SDRAM, etc., or it can be non-volatile memory.
  • RAM Random Access Memory
  • SDRAM Serial Dynamic RAM
  • DDR Double Data Rate SDRAM
  • non-volatile memory such as hard drives, mobile hard drives, disks, and so on.
  • the pre-processing unit may perform a pre-screening process on the newly acquired original data when the data stored in the heap storage unit reaches a preset amount.
  • the preprocessing unit may directly output the original data to the plurality of pile adjustment units.
  • the preset number may be equal to the total number of data that can be stored in the heap storage unit, that is, only when multiple heap storage units are full, the newly acquired original data is pre-screened.
  • the number of activated heap storage units may be determined according to the quantity of raw data, and only when the activated heap storage units are full, the newly acquired raw data is pre-screened.
  • the amount of raw data is less than the total number of data that can be stored in all heap storage units, only some of the heap storage units are enabled, so that the total number of data that can be stored by the enabled heap storage units is equal to the number of original data.
  • all heap storage units can be activated.
  • the pre-processing unit may perform pre-screening processing on the raw data by comparing the acquired raw data with the data of the root node of the pile, so as to pre-determine whether the raw data needs to be put into the pile.
  • the data of the root node of the heap is less than or equal to the data of any other node.
  • the original data must also be smaller than the data of any other node of the heap, so that the original data does not need to be sorted by the heap adjustment unit. Only when a certain original data is greater than the data of the root node of the heap, the original data needs to be sorted by the heap adjustment unit. Therefore, in the case where the acquired original data is less than or equal to the data of the root node of the heap, it is determined that the original data does not need to be included in the heap; otherwise, it is determined that the original data needs to be included in the heap.
  • the heap is the largest heap
  • the acquired original data is greater than or equal to the data of the root node of the heap, it is determined that the original data does not need to enter the heap; otherwise, it is determined that the original data needs to enter the heap. heap.
  • the use of the smallest heap can effectively improve the data processing efficiency.
  • using the largest heap can effectively improve the data processing efficiency.
  • the number of the pre-processing units may be multiple, and multiple pre-processing units may be used to perform pre-screening processing on the acquired raw data in parallel.
  • multiple pre-processing units may be used to perform pre-screening processing on the acquired raw data in parallel.
  • the preprocessing unit may transmit the original data to the first buffer unit or the stack adjustment unit.
  • the original data may be first transmitted to the first cache unit, and then the original data in the first cache unit are sequentially output to the plurality of pile adjustment units for sorting.
  • the preprocessing unit may directly output the raw data that needs to be piled up to the plurality of pile adjustment units for sorting.
  • the preprocessing unit may delete the original data.
  • the preprocessing unit returns the raw data that does not need to be piled into the data storage device, and the pile adjustment unit returns the raw data squeezed out during the sorting process to the data storage device, thereby Eliminate the limitation of the heap storage unit on the amount of output ordered data, and improve the versatility of the data processing device.
  • the pile adjustment unit returns the raw data squeezed out during the sorting process to the data storage device, thereby Eliminate the limitation of the heap storage unit on the amount of output ordered data, and improve the versatility of the data processing device.
  • by deleting the data that does not need to enter the heap storage space can be saved.
  • the original data returned to the data storage device can be easily reused in the subsequent processing.
  • the plurality of heap adjustment units may re-sort the data returned to the data storage device when the data in the plurality of heap storage units are all sorted.
  • the amount of ordered data output by a sort is limited by the capacity of the heap, for example, it may not be output due to the influence of the number of layers of the heap, the number of heap adjustment units, and the size of the heap storage unit. Sufficient amount of ordered data.
  • the device provided by the embodiment of the present disclosure supports writing the unselected raw data (such as the raw data that has not entered the heap and the raw data that is squeezed out after entering the heap) back to the data storage device in the sorting process, so as to perform multiple sorting, thereby improving The versatility of the data processing device.
  • a first round of sorting may be performed on the data that has entered the heap, and after the first round of sorting, the next round of sorting is performed on the unselected data in the first round of sorting. Further, in the second round of sorting, the same processing can be performed as in the first round of sorting, including pre-screening again. In this way, multiple rounds of sorting can be carried out until a certain stopping condition is met.
  • the stopping condition may be that all the original data to be sorted are sorted.
  • the stop condition may also be that the number of sorted data reaches the required number.
  • a data processing device with a smaller heap capacity is used to sort a larger amount of original data, which avoids sorting failures caused by insufficient heap capacity, and improves the application range of the data processing device.
  • the sorting process in the second round and after the second round is the same as the sorting process in the first round, and will not be repeated here.
  • the original data when the capacity of the data storage device is limited, the original data can also be written into the data storage device in batches, and each batch of data written into the data storage device is pre-screened and sorted. Therefore, it is possible to sort a larger amount of data through a data storage device with a smaller capacity, and to avoid a sorting failure caused by insufficient capacity of the data storage device.
  • the data processing device further includes a second cache unit configured to cache the original data obtained from the data storage device, and the second cache unit sends the cached original data to the plurality of Heap adjustment unit; the plurality of heap adjustment units are used to sort the original data obtained from the second cache unit and the data in the plurality of heap storage units.
  • the second cache unit may obtain one or more original data from the data storage device each time, and cache the obtained original data.
  • the first caching unit may obtain one or more original data from the preprocessing unit each time, and cache the obtained original data.
  • the first buffer unit and the second buffer unit may be FIFO (First In First Out) buffer units.
  • FIG. 4 it is a schematic diagram of data processing apparatuses according to other embodiments of the present disclosure.
  • the data processing device includes n+1 heap storage units 201, n heap adjustment units 202, 1 first cache unit 203, and 4 preprocessing units 204.
  • each heap storage unit is used to store data of a layer node of the heap
  • the heap adjustment unit i is used to access the i-th heap storage unit and the (i+1)th heap storage unit.
  • the data path is as follows under the top k data sorting task.
  • the original data is in parallel (assuming the parallelism is 4).
  • the original data directly enters the first cache unit 203; for example, the data in the heap
  • each preprocessing unit compares the input raw data with the data at the top of the current heap (that is, the root node of the heap). When the heap is the smallest heap, it will be larger than the original data at the top of the heap.
  • the data is output to the first buffer unit 203, and the data less than or equal to the top of the heap is written back to an external data storage device (not shown in the figure) through the first output terminal to perform multiple sorting.
  • the heap adjustment unit 1 fetches the number from the first buffer unit 203. Multiple heap adjustment units perform parallel heap adjustment to adjust the data in the heap to the smallest heap. The data squeezed out of the heap can be written back to the data storage device through the second output terminal for multiple sorting. Repeat the above process until all The raw data is put into the heap to complete.
  • the commands executed by the device in this example are as follows.
  • each heap storage unit may include a flag bit, which is used to indicate whether the data at a corresponding position in the heap storage unit is valid.
  • the heap storage unit mem1 includes the flag bits of the data of the node P11, as shown by the black squares of flg1 in the figure
  • the heap storage unit mem2 includes the flag bits of the data of the node P21 and the node P22, as shown in the figure.
  • the flags of flg2 and P21 are represented by black squares
  • the flags of P22 are represented by gray squares, and so on.
  • the storage unit can include N flag bits.
  • the data in the heap storage unit is valid, which means that the data is data that needs to be sorted; the data in the heap storage unit is invalid, which means that the data is not data that needs to be sorted.
  • the flag bit is the first value.
  • the flag bit is the second value. For example, the first value may be "1" and the second value may be "0".
  • a common heap sorting method initializes the data in each heap storage unit, and as the depth of the heap increases, the initialization time also increases.
  • the embodiment of the present disclosure uses flag bits to initialize each flag bit in the heap storage unit before writing data to the heap storage unit, so there is no need to initialize the data. Since the bit length of the flag bit is smaller than the bit length of the original data (for example, the flag bit can be 1 bit), in some cases, only 1 clock cycle can be used to initialize the flag bits of all heap storage units.
  • the time for initializing processing is less than the time for initializing data in the heap storage unit, thereby improving the efficiency of data processing.
  • the flag bit of the written valid data can be updated, that is, the flag bit is set from invalid to valid, so that the heap storage unit can be determined according to the flag bit of the data Whether the data in is valid data.
  • each of the plurality of stack adjustment units is further used for: when the flag bit of the first stack storage unit indicates that the data at the corresponding position is valid data, the input The original data of the pile adjustment unit and the valid data are sorted; in the case where the flag bit in the first pile of storage units indicates that the data at the corresponding position includes any invalid data, the original data input to the pile adjustment unit Data is written to the corresponding location of invalid data.
  • the first pile storage unit is the pile storage unit closer to the root node among the at least two pile storage units accessed by the pile adjustment unit.
  • the data input to the heap adjustment unit is written to the left in the order of left and right. The location corresponding to the invalid data.
  • the data stacking process of the embodiment of the present disclosure is similar to the data stacking process, in that one piece of data is input to the multiple stack adjustment units, and then the multiple stack adjustment units compare the input data with the stack storage unit. Sort the stored data in.
  • each stack adjustment unit of the plurality of stack adjustment units can access at least two stack storage units, and store the specified data in the at least two stack storage units.
  • the data stored in the at least two heap storage units are sorted out of the heap.
  • the heap out process is similar to the sort process and is also executed in parallel.
  • one designated data can be input to the multiple stack storage units each time.
  • the value of the specified data can be greater than each data stored in the multiple stack storage units,
  • the specified data may be data with a value of + ⁇ .
  • the so-called + ⁇ data can be the maximum value in the data format of the original data. For example, for a 16-bit floating point number, 7c00 16 can represent + ⁇ .
  • the value of the designated data may be smaller than each data stored in the multiple heap storage units, for example, the designated data may be data with a value of - ⁇ .
  • the so-called - ⁇ data can be the minimum value in the data format of the original data. For example, for a 16-bit floating point number, fc00 16 can represent - ⁇ .
  • the above-mentioned initialization, stacking, and stacking processes can be controlled by different instructions respectively.
  • the entire sorting process is completed by one instruction, and after the parameters are fixed, the versatility of the data processing device is poor.
  • a sorting is divided into three processes of initialization, entering the heap, and exiting the heap, corresponding to three kinds of instructions respectively.
  • There can be multiple heaping instructions in a single sorting original data can be input multiple times). Eliminates the limitation on the quantity of original data by the data storage device, and enables the heap adjustment unit and the preprocessing unit to run in parallel, which is more flexible in use.
  • the instructions in the process of initialization, stacking and stacking can be sent by an upper-level controller to the stack control unit in the data processing device, and are implemented under the control of the stack control unit.
  • the device further includes: a heap control unit, configured to perform at least any one of the following operations: upon receiving an initialization instruction, control the multiple heap storage units to initialize in the same clock cycle In the case of receiving the stack instruction, read the original data from the data storage device, and transmit the read original data to the plurality of stack adjustment units, so that the plurality of stack adjustment units The original data and the data in the multiple heap storage units are sorted; and in the case of receiving the heap output instruction, the multiple heap adjustment units are controlled to remove the data in the multiple heap storage units from the heap in a specific order. Top output.
  • a heap control unit configured to perform at least any one of the following operations: upon receiving an initialization instruction, control the multiple heap storage units to initialize in the same clock cycle In the case of receiving the stack instruction, read the original data from the data storage device, and transmit the read original data to the plurality of stack adjustment units, so that the plurality of stack adjustment units The original data and the data in the multiple heap storage units are sorted; and in the case of receiving the heap output instruction, the
  • the heap control unit may send an initialization signal to the heap storage unit to initialize each flag bit in the heap storage unit.
  • the stack control unit can read the original data from the data storage device, output the original data to the preprocessing unit, and the preprocessing unit determines whether the original data needs to be preprocessed. Screening processing. If necessary, the pre-processing unit directly deletes the original data that does not need to enter the heap or returns it to the data storage device, and outputs the data that needs to enter the heap to the first cache unit; if pre-screening processing is not required, directly The original data is output to the first buffer unit.
  • the heap adjustment unit receives the original data in the first cache unit, and adjusts the data in the heap storage unit step by step according to the size of the original data until all the original data that needs to be sorted are processed.
  • the stack control unit When the stack control unit receives the stack output instruction, it outputs the designated data to the stack adjustment unit, and the stack adjustment unit receives the specified data, adjusts the data in the stack storage unit step by step, and each specified data enters After the pile is piled, one piece of data (that is, the data of the root node of the pile) in the pile storage unit is squeezed out of the pile, and the pile control unit sequentially outputs the squeezed out data to the data output port of the data processing device.
  • the stack control unit When the stack control unit receives the stack output instruction, it outputs the designated data to the stack adjustment unit, and the stack adjustment unit receives the specified data, adjusts the data in the stack storage unit step by step, and each specified data enters After the pile is piled, one piece of data (that is, the data of the root node of the pile) in the pile storage unit is squeezed out of the pile, and the pile control unit sequentially outputs the squeezed out data to the data output port of the data processing device.
  • 5A to 5F are schematic diagrams of node data changes during the sorting process of the embodiments of the present disclosure.
  • This embodiment takes the smallest heap as an example for description.
  • the sorting process of the largest heap is similar to that of the smallest heap, and will not be repeated here.
  • the depth of the heap is 6, that is, the heap includes 6 layers of nodes, the data of each node in each layer of nodes is stored in an independent heap storage unit, and the data of each child node of the same node is stored in the same heap storage unit.
  • the heap storage unit corresponding to the i-th layer node is heap storage unit i
  • the heap adjustment unit that accesses heap storage unit i and heap storage unit i+1 is heap adjustment unit i
  • each node of the i-th layer is denoted as Pij , 1 ⁇ j ⁇ 2 i-1 , i is a positive integer.
  • the heap at the initial time t0 is as shown in Figure 5A.
  • the original data "70” enters the heap
  • the data "8" of node P11 is squeezed out of the heap storage unit 1
  • the heap adjustment unit 1 reads the data of node P21 and node P22 from the heap storage unit 2.
  • the heap adjustment unit 1 writes the data of the node P21 into the heap storage unit 1 corresponding to the node P11, and outputs the original data "70" to the heap adjustment unit 2, as shown in FIG. 5B.
  • the heap adjustment unit 2 reads the data of the node P31 from the heap storage unit 3 and compares the data of the node P32, and the heap adjustment unit 2 writes the data of the node P31 into the heap storage unit 2 corresponding to the node P21, and the original The data "70" is output to the stack adjustment unit 3 as shown in FIG. 5C.
  • the heap adjustment unit 3 reads the data of the node P41 and the data of the node P42 from the heap storage unit 4; at the same time, the original data "75" enters the heap, and the data "12" of the node P11 is sent from the heap storage unit 1.
  • the stack adjustment unit 1 reads the data of the node P21 and the data of the node P22 from the stack storage unit 2, as shown in FIG. 5D.
  • the heap adjustment unit 3 writes the data of the node P41 into the heap storage unit 3 corresponding to the node P31, and outputs the original data "70" to the heap adjustment unit 4, and the heap adjustment unit 4 reads the node from the heap storage unit 5.
  • the heap adjustment unit 1 writes the data of the node P22 into the heap storage unit 1 corresponding to the node P11, and outputs the original data "75" to the heap adjustment unit 2
  • the heap adjustment unit 2 reads the data from the heap
  • the storage unit 3 reads the data of the node P31 and the data of the node P32
  • the heap adjustment unit 4 writes the data of the node P51 into the heap storage unit 3 corresponding to the node P41, and outputs the original data "70" to the heap adjustment unit 5, such as Shown in Figure 5E.
  • the heap adjustment unit 5 reads the data of the node P61 and the data of the node P62 from the heap storage unit 6.
  • the heap adjustment unit 2 writes the data of the node P34 into the heap storage unit 2 corresponding to the node P22, and the original The data "75" is output to the heap adjustment unit 3, and the heap adjustment unit 3 reads the data of the node P47 and the data of the node P48 from the heap storage unit 4; at the same time, the original data "80" enters the heap, as shown in FIG. 5F.
  • the starting moments of t1 and t2 are separated by at least two cycles, and the starting moments of t2 and t3 are separated by at least two cycles.
  • the parallel heap sorting method of the embodiment of the present disclosure can shorten the sorting time to about 1/3 of the original. The greater the depth of the heap, the greater the number of heap adjustment units working at the same time, that is, the higher the degree of parallelism, the more time can be shortened.
  • Figure 6 is a schematic diagram of the data flow process when the depth of the stack is 8.
  • d1, d2, etc. represent input raw data
  • t1, t2, etc. represent time
  • adj1, adj2, etc. represent stack adjustment units.
  • the embodiment of the present disclosure merges the two processes of stack building and stack adjustment into a unified top-down stack adjustment process during the stack sorting process, and the data of two adjacent layers of the stack is performed by a stack adjustment unit. Adjustment, a plurality of stack adjustment units form an array, and the input data stream passes through each stack adjustment unit. At different times, multiple stack adjustment units can be executed in parallel. And starting from t6, the maximum degree of parallelism is reached, that is, 4.
  • the stack adjustment unit 1, the stack adjustment unit 3, the stack adjustment unit 5, and the stack adjustment unit 7 all work at the same time.
  • the next-level heap adjustment unit may modify the data stored in the heap storage unit required by the upper-level heap adjustment unit, in order to avoid data read and write conflicts, it takes time for two adjacent original data to enter the heap.
  • the interval level is one level, that is, when the m-th original data itself or the data replaced from the heap storage unit by the m-th original data is sorted by adj3, the m+1-th original data can be sorted by adj1.
  • Each unit in the data processing device of the embodiment of the present disclosure may be based on FPGA (Field Programmable Gate Array), PLD (programmable logic device, programmable logic device), ASIC (Application Specific Integrated Circuit, application specific integrated circuit) ) Controller, microcontroller, microprocessor or other electronic components.
  • FPGA Field Programmable Gate Array
  • PLD programmable logic device, programmable logic device
  • ASIC Application Specific Integrated Circuit, application specific integrated circuit
  • the data processing device realizes parallel heap sorting and improves data processing efficiency.
  • there is no need to initialize the data in the heap storage unit and only the flag bit needs to be initialized, which improves the initialization efficiency.
  • pre-screening processing can be performed, which reduces the number of times that raw data enters the heap, and further improves the efficiency of data processing.
  • multiple rounds of sorting can be performed to support multiple sorting of the original data in the data storage device, and also support to write the original data into the data storage device in batches and then sort the same batch of data in the heap storage unit.
  • the sorting process is not limited by the size of the heap storage unit and the data storage device, and it has strong versatility.
  • an embodiment of the present disclosure also provides an integrated circuit, which includes the data processing device described in any of the embodiments.
  • the integrated circuit further includes: a controller, configured to send at least any one of the following instructions to the data processing device: an initialization instruction, used to instruct the plurality of heap storage units to initialize; a stack instruction , For instructing the plurality of heap adjustment units to obtain original data, and to sort the original data and the data stored in the plurality of heap storage units; and a heap instruction, for instructing the plurality of heap adjustments
  • the unit outputs the data stored in the plurality of stack storage units in a specific order.
  • the initialization instruction, the stacking instruction, and the stacking instruction may be different instructions.
  • a sorting is divided into three processes of initialization, entering the heap, and exiting the heap, corresponding to three kinds of instructions respectively.
  • There can be multiple heaping instructions in a single sorting original data can be input multiple times). Eliminates the limitation on the quantity of original data by the data storage device, and enables the heap adjustment unit and the preprocessing unit to run in parallel, which is more flexible in use.
  • the instructions in the process of initialization, stacking and stacking can be sent by the controller of the integrated circuit to the stack control unit in the data processing device, and implemented under the control of the stack control unit .
  • an embodiment of the present disclosure also provides an AI (Artificial Intelligence) accelerator, and the AI accelerator includes the integrated circuit described in any of the embodiments.
  • AI Artificial Intelligence
  • the writing order of the steps does not mean a strict execution order but constitutes any limitation on the implementation process.
  • the specific execution order of each step should be based on its function and possibility.
  • the inner logic is determined.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Human Computer Interaction (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Computer Hardware Design (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Memory System Of A Hierarchy Structure (AREA)
  • Radar Systems Or Details Thereof (AREA)

Abstract

本公开实施例提供用于数据处理的装置、集成电路和AI加速器。用于数据处理的装置包括多个堆存储单元,每个堆存储单元用于存储堆的一组节点的数据,所述一组节点中包括所述堆的同一层节点中的至少部分节点;以及多个堆调整单元,每个堆调整单元用于访问至少两个堆存储单元,以对输入的原始数据与所述至少两个堆存储单元中存储的数据进行排序。

Description

[根据细则37.2由ISA制定的发明名称] 数据处理装置、集成电路和AI加速器 技术领域
本公开涉及数据处理技术领域,尤其涉及数据处理装置、集成电路和人工智能(Artificial Intelligence,AI)加速器。
背景技术
在许多算法或者模型中,经常需要处理排序问题,堆排序(Heapsort)被广泛用于处理排序问题。堆排序是指利用堆这种数据结构所设计的一种排序方式。
发明内容
本公开提供数据处理装置、集成电路和AI加速器。
根据本公开实施例的第一方面,提供一种数据处理装置,所述装置包括:多个堆存储单元,每个堆存储单元用于存储堆的一组节点的数据,所述一组节点中包括所述堆的同一层节点中的至少部分节点;以及多个堆调整单元,每个堆调整单元用于访问至少两个堆存储单元,以对输入的原始数据与所述至少两个堆存储单元中存储的数据进行排序。
根据本公开实施例的第二方面,提供一种集成电路,所述集成电路包括第一方面所述的数据处理装置。
根据本公开实施例的第三方面,提供一种AI加速器,所述AI加速器包括第二方面所述的集成电路。
本公开实施例将堆中各个的节点的数据存储在多个堆存储单元中,多个堆存储单元中的数据可以独立地进行读写,在前一个数据通过所述多个堆调整单元进行排序的同时,可以将后一个数据进堆,从而可以实现在建堆过程中同时进行排序,提高了排序效率。
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,而非限制本公开。
附图说明
此处的附图被并入说明书中并构成本说明书的一部分,这些附图示出了符合本公开的实施例,并与说明书一起用于说明本公开的技术方案。
图1A是一些实施例的堆的示意图。
图1B是一些实施例的堆排序过程的示意图。
图2是本公开实施例的数据处理装置的示意图。
图3A和图3B分别是本公开实施例的数据存储方式的示意图。
图4是本公开另一些实施例的数据处理装置的示意图。
图5A至图5F是本公开实施例的堆排序过程中数据变化的示意图。
图6是本公开实施例的数据流动过程的示意图。
具体实施方式
这里将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本公开相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本公开的一些方面相一致的装置和方法的例子。
在本公开使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本公开。在本公开和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其他含义。还应当理解,本文中使用的术语“和/或”是指并包含一个或多个相关联的列出项目的任何或所有可能组合。另外,本文中术语“至少一种”表示多种中的任意一种或多种中的至少两种的任意组合。
应当理解,尽管在本公开可能采用术语第一、第二、第三等来描述各种信息,但这些信息不应限于这些术语。这些术语仅用来将同一类型的信息彼此区分开。例如,在不脱离本公开范围的情况下,第一信息也可以被称为第二信息,类似地,第二信息也可以被称为第一信息。取决于语境,如在此所使用的词语“如果”可以被解释成为“在……时”或“当……时”或“响应于确定”。
为了使本技术领域的人员更好的理解本公开实施例中的技术方案,并使本公开实施例的上述目的、特征和优点能够更加明显易懂,下面结合附图对本公开实施例中的技术方案作进一步详细的说明。
在许多算法或者模型(例如,神经网络模型)中,经常需要处理排序问题,特别是top k(k为正整数)排序问题,即,从一组待排序数据中选择出数值最大或者最小的k 个数据。堆排序被广泛用于处理排序问题。堆排序是指利用堆这种数据结构所设计的一种排序方式。如图1A所示,堆是一个近似完全二叉树的结构,且在堆为最小堆的情况下,堆中的每个节点对应的数据总是小于或等于其子节点;在堆为最大堆的情况下,堆中的每个节点对应的数据总是大于或等于其子节点。
在一种堆排序方式中,可以采用一块完整的存储单元存储整个堆,即,堆的各个节点对应的数据均存储在同一存储单元中。由于读写冲突,每次只能对一个节点及其子节点的数据进行排序。如图1B所示,是一个包括5个节点的示意图,其中,这5个节点对应的数据均存储在同一存储单元中,即图中的mem。在排序的时候,从堆底开始,通过比较,首先将存储单元中节点4的数据与节点1的数据互换,再将节点1与节点0的数据互换,然后将节点4的数据与节点1的数据互换,得到排序后的最大堆,如左下角的示意图所示。将堆顶(即,堆的根节点)的数据从存储单元中写出,剩余的数据再重复上述排序过程,直到堆中各个节点对应的数据均从存储单元中写出。可见,该堆排序方式排序效率低。
基于此,本公开实施例提供一种数据处理装置,如图2所示,所述装置可包括多个堆存储单元201和多个堆调整单元202。
多个堆存储单元201,每个堆存储单元用于存储堆的一组节点的数据,所述一组节点中包括所述堆的同一层节点中的至少部分节点。
多个堆调整单元202,每个堆调整单元用于访问至少两个堆存储单元,以对输入的原始数据与所述至少两个堆存储单元中存储的数据进行排序。
在一种可选的堆排序方式中,数据从堆底进堆,再从堆顶开始排序,因此,建堆和排序过程是独立进行的,在进堆过程中无法进行并行排序。本公开实施例将堆中各个的节点的数据存储在多个堆存储单元201中,多个堆存储单元201中的数据可以独立地进行读写,在前一个数据通过所述多个堆调整单元202进行排序的同时,可以将后一个数据进堆,从而可以实现在建堆过程中同时进行排序,提高了排序效率。
需要说明的是,对于最后一个堆调整单元,如图2所示的堆调整单元n,虽然也连接了两个堆存储单元,如堆存储单元n和堆存储单元n+1,但由于没有调整单元对堆存储单元n+1写入数据,因此,堆调整单元n实际上无法从堆存储单元n+1读取数据。在具体实践中,堆存储单元n+1可以是一个虚拟的存储单元,也可以是一个和其他堆存储单元类似的存储单元。
此外,虽然图2中示意地画出了在排序时堆调整单元访问堆存储单元的数据流方向,但在实际应用中,本公开并不限制堆调整单元i只能向堆存储单元i写入数据,和/或只能从堆存储单元i+1读出数据。
图3A是一个包括4层节点的堆以及所述堆中各个节点的数据的存储方式的示意图。如图3A所示,第i堆存储单元可以用于存储位于堆的第i层的全部节点的数据,例如,第1堆存储单元用于存储堆的第1层节点P11的数据,第2堆存储单元用于存储堆的第2层节点P21和P22的数据,以此类推。
应当说明的是,图3A所示的实施例仅为本公开的一种可能的实现方式,本公开不限于此。在实际应用中,堆的任意一层节点中全部节点的数据也可以存储在多个堆存储单元中。例如,存储节点P31的数据和节点P32的数据的堆存储单元可以与存储节点P33的数据和节点P34的数据的堆存储单元不同。
在一些实施例中,每个堆调整单元访问的所述至少两个堆存储单元用于存储所述堆的相邻层节点的数据。可选地,每个堆调整单元可以访问两个堆存储单元,其中,所述两个堆存储单元用于存储堆中相邻两层节点中的部分或全部节点的数据。例如,在图3A所示的实施例中,堆调整单元1可以访问第1堆存储单元和第2堆存储单元,堆调整单元2可以访问第2堆存储单元和第3堆存储单元,堆调整单元3可以访问第3堆存储单元和第4堆存储单元,以此类推。可选地,在其他实施例中,每个堆调整单元还可以访问两个以上堆存储单元,以对所述两个以上堆存储单元中的数据进行排序,其中,所述两个以上堆存储单元中的数据可以是相邻两层节点中的部分或全部节点的数据,也可以是相邻的三层或三层以上节点中的部分或全部节点的数据。
在另一些实施例中,每个堆调整单元还可以对堆的任意不相邻的两层以上节点中的至少部分数据进行排序,以满足不同应用场景下的排序需求,此处不再赘述。
在一些实施例中,所述多个堆调整单元中的至少两个堆调整单元可以并行地进行排序,从而提高数据处理效率。在另一些实施例中,所述多个堆调整单元也可以对所述多个堆存储单元中的数据进行串行地排序。
为了避免数据冲突,并行地进行排序的至少两个堆调整单元访问的堆存储单元互不相同。例如,在图3A所示的实施例中,堆调整单元2访问的堆存储单元包括第2堆存储单元和第3堆存储单元,堆调整单元3访问的堆存储单元包括第3堆存储单元和第4堆存储单元,由于堆调整单元2和堆调整单元3访问的堆存储单元中均包括第3堆存储 单元,因此,堆调整单元2和堆调整单元3不进行并行排序。而堆调整单元1访问的堆存储单元包括第1堆存储单元和第2堆存储单元,堆调整单元3访问的堆存储单元包括第3堆存储单元和第4堆存储单元,则堆调整单元1和堆调整单元3访问的堆存储单元各不相同,即,这两个堆调整单元访问的堆存储单元中不包括相同的堆存储单元。因此,堆调整单元1和堆调整单元3可以并行地进行排序。
作为一种解决数据读写冲突的具体实现方式,多个堆调整单元中两个相邻堆调整单元分别访问的两个堆存储单元中包括一个相同的堆存储单元。例如,堆调整单元1访问的堆存储单元中包括第1堆存储单元和第2堆存储单元,堆调整单元2访问的堆存储单元中包括第2堆存储单元和第3堆存储单元,以此类推。在堆调整单元1访问第2堆存储单元时,堆调整单元2可以访问第3堆存储单元;堆调整单元2访问第2堆存储单元时,堆调整单元1可以访问第1堆存储单元,从而避免数据读写冲突。又例如,堆调整单元1访问的堆存储单元中包括第1堆存储单元至第3堆存储单元,堆调整单元2访问的堆存储单元中包括第3堆存储单元至第5堆存储单元,以此类推。类似的,在堆调整单元1访问第3堆存储单元时,堆调整单元2可以访问第4或第5堆存储单元。
作为另一种解决数据读写冲突的具体实现方式,并行地进行排序的任意两个堆调整单元之间间隔至少一个堆调整单元。例如,访问第1堆存储单元和第2堆存储单元的堆调整单元1与访问第3堆存储单元和第4堆存储单元的堆调整单元3之间间隔一个访问第2堆存储单元和第3堆存储单元的堆调整单元2,则堆调整单元1与堆调整单元3可以并行地进行排序。
在排序过程中,每次将一个数据进堆,可以通过所述多个堆调整单元对进堆的数据与所述多个堆存储单元中存储的数据进行排序。作为再一种解决数据读写冲突的具体实现方式,相邻两个数据的进堆时序之间至少间隔两个堆存储单元的处理时间。例如,如图6所示,数据d1在t1的开始时刻进堆,则下一个数据d2可以在第t3的开始时刻进堆,其中,t1、t2、t3….表示堆存储单元的处理时间。
除了上述方式之外,还可以采用其他方式来解决数据读写冲突的问题,以使多个堆调整单元可以并行地进行排序,此处不再赘述。由于采用了多个堆存储单元,任意一个堆存储单元的读写过程不影响其他的堆存储单元。因此,访问不同堆存储单元的多个堆调整单元可以并行地进行排序,提高了排序效率。
在排序的过程中,所述多个堆调整单元中每个堆调整单可以获取数据,并将获取到的数据和访问的所述至少两个堆存储单元中的至少一个堆存储单元中的数据进行排序。
基于本公开实施例的堆存储单元的结构,在建堆、堆调整及出堆过程中,输入数据可以从堆顶进堆,并采用自上到下的方式进行调整。为了便于理解,下面以每个堆存储单元存储堆的一层节点中所有节点的数据,每个堆调整单元访问的堆存储单元用于存储堆的相邻两层节点的数据为例,对本公开实施例的方案进行说明。假设堆调整单元i用于访问第i堆存储单元和第i+1堆存储单元,i为正整数。其他情况下的排序方式与上述情况类似,此处不再赘述。本实施例中,将堆中相邻两层节点的数据调整操作封装在一个堆调整单元中进行,利用ceil(log 2k)个堆调整单元组成堆调整流水线。其中,ceil表示向上取整操作,k为需要获取的有序数据的总数,也就是前述top k排序问题中的k。
在进行排序时,先向堆调整单元1输入原始数据d1,堆调整单元1将原始数据d1与第1堆存储单元和第2堆存储单元中至少一个堆存储单元的之前存储的数据进行排序,并根据排序结果向堆调整单元2输出数据d1’,其中d1’可以是原始数据d1,也可以是第2堆存储单元中的一个数据。同理,将数据d1’作为原始数据输入到堆调整单元2,堆调整单元2将数据d1’与第2堆存储单元和第3堆存储单元中至少一个堆存储单元的数据进行排序,并根据排序结果向堆调整单元3输出数据d1”,以此类推。
在堆为最小堆且堆中数据已满的情况下,堆调整单元1先将原始数据d1与根节点P11的两个子节点的数据进行比较,将其中最小的数据(假设为根节点的左子节点P21的数据)写入根节点对应的堆存储单元。然后,将原始数据d1作为堆调整单元2的原始数据,堆调整单元2将原始数据d1与节点P21的两个子节点的数据进行比较,将其中最小的数据(假设为节点P21的左子节点P31的数据)写入节点P21对应的堆存储单元,以此类推。
进一步的,在堆为最小堆且堆中数据已满的情况下,堆调整单元1先将原始数据d1与根节点P11的两个子节点的数据进行比较,若d1小于根节点P11的两个子节点的数据,则可以进一步将d1与根节点P11的数据进行比较。若d1小于等于根节点P11的数据,则直接丢弃d1,若d1大于根节点P11的数据,则将d1存储到第1堆存储单元,后续的堆调整单元不需要启动。在这种情况下,堆调整单元1可以读取第1堆存储单元的数据。
在堆为最大堆且堆中数据已满的情况下,堆调整单元1先将原始数据d1与根节点P11的两个子节点的数据进行比较,将其中最大的数据(假设为根节点的左子节点P21的数据)写入根节点对应的堆存储单元。然后,将原始数据d1作为堆调整单元2的原始数据,堆调整单元2将原始数据d1与节点P21的两个子节点的数据进行比较,将其 中最大的数据(假设为节点P21的左子节点P31的数据)写入节点P21对应的堆存储单元,以此类推。
在一些实施例中,所述堆的同一节点的各个子节点的数据存储在同一堆存储单元的同一地址中。例如,假设数据位长为n,则可以将节点的左子节点的数据存储在对应存储地址的低n位中,将同一节点的右子节点的数据存储在对应存储地址的高n位中。在这种情况下,堆存储单元的位宽为数据位长的两倍。如图3B所示,节点P11的数据存储在堆存储单元mem1中,节点P11的两个子节点(即,P21和P22)的数据存储在堆存储单元mem2的同一地址下,节点P21的两个子节点(即,P31和P32)的数据存储在堆存储单元mem3的同一地址下(如mem3的第一行),节点P22的两个子节点(即,P33和P34)的数据存储在堆存储单元mem3的另一地址下(如mem3的第二行)。通过将同一节点的各个子节点的数据存储在同一存储地址中,可以在一个时钟周期内从同一存储单元的同一存储地址中读取到同一节点的各个子节点的数据,从而减少数据读取次数,提高数据处理效率。
在一些实施例中,所述装置还可包括:预处理单元,用于对从数据存储装置获取的原始数据进行预筛选处理。经过预筛选处理的数据被输入后续的堆调整单元。所述预筛选处理是指从原始数据中过滤掉不需要进堆的数据。通过进行预筛选处理,能够减少数据进堆的次数,从而提高数据处理效率。输入数据量越大,预筛选处理的收益越明显,特别是对于前述top k数据排序场景中,收益较大。
其中,所述数据存储装置可以是位于本公开提供的装置外部的存储器,该外部存储器与本公开的数据处理装置相连。本公开不限制外部存储器的类型,例如,可以是易失性存储器,如RAM(Random Access Memory)、SDRAM(Synchronous Dynamic RAM)、DDR(Double Data Rate)SDRAM等等,也可以是非易失性存储器,如硬盘、移动硬盘、磁盘等等。
可选地,所述预处理单元可以在所述堆存储单元中存储的数据达到预设数量的情况下,对新获取到的所述原始数据进行预筛选处理。可选地,在所述堆存储单元中存储的数据未达到预设数量的情况下,所述预处理单元可以直接将所述原始数据输出至所述多个堆调整单元。所述预设数量可以等于堆存储单元可存储的数据总数,也就是说,在多个堆存储单元存满的情况下,才对新获取到的原始数据进行预筛选处理。在一些实施例中,可以根据原始数据的数量确定启用的堆存储单元的数量,在已启用的堆存储单元存满的情况下,才对新获取到的原始数据进行预筛选处理。例如,在原始数据的数量小于 所有堆存储单元可存储的数据总数的情况下,仅启用部分堆存储单元,以使启用的堆存储单元可存储的数据总数等于原始数据的数量。又例如,在原始数据的数量大于或等于所有堆存储单元可存储的数据总数的情况下,可以启用全部堆存储单元。
在一些实施例中,所述预处理单元可以通过比较获取到的原始数据与所述堆的根节点的数据,对原始数据进行预筛选处理,以预先判定所述原始数据是否需要进堆。
例如,在所述堆为最小堆的情况下,堆的根节点的数据小于或等于其余任一节点的数据。在某一原始数据小于堆的根节点的数据的情况下,该原始数据必然也小于堆的其余任一节点的数据,从而无需通过堆调整单元对该原始数据进行排序。只有在某一原始数据大于堆的根节点的数据的情况下,才需要通过堆调整单元对该原始数据进行排序。因此,在获取到的原始数据小于或等于堆的根节点的数据的情况下,判定该原始数据不需要进堆,否则,判定该原始数据需要进堆。同理,在所述堆为最大堆的情况下,在获取到的原始数据大于或等于堆的根节点的数据的情况下,判定该原始数据不需要进堆,否则,判定该原始数据需要进堆。
在应用场景为确定原始数据中的top k个最大数据的情况下,采用最小堆能够有效提高数据处理效率。同理,在应用场景为确定原始数据中的top k个最小数据的情况下,采用最大堆能够有效提高数据处理效率。
在一些实施例中,所述预处理单元的数量可以为多个,可以采用多个所述预处理单元并行地对获取到的原始数据进行预筛选处理。通过进行预筛选处理,使得原始数据中有一部分数据不需要进堆。在top k排序场景下,尤其是在原始数据的数量与k值相差较大的情况下,会有相当一部分数据不需要进堆。因此,通过采用多个预处理单元并行地进行预筛选处理,能够有效提高预筛选效率,避免堆调整单元长时间处于空闲等待状态。
可选地,在判定原始数据需要进堆的情况下,所述预处理单元可以将所述原始数据传输至第一缓存单元或堆调整单元。对于需要进堆的原始数据,可以先将所述原始数据传输至第一缓存单元,然后将第一缓存单元中的原始数据依次输出至所述多个堆调整单元进行排序。或者,也可以由所述预处理单元将需要进堆的原始数据直接依次输出至所述多个堆调整单元进行排序。
可选地,在判定所述原始数据不需要进堆的情况下,所述预处理单元可以将所述原始数据删除。可选地,所述预处理单元还将不需要进堆的原始数据返回所述数据存储装 置,并且所述堆调整单元将在排序过程中被挤出的原始数据返回所述数据存储装置,从而消除堆存储单元对输出有序数据量的限制,提高数据处理装置的通用性。其中,通过将不需要进堆的数据删除,可以节约存储空间。通过将此次排序未被选中的数据返回所述数据存储装置,可以便于在后续处理过程中再次使用返回所述数据存储装置的原始数据。例如,所述多个堆调整单元可以在所述多个堆存储单元中的数据均排序完成的情况下,对返回所述数据存储装置的数据进行再次排序。数据处理装置的硬件参数确定后,一次排序输出的有序数据量受堆容量的限制,例如,受到堆的层数、堆调整单元的数量、堆存储单元的大小等的影响,有可能不能输出足够数量的有序数据。本公开实施例提供的装置支持将排序过程中未选中的原始数据(如未进堆的原始数据和进堆后被挤出的原始数据)写回数据存储装置,以便进行多次排序,从而提高了数据处理装置的通用性。
在一些实施例中,可以对进堆的数据进行第一轮排序,在第一轮排序之后,再对第一轮排序未选中的数据进行下一轮排序。进一步地,在第二轮排序中可以按照第一轮排序的方式进行同样的处理,包括再次进行预筛选处理。通过这种方式,可以进行多轮排序,直到满足某一停止条件。该停止条件可以是待排序的原始数据全部排序完成。该停止条件也可以是已排序的数据的数量达到了要求的数量。通过多轮排序,实现了用堆容量较小的数据处理装置对数量较多的原始数据进行排序,避免了堆容量不足导致的排序失败,提高了数据处理装置的适用范围。第二轮及第二轮以后的排序过程与第一轮排序的过程相同,此处不再赘述。
在一些实施例中,在数据存储装置容量受限的情况下,也可以将原始数据分批写入数据存储装置,并分别对每一批写入数据存储装置的数据进行预筛选处理以及排序处理,从而实现了通过容量较小的数据存储装置来对数量较多的数据进行排序,避免了数据存储装置的容量不足导致的排序失败。
在一些实施例中,所述数据处理装置还包括第二缓存单元,用于对从数据存储装置获取的原始数据进行缓存,所述第二缓存单元将缓存后的原始数据发送给所述多个堆调整单元;所述多个堆调整单元用于对从所述第二缓存单元获取的原始数据与所述多个堆存储单元中的数据进行排序。可选地,所述第二缓存单元每次可以从数据存储装置获取一个或多个原始数据,并对获取的原始数据进行缓存。可选地,所述第一缓存单元可以每次从预处理单元获取一个或多个原始数据,并对获取的原始数据进行缓存。所述第一缓存单元和所述第二缓存单元可以是FIFO(First In First Out,先进先出)缓存单元。
如图4所示,是本公开另一些实施例的数据处理装置的示意图。其中,在该示例中, 假设数据处理装置包括n+1个堆存储单元201、n个堆调整单元202、1个第一缓存单元203和4个预处理单元204。其中,每个堆存储单元用于存储堆的一层节点的数据,堆调整单元i用于访问第i个堆存储单元和第i+1个堆存储单元。假设在进行top k的数据排序任务下,数据通路如下。
(1)原始数据并行(假设并行度为4)经过4个预处理单元204,如果堆中数据的数量还未达到k个,则该原始数据直接进入第一缓存单元203;如堆中数据的数量达到k个,则每个预处理单元将输入的各个原始数据分别与当前堆顶(即,堆的根节点)的数据进行比较,在堆为最小堆的情况下,将大于堆顶的原始数据输出至第一缓存单元203,将小于或等于堆顶的数据通过第一输出端写回外部的数据存储装置(图中未示出),以便进行多次排序。
(2)堆调整单元1从第一缓存单元203中取数。多个堆调整单元进行并行堆调整,将堆中数据调整为最小堆,可以通过第二输出端将从堆中挤出的数据写回数据存储装置,以便多次排序,重复上述过程,直到所有原始数据进堆完成。
本示例的装置所执行的指令如下。
(1)执行初始化指令,初始化n+1个堆存储单元201中的标志位。
(2)执行进堆指令,通过并行预筛选过程和并行堆调整过程,选取k个原始数组成最小堆,可采用多条进堆指令。
(3)执行出堆指令,通过并行堆调整向堆中输入数值最大的数据,将堆中k个有效数据通过第二输出端依次替换出,替换出的k个数据即为所需的top k数据。
在一些实施例中,每个堆存储单元可以包括标志位,标志位用于指示所述堆存储单元中的对应位置的数据是否有效。如图3B所示,堆存储单元mem1中包括节点P11的数据的标志位,如图中的flg1的黑色方块所示,堆存储单元mem2中包括节点P21和节点P22的数据的标志位,如图中的flg2,P21的标志位用黑色方块表示,P22的标志位用灰色方块表示,以此类推。在一个存储单元中可存储N个数据的情况下,该存储单元中可包括N个标志位。堆存储单元中的数据有效,表示所述数据是需要进行排序的数据;堆存储单元中的数据无效,表示所述数据不是需要进行排序的数据。在一些实施例中,在所述堆存储单元中的数据有效的情况下,所述标志位为第一数值。在所述堆存储单元中的数据无效的情况下,所述标志位为第二数值。例如,所述第一数值可以是“1”,所述第二数值可以是“0”。
一种常见的堆排序方式对各个堆存储单元中的数据进行初始化,并且,随着堆的深度增加,初始化的时间也会增加。而本公开实施例通过采用标志位,在向堆存储单元写入数据之前,可以对所述堆存储单元中的各个标志位进行初始化处理,从而无需对数据进行初始化处理。由于标志位的位长小于原始数据的位长(例如,标志位可以是1bit),因此,在一些例子中可以仅需要1个时钟周期就能够实现所有堆存储单元标志位的初始化,对标志位进行初始化处理的时间小于对堆存储单元中的数据进行初始化的时间,从而提高了数据处理效率。每向所述堆存储单元写入一个有效数据,可以对写入的有效数据的标志位进行更新,也就是将该标志位从无效设置为有效,从而根据数据的标志位即可确定堆存储单元中的数据是否为有效数据。
在设置标志位的情况下,所述多个堆调整单元中的每个堆调整单元还用于:在第一堆存储单元的标志位指示对应位置的数据均为有效数据的情况下,对输入该堆调整单元的的原始数据和所述有效数据进行排序;在所述第一堆存储单元中的标志位指示对应位置的数据包括任一无效数据的情况下,输入到该堆调整单元的原始数据写入无效数据对应位置。其中,第一堆存储单元为该堆调整单元所访问的至少两个堆存储单元中更靠近根节点的堆存储单元。
在一些实施例中,在该堆调整单元访问的堆存储单元中的多个无效数据的情况下,按照先左后右的这种顺序,将输入到所述堆调整单元的数据写入靠左的无效数据对应的位置。
也就是说,只有有效数据才会参与排序,无效数据会被输入对应堆调整单元的原始数据直接替换掉。通过这种方式,实现了原始数据的进堆过程,避免了无效数据对有效数据的排序过程产生影响。
在对所述多个堆存储单元中的数据排序完成之后,需要将堆存储单元中的数据出堆。本公开实施例的数据出堆过程与数据进堆过程的处理方式类似,都是向所述多个堆调整单元输入一个数据,再由所述多个堆调整单元对输入的数据与堆存储单元中已存储的数据进行排序。
具体来说,在出堆过程中,所述多个堆调整单元中的每个堆调整单元可以访问至少两个堆存储单元,对获取到的指定数据与所述至少两个堆存储单元中存储的数据进行排序,以使所述至少两个堆存储单元中存储的数据出堆。出堆过程与排序过程类似,也是并行执行的。在出堆过程中,每次可以向所述多个堆存储单元输入一个指定数据,在最小堆应用场景中,所述指定数据的数值可以大于所述多个堆存储单元中存储的各个数据, 例如,所述指定数据可以是数值为+∞的数据。所谓+∞的数据可以是原始数据的数据格式下的最大值,例如,对于16位浮点数,7c00 16可以表示+∞。在最大堆应用场景中,所述指定数据的数值可以小于所述多个堆存储单元中存储的各个数据,例如,所述指定数据可以是数值为-∞的数据。所谓-∞的数据可以是原始数据的数据格式下的最小值,例如,对于16位浮点数,fc00 16可以表示-∞。在输入指定数据之后,可以将堆的根节点的数据出堆,再由所述多个堆调整单元对所述指定数据以及堆存储单元中的其他数据并行地进行排序。
这样,可以将数据进堆、数据排序和数据出堆这三个过程统一通过相同的流程实现,从而有利于进行数据并行处理。
上述初始化、进堆以及出堆的过程可以分别由不同的指令控制。在传统的堆排序方案中,整个排序过程采用一条指令完成,参数固定后,数据处理装置的通用性较差。在本公开实施例中,一次排序分为初始化、进堆、出堆三个过程,分别对应三种指令,一次排序中可以有多条进堆指令(原始数据可以分多次输入),既可以消除数据存储装置对原始数据的数量限制,又可以使堆调整单元与预处理单元并行运行,使用比较灵活。所述初始化、进堆以及出堆的过程中的指令可以由上级控制器发送至所述数据处理装置中的堆控制单元,并在所述堆控制单元的控制下实现。
在一些实施例中,所述装置还包括:堆控制单元,用于执行以下至少任一操作:在接收到初始化指令的情况下,控制所述多个堆存储单元在同一个时钟周期内进行初始化;在接收到进堆指令的情况下,从数据存储装置读取原始数据,将读取到的所述原始数据传输至所述多个堆调整单元,以使所述多个堆调整单元对所述原始数据和多个堆存储单元中的数据进行排序;以及在接收到出堆指令的情况下,控制所述多个堆调整单元按照特定顺序将所述多个堆存储单元中的数据从堆顶输出。
具体来说,所述堆控制单元在接收到初始化指令的情况下,可以将初始化信号发送至堆存储单元,以对堆存储单元中的各个标志位进行初始化。所述堆控制单元在接收到进堆指令的情况下,可以从数据存储装置中读取原始数据,将所述原始数据输出至预处理单元,由预处理单元判断所述原始数据是否需要进行预筛选处理。如果需要,所述预处理单元将不需要进堆的原始数据直接删除或返回所述数据存储装置,将需要进堆的数据输出至第一缓存单元;如不需要进行预筛选处理,则直接将原始数据输出至第一缓存单元。堆调整单元接收到第一缓存单元中的原始数据,根据原始数据的大小对堆存储单元中的数据进行逐级调整,直到所有需要排序的原始数据都处理完毕。
所述堆控制单元在接收到出堆指令的情况下,将指定数据输出至堆调整单元,堆调整单元接收所述指定数据,对堆存储单元中的数据进行逐级调整,每一个指定数据进堆之后,堆存储单元中都会有一个数据(即堆的根节点的数据)被挤出堆,所述堆控制单元将该挤出的数据依次输出到数据处理装置的数据输出端口。
如图5A至5F是本公开实施例的排序过程中节点数据变化的示意图。本实施例以最小堆为例进行说明,最大堆的排序过程与最小堆类似,此处不再赘述。假设堆的深度为6,即,堆中包括6层节点,每层节点中各个节点的数据存储在一个独立的堆存储单元中,同一节点的各个子节点的数据存储在同一堆存储单元中相同的地址中,第i层节点对应的堆存储单元为堆存储单元i,访问堆存储单元i与堆存储单元i+1的堆调整单元为堆调整单元i,第i层的各个节点记为Pij,1≤j≤2 i-1,i为正整数。
假设初始时刻t0的堆如图5A所示。在t1时刻,原始数据“70”进堆,节点P11的数据“8”从所述堆存储单元1中被挤出,堆调整单元1从堆存储单元2中读取节点P21的数据和节点P22的数据进行比较,堆调整单元1将节点P21的数据写入节点P11对应的堆存储单元1,将原始数据“70”输出至堆调整单元2,如图5B所示。
在t2时刻,堆调整单元2从堆存储单元3中读取节点P31的数据和节点P32的数据进行比较,堆调整单元2将节点P31的数据写入节点P21对应的堆存储单元2,将原始数据“70”输出至堆调整单元3如图5C所示。
在t3时刻,堆调整单元3从堆存储单元4中读取节点P41的数据和节点P42的数据;同时,原始数据“75”进堆,节点P11的数据“12”从所述堆存储单元1中被挤出,堆调整单元1从堆存储单元2中读取节点P21的数据和节点P22的数据,如图5D所示。
在t4时刻,堆调整单元3将节点P41的数据写入节点P31对应的堆存储单元3,将原始数据“70”输出至堆调整单元4,堆调整单元4从堆存储单元5中读取节点P51的数据和节点P52的数据;同时,堆调整单元1将节点P22的数据写入节点P11对应的堆存储单元1,将原始数据“75”输出至堆调整单元2,堆调整单元2从堆存储单元3中读取节点P31的数据和节点P32的数据,堆调整单元4将节点P51的数据写入节点P41对应的堆存储单元3,将原始数据“70”输出至堆调整单元5,如图5E所示。
在t5时刻,堆调整单元5从堆存储单元6中读取节点P61的数据和节点P62的数据;同时,堆调整单元2将节点P34的数据写入节点P22对应的堆存储单元2,将原始数据“75”输出至堆调整单元3,堆调整单元3从堆存储单元4中读取节点P47的数据和节 点P48的数据;同时,原始数据“80”进堆,如图5F所示。
其中,在一些例子中,由于堆调整单元的调整需要两个周期,t1和t2的起始时刻相隔至少两个周期,t2和t3的起始时刻相隔至少两个周期。
可见,从t3时刻开始,同时有2个堆调整单元在工作。同理,从t5时刻开始,同时有3个堆调整单元在工作,从t7时刻开始,同时有3个堆调整单元在工作,以此类推。与非并行排序方式中每个时刻只能由一个堆调整单元在运行相比,本公开实施例的并行堆排序的方式能够将排序时间缩短到约为原来的1/3。堆的深度越大,同时工作的堆调整单元的数量也越多,也就是并行度越高,缩短的时间就越多。
图6是堆的深度为8时的数据流动过程的示意图。其中,d1、d2等表示输入的原始数据,t1、t2等表示时间,adj1、adj2等表示堆调整单元。可以看出,本公开实施例在堆排序过程中将建堆、堆调整两个过程合并为统一的自上而下的堆调整过程,堆的相邻两层节点的数据通过一个堆调整单元进行调整,多个堆调整单元组成阵列,输入数据流水经过各个堆调整单元,不同时刻,多个堆调整单元可并行执行。且从t6时刻开始,达到最大并行度,即,4。例如,在t7时刻,堆调整单元1、堆调整单元3、堆调整单元5和堆调整单元7均同时工作。应当说明的是,由于下一级堆调整单元可能会修改上一级堆调整单元所需的堆存储单元中存储的数据,为避免数据读写冲突,相邻两个原始数据进堆的时间需要间隔一级,即第m个原始数据本身或者被第m个原始数据从堆存储单元中替换出来的数据通过adj3进行排序时,第m+1个原始数据才可通过adj1进行排序。
本公开实施例的数据处理装置中的各个单元可基于FPGA(Field Programmable Gate Array,现场可编程门阵列)、PLD(programmable logic device,可编程逻辑器件)、ASIC(Application Specific Integrated Circuit,专用集成电路)控制器、微控制器、微处理器或其他电子元件实现。
本公开提供的数据处理装置实现了并行堆排序,提高了数据处理效率。在一些实施例中,无需对堆存储单元中的数据进行初始化,仅需对标志位进行初始化,提高了初始化效率。在一些实施例中,可以进行预筛选处理,减少了原始数据进堆的次数,进一步提高了数据处理效率。在一些实施例中,可以进行多轮排序,支持对数据存储装置中的原始数据进行多次排序,也支持将原始数据分批写入数据存储装置后与堆存储单元中的同一批数据进行排序,排序过程不受堆存储单元以及数据存储装置大小的限制,通用性较强。
相应地,本公开实施例还提供一种集成电路,所述集成电路包括任一实施例所述的数据处理装置。
在一些实施例中,所述集成电路还包括:控制器,用于向所述数据处理装置发送以下至少任一指令:初始化指令,用于指示所述多个堆存储单元进行初始化;进堆指令,用于指示所述多个堆调整单元获取原始数据,并对所述原始数据和所述多个堆存储单元中存储的数据进行排序;以及出堆指令,用于指示所述多个堆调整单元按照特定顺序将所述多个堆存储单元中存储的数据输出。
其中,所述初始化指令、所述进堆指令和所述出堆指令可以是不同的指令。在本公开实施例中,一次排序分为初始化、进堆、出堆三个过程,分别对应三种指令,一次排序中可以有多条进堆指令(原始数据可以分多次输入),既可以消除数据存储装置对原始数据的数量限制,又可以使堆调整单元与预处理单元并行运行,使用比较灵活。在一个例子中,所述初始化、进堆以及出堆的过程中的指令可以由集成电路的控制器发送至所述数据处理装置中的堆控制单元,并在所述堆控制单元的控制下实现。
相应地,本公开实施例还提供一种AI(Artificial Intelligence,人工智能)加速器,所述AI加速器包括任一实施例所述的集成电路。
本领域技术人员可以理解,在具体实施方式的上述方法中,各步骤的撰写顺序并不意味着严格的执行顺序而对实施过程构成任何限定,各步骤的具体执行顺序应当以其功能和可能的内在逻辑确定。
本领域技术人员在考虑说明书及实践这里公开的说明书后,将容易想到本公开的其它实施方案。本公开旨在涵盖本公开的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本公开的一般性原理并包括本公开未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的,本公开的真正范围和精神由下面的权利要求指出。
应当理解的是,本公开并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围进行各种修改和改变。本公开的范围仅由所附的权利要求来限制。
以上所述仅为本公开的较佳实施例而已,并不用以限制本公开,凡在本公开的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本公开保护的范围之内。
上文对各个实施例的描述倾向于强调各个实施例之间的不同之处,其相同或相似之处可以互相参考,为了简洁,本文不再赘述。

Claims (21)

  1. 一种用于数据处理的装置,其特征在于,所述装置包括:
    多个堆存储单元,每个堆存储单元用于存储堆的一组节点的数据,所述一组节点中包括所述堆的同一层节点中的至少部分节点;以及
    多个堆调整单元,每个堆调整单元用于访问至少两个堆存储单元,以对输入的原始数据与所述至少两个堆存储单元中存储的数据进行排序。
  2. 根据权利要求1所述的装置,其特征在于,
    每个堆调整单元访问的所述至少两个堆存储单元用于存储所述堆的相邻层节点的数据;和/或
    所述多个堆调整单元中每个堆调整单元用于获取所述输入的原始数据,并将所述获取到的原始数据和访问的所述至少两个堆存储单元中的至少一个堆存储单元中的数据进行排序。
  3. 根据权利要求1或2所述的装置,其特征在于,
    所述多个堆调整单元中两个相邻堆调整单元分别访问的两个堆存储单元中包括一个相同的堆存储单元;和/或
    所述多个堆调整单元中的至少两个堆调整单元并行进行排序,所述至少两个堆调整单元访问的堆存储单元互不相同。
  4. 根据权利要求1至3任意一项所述的装置,其特征在于,相邻两个数据的进堆时序之间间隔两个堆存储单元的处理周期。
  5. 根据权利要求1至4任意一项所述的装置,其特征在于,所述堆的同一节点的各个子节点的数据存储在同一堆存储单元的同一地址中。
  6. 根据权利要求1至5任意一项所述的装置,其特征在于,所述装置还包括:
    预处理单元,用于对从数据存储装置获取的原始数据进行预筛选处理,经过预筛选处理的数据被输入所述多个堆调整单元。
  7. 根据权利要求6所述的装置,其特征在于,所述预处理单元用于在所述堆存储单元中存储的数据达到预设数量的情况下,对新获取到的所述原始数据进行所述预筛选处理。
  8. 根据权利要求6或7所述的装置,其特征在于,所述预处理单元用于通过比较所述原始数据与所述堆的根节点的数据,对所述原始数据进行所述预筛选处理,以预先判定所述原始数据是否需要进堆。
  9. 根据权利要求6至8任意一项所述的装置,其特征在于,所述预处理单元的数 量为多个,多个所述预处理单元用于并行对获取到的所述原始数据进行所述预筛选处理。
  10. 根据权利要求6至9任意一项所述的装置,其特征在于,所述预处理单元用于:
    在判定所述原始数据需要进堆的情况下,将所述原始数据传输至缓存单元或所述多个堆调整单元;和
    在判定所述原始数据不需要进堆的情况下,将所述原始数据删除或返回所述数据存储装置。
  11. 根据权利要求10所述的装置,其特征在于,所述多个堆调整单元还用于:
    将在排序过程中被挤出的原始数据返回所述数据存储装置;
    在所述多个堆存储单元中的数据均排序完成的情况下,对返回所述数据存储装置的原始数据进行再次排序。
  12. 根据权利要求6至11任意一项所述的装置,其特征在于,所述装置还包括:
    第一缓存单元,用于对从所述预处理单元获取的经过所述预筛选处理的原始数据进行缓存;
    所述多个堆调整单元用于对从所述第一缓存单元获取的原始数据与所述多个堆存储单元中的数据进行排序。
  13. 根据权利要求1至5任意一项所述的装置,其特征在于,所述装置还包括:
    第二缓存单元,用于对从数据存储装置获取的原始数据进行缓存;
    所述多个堆调整单元用于对从所述第二缓存单元获取的原始数据与所述多个堆存储单元中的数据进行排序。
  14. 根据权利要求1至13任意一项所述的装置,其特征在于,所述堆存储单元均包括标志位,其中,所述标志位用于指示所述堆存储单元中的对应位置的数据是否有效。
  15. 根据权利要求14所述的装置,其特征在于,所述堆存储单元还用于:
    对所述堆存储单元中的各个标志位进行初始化处理;和/或
    在确定一个标志位的对应位置写入有效数据的情况下,对该标志位进行更新。
  16. 根据权利要求14或15所述的装置,其特征在于,所述多个堆调整单元中的每个堆调整单元还用于:
    在该堆调整单元访问的第一堆存储单元的标志位指示对应位置的数据均为有效数据的情况下,对输入到该堆调整单元的原始数据和所述有效数据进行排序,其中,所述第一堆存储单元为该堆调整单元所访问的所述至少两个堆存储单元中更靠近根节点的堆存储单元;和
    在所述第一堆存储单元的标志位指示对应位置的数据包括任一无效数据的情况下, 将输入到该堆调整单元的所述原始数据写入所述无效数据对应的位置。
  17. 根据权利要求1至16任意一项所述的装置,其特征在于,所述多个堆调整单元中的每个堆调整单元用于:
    读取至少两个堆存储单元中的至少一个堆存储单元中存储的数据;
    对输入该堆调整单元的原始数据与所述读取的数据进行排序;以及
    根据排序的要求,将所述排序结果中较大或较小的数据写入所述至少两个堆存储单元中另一个堆存储单元,其中,所述另一个堆存储单元与所述至少一个堆存储单元不是同一个堆存储单元。
  18. 根据权利要求1至17任意一项所述的装置,其特征在于,所述装置还包括:堆控制单元,用于执行以下至少任一操作:
    在接收到初始化指令的情况下,控制所述多个堆存储单元在同一个时钟周期内进行初始化;
    在接收到进堆指令的情况下,从数据存储装置读取原始数据,将读取到的所述原始数据传输至所述多个堆调整单元,以使所述多个堆调整单元对所述原始数据和多个堆存储单元中的数据进行排序;以及
    在接收到出堆指令的情况下,控制所述多个堆调整单元按照特定顺序将所述多个堆存储单元中的数据从堆顶输出。
  19. 一种集成电路,其特征在于,所述集成电路包括权利要求1至18任意一项所述的数据处理装置。
  20. 根据权利要求19所述的集成电路,其特征在于,所述集成电路还包括控制器,用于向所述数据处理装置发送以下至少任一指令:
    初始化指令,用于指示所述多个堆存储单元进行初始化;
    进堆指令,用于指示所述多个堆调整单元获取原始数据,并对所述原始数据和所述多个堆存储单元中存储的数据进行排序;以及
    出堆指令,用于指示所述多个堆调整单元按照特定顺序将所述多个堆存储单元中存储的数据输出。
  21. 一种人工智能AI加速器,其特征在于,所述AI加速器包括权利要求19或20所述的集成电路。
PCT/CN2020/136960 2020-03-31 2020-12-16 数据处理装置、集成电路和ai加速器 WO2021196745A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
JP2021557465A JP2022531075A (ja) 2020-03-31 2020-12-16 データ処理
KR1020217031349A KR20210129715A (ko) 2020-03-31 2020-12-16 데이터 처리

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010244150.2 2020-03-31
CN202010244150.2A CN113467702A (zh) 2020-03-31 2020-03-31 数据处理装置、集成电路和ai加速器

Publications (1)

Publication Number Publication Date
WO2021196745A1 true WO2021196745A1 (zh) 2021-10-07

Family

ID=77865417

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/136960 WO2021196745A1 (zh) 2020-03-31 2020-12-16 数据处理装置、集成电路和ai加速器

Country Status (5)

Country Link
JP (1) JP2022531075A (zh)
KR (1) KR20210129715A (zh)
CN (1) CN113467702A (zh)
TW (1) TWI773051B (zh)
WO (1) WO2021196745A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115932532B (zh) * 2023-03-09 2023-07-25 长鑫存储技术有限公司 故障存储单元的物理地址的存储方法、装置、设备及介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060004897A1 (en) * 2000-11-28 2006-01-05 Paul Nadj Data structure and method for sorting using heap-supernodes
US20060095444A1 (en) * 2000-11-28 2006-05-04 Paul Nadj Data structure and method for pipeline heap-sorting
US20140181126A1 (en) * 2001-08-16 2014-06-26 Altera Corporation System and Method for Scheduling and Arbitrating Events in Computing and Networking
CN107402741A (zh) * 2017-08-04 2017-11-28 电子科技大学 一种适宜于fpga实现的排序方法
CN108319454A (zh) * 2018-03-27 2018-07-24 武汉中元华电电力设备有限公司 一种基于硬件fpga快速实现最优二叉树的方法
CN109375989A (zh) * 2018-09-10 2019-02-22 中山大学 一种并行后缀排序方法及系统

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS6116327A (ja) * 1984-07-03 1986-01-24 Agency Of Ind Science & Technol デ−タ処理装置
JPS6154536A (ja) * 1984-08-24 1986-03-18 Hitachi Ltd デ−タ整順化回路
JPS61150055A (ja) * 1984-12-25 1986-07-08 Panafacom Ltd Dmaデ−タ転送方式
US10268410B2 (en) * 2014-10-20 2019-04-23 Netapp, Inc. Efficient modification of storage system metadata
US10761979B2 (en) * 2016-07-01 2020-09-01 Intel Corporation Bit check processors, methods, systems, and instructions to check a bit with an indicated check bit value
CN110825440B (zh) * 2018-08-10 2023-04-14 昆仑芯(北京)科技有限公司 指令执行方法和装置

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060004897A1 (en) * 2000-11-28 2006-01-05 Paul Nadj Data structure and method for sorting using heap-supernodes
US20060095444A1 (en) * 2000-11-28 2006-05-04 Paul Nadj Data structure and method for pipeline heap-sorting
US20140181126A1 (en) * 2001-08-16 2014-06-26 Altera Corporation System and Method for Scheduling and Arbitrating Events in Computing and Networking
CN107402741A (zh) * 2017-08-04 2017-11-28 电子科技大学 一种适宜于fpga实现的排序方法
CN108319454A (zh) * 2018-03-27 2018-07-24 武汉中元华电电力设备有限公司 一种基于硬件fpga快速实现最优二叉树的方法
CN109375989A (zh) * 2018-09-10 2019-02-22 中山大学 一种并行后缀排序方法及系统

Also Published As

Publication number Publication date
CN113467702A (zh) 2021-10-01
TWI773051B (zh) 2022-08-01
TW202138994A (zh) 2021-10-16
KR20210129715A (ko) 2021-10-28
JP2022531075A (ja) 2022-07-06

Similar Documents

Publication Publication Date Title
US9547444B1 (en) Selectively scheduling memory accesses in parallel based on access speeds of memory
KR101431205B1 (ko) 캐시 메모리 장치 및 캐시 메모리 장치의 데이터 처리 방법
CN101038531A (zh) 用于嵌入式系统中部件的共用接口
JP6935356B2 (ja) 半導体装置、情報処理システム、および情報処理方法
US20220197530A1 (en) Memory system and operating method thereof
WO2020248982A1 (zh) 一种区块链中交易处理的方法及装置
JP2021072107A5 (zh)
WO2021196745A1 (zh) 数据处理装置、集成电路和ai加速器
WO2024046230A1 (zh) 存储器训练方法及系统
US20190370097A1 (en) Grouping requests to reduce inter-process communication in memory systems
WO2021115002A1 (zh) 一种区块链交易记录的处理方法及装置
US10489702B2 (en) Hybrid compression scheme for efficient storage of synaptic weights in hardware neuromorphic cores
US11996860B2 (en) Scaled bit flip thresholds across columns for irregular low density parity check decoding
TWI707362B (zh) 資料寫入方法和儲存控制器
KR20210108487A (ko) 저장 디바이스 동작 오케스트레이션
TWI843934B (zh) 用於處理無結構源資料的方法及系統
US11442643B2 (en) System and method for efficiently converting low-locality data into high-locality data
CN113467704B (zh) 通过智能阈值检测的命令优化
US11734205B2 (en) Parallel iterator for machine learning frameworks
US20240211135A1 (en) Storage system, storage device and operating method thereof
CN113672530B (zh) 一种服务器及其排序设备
US20240086106A1 (en) Accelerator Queue in Data Storage Device
WO2021223098A1 (en) Hierarchical methods and systems for storing data
US20220107754A1 (en) Apparatus and method for data packing and ordering
TWI721660B (zh) 控制資料讀寫裝置與方法

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2021557465

Country of ref document: JP

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 20217031349

Country of ref document: KR

Kind code of ref document: A

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20929152

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20929152

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 05.04.2023)

122 Ep: pct application non-entry in european phase

Ref document number: 20929152

Country of ref document: EP

Kind code of ref document: A1