WO2022179074A1 - 数据处理装置、方法、计算机设备及存储介质 - Google Patents

数据处理装置、方法、计算机设备及存储介质 Download PDF

Info

Publication number
WO2022179074A1
WO2022179074A1 PCT/CN2021/115780 CN2021115780W WO2022179074A1 WO 2022179074 A1 WO2022179074 A1 WO 2022179074A1 CN 2021115780 W CN2021115780 W CN 2021115780W WO 2022179074 A1 WO2022179074 A1 WO 2022179074A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
storage unit
data processing
control signal
read
Prior art date
Application number
PCT/CN2021/115780
Other languages
English (en)
French (fr)
Inventor
周军
常亮
周亮
赵能
Original Assignee
成都商汤科技有限公司
电子科技大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 成都商汤科技有限公司, 电子科技大学 filed Critical 成都商汤科技有限公司
Publication of WO2022179074A1 publication Critical patent/WO2022179074A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/60Memory management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Definitions

  • the present disclosure relates to the field of computer technology, and in particular, to a data processing apparatus, method, computer device, and storage medium.
  • AI accelerator hardware architecture is used to complete the realization of image processing algorithms.
  • the commonly used AI accelerator hardware architecture mainly includes storage unit, computing unit, control unit, etc.
  • the core computing unit is generally composed of a two-dimensional processing engine (Processing Engine, PE) array and a register array (local register file);
  • the storage unit can be It consists of different hierarchical caches, including double-rate synchronous dynamic random access memory (Double Data Rate, DDR), static random access memory (Static Random Access Memory, SRAM), registers, post-relational database cache and other storage spaces.
  • DDR double-rate synchronous dynamic random access memory
  • SRAM static random access memory
  • registers post-relational database cache and other storage spaces.
  • the input data stream is buffered and transferred in different storage spaces, and enters the register array corresponding to the PE array.
  • Embodiments of the present disclosure provide at least a data processing apparatus, method, computer device, and storage medium.
  • an embodiment of the present disclosure provides a data processing apparatus, including a plurality of first storage units and a computing unit; the computing unit includes a processing engine (PE) array; the plurality of first storage units are respectively associated with the The PEs in the PE array are connected; the PEs are used to perform read/write access to the connected first storage units; the multiple first storage units are used to store the connected PEs in the process of read/write accesses transmitted data.
  • PE processing engine
  • the PEs in the PE array are divided into multiple PE groups, and the multiple first storage units are used to connect to different PE groups in the PE array respectively.
  • each of the first storage units is connected to a PE in the PE group.
  • a PE group in the multiple PE groups includes multiple PEs in the PE array that have a physical connection relationship, and the multiple PEs are located in the same row in the hardware layout, or On the same half line, or on the same block.
  • the number of data transmission channels can be increased, so that when the PE performs read/write access to the first storage unit, more data can be transmitted, so as to improve the The efficiency of data transmission; at the same time, the flexibility of the data processing device can also be increased to adapt to different data processing requirements.
  • the PE is used for at least one of the following: in the first processing cycle, perform read access to the connected first storage unit to obtain the first data corresponding to the PE; In the second processing cycle, write access is performed to the connected first storage unit, and the second data generated by the PE is stored in the connected first storage unit.
  • the PEs are used for at least one of the following: different PEs in each PE group are respectively in different clock cycles of the same processing cycle to connect the first storage to the PE group; unit to perform read/write access; at least two PE groups in different at least two PE groups respectively have at least one PE in the same clock cycle to perform read/write access to the connected first storage unit; wherein, one of the processing cycles includes at least one the clock cycle described above.
  • PEs with the same relative position perform read/write access to the connected first storage unit in the same clock cycle.
  • multiple PEs can synchronously perform read/write access to different first storage units, thereby improving the efficiency of data transmission.
  • a control unit is further included; the control unit is configured to generate a first control signal based on a data processing instruction, and transmit the first control signal to the PE; the PE is configured to respond to Based on the first control signal, the first data to be processed by the PE is read from a first storage unit connected to the PE.
  • control unit is further configured to generate a second control signal based on the data processing instruction, and transmit the second control signal to the PE; the PE is configured to respond to the data processing instruction.
  • the second control signal is used to write the second data generated by the PE into the first storage unit connected to the PE.
  • a data scheduler is further included; the control unit is further configured to generate a third control signal based on the data processing instruction, and transmit the third control signal to the data scheduler; The data scheduler is configured to perform write access to the first storage unit based on the third control signal.
  • the apparatus further includes a second storage unit; the data scheduler is configured to read the data to be processed corresponding to each first storage unit from the second storage unit, and based on the The first data storage address carried in the third control signal stores the to-be-processed data in the corresponding first storage unit; wherein the to-be-processed data includes: the first data from which each PE needs to be connected The first data read by a storage unit.
  • control unit is further configured to generate a fourth control signal based on the data processing instruction, and transmit the fourth control signal to the data scheduler;
  • data scheduler is further for reading result data from the plurality of first storage units based on the fourth control signal, and storing the result data in the second storage unit; wherein the result data includes the The second data generated by the PE and stored in the first storage unit to which it is connected.
  • an embodiment of the present disclosure further provides a data processing method, which is applied to a data processing apparatus, where the data processing apparatus includes a plurality of first storage units and a computing unit; the computing unit includes a PE array; the plurality of The first storage units are respectively connected to the PEs in the PE array; the data processing method includes: the PEs perform read/write access to the connected first storage units; the plurality of first storage units store the connected first storage units The PE transmits data during read/write access.
  • the PEs in the PE array are divided into multiple PE groups, and the multiple first storage units are used to connect to different PE groups in the PE array respectively.
  • each first storage unit is connected to a PE in a PE group.
  • a PE group in the multiple PE groups includes multiple PEs in the PE array that have a physical connection relationship, and the multiple PEs are located in the same row in the hardware layout, or On the same half line, or on the same block.
  • the PE performs read/write access to the connected first storage unit, including at least one of the following: during the first processing cycle, the PE performs read/write access to the connected first storage unit. Perform read access to obtain the first data corresponding to the PE; the PE performs write access to the connected first storage unit in the second processing cycle, and stores the second data generated by the PE to the connected first storage unit. a storage unit.
  • the PE performs read/write access to the connected first storage unit, including at least one of the following: different PEs in each PE group are respectively at different clocks of the same processing cycle; Periodically perform read/write access to the first storage unit connected to the PE group; at least two different PE groups respectively have at least one PE to perform read/write access to the connected first storage unit in the same clock cycle; Wherein, one of the processing cycles includes at least one of the clock cycles.
  • PEs with the same relative position perform read/write access to the connected first storage unit in the same clock cycle.
  • the data processing apparatus further includes a control unit; and the data processing method further includes: the control unit generates a first control signal based on a data processing instruction, and transmits the first control signal to the PE. a first control signal; in response to the first control signal, the PE reads the first data to be processed by the PE from a first storage unit connected to the PE.
  • the method further includes: the control unit generates a second control signal based on the data processing instruction, and transmits the second control signal to the PE; the PE responds to The second control signal writes the second data generated by the PE into the first storage unit connected to the PE.
  • the data processing apparatus further includes a data scheduler; the data processing method further includes: the control unit generates a third control signal based on the data processing instruction, and sends a third control signal to the data The scheduler transmits the third control signal; the data scheduler performs write access to the first storage unit based on the third control signal.
  • the data processing apparatus further includes a second storage unit; the data scheduler reads the data to be processed corresponding to each first storage unit from the second storage unit, and The first data storage address carried in the third control signal, and the to-be-processed data is stored in the corresponding first storage unit; wherein, the to-be-processed data includes the first data from which the PE needs to be connected the first data read by the storage unit.
  • control unit generates a fourth control signal based on the data processing instruction, and transmits the fourth control signal to the data scheduler; the data scheduler is based on The fourth control signal reads result data from the plurality of first storage units, and stores the result data in the second storage unit; wherein, the result data includes The second data in the first storage unit to which it is connected is stored.
  • an optional implementation manner of the present disclosure further provides a computer device, including: an instruction memory and the data processing apparatus provided in the first aspect of the present disclosure.
  • an optional implementation manner of the present disclosure further provides a computer-readable storage medium on which a computer program is stored, and when the computer program is run, executes the above-mentioned second aspect, or any possible implementation of the second aspect steps in the method.
  • FIG. 1 shows a schematic diagram of a data processing apparatus provided by an embodiment of the present disclosure
  • FIG. 2 shows a schematic diagram of a PE array provided by an embodiment of the present disclosure
  • FIG. 3 shows a schematic diagram of an internal structure of a PE provided by an embodiment of the present disclosure
  • FIG. 4a shows a schematic diagram of a connection manner of a first storage unit and a PE array provided by an embodiment of the present disclosure
  • FIG. 4b shows a schematic diagram of another connection manner of the first storage unit and the PE array provided by an embodiment of the present disclosure
  • FIG. 5 shows a schematic diagram of a data processing apparatus provided by an embodiment of the present disclosure when performing data processing
  • FIG. 6 shows a flowchart of a data processing method provided by an embodiment of the present disclosure.
  • the result data will be generated; the generated result data needs to be stored in the external memory; at this time, the result data generated by different PEs in the PE array also need to be transmitted to the external one by one using the bus. memory. This results in that it also takes a long time to store the result data in the external memory, resulting in lower data transmission efficiency and lower data processing efficiency.
  • the present disclosure provides a data processing device, including a plurality of first storage units, different first storage units are respectively connected to different PEs in the PE array, and each PE in the PE array can The first storage unit performs read/write access, and further, different PEs connected to different first storage units in the PE array can access different first storage units in parallel, which improves the efficiency of reading data from the first storage unit , and the efficiency of storing the data in the first storage unit is improved, thereby improving the data processing efficiency.
  • the apparatus includes a plurality of first storage units (a plurality of first storage units are shown in the figure, including first storage units 0 to 10) A first storage unit 3) and a computing unit; the computing unit includes a processing engine (PE) array; a plurality of first storage units are respectively connected with PEs in the PE array (multiple PEs are shown in the figure, including PE0 to PE15); Wherein, the PE is used to perform read/write access to the connected first storage unit; the plurality of first storage units are used to store the data transmitted by the connected PE during the read/write access process.
  • PE processing engine
  • the computing unit includes at least one PE array; wherein, the physical connection relationship between any PE in the PE array and other PEs may be as shown in FIG. 2 , and multiple PEs together form a 2D (dimension) torus.
  • a network, PEs can be connected to different PEs with which they are physically connected, including at up, down, left, and right positions.
  • the PE array includes PE 22 that specifically completes related computing tasks and PE 21 that is located at the edge of the PE array.
  • PE 21 located at the edge of the PE array is marked as halo in FIG. 2 .
  • PE 22 can complete operations such as multiply-accumulate (MAC) of data;
  • PE 21 forms a peripheral ring array on the periphery of the PE array, because when PE 22 in the PE array processes data, it may be A shift (shift) of data between different PEs will occur, and the PEs 21 in the peripheral ring array can ensure that data will not be lost when moving between different PEs inside the PE array.
  • MAC multiply-accumulate
  • FIG. 3 shows an internal structure of a PE provided by an embodiment of the present disclosure.
  • a PE 22 that specifically completes related computing tasks, and a PE 21 that is connected to it and located at the edge of the PE array.
  • PE 22 includes: memory access module 33, represented as M1; arithmetic logic unit 34 (Arithmetic and Logic Unit, ALU), represented as ALU1; internal register 35, represented as R0_1; and shift register file 36.
  • M1 memory access module 33
  • ALU Arimetic and Logic Unit
  • the memory access module M1 is used for read/write access to the first memory unit connected to the PE 22.
  • the memory access module M1 can transmit the data acquired from the first memory unit to the internal register R0_1 or ALU1, to wait for the acquired data to be processed by ALU1, or transmit to PE 22 corresponding to In the shift register file, so that the acquired data is transferred to the PE21 connected with the PE22; when the first memory unit is written and accessed, the result operation data calculated by the ALU1 can be written into the corresponding first memory unit .
  • the arithmetic logic unit ALU1 is used to perform data processing on the received data. Since there may be multiple intermediate calculation steps during data processing, the obtained intermediate operation data can also be transferred to the internal register in the PE, and the intermediate operation data in the internal register can be called in the next calculation to perform further processing. After obtaining the result operation data, it can be transmitted to the memory access module M1 for output according to the actual data processing instruction; alternatively, the result operation data can also be transmitted to the shift register file corresponding to PE 22, waiting to be transmitted to the connection with PE 22 of PE 21.
  • ALU may not be included internally to reduce equipment requirements and thus reduce equipment cost; or there is ALU, but ALU does not actually perform related data operations, so as to Reduce the complexity of device integration.
  • Fig. 3 when ALU2 exists in PE 21, a connection relationship similar to that of ALU1 in PE 21 that may exist is represented by a dashed line.
  • Internal register R0_1 is used to receive and store the data read by M1 from the first memory unit that PE 22 is connected to; Obtain the result operation data and store the result operation data; or transfer the result operation data to M1.
  • PE 21 since PE 21 can only complete the function of data transmission among multiple PEs, or only receive data transmitted in the first storage unit, internal registers may not be included to reduce equipment requirements, thereby Reduce equipment cost; or there are internal registers, but do not complete storage-related tasks to reduce the complexity of equipment integration.
  • FIG. 3 when there is an internal register R0_2 in the PE 21, a connection relationship similar to that of the internal register R0_1 in the PE 21 that may exist is indicated by a dotted line.
  • the shift register file is used to transmit the data acquired by the PE to other PEs connected to the PE.
  • the shift register file 36 corresponding to the PE 22 can be connected in circuit with the shift register file corresponding to the PE with which the connection relationship exists in the four directions of up, down, left and right, and the corresponding shift register file 36 exists.
  • PE 21 there is also a shift register file 37, including corresponding shift registers R1', R2', R3', R4'.
  • the register R4' has a connection relationship, after the PE 22 transmits the data to the R4, the R4' corresponding to the PE 21 can receive the data, so that the PE 21 can receive the data.
  • Different PEs in the PE array may be respectively connected with different first storage units, or connected with the same first storage unit.
  • the PE with the connection relationship and the first storage unit can perform data uploading, and the PE can read the data in the first storage unit, and transmit the processed data to the first storage unit.
  • each first storage unit a plurality of corresponding PEs can be connected, so the first storage unit can include a plurality of storage units corresponding to the number of PEs, which are used to correspondingly store the read/write data of each PE in the connected multiple PEs. data.
  • multiple PEs may be grouped first, and a corresponding first storage unit may be determined for each group of PEs after the grouping. unit.
  • grouping multiple PEs multiple PEs with a physical connection relationship can be regarded as a PE group, and the multiple PEs are located in the same row, or in the same half row, or in the same block in hardware layout.
  • a row of PEs may be used as a PE group, as shown in FIG. 4a , which shows a schematic diagram of a connection manner between the first storage unit and the PE array.
  • the first row of PEs is used as a PE group (PE group), denoted as G0
  • the second row as a PE group, denoted as G1, and so on, until the nth row is divided into a PE group, Denoted as Gn.
  • the corresponding first storage unit 0 is allocated to the PE group G0
  • the corresponding first storage unit 1 is allocated to the PE group G1, and so on, until the corresponding first storage unit n is allocated to the PE group Gn.
  • FIG. 4b it shows a schematic diagram of another connection manner of the first storage unit and the PE array.
  • the PEs in the first row and the second row can be regarded as a PE group, denoted as G0
  • the PEs in the third row and the fourth row can be regarded as a PE group, Denoted as G1
  • multiple rows of PEs in the PE array can be divided into n different PE groups, that is, the PE array is divided into n PE groups, and then the corresponding n PE groups are allocated to the n different PE groups.
  • the first storage unit may include, for example, first storage unit 0 to first storage unit n.
  • each PE in the PE array can also be regarded as a PE group, and a corresponding first storage unit is allocated to each PE, that is, each PE in the PE array has a corresponding storage unit.
  • the number of PEs included in each PE group in the multiple PE groups may be the same or different.
  • the PE with the closer physical connection relationship is used as the A PE group, or in a more targeted device, uses multiple PEs with stronger computing and processing functions as a PE group. That is, the specific way of determining the PE group can be determined according to the actual situation, which is not limited here.
  • the connected first storage unit can be read and accessed to obtain the corresponding PE. first data.
  • write access to the connected first storage unit may be performed, and the second data generated by the PE may be stored in the connected first storage unit.
  • the processing cycle can be determined according to the actual data processing process. For example, in the processing step of multiplying and adding data, since the calculation is relatively simple, it can include two or three clock cycles; for example, in the processing of weighted filtering of data In the step, because the calculation is more complicated, it can include four or five clock cycles. That is, the number of clock cycles included in the processing cycle is related to the actual processing process, and the number of clock cycles included in different processing cycles may be the same or different.
  • each group of PEs and their corresponding first storage units are implemented in the form of single instruction multiple data streams (Single Instruction Multiple Data, SIMD). Therefore, different PEs of each PE group can perform read/write access to the first storage unit connected to the PE group in different clock cycles of the same processing cycle respectively; and/or, different at least two In the PE group, at least one PE respectively performs read/write access to the connected first storage unit in the same clock cycle; wherein, one of the processing cycles includes at least one of the clock cycles.
  • the PEs at the same position in each group of PEs can perform read access to the first storage units corresponding to them, Taking the embodiment corresponding to FIG. 1 as an example, there are four first storage units, namely: first storage unit 0, first storage unit 1, first storage unit 2, and first storage unit 3, and the first storage unit
  • the PEs connected to unit 0 include PE0, PE1, PE2, and PE3, the PEs connected to the first storage unit 1 include PE4, PE5, PE6, and PE7, and the PEs connected to the first storage unit 2 include PE8, PE9, PE10, and PE11,
  • the PEs connected to the first storage unit 3 include PE12, PE13, PE14 and PE15; in this example, the first processing cycle may include 4 clock cycles; in the first clock cycle, PE0 performs the first storage unit 0 processing.
  • PE4 performs read access to the first storage unit 1
  • PE8 performs read access to the first storage unit 2
  • PE12 performs read access to
  • PE0, PE4, PE8, PE12 can store the read data in the corresponding internal memory, so that the PE including the arithmetic logic unit can perform arithmetic processing on the read data, or make the data not including the arithmetic logic unit
  • the PE stores it and waits for the next processing cycle to move or other data transmission.
  • PE1 performs read access to the first storage unit 0, at the same time, PE5 performs read access to the first storage unit 1, PE9 performs read access to the first storage unit 2, and PE13 performs read access to the first storage unit 3 Carry out read access; in the third clock cycle, PE2 performs read access to the first storage unit 0, meanwhile, PE6 reads the first storage unit 1, PE10 reads the first storage unit 2, and PE14 reads the first storage unit 2.
  • a storage unit 3 performs read access; in the fourth clock cycle, PE3 performs read access to the first storage unit 0, at the same time, PE7 performs read access to the first storage unit 1, and PE11 performs read access to the first storage unit 2 , PE15 performs read access to the first storage unit 3 .
  • the PE reads the access to the first storage unit, and transmits the first data correspondingly stored in the first storage unit and waiting to be processed by the PE to the corresponding internal parts of each PE. register, awaiting further data access.
  • the control unit in the data processing apparatus when the PE performs read access to the first storage unit, the control unit in the data processing apparatus generates a first control signal based on the data processing instruction, and transmits the first control signal to the PE, and the PE responds to the first control signal, The first data to be processed by the PE is read from the first storage unit connected to the PE.
  • the data processing instructions may include related instructions for controlling the PE to operate on the data in the first storage unit, such as a data transfer instruction (MOV), an addition instruction (ADD), a subtraction instruction (SUB), a logical AND instruction (AND), etc. different instructions.
  • MOV data transfer instruction
  • ADD addition instruction
  • SBA subtraction instruction
  • AND logical AND instruction
  • the control unit can generate a first control signal based on the data transmission instruction. , includes the data address that the PE that receives this first control signal accesses when the first storage unit performs read access, and is used to control the PE that receives this first control signal to read data to the corresponding first storage unit, And store the read data into the corresponding internal register.
  • the first control signal transmitted by the control unit to PE0 may include, for example, the address of s0.
  • PE0 After PE0 receives the first control signal, it can retrieve the data from the connected first storage unit 0 according to the address of s0 carried in it. The corresponding data is read from the storage space s0.
  • the control unit when the image to be processed is processed and stored in the first storage unit, for example, the following method may be adopted: the control unit generates a third control signal based on the data processing instruction, and sends the third control signal to the data scheduler in the data processing device. A third control signal is transmitted; the data scheduler performs write access to the first storage unit based on the third control signal.
  • the third control signal may carry, for example, a first data storage address, and the first data storage address is used to determine the storage location of the data to be processed stored in the first storage unit.
  • the data processing apparatus further includes a second storage unit, and the second storage unit may include an external memory for storing data such as original images and feature maps to be processed.
  • the embodiments of the present disclosure take the processing of the original image as an example to describe the detailed process of the data processing performed by the data device. Taking the PE array shown in Figure 1 as an example, when each PE can process sub-image data composed of 4 ⁇ 4 pixels, when the image size (unit is pixel) is 16 ⁇ 16, each PE can process the corresponding 4 ⁇ 4 pixels on average.
  • the data contained in the obtained 16 sub-images can be stored in the second storage unit, waiting for the data scheduler to read the data from the second storage unit; and, since the data stored in the second storage unit is The PE can directly process the data, so when the data in the second storage unit is stored in the first storage unit, only the data transmission can be completed without the need for data segmentation and other processing, thus reducing data processing.
  • the processing task of the device during data transmission improves the efficiency of data transmission; in addition, since the data stored in the second storage unit can be directly used as the data to be processed corresponding to the first storage unit, it is also beneficial to the first storage unit, And the reading of the data to be processed by the PE connected to the first storage unit.
  • the data scheduler reads the to-be-processed data corresponding to each first storage unit from the second storage unit, and based on the first data storage address carried in the third control signal, stores the to-be-processed data corresponding to each first storage unit
  • the data is stored in the corresponding first storage unit; wherein, the data to be processed corresponding to each first storage unit includes: data that needs to be read by the PE connected to each first storage unit.
  • the PE can wait for the receiving control unit to transmit the control signal, and after receiving the first control signal sent by the control unit, the PE can send the data from the corresponding first storage unit Read the corresponding data for processing.
  • the PE includes multiple steps such as weighted summation. Therefore, during processing, there may be multiple intermediate data, which can be stored in PE, for example.
  • the data is temporarily stored in the corresponding internal memory, and then the data temporarily stored in the internal memory is directly called for processing in the next processing, until all data processing tasks for the original image are completed.
  • the intermediate data can also be transmitted to the first storage unit, but since the intermediate data is not the final output result data, further processing is required, so the intermediate data in the first storage unit may not be sent to the second storage unit. output.
  • control unit may generate a second control signal based on the data processing instruction, and transmit the second control signal to the PE; the PE, in response to receiving the second control signal transmitted by the control unit, writes the data generated by the PE into the PE connected in the first storage unit.
  • the second control signal is similar to the above-mentioned first control signal, including the data address accessed by the PE receiving the second control signal when the first memory unit performs write access, and is used to control the PE receiving the second control signal Write data to the corresponding first storage unit, so that the first storage unit receives the data written by the corresponding PE, and waits for output to the second storage unit, and the processing result of the original image has been obtained.
  • the control unit can also generate a fourth control signal and transmit the fourth control signal to the data scheduler; data scheduling; Based on the fourth control signal, the controller reads the resultant data from the plurality of first storage units, and stores the resultant data in the second storage unit; wherein, the resultant data includes generated by the PE connected to the first storage unit and stored in the data in the first storage unit.
  • the fourth control signal may carry a second data storage address, where the second data storage address is used to indicate a location where the data scheduler stores the result data in the second storage unit.
  • the fourth control signal may not carry the storage address of the second data.
  • the data scheduler may read the result data respectively generated by PE0, PE1, PE2, and PE3 from the first storage unit 0, that is, the four data storage spaces s0 and s1 stored in the first storage unit 0. , s2, and s3, and then store the result data in the second storage unit to obtain the processing result of the original image.
  • control unit may also control to sequentially splicing multiple result data output from the second storage unit, so as to restore multiple result data obtained from the original image divided into multiple sub-images to the original The result data corresponding to the image.
  • the embodiment of the present disclosure also provides a specific example of performing convolution processing on the original image A by using a data processing apparatus.
  • FIG. 5 is a schematic diagram of the data processing apparatus during data processing. As shown in FIG. 5 , there are 4 memory units, which are respectively represented as PE_RAM0 to PE_RAM3, and the PE array includes 16 PEs, which are respectively represented as PE0 to PE15.
  • PE0 to PE3 are regarded as a PE group
  • PE4 to PE7 are regarded as a PE group
  • PE8 to PE11 are regarded as a PE group
  • PE12 to PE15 are regarded as a PE group, which are respectively denoted as G0, G1, G2, and G3.
  • the PE sub-array After the PE sub-array is determined, it can be determined to take PE_RAM0 in the first storage unit as the first storage unit corresponding to G0; take PE_RAM1 in the first storage unit as the first storage unit corresponding to G1; take PE_RAM1 in the first storage unit as the first storage unit corresponding to G1; PE_RAM2 is used as the first storage unit corresponding to G2; and PE_RAM3 in the first storage unit is used as the first storage unit corresponding to G3.
  • the control unit When using the data processing device to complete the operation on the convolution layer, the control unit generates a third control signal C3 based on the data processing instruction, and sends the third control signal to the data scheduler, and the data scheduler reads the second storage unit access, the data corresponding to the original image A is stored in the second storage unit, and then the data scheduler stores the data used for the convolution calculation in the second storage unit into the first storage unit.
  • control unit sends the first control signal C1 to the PE, and each PE working in the PE array reads the first data to be processed from the corresponding first storage unit, and then performs corresponding calculation.
  • C1 controls the following operations: in the first clock cycle, PE0, PE4, PE8, and PE12 corresponding to PE_RAM0 to PE_RAM3 read the first data to be processed respectively corresponding to them; in the second clock cycle , PE1, PE5, PE9, and PE13 read the corresponding first data to be processed; in the third clock cycle, PE2, PE6, PE10, and PE14 read the first data to be processed corresponding to each Read; in the fourth clock cycle, PE3, PE7, PE11, and PE15 read the first data to be processed corresponding to each other.
  • PE0 to PE15 respectively perform data processing on the corresponding first data to be processed, for example, perform convolution operation processing on the first data to obtain second data.
  • the second data is the result data.
  • the control unit After the PE in the PE array processes the first data to obtain the second data, the control unit sends the second control signal C2 to the PE, and writes the second data in the PE into the first storage unit corresponding to the PE. At this time, the control unit sends a fourth control signal C4 to the data scheduler, so that the data scheduler reads out the result data from the first storage unit and stores it in the second storage unit.
  • the writing order of each step does not mean a strict execution order but constitutes any limitation on the implementation process, and the specific execution order of each step should be based on its function and possible Internal logic is determined.
  • the embodiment of the present disclosure also provides a data processing method corresponding to the data processing apparatus. Reference may be made to the implementation of the method, and repeated descriptions will not be repeated.
  • an embodiment of the present disclosure provides a data processing method, and the data processing method is applied to a data processing apparatus; the data processing method includes:
  • the plurality of first storage units store data transmitted by the connected PEs during the read/write access process.
  • the plurality of first storage units are respectively connected to different PE groups in the PE array.
  • each first storage unit is connected to a PE in one PE group; different PEs belong to different PE groups respectively.
  • the one PE group includes multiple PEs in the PE array that are physically connected, and the multiple PEs are located in the same row, or in the same half row, or in the hardware layout. the same piece.
  • the PE performs read/write access to the connected first storage unit, including: the PE performs read access to the connected first storage unit in the first processing cycle, and obtains the and/or in the second processing cycle, perform write access to the connected first storage unit, and store the second data generated by the PE in the connected first storage unit.
  • the PE performs read/write access to the connected first storage unit, including: different PEs connected to the same first storage unit are respectively in different clock cycles of the same processing cycle to the same first storage unit.
  • a storage unit performs read/write access; and/or, there are at least one PE in the PE group connected to different first storage units in the same clock cycle to perform read/write access to the connected first storage unit; wherein, a processing cycle Include at least one clock cycle.
  • the different at least two PE groups respectively have at least one PE in the same clock cycle to perform read/write access to the connected first storage unit, including: In the PE group, PEs with the same relative position perform read/write access to the connected first storage unit in the same clock cycle.
  • the data processing apparatus further includes a control unit; and the data processing method further includes: the control unit generates a first control signal based on a data processing instruction, and transmits the first control signal to the PE. a first control signal; in response to the first control signal, the PE reads the first data to be processed by the PE from a first storage unit connected to the PE.
  • control unit generates a second control signal based on the data processing instruction, and transmits the second control signal to the PE; the PE responds to the second control signal signal to write the second data generated by the PE into the first storage unit connected to the PE.
  • the data processing apparatus further includes a data scheduler; the data processing method further includes: the control unit generates a third control signal based on the data processing instruction, and sends a third control signal to the data The scheduler transmits the third control signal; the data scheduler performs write access to the first storage unit based on the third control signal.
  • the data processing apparatus further includes a second storage unit; the data scheduler reads the data to be processed corresponding to each first storage unit from the second storage unit, and The first data storage address carried in the third control signal, and the to-be-processed data is stored in the corresponding first storage unit; wherein, the to-be-processed data includes the first data from which the PE needs to be connected the first data read by the storage unit.
  • control unit generates a fourth control signal based on the data processing instruction, and transmits the fourth control signal to the data scheduler; the data scheduler is based on The fourth control signal performs read access to the first storage unit.
  • the data scheduler performing read access to the first storage unit based on the fourth control signal including: the data scheduler, based on the fourth control signal, from the The result data is read from the plurality of first storage units, and the result data is stored in the second storage unit; wherein, the result data includes: the first storage unit generated by the PE and stored in the first storage unit connected thereto. the second data in a storage unit.
  • the embodiments of the present disclosure further provide a computer device, including: an instruction memory and the data processing apparatus provided by the embodiments of the present disclosure.
  • the data processing apparatus provided by the embodiments of the present disclosure may include a chip, an AI chip, and the like.
  • the computer device provided by the embodiment of the present disclosure may include a smart terminal such as a mobile phone, or may be other devices, servers, etc. that can be used for data processing, which is not limited here.
  • Embodiments of the present disclosure further provide a computer-readable storage medium, on which a computer program is stored, and when the computer program is run by a processor, the steps of the data processing method described in the above method embodiments are executed.
  • the storage medium may be a volatile or non-volatile computer-readable storage medium.
  • the embodiments of the present disclosure further provide a computer program product, which carries program codes, and the instructions included in the program codes can be used to execute the steps of the data processing methods described in the foregoing method embodiments. For details, refer to the foregoing method embodiments. It is not repeated here.
  • the above-mentioned computer program product can be specifically implemented by means of hardware, software or a combination thereof.
  • the computer program product is embodied as a computer storage medium, and in another optional embodiment, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK), etc. Wait.
  • the units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
  • each functional unit in each embodiment of the present disclosure may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
  • the functions, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a processor-executable non-volatile computer-readable storage medium.
  • the computer software products are stored in a storage medium, including Several instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in various embodiments of the present disclosure.
  • the aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk and other media that can store program codes .

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Neurology (AREA)
  • Image Processing (AREA)

Abstract

本公开提供了一种数据处理装置、方法、计算机设备及存储介质,其中,该装置包括多个第一存储单元和计算单元;所述计算单元包括PE阵列;多个第一存储单元分别与所述PE阵列中的PE连接;所述PE用于对所连接的第一存储单元进行读/写访问;所述多个第一存储单元用于存储所连接的PE在进行读/写访问过程中传输的数据。

Description

数据处理装置、方法、计算机设备及存储介质
相关申请的交叉引用
本公开要求于2021年02月26日提交的、申请号为202110221038.1、发明名称为“一种数据处理装置、方法、计算机设备及存储介质”的中国专利申请的优先权,该申请以引用的方式并入本文中。
技术领域
本公开涉及计算机技术领域,具体而言,涉及一种数据处理装置、方法、计算机设备及存储介质。
背景技术
图像处理算法在图像识别、目标检测等不同领域应用广泛,通常采用人工智能(Artificial Intelligence,AI)加速器硬件架构完成图像处理算法的实现。现在常用的AI加速器硬件架构主要包括存储单元、计算单元、控制单元等,其中核心的计算单元一般由二维处理引擎(Processing Engine,PE)阵列和寄存器阵列(local register file)构成;存储单元可以由不同的分级缓存组成,包括双倍速率同步动态随机存储器(Double Data Rate,DDR)、静态随机存取存储器(Static Random Access Memory,SRAM)、寄存器、后关系型数据库cache等存储空间。输入数据流在不同的存储空间进行缓存、转移,并进入PE阵列对应的寄存器阵列中,经由PE阵列从寄存器阵列中读取数据后,进行算数运算(或逻辑运算),最后将所得运算结果写回相应的存储空间。然而目前,控制输入数据存储至寄存器阵列的方式存在效率低的问题。
发明内容
本公开实施例至少提供一种数据处理装置、方法、计算机设备及存储介质。
第一方面,本公开实施例提供了一种数据处理装置,包括多个第一存储单元和计算单元;所述计算单元包括处理引擎(PE)阵列;所述多个第一存储单元分别与所述PE阵列中的PE连接;所述PE用于对所连接的第一存储单元进行读/写访问;所述多个第一存储单元用于存储所连接的PE在进行读/写访问过程中传输的数据。
这样,PE阵列中与不同第一存储单元连接的不同PE,能够并行对不同的第一存储单元进行访问,提升了从第一存储单元中读取数据的效率,且提升了将数据存储至第一存储单元的效率,从而提升了数据处理效率。
一种可选的实施方式中,所述PE阵列中的PE划分成多个PE组,所述多个第一存储单元用于分别与所述PE阵列中不同的PE组连接。
一种可选的实施方式中,每个所述第一存储单元与一个所述PE组中的PE连接。
一种可选的实施方式中,所述多个PE组中的一个PE组包括所述PE阵列中具有物理连接关系的多个PE,且所述多个PE在硬件布局上位于同一行,或者位于同半行,或者位于同一块。
这样,为PE阵列中的PE分配对应的第一存储单元,可以增加数据传输通道的数量,以使PE在对第一存储单元进行读/写访问时,可以有更多数据的传输,以提高数据传输的效率;同时,还可以增加数据处理装置的灵活性,以适应不同的数据处理要求。
一种可选的实施方式中,所述PE用于以下中的至少一项:在第一处理周期,对所连接的第一存储单元进行读访问,得到所述PE对应的第一数据;在第二处理周期,对所连接的第一存储单元进行写访问,将所述PE生成的第二数据存储至所连接的第一存储单元。
一种可选的实施方式中,所述PE用于以下中的至少一项:每个所述PE组中的不同PE分别在同一处理周期的不同时钟周期对该PE组所连接的第一存储单元进行读/写访问;不同的至少两个所述PE组在同一时钟周期分别存在至少一个PE对所连接的第一存储单元进行读/写访问;其中,一个所述处理周期包括至少一个所述时钟周期。
一种可选的实施方式中,不同的至少两个所述PE组中,具有相同相对位置的PE在同一时钟周期对所连接的第一存储单元进行读/写访问。
这样,能够实现多个PE同步对不同的第一存储单元进行读/写访问,提升数据传输的效率。
一种可选的实施方式中,还包括控制单元;所述控制单元用于基于数据处理指令,生成第一控制信号,并向所述PE传递所述第一控制信号;所述PE用于响应于所述第一控制信号,从与所述PE连接的第一存储单元中读取所述PE待处理的所述第一数据。
一种可选的实施方式中,所述控制单元还用于基于所述数据处理指令,生成第二控制信号,并向所述PE传递所述第二控制信号;所述PE用于响应于所述第二控制信号,将所述PE生成的所述第二数据写入与所述PE连接的第一存储单元中。
一种可选的实施方式中,还包括数据调度器;所述控制单元还用于基于所述数据处 理指令,生成第三控制信号,并向所述数据调度器传递所述第三控制信号;所述数据调度器用于基于所述第三控制信号,对所述第一存储单元进行写访问。
这样,利用数据调度器作为数据传输的媒介,可以控制数据量较大的数据在传输时进行高效有序的传输,避免在传输时发生错误。
一种可选的实施方式中,所述装置还包括第二存储单元;所述数据调度器用于从所述第二存储单元中读取各第一存储单元对应的待处理数据,并基于所述第三控制信号中携带的第一数据存储地址,将所述待处理数据存储至对应的第一存储单元中;其中,所述待处理数据包括:所述各PE需要从其连接的所述第一存储单元读取的所述第一数据。
一种可选的实施方式中,所述控制单元还用于基于所述数据处理指令,生成第四控制信号,并向所述数据调度器传递所述第四控制信号;所述数据调度器还用于基于所述第四控制信号,从所述多个第一存储单元中读取结果数据,并将所述结果数据存储至所述第二存储单元中;其中,所述结果数据包括所述PE产生的、并存储至其连接的所述第一存储单元中的所述第二数据。
第二方面,本公开实施例还提供一种数据处理方法,应用于数据处理装置,所述数据处理装置包括多个第一存储单元和计算单元;所述计算单元包括PE阵列;所述多个第一存储单元分别与所述PE阵列中的PE连接;所述数据处理方法包括:所述PE对所连接的第一存储单元进行读/写访问;所述多个第一存储单元存储所连接的PE在进行读/写访问过程中传输的数据。
一种可选的实施方式中,所述PE阵列中的PE划分成多个PE组,所述多个第一存储单元用于分别与所述PE阵列中不同的PE组连接。
一种可选的实施方式中,每个第一存储单元与一个PE组中的PE连接。一种可选的实施方式中,所述多个PE组中的一个PE组包括所述PE阵列中具有物理连接关系的多个PE,且所述多个PE在硬件布局上位于同一行,或者位于同半行,或者位于同一块。
一种可选的实施方式中,所述PE对所连接第一存储单元进行读/写访问,包括以下中的至少一项:所述PE在第一处理周期,对所连接的第一存储单元进行读访问,得到所述PE对应的第一数据;所述PE在第二处理周期,对所连接的第一存储单元进行写访问,将所述PE生成的第二数据存储至所连接的第一存储单元。
一种可选的实施方式中,所述PE对所连接第一存储单元进行读/写访问,包括以下中至少一项:每个所述PE组中的不同PE分别在同一处理周期的不同时钟周期对该PE 组所连接的第一存储单元进行读/写访问;至少两个不同的所述PE组在同一时钟周期分别存在至少一个PE对所连接的第一存储单元进行读/写访问;其中,一个所述处理周期包括至少一个所述时钟周期。
一种可选的实施方式中,所述不同的至少两个所述PE组在同一时钟周期分别存在至少一个PE对所连接的第一存储单元进行读/写访问,包括:不同的至少两个所述PE组中,具有相同相对位置的PE在同一时钟周期对所连接的第一存储单元进行读/写访问。
一种可选的实施方式中,所述数据处理装置还包括控制单元;所述数据处理方法还包括:所述控制单元基于数据处理指令,生成第一控制信号,并向所述PE传递所述第一控制信号;所述PE响应于第一控制信号,从与所述PE连接的第一存储单元中读取所述PE待处理的第一数据。
一种可选的实施方式中,所述方法还包括:所述控制单元基于所述数据处理指令,生成第二控制信号,并向所述PE传递所述第二控制信号;所述PE响应于第二控制信号,将所述PE生成的第二数据写入与所述PE连接的第一存储单元中。
一种可选的实施方式中,所述数据处理装置还包括数据调度器;所述数据处理方法还包括:所述控制单元基于所述数据处理指令,生成第三控制信号,并向所述数据调度器传递所述第三控制信号;所述数据调度器基于所述第三控制信号,对所述第一存储单元进行写访问。
一种可选的实施方式中,所述数据处理装置还包括第二存储单元;所述数据调度器从所述第二存储单元中读取各第一存储单元对应的待处理数据,并基于所述第三控制信号中携带的第一数据存储地址,将所述待处理数据存储至对应的第一存储单元中;其中,所述待处理数据包括所述PE需要从其连接的所述第一存储单元读取的所述第一数据。
一种可选的实施方式中,还包括:所述控制单元基于所述数据处理指令,生成第四控制信号,并向所述数据调度器传递所述第四控制信号;所述数据调度器基于所述第四控制信号,从所述多个第一存储单元中读取结果数据,并将所述结果数据存储至第二存储单元中;其中,所述结果数据包括所述PE产生的、并存储至其连接的所述第一存储单元中的所述第二数据。
第三方面,本公开可选实现方式还提供一种计算机设备,包括:指令存储器和本公开第一方面提供的数据处理装置。
第四方面,本公开可选实现方式还提供一种计算机可读存储介质,其上存储有计算 机程序,该计算机程序被运行时执行上述第二方面,或第二方面中任一种可能的实施方式中的步骤。
关于上述数据处理方法、计算机设备、及计算机可读存储介质的效果描述参见上述数据处理装置的说明,这里不再赘述。
为使本公开的上述目的、特征和优点能更明显易懂,下文特举较佳实施例,并配合所附附图,作详细说明如下。
附图说明
为了更清楚地说明本公开实施例的技术方案,下面将对实施例中所需要使用的附图作简单地介绍。这些附图示出了符合本公开的实施例,并与说明书一起用于说明本公开的技术方案。应当理解,以下附图仅示出了本公开的某些实施例,因此不应被看作是对范围的限定,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他相关的附图。
图1示出了本公开实施例所提供的一种数据处理装置的示意图;
图2示出了本公开实施例所提供的一种PE阵列的示意图;
图3示出了本公开实施例所提供的一种PE内部结构的示意图;
图4a示出了本公开实施例所提供的一种第一存储单元与PE阵列的连接方式的示意图;
图4b示出了本公开实施例所提供的另一种第一存储单元与PE阵列的连接方式的示意图;
图5示出了本公开实施例所提供的一种数据处理装置在进行数据处理时的示意图;
图6示出了本公开实施例所提供的一种数据处理方法的流程图。
具体实施方式
为使本公开实施例的目的、技术方案和优点更加清楚,下面将结合本公开实施例中附图,对本公开实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本公开一部分实施例,而不是全部的实施例。通常在此处描述和示出的本公开实施例的组件可以以各种不同的配置来布置和设计。因此,以下对本公开的实施例的详细描述并非旨在限制要求保护的本公开的范围,而是仅仅表示本公开的选定实施例。基于本 公开的实施例,本领域技术人员在没有做出创造性劳动的前提下所获得的所有其他实施例,都属于本公开保护的范围。
经研究发现,在利用AI加速器硬件结构对图像数据进行处理时,通常需要将图像数据从外部存储器传输至PE阵列中的PE所包括的寄存器中,以使PE阵列中各PE的计算单元能够从寄存器读取对应的图像数据并进行处理,而由不同PE包括的寄存器构成的寄存器阵列共享相同的总线,且总线的带宽有限,为了避免数据在总线中传输时的数据冲突,各个寄存器需要的数据会逐一从外部存储器进入到对应的寄存器中,这就导致了将图像数据传输至寄存器阵列需要耗费大量的时间,造成数据处理的效率较低。
另外,PE阵列将图像数据处理后,会生成结果数据;所生成的结果数据需要存储至外部存储器中;此时,也需要利用总线将PE阵列中的不同PE所产生的结果数据逐一传输至外部存储器。这导致了将结果数据存储至外部存储器也需要耗费较多的时间,导致传输数据的效率降低,同样造成数据处理的效率较低。
基于上述研究,本公开提供了一种数据处理装置,包括多个第一存储单元,不同的第一存储单元分别和PE阵列中的不同PE连接,PE阵列中的每个PE能够对与其连接的第一存储单元进行读/写访问,进而,PE阵列中与不同第一存储单元连接的不同PE能够并行对不同的第一存储单元进行访问,提升了从第一存储单元中读取数据的效率,且提升了将数据存储至第一存储单元的效率,从而提升了数据处理效率。
以上方案所存在的缺陷均是发明人在经过实践并仔细研究后得出的结果,因此,上述问题的发现过程以及下文中本公开针对上述问题所提出的解决方案,都应该是发明人对本公开做出的贡献。
应注意到:相似的标号和字母在下面的附图中表示类似项,因此,一旦某一项在一个附图中被定义,则在随后的附图中不需要对其进行进一步定义和解释。
为便于对本实施例进行理解,首先对本公开实施例提供的一种数据处理装置进行详细介绍。
参见图1所示,为本公开实施例所提供的一种数据处理装置的示意图,装置包括多个第一存储单元(图中示出了多个第一存储单元,包括第一存储单元0至第一存储单元3)和计算单元;计算单元包括处理引擎(PE)阵列;多个第一存储单元分别与PE阵列中的PE连接(图中示出了多个PE,包括PE0至PE15);其中,PE用于对所连接的第一存储单元进行读/写访问;多个第一存储单元用于存储所连接的PE在进行读/写访问过程 中传输的数据。
示例性的,计算单元中包括至少一个PE阵列;其中,PE阵列中的任一PE与其他PE之间的物理连接关系可如图2所示,多个PE共同形成一个2D(dimension)环面网络,PE可以与其在物理上连接的包括处于上、下、左、右位置的不同PE相连接。
在PE阵列中包括具体完成相关运算任务的PE 22、以及位于PE阵列边缘位置的PE21,为便于区分,在图2中将位于PE阵列边缘位置的PE 21标记为halo。其中,PE 22可以完成对数据的乘加(multiply-accumulate,MAC)等运算操作;PE 21形成在PE阵列外围的外围环状阵列,由于在PE阵列中的PE 22对数据进行处理时,可能会发生数据在不同PE之间的移动(shift),则在外围环状阵列中的PE 21可以保证数据在PE阵列内部的不同PE之间移动时不会发生丢失。
另外,图3示出了本公开实施例提供的一种PE内部结构。在该示例中,包括一个具体完成相关运算任务的PE 22,以及与其连接的一个位于PE阵列边缘位置的PE 21。在PE 22中,包括:内存访问模块33,表示为M1;算术逻辑单元34(Arithmetic and Logic Unit,ALU),表示为ALU1;内部寄存器35,表示为R0_1;以及移位寄存器堆36。
内存访问模块M1,用于对与PE 22相连接的第一内存单元进行读/写访问。内存访问模块M1在对第一内存单元进行读访问时,可以将从第一内存单元中获取的数据传输给内部寄存器R0_1或者ALU1,以等待获取的数据被ALU1处理,或者,传输至PE 22对应的移位寄存器堆中,以使获取的数据传输至与PE 22连接的PE21中;在对第一内存单元进行写访问时,可以将ALU1计算得到的结果运算数据写入对应的第一内存单元。
算术逻辑单元ALU1,用于将接收到的数据进行数据处理。由于在对数据进行数据处理时,可能存在多个中间计算步骤,因此还可以将得到的中间运算数据传输至PE中的内部寄存器,并在下一次计算中调用内部寄存器中的中间运算数据,以进行进一步的处理。在得到结果运算数据后,可以依据实际的数据处理指令传输至内存访问模块M1等待输出;或者,还可以将结果运算数据传输至PE 22对应的移位寄存器堆中,等待传输至与PE 22连接的PE 21中。
此处,对于PE 21,由于PE 21不承担数据运算的功能,因此内部可以不包含ALU,以降低设备要求,从而降低设备成本;或者存在ALU,但ALU并不实际进行相关的数据运算,以降低设备集成的复杂度。在图3中,以虚线表示了在PE 21中在存在ALU2时,可能存在的与PE 21中ALU1相似的连接关系。
内部寄存器R0_1,用于接收并存储M1从PE 22相连接的第一内存单元中读取的数据;或者与ALU1连接,存储产生的中间运算数据,并将中间运算数据传输至ALU1,以使ALU1得到结果运算数据,并存储结果运算数据;或者将结果运算数据传输至M1中。
此处,对于PE 21,由于PE 21可以仅完成数据在多个PE之间传输的功能,或者仅接收第一存储单元中传输的数据,因此内部可以不包含内部寄存器,以降低设备要求,从而降低设备成本;或者存在内部寄存器,但并不完成存储的相关任务,以降低设备集成的复杂度。在图3中,以虚线表示了在PE 21中存在内部寄存器R0_2时,可能存在的与PE 21中内部寄存器R0_1相似的连接关系。
移位寄存器堆,用于使PE获取到的数据传输至与PE连接的其他PE中。在图3中,PE 22对应的移位寄存器堆36可以与其在上、下、左、右四个方向存在连接关系的PE对应的移位寄存器堆在电路连接,移位寄存器堆36对应的存在4个移位寄存器R1、R2、R3、R4;同样的,在PE 21中,也存在移位寄存器堆37,包括对应的移位寄存器R1’、R2’、R3’、R4’。数据从PE 22传输至PE 21时,可以通过在移位寄存器堆36中与移位寄存器堆37具有连接关系的移位寄存器实现,例如PE 22中的移位寄存器R4与PE 21中的移位寄存器R4’具有连接关系时,PE 22可以在将数据传输至R4后,由PE 21对应的R4’接收此数据,以使PE 21接收此数据。
此处,其他PE的结构与上述PE的内部结构相似,在此只举例说明,不再赘述。
对于PE阵列中的不同PE,可以分别与不同的第一存储单元连接,或者与相同的第一存储单元连接。具有连接关系的PE和第一存储单元可以进行数据上传输,PE可以对第一存储单元中的数据进行读取,并将处理后的数据传输至第一存储单元。
对于各个第一存储单元,可以连接对应的多个PE,因此第一存储单元可以包含对应PE数量的多个存储单元,用以对应的存放连接的多个PE中每个PE读/写时的数据。
具体地,在确定多个第一存储单元和PE阵列中的多个PE进行连接关系时,例如可以先对多个PE进行分组,并为分组后的每一组PE确定一个对应的第一存储单元。在对多个PE进行分组时,可以将具有物理连接关系的多个PE作为一个PE组,且多个PE在硬件布局上位于同一行,或者位于同半行,或者位于同一块。
示例性的,在一种可能的实施方式中,可以将一行PE作为一个PE组,参见图4a所示,其示出了一种第一存储单元与PE阵列的连接方式的示意图。在图4a中,将第一 行PE作为一个PE组(PE group),表示为G0、将第二行作为一个PE组,表示为G1,以此类推,直至为第n行划分一个PE组,表示为Gn。并为PE组G0分配对应的第一存储单元0、为PE组G1分配对应的第一存储单元1,以此类推,直至为PE组Gn分配对应的第一存储单元n。
在另一种可能的实施方式中,参见图4b所示,其示出了另一种第一存储单元与PE阵列的连接方式的示意图。在图4b中,将两行PE作为一个PE组,可以将第一行和第二行的PE作为一个PE组,表示为G0,并将第三行和第四行的PE作为一个PE组,表示为G1,以此类推,可以将PE阵列中多行PE划分为n个不同的PE组,也即为PE阵列划分n个PE组,然后再为此n个不同的PE组分配对应的n个第一存储单元,例如可以包括第一存储单元0至第一存储单元n。
特殊的,还可以将PE阵列中的每个PE作为一个PE组,并为每个PE分配对应的第一存储单元,也即PE阵列中的每个PE都对应有一个存储单元。这种方式更进一步的对第一存储单元进行划分,可以将数据交互的吞吐量达到最大,因此在数据进行传输时消耗的时间降低。
此处,多个PE组中的每个PE组中包括的PE的数量可以相同也可以不同,例如为了减轻电路中的走线的复杂程度等影响,将在物理连接关系上较为紧密的PE作为一个PE组,或者在更具有针对性的设备中,将具有更强的运算处理功能的多个PE作为一个PE组。也即具体确定PE组的方式可以根据实际情况确定,在此不做限定。
PE在对第一存储单元进行读/写访问时,针对对第一存储单元进行读访问的情况,例如可以在第一处理周期,对所连接的第一存储单元进行读访问,得到PE对应的第一数据。
针对对第一存储单元进行写访问的情况,例如可以在第二处理周期,对所连接的第一存储单元进行写访问,将PE生成的第二数据存储至所连接的第一存储单元。
其中,处理周期可以根据实际的数据处理过程确定,在例如对数据的乘加的处理步骤中,由于计算较为简单,因此可以包括两个或三个时钟周期;在例如对数据的加权滤波的处理步骤中,由于计算较为复杂,因此可以包括四个或五个时钟周期。也即处理周期所包含的时钟周期的数量与实际的处理过程相关,不同的处理周期所包括的时钟周期的数量可以相同,也可以不同。
另外,由于存在多个PE组,且在为PE组中的PE进行数据传输时,由于以单指令 多数据流(Single Instruction Multiple Data,SIMD)的方式实现每组PE与其对应的第一存储单元之间的数据传输,因此每个PE组的不同PE分别在同一处理周期的不同时钟周期可以对该PE组所连接的第一存储单元进行读/写访问;和/或,不同的至少两个PE组,在同一时钟周期分别存在至少一个PE对所连接的第一存储单元进行读/写访问;其中,一个所述处理周期包括至少一个所述时钟周期。
示例性的,对于图1示出的多个第一存储单元以及多个PE,因此在一个处理周期中,每组PE中相同位置上的PE可以对与其对应的第一存储单元进行读访问,以图1对应的实施例为例,第一存储单元有4个,分别为:第一存储单元0、第一存储单元1、第一存储单元2、以及第一存储单元3,与第一存储单元0连接的PE包括PE0、PE1、PE2和PE3,与第一存储单元1连接的PE包括PE4、PE5、PE6和PE7,与第一存储单元2连接的PE包括PE8、PE9、PE10和PE11,与第一存储单元3连接的PE包括PE12、PE13、PE14和PE15;在该示例中,第一处理周期可以包括4个时钟周期;在第一个时钟周期内,PE0对第一存储单元0进行读访问,PE4对第一存储单元1进行读访问、PE8对第一存储单元2进行读访问、PE12对第一存储单元3进行读访问。
然后,PE0、PE4、PE8、PE12可以将读取得到的数据存储至对应的内部存储器中,以使包含算术逻辑单元的PE可以对读取得到的数据进行运算处理,或者使不包含算术逻辑单元的PE对其进行存储,等待下一处理周期的移动或者其他的数据传输。
在第二个时钟周期中,PE1对第一存储单元0进行读访问,同时,PE5对第一存储单元1进行读访问、PE9对第一存储单元2进行读访问、PE13对第一存储单元3进行读访问;在第三个时钟周期中,PE2对第一存储单元0进行读访问,同时,PE6对第一存储单元1进行读访问、PE10对第一存储单元2进行读访问、PE14对第一存储单元3进行读访问;在第四个时钟周期中,PE3对第一存储单元0进行读访问,同时,PE7对第一存储单元1进行读访问、PE11对第一存储单元2进行读访问、PE15对第一存储单元3进行读访问。以此完成在第一个处理周期中,PE对第一存储单元的读访问,并将在第一存储单元中对应存储的、且等待PE进行处理的第一数据传输至各个PE分别对应的内部寄存器中,等待进一步的数据访问。
在另一种可能的实施方式中,在PE阵列中的PE数量较大、且待处理的图像尺寸较小时,可能存在处理该待处理的图像仅需使用PE阵列中部分PE的情况,因此还可以存在部分PE在第一个处理周期中不对对应的第一存储单元进行访问的情况,继续等待下一处理周期的数据处理指令。
具体地,在PE对第一存储单元进行读访问时,数据处理装置中的控制单元基于数据处理指令,生成第一控制信号,并向PE传递第一控制信号,PE响应于第一控制信号,从与PE连接的第一存储单元中读取PE待处理的第一数据。
其中,数据处理指令可以包括控制PE对第一存储单元中的数据进行操作的相关指令,例如数据传送指令(MOV)、加法指令(ADD)、减法指令(SUB)、逻辑与指令(AND)等不同的指令。
以利用数据处理装置对待处理的任一图像进行处理为例,在第一存储单元对此图像进行处理以及存储后,控制单元可以基于数据传送指令,生成第一控制信号,此第一控制信号中,包括了接收此第一控制信号的PE在第一存储单元进行读访问时访问的数据地址,用于控制接收到此第一控制信号的PE对对应的第一存储单元进行数据的读取,并将读取得到的数据存储至对应的内部寄存器中。
例如,在图1示出的第一存储单元0中,由于连接了PE0~PE3共四个PE,因此可以包括对应的四个数据存储空间(Space),分别以s0、s1、s2、s3表示,控制单元向PE0传递的第一控制信号中例如可以包括s0的地址,在PE0接收到第一控制信号后,即可以根据其中携带的s0的地址,从连接的第一存储单元0中的数据存储空间s0中读取对应的数据。
其他PE从对应的第一存储单元中读取数据的方式,与上述PE0从第一存储单元0中读取数据的方式相似,在此不再赘述。
另外,在将待处理的图像进行处理并存储至第一存储单元中时,例如可以采用下述方式:控制单元基于数据处理指令,生成第三控制信号,并向数据处理装置中的数据调度器传递第三控制信号;数据调度器基于第三控制信号,对第一存储单元进行写访问。
其中,第三控制信号中例如可以携带有第一数据存储地址,此第一数据存储地址用于确定存放至第一存储单元的待处理数据的存储位置。
在具体实施中,数据处理装置还包括第二存储单元,第二存储单元可以包括外部存储器,用于将待处理的原始图像、特征图等数据进行存储。本公开实施例以对原始图像进行处理为例对数据装置进行数据处理的详细过程加以说明。以图1中示出的PE阵列为例,当其中的每个PE均可以处理4×4个像素点组成的子图像数据时,在图像尺寸(单位为像素)为16×16时,每个PE可以平均地处理对应的4×4个像素点。此时,即可以将得到的16个子图像包含的数据存放至第二存储单元中,等待数据调度器从第二存 储单元中读取数据;并且,由于在第二存储单元中存储的数据即为PE可以直接进行处理的数据,因此在将第二存储单元中的数据存储至第一存储单元中时,可以仅完成数据的传输,而不需要对数据进行切分等处理,从而减轻了数据处理装置在数据传输时的处理任务,提升数据传输的效率;另外,由于在第二存储单元中存储的数据,可以直接作为第一存储单元对应的待处理数据,因此还有利于第一存储单元、以及与第一存储单元连接的PE对待处理数据的读取。
具体地,数据调度器从第二存储单元中读取各第一存储单元对应的待处理数据,并基于第三控制信号中携带的第一数据存储地址,将各第一存储单元对应的待处理数据存储至对应的第一存储单元中;其中,各第一存储单元对应的待处理数据包括:各第一存储单元连接的PE需要读取的数据。
在第一存储单元中存放有连接的PE需要读取的数据后,PE即可以等待接收控制单元传递控制信号,并在接收到控制单元发送的第一控制信号后,从对应的第一存储单元中读取对应的数据进行处理。此时,对于较为复杂的图像处理算法,例如对图像进行卷积处理时,包括加权求和等多个步骤,因此在进行处理时,可能存在多个中间数据,这些中间数据例如可以存放至PE分别对应的内部存储器中暂存,然后再在下一次处理时直接调用在内部存储器中暂存的数据进行处理,直至完成对原始图像的所有数据处理任务。
或者,还可以将中间数据传输至第一存储单元中,但由于中间数据并非最终输出的结果数据,还需要进行进一步的处理,因此在第一存储单元中的中间数据可以不向第二存储单元输出。
具体地,控制单元可以基于数据处理指令,生成第二控制信号,并向PE传递第二控制信号;PE响应于接收到控制单元传递的第二控制信号,将PE生成的数据写入PE连接的第一存储单元中。
其中,第二控制信号与上述第一控制信号类似,包括了接收此第二控制信号的PE在第一存储单元进行写访问时访问的数据地址,用于控制接收到此第二控制信号的PE对对应的第一存储单元写入数据,以使第一存储单元接收到对应的PE写入的数据,等待向第二存储单元输出,已得到原始图像的处理结果。
在PE对原始图像中的数据完成所有的处理步骤后,即可以得到用于输出的结果数据,此时控制单元还可以生成第四控制信号,并向数据调度器传递第四控制信号;数据 调度器基于第四控制信号,从多个第一存储单元中读取结果数据,并将结果数据存储至第二存储单元中;其中,结果数据包括第一存储单元连接的PE产生的、并存储至第一存储单元中的数据。
具体地,第四控制信号中可以携带有第二数据存储地址,此第二数据存储地址用于指示数据调度器将结果数据在第二存储单元中存储的位置。另外,第四控制信号中也可以不携带第二数据的存储地址。
示例性的,数据调度器可以从第一存储单元0中读取由PE0、PE1、PE2、PE3分别产生的结果数据,也即存放在第一存储单元0中的四个数据存储空间s0、s1、s2、s3中的结果数据,然后将结果数据存储至第二存储单元中,得到对原始图像的处理结果。
在一种可能的实施方式中,控制单元还可以控制将第二存储单元中输出的多个结果数据依顺序拼接,以将划分为多个子图像的原始图像得到的多个结果数据,还原为原始图像对应的结果数据。
本公开实施例还提供了一种利用数据处理装置对原始图像A进行卷积处理的具体示例。
图5为数据处理装置在进行数据处理时的示意图。如图5所示,内存单元有4个,分别表示为PE_RAM0至PE_RAM3,PE阵列中包括16个PE,分别表示为PE0至PE15。
其中,将PE0至PE3作为一个PE组、PE4至PE7作为一个PE组、PE8至PE11作为一个PE组、PE12至PE15作为一个PE组,分别表示为G0、G1、G2、G3。
在确定PE子阵列后,可以确定将第一存储单元中的PE_RAM0作为G0对应的第一存储单元;将第一存储单元中的PE_RAM1作为G1对应的第一存储单元;将第一存储单元中的PE_RAM2作为G2对应的第一存储单元;并将第一存储单元中的PE_RAM3作为G3对应的第一存储单元。
在利用数据处理装置完成对卷积层的运算时,控制单元基于数据处理指令生成第三控制信号C3,并将第三控制信号发送至数据调度器,由数据调度器对第二存储单元进行读访问,第二存储单元中存储有原始图像A对应的数据,然后数据调度器将第二存储单元中用于进行卷积计算的数据存储至第一存储单元中。
之后,控制单元向PE发送第一控制信号C1,PE阵列中工作的每个PE从对应的第一存储单元中读取待处理的第一数据,然后进行相应的计算。
其中,C1控制以下的操作:在第一个时钟周期,PE_RAM0至PE_RAM3分别对应的PE0、PE4、PE8、以及PE12对各自分别对应的待处理的第一数据进行读取;在第二个时钟周期,PE1、PE5、PE9、以及PE13对各自分别对应的待处理的第一数据进行读取;在第三个时钟周期,PE2、PE6、PE10、以及PE14对各自分别对应的待处理的第一数据进行读取;在第四个时钟周期,PE3、PE7、PE11、以及PE15对各自分别对应的待处理的第一数据进行读取。
然后,PE0至PE15分别对各自对应的待处理的第一数据进行数据处理,例如对第一数据进行卷积运算处理,得到第二数据。
此处,第二数据即为结果数据。
在PE阵列中的PE对第一数据处理得到第二数据后,控制单元向PE发送第二控制信号C2,将PE中的第二数据写入与PE对应的第一存储单元中。此时,控制单元向数据调度器发送第四控制信号C4,使数据调度器从第一存储单元中将结果数据读取出并保存在第二存储单元中。
本领域技术人员可以理解,在具体实施方式的上述方法中,各步骤的撰写顺序并不意味着严格的执行顺序而对实施过程构成任何限定,各步骤的具体执行顺序应当以其功能和可能的内在逻辑确定。
基于同一发明构思,本公开实施例中还提供了与数据处理装置对应的数据处理方法,由于本公开实施例中的方法解决问题的原理与本公开实施例上述数据处理装置相似,因此装置的实施可以参见方法的实施,重复之处不再赘述。
如图6所示,本公开实施例提供一种数据处理方法,所述数据处理方法应用于数据处理装置;所述数据处理方法包括:
S601:PE对所连接的第一存储单元进行读/写访问;
S602:多个第一存储单元存储所连接的PE在进行读/写访问过程中传输的数据。
一种可选的实施方式中,所述多个第一存储单元分别与所述PE阵列中不同的PE组连接。
一种可选的实施方式中,每个第一存储单元与一个PE组中的PE连接;不同PE分别属于不同PE组。
一种可选的实施方式中,所述一个PE组,包括所述PE阵列中具有物理连接关系 的多个PE,且多个PE在硬件布局上位于同一行,或者位于同半行,或者位于同一块。
一种可选的实施方式中,所述PE对所连接第一存储单元进行读/写访问,包括:所述PE在第一处理周期,对所连接的第一存储单元进行读访问,得到所述PE对应的第一数据;和/或在第二处理周期,对所连接的第一存储单元进行写访问,将所述PE生成的第二数据存储至所连接的第一存储单元。
一种可选的实施方式中,所述PE对所连接第一存储单元进行读/写访问,包括:与同一第一存储单元连接的不同PE分别在同一处理周期的不同时钟周期对该同一第一存储单元进行读/写访问;和/或,不同第一存储单元所连接的PE组在同一时钟周期分别存在至少一个PE对所连接第一存储单元进行读/写访问;其中,一个处理周期包括至少一个时钟周期。
一种可选的实施方式中,所述不同的至少两个PE组在同一时钟周期分别存在至少一个PE对所连接第一存储单元进行读/写访问,包括:不同第一存储单元所连接的PE组中,具有相同相对位置的PE在同一时钟周期对所连接的第一存储单元进行读/写访问。
一种可选的实施方式中,所述数据处理装置还包括控制单元;所述数据处理方法还包括:所述控制单元基于数据处理指令,生成第一控制信号,并向所述PE传递所述第一控制信号;所述PE响应于第一控制信号,从与所述PE连接的第一存储单元中读取所述PE待处理的第一数据。
一种可选的实施方式中,还包括:所述控制单元基于所述数据处理指令,生成第二控制信号,并向所述PE传递所述第二控制信号;所述PE响应于第二控制信号,将所述PE生成的第二数据写入与所述PE连接的第一存储单元中。
一种可选的实施方式中,所述数据处理装置还包括数据调度器;所述数据处理方法还包括:所述控制单元基于所述数据处理指令,生成第三控制信号,并向所述数据调度器传递所述第三控制信号;所述数据调度器基于所述第三控制信号,对所述第一存储单元进行写访问。
一种可选的实施方式中,所述数据处理装置还包括第二存储单元;所述数据调度器从所述第二存储单元中读取各第一存储单元对应的待处理数据,并基于所述第三控制信号中携带的第一数据存储地址,将所述待处理数据存储至对应的第一存储单元中;其中,所述待处理数据包括所述PE需要从其连接的所述第一存储单元读取的所述第一数据。
一种可选的实施方式中,还包括:所述控制单元基于所述数据处理指令,生成第四控制信号,并向所述数据调度器传递所述第四控制信号;所述数据调度器基于所述第四控制信号,对所述第一存储单元进行读访问。
一种可选的实施方式中,所述数据调度器基于所述第四控制信号,对所述第一存储单元进行读访问,包括:所述数据调度器基于所述第四控制信号,从所述多个第一存储单元中读取结果数据,并将所述结果数据存储至第二存储单元中;其中,所述结果数据包括:所述PE产生的、并存储至其连接的所述第一存储单元中的所述第二数据。
本公开实施例还提供一种计算机设备,包括:指令存储器和本公开实施例提供的数据处理装置。
本公开实施例提供的数据处理装置可以包括芯片、AI芯片等。本公开实施例提供的计算机设备可以包括手机等智能终端,或者也可以是其他可以用于进行数据处理的设备、服务器等,这里并不限制。
本公开实施例还提供一种计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器运行时执行上述方法实施例中所述的数据处理方法的步骤。其中,该存储介质可以是易失性或非易失的计算机可读取存储介质。
本公开实施例还提供一种计算机程序产品,其承载有程序代码,所述程序代码包括的指令可用于执行上述方法实施例中所述的数据处理方法的步骤,具体可参见上述方法实施例,在此不再赘述。
其中,上述计算机程序产品可以具体通过硬件、软件或其结合的方式实现。在一个可选实施例中,所述计算机程序产品具体体现为计算机存储介质,在另一个可选实施例中,计算机程序产品具体体现为软件产品,例如软件开发包(Software Development Kit,SDK)等等。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统和装置的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。在本公开所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,又例如,多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些通信接口,装置或单元 的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本公开各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个处理器可执行的非易失的计算机可读取存储介质中。基于这样的理解,本公开的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本公开各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
最后应说明的是:以上所述实施例,仅为本公开的具体实施方式,用以说明本公开的技术方案,而非对其限制,本公开的保护范围并不局限于此,尽管参照前述实施例对本公开进行了详细的说明,本领域的普通技术人员应当理解:任何熟悉本技术领域的技术人员在本公开揭露的技术范围内,其依然可以对前述实施例所记载的技术方案进行修改或可轻易想到变化,或者对其中部分技术特征进行等同替换;而这些修改、变化或者替换,并不使相应技术方案的本质脱离本公开实施例技术方案的精神和范围,都应涵盖在本公开的保护范围之内。因此,本公开的保护范围应所述以权利要求的保护范围为准。

Claims (23)

  1. 一种数据处理装置,包括多个第一存储单元和计算单元;所述计算单元包括处理引擎PE阵列;所述多个第一存储单元分别与所述PE阵列中的PE连接;
    所述PE用于对所连接的第一存储单元进行读/写访问;
    所述多个第一存储单元用于存储所连接的PE在进行读/写访问过程中传输的数据。
  2. 根据权利要求1所述的数据处理装置,其特征在于,所述PE阵列中的PE划分成多个PE组,所述多个第一存储单元用于分别与所述PE阵列中不同的PE组连接。
  3. 根据权利要求1或2所述的数据处理装置,其特征在于,每个所述第一存储单元与一个所述PE组中的PE连接。
  4. 根据权利要求3所述的数据处理装置,其特征在于,所述多个PE组中的一个PE组包括所述PE阵列中具有物理连接关系的多个PE,且所述多个PE在硬件布局上位于同一行,或者位于同半行,或者位于同一块。
  5. 根据权利要求4所述的数据处理装置,其特征在于,所述PE用于以下中的至少一项:
    在第一处理周期,对所连接的第一存储单元进行读访问,得到所述PE对应的第一数据;
    在第二处理周期,对所连接的第一存储单元进行写访问,将所述PE生成的第二数据存储至所连接的第一存储单元。
  6. 根据权利要求5所述的数据处理装置,其特征在于,所述PE用于以下中的至少一项:
    每个所述PE组中的不同PE分别在同一处理周期的不同时钟周期对该PE组所连接的第一存储单元进行读/写访问;
    不同的至少两个所述PE组在同一时钟周期分别存在至少一个PE对所连接的第一存储单元进行读/写访问;
    其中,一个所述处理周期包括至少一个所述时钟周期。
  7. 根据权利要求6所述的数据处理装置,其特征在于,不同的至少两个所述PE组中,具有相同相对位置的PE在同一时钟周期对所连接的第一存储单元进行读/写访问。
  8. 根据权利要求5所述的数据处理装置,其特征在于,还包括控制单元;
    所述控制单元用于基于数据处理指令,生成第一控制信号,并向所述PE传递所述第一控制信号;
    所述PE用于响应于所述第一控制信号,从与所述PE连接的第一存储单元中读取 所述PE待处理的所述第一数据。
  9. 根据权利要求8所述的数据处理装置,其特征在于,
    所述控制单元还用于基于所述数据处理指令,生成第二控制信号,并向所述PE传递所述第二控制信号;
    所述PE用于响应于所述第二控制信号,将所述PE生成的所述第二数据写入与所述PE连接的第一存储单元中。
  10. 根据权利要求8或9所述的数据处理装置,其特征在于,还包括数据调度器;
    所述控制单元还用于基于所述数据处理指令,生成第三控制信号,并向所述数据调度器传递所述第三控制信号;
    所述数据调度器用于基于所述第三控制信号,对所述第一存储单元进行写访问。
  11. 根据权利要求10所述的数据处理装置,其特征在于,还包括第二存储单元;
    所述数据调度器用于从所述第二存储单元中读取各第一存储单元对应的待处理数据,并基于所述第三控制信号中携带的第一数据存储地址,将所述待处理数据存储至对应的第一存储单元中;
    其中,所述待处理数据包括所述PE需要从其连接的所述第一存储单元读取的所述第一数据。
  12. 根据权利要求10或11所述的数据处理装置,其特征在于,
    所述控制单元还用于基于所述数据处理指令,生成第四控制信号,并向所述数据调度器传递所述第四控制信号;
    所述数据调度器还用于基于所述第四控制信号,从所述多个第一存储单元中读取结果数据,并将所述结果数据存储至所述第二存储单元中;
    其中,所述结果数据包括所述PE产生的、并存储至其连接的所述第一存储单元中的所述第二数据。
  13. 一种数据处理方法,其应用于数据处理装置,所述数据处理装置包括多个第一存储单元和计算单元;所述计算单元包括处理引擎PE阵列;所述多个第一存储单元分别与所述PE阵列中的PE连接;所述数据处理方法包括:
    所述PE对所连接的第一存储单元进行读/写访问;
    所述多个第一存储单元存储所连接的PE在进行读/写访问过程中传输的数据。
  14. 根据权利要求13所述的数据处理方法,其特征在于,所述PE对所连接第一存储单元进行读/写访问,包括以下中的至少一项:
    所述PE在第一处理周期,对所连接的第一存储单元进行读访问,得到所述PE对 应的第一数据;
    所述PE在第二处理周期,对所连接的第一存储单元进行写访问,将所述PE生成的第二数据存储至所连接的第一存储单元。
  15. 根据权利要求13或14所述的数据处理方法,其特征在于,所述PE阵列中的PE划分成多个PE组,所述多个第一存储单元用于分别与所述PE阵列中不同的PE组连接,每个所述第一存储单元与所连接的一个PE组中的PE连接;所述PE对所连接第一存储单元进行读/写访问,包括以下中至少一项:
    每个所述PE组中的不同PE分别在同一处理周期的不同时钟周期对该PE组所连接的第一存储单元进行读/写访问;
    至少两个不同的所述PE组在同一时钟周期分别存在至少一个PE对所连接的第一存储单元进行读/写访问;
    其中,一个所述处理周期包括至少一个所述时钟周期。
  16. 根据权利要求15所述的数据处理方法,其特征在于,所述不同的至少两个所述PE组在同一时钟周期分别存在至少一个PE对所连接的第一存储单元进行读/写访问,包括:
    不同的至少两个所述PE组中,具有相同相对位置的PE在同一时钟周期对所连接的第一存储单元进行读/写访问。
  17. 根据权利要求14所述的数据处理方法,其特征在于,所述数据处理装置还包括控制单元;所述数据处理方法还包括:
    所述控制单元基于数据处理指令,生成第一控制信号,并向所述PE传递所述第一控制信号;
    所述PE响应于所述第一控制信号,从与所述PE连接的第一存储单元中读取所述PE待处理的第一数据。
  18. 根据权利要求17所述的数据处理方法,其特征在于,还包括:
    所述控制单元基于所述数据处理指令,生成第二控制信号,并向所述PE传递所述第二控制信号;
    所述PE响应于所述第二控制信号,将所述PE生成的所述第二数据写入与所述PE连接的第一存储单元中。
  19. 根据权利要求17或18所述的数据处理方法,其特征在于,所述数据处理装置还包括数据调度器;所述数据处理方法还包括:
    所述控制单元基于所述数据处理指令,生成第三控制信号,并向所述数据调度器传 递所述第三控制信号;
    所述数据调度器基于所述第三控制信号,对所述第一存储单元进行写访问。
  20. 根据权利要求19所述的数据处理方法,其特征在于,所述数据处理装置还包括第二存储单元;
    所述数据调度器从所述第二存储单元中读取各第一存储单元对应的待处理数据,并基于所述第三控制信号中携带的第一数据存储地址,将所述待处理数据存储至对应的第一存储单元中;
    其中,所述待处理数据包括所述PE需要从其连接的所述第一存储单元读取的所述第一数据。
  21. 根据权利要求19或20所述的数据处理方法,其特征在于,还包括:
    所述控制单元基于所述数据处理指令,生成第四控制信号,并向所述数据调度器传递所述第四控制信号;
    所述数据调度器基于所述第四控制信号,从所述多个第一存储单元中读取结果数据,并将所述结果数据存储至所述第二存储单元中;
    其中,所述结果数据包括所述PE产生的、并存储至其连接的所述第一存储单元中的所述第二数据。
  22. 一种计算机设备,包括:指令存储器和如权利要求1至12任一项所述的数据处理装置。
  23. 一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被数据处理装置运行时执行如权利要求13至21任一项所述的数据处理方法的步骤。
PCT/CN2021/115780 2021-02-26 2021-08-31 数据处理装置、方法、计算机设备及存储介质 WO2022179074A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110221038.1 2021-02-26
CN202110221038.1A CN112967172A (zh) 2021-02-26 2021-02-26 一种数据处理装置、方法、计算机设备及存储介质

Publications (1)

Publication Number Publication Date
WO2022179074A1 true WO2022179074A1 (zh) 2022-09-01

Family

ID=76275819

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/115780 WO2022179074A1 (zh) 2021-02-26 2021-08-31 数据处理装置、方法、计算机设备及存储介质

Country Status (2)

Country Link
CN (1) CN112967172A (zh)
WO (1) WO2022179074A1 (zh)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112967172A (zh) * 2021-02-26 2021-06-15 成都商汤科技有限公司 一种数据处理装置、方法、计算机设备及存储介质
CN113596472B (zh) * 2021-07-27 2023-12-22 安谋科技(中国)有限公司 数据处理方法及装置
CN113872752B (zh) * 2021-09-07 2023-10-13 哲库科技(北京)有限公司 安全引擎模组、安全引擎装置和通信设备
CN116627887A (zh) * 2022-02-14 2023-08-22 华为技术有限公司 图数据处理的方法和芯片

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5625836A (en) * 1990-11-13 1997-04-29 International Business Machines Corporation SIMD/MIMD processing memory element (PME)
CN106502923A (zh) * 2016-09-30 2017-03-15 西安邮电大学 阵列处理器中簇内存储访问行列两级交换电路
CN107590085A (zh) * 2017-08-18 2018-01-16 浙江大学 一种具有多级缓存的动态可重构阵列数据通路及其控制方法
CN111209249A (zh) * 2020-01-10 2020-05-29 中山大学 一种时域有限差分法硬件加速器架构及其实现方法
CN112967172A (zh) * 2021-02-26 2021-06-15 成都商汤科技有限公司 一种数据处理装置、方法、计算机设备及存储介质

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2417105B (en) * 2004-08-13 2008-04-09 Clearspeed Technology Plc Processor memory system
JP2012164144A (ja) * 2011-02-07 2012-08-30 Denso Corp マイクロコンピュータ
JP5739758B2 (ja) * 2011-07-21 2015-06-24 ルネサスエレクトロニクス株式会社 メモリコントローラ及びsimdプロセッサ
CN110892373A (zh) * 2018-07-24 2020-03-17 深圳市大疆创新科技有限公司 数据存取的方法、处理器、计算机系统和可移动设备
CN111045727B (zh) * 2018-10-14 2023-09-05 天津大学青岛海洋技术研究院 一种基于非易失性内存计算的处理单元阵列及其计算方法
CN111897579B (zh) * 2020-08-18 2024-01-30 腾讯科技(深圳)有限公司 图像数据处理方法、装置、计算机设备和存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5625836A (en) * 1990-11-13 1997-04-29 International Business Machines Corporation SIMD/MIMD processing memory element (PME)
CN106502923A (zh) * 2016-09-30 2017-03-15 西安邮电大学 阵列处理器中簇内存储访问行列两级交换电路
CN107590085A (zh) * 2017-08-18 2018-01-16 浙江大学 一种具有多级缓存的动态可重构阵列数据通路及其控制方法
CN111209249A (zh) * 2020-01-10 2020-05-29 中山大学 一种时域有限差分法硬件加速器架构及其实现方法
CN112967172A (zh) * 2021-02-26 2021-06-15 成都商汤科技有限公司 一种数据处理装置、方法、计算机设备及存储介质

Also Published As

Publication number Publication date
CN112967172A (zh) 2021-06-15

Similar Documents

Publication Publication Date Title
WO2022179074A1 (zh) 数据处理装置、方法、计算机设备及存储介质
US20230222331A1 (en) Deep learning hardware
US11775430B1 (en) Memory access for multiple circuit components
US11500811B2 (en) Apparatuses and methods for map reduce
US20040215677A1 (en) Method for finding global extrema of a set of bytes distributed across an array of parallel processing elements
US10762425B2 (en) Learning affinity via a spatial propagation neural network
US11106261B2 (en) Optimal operating point estimator for hardware operating under a shared power/thermal constraint
US11328169B2 (en) Switchable propagation neural network
US11354570B2 (en) Machine learning network implemented by statically scheduled instructions, with MLA chip
US11315344B2 (en) Reconfigurable 3D convolution engine
CN111667542B (zh) 适用于人工神经网络的用于处理压缩数据的解压缩技术
WO2022252568A1 (zh) 一种基于gpgpu可重构架构的方法、计算系统及重构架构的装置
EP3844610B1 (en) Method and system for performing parallel computation
EP3678037A1 (en) Neural network generator
WO2023045445A1 (zh) 数据处理装置、数据处理方法及相关产品
CN112288619A (zh) 用于在渲染图形时预加载纹理的技术
WO2022179075A1 (zh) 一种数据处理方法、装置、计算机设备及存储介质
CN116775518A (zh) 用于高效访问多维数据结构和/或其他大型数据块的方法和装置
Huan et al. A 3d tiled low power accelerator for convolutional neural network
WO2022134873A1 (zh) 数据处理装置、数据处理方法及相关产品
CN116127685A (zh) 使用机器学习执行模拟
US20220036243A1 (en) Apparatus with accelerated machine learning processing
US20210326681A1 (en) Avoiding data routing conflicts in a machine learning accelerator
CN114692844A (zh) 数据处理装置、数据处理方法及相关产品
WO2022134872A1 (zh) 数据处理装置、数据处理方法及相关产品

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21927501

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21927501

Country of ref document: EP

Kind code of ref document: A1