WO2023236929A1 - 基于指令读取数据中的目标数据的方法及其设备 - Google Patents

基于指令读取数据中的目标数据的方法及其设备 Download PDF

Info

Publication number
WO2023236929A1
WO2023236929A1 PCT/CN2023/098497 CN2023098497W WO2023236929A1 WO 2023236929 A1 WO2023236929 A1 WO 2023236929A1 CN 2023098497 W CN2023098497 W CN 2023098497W WO 2023236929 A1 WO2023236929 A1 WO 2023236929A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
value
nth
step size
dimension
Prior art date
Application number
PCT/CN2023/098497
Other languages
English (en)
French (fr)
Inventor
马绪研
郝勇峥
Original Assignee
上海寒武纪信息科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海寒武纪信息科技有限公司 filed Critical 上海寒武纪信息科技有限公司
Publication of WO2023236929A1 publication Critical patent/WO2023236929A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers

Definitions

  • the present invention relates generally to the field of computers. More specifically, the present invention relates to a method of reading target data in data based on instructions and an apparatus thereof.
  • Images are generally stored in a two-dimensional or three-dimensional manner.
  • image data When image data is stored in the memory, the image data has a fixed stride between two consecutive pixels in the same column.
  • the processor processes image data, it will move part of the data from general memory, such as DRAM (Dynamic Random Access Memory), to the cache in advance.
  • DRAM Dynamic Random Access Memory
  • the cached I/O (Input/Output, input/output) Output) is fast, and data is retrieved from the cache for processing, which can improve processing efficiency.
  • Preload instruction given by the software. Move data from one memory to another under trigger conditions or times specified in the software. In one case, data is moved from off-chip to on-chip.
  • Load instruction Read access operation. When the processor needs to perform a task based on data, it is taken from the on-chip memory and moved to the cache to prepare for operations based on the data.
  • FIG. 1 shows a schematic diagram of how image data is placed in the memory.
  • the image data is an 8 ⁇ 8 two-dimensional matrix.
  • the thick box in the figure is the local area of interest.
  • the local area is the facial area.
  • the system wants to perform operations on a local area, as mentioned above, whether it uses preload instructions or load instructions, it is read continuously, so the system will read cache line (Cache Line) 1 to cache line 26 to obtain the sense.
  • the operation is performed on the local area of interest. In other words, although the system only needs the data of 8 cache lines, it has moved 26 cache lines. Nearly 70% of the I/O is unnecessary, resulting in a waste of bandwidth.
  • the solution of the present invention provides a method and device for reading target data in data based on instructions.
  • the present invention discloses a processor core that is electrically connected to an on-chip memory, and the on-chip memory stores data.
  • the processor core includes a control module and an operation module.
  • the control module is used to: set the first step length according to the first step length value in the instruction; set the first quantity according to the first quantity value in the instruction; read from the on-chip memory. Get the target data corresponding to the first step length and the first quantity in the data to the cache.
  • the first step length is the reading interval in units of data fragments when reading the target data along the first dimension, and the first number is the number of readings in units of data fragments.
  • the operation module performs operations based on the target data in the cache.
  • the present invention discloses a computing device including the above-mentioned processor core.
  • the present invention discloses another computing device that is electrically connected to an off-chip memory, and the off-chip memory stores data.
  • the computing device includes an on-chip memory and a processor core.
  • the processor core is used to: set the first step length according to the first step length value in the instruction; set the first quantity according to the first quantity value in the instruction; and obtain the first step length from the off-chip memory.
  • the first step length is the reading interval in units of data fragments when reading the target data along the first dimension, and the first number is the number of readings in units of data fragments.
  • Processor core based on on-chip Perform operations on target data in memory.
  • the present invention discloses an integrated circuit device including the above computing device, and discloses a board card including the above integrated circuit device.
  • the present invention discloses a method for reading target data in data based on an instruction, including: setting the first step length according to the first step length value in the instruction; setting the first step length according to the first quantity value in the instruction.
  • the first quantity read the target data corresponding to the first step length and the first quantity in the data from the first memory to the second memory; perform an operation based on the target data in the second memory.
  • the first step length is the reading interval in units of data fragments when reading the target data along the first dimension, and the first number is the number of readings in units of data fragments.
  • the present invention discloses a computer-readable storage medium having computer program code based on a method of instructing to read target data in data stored thereon.
  • the computer program code is run by a processing device, the above-mentioned method is executed. Methods.
  • the present invention discloses a computer program product, including a computer program for reading target data in data based on instructions, and when the computer program is executed by a processor, the steps of the above method are implemented.
  • the present invention discloses a computer device including a memory, a processor and a computer program stored on the memory.
  • the processor executes the computer program to implement the steps of the above method.
  • the present invention determines the step length and quantity of reading data by setting at least one step value and at least one quantity value, and then skips uninteresting data fragments and only takes out interesting local data to save I/O bandwidth. .
  • Figure 1 is a schematic diagram showing image data being placed in a memory
  • Figure 2 is a structural diagram showing a board card according to an embodiment of the present invention.
  • FIG. 3 is a structural diagram showing an integrated circuit device according to an embodiment of the present invention.
  • Figure 4 is a schematic diagram showing the internal structure of a computing device according to an embodiment of the present invention.
  • Figure 5 is a schematic diagram showing the internal structure of a processor core according to an embodiment of the present invention.
  • Figure 6 is a schematic diagram illustrating reading target data from two-dimensional data according to an embodiment of the present invention.
  • Figure 7 is another schematic diagram illustrating reading target data from two-dimensional data according to an embodiment of the present invention.
  • Figure 8 is another schematic diagram illustrating reading target data from two-dimensional data according to an embodiment of the present invention.
  • Figure 9 is another schematic diagram illustrating reading target data from two-dimensional data according to an embodiment of the present invention.
  • Figure 10 is another schematic diagram illustrating reading target data from two-dimensional data according to an embodiment of the present invention.
  • Figure 11 is a flow chart showing another embodiment of the present invention.
  • Figure 12 is a flowchart showing another embodiment of the present invention.
  • the term “if” may be interpreted as “when” or “once” or “in response to determining” or “in response to detecting” depending on the context.
  • FIG. 2 shows a schematic structural diagram of a board card 20 according to an embodiment of the present invention.
  • the board 20 includes a chip 201, which is a system-level chip, or System on Chip (SoC), integrated with one or more combination processing devices.
  • the combination processing device is an artificial Intelligent computing units are used to support various deep learning and machine learning algorithms to meet the intelligent processing needs in complex scenarios in computer vision, speech, natural language processing, data mining and other fields.
  • deep learning technology is widely used in the field of cloud intelligence.
  • a significant feature of cloud intelligence applications is the large amount of input data, which has high requirements on the storage and computing capabilities of the platform.
  • the board 20 of this embodiment is suitable for use in cloud intelligence applications. application, with huge off-chip storage, on-chip storage and powerful computing capabilities.
  • the chip 201 is connected to the external device 203 through the external interface device 202 .
  • the external device 203 is, for example, a server, computer, camera, monitor, mouse, keyboard, network card or wifi interface.
  • the data to be processed can be transferred to the chip 201 from the external device 203 through the external interface device 202 .
  • the calculation results of the chip 201 can be transmitted back to the external device 203 via the external interface device 202 .
  • the external interface device 202 may have different interface forms, such as PCIe (Peripheral Component Interconnect express, high-speed peripheral component interconnection) interface, etc.
  • the board 20 also includes a storage device 204 for storing data, which includes one or more storage units 205 .
  • the memory device 204 performs connection and data transmission with the control device 206 and the chip 201 through a bus.
  • the control device 206 in the board card 20 is configured to control the status of the chip 201 .
  • the control device 206 may include a microcontroller, also known as a Micro Controller Unit (MCU).
  • MCU Micro Controller Unit
  • FIG. 3 is a structural diagram showing the combined processing device in the chip 201 of this embodiment.
  • the combined processing device 30 includes a computing device 301 , an interface device 302 , a processing device 303 and an off-chip memory 304 .
  • the computing device 301 is configured to perform user-specified operations, and is mainly implemented as a single-core intelligent processor or a multi-core intelligent processor to perform deep learning or machine learning calculations. It can interact with the processing device 303 through the interface device 302 to Work together to complete user-specified operations.
  • the interface device 302 is used to transmit data and control instructions between the computing device 301 and the processing device 303 .
  • the computing device 301 can obtain input data from the processing device 303 via the interface device 302 and write it into an on-chip storage device of the computing device 301 .
  • the computing device 301 can obtain the control instructions from the processing device 303 via the interface device 302 and write them into the control cache on-chip of the computing device 301 .
  • the interface device 302 may also read the data in the storage device of the computing device 301 and transmit it to the processing device 303 .
  • the processing device 303 performs basic control including but not limited to data transfer, starting and/or stopping the computing device 301, and the like.
  • the processing device 303 may be one or more types of a central processing unit (CPU), a graphics processing unit (GPU), or other general-purpose and/or special-purpose processors.
  • processors including but not limited to Digital Signal Processor (DSP), Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or others Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., and their number can be determined according to actual needs.
  • DSP Digital Signal Processor
  • ASIC Application Specific Integrated Circuit
  • FPGA Field-Programmable Gate Array
  • the computing device 301 of the present invention can be regarded as having a single-core structure or a homogeneous multi-core structure.
  • the computing device 301 and the processing device 303 are considered together, they are regarded as forming a heterogeneous multi-core structure.
  • the off-chip memory 304 is used to store data to be processed. It is a DDR (Double Data Rate, double rate) memory, usually 16G or larger in size, and is used to save data of the computing device 301 and/or the processing device 303.
  • DDR Double Data Rate, double rate
  • Figure 4 shows a schematic diagram of the internal structure of the computing device 301.
  • the computing device 301 is used to process input data such as computer vision, speech, natural language, and data mining.
  • the computing device 301 in the figure adopts a multi-core hierarchical structure design.
  • the computing device 301 serves as a system on a chip and includes multiple clusters. Each cluster also includes multiple processor cores.
  • the computing device 301 is composed of a system-on-chip-cluster-processor core hierarchy.
  • the computing device 301 includes an external storage controller 401 , a peripheral communication module 402 , an on-chip interconnection module 403 , a synchronization module 404 and a plurality of clusters 405 .
  • the peripheral communication module 402 is used to receive control signals from the processing device 303 through the interface device 302 and start the computing device 301 to perform tasks.
  • the on-chip interconnection module 403 connects the external storage controller 401, the peripheral communication module 402 and multiple clusters 405 to transmit data and control signals between various modules.
  • the synchronization module 404 is a global synchronization barrier controller (Global Barrier Controller, GBC), used to coordinate the work progress of each cluster and ensure information synchronization.
  • GBC Global Barrier Controller
  • Clusters 405 are the computing cores of the computing device 301. Four clusters are exemplarily shown in the figure. With the development of hardware, the computing device 301 of the present invention may also include 8, 16, 64, or even more. Cluster 405. Cluster 405 is used to efficiently execute deep learning algorithms.
  • each cluster 405 includes multiple processor cores (IPU Core) 406 and a storage core (MEM Core) 407.
  • IPU Core processor cores
  • MEM Core storage core
  • processor cores 406 are exemplarily shown in the figure, and the present invention does not limit the number of processor cores 406 . Its internal architecture is shown in Figure 5. Each processor core 406 includes three major modules: a control module 51 , an operation module 52 and a storage module 53 .
  • the control module 51 is used to coordinate and control the work of the computing module 52 and the storage module 53 to complete the task of deep learning, and includes an instruction fetch unit (Instruction Fetch Unit, IFU) 511 and an instruction decode unit (Instruction Decode Unit, IDU). 512.
  • the instruction fetching unit 511 is used to obtain instructions from the processing device 303, and the instruction decoding unit 512 decodes the obtained instructions and sends the decoding results to the computing module 52 and the storage module 53 as control information.
  • the operation module 52 includes a vector operation unit 521 and a matrix operation unit 522.
  • the vector operation unit 521 is used to perform vector operations and can support complex operations such as vector multiplication, addition, and nonlinear transformation;
  • the matrix operation unit 522 is responsible for the core calculations of the deep learning algorithm, namely matrix multiplication and convolution.
  • the storage module 53 is used to store or transport related data, including a neuron storage unit (Neuron RAM, NRAM) 531, a weight storage unit (Weight RAM, WRAM) 532, a data cache 533, and an input/output direct memory access module (Input/ Output Direct Memory Access, IODMA) 534, Move Direct Memory Access module (Move Direct Memory Access, MVDMA) 535.
  • NRAM 531 is used to store the feature map calculated by the processor core 406 and the calculated intermediate results;
  • WRAM 532 is used to store the weights of the deep learning network;
  • the data cache 533 is used to cache data from outside the processor core 406.
  • the operation module 52 When the operation module 52 needs to perform operations, it transfers the required data to the NRAM 531 and WRAM 532; the IODMA 534 controls the access of the data cache 533 and the off-chip memory 304 through the broadcast bus 409; the MVDMA 535 is used to control the data cache 533 and Memory access of the shared memory unit (Static Random-Access Memory, SRAM) 408.
  • SRAM Static Random-Access Memory
  • the storage core 407 is mainly used for storage and communication, that is, storage of shared data or intermediate results between the processor cores 406, communication between the execution cluster 405 and the off-chip memory 304, and communication between the clusters 405. Communication between processor cores 406, etc.
  • the storage core 407 has the capability of scalar operations to perform scalar operations.
  • the storage core 407 includes a shared memory unit (SRAM) 408, a broadcast bus 409, a cluster direct memory access module (Cluster Direct Memory Access, CDMA) 420, and a global direct memory access module (Global Direct Memory Access, GDMA) 411.
  • SRAM 408 is an on-chip memory that plays the role of a high-performance data transfer station.
  • the data multiplexed between different processor cores 406 in the same cluster 405 does not need to be obtained from the off-chip memory 304 through the processor cores 406, but is transferred between the processor cores 406 through the SRAM 408, and the storage core 407 only
  • the multiplexed data needs to be quickly distributed from the SRAM 408 to multiple processor cores 406 to improve inter-core communication efficiency and greatly reduce on-chip and off-chip input/output accesses.
  • the broadcast bus 409, CDMA 420 and GDMA 411 are respectively used to perform communication between processor cores 406, communication between clusters 405, and data transmission between the cluster 405 and the off-chip memory 304. They will be explained below.
  • the broadcast bus 409 is used to complete high-speed communication between the processor cores 406 in the cluster 405.
  • the broadcast bus 409 in this embodiment supports inter-core communication methods including unicast, multicast and broadcast.
  • Unicast refers to point-to-point (i.e., single processor core to single processor core) data transmission
  • multicast is a communication method that transmits a piece of data from SRAM 408 to specific several processor cores 406
  • broadcast is a communication method that transmits a piece of data from SRAM 408 to specific processor cores 406.
  • the communication method in which copies of data are transmitted from SRAM 408 to all processor cores 406 is a special case of multicast.
  • CDMA 420 is used to control memory access of SRAM 408 between different clusters 405 within the same computing device 301.
  • the GDMA 411 cooperates with the external memory controller 401 to control the memory access from the SRAM 408 of the cluster 405 to the off-chip memory 304, or to read data from the off-chip memory 304 to the SRAM 408.
  • the communication between the off-chip memory 304 and the data cache 533 can be achieved through two channels.
  • the first channel is to directly contact the off-chip memory 304 and the data cache 533 through the IODAM 534; the second channel is to first transmit the data between the off-chip memory 304 and the SRAM 408 through the GDMA 411, and then through the MVDMA 535 to transmit the data in the SRAM 408 Transfer between data cache 533.
  • Embodiments of the present invention can select a data transmission channel according to its own hardware conditions.
  • the functionality of GDMA 411 and the functionality of IODMA 534 may be integrated into the same component.
  • the present invention treats GDMA 411 and IODMA 534 as different components.
  • the functions of GDMA 411, IODMA 534, CDMA 420, and MVDMA 535 can also be implemented by the same component.
  • this embodiment sends a preload instruction through software settings to move the data from the off-chip memory 304 to the SRAM 408 of the specific cluster 405.
  • the preload of this embodiment The instruction is used to drive the computing device 301 to preload data required for the next operation.
  • the control module 51 moves the data from the SRAM 408 to the data cache 533 for the operation module 52 to perform operations.
  • the load instruction in this embodiment is used to drive the control module 51 to load the operation module 52 to perform this operation. required data.
  • Both the preload instruction and the load instruction in this embodiment include a starting address field, a step size field and a quantity field.
  • the read target data is obtained from the starting address field.
  • the starting address obtains the step value from the step field and the quantity value from the quantity field, where the step value refers to the reading interval in units of data fragments when reading the target data along the first dimension.
  • the quantity is the read quantity.
  • Figure 6 shows a schematic diagram of reading target data from two-dimensional data in this embodiment.
  • the image data is an 8 ⁇ 8 two-dimensional matrix and is continuously stored along the X direction.
  • Each grid in the figure is a data fragment, that is, read and write data.
  • the smallest unit, in this example is a 64-byte cache line.
  • the step value in the step field is 0 and the quantity value in the quantity field is 4, which means that the reading interval when reading the target data along the X direction is 0, that is, continuous reading, and the reading interval along the X direction Read 4 cache lines.
  • the sub-data set obtained based on a step value and a quantity value is called a sub-data group. Assuming that the starting address is the address of cache line 1 in the figure, the sub-data groups taken out are cache line 1 to cache line 4.
  • the step value in the step field is 2 and the quantity value in the quantity field is 6, which means that the reading interval when reading the target data along the X direction is 2, and 6 buffers are read along the X direction.
  • the sub-data group taken out is cache line 5 to cache line 10. Since the example in Figure 6 only shows a set of step values and quantity values, the sub-data set is the one to be read. target data.
  • FIG. 7 shows another schematic diagram of reading target data from two-dimensional data in this embodiment.
  • the step value in the step field is 7 and the quantity value in the quantity field is 4, which means that the read interval when reading the target data along the X direction is 7, and 4 buffers are read along the X direction.
  • line assuming that the starting address is the address of cache line 1 in the figure, then the sub-data group taken out is cache line 1 to cache line 4. Since this example also only has one set of step value and quantity value, the sub-data group is the target data.
  • the step value is the image step minus 1
  • the target data happens to be along the second dimension (Y direction)
  • Several cache lines are read continuously, where the image step size is the length of the image in the first dimension (X direction).
  • the above setting method will only read the cache line of interest.
  • the step value in the step field is 6 and the quantity value in the quantity field is 4, which means that the reading interval when reading the target data along the X direction is 6, and reading 4 along the X direction Cache line, assuming that the starting address is the address of cache line 5 in the figure, the sub-data group (target data) taken out is cache line 5 to cache line 8, which is similar to continuously reading cache lines along the direction of slope 1.
  • the step value and quantity value in this embodiment may include multiple values.
  • the preload instruction and the load instruction include 2 step fields and 2 quantity fields, the reading methods will be more diverse.
  • the preload instruction and the load instruction include a starting address field, a first step length field carrying a first step size value, a first quantity field carrying a first quantity value, and a second step size field carrying a second step size value.
  • the second step length field and the second quantity field carrying the second quantity value where the first step length value is the reading interval in units of data segments when reading the target data along the first dimension direction (X direction), the A quantity value is the number of readings along the first dimension (X direction), and the data set obtained each time the operation of the first step length value and the first quantity value is performed is a sub-data group.
  • the second step value is the reading interval in units of data segments along the first dimension direction every time a sub-data group is obtained. After that, the sub-data group is read according to the above-mentioned first step length value and first quantity value.
  • the quantity value is the number of times the subdata group is read.
  • FIG. 8 shows another schematic diagram of reading target data from two-dimensional data in this embodiment.
  • An example is that if the first step value (S1) is 1, the first quantity value (N1) is 3, the second step value (S2) is 3, and the second quantity value (N2) is 4, if the starting address is the address of cache line 1 in the figure.
  • Starting from cache line 1, read a cache line every first step (S1) and execute it 3 times (N1) to obtain the first sub-data group (cache line 1). to 3), then starting from cache line 4 at a second step (S2) from cache line 3, the second sub-data group, namely cache lines 4 to 6, is obtained.
  • the second quantity value (N2) is 4, the above operation is repeated to obtain 4 groups of sub-data groups, in which the third sub-data group is cache lines 7 to 9, and the fourth sub-data group is cache lines 10 to 12. These four sub-data groups are the target data.
  • This embodiment can accept the step field as an empty string (Null) or special characters.
  • the second step value is the image step minus the first quantity value.
  • FIG. 9 shows another schematic diagram of reading target data from two-dimensional data in this embodiment. If the two-dimensional data is a photo of the upper body of a portrait, the bold box part is the front of the face. Now this embodiment wants to extract the local area of the face (the bold box part) for face recognition operation.
  • the starting address field of the instruction is set The specified starting address is the address of cache line 1, the first step length value is set to 0, the first quantity value is set to 3, the second step length field is an empty string, and the second quantity value is set to 4.
  • FIG. 10 shows another schematic diagram of reading target data from two-dimensional data in this embodiment. If the two-dimensional data is a photo of the upper body of a portrait, and the bold box part is the side of the face, this embodiment wants to extract the local area including the side of the face, straighten it and then perform the face recognition operation, then the starting address field of the instruction Set to the address of cache line 1, the first step length value Set to 0, the first quantity value is set to 3, the second step value is set to 6, and the second quantity value is set to 4.
  • the embodiment starts from cache line 1 and reads 3 cache lines continuously without any interval.
  • the first sub-data group is cache lines 1 to 3, and then reads the second sub-data group (cache lines 4 to 3 after an interval of 6 cache lines).
  • the instruction of this embodiment can also set N step fields and N quantity fields, where the Nth step value is from the N-1 sub-data.
  • the group starts from the number of intervals along the first dimension in units of data segments, and the Nth number value is the number of N-1 sub-data groups read.
  • the first step length value and the first quantity value determine the one-dimensional sub-data group along the first dimension
  • the second step value and the second quantity value determine the one-dimensional sub-data group along the first dimension.
  • a two-dimensional sub-data group composed of sub-data groups is a matrix sub-data group expanded along the first and second dimensions.
  • the third step value and the third quantity value are determined by several two-dimensional A three-dimensional sub-data group composed of sub-data groups. This three-dimensional sub-data group is the target data. Those skilled in the art can easily extrapolate to the case of four or more dimensions, so no details will be given.
  • the processor core 406 preloads The starting address field in the instruction determines the starting data fragment.
  • the first step length value is obtained from the first step length field
  • the first quantity value is obtained from the first quantity field. From the first step length value and the first quantity value Determine the one-dimensional sub-image data set along the first dimension. If there are no other step fields and quantity fields in the preload command, the one-dimensional sub-image data set is the target image data.
  • the preload instruction also includes a second step size field and a second quantity field
  • the second step size value is obtained from the second step size field
  • the second quantity value is obtained from the second quantity field
  • the second step size value is obtained from the second step size field.
  • the second quantitative value determines a two-dimensional sub-image data group composed of several one-dimensional sub-image data groups, wherein the second step value determines the step size along the first-dimensional direction between the one-dimensional sub-image data groups, and the second The quantity value determines the number of reads of the one-dimensional sub-image data set. If there are no other step fields and quantity fields in the preload command, the two-dimensional sub-image data set is the target image data.
  • the preload command also includes the N-th step size field and the N-th quantity field, obtain the N-dimensional sub-image data group in the aforementioned manner, where the N-th step size value determines the first-dimensional direction between the N-1-dimensional sub-image data groups.
  • the step size of , the Nth quantity value determines the number of reads of the N-1-dimensional sub-image data group.
  • the N-dimensional sub-image data group is the target image data.
  • the computing device 301 can completely transfer the target image data from the off-chip memory 304 to the SRAM 408 without reading uninteresting areas of the image.
  • the control module 51 wants to read the target image data in the N-dimensional image data from the first memory (such as SRAM 408) to the second memory (such as the data cache 533), it should be noted that the N-dimensional image data here can be The above target image data is transferred from the off-chip memory 304 to the SRAM 408.
  • the control module 51 determines the starting data segment according to the starting address field in the load instruction, obtains the first step length value from the first step length field, obtains the first quantity value from the first quantity field, and obtains the first step length value from the first step length value. and the first quantity value determines a one-dimensional sub-image data set along the first dimensional direction. If there are no other step fields and quantity fields in the load command, the one-dimensional sub-image data group is the target image data.
  • the second step size value is obtained from the second step size field
  • the second quantity value is obtained from the second quantity field.
  • the second step size value and the second quantity field are obtained.
  • the second quantity value determines a two-dimensional sub-image data group composed of several one-dimensional sub-image data groups, where the second step value determines the step size along the first-dimensional direction between the one-dimensional sub-image data groups.
  • the second quantity determines the number of reads of the 1D sub-image data set. If there are no other step fields and quantity fields in the load command, the two-dimensional sub-image data group is the target image data.
  • the preload command also includes the N-th step size field and the N-th quantity field, obtain the N-dimensional sub-image data group in the aforementioned manner, where the N-th step size value determines the first-dimensional direction between the N-1-dimensional sub-image data groups.
  • the step size of , the Nth quantity value determines the number of reads of the N-1-dimensional sub-image data group.
  • the N-dimensional sub-image data group is the target image data.
  • control module 51 can completely transfer the target image data from the SRAM 408 to the data cache 533 without reading uninteresting areas of the image.
  • the target data when the first step length value is the image step size minus one, the target data is to read the first quantity value continuously along the second dimension direction. Data fragment.
  • the target data is to continuously read the first number of data segments along the direction with a slope of 1 in the plane composed of the first dimension and the second dimension. It can further be deduced that when the first step length value is the image step size, the target data is to continuously read the first number of data fragments along the direction of the slope of -1 in the plane composed of the first dimension and the second dimension. .
  • the control module 51 transfers the target image data in the N-dimensional image data from the SRAM 408 to the data cache 533 based on the load instruction.
  • the operation module 52 performs operations, the target data is moved from the data cache 533 to the NRAM 531, and the operation module 52 reads the target data from the NRAM 531 for operation.
  • Another embodiment of the present invention is a method for reading target data in data based on an instruction. More specifically, the instruction of this embodiment only includes a step size field and a quantity field.
  • Figure 11 shows a flow chart of this embodiment.
  • step 1101 the starting address is determined based on the starting address field in the instruction.
  • step 1102 the step size is set according to the step size value in the step size field.
  • the embodiment obtains the step value from the step field in the instruction to set the step size.
  • step 1103 the quantity is set according to the quantity value in the quantity field.
  • the embodiment obtains the quantity value from the quantity field in the instruction to set the read quantity.
  • step 1104 target data corresponding to the step size and quantity in the data is read from the first memory to the second memory.
  • the one-dimensional sub-image data group along the first dimension is determined by the first step length value and the first quantity value.
  • the step value is the reading interval when reading the target data along the first dimension direction
  • the quantity value is the number of read data fragments along the first dimension
  • the obtained sub-data group is the target data to be read.
  • step 1105 an operation is performed based on the target data in the second memory.
  • the first memory may be an off-chip memory, and the second memory may be an on-chip memory; in another case, the first memory may be an on-chip memory, and the second memory may be a cache.
  • Another embodiment of the present invention is a method for reading target data in data based on instructions. More specifically, the instructions of this embodiment include N step fields and N quantity fields.
  • Figure 12 shows a flow chart of this embodiment.
  • step 1201 the starting address is determined based on the starting address field in the instruction.
  • N step sizes are set according to the N step size values in the N step size fields.
  • the first step value is obtained from the first step field
  • the second step value is obtained from the second step field
  • the Nth step value is obtained from the Nth step field to set the first step value respectively.
  • One to Nth step length is set according to the first step value.
  • N quantities are set according to N quantity values in N quantity fields. This embodiment obtains the first quantity value from the first quantity field, the second quantity value from the second quantity field, and the Nth quantity value from the Nth quantity field to respectively set the first to Nth quantities.
  • step 1204 target data corresponding to N steps and N quantities in the data is read from the first memory to the second memory.
  • the one-dimensional sub-image data set along the first dimensional direction is determined by the first step length value and the first quantity value.
  • the second step value is obtained from the second step field, and the second quantity value is obtained from the second quantity field.
  • the second step value and the second quantity value are determined by the second step value and the second quantity value, which are composed of several one-dimensional sub-image data groups. Two-dimensional sub-image data group, wherein the second step value determines the step size along the first-dimensional direction between the one-dimensional sub-image data groups, and the second quantity value determines the number of reads of the one-dimensional sub-image data group.
  • the N-dimensional sub-image data group is the target image data.
  • step 1205 an operation is performed based on the target data in the second memory.
  • the first memory may be an off-chip memory, and the second memory may be an on-chip memory; in another case, the first memory may be an on-chip memory, and the second memory may be a cache.
  • Another embodiment of the present invention is a computer-readable storage medium on which is stored a computer program code based on a method of instructing to read target data in data.
  • FIG. 11 is executed.
  • FIG. 12 is executed.
  • FIG. 12 is executed.
  • FIG. 12 is executed.
  • FIG. 11 is executed.
  • FIG. 12 is executed.
  • FIG. 12 is executed.
  • FIG. 11 is executed.
  • FIG. 12 is executed.
  • Another embodiment of the present invention is a computer program product, including reading based on instructions
  • the computer program of the target data in the data is characterized in that when the computer program is executed by the processor, the steps of the method described in Figure 11 or Figure 12 are implemented.
  • Another embodiment of the present invention is a computer device, including a memory, a processor, and a computer program stored on the memory. The feature is that the processor executes the computer program to implement the method described in Figure 11 or Figure 12 step.
  • the above integrated units can be implemented in the form of software program modules. If implemented in the form of software program modules and sold or used as a stand-alone product, the integrated unit may be stored in a computer-readable memory. Based on this, when the solution of the present invention is embodied in the form of a software product (such as a computer-readable storage medium), the software product can be stored in a memory, which can include a number of instructions to cause a computer device (such as a personal computer, server or network equipment, etc.) to perform some or all steps of the method described in the embodiment of the present invention.
  • the aforementioned memory may include but is not limited to U disk, flash disk, read only memory (ROM), random access memory (RAM), mobile hard disk, magnetic disk or optical disk, etc., which can store programs. The medium of the code.
  • the present invention determines the step length and quantity of reading data by setting at least one step value and at least one quantity value, and then skips uninteresting data fragments and only takes out interesting local data to save I/O bandwidth. .
  • the electronic equipment or device of the present invention may include servers, cloud servers, server clusters, data processing devices, robots, computers, printers, scanners, tablets, smart terminals, PC equipment, Internet of Things terminals, mobile Terminals, mobile phones, driving recorders, navigators, sensors, cameras, cameras, video cameras, projectors, watches, headphones, mobile storage, wearable devices, visual terminals, autonomous driving terminals, vehicles, household appliances, and/or medical equipment.
  • the means of transportation include airplanes, ships and/or vehicles;
  • the household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lights, gas stoves, and range hoods;
  • the medical equipment includes nuclear magnetic resonance machines, B-ultrasound and/or electrocardiograph.
  • the electronic equipment or device of the present invention can also be applied to the Internet, Internet of Things, data centers, energy, transportation, public management, manufacturing, education, power grids, telecommunications, finance, retail, construction sites, medical care and other fields. Furthermore, the electronic equipment or device of the present invention can also be used in cloud, edge, terminal and other application scenarios related to artificial intelligence, big data and/or cloud computing. In one or more embodiments, electronic equipment or devices with high computing power according to the solution of the present invention can be applied to cloud equipment (such as cloud servers), while electronic equipment or devices with low power consumption can be applied to terminal equipment and/or Edge devices (such as smartphones or cameras).
  • cloud equipment such as cloud servers
  • electronic equipment or devices with low power consumption can be applied to terminal equipment and/or Edge devices (such as smartphones or cameras).
  • the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that the hardware resources of the cloud device can be obtained based on the hardware information of the terminal device and/or the edge device.
  • the present invention describes some methods and their embodiments as a series of actions and their combinations, but those skilled in the art can understand that the solution of the present invention is not limited by the sequence of the described actions. . Therefore, based on the disclosure or teaching of the present invention, those skilled in the art can understand that certain steps may be performed in other orders or simultaneously. Furthermore, those skilled in the art can understand that the embodiments described in the present invention can be regarded as optional embodiments, that is, the actions or modules involved are not necessarily necessary for the implementation of one or some solutions of the present invention. In addition, according to different solutions, the description of some embodiments of the present invention also has different emphasis. In view of this, those skilled in the art can understand the parts that are not described in detail in a certain embodiment of the present invention, and can also refer to the relevant descriptions of other embodiments.
  • units described as separate components may or may not be physically separated, and components shown as units may or may not be physical units.
  • the aforementioned components or units may be co-located or distributed over multiple network units.
  • some or all of the units may be selected to achieve the purpose of the solutions described in the embodiments of the present invention.
  • multiple units in the embodiment of the present invention may be integrated into one unit or each unit may exist physically separately.
  • the above-mentioned integrated unit can also be implemented in the form of hardware, that is, a specific hardware circuit, which can include digital circuits and/or analog circuits, etc.
  • the physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, and the physical devices may include, but is not limited to, devices such as transistors or memristors.
  • various devices such as computing devices or other processing devices described herein can be implemented by appropriate hardware processors, such as central processing units, GPUs, FPGAs, DSPs, and ASICs.
  • the aforementioned storage unit or storage device can be any appropriate storage medium (including magnetic storage media or magneto-optical storage media, etc.), which can be, for example, a variable resistive memory (Resistive Random Access Memory, RRAM), dynamic memory, etc.
  • Random access memory Dynamic Random Access Memory, DRAM
  • static random access memory Static Random Access Memory, SRAM
  • enhanced dynamic random access memory Enhanced Dynamic Random Access Memory, EDRAM
  • high bandwidth memory High Bandwidth Memory
  • HBM Hybrid Memory Cube
  • HMC Hybrid Memory Cube
  • ROM and RAM etc.
  • the first step length value sets the first step length; sets the first quantity according to the first quantity value in the instruction; and reads the data corresponding to the first step length and the first step length from the on-chip memory.
  • a quantity of target data is sent to the cache; wherein, the first step length is the reading interval in data fragments when reading the target data along the first dimension, and the first quantity is the reading interval in data fragments. is the number of reads in units, and the operation module performs operations based on the target data in the cache.
  • Clause A2 The processor core according to Clause A1, wherein the instruction includes a first step length field and a first quantity field, and the control module obtains the first step length value from the first step length field. , and obtain the first quantity value from the first quantity field.
  • Clause A3 The processor core according to Clause A1, wherein when the first step length value is the image step size minus one, the target data is to continuously read the first quantity value along the second dimension direction. Data fragment, wherein the image step size is the length of the data along the first dimension.
  • Clause A4 The processor core according to Clause A1, wherein when the first step length value is the image step size minus two, the target data is along the plane formed by the first dimension and the second dimension.
  • the first number of data segments are continuously read in a direction with a slope of 1, wherein the image step size is a length of the data along the first dimension.
  • Clause A5. The processor core according to clause A1, wherein when the first step length value is an image step size, the target data is along a slope in a plane composed of the first dimension and the second dimension. The first number of data segments are continuously read in the direction of -1, where the image step size is the length of the data along the first dimension.
  • Clause A6 The processor core according to Clause A2, wherein the control module is further used to: set the Nth step size according to the Nth step size value in the instruction; according to the Nth quantity value in the instruction Set the Nth number; read the target data corresponding to the first to Nth steps and the first to Nth numbers in the data from the on-chip memory to the cache; wherein, from the on-chip memory, The N-1 step size value and the N-1th quantity value determine the N-1 dimensional sub-data group, and the N-th step size is the data segment of the N-1 dimensional sub-data group along the first dimensional direction. is the unit reading interval, the Nth quantity value is the number of readings of the N-1-dimensional sub-data group, and N is a positive integer greater than 1.
  • Clause A8 The processor core according to clause A6, wherein when the first step size value is 0 and the second step size field is an empty string or a special character, the second step size value is The first quantitative value is subtracted from the image step size, where the image step size is the length of the data along the first dimension.
  • Clause A9 The processor core according to Clause A6, wherein the instruction includes an Nth step field and an Nth dimensional quantity field, and the control module obtains the Nth step from the Nth step field. value, and obtain the Nth quantity value from the Nth dimension quantity field.
  • Clause A10 The processor core of Clause A1, further comprising a storage module, the storage module including the data cache.
  • Clause A11 The processor core according to Clause A10, wherein the storage module further includes a neuron storage unit, and when the operation module performs an operation, the target data is moved from the data cache to the neuron storage unit, the computing module reads the target data from the neuron storage unit.
  • Clause A12 The processor core of Clause A1, wherein the instruction is a load instruction.
  • Clause A13 The processor core of Clause A1, wherein the data fragment is a cache line.
  • Clause A14 The processor core according to any one of clauses A1 to A13, wherein the instruction further includes a starting address field, and the control module obtains a start of reading the target data from the starting address field. address.
  • Clause A15 A computing device comprising a processor core according to clauses A1 to A14.
  • Clause A16 An integrated circuit device comprising the computing device according to Clause A15.
  • the first step length value sets the first step length; sets the first quantity according to the first quantity value in the instruction; reads the data from the off-chip memory corresponding to the first step length and the A first amount of target data to the on-chip memory; wherein, the first step length is the reading interval in data segments when reading the target data along the first dimensional direction, and the first amount is in units of data segments.
  • the data fragment is a read number of units, and the processor core performs operations based on the target data in the on-chip memory.
  • Clause A19 The computing device of clause A18, wherein the instruction includes a first step length field and a first quantity field, and the control module obtains the first step length value from the first step length field, and obtain a first quantity value from the first quantity field.
  • Clause A20 The computing device of Clause A18, wherein when the first step length value is an image step size minus one, the target data is to read the first amount of data continuously along a second dimension. Fragment, wherein the image step size is the length of the data along the first dimension.
  • Clause A21 The computing device of Clause A18, wherein when the first step length value is an image step size minus two, the target data is along a plane formed by the first and second dimensions. The first number of data segments are continuously read in the direction with a slope of 1, where the image step size is the length of the data along the first dimension.
  • Clause A22 The computing device of Clause A18, wherein when the first step length value is an image step size, the target data is along a slope of The first number of data segments are continuously read in the direction of -1, where the image step size is the length of the data along the first dimension.
  • Clause A23 The computing device of clause A19, wherein the processor core is further configured to: set the Nth step size according to the Nth step size value in the instruction; according to the Nth quantity value in the instruction Set the Nth quantity; read the target data corresponding to the first to Nth steps and the first to Nth quantities in the data from the off-chip memory to the on-chip memory; wherein, The N-1-dimensional sub-data group is determined by the N-1-th step value and the N-1 quantity value.
  • the N-th step size is the N-1-dimensional sub-data group along the first dimension direction in the direction of the first dimension.
  • the data fragment is the reading interval of the unit
  • the Nth quantity value is the number of readings of the N-1-dimensional sub-data group
  • N is a positive integer greater than 1.
  • Clause A24 The computing device according to Clause A23, wherein when the following expression is satisfied and the first step length value is 0, the two-dimensional sub-data set is a one-dimensional sub-data set read continuously along the second dimensional direction. :
  • the second step value the image step size - the first quantity value
  • Clause A25 The processor core according to Clause A23, wherein when the first step length value is 0 and the second step length field is an empty string or a special character, the second step length value is The first quantitative value is subtracted from the image step size, where the image step size is the length of the data along the first dimension.
  • Clause A26 The computing device of Clause A23, wherein the instruction includes an Nth step size field and an Nth dimension quantity field, and the processor core obtains the Nth step size from the Nth step size field. value, and obtain the Nth quantity value from the Nth dimension quantity field.
  • Clause A28 The computing device of clause A18, wherein the data fragment is a cache line.
  • Clause A29 The computing device of any one of Clauses A18 to A28, wherein the instructions further include a starting address field from which the processor core obtains a start of reading the target data. address.
  • Clause A31 A board comprising an integrated circuit device according to Clause A30.
  • a method for reading target data in data based on an instruction characterized by comprising: setting the first step length according to the first step length value in the instruction; according to the first quantity value in the instruction Set the first quantity; read the target data corresponding to the first step length and the first quantity in the data from the first memory to the second memory; based on the target in the second memory The data is operated; wherein, the first step length is the reading interval in units of data fragments when reading the target data along the first dimension, and the first number is the reading interval in units of data fragments. Take the quantity.
  • Clause A33 The method according to clause A32, further comprising: setting the second to Nth step size according to the second to Nth step size value in the instruction; and setting the second to Nth step size according to the second to Nth number in the instruction.
  • the value sets the second to Nth numbers; the target data corresponding to the second to Nth steps and the second to Nth numbers in the data is read from the first memory to the Two memories; wherein, the N-1-dimensional sub-data group is determined by the N-1-th step value and the N-1-th quantity value, and the N-th step size is the N-1-dimensional sub-data group along the first
  • the reading interval in the dimensional direction is the unit of the data fragment, the Nth quantity value is the number of readings of the N-1 dimensional sub-data group, and N is a positive integer greater than 1.
  • a computer-readable storage medium having computer program code based on a method of instructing to read target data in data stored thereon, which when the computer program code is run by a processing device, executes clause A32 or A33 Methods.
  • Clause A35 A computer program product comprising a computer program for reading target data in data based on instructions, characterized in that the computer program, when executed by a processor, implements the steps of the method described in Clause A32 or A33.
  • Clause A36 A computer device, comprising a memory, a processor and a computer program stored on the memory, characterized in that the processor executes the computer program to implement the steps of the method described in Clause A32 or A33.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

一种基于指令读取数据中的目标数据的方法及计算装置(301),计算装置(301)包括在集成电路装置中,集成电路装置包括通用互联接口和其他处理装置。计算装置(301)与其他处理装置进行交互,共同完成用户指定的计算操作。集成电路装置还包括存储装置,存储装置分别与计算装置(301)和其他处理装置连接,用于计算装置(301)和其他处理装置的数据存储。

Description

基于指令读取数据中的目标数据的方法及其设备
相关申请的交叉引用
本申请要求于2022年6月6日申请的,申请号为202210635541.6,名称为“基于指令读取数据中的目标数据的方法及其设备”的中国专利申请的优先权。
技术领域
本发明一般地涉及计算机领域。更具体地,本发明涉及基于指令读取数据中的目标数据的方法及其设备。
背景技术
在图像识别或图像分类的领域中,图像的局部区域通常具有强相关性,也就是邻近的小片像素区域相关性较强,而距离较远的像素区域相关性较弱。
图像一般是以二维或三维的方式进行存储的,当图像数据在存储器中存储时,图像数据在同一列上的连续两个像素点之间有固定的步长(Stride)。当处理器在处理图像数据时,会将一部分的数据从一般存储器,如DRAM(Dynamic Random Access Memory,动态随机存取存储器)预先搬运到缓存上,缓存的I/O(Input/Output,输入/输出)速度快,从缓存取出数据来处理,可以提高处理效率。处理器读取数据会涉及到两种指令:
1.预载指令(Prefetch):由软件给出。在软件中指定的触发条件或时间,将数据从一个存储器搬运到另一个存储器中。在一种情况下,是将数据从片外搬运到片上。
2.加载指令(Load):读访问操作,当处理器需要基于数据执行任务时,自片上内存中取出搬运到缓存,准备基于该数据进行运算。
利用上述两种指令读取数据时,每次都是搬运连续的数据片段(最小搬运单元),即数据搬运必然连续的。图1示出一个图像数据在存储器中被摆放的示意图,该图像数据是8×8的二维矩阵,图中粗线方框为感兴趣的局部区域,例如该图像数据为人像半身照,而局部区域为脸部区域。当系统欲针对局部区域进行运算时,如前所述,不论利用预载指令或是加载指令均为连续读取,因此系统将读取缓存行(Cache Line)1至缓存行26,才能获得感兴趣的局部区域进行运算,换言之,虽然系统仅需要8个缓存行的数据,却搬运了26个缓存行,近7成的I/O是无谓的,造成带宽的浪费。
发明内容
为了至少部分地解决背景技术中提到的技术问题,本发明的方案提供了一种基于指令读取数据中的目标数据的方法及其设备。
在一个方面中,本发明揭露一种处理器核,电性连接至片上内存,片上内存存储有数据。处理器核包括控制模块及运算模块,其中控制模块用以:根据指令中的第一步长值设定第一步长;根据指令中的第一数量值设定第一数量;自片上内存读取数据中对应第一步长与第一数量的目标数据至缓存。第一步长为沿第一维方向读取目标数据时以数据片段为单位的读取间隔,第一数量为以数据片段为单位的读取数量。运算模块基于缓存中的目标数据进行运算。
在另一个方面,本发明揭露一种计算装置,包括上述的处理器核。
在另一个方面,本发明揭露另一种计算装置,电性连接至片外内存,片外内存存储有数据。计算装置包括片上内存与处理器核,处理器核用以:根据指令中的第一步长值设定第一步长;根据指令中的第一数量值设定第一数量;自片外内存读取数据中对应第一步长与第一数量的目标数据至片上内存。其中,第一步长为沿第一维方向读取目标数据时以数据片段为单位的读取间隔,第一数量为以数据片段为单位的读取数量。处理器核基于片上 内存中的目标数据进行运算。
在另一个方面,本发明揭露一种集成电路装置,包括上述的计算装置,并揭露一种板卡,包括上述集成电路装置。
在另一个方面,本发明揭露一种基于指令读取数据中的目标数据的方法,包括:根据指令中的第一步长值设定第一步长;根据指令中的第一数量值设定第一数量;自第一内存读取数据中对应第一步长与第一数量的目标数据至第二内存;基于第二内存中的目标数据进行运算。其中,第一步长为沿第一维方向读取目标数据时以数据片段为单位的读取间隔,第一数量为以数据片段为单位的读取数量。
在另一个方面,本发明揭露一种计算机可读存储介质,其上存储有基于指令读取数据中的目标数据的方法的计算机程序代码,当所述计算机程序代码由处理装置运行时,执行上述的方法。
在另一个方面,本发明揭露一种计算机程序产品,包括基于指令读取数据中的目标数据的计算机程序,所述计算机程序被处理器执行时实现上述方法的步骤。
在另一个方面,本发明揭露一种计算机装置,包括存储器、处理器及存储在存储器上的计算机程序,所述处理器执行所述计算机程序以实现上述方法的步骤。
本发明通过设定至少一个步长值及至少一个数量值,以决定读取数据的步长与数量,进而略过不感兴趣的数据片段,仅取出感兴趣的局部数据,以节省I/O带宽。
附图说明
通过参考附图阅读下文的详细描述,本发明示例性实施方式的上述以及其他目的、特征和优点将变得易于理解。在附图中,以示例性而非限制性的方式示出了本发明的若干实施方式,并且相同或对应的标号表示相同或对应的部分。其中:
图1是示出图像数据在存储器中被摆放的示意图;
图2是示出本发明实施例的板卡的结构图;
图3是示出本发明实施例的集成电路装置的结构图;
图4是示出本发明实施例的计算装置的内部结构示意图;
图5是示出本发明实施例的处理器核的内部结构示意图;
图6是示出本发明实施例自二维数据读取目标数据的示意图;
图7是示出本发明实施例自二维数据中读取目标数据的另一个示意图;
图8是示出本发明实施例自二维数据中读取目标数据的另一个示意图;
图9是示出本发明实施例自二维数据中读取目标数据的另一个示意图;
图10是示出本发明实施例自二维数据中读取目标数据的另一个示意图;
图11是示出本发明另一个实施例的流程图;
图12是示出本发明另一个实施例的流程图。
具体实施方式
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
应当理解,本发明的权利要求、说明书及附图中的术语“第一”、“第二”、“第三”和“第四”等是用于区别不同对象,而不是用于描述特定顺序。本发明的说明书和权利要求书中使用的术语“包括”和“包含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。
还应当理解,在此本发明说明书中所使用的术语仅仅是出于描述特定实施例的目的,而并不意在限定本发明。如在本发明说明书和权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。还应当进一步理解,在本发明说明书和权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。
如在本说明书和权利要求书中所使用的那样,术语“如果”可以依据上下文被解释为“当...时”或“一旦”或“响应于确定”或“响应于检测到”。
下面结合附图来详细描述本发明的具体实施方式。
图2示出本发明实施例的一种板卡20的结构示意图。如图2所示,板卡20包括芯片201,其是一种系统级芯片,或称片上系统(System on Chip,SoC),集成有一个或多个组合处理装置,组合处理装置是一种人工智能运算单元,用以支持各类深度学习和机器学习算法,满足计算机视觉、语音、自然语言处理、数据挖掘等领域复杂场景下的智能处理需求。特别是深度学习技术大量应用在云端智能领域,云端智能应用的一个显著特点是输入数据量大,对平台的存储能力和计算能力有很高的要求,此实施例的板卡20适用在云端智能应用,具有庞大的片外存储、片上存储和强大的计算能力。
芯片201通过对外接口装置202与外部设备203相连接。外部设备203例如是服务器、计算机、摄像头、显示器、鼠标、键盘、网卡或wifi接口等。待处理的数据可以由外部设备203通过对外接口装置202传递至芯片201。芯片201的计算结果可以经由对外接口装置202传送回外部设备203。根据不同的应用场景,对外接口装置202可以具有不同的接口形式,例如PCIe(Peripheral Component Interconnect express,高速外围组件互连)接口等。
板卡20还包括用于存储数据的存储器件204,其包括一个或多个存储单元205。存储器件204通过总线与控制器件206和芯片201进行连接和数据传输。板卡20中的控制器件206配置用于对芯片201的状态进行调控。为此,在一个应用场景中,控制器件206可以包括单片机,又称微控制单元(Micro Controller Unit,MCU)。
图3是示出此实施例的芯片201中的组合处理装置的结构图。如图3中所示,组合处理装置30包括计算装置301、接口装置302、处理装置303和片外内存304。
计算装置301配置成执行用户指定的操作,主要实现为单核智能处理器或者多核智能处理器,用以执行深度学习或机器学习的计算,其可以通过接口装置302与处理装置303进行交互,以共同完成用户指定的操作。
接口装置302用于在计算装置301与处理装置303间传输数据和控制指令。例如,计算装置301可以经由接口装置302从处理装置303中获取输入数据,写入计算装置301片上的存储装置。进一步,计算装置301可以经由接口装置302从处理装置303中获取控制指令,写入计算装置301片上的控制缓存中。替代地或可选地,接口装置302也可以读取计算装置301的存储装置中的数据并传输给处理装置303。
处理装置303作为通用的处理装置,执行包括但不限于数据搬运、对计算装置301的开启和/或停止等基本控制。根据实现方式的不同,处理装置303可以是中央处理器(Central Processing Unit,CPU)、图形处理器(Graphics Processing Unit,GPU)或其他通用和/或专用处理器中的一种或多种类型的处理器,这些处理器包括但不限于数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等,并且其数目可以根据实际需要来确定。如前所述,仅就本发明的计算装置301而言,其可以视为具有单核结构或者同构多核结构。然而,当将计算装置301和处理装置303整合共同考虑时,二者视为形成异构多核结构。
片外内存304用以存储待处理的数据,为DDR(Double Data Rate,双倍速率)内存,大小通常为16G或更大,用于保存计算装置301和/或处理装置303的数据。
图4示出了计算装置301的内部结构示意图。计算装置301用以处理计算机视觉、语音、自然语言、数据挖掘等输入数据,图中的计算装置301采用多核分层结构设计,计算装置301作为一个片上系统,其包括多个集群(Cluster),每个集群又包括多个处理器核,换言之,计算装置301是以片上系统-集群-处理器核的层次所构成的。
以片上系统的层级来看,如图4所示,计算装置301包括外部存储控制器401、外设通信模块402、片上互联模块403、同步模块404以及多个集群405。
外部存储控制器401可以有多个,在图中示例性地展示2个,其用以响应处理器核发出的访问请求,访问外部存储设备,例如图3中的片外内存304,从而自片外读取数据或是将数据写入。外设通信模块402用以通过接口装置302接收来自处理装置303的控制信号,启动计算装置301执行任务。片上互联模块403将外部存储控制器401、外设通信模块402及多个集群405连接起来,用以在各个模块间传输数据和控制信号。同步模块404是一种全局同步屏障控制器(Global Barrier Controller,GBC),用以协调各集群的工作进度,确保信息的同步。多个集群405是计算装置301的计算核心,在图中示例性地展示4个,随着硬件的发展,本发明的计算装置301还可以包括8个、16个、64个、甚至更多的集群405。集群405用以高效地执行深度学习算法。
以集群的层级来看,如图4所示,每个集群405包括多个处理器核(IPU Core)406及一个存储核(MEM Core)407。
处理器核406在图中示例性地展示4个,本发明不限制处理器核406的数量。其内部架构如图5所示。每个处理器核406包括三大模块:控制模块51、运算模块52及存储模块53。
控制模块51用以协调并控制运算模块52和存储模块53的工作,以完成深度学习的任务,其包括取指单元(Instruction Fetch Unit,IFU)511及指令译码单元(Instruction Decode Unit,IDU)512。取指单元511用以获取来自处理装置303的指令,指令译码单元512则将获取的指令进行译码,并将译码结果作为控制信息发送给运算模块52和存储模块53。
运算模块52包括向量运算单元521及矩阵运算单元522。向量运算单元521用以执行向量运算,可支持向量乘、加、非线性变换等复杂运算;矩阵运算单元522负责深度学习算法的核心计算,即矩阵乘及卷积。
存储模块53用来存储或搬运相关数据,包括神经元存储单元(Neuron RAM,NRAM)531、权值存储单元(Weight RAM,WRAM)532、数据缓存533、输入/输出直接内存访问模块(Input/Output Direct Memory Access,IODMA)534、搬运直接内存访问模块(Move Direct Memory Access,MVDMA)535。NRAM 531用以存储供处理器核406计算的特征图及计算后的中间结果;WRAM 532则用以存储深度学习网络的权值;数据缓存533用以缓存来自处理器核406外部的数据,当运算模块52需要进行运算时,将所需的数据搬运至NRAM 531及WRAM 532;IODMA 534通过广播总线409控制数据缓存533与片外内存304的访存;MVDMA 535则用以控制数据缓存533与共享存储单元(Static Random-Access Memory,SRAM)408的访存。
回到图4,存储核407主要用以存储和通信,即存储处理器核406间的共享数据或中间结果、以及执行集群405与片外内存304之间的通信、集群405间彼此的通信、处理器核406间彼此的通信等。在其他实施例中,存储核407具有标量运算的能力,用以执行标量运算。
存储核407包括共享存储单元(SRAM)408、广播总线409、集群直接内存访问模块(Cluster Direct Memory Access,CDMA)420及全局直接内存访问模块(Global Direct Memory Access,GDMA)411。SRAM 408为片上内存,承担高性能数据中转站的角色, 在同一个集群405内不同处理器核406之间所复用的数据不需要通过处理器核406各自向片外内存304获得,而是经SRAM 408在处理器核406间中转,存储核407只需要将复用的数据从SRAM 408迅速分发给多个处理器核406即可,以提高核间通讯效率,亦大大减少片上片外的输入/输出访问。
广播总线409、CDMA 420及GDMA 411则分别用来执行处理器核406间的通信、集群405间的通信和集群405与片外内存304的数据传输。以下将分别说明。
广播总线409用以完成集群405内各处理器核406间的高速通信,此实施例的广播总线409支持核间通信方式包括单播、多播与广播。单播是指点对点(即单一处理器核至单一处理器核)的数据传输,多播是将一份数据从SRAM 408传输到特定几个处理器核406的通信方式,而广播则是将一份数据从SRAM 408传输到所有处理器核406的通信方式,属于多播的一种特例。
CDMA 420用以控制在同一个计算装置301内不同集群405间的SRAM 408的访存。
GDMA 411与外部存储控制器401协同,用以控制集群405的SRAM 408到片外内存304的访存,或是将数据自片外内存304读取至SRAM 408中。从前述可知,片外内存304与数据缓存533间的通信可以经由2个渠道来实现。第一个渠道是通过IODAM 534直接联系片外内存304与数据缓存533;第二个渠道是先经由GDMA 411使得数据在片外内存304与SRAM 408间传输,再经过MVDMA 535使得数据在SRAM 408与数据缓存533间传输。虽然表面上看来第二个渠道需要更多的元件参与,数据流较长,但实际上在部分实施例中,第二个渠道的带宽远大于第一个渠道,因此片外内存304与数据缓存533间的通信通过第二个渠道可能更有效率。本发明的实施例可根据本身硬件条件选择数据传输渠道。
在其他实施例中,GDMA 411的功能和IODMA 534的功能可以整合在同一部件中。本发明为了方便描述,将GDMA 411和IODMA 534视为不同部件,对于本领域技术人员来说,只要其实现的功能以及达到的技术效果与本发明类似,即属于本发明的保护范围。进一步地,GDMA 411的功能、IODMA 534的功能、CDMA 420的功能、MVDMA 535的功能亦可以由同一部件来实现。
综上所述,此实施例在进行深度学习运算前,通过软件设定,发送预载指令,将数据从片外内存304搬移至特定集群405的SRAM 408中,换言之,此实施例的预载指令用以驱动计算装置301预先载入下次运算所需的数据。接着根据加载指令,控制模块51将数据从SRAM 408搬移至数据缓存533,供运算模块52进行运算,换言之,此实施例的加载指令用以驱动控制模块51载入运算模块52进行本次运算所需的数据。此实施例的预载指令及加载指令均包括起始地址域、步长域及数量域,欲读取一个数据中的局部区域(目标数据)时,自起始地址域获取读取目标数据的起始地址,自步长域中获取步长值,并自数量域中获取数量值,其中步长值指的是沿第一维方向读取目标数据时以数据片段为单位的读取间隔,数量为读取数量。藉由引入步长值与数量值,此实施例便可仅读取局部区域(目标数据),不需连带读取不感兴趣的其他数据片段。
图6示出此实施例自二维数据读取目标数据的示意图,该图像数据是8×8的二维矩阵,并沿着X方向连续存储,图中每格为数据片段,即读写数据的最小单元,在此实施例中为64个字节的缓存行。
一个示例是步长域里的步长值为0及数量域里的数量值为4,表示沿X方向读取目标数据时的读取间隔为0,也就是连续读取,且沿X方向的读取4个缓存行。在此实施例中,基于一个步长值与数量值所获得的子数据集合称为子数据组。假设起始地址为图中缓存行1的地址,则取出的子数据组为缓存行1至缓存行4。另一个示例是步长域里的步长值为2及数量域里的数量值为6,表示沿X方向读取目标数据时的读取间隔为2,且沿X方向的读取6个缓存行,假设起始地址为图中缓存行5的地址,则取出的子数据组为缓存行5至缓存行10。由于图6的示例仅展示一组步长值与数量值,因此子数据组即为欲读取的 目标数据。
图7示出此实施例自二维数据中读取目标数据的另一个示意图。一个示例是,步长域里的步长值为7及数量域里的数量值为4,表示沿X方向读取目标数据时的读取间隔为7,且沿X方向的读取4个缓存行,假设起始地址为图中缓存行1的地址,则取出的子数据组为缓存行1至缓存行4。由于此示例亦仅有一组步长值与数量值,因此子数据组即为目标数据,换言之,当步长值为图像步长减1时,目标数据恰好为沿着第二维(Y方向)连续读取数个缓存行,其中图像步长为图像在第一维(X方向)上的长度。如果感兴趣的局部区域为粗线方框的数据,上述的设定方式恰好只读取感兴趣的缓存行。另一个示例是,步长域里的步长值为6及数量域里的数量值为4,表示沿X方向读取目标数据时的读取间隔为6,且沿X方向的读取4个缓存行,假设起始地址为图中缓存行5的地址,则取出的子数据组(目标数据)为缓存行5至缓存行8,类似于沿着斜率为1的方向连续读取缓存行。
此实施例的步长值与数量值可以包括多个。当预载指令与加载指令包括2个步长域及2个数量域时,其读取的方式将更为多样。举例来说,预载指令与加载指令包括起始地址域、载有第一步长值的第一步长域、载有第一数量值的第一数量域、载有第二步长值的第二步长域与载有第二数量值的第二数量域,其中第一步长值为沿第一维方向(X方向)读取目标数据时以数据片段为单位的读取间隔,第一数量值为沿第一维方向(X方向)的读取数量,每执行一次第一步长值与第一数量值的操作所获得的数据集合为子数据组。第二步长值为每获得一次子数据组,沿第一维方向以数据片段为单位的读取间隔,之后再按上述第一步长值与第一数量值读取子数据组,第二数量值则为读取子数据组的次数。
图8示出此实施例自二维数据读取目标数据的另一个示意图。一个示例是,第一步长值(S1)为1、第一数量值(N1)为3、第二步长值(S2)为3、第二数量值(N2)为4,如果起始地址为图中缓存行1的地址,则自缓存行1开始,每隔第一步长(S1)读取1个缓存行,执行3次(N1),以获得第一子数据组(缓存行1至3),接着自缓存行3起隔第二步长(S2)的缓存行4开始,获得第二子数据组,即缓存行4至6。由于第二数量值(N2)为4,因此重复上述操作以获得4组子数据组,其中第三子数据组为缓存行7至9,第四子数据组为缓存行10至12。这四组子数据组即为目标数据。
此实施例可以接受步长域为空字符串(Null)或是特殊字符。当第一步长值为0,第二步长域为空字符串或是特殊字符时,第二步长值为图像步长减去第一数量值。图9示出此实施例自二维数据读取目标数据的另一个示意图。如果该二维数据为人像上半身照片,粗体方框部分为人脸正面,现此实施例欲提取人脸的局部区域(粗体方框部分)进行人脸识别运算,指令的起始地址域设定的起始地址为缓存行1的地址,第一步长值设定为0、第一数量值设定为3、第二步长域为空字符串、第二数量值设定为4,处理器核406自缓存行1开始,无间隔连续读取3个缓存行,以获得第一子数据组(缓存行1至3),接着再以图像步长的长度减去第一数量值(8-3=5)来进行位移,接着读取第二子数据组(缓存行4至6),由于第二数量值为4,因此要读取4组子数据组,其中第三子数据组为缓存行7至9,第四子数据组为缓存行10至12。这四组子数据组即为目标数据,也就是人脸的局部区域,无需读取不感兴趣的缓存行。
从图7及图9的例子推演可知,不论是预载指令或是加载指令,当以下表达式满足且第一步长值为0时,二维子数据组恰好为沿着第二维方向连续读取一维子数据组:
第2步长值=图像步长-第一数量值
图10示出此实施例自二维数据读取目标数据的另一个示意图。如果该二维数据为人像上半身照片,粗体方框部分为人脸侧面,现此实施例欲提取包括人脸侧面的局部区域将其转正后再进行人脸识别运算,则指令的起始地址域设定为缓存行1的地址,第一步长值 设定为0、第一数量值设定为3、第二步长值设定为6、第二数量值设定为4。实施例自缓存行1开始,无间隔连续读取3个缓存行,第一子数据组为缓存行1至3,接着间隔6个缓存行后再读取第二子数据组(缓存行4至6),接着间隔6个缓存行后再读取第三子数据组(缓存行7至9),接着间隔6个缓存行后再读取第四子数据组(缓存行10至12),人脸侧面的局部区域(类似平行四边形)便完整被获取,无需读取不感兴趣的缓存行。
随着技术的演进,数据可以采用更高维的形式进行存储,例如三维、四维或更多维。欲读取N维数据的局部区域时,此实施例的指令除了起始地址域外,还可以设定N个步长域及N个数量域,其中第N步长值为自N-1子数据组起算沿第一维以数据片段为单位的间隔数量,第N数量值为读取N-1子数据组的数量。以读取三维数据的目标数据为例,第一步长值与第一数量值决定沿第一维方向的一维子数据组,第二步长值与第二数量值决定由数个一维子数据组所组成的二维子数据组,二维子数据组为沿着第一维与第二维展开的矩阵子数据组,第三步长值与第三数量值决定由数个二维子数据组所组成的三维子数据组,此三维子数据组即为目标数据。本领域技术人员可以轻易的推及四维或更多维的情况,故不赘述。
基于上述的操作,当计算装置301欲自第一内存(如片外内存304)读取N维图像数据中的目标图像数据至第二内存(如SRAM 408)时,处理器核406根据预载指令中的起始地址域确定起始数据片段,自第一步长域中获取第一步长值,自第一数量域中获取第一数量值,由第一步长值与第一数量值决定沿第一维方向的一维子图像数据组。如预载指令中并无其他步长域及数量域,则一维子图像数据组即为目标图像数据。
如预载指令中还包括第二步长域及第二数量域,自第二步长域中获取第二步长值,自第二数量域中获取第二数量值,由第二步长值与第二数量值决定由数个一维子图像数据组所组成的二维子图像数据组,其中第二步长值决定一维子图像数据组间沿第一维方向的步长,第二数量值决定一维子图像数据组的读取数量。如预载指令中并无其他步长域及数量域,则二维子图像数据组即为目标图像数据。
如预载指令中还包括第N步长域及第N数量域,依前述方式获取N维子图像数据组,其中第N步长值决定N-1维子图像数据组间沿第一维方向的步长,第N数量值决定N-1维子图像数据组的读取数量。该N维子图像数据组即为目标图像数据。
计算装置301基于上述的预载指令便可完整地将目标图像数据自片外内存304搬运至SRAM 408,无需读取图像中不感兴趣的区域。
当控制模块51欲自第一内存(如SRAM 408)读取N维图像数据中的目标图像数据至第二内存(如数据缓存533)时,需注意的是,这里的N维图像数据可以是上述自片外内存304搬运至SRAM 408的目标图像数据。控制模块51根据加载指令中的起始地址域确定起始数据片段,自第一步长域中获取第一步长值,自第一数量域中获取第一数量值,由第一步长值与第一数量值决定沿第一维方向的一维子图像数据组。如加载指令中并无其他步长域及数量域,则一维子图像数据组即为目标图像数据。
如加载指令中还包括第二步长域及第二数量域,自第二步长域中获取第二步长值,自第二数量域中获取第二数量值,由第二步长值与第二数量值决定由数个一维子图像数据组所组成的二维子图像数据组,其中第二步长值决定一维子图像数据组间沿第一维方向的步长,第二数量值决定一维子图像数据组的读取数量。如加载指令中并无其他步长域及数量域,则二维子图像数据组即为目标图像数据。
如预载指令中还包括第N步长域及第N数量域,依前述方式获取N维子图像数据组,其中第N步长值决定N-1维子图像数据组间沿第一维方向的步长,第N数量值决定N-1维子图像数据组的读取数量。该N维子图像数据组即为目标图像数据。
控制模块51基于上述的加载指令便可完整地将目标图像数据自SRAM 408搬运至数据缓存533,无需读取图像中不感兴趣的区域。
从图7可知,如果指令仅包括第一步长域与第一数量域,当第一步长值为图像步长减一时,目标数据为沿着第二维方向连续读取第一数量值个数据片段。当第一步长值为图像步长减二时,目标数据为在第一维与第二维组成的平面中沿着斜率为1的方向连续读取第一数量值个数据片段。进一步可以推论出,当第一步长值为图像步长时,目标数据为在第一维与第二维组成的平面中沿着斜率为-1的方向连续读取第一数量值个数据片段。
控制模块51基于加载指令自SRAM 408搬运N维图像数据中的目标图像数据至数据缓存533。当运算模块52进行运算时,目标数据自数据缓存533搬移至NRAM 531,运算模块52自NRAM 531读取目标数据进行运算。
本发明的另一个实施例为一种基于指令读取数据中的目标数据的方法,更详细来说,此实施例的指令仅包括一个步长域与一个数量域。图11示出此实施例的流程图。
在步骤1101中,根据指令中的起始地址域确定起始地址。
在步骤1102中,根据步长域中的步长值设定步长。实施例自指令中的步长域中获取步长值来设定步长。
在步骤1103中,根据数量域中的数量值设定数量。实施例自指令中的数量域中获取数量值来设定读取数量。
在步骤1104中,自第一内存读取数据中对应步长与数量的目标数据至第二内存。由第一步长值与第一数量值决定沿第一维方向的一维子图像数据组,具体来说,步长值为沿第一维方向读取目标数据时的读取间隔,数量值为沿第一维方向的读取数据片段的数量,所获得的子数据组即为欲读取的目标数据。
在步骤1105中,基于第二内存中的目标数据进行运算。在一种情况下,第一内存可以是片外内存,第二内存为片上内存;在另一种情况下,第一内存可以是片上内存,第二内存为缓存。
本发明的另一个实施例为一种基于指令读取数据中的目标数据的方法,更详细来说,此实施例的指令包括N个步长域与N个数量域。图12示出此实施例的流程图。
在步骤1201中,根据指令中的起始地址域确定起始地址。
在步骤1202中,根据N个步长域中的N个步长值设定N个步长。此实施例自第一步长域中获取第一步长值,自第二步长域中获取第二步长值,自第N步长域中获取第N步长值,以分别设定第一至第N步长。
在步骤1203中,根据N个数量域中的N个数量值设定N个数量。此实施例自第一数量域中获取第一数量值,自第二数量域中获取第二数量值,自第N数量域中获取第N数量值,以分别设定第一至第N数量。
在步骤1204中,自第一内存读取数据中对应N个步长与N个数量的目标数据至第二内存。此实施例由第一步长值与第一数量值决定沿第一维方向的一维子图像数据组。自第二步长域中获取第二步长值,自第二数量域中获取第二数量值,由第二步长值与第二数量值决定由数个一维子图像数据组所组成的二维子图像数据组,其中第二步长值决定一维子图像数据组间沿第一维方向的步长,第二数量值决定一维子图像数据组的读取数量。依前述方式获取N维子图像数据组,其中第N步长值决定N-1维子图像数据组间沿第一维方向的读取间隔,第N数量值决定N-1维子图像数据组的读取数量。该N维子图像数据组即为目标图像数据。
在步骤1205中,基于第二内存中的目标数据进行运算。在一种情况下,第一内存可以是片外内存,第二内存为片上内存;在另一种情况下,第一内存可以是片上内存,第二内存为缓存。
本发明另一个实施例为一种计算机可读存储介质,其上存储有基于指令读取数据中的目标数据的方法的计算机程序代码,当所述计算机程序代码由处理装置运行时,执行图11或图12所述的方法。本发明另一个实施例为一种计算机程序产品,包括基于指令读取 数据中的目标数据的计算机程序,其特征在于,所述计算机程序被处理器执行时实现图11或图12所述方法的步骤。本发明另一个实施例为一种计算机装置,包括存储器、处理器及存储在存储器上的计算机程序,其特征在于,所述处理器执行所述计算机程序以实现图11或图12所述方法的步骤。在一些实现场景中,上述集成的单元可以采用软件程序模块的形式来实现。如果以软件程序模块的形式实现并作为独立的产品销售或使用时,所述集成的单元可以存储在计算机可读取存储器中。基于此,当本发明的方案以软件产品(例如计算机可读存储介质)的形式体现时,该软件产品可以存储在存储器中,其可以包括若干指令用以使得计算机设备(例如个人计算机、服务器或者网络设备等)执行本发明实施例所述方法的部分或全部步骤。前述的存储器可以包括但不限于U盘、闪存盘、只读存储器(Read Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。
本发明通过设定至少一个步长值及至少一个数量值,以决定读取数据的步长与数量,进而略过不感兴趣的数据片段,仅取出感兴趣的局部数据,以节省I/O带宽。
根据不同的应用场景,本发明的电子设备或装置可以包括服务器、云端服务器、服务器集群、数据处理装置、机器人、电脑、打印机、扫描仪、平板电脑、智能终端、PC设备、物联网终端、移动终端、手机、行车记录仪、导航仪、传感器、摄像头、相机、摄像机、投影仪、手表、耳机、移动存储、可穿戴设备、视觉终端、自动驾驶终端、交通工具、家用电器、和/或医疗设备。所述交通工具包括飞机、轮船和/或车辆;所述家用电器包括电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机;所述医疗设备包括核磁共振仪、B超仪和/或心电图仪。本发明的电子设备或装置还可以被应用于互联网、物联网、数据中心、能源、交通、公共管理、制造、教育、电网、电信、金融、零售、工地、医疗等领域。进一步,本发明的电子设备或装置还可以用于云端、边缘端、终端等与人工智能、大数据和/或云计算相关的应用场景中。在一个或多个实施例中,根据本发明方案的算力高的电子设备或装置可以应用于云端设备(例如云端服务器),而功耗小的电子设备或装置可以应用于终端设备和/或边缘端设备(例如智能手机或摄像头)。在一个或多个实施例中,云端设备的硬件信息和终端设备和/或边缘端设备的硬件信息相互兼容,从而可以根据终端设备和/或边缘端设备的硬件信息,从云端设备的硬件资源中匹配出合适的硬件资源来模拟终端设备和/或边缘端设备的硬件资源,以便完成端云一体或云边端一体的统一管理、调度和协同工作。
需要说明的是,为了简明的目的,本发明将一些方法及其实施例表述为一系列的动作及其组合,但是本领域技术人员可以理解本发明的方案并不受所描述的动作的顺序限制。因此,依据本发明的公开或教导,本领域技术人员可以理解其中的某些步骤可以采用其他顺序来执行或者同时执行。进一步,本领域技术人员可以理解本发明所描述的实施例可以视为可选实施例,即其中所涉及的动作或模块对于本发明某个或某些方案的实现并不一定是必需的。另外,根据方案的不同,本发明对一些实施例的描述也各有侧重。鉴于此,本领域技术人员可以理解本发明某个实施例中没有详述的部分,也可以参见其他实施例的相关描述。
在具体实现方面,基于本发明的公开和教导,本领域技术人员可以理解本发明所公开的若干实施例也可以通过本文未公开的其他方式来实现。例如,就前文所述的电子设备或装置实施例中的各个单元来说,本文在考虑了逻辑功能的基础上对其进行拆分,而实际实现时也可以有另外的拆分方式。又例如,可以将多个单元或组件结合或者集成到另一个系统,或者对单元或组件中的一些特征或功能进行选择性地禁用。就不同单元或组件之间的连接关系而言,前文结合附图所讨论的连接可以是单元或组件之间的直接或间接耦合。在一些场景中,前述的直接或间接耦合涉及利用接口的通信连接,其中通信接口可以支持电性、光学、声学、磁性或其它形式的信号传输。
在本发明中,作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元示出的部件可以是或者也可以不是物理单元。前述部件或单元可以位于同一位置或者分布到多个网络单元上。另外,根据实际的需要,可以选择其中的部分或者全部单元来实现本发明实施例所述方案的目的。另外,在一些场景中,本发明实施例中的多个单元可以集成于一个单元中或者各个单元物理上单独存在。
在另外一些实现场景中,上述集成的单元也可以采用硬件的形式实现,即为具体的硬件电路,其可以包括数字电路和/或模拟电路等。电路的硬件结构的物理实现可以包括但不限于物理器件,而物理器件可以包括但不限于晶体管或忆阻器等器件。鉴于此,本文所述的各类装置(例如计算装置或其他处理装置)可以通过适当的硬件处理器来实现,例如中央处理器、GPU、FPGA、DSP和ASIC等。进一步,前述的所述存储单元或存储装置可以是任意适当的存储介质(包括磁存储介质或磁光存储介质等),其例如可以是可变电阻式存储器(Resistive Random Access Memory,RRAM)、动态随机存取存储器(Dynamic Random Access Memory,DRAM)、静态随机存取存储器(Static Random Access Memory,SRAM)、增强动态随机存取存储器(Enhanced Dynamic Random Access Memory,EDRAM)、高带宽存储器(High Bandwidth Memory,HBM)、混合存储器立方体(Hybrid Memory Cube,HMC)、ROM和RAM等。
依据以下条款可更好地理解前述内容:
条款A1.一种处理器核,电性连接至片上内存,所述片上内存存储有数据,所述处理器核包括控制模块及运算模块,其特征在于所述控制模块用以:根据指令中的第一步长值设定第一步长;根据所述指令中的第一数量值设定第一数量;自所述片上内存读取所述数据中对应所述第一步长与所述第一数量的目标数据至缓存;其中,所述第一步长为沿第一维方向读取所述目标数据时以数据片段为单位的读取间隔,所述第一数量为以所述数据片段为单位的读取数量,所述运算模块基于所述缓存中的所述目标数据进行运算。
条款A2.根据条款A1所述的处理器核,其中所述指令包括第一步长域及第一数量域,所述控制模块自所述第一步长域中获取所述第一步长值,并自所述第一数量域中获取第一数量值。
条款A3.根据条款A1所述的处理器核,其中当所述第一步长值为图像步长减一时,所述目标数据为沿着第二维方向连续读取所述第一数量值个数据片段,其中所述图像步长为所述数据沿所述第一维方向的长度。
条款A4.根据条款A1所述的处理器核,其中当所述第一步长值为图像步长减二时,所述目标数据为在所述第一维与第二维组成的平面中沿着斜率为1的方向连续读取所述第一数量值个数据片段,其中所述图像步长为所述数据沿所述第一维方向的长度。
条款A5.根据条款A1所述的处理器核,其中当所述第一步长值为图像步长时,所述目标数据为在所述第一维与第二维组成的平面中沿着斜率为-1的方向连续读取所述第一数量值个数据片段,其中所述图像步长为所述数据沿所述第一维方向的长度。
条款A6.根据条款A2所述的处理器核,其中所述控制模块还用以:根据所述指令中的第N步长值设定第N步长;根据所述指令中的第N数量值设定第N数量;自所述片上内存读取所述数据中对应所述第一至第N步长与所述第一至第N数量的所述目标数据至所述缓存;其中,由第N-1步长值与第N-1数量值决定N-1维子数据组,所述第N步长为所述N-1维子数据组沿所述第一维方向以所述数据片段为单位的读取间隔,所述第N数量值为所述N-1维子数据组的读取数量,N为大于1的正整数。
条款A7.根据条款A6所述的处理器核,其中当以下表达式满足且所述第一步长值为0时,二维子数据组为沿着第二维方向连续读取一维子数据组:
第2步长值=所述图像步长-所述第一数量值
条款A8.根据条款A6所述的处理器核,其中当所述第一步长值为0,所述第二步长域为空字符串或是特殊字符时,所述第二步长值为图像步长减去第一数量值,其中所述图像步长为所述数据沿所述第一维方向的长度。
条款A9.根据条款A6所述的处理器核,其中所述指令包括第N步长域及第N维数量域,所述控制模块自所述第N步长域中获取所述第N步长值,并自所述第N维数量域中获取第N数量值。
条款A10.根据条款A1所述的处理器核,还包括存储模块,所述存储模块包括所述数据缓存。
条款A11.根据条款A10所述的处理器核,其中所述存储模块还包括神经元存储单元,当所述运算模块进行运算时,所述目标数据自所述数据缓存搬移至所述神经元存储单元,所述运算模块自所述神经元存储单元读取所述目标数据。
条款A12.根据条款A1所述的处理器核,其中所述指令为加载指令。
条款A13.根据条款A1所述的处理器核,其中所述数据片段为缓存行。
条款A14.根据条款A1至A13任一项所述的处理器核,其中所述指令还包括起始地址域,所述控制模块自所述起始地址域获取读取所述目标数据的起始地址。
条款A15.一种计算装置,包括根据条款A1至A14所述的处理器核。
条款A16.一种集成电路装置,包括根据条款A15所述的计算装置。
条款A17.一种板卡,包括根据条款A16所述的集成电路装置。
条款A18.一种计算装置,电性连接至片外内存,所述片外内存存储有数据,所述计算装置包括片上内存与处理器核,其特征在于处理器核用以:根据指令中的第一步长值设定第一步长;根据所述指令中的第一数量值设定第一数量;自所述片外内存读取所述数据中对应所述第一步长与所述第一数量的目标数据至所述片上内存;其中,所述第一步长为沿第一维方向读取所述目标数据时以数据片段为单位的读取间隔,所述第一数量为以所述数据片段为单位的读取数量,所述处理器核基于所述片上内存中的所述目标数据进行运算。
条款A19.根据条款A18所述的计算装置,其中所述指令包括第一步长域及第一数量域,所述控制模块自所述第一步长域中获取所述第一步长值,并自所述第一数量域中获取第一数量值。
条款A20.根据条款A18所述的计算装置,其中当所述第一步长值为图像步长减一时,所述目标数据为沿着第二维方向连续读取所述第一数量值个数据片段,其中所述图像步长为所述数据沿所述第一维方向的长度。
条款A21.根据条款A18所述的计算装置,其中当所述第一步长值为图像步长减二时,所述目标数据为在所述第一维与第二维组成的平面中沿着斜率为1的方向连续读取所述第一数量值个数据片段,其中所述图像步长为所述数据沿所述第一维方向的长度。
条款A22.根据条款A18所述的计算装置,其中当所述第一步长值为图像步长时,所述目标数据为在所述第一维与第二维组成的平面中沿着斜率为-1的方向连续读取所述第一数量值个数据片段,其中所述图像步长为所述数据沿所述第一维方向的长度。
条款A23.根据条款A19所述的计算装置,其中所述处理器核还用以:根据所述指令中的第N步长值设定第N步长;根据所述指令中的第N数量值设定第N数量;自所述片外内存读取所述数据中对应所述第一至第N步长与所述第一至第N数量的所述目标数据至所述片上内存;其中,由第N-1步长值与第N-1数量值决定N-1维子数据组,所述第N步长为所述N-1维子数据组沿所述第一维方向以所述数据片段为单位的读取间隔,所述第N数量值为所述N-1维子数据组的读取数量,N为大于1的正整数。
条款A24.根据条款A23所述的计算装置,其中当以下表达式满足且所述第一步长值为0时,二维子数据组为沿着第二维方向连续读取一维子数据组:
第2步长值=所述图像步长-所述第一数量值
条款A25.根据条款A23所述的处理器核,其中当所述第一步长值为0,所述第二步长域为空字符串或是特殊字符时,所述第二步长值为图像步长减去第一数量值,其中所述图像步长为所述数据沿所述第一维方向的长度。
条款A26.根据条款A23所述的计算装置,其中所述指令包括第N步长域及第N维数量域,所述处理器核自所述第N步长域中获取所述第N步长值,并自所述第N维数量域中获取第N数量值。
条款A27.根据条款A18所述的计算装置,其中所述指令为预载指令。
条款A28.根据条款A18所述的计算装置,其中所述数据片段为缓存行。
条款A29.根据条款A18至A28任一项所述的计算装置,其中所述指令还包括起始地址域,所述处理器核自所述起始地址域获取读取所述目标数据的起始地址。
条款A30.一种集成电路装置,包括根据条款A18至A29任一项所述的计算装置。
条款A31.一种板卡,包括根据条款A30所述的集成电路装置。
条款A32.一种基于指令读取数据中的目标数据的方法,其特征在于包括:根据所述指令中的第一步长值设定第一步长;根据所述指令中的第一数量值设定第一数量;自第一内存读取所述数据中对应所述第一步长与所述第一数量的所述目标数据至第二内存;基于所述第二内存中的所述目标数据进行运算;其中,所述第一步长为沿第一维方向读取所述目标数据时以数据片段为单位的读取间隔,所述第一数量为以所述数据片段为单位的读取数量。
条款A33.根据条款A32所述的方法,还包括:根据所述指令中的第二至第N步长值设定第二至第N步长;根据所述指令中的第二至第N数量值设定第二至第N数量;自所述第一内存读取所述数据中对应所述第二至第N步长与所述第二至第N数量的所述目标数据至所述第二内存;其中,由第N-1步长值与第N-1数量值决定N-1维子数据组,所述第N步长为所述N-1维子数据组沿所述第一维方向以所述数据片段为单位的读取间隔,所述第N数量值为所述N-1维子数据组的读取数量,N为大于1的正整数。
条款A34.一种计算机可读存储介质,其上存储有基于指令读取数据中的目标数据的方法的计算机程序代码,当所述计算机程序代码由处理装置运行时,执行条款A32或A33所述的方法。
条款A35.一种计算机程序产品,包括基于指令读取数据中的目标数据的计算机程序,其特征在于,所述计算机程序被处理器执行时实现条款A32或A33所述方法的步骤。
条款A36.一种计算机装置,包括存储器、处理器及存储在存储器上的计算机程序,其特征在于,所述处理器执行所述计算机程序以实现条款A32或A33所述方法的步骤。
以上对本发明实施例进行了详细介绍,本文中应用了具体个例对本发明的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本发明的方法及其核心思想;同时,对于本领域的一般技术人员,依据本发明的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本发明的限制。

Claims (36)

  1. 一种处理器核,电性连接至片上内存,所述片上内存存储有数据,所述处理器核包括控制模块及运算模块,其特征在于所述控制模块用以:
    根据指令中的第一步长值设定第一步长;
    根据所述指令中的第一数量值设定第一数量;
    自所述片上内存读取所述数据中对应所述第一步长与所述第一数量的目标数据至缓存;
    其中,所述第一步长为沿第一维方向读取所述目标数据时以数据片段为单位的读取间隔,所述第一数量为以所述数据片段为单位的读取数量,所述运算模块基于所述缓存中的所述目标数据进行运算。
  2. 根据权利要求1所述的处理器核,其中所述指令包括第一步长域及第一数量域,所述控制模块自所述第一步长域中获取所述第一步长值,并自所述第一数量域中获取第一数量值。
  3. 根据权利要求1所述的处理器核,其中当所述第一步长值为图像步长减一时,所述目标数据为沿着第二维方向连续读取所述第一数量值个数据片段,其中所述图像步长为所述数据沿所述第一维方向的长度。
  4. 根据权利要求1所述的处理器核,其中当所述第一步长值为图像步长减二时,所述目标数据为在所述第一维与第二维组成的平面中沿着斜率为1的方向连续读取所述第一数量值个数据片段,其中所述图像步长为所述数据沿所述第一维方向的长度。
  5. 根据权利要求1所述的处理器核,其中当所述第一步长值为图像步长时,所述目标数据为在所述第一维与第二维组成的平面中沿着斜率为-1的方向连续读取所述第一数量值个数据片段,其中所述图像步长为所述数据沿所述第一维方向的长度。
  6. 根据权利要求2所述的处理器核,其中所述控制模块还用以:
    根据所述指令中的第N步长值设定第N步长;
    根据所述指令中的第N数量值设定第N数量;
    自所述片上内存读取所述数据中对应所述第一至第N步长与所述第一至第N数量的所述目标数据至所述缓存;
    其中,由第N-1步长值与第N-1数量值决定N-1维子数据组,所述第N步长为所述N-1维子数据组沿所述第一维方向以所述数据片段为单位的读取间隔,所述第N数量值为所述N-1维子数据组的读取数量,N为大于1的正整数。
  7. 根据权利要求6所述的处理器核,其中当以下表达式满足且所述第一步长值为0时,二维子数据组为沿着第二维方向连续读取一维子数据组:
    第2步长值=所述图像步长-所述第一数量值。
  8. 根据权利要求6所述的处理器核,其中当所述第一步长值为0,所述第二步长域为空字符串或是特殊字符时,所述第二步长值为图像步长减去第一数量值,其中所述图像步长为所述数据沿所述第一维方向的长度。
  9. 根据权利要求6所述的处理器核,其中所述指令包括第N步长域及第N维数量域,所述控制模块自所述第N步长域中获取所述第N步长值,并自所述第N维数量域中获取第N数量值。
  10. 根据权利要求1所述的处理器核,还包括存储模块,所述存储模块包括所述数据缓存。
  11. 根据权利要求10所述的处理器核,其中所述存储模块还包括神经元存储单元,当所述运算模块进行运算时,所述目标数据自所述数据缓存搬移至所述神经元存储单元,所述运算模块自所述神经元存储单元读取所述目标数据。
  12. 根据权利要求1所述的处理器核,其中所述指令为加载指令。
  13. 根据权利要求1所述的处理器核,其中所述数据片段为缓存行。
  14. 根据权利要求1至13任一项所述的处理器核,其中所述指令还包括起始地址域,所述控制模块自所述起始地址域获取读取所述目标数据的起始地址。
  15. 一种计算装置,包括根据权利要求1至14所述的处理器核。
  16. 一种集成电路装置,包括根据权利要求15所述的计算装置。
  17. 一种板卡,包括根据权利要求16所述的集成电路装置。
  18. 一种计算装置,电性连接至片外内存,所述片外内存存储有数据,所述计算装置包括片上内存与处理器核,其特征在于处理器核用以:
    根据指令中的第一步长值设定第一步长;
    根据所述指令中的第一数量值设定第一数量;
    自所述片外内存读取所述数据中对应所述第一步长与所述第一数量的目标数据至所述片上内存;
    其中,所述第一步长为沿第一维方向读取所述目标数据时以数据片段为单位的读取间隔,所述第一数量为以所述数据片段为单位的读取数量,所述处理器核基于所述片上内存中的所述目标数据进行运算。
  19. 根据权利要求18所述的计算装置,其中所述指令包括第一步长域及第一数量域,所述控制模块自所述第一步长域中获取所述第一步长值,并自所述第一数量域中获取第一数量值。
  20. 根据权利要求18所述的计算装置,其中当所述第一步长值为图像步长减一时,所述目标数据为沿着第二维方向连续读取所述第一数量值个数据片段,其中所述图像步长为所述数据沿所述第一维方向的长度。
  21. 根据权利要求18所述的计算装置,其中当所述第一步长值为图像步长减二时,所述目标数据为在所述第一维与第二维组成的平面中沿着斜率为1的方向连续读取所述第一数量值个数据片段,其中所述图像步长为所述数据沿所述第一维方向的长度。
  22. 根据权利要求18所述的计算装置,其中当所述第一步长值为图像步长时,所述目标数据为在所述第一维与第二维组成的平面中沿着斜率为-1的方向连续读取所述第一数量值个数据片段,其中所述图像步长为所述数据沿所述第一维方向的长度。
  23. 根据权利要求19所述的计算装置,其中所述处理器核还用以:
    根据所述指令中的第N步长值设定第N步长;
    根据所述指令中的第N数量值设定第N数量;
    自所述片外内存读取所述数据中对应所述第一至第N步长与所述第一至第N数量的所述目标数据至所述片上内存;
    其中,由第N-1步长值与第N-1数量值决定N-1维子数据组,所述第N步长为所述N-1 维子数据组沿所述第一维方向以所述数据片段为单位的读取间隔,所述第N数量值为所述N-1维子数据组的读取数量,N为大于1的正整数。
  24. 根据权利要求23所述的计算装置,其中当以下表达式满足且所述第一步长值为0时,二维子数据组为沿着第二维方向连续读取一维子数据组:
    第2步长值=所述图像步长-所述第一数量值。
  25. 根据权利要求23所述的计算装置,其中当所述第一步长值为0,所述第二步长域为空字符串或是特殊字符时,所述第二步长值为图像步长减去第一数量值,其中所述图像步长为所述数据沿所述第一维方向的长度。
  26. 根据权利要求23所述的计算装置,其中所述指令包括第N步长域及第N维数量域,所述处理器核自所述第N步长域中获取所述第N步长值,并自所述第N维数量域中获取第N数量值。
  27. 根据权利要求18所述的计算装置,其中所述指令为预载指令。
  28. 根据权利要求18所述的计算装置,其中所述数据片段为缓存行。
  29. 根据权利要求18至28任一项所述的计算装置,其中所述指令还包括起始地址域,所述处理器核自所述起始地址域获取读取所述目标数据的起始地址。
  30. 一种集成电路装置,包括根据权利要求18至29任一项所述的计算装置。
  31. 一种板卡,包括根据权利要求30所述的集成电路装置。
  32. 一种基于指令读取数据中的目标数据的方法,其特征在于包括:
    根据指令中的第一步长值设定第一步长;
    根据所述指令中的第一数量值设定第一数量;
    自第一内存读取所述数据中对应所述第一步长与所述第一数量的所述目标数据至第二内存;
    基于所述第二内存中的所述目标数据进行运算;
    其中,所述第一步长为沿第一维方向读取所述目标数据时以数据片段为单位的读取间隔,所述第一数量为以所述数据片段为单位的读取数量。
  33. 根据权利要求32所述的方法,还包括:
    根据所述指令中的第二至第N步长值设定第二至第N步长;
    根据所述指令中的第二至第N数量值设定第二至第N数量;
    自所述第一内存读取所述数据中对应所述第二至第N步长与所述第二至第N数量的所述目标数据至所述第二内存;
    其中,由第N-1步长值与第N-1数量值决定N-1维子数据组,所述第N步长为所述N-1维子数据组沿所述第一维方向以所述数据片段为单位的读取间隔,所述第N数量值为所述N-1维子数据组的读取数量,N为大于1的正整数。
  34. 一种计算机可读存储介质,其上存储有基于指令读取数据中的目标数据的方法的计算机程序代码,当所述计算机程序代码由处理装置运行时,执行权利要求32或33所述的方法。
  35. 一种计算机程序产品,包括基于指令读取数据中的目标数据的计算机程序,其特征 在于,所述计算机程序被处理器执行时实现权利要求32或33所述方法的步骤。
  36. 一种计算机装置,包括存储器、处理器及存储在存储器上的计算机程序,其特征在于,所述处理器执行所述计算机程序以实现权利要求32或33所述方法的步骤。
PCT/CN2023/098497 2022-06-06 2023-06-06 基于指令读取数据中的目标数据的方法及其设备 WO2023236929A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210635541.6 2022-06-06
CN202210635541.6A CN117234408A (zh) 2022-06-06 2022-06-06 基于指令读取数据中的目标数据的方法及其设备

Publications (1)

Publication Number Publication Date
WO2023236929A1 true WO2023236929A1 (zh) 2023-12-14

Family

ID=89097233

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/098497 WO2023236929A1 (zh) 2022-06-06 2023-06-06 基于指令读取数据中的目标数据的方法及其设备

Country Status (2)

Country Link
CN (1) CN117234408A (zh)
WO (1) WO2023236929A1 (zh)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100103282A1 (en) * 2008-10-28 2010-04-29 Kabushiki Kaisha Toshiba Image processing apparatus and image processing system
CN111158757A (zh) * 2019-12-31 2020-05-15 深圳芯英科技有限公司 并行存取装置和方法以及芯片
CN111325333A (zh) * 2020-03-02 2020-06-23 Oppo广东移动通信有限公司 数据处理方法、神经网络处理器、存储介质及电子设备
CN111782656A (zh) * 2020-06-30 2020-10-16 北京海益同展信息科技有限公司 数据读写方法及装置
CN113032007A (zh) * 2019-12-24 2021-06-25 阿里巴巴集团控股有限公司 一种数据处理方法及装置
CN113849224A (zh) * 2020-06-27 2021-12-28 英特尔公司 用于移动数据的指令的装置、方法和系统

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100103282A1 (en) * 2008-10-28 2010-04-29 Kabushiki Kaisha Toshiba Image processing apparatus and image processing system
CN113032007A (zh) * 2019-12-24 2021-06-25 阿里巴巴集团控股有限公司 一种数据处理方法及装置
CN111158757A (zh) * 2019-12-31 2020-05-15 深圳芯英科技有限公司 并行存取装置和方法以及芯片
CN111325333A (zh) * 2020-03-02 2020-06-23 Oppo广东移动通信有限公司 数据处理方法、神经网络处理器、存储介质及电子设备
CN113849224A (zh) * 2020-06-27 2021-12-28 英特尔公司 用于移动数据的指令的装置、方法和系统
CN111782656A (zh) * 2020-06-30 2020-10-16 北京海益同展信息科技有限公司 数据读写方法及装置

Also Published As

Publication number Publication date
CN117234408A (zh) 2023-12-15

Similar Documents

Publication Publication Date Title
EP3660628B1 (en) Dynamic voltage frequency scaling device and method
WO2022161318A1 (zh) 数据处理装置、方法及相关产品
CN112633490B (zh) 执行神经网络模型的数据处理装置、方法及相关产品
US10747292B2 (en) Dynamic voltage frequency scaling device and method
WO2023045445A1 (zh) 数据处理装置、数据处理方法及相关产品
CN112799599B (zh) 一种数据存储方法、计算核、芯片和电子设备
CN110059797B (zh) 一种计算装置及相关产品
CN115605907A (zh) 分布式图形处理器单元架构
CN113469336A (zh) 优化神经网络模型的编译方法、执行方法及相关产品
CN112084023A (zh) 数据并行处理的方法、电子设备及计算机可读存储介质
WO2023236929A1 (zh) 基于指令读取数据中的目标数据的方法及其设备
US11829119B2 (en) FPGA-based acceleration using OpenCL on FCL in robot motion planning
WO2022095675A1 (zh) 神经网络稀疏化的装置、方法及相关产品
CN113704156A (zh) 感知数据处理装置、板卡、系统及方法
WO2022134872A1 (zh) 数据处理装置、数据处理方法及相关产品
CN116483255A (zh) 加速搬移数据的设备及方法
WO2022135049A1 (zh) 规约多维向量的方法、电子设备以及存储介质
CN113792867B (zh) 运算电路、芯片和板卡
CN114648438A (zh) 处理图像数据的设备、方法及可读存储介质
WO2023087698A1 (zh) 执行卷积运算的计算装置、方法及相关产品
WO2023016382A1 (zh) 用于片上系统的方法及其相关产品
CN117742566A (zh) 访存处理装置、处理器、芯片、板卡和指令执行方法
WO2022111013A1 (zh) 支援多种访问模式的设备、方法及可读存储介质
CN116484926A (zh) 自适应拆分优化的设备及方法
CN114429194A (zh) 处理神经网络计算的装置、板卡、方法及可读存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23819094

Country of ref document: EP

Kind code of ref document: A1