WO2020019174A1 - 数据存取的方法、处理器、计算机系统和可移动设备 - Google Patents

数据存取的方法、处理器、计算机系统和可移动设备 Download PDF

Info

Publication number
WO2020019174A1
WO2020019174A1 PCT/CN2018/096904 CN2018096904W WO2020019174A1 WO 2020019174 A1 WO2020019174 A1 WO 2020019174A1 CN 2018096904 W CN2018096904 W CN 2018096904W WO 2020019174 A1 WO2020019174 A1 WO 2020019174A1
Authority
WO
WIPO (PCT)
Prior art keywords
array
bit width
cache
memory
processor
Prior art date
Application number
PCT/CN2018/096904
Other languages
English (en)
French (fr)
Inventor
杨康
李鹏
韩峰
Original Assignee
深圳市大疆创新科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市大疆创新科技有限公司 filed Critical 深圳市大疆创新科技有限公司
Priority to PCT/CN2018/096904 priority Critical patent/WO2020019174A1/zh
Priority to CN201880038925.1A priority patent/CN110892373A/zh
Publication of WO2020019174A1 publication Critical patent/WO2020019174A1/zh
Priority to US17/120,467 priority patent/US20210133093A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/0207Addressing or allocation; Relocation with multidimensional access, e.g. row/column, matrix
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0891Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches using clearing, invalidating or resetting means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/14Protection against unauthorised use of memory or access to memory
    • G06F12/1458Protection against unauthorised use of memory or access to memory by checking the subject access rights
    • G06F12/1483Protection against unauthorised use of memory or access to memory by checking the subject access rights using an access-table, e.g. matrix or list
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements
    • G06F9/3012Organisation of register space, e.g. banked or distributed register file
    • G06F9/30138Extension of register space, e.g. register cache
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the present application relates to the field of information technology, and more particularly, to a method, a processor, a computer system, and a removable device for data access.
  • CNN Convolutional Neural Network
  • the embodiments of the present application provide a data access method, a processor, a computer system, and a mobile device, which can improve the efficiency of data access.
  • a data access method for a processor includes a computing array and a cache array, and a bit width of each cache in the cache array is equal to a bit of a data unit processed by the computing array.
  • the method includes: reading M * N data units from a memory to N input buffers in the cache array with a first access bit width, wherein the first access bit width is N times the bit width, one column of the M * N data units is stored in one of the N input buffers, and M and N are positive integers greater than 1; the second access bit The data units in the N input buffers are read into the computing array by the width, wherein the second access bit width is a bit width of each buffer.
  • a processor including: a computing array and a cache array; wherein a bit width of each cache in the cache array is equal to a bit width of a data unit processed by the computing array; Reading M * N data units from the memory to N input buffers in the buffer array with a first access bit width, wherein the first access bit width is N times the bit width of each cache, The data unit of one column of the M * N data units is stored in one of the N input buffers, and M and N are positive integers greater than 1; the calculation array is used for the second access bit The data units in the N input buffers are read into the computing array by the width, wherein the second access bit width is a bit width of each buffer.
  • a computer system including: a memory for storing computer-executable instructions; a processor for accessing the memory and executing the computer-executable instructions to perform the method of the first aspect Operation.
  • a mobile device including: the processor of the second aspect; or the computer system of the third aspect.
  • a computer storage medium stores program code, where the program code may be used to instruct execution of the method of the first aspect.
  • the technical solution of the embodiment of the present application uses a cache array having a bit width equal to the bit width of the data unit processed by the computing array as an intermediate cache for data access. Equipped with the data access required by the computing array, it can improve the efficiency of data access.
  • FIG. 1a is a schematic diagram of a data processing process of a convolutional neural network.
  • FIG. 1b is a schematic diagram of a data input format of a MAC computing array.
  • FIG. 2 and FIG. 3 are architecture diagrams of applying the technical solution of the embodiment of the present application.
  • FIG. 4 is an exemplary structural diagram of a MAC calculation array according to an embodiment of the present application.
  • FIG. 5 is a schematic architecture diagram of a mobile device according to an embodiment of the present application.
  • FIG. 6 is a schematic flowchart of a data access method according to an embodiment of the present application.
  • FIG. 7 is a schematic diagram of a data input process according to an embodiment of the present application.
  • FIG. 8 is a schematic diagram of a data output process according to an embodiment of the present application.
  • FIG. 9 is a schematic block diagram of a processor according to an embodiment of the present application.
  • FIG. 10 is a schematic block diagram of a computer system according to an embodiment of the present application.
  • the size of the sequence number of each process does not mean the order of execution.
  • the execution order of each process should be determined by its function and internal logic, and should not deal with the embodiments of the present application.
  • the implementation process constitutes any limitation.
  • Figure 1a shows a schematic diagram of the data processing process of a convolutional neural network.
  • the processing process of the convolutional neural network is that the input feature values of a window in the Input Feature Map (IF) are processed with weights in the Multiply Accumulate (MAC) calculation array.
  • the inner product operation is performed, and the obtained result is output to an output feature map (OF).
  • Input feature maps and output feature maps are generally stored in memory, such as Random Access Memory (RAM).
  • RAM Random Access Memory
  • data access refers to "fetching" of data from RAM to the MAC computing array and "storing" of data from the MAC computing array to the RAM after the MAC computing array calculation is completed.
  • Feature maps are generally stored in segments and continuously in RAM.
  • a MAC computing array requires multiple feature maps or "interleaved" input and output between multiple rows of data.
  • the MAC calculation array requires data units 1-12 according to ⁇ 1 ⁇ , ⁇ 2,5 ⁇ , ⁇ 3,6,9 ⁇ , ⁇ 4,7,10 ⁇ , ⁇ 8,11 ⁇ , The sequence of ⁇ 12 ⁇ enters the MAC calculation array.
  • an intermediate storage medium such as a cache array, may be used to implement format conversion.
  • FIG. 2 is an architecture diagram to which the technical solution of the embodiment of the present application is applied.
  • the system 200 may include a processor 210 and a memory 220.
  • the memory 220 is configured to store data to be processed, such as an input feature map, and to store data processed by the processor, such as an output feature map.
  • the memory 220 may be the aforementioned RAM, for example, a Static Random Access Memory (Static Random Access Memory, SRAM).
  • the processor 210 is configured to read data from the memory 220 for processing, and store the processed data in the memory 220.
  • the processor 210 may include a computing array 211 and a cache array 212. Based on this design, when inputting data, the data is first read from the memory 220 into the cache array 212, and the calculation array 211 then reads the data required for calculation from the cache array 212. When outputting data, the calculation array 211 The data is output to the cache array 212, and then the data is stored from the cache array 212 to the memory 220.
  • the cache array 212 can convert the data access format to meet the needs of the input and output data of the computing array 211, such as the data input format shown in FIG. 1b.
  • the computing array 211 may implement data input and output through corresponding input and output modules.
  • the processor 210 may further include an input module 213 and an output module 214.
  • the calculation array 211 reads data required for calculation from the cache array 212 through the input module 213, and outputs the data to the cache array 212 through the output module 214.
  • the input module 213 may be a network on chip (Network On Chip), and the Network On Chip implements data reading through a corresponding bus design.
  • the output module 214 may be a partial sum memory, which is used to cache the intermediate results of the computing array 211 and send the intermediate results to the computing array 211 for accumulation again, and forward the final computing results obtained by the computing array 211 to Cache array 212. In the case where there is no intermediate result, Partial memory is only used to forward the final calculation result of the calculation array 211.
  • the computing array 211 is a MAC computing array.
  • FIG. 4 shows an exemplary structural diagram of a MAC computing array.
  • the MAC calculation array 400 may include a two-dimensional array of a MAC calculation group 410 and a MAC control module 420.
  • the MAC calculation group 410 may include a weight register 411 and a plurality of MAC calculation units 412.
  • the MAC calculation unit 412 is configured to buffer the input feature value, and perform a multiply and accumulate operation using the buffered input feature value and the filter weight value buffered in the weight register 411.
  • the system 200 may be provided in a mobile device.
  • the movable device may be an unmanned aerial vehicle, an unmanned ship, an autonomous vehicle, or a robot, which is not limited in the embodiments of the present application.
  • FIG. 5 is a schematic architecture diagram of a mobile device 500 according to an embodiment of the present application.
  • the mobile device 500 may include a power system 510, a control system 520, a sensing system 530, and a processing system 540.
  • the power system 510 is used to power the mobile device 500.
  • the power system of the drone may include an electronic governor (referred to as an ESC), a propeller, and a motor corresponding to the propeller.
  • the motor is connected between the electronic governor and the propeller, and the motor and the propeller are arranged on the corresponding arm; the electronic governor is used to receive the driving signal generated by the control system and provide the driving current to the motor according to the driving signal to control the Rotating speed.
  • the motor is used to drive the propellers to rotate, thereby powering the drone's flight.
  • the sensing system 530 can be used to measure the posture information of the mobile device 500, that is, the position information and status information of the mobile device 500 in space, such as three-dimensional position, three-dimensional angle, three-dimensional velocity, three-dimensional acceleration, and three-dimensional angular velocity.
  • the sensing system 530 may include, for example, at least one of a gyroscope, an electronic compass, an Inertial Measurement Unit (IMU), a vision sensor, a Global Positioning System (GPS), a barometer, and an airspeed meter. Species.
  • the sensing system 530 may also be used for acquiring images, that is, the sensing system 530 includes a sensor for acquiring images, such as a camera.
  • the control system 520 is used to control the movement of the mobile device 500.
  • the control system 520 may control the mobile device 500 according to a preset program instruction.
  • the control system 520 may control the movement of the mobile device 500 according to the posture information of the mobile device 500 measured by the sensing system 530.
  • the control system 520 may also control the mobile device 500 according to a control signal from a remote controller.
  • the control system 520 may be a flight control system (flight control), or a control circuit in the flight control.
  • the processing system 540 may process images acquired by the sensing system 530.
  • the processing system 540 may be an image signal processing (Image Signal Processing, ISP) chip.
  • ISP Image Signal Processing
  • the processing system 540 may be the system 200 in FIG. 2, or the processing system 540 may include the system 200 in FIG. 2.
  • the mobile device 500 may further include other components not shown in FIG. 5, which is not limited in the embodiment of the present application.
  • an implementation method is to use a large first-in, first-out (FIFO) queue with a large bit width, where the bit width of the FIFO is the number of columns of data required to “interleave” the input and output Bit width, for example, the bit width of the 4 columns of data in Figure 1b.
  • FIFO first-in, first-out
  • the embodiments of the present application provide a technical solution to improve the efficiency of data access by improving the design of the intermediate storage medium.
  • the technical solutions of the embodiments of the present application are described in detail below.
  • FIG. 6 shows a schematic flowchart of a data access method 600 according to an embodiment of the present application.
  • the method 600 is performed by a processor, which includes a computing array and a cache array, and a bit width of each cache in the cache array is equal to a bit width of a data unit processed by the computing array.
  • the method 600 includes:
  • the bit width of each cache in the cache array as the intermediate storage medium is equal to the bit width of the data unit processed by the calculation array.
  • the bit width of the buffer may be the bit width of the feature value in the input feature map.
  • bit width of the feature value in the input feature map is 8b (bits)
  • a cache array with a bit width of 8b per cache can be used.
  • the cache array may be a RAM array, a FIFO array, or a register (REG) array, which is not limited in the embodiment of the present invention.
  • N data units can be read at one time and stored in the N input buffers. That is, data is read at the first access bit width of N times the buffer bit width, M * N data units are read from the memory to the N input buffers, and one column of the M * N data units is stored to N One of the input buffers.
  • 3 * 4 data units can be read into 4 input buffers with an access bit width of 32b.
  • the data unit can be read from each cache with the cache bit width (second access bit width) to meet the needs of computing array data processing.
  • the data units in the N input buffers may be read to the computing array according to the processing order of the computing array with the second access bit width.
  • the data unit is a feature value in a feature map
  • the processing order is a processing order of the convolutional neural network.
  • data units 1-12 need to be in accordance with ⁇ 1 ⁇ , ⁇ 2,5 ⁇ , ⁇ 3,6,9 ⁇ , ⁇ 4,7,10 ⁇ , ⁇ 8,11 ⁇ , ⁇ 12 ⁇ into the MAC calculation array. Because the cached bit width is equal to the data unit's bit width, the MAC computing array can read one data unit at a time with the cached access bit width, so the data units required for calculation can be read in the above order.
  • the data units processed by the computing array may be stored in the N output buffers in the cache array with the second access bit width; and then, the N output buffers may be stored in the first access bit width with the first access bit width. M * N data units are stored into the memory.
  • the data unit can be output at the granularity of the data unit with the access bit width of the cache; for the output process of data from the cache array to the memory, the cache bit width can be N times.
  • the first access bit is wide and outputs N data units of the same output feature map to the corresponding output feature map at one time.
  • each data unit can be stored in the corresponding position in the 4 output buffers according to the granularity of the data unit (second access bit width), and then 4 data units are used.
  • the granularity (first access bit width) stores the data units of the same output feature map to the corresponding output feature map in the memory.
  • the memory may be an on-chip memory or an off-chip memory.
  • the processor may further include the memory.
  • the technical solution of the embodiment of the present application uses a cache array having a bit width equal to the bit width of the data unit processed by the computing array as an intermediate cache for data access. Equipped with the data access required by the computing array, it can improve the efficiency of data access.
  • the data access method in the embodiments of the present application is described in detail above.
  • the processor, computer system, and mobile device in the embodiments of the present application will be described below. It should be understood that the processor, the computer system, and the mobile device in the embodiments of the present application can execute the foregoing methods of the embodiments of the present application, that is, the specific working processes of the following various products, refer to the corresponding processes in the foregoing method embodiments.
  • FIG. 9 shows a schematic block diagram of the processor 900 of the present application.
  • the processor 900 may include a computing array 910 and a cache array 920.
  • the bit width of each cache in the cache array 920 is equal to the bit width of the data unit processed by the computing array 910.
  • the cache array 920 is configured to read M * N data units from the memory to the N input caches in the cache array 920 with a first access bit width, where the first access bit width is each cache N times the bit width of the data unit in one column of the M * N data units is stored in one of the N input caches, and M and N are positive integers greater than 1.
  • the computing array 910 is configured to read data units in the N input buffers to the computing array 910 with a second access bit width, where the second access bit width is a bit width of each cache.
  • the computing array 910 is configured to read the data units in the N input buffers according to the processing order of the computing array 910 with the second access bit width.
  • the computing array is configured to read the data units in the N input buffers according to the processing order of the computing array 910 with the second access bit width.
  • the data unit is a feature value in a feature map
  • the processing order is a processing order of a convolutional neural network
  • the computing array 910 is further configured to store the data units processed by the computing array 910 into N of the cache arrays 920 with the second access bit width.
  • Output buffer
  • the buffer array 920 is further configured to store M * N data units in the N output buffers into the memory with the first access bit width.
  • the cache array 920 is a random access memory RAM array, a first-in-first-out FIFO array, or a register REG array.
  • the processor is an on-chip device
  • the memory is an on-chip memory or an off-chip memory.
  • the calculation array 910 is a multiply-accumulate MAC calculation array.
  • the processor 900 further includes the memory.
  • processor in the embodiment of the present application may be a chip, which may be implemented by a circuit, but the embodiment of the present application does not limit the specific implementation form.
  • FIG. 10 shows a schematic block diagram of a computer system 1000 according to an embodiment of the present application.
  • the computer system 1000 may include a processor 1010 and a memory 1020.
  • the computer system 1000 may also include components generally included in other computer systems, such as input-output devices, communication interfaces, and the like, which is not limited in the embodiments of the present application.
  • the memory 1020 is configured to store computer-executable instructions.
  • the memory 1020 may be various types of memory, for example, it may include high-speed random access memory (Random Access Memory, RAM), and may also include non-volatile memory (non-volatile memory), such as at least one magnetic disk memory. Examples are not limited to this.
  • RAM Random Access Memory
  • non-volatile memory such as at least one magnetic disk memory. Examples are not limited to this.
  • the processor 1010 is configured to access the memory 1020 and execute the computer-executable instructions to perform operations in the data access method of the foregoing various embodiments of the present application.
  • the processor 1010 may include a microprocessor, a Field-Programmable Gate Array (FPGA), a Central Processing Unit (CPU), and a Graphics Processing Unit (GPU). Examples are not limited to this.
  • FPGA Field-Programmable Gate Array
  • CPU Central Processing Unit
  • GPU Graphics Processing Unit
  • An embodiment of the present application further provides a mobile device, and the mobile device may include a processor or a computer system in the foregoing various embodiments of the present application.
  • the processor, the computer system, and the mobile device in the embodiments of the present application may correspond to an execution subject of the data access method in the embodiments of the present application, and the above and other operations of each module in the processor, the computer system, and the mobile device And / or functions are respectively to implement the corresponding processes of the foregoing methods, and for the sake of brevity, they are not repeated here.
  • the embodiment of the present application further provides a computer storage medium, and the computer storage medium stores program code, and the program code may be used to instruct to execute the data access method in the embodiment of the present application.
  • the term “and / or” is merely an association relationship describing an associated object, and indicates that there may be three relationships.
  • a and / or B can indicate the following three situations: A alone, A and B, and B alone.
  • the character "/" in this article generally indicates that the related objects are an "or" relationship.
  • the disclosed systems, devices, and methods may be implemented in other ways.
  • the device embodiments described above are only schematic.
  • the division of the unit is only a logical function division.
  • multiple units or components may be combined or Can be integrated into another system, or some features can be ignored or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may also be electrical, mechanical or other forms of connection.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, may be located in one place, or may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions in the embodiments of the present application.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist separately physically, or two or more units may be integrated into one unit.
  • the above integrated unit may be implemented in the form of hardware or in the form of software functional unit.
  • the integrated unit When the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it may be stored in a computer-readable storage medium.
  • the technical solution of this application is essentially a part that contributes to the existing technology, or all or part of the technical solution may be embodied in the form of a software product, which is stored in a storage medium
  • Included are instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method described in the embodiments of the present application.
  • the foregoing storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disks or optical disks and other media that can store program codes .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computer Security & Cryptography (AREA)
  • Human Computer Interaction (AREA)
  • Neurology (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

公开了一种数据存取的方法、处理器、计算机系统和可移动设备。所述处理器包括计算阵列和缓存阵列,所述缓存阵列中每个缓存的位宽等于所述计算阵列处理的数据单元的位宽;所述方法包括:以第一访问位宽将M*N个数据单元从存储器读取到所述缓存阵列中的N个输入缓存,其中,所述第一访问位宽为每个缓存的位宽的N倍,所述M*N个数据单元中一列的数据单元被存储到所述N个输入缓存中的一个输入缓存中,M和N为大于1的正整数;以第二访问位宽将所述N个输入缓存中的数据单元读取到所述计算阵列,其中,所述第二访问位宽为每个缓存的位宽。本申请实施例的技术方案,能够提高数据存取的效率。

Description

数据存取的方法、处理器、计算机系统和可移动设备
版权申明
本专利文件披露的内容包含受版权保护的材料。该版权为版权所有人所有。版权所有人不反对任何人复制专利与商标局的官方记录和档案中所存在的该专利文件或者该专利披露。
技术领域
本申请涉及信息技术领域,并且更具体地,涉及一种数据存取的方法、处理器、计算机系统和可移动设备。
背景技术
随着互联网和半导体技术的发展,深度学习算法最近几年在一些应用领域的可靠性达到了可以商业化应用的阈值,但是对计算量的巨大需求一定程度限制了深度学习的应用,因此,深度学习专用处理器的设计至关重要。
目前应用最广泛的深度学习算法是卷积神经网络(Convolutional Neural Network,CNN),它90%左右的计算量为卷积运算。深度学习专用处理器芯片的设计的重要目标之一即为提供高性能的卷积计算。
获得高性能的运算,一方面需要有较大的计算阵列,另一方面,高效率的数据存取也很关键。因此,如何提高数据存取的效率,成为处理器设计中一个亟待解决的技术问题。
发明内容
本申请实施例提供了一种数据存取的方法、处理器、计算机系统和可移动设备,能够提高数据存取的效率。
第一方面,提供了一种处理器的数据存取的方法,所述处理器包括计算阵列和缓存阵列,所述缓存阵列中每个缓存的位宽等于所述计算阵列处理的数据单元的位宽;所述方法包括:以第一访问位宽将M*N个数据单元从存储器读取到所述缓存阵列中的N个输入缓存,其中,所述第一访问位宽为每个缓存的位宽的N倍,所述M*N个数据单元中一列的数据单元被存储到所述N个输入缓存中的一个输入缓存中,M和N为大于1的正整数;以第 二访问位宽将所述N个输入缓存中的数据单元读取到所述计算阵列,其中,所述第二访问位宽为每个缓存的位宽。
第二方面,提供了一种处理器,包括:计算阵列和缓存阵列;其中,所述缓存阵列中每个缓存的位宽等于所述计算阵列处理的数据单元的位宽;所述缓存阵列用于以第一访问位宽将M*N个数据单元从存储器读取到所述缓存阵列中的N个输入缓存,其中,所述第一访问位宽为每个缓存的位宽的N倍,所述M*N个数据单元中一列的数据单元被存储到所述N个输入缓存中的一个输入缓存中,M和N为大于1的正整数;所述计算阵列用于以第二访问位宽将所述N个输入缓存中的数据单元读取到所述计算阵列,其中,所述第二访问位宽为每个缓存的位宽。
第三方面,提供了一种计算机系统,包括:存储器,用于存储计算机可执行指令;处理器,用于访问所述存储器,并执行所述计算机可执行指令,以进行上述第一方面的方法中的操作。
第四方面,提供了一种可移动设备,包括:上述第二方面处理器;或者,上述第三方面的计算机系统。
第五方面,提供了一种计算机存储介质,该计算机存储介质中存储有程序代码,该程序代码可以用于指示执行上述第一方面的方法。
本申请实施例的技术方案,采用位宽等于计算阵列处理的数据单元的位宽的缓存阵列,作为中间缓存进行数据存取,所需要的缓存阵列位宽低,占用资源较少,而且能够适配计算阵列所需的数据存取,能够提高数据存取的效率。
附图说明
图1a是卷积神经网络的数据处理过程的示意图。
图1b是MAC计算阵列的数据输入格式的示意图。
图2和图3是应用本申请实施例的技术方案的架构图。
图4是本申请实施例的MAC计算阵列的示例性结构图。
图5是本申请实施例的可移动设备的示意性架构图。
图6是本申请实施例的数据存取的方法的示意性流程图。
图7是本申请实施例的数据输入过程的示意图。
图8是本申请实施例的数据输出过程的示意图。
图9是本申请实施例的处理器的示意性框图。
图10是本申请实施例的计算机系统的示意性框图。
具体实施方式
下面将结合附图,对本申请实施例中的技术方案进行描述。
应理解,本文中的具体的例子只是为了帮助本领域技术人员更好地理解本申请实施例,而非限制本申请实施例的范围。
还应理解,在本申请的各种实施例中,各过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
还应理解,本说明书中描述的各种实施方式,既可以单独实施,也可以组合实施,本申请实施例对此并不限定。
本申请实施例的技术方案可以应用于各种深度学习算法中,例如卷积神经网络,但本申请实施例对此并不限定。
图1a示出了卷积神经网络的数据处理过程的示意图。
如图1a所示,卷积神经网络的处理过程为输入特征图(Input Feature Map,IF)中一个窗口的输入特征值在乘累加(Multiply Accumulate,MAC)计算阵列中与权重值(weights)进行内积操作,所得结果输出到输出特征图(Output Feature Map,OF)中。输入特征图和输出特征图(统称为特征图)一般是存储在内存,例如,随机存取存储器(Random Access Memory,RAM)中。在本申请实施例中,数据存取是指数据从RAM到MAC计算阵列的“取”和MAC计算阵列计算完成后数据从MAC计算阵列到RAM的“存”。
特征图在RAM中一般是分段连续存储,而MAC计算阵列为了计算的高效,它需要多个特征图或者多行数据之间的“交织”输入输出。例如,如图1b所示,MAC计算阵列需要数据单元1-12按照{1},{2,5},{3,6,9},{4,7,10},{8,11},{12}的顺序进入MAC计算阵列。在一些实现方式中,为了解决这种“存储”与“计算”(使用)之间的冲突,可采用中间存储介质,例如缓存阵列,来实现格式的转换。
图2是应用本申请实施例的技术方案的架构图。
如图2所示,系统200可以包括处理器210和存储器220。
存储器220用于存储待处理的数据,例如,输入特征图,以及存储处 理器处理后的数据,例如输出特征图。存储器220可以为前述的RAM,例如,静态随机存取存储器(Static RandomAccess Memory,SRAM)。
处理器210用于从存储器220中读取数据进行处理,并将处理后的数据存储到存储器220中。处理器210可以包括计算阵列211和缓存阵列212。基于这样的设计,在输入数据时,数据先从存储器220读取到缓存阵列212中,计算阵列211再从缓存阵列212中读取计算所需要的数据;在输出数据时,计算阵列211先将数据输出到缓存阵列212中,然后数据再从缓存阵列212存储到存储器220中。缓存阵列212作为中间存储介质可以实现数据存取格式的转换,以满足计算阵列211输入输出数据的需要,例如图1b所示的数据输入格式。
可选地,计算阵列211可以通过相应的输入和输出模块实现数据的输入和输出。例如,如图3所示,处理器210还可以包括输入模块213和输出模块214。计算阵列211通过输入模块213从缓存阵列212中读取计算所需要的数据,通过输出模块214将数据输出到缓存阵列212中。例如,输入模块213可以为片上网络(Network On Chip),Network On Chip通过相应的总线设计实现数据的读取。输出模块214可以为部分和存储器(Partial Sum Memory),用于缓存计算阵列211的中间结果,并将中间结果再次送给计算阵列211进行累加,以及,将计算阵列211得到的最终计算结果转发给缓存阵列212。在没有中间结果的情况下,Partial Sum Memory仅用于转发计算阵列211的最终计算结果。
在一个实施例中,计算阵列211为MAC计算阵列。图4示出了MAC计算阵列的一种示例性结构图。如图4所示,MAC计算阵列400可以包括MAC计算组410的二维阵列和MAC控制模块420。MAC计算组410可以包括权重寄存器411和多个MAC计算单元412。MAC计算单元412用于缓存输入特征值,并用缓存的输入特征值与权重寄存器411中缓存的滤波器权重值进行乘累加操作。
在一些实施例中,系统200可以设置于可移动设备中。该可移动设备可以是无人机、无人驾驶船、自动驾驶车辆或机器人等,本申请实施例对此并不限定。
图5是本申请实施例的可移动设备500的示意性架构图。
如图5所示,可移动设备500可以包括动力系统510、控制系统520、 传感系统530和处理系统540。
动力系统510用于为该可移动设备500提供动力。
以无人机为例,无人机的动力系统可以包括电子调速器(简称为电调)、螺旋桨以及与螺旋桨相对应的电机。电机连接在电子调速器与螺旋桨之间,电机和螺旋桨设置在对应的机臂上;电子调速器用于接收控制系统产生的驱动信号,并根据驱动信号提供驱动电流给电机,以控制电机的转速。电机用于驱动螺旋桨旋转,从而为无人机的飞行提供动力。
传感系统530可以用于测量可移动设备500的姿态信息,即可移动设备500在空间的位置信息和状态信息,例如,三维位置、三维角度、三维速度、三维加速度和三维角速度等。传感系统530例如可以包括陀螺仪、电子罗盘、惯性测量单元(Inertial Measurement Unit,IMU)、视觉传感器、全球定位系统(Global Positioning System,GPS)、气压计、空速计等传感器中的至少一种。
传感系统530还可用于采集图像,即传感系统530包括用于采集图像的传感器,例如相机等。
控制系统520用于控制可移动设备500的移动。控制系统520可以按照预先设置的程序指令对可移动设备500进行控制。例如,控制系统520可以根据传感系统530测量的可移动设备500的姿态信息控制可移动设备500的移动。控制系统520也可以根据来自遥控器的控制信号对可移动设备500进行控制。例如,对于无人机,控制系统520可以为飞行控制系统(飞控),或者为飞控中的控制电路。
处理系统540可以处理传感系统530采集的图像。例如,处理系统540可以为图像信号处理(Image Signal Processing,ISP)类芯片。
处理系统540可以为图2中的系统200,或者,处理系统540可以包括图2中的系统200。
应理解,上述对于可移动设备500的各组成部件的划分和命名仅仅是示例性的,并不应理解为对本申请实施例的限制。
还应理解,可移动设备500还可以包括图5中未示出的其他部件,本申请实施例对此并不限定。
在中间存储介质的设计上,一种实现方式是使用大位宽的先入先出(First Input First Output,FIFO)队列,其中,FIFO的位宽为“交织”输入 输出所需的多列数据的位宽,例如,图1b中4列数据的位宽。然而,以大位宽的FIFO作为数据输入输出计算阵列的中间缓存,浪费了较大的存储空间,会间接提高芯片的面积(成本)和功耗,影响数据存取的效率,不利于应用于对硬件资源要求比较高的平台,例如可移动设备中。
鉴于此,本申请实施例提供了一种技术方案,通过改进中间存储介质的设计,提高数据存取的效率。下面对本申请实施例的技术方案进行详细描述。
图6示出了本申请一个实施例的数据存取的方法600的示意性流程图。该方法600由处理器执行,所述处理器包括计算阵列和缓存阵列,所述缓存阵列中每个缓存的位宽等于所述计算阵列处理的数据单元的位宽。
如图6所示,该方法600包括:
610,以第一访问位宽将M*N个数据单元从存储器读取到所述缓存阵列中的N个输入缓存,其中,所述第一访问位宽为每个缓存的位宽的N倍,所述M*N个数据单元中一列的数据单元被存储到所述N个输入缓存中的一个输入缓存中,M和N为大于1的正整数;
620,以第二访问位宽将所述N个输入缓存中的数据单元读取到所述计算阵列,其中,所述第二访问位宽为每个缓存的位宽。
在本申请实施例中,作为中间存储介质的缓存阵列中每个缓存的位宽等于计算阵列处理的数据单元的位宽。例如,缓存的位宽可以为输入特征图中特征值的位宽。
如图7所示,若输入特征图中特征值的位宽为8b(比特),则可以采用每个缓存的位宽为8b的缓存阵列。
可选地,缓存阵列可以采用RAM阵列、FIFO阵列或者寄存器(REG)阵列等,本发明实施例对此并不限定。
在数据从存储器到缓存阵列的读取过程中,可以一次性地读取N个数据单元,存储到N个输入缓存中。即,以N倍缓存位宽的第一访问位宽读取数据,将M*N个数据单元从存储器读取到N个输入缓存,M*N个数据单元中一列的数据单元存储到N个输入缓存中的一个输入缓存中。
例如,如图7所示,为了便于数据到MAC计算阵列的交织输入,可以以32b的访问位宽将3*4个数据单元读取到4个输入缓存中。
在数据从缓存阵列到计算阵列的读取过程中,可以以缓存位宽(第二 访问位宽)从每个缓存中读取数据单元,以满足计算阵列数据处理的需要。
可选地,可以以所述第二访问位宽将所述N个输入缓存中的数据单元按照所述计算阵列的处理顺序读取到所述计算阵列。
例如,对于卷积神经网络,所述数据单元为特征图中的特征值,所述处理顺序为卷积神经网络的处理顺序。
例如,如图7所示,按照MAC计算阵列的处理顺序,数据单元1-12需要按照{1},{2,5},{3,6,9},{4,7,10},{8,11},{12}的顺序进入MAC计算阵列。由于缓存的位宽等于数据单元的位宽,MAC计算阵列可以以缓存的访问位宽每次读取一个数据单元,因此,可以按照上述顺序读取计算所需要的数据单元。
对于计算结果的输出,可以采用与输入相对应的方式。可以先以所述第二访问位宽将所述计算阵列处理后的数据单元存储到所述缓存阵列中的N个输出缓存;再以所述第一访问位宽将所述N个输出缓存中的M*N个数据单元存储到所述存储器。
也就是说,对于数据从计算阵列到缓存阵列的输出过程,可以以缓存的访问位宽按照数据单元的粒度输出数据单元;对于数据从缓存阵列到存储器的输出过程,可以以N倍缓存位宽的第一访问位宽,一次性地输出同一输出特征图的N个数据单元到相应的输出特征图。
例如,如图8所示,对于数据单元a-l,可以按照数据单元的粒度(第二访问位宽)先将每个数据单元存储到4个输出缓存中的相应位置,然后再以4个数据单元的粒度(第一访问位宽)将同一输出特征图的数据单元存储到存储器中相应的输出特征图。
应理解,在所述处理器为片上器件时,所述存储器可以为片内存储器,也可以为片外存储器。所述处理器还可以包括所述存储器。
本申请实施例的技术方案,采用位宽等于计算阵列处理的数据单元的位宽的缓存阵列,作为中间缓存进行数据存取,所需要的缓存阵列位宽低,占用资源较少,而且能够适配计算阵列所需的数据存取,能够提高数据存取的效率。
上文详细描述了本申请实施例的数据存取的方法,下面将描述本申请实施例的处理器、计算机系统和可移动设备。应理解,本申请实施例的处理器、计算机系统和可移动设备可以执行前述本申请实施例的各种方法,即以 下各种产品的具体工作过程,可以参考前述方法实施例中的对应过程。
图9示出了本申请处理器900的示意性框图。
如图9所示,该处理器900可以包括:计算阵列910和缓存阵列920。
所述缓存阵列920中每个缓存的位宽等于所述计算阵列910处理的数据单元的位宽。
所述缓存阵列920用于以第一访问位宽将M*N个数据单元从存储器读取到所述缓存阵列920中的N个输入缓存,其中,所述第一访问位宽为每个缓存的位宽的N倍,所述M*N个数据单元中一列的数据单元被存储到所述N个输入缓存中的一个输入缓存中,M和N为大于1的正整数。
所述计算阵列910用于以第二访问位宽将所述N个输入缓存中的数据单元读取到所述计算阵列910,其中,所述第二访问位宽为每个缓存的位宽。
可选地,在本申请一个实施例中,所述计算阵列910用于以所述第二访问位宽将所述N个输入缓存中的数据单元按照所述计算阵列910的处理顺序读取到所述计算阵列。
可选地,在本申请一个实施例中,所述数据单元为特征图中的特征值,所述处理顺序为卷积神经网络的处理顺序。
可选地,在本申请一个实施例中,所述计算阵列910还用于以所述第二访问位宽将所述计算阵列910处理后的数据单元存储到所述缓存阵列920中的N个输出缓存;
所述缓存阵列920还用于以所述第一访问位宽将所述N个输出缓存中的M*N个数据单元存储到所述存储器。
可选地,在本申请一个实施例中,所述缓存阵列920为随机存取存储器RAM阵列、先入先出FIFO阵列或者寄存器REG阵列。
可选地,在本申请一个实施例中,所述处理器为片上器件,所述存储器为片内存储器或片外存储器。
可选地,在本申请一个实施例中,所述计算阵列910为乘累加MAC计算阵列。
可选地,在本申请一个实施例中,所述处理器900还包括所述存储器。
应理解,上述本申请实施例的处理器可以是芯片,其具体可以由电路实现,但本申请实施例对具体的实现形式不做限定。
图10示出了本申请实施例的计算机系统1000的示意性框图。
如图10所示,该计算机系统1000可以包括处理器1010和存储器1020。
应理解,该计算机系统1000还可以包括其他计算机系统中通常所包括的部件,例如,输入输出设备、通信接口等,本申请实施例对此并不限定。
存储器1020用于存储计算机可执行指令。
存储器1020可以是各种种类的存储器,例如可以包括高速随机存取存储器(Random Access Memory,RAM),还可以包括非易失性存储器(non-volatile memory),例如至少一个磁盘存储器,本申请实施例对此并不限定。
处理器1010用于访问该存储器1020,并执行该计算机可执行指令,以进行上述本申请各种实施例的数据存取的方法中的操作。
处理器1010可以包括微处理器,现场可编程门阵列(Field-Programmable Gate Array,FPGA),中央处理器(Central Processing unit,CPU),图形处理器(Graphics Processing Unit,GPU)等,本申请实施例对此并不限定。
本申请实施例还提供了一种可移动设备,该移动设备可以包括上述本申请各种实施例的处理器或者计算机系统。
本申请实施例的处理器、计算机系统和可移动设备可对应于本申请实施例的数据存取的方法的执行主体,并且处理器、计算机系统和可移动设备中的各个模块的上述和其它操作和/或功能分别为了实现前述各个方法的相应流程,为了简洁,在此不再赘述。
本申请实施例还提供了一种计算机存储介质,该计算机存储介质中存储有程序代码,该程序代码可以用于指示执行上述本申请实施例的数据存取的方法。
应理解,在本申请实施例中,术语“和/或”仅仅是一种描述关联对象的关联关系,表示可以存在三种关系。例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本文中字符“/”,一般表示前后关联对象是一种“或”的关系。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来 执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另外,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口、装置或单元的间接耦合或通信连接,也可以是电的,机械的或其它的形式连接。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本申请实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以是两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分,或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并 不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。

Claims (18)

  1. 一种处理器的数据存取的方法,其特征在于,所述处理器包括计算阵列和缓存阵列,所述缓存阵列中每个缓存的位宽等于所述计算阵列处理的数据单元的位宽;
    所述方法包括:
    以第一访问位宽将M*N个数据单元从存储器读取到所述缓存阵列中的N个输入缓存,其中,所述第一访问位宽为每个缓存的位宽的N倍,所述M*N个数据单元中一列的数据单元被存储到所述N个输入缓存中的一个输入缓存中,M和N为大于1的正整数;
    以第二访问位宽将所述N个输入缓存中的数据单元读取到所述计算阵列,其中,所述第二访问位宽为每个缓存的位宽。
  2. 根据权利要求1所述的方法,其特征在于,所述以第二访问位宽将所述N个输入缓存中的数据单元读取到所述计算阵列,包括:
    以所述第二访问位宽将所述N个输入缓存中的数据单元按照所述计算阵列的处理顺序读取到所述计算阵列。
  3. 根据权利要求2所述的方法,其特征在于,所述数据单元为特征图中的特征值,所述处理顺序为卷积神经网络的处理顺序。
  4. 根据权利要求1至3中任一项所述的方法,其特征在于,所述方法还包括:
    以所述第二访问位宽将所述计算阵列处理后的数据单元存储到所述缓存阵列中的N个输出缓存;
    以所述第一访问位宽将所述N个输出缓存中的M*N个数据单元存储到所述存储器。
  5. 根据权利要求1至4中任一项所述的方法,其特征在于,所述缓存阵列为随机存取存储器RAM阵列、先入先出FIFO阵列或者寄存器REG阵列。
  6. 根据权利要求1至5中任一项所述的方法,其特征在于,所述处理器为片上器件,所述存储器为片内存储器或片外存储器。
  7. 根据权利要求1至6中任一项所述的方法,其特征在于,所述计算阵列为乘累加MAC计算阵列。
  8. 根据权利要求1至7中任一项所述的方法,其特征在于,所述处理 器还包括所述存储器。
  9. 一种处理器,其特征在于,包括:计算阵列和缓存阵列;
    其中,所述缓存阵列中每个缓存的位宽等于所述计算阵列处理的数据单元的位宽;
    所述缓存阵列用于以第一访问位宽将M*N个数据单元从存储器读取到所述缓存阵列中的N个输入缓存,其中,所述第一访问位宽为每个缓存的位宽的N倍,所述M*N个数据单元中一列的数据单元被存储到所述N个输入缓存中的一个输入缓存中,M和N为大于1的正整数;
    所述计算阵列用于以第二访问位宽将所述N个输入缓存中的数据单元读取到所述计算阵列,其中,所述第二访问位宽为每个缓存的位宽。
  10. 根据权利要求9所述的处理器,其特征在于,所述计算阵列用于以所述第二访问位宽将所述N个输入缓存中的数据单元按照所述计算阵列的处理顺序读取到所述计算阵列。
  11. 根据权利要求10所述的处理器,其特征在于,所述数据单元为特征图中的特征值,所述处理顺序为卷积神经网络的处理顺序。
  12. 根据权利要求9至11中任一项所述的处理器,其特征在于,所述计算阵列还用于以所述第二访问位宽将所述计算阵列处理后的数据单元存储到所述缓存阵列中的N个输出缓存;
    所述缓存阵列还用于以所述第一访问位宽将所述N个输出缓存中的M*N个数据单元存储到所述存储器。
  13. 根据权利要求9至12中任一项所述的处理器,其特征在于,所述缓存阵列为随机存取存储器RAM阵列、先入先出FIFO阵列或者寄存器REG阵列。
  14. 根据权利要求9至13中任一项所述的处理器,其特征在于,所述处理器为片上器件,所述存储器为片内存储器或片外存储器。
  15. 根据权利要求9至14中任一项所述的处理器,其特征在于,所述计算阵列为乘累加MAC计算阵列。
  16. 根据权利要求9至15中任一项所述的处理器,其特征在于,所述处理器还包括所述存储器。
  17. 一种计算机系统,其特征在于,包括:
    存储器,用于存储计算机可执行指令;
    处理器,用于访问所述存储器,并执行所述计算机可执行指令,以进行根据权利要求1至8中任一项所述的方法中的操作。
  18. 一种可移动设备,其特征在于,包括:
    根据权利要求9至16中任一项所述的处理器;或者,
    根据权利要求17所述的计算机系统。
PCT/CN2018/096904 2018-07-24 2018-07-24 数据存取的方法、处理器、计算机系统和可移动设备 WO2020019174A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
PCT/CN2018/096904 WO2020019174A1 (zh) 2018-07-24 2018-07-24 数据存取的方法、处理器、计算机系统和可移动设备
CN201880038925.1A CN110892373A (zh) 2018-07-24 2018-07-24 数据存取的方法、处理器、计算机系统和可移动设备
US17/120,467 US20210133093A1 (en) 2018-07-24 2020-12-14 Data access method, processor, computer system, and mobile device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2018/096904 WO2020019174A1 (zh) 2018-07-24 2018-07-24 数据存取的方法、处理器、计算机系统和可移动设备

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/120,467 Continuation US20210133093A1 (en) 2018-07-24 2020-12-14 Data access method, processor, computer system, and mobile device

Publications (1)

Publication Number Publication Date
WO2020019174A1 true WO2020019174A1 (zh) 2020-01-30

Family

ID=69181114

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/096904 WO2020019174A1 (zh) 2018-07-24 2018-07-24 数据存取的方法、处理器、计算机系统和可移动设备

Country Status (3)

Country Link
US (1) US20210133093A1 (zh)
CN (1) CN110892373A (zh)
WO (1) WO2020019174A1 (zh)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111599389B (zh) * 2020-05-13 2022-09-06 芯颖科技有限公司 数据存取方法、数据存取电路、芯片和电子设备
US11175957B1 (en) * 2020-09-22 2021-11-16 International Business Machines Corporation Hardware accelerator for executing a computation task
CN112967172A (zh) * 2021-02-26 2021-06-15 成都商汤科技有限公司 一种数据处理装置、方法、计算机设备及存储介质
CN112835842B (zh) * 2021-03-05 2024-04-30 深圳市汇顶科技股份有限公司 端序处理方法、电路、芯片以及电子终端
CN113448624B (zh) * 2021-07-15 2023-06-27 安徽聆思智能科技有限公司 数据存取方法及装置、系统、ai加速器
CN117196931B (zh) * 2023-11-08 2024-02-09 苏州元脑智能科技有限公司 面向传感器阵列的数据处理方法、fpga及电子设备

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103077123A (zh) * 2013-01-15 2013-05-01 华为技术有限公司 一种数据写入和读取方法及装置
CN103902507A (zh) * 2014-03-28 2014-07-02 中国科学院自动化研究所 一种面向可编程代数处理器的矩阵乘法计算装置及方法
CN106940815A (zh) * 2017-02-13 2017-07-11 西安交通大学 一种可编程卷积神经网络协处理器ip核
US20170270979A1 (en) * 2016-03-15 2017-09-21 Maxlinear, Inc. Methods and systems for parallel column twist interleaving
CN108171317A (zh) * 2017-11-27 2018-06-15 北京时代民芯科技有限公司 一种基于soc的数据复用卷积神经网络加速器

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014169480A1 (zh) * 2013-04-19 2014-10-23 中国科学院自动化研究所 一种并行滤波方法及相应的装置
CN105843589B (zh) * 2016-03-18 2018-05-08 同济大学 一种应用于vliw类型处理器的存储器装置
CN108229645B (zh) * 2017-04-28 2021-08-06 北京市商汤科技开发有限公司 卷积加速和计算处理方法、装置、电子设备及存储介质
CN107451659B (zh) * 2017-07-27 2020-04-10 清华大学 用于位宽分区的神经网络加速器及其实现方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103077123A (zh) * 2013-01-15 2013-05-01 华为技术有限公司 一种数据写入和读取方法及装置
CN103902507A (zh) * 2014-03-28 2014-07-02 中国科学院自动化研究所 一种面向可编程代数处理器的矩阵乘法计算装置及方法
US20170270979A1 (en) * 2016-03-15 2017-09-21 Maxlinear, Inc. Methods and systems for parallel column twist interleaving
CN106940815A (zh) * 2017-02-13 2017-07-11 西安交通大学 一种可编程卷积神经网络协处理器ip核
CN108171317A (zh) * 2017-11-27 2018-06-15 北京时代民芯科技有限公司 一种基于soc的数据复用卷积神经网络加速器

Also Published As

Publication number Publication date
CN110892373A (zh) 2020-03-17
US20210133093A1 (en) 2021-05-06

Similar Documents

Publication Publication Date Title
WO2020019174A1 (zh) 数据存取的方法、处理器、计算机系统和可移动设备
WO2019104638A1 (zh) 神经网络处理的方法、装置、加速器、系统和可移动设备
WO2018090308A1 (en) Enhanced localization method and apparatus
US11715224B2 (en) Three-dimensional object reconstruction method and apparatus
US11562214B2 (en) Methods for improving AI engine MAC utilization
US11604594B2 (en) Apparatus, system and method for offloading data transfer operations between source and destination storage devices to a hardware accelerator
WO2018218481A1 (zh) 神经网络训练的方法、装置、计算机系统和可移动设备
JP6441586B2 (ja) 情報処理装置および情報処理方法
WO2019215907A1 (ja) 演算処理装置
US11482009B2 (en) Method and system for generating depth information of street view image using 2D map
US20160357668A1 (en) Parallel caching architecture and methods for block-based data processing
CN110296717B (zh) 一种事件数据流的处理方法及计算设备
US10377039B1 (en) Tagged robot sensor data
Goldberg et al. Stereo and IMU assisted visual odometry on an OMAP3530 for small robots
US11288768B2 (en) Application processor including reconfigurable scaler and devices including the processor
WO2020155044A1 (zh) 卷积计算的装置、方法、处理器和可移动设备
US20200134771A1 (en) Image processing method, chip, processor, system, and mobile device
WO2021102946A1 (zh) 计算装置、方法、处理器和可移动设备
US20210392269A1 (en) Motion sensor in memory
US20200356844A1 (en) Neural network processor for compressing featuremap data and computing system including the same
WO2020073164A1 (zh) 数据存储的装置、方法、处理器和可移动设备
CN110770699A (zh) 数据指令处理方法、存储芯片、存储系统和可移动平台
US11500802B1 (en) Data replication for accelerator
CN116113926A (zh) 神经网络电路以及神经网络电路的控制方法
JPWO2018109847A1 (ja) 制御装置、撮像装置、移動体、制御方法、およびプログラム

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18927310

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18927310

Country of ref document: EP

Kind code of ref document: A1