CN116362304A - Data processing device, data processing method and related device - Google Patents

Data processing device, data processing method and related device Download PDF

Info

Publication number
CN116362304A
CN116362304A CN202111584175.8A CN202111584175A CN116362304A CN 116362304 A CN116362304 A CN 116362304A CN 202111584175 A CN202111584175 A CN 202111584175A CN 116362304 A CN116362304 A CN 116362304A
Authority
CN
China
Prior art keywords
data
space
cache
sub
annular
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111584175.8A
Other languages
Chinese (zh)
Inventor
高迪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Oppo Mobile Telecommunications Corp Ltd
Original Assignee
Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Oppo Mobile Telecommunications Corp Ltd filed Critical Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority to CN202111584175.8A priority Critical patent/CN116362304A/en
Publication of CN116362304A publication Critical patent/CN116362304A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Neurology (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

The application provides a data processing device, a data processing method and a related device, which comprise a neural network processor, wherein first, the first data is divided into n first sub-data according to the channel number n of the first data, the first data represents the data to be stored, and n is a positive integer less than or equal to M/2; and writing each first sub-data into a circular buffer space formed by INT (M/n) annular buffer spaces, wherein any circular buffer space is used for storing any first sub-data. The area of a buffer control circuit in the neural network processor can be reduced through a specific storage space architecture, and the power consumption of data writing and reading is reduced.

Description

Data processing device, data processing method and related device
Technical Field
The present disclosure relates to the technical field of neural network processors, and in particular, to a data processing apparatus, a data processing method, and a related apparatus.
Background
With the development of the prior art, in order to enhance the artificial intelligence capability of the device, a neural network processor (Neural network Processing Unit, NPU) is generally integrated in the system, and a data-driven parallel computing architecture is generally adopted for accelerating the operation of the neural network, so as to solve the problem of low efficiency of the traditional chip in the operation of the neural network. How to reduce the power consumption of the NPU becomes a challenge.
Disclosure of Invention
In view of this, the present application provides a data processing apparatus, a data processing method, and a related apparatus, which can reduce the area of a buffer control circuit in a neural network processor through a specific memory space architecture, and reduce the power consumption of data writing and reading.
In a first aspect, an embodiment of the present application provides a data processing apparatus, including a neural network processor, where the neural network processor includes a processing unit array and M storage modules, where the processing unit array includes M columns of processing unit sets, and M is an even number;
each storage module comprises annular cache spaces corresponding to the layers of the neural network one by one, M annular cache spaces of each layer form at most M/2 circulating cache spaces, and each circulating cache space comprises at least 2 and at most M annular cache spaces;
each annular cache space comprises x line cache address spaces, x is a positive integer greater than 2, each x line cache address space comprises a front N-1 line head address space and a rear N-1 line tail address space, N is the size NxN of a convolution kernel, and N is a positive integer greater than 1 and less than x; a first mapping relation exists between the tail shadow space of the N-1 row behind the tail address space of any one annular cache space forming the circular cache space and the head address space of the front N-1 row of the next annular cache space, and a second mapping relation exists between the head shadow space of the N-1 row in front of the head address space of any one annular cache space forming the circular cache space and the tail address space of the N-1 row of the previous annular cache space;
the M storage modules are used for storing first data in a distributed mode, and the M-column processing unit set is used for reading the first data stored in the distributed mode from the M storage modules.
In a second aspect, an embodiment of the present application provides a data processing method, which is applied to the data processing apparatus according to the first aspect of the embodiment of the present application, where the method includes:
dividing first data into n pieces of first sub data according to the channel number n of the first data, wherein the first data represents data to be stored, and n is a positive integer less than or equal to M/2;
and writing each first sub-data into a circular buffer space formed by INT (M/n) annular buffer spaces, wherein any circular buffer space is used for storing any first sub-data.
In a third aspect, an embodiment of the present application provides a system-on-chip, where the neural network processor includes a processing unit array and M memory modules, where the processing unit array includes M columns of processing unit sets, and M is an even number; each storage module comprises annular cache spaces corresponding to the layers of the neural network one by one, M annular cache spaces of each layer form at most M/2 circulating cache spaces, and each circulating cache space comprises at least 2 and at most M annular cache spaces; each annular cache space comprises x line cache address spaces, x is a positive integer greater than 2, each x line cache address space comprises a front N-1 line head address space and a rear N-1 line tail address space, N is the size NxN of a convolution kernel, and N is a positive integer greater than 1 and less than x; a first mapping relation exists between the tail shadow space of the N-1 row behind the tail address space of any one annular cache space forming the circular cache space and the head address space of the front N-1 row of the next annular cache space, and a second mapping relation exists between the head shadow space of the N-1 row in front of the head address space of any one annular cache space forming the circular cache space and the tail address space of the N-1 row of the previous annular cache space; the neural network processor is configured to:
dividing first data into n pieces of first sub data according to the channel number n of the first data, wherein the first data represents data to be stored, and n is a positive integer less than or equal to M/2;
and writing each first sub-data into a circular buffer space formed by INT (M/n) annular buffer spaces, wherein any circular buffer space is used for storing any first sub-data.
In a fourth aspect, embodiments of the present application provide an electronic device, including a memory for storing a program and a processor executing the program stored in the memory, where the processor is configured to execute instructions of steps in the method according to any one of the second aspects of the embodiments of the present application when the program stored in the memory is executed.
In a fifth aspect, embodiments of the present application provide a computer storage medium storing a computer program comprising program instructions which, when executed by a processor, cause the processor to perform a method according to any one of the second aspects of embodiments of the present application.
In a sixth aspect, embodiments of the present application provide a computer program product, wherein the computer program product comprises a non-transitory computer readable storage medium storing a computer program operable to cause a computer to perform some or all of the steps described in any of the methods of the second aspect of embodiments of the present application. The computer program product may be a software installation package.
It can be seen that, by the above data processing apparatus, data processing method and related apparatus, the apparatus includes a neural network processor, where the neural network processor includes a processing unit array and M storage modules, the processing unit array includes M columns of processing unit sets, and M is an even number; each storage module comprises annular cache spaces corresponding to the layers of the neural network one by one, M annular cache spaces of each layer form at most M/2 circulating cache spaces, and each circulating cache space comprises at least 2 and at most M annular cache spaces; each annular cache space comprises x line cache address spaces, x is a positive integer greater than 2, each x line cache address space comprises a front N-1 line head address space and a rear N-1 line tail address space, N is the size NxN of a convolution kernel, and N is a positive integer greater than 1 and less than x; a first mapping relation exists between the tail shadow space of the N-1 row behind the tail address space of any one annular cache space forming the circular cache space and the head address space of the front N-1 row of the next annular cache space, and a second mapping relation exists between the head shadow space of the N-1 row in front of the head address space of any one annular cache space forming the circular cache space and the tail address space of the N-1 row of the previous annular cache space; the M storage modules are used for storing first data in a distributed mode, and the M-column processing unit set is used for reading the first data stored in the distributed mode from the M storage modules. The area of a buffer control circuit in the neural network processor can be reduced, and the power consumption of data writing and reading is reduced.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of a data processing apparatus according to an embodiment of the present disclosure;
fig. 2 is a schematic structural diagram of a memory module according to an embodiment of the present application;
FIG. 3 is a schematic structural diagram of an annular buffer space according to an embodiment of the present disclosure;
FIG. 4 is an exemplary block diagram of a circular cache space according to an embodiment of the present application;
fig. 5 is a schematic flow chart of a data processing method according to an embodiment of the present application;
fig. 6 is a schematic architecture diagram of a neural network processor according to an embodiment of the present application;
FIG. 7 is a schematic diagram illustrating power consumption comparison according to an embodiment of the present disclosure;
fig. 8 is a schematic structural diagram of a system-on-chip according to an embodiment of the present application;
fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
In order to make the present application solution better understood by those skilled in the art, the following description will clearly and completely describe the technical solution in the embodiments of the present application with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.
The terms first, second and the like in the description and in the claims of the present application and in the above-described figures, are used for distinguishing between different objects and not for describing a particular sequential order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus.
It should be understood that the term "and/or" is merely an association relationship describing the associated object, and means that three relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist together, and B exists alone. In this context, the character "/" indicates that the front and rear associated objects are an "or" relationship. The term "plurality" as used in the embodiments herein refers to two or more.
The "connection" in the embodiments of the present application refers to various connection manners such as direct connection or indirect connection, so as to implement communication between devices, which is not limited in any way in the embodiments of the present application.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the present application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.
The following is a description of the background art and related terms of the present application.
The background technology is related:
in NPU architecture design, multiple levels of cache are typically designed to increase the data bandwidth of the system. The basic computational units in NPUs are typically arithmetic logic units (Arithmetic and Logic Unit, ALU) with storage, also called processing units (Processing Element, PE). The storage in the PE is referred to as level 0 cache. And the data processed by the neural network is usually data divided into a plurality of channels, such as image data including a plurality of color channels, in an NPU systolic array architecture for the convolutional neural network, one common architecture is to split a buffer (buffer) into separate small buffers corresponding to different channels composed of static random access memories (Static Random Access Memory, SRAMs) to increase the bandwidth corresponding to different PEs, and each SRAM corresponds to a column of PEs. When a certain column of PE in the PE array needs to access data in the SRAM which is not corresponding to the PE array, one of the existing methods is to add an extra bus in buffers of different SRAMs to support data access among different SRAMs, so that the area and the power consumption of a chip can be increased; another method is to splice multiple SRAMs into one large SRAM for use, but because different SRAMs have overlapping data portions, the data of the overlapping portions need to be accessed to the dynamic random access memory (Dynamic Random Access Memory, DRAM) multiple times, and the power consumption of each time the DRAM is read is about 100 times higher than the power consumption of accessing the SRAM, which greatly increases the power consumption.
In order to solve the above problems, the present application provides a data processing apparatus, a data processing method, and a related apparatus, which can use a new storage space architecture, reduce the area of a cache control circuit in a neural network processor, and reduce the power consumption of data writing and reading.
In the following, a data processing apparatus according to an embodiment of the present application will be described with reference to fig. 1, and fig. 1 is a schematic architecture diagram of a data processing apparatus according to an embodiment of the present application, where the data processing apparatus 100 includes a processing unit array 110 and M memory modules 120, and the processing unit array 110 includes M columns of processing unit sets 111, where M is an even number.
Each storage module 120 comprises annular cache spaces corresponding to the layers of the neural network one by one, wherein M annular cache spaces of each layer form at most M/2 circulating cache spaces, and each circulating cache space comprises at least 2 and at most M annular cache spaces;
each annular cache space comprises x line cache address spaces, x is a positive integer greater than 2, each x line cache address space comprises a front N-1 line head address space and a rear N-1 line tail address space, N is the size NxN of a convolution kernel, and N is a positive integer greater than 1 and less than x; a first mapping relation exists between the tail shadow space of the N-1 row behind the tail address space of any one annular cache space forming the circular cache space and the head address space of the front N-1 row of the next annular cache space, and a second mapping relation exists between the head shadow space of the N-1 row in front of the head address space of any one annular cache space forming the circular cache space and the tail address space of the N-1 row of the previous annular cache space;
the M storage modules 120 are configured to store first data in a distributed manner, and the M-column processing unit set 111 is configured to read the first data stored in a distributed manner from the M storage modules 120.
The storage module 120 may be an SRAM, and the processing unit set 111 may include a plurality of PEs.
For easy understanding, an arbitrary one of the memory modules 120 in the embodiments of the present application is described separately with reference to fig. 2, and fig. 2 is a schematic structural diagram of a memory module provided in the embodiments of the present application, where the memory module includes α Ring buffer spaces, that is, ring buffer spaces 0 to α -1 in the figures, and different Ring buffer spaces correspond to different layers in a neural network.
In one possible embodiment, when the annular buffer space of each storage module 120 is opened, the space can be opened by taking the space occupied by the pixel points in the even rows of the image as a unit, so as to ensure that the buffer addresses of different storage modules 120 in the same layer of neural network are aligned.
Further, referring to fig. 3 for describing the ring memory space in the embodiment of the present application, fig. 3 is a schematic structural diagram of the ring memory space provided in the embodiment of the present application, where the ring memory space includes x line memory address spaces, that is, lines 0 to x-1 in the drawing, and when the convolution kernel is NxN, the front N-1 may be set as a first address space, and the rear N-1 may be set as a tail address space.
Further, describing the circular buffer space in the embodiment of the present application with reference to fig. 4, fig. 4 is an exemplary structure diagram of a circular buffer space provided in the embodiment of the present application, it may be seen that, when 2 circular buffer spaces form the circular buffer space, each circular buffer space is set to include 8 line buffer address spaces, that is, line 0 to line 7 in the first circular buffer space and line 8 to line 15 (middle line is not shown) in the second circular buffer space, and the convolution kernel size is set to 3x3, then it may be determined that the first 2 lines and the second 2 lines in the first circular buffer space and the first 2 lines and the second 2 lines in the second circular buffer space are mapped to each other, that is, line 6 and line 7 are mapped to the 2 line shadow space before line 8, line 8 and line 9 are mapped to the 2 line tail shadow space after line 7, and similarly, line 14 and line 15 are mapped to the 2 line shadow space before line 0 and line 1 are mapped to the 2 line tail space after line 15, and thus the circular buffer space is formed. It will be appreciated that the shadow space has no actual stored data and may be retrieved from the source space to which the shadow space maps when data in the shadow space is required, the shadow space representing only the address space in which the mapping relationship exists. For example, when the data of the line 7, the line 8 and the line 9 are needed, the data of the line 7 is read from the first annular buffer space, and the data of the line 8 and the line 9 is read from the second annular buffer space according to the mapping relation.
By the data processing device, the distributed storage of the data can be executed, and the power consumption when the data stored in the distributed storage is read is low.
A data processing method in the embodiment of the present application is described below with reference to fig. 5, and fig. 5 is a schematic flow chart of the data processing method provided in the embodiment of the present application, which specifically includes the following steps:
in step 501, the first data is divided into n first sub-data according to the number n of channels of the first data.
The first data represents data to be stored, n is a positive integer less than or equal to M/2, for example, the first data may be image data having n channels, and in order to ensure that data of each channel is uniformly distributed and stored in M storage modules, n is less than or equal to M/2.
Step 502, writing each first sub-data into a circular buffer space formed by INT (M/n) annular buffer spaces.
Wherein, any one of the circular buffer spaces is used for storing any one of the first sub data.
The method comprises the steps of acquiring the row number y of each first sub-data, sequentially writing each first sub-data into a circular buffer space formed by INT (M/n) annular buffer spaces, and writing y/(INT (M/n) x) wheels to store the first data.
It can be understood that when n is 2 and M is 4, there are 2 first sub-data, and 2 circular buffer spaces can be constructed, where 2 first sub-data are respectively stored, that is, each first sub-data is written into a circular buffer space formed by INT (4/2) =2 circular buffer spaces; when n is 2 and m is 9, there are 2 first sub-data, 2 circular buffer spaces can be constructed, and 2 first sub-data can be stored respectively, that is, each first sub-data is written into a circular buffer space formed by INT (9/2) =4 annular buffer spaces, so that the maximized distributed storage can be ensured.
Step 503, reading the first sub-data in each circular cache space by each processing unit set.
The INT (M/n) row y-wheels of the INT (M/n) annular buffer spaces can be sequentially and circularly read to read each first sub-data.
In one possible embodiment, when the first sub-data of any one of the head shadow spaces needs to be read, the first sub-data in the N-1 row tail address space of the previous annular cache space corresponding to the head shadow space can be read according to the first mapping relation.
In one possible embodiment, when the first sub-data of any one tail shadow space needs to be read, the first sub-data in the front N-1 row head address space of the next annular cache space corresponding to the tail shadow space is read according to the second mapping relation.
In one possible embodiment, the first sub-data of each row is read in turn when it is not necessary to read any one of the head shadow spaces or any one of the tail shadow spaces.
By the data processing method, the distributed storage of the data can be realized, when the number of channels is 2 and the number of SRAM is 8, four SRAMs can be spliced together for one channel to use, the size of the stored image is 4 times that of the image which is not stored averagely, the data is read from the source space corresponding to the shadow space according to the mapping relation when the data is read, the memory is not required to be accessed, and the power consumption is greatly reduced.
For the sake of understanding, the data processing apparatus and the data processing method in the present application are described below with reference to examples, for example, 4 storage modules are set, that is, SRAM0, SRAM1, SRAM2, and SRAM3, respectively, and the first data to be stored is image data of 2 channels, where in the conventional method, data of one channel is generally stored in SRAM0, data of another channel is stored in SRAM1, and both SRAM2 and SRAM3 are in an empty state, which is very wasteful of resources.
In this scheme, a mapping relationship exists between the SRAM0 and the SRAM2, a mapping relationship exists between the head and tail address spaces of the SRAM0 and the SRAM2, and a mapping relationship exists between the SRAM1 and the SRAM3, and the head and tail address spaces of the SRAM1 and the SRAM3 are mapped to each other.
Assume that the number to be stored is an image, which is data of 2-channel 2800 rows. In the conventional method, the channel 1 is stored in the annular storage space 1 of the allocated storage module 1, and 20 lines of data can be stored in the annular storage space 1, so that 20 lines of data can also be stored in the annular storage space 2 of the corresponding other storage module 2. Then, a fetch may fetch 40 lines of data. Based on the previous analysis, in the case where the convolution kernel is 7×7, 40 lines of data are taken, and there are 24 lines in total in the overlapping portion of data. A total of 70 rounds are required to complete all the image data. The 70 rounds of overlapping data requiring additional fetching have 1680 rows in total, the 1680 rows account for 1680/2800=0.6 of the total image rows, 60% of the data are overlapping data, and repeated handling is required. If the mapping strategy proposed by the technology is adopted, the power consumption caused by 60% of data movement can be saved.
As shown in fig. 7, fig. 7 is a schematic diagram of power consumption comparison provided in the embodiment of the present application, and the hatched portion is the power consumption when the overlapping area is read, so that the power consumption of the scheme is obviously reduced compared with that of the existing scheme.
It should be understood that the foregoing is illustrative, and that the data of one channel may be divided into two memory modules by itself, for example, the data of one channel may be stored in SRAM0 and SRAM4, or SRAM1 and SRAM2, etc., which are not limited herein.
The scene that this embodiment of the application can be suitable for is:
n×2<M
n is the number of channels of data to be stored, and M is the number of memory modules.
By the data processing device and the data processing method, in the architecture of the neural network processor which does not support access among the SRAMs of the same-level cache, a plurality of SRAMs are connected in series with an address space in a mapping mode, so that the space in the SRAMs is fully utilized, the size of data storage is improved, and meanwhile, the power consumption and the chip area are not increased.
A system-on-chip in an embodiment of the present application is described below with reference to fig. 8, where the system-on-chip 800 includes a neural network processor 810, where the neural network processor 810 includes a processing unit array and M memory modules, where the processing unit array includes M columns of processing unit sets, and M is an even number; each storage module comprises annular cache spaces corresponding to the layers of the neural network one by one, M annular cache spaces of each layer form at most M/2 circulating cache spaces, and each circulating cache space comprises at least 2 and at most M annular cache spaces; each annular cache space comprises x line cache address spaces, x is a positive integer greater than 2, each x line cache address space comprises a front N-1 line head address space and a rear N-1 line tail address space, N is the size NxN of a convolution kernel, and N is a positive integer greater than 1 and less than x; a first mapping relation exists between the tail shadow space of the N-1 row behind the tail address space of any one annular cache space forming the circular cache space and the head address space of the front N-1 row of the next annular cache space, and a second mapping relation exists between the head shadow space of the N-1 row in front of the head address space of any one annular cache space forming the circular cache space and the tail address space of the N-1 row of the previous annular cache space; the neural network processor 810 is configured to:
dividing first data into n pieces of first sub data according to the channel number n of the first data, wherein the first data represents data to be stored, and n is a positive integer less than or equal to M/2;
and writing each first sub-data into a circular buffer space formed by INT (M/n) annular buffer spaces, wherein any circular buffer space is used for storing any first sub-data.
In one possible embodiment, in terms of the writing each first sub-data into a circular buffer space consisting of INT (M/n) ring buffer spaces, the neural network processor 810 is specifically configured to:
acquiring the row number y of each first sub data;
and sequentially writing each first sub data into a circular buffer space formed by the INT (M/n) annular buffer spaces, and writing y/(INT (M/n) x) wheels to store the first data.
In one possible embodiment, after writing each first sub-data into a circular buffer space consisting of INT (M/n) ring buffer spaces, the neural network processor is further configured to:
the first sub-data in each circular cache space is read by each processing unit set.
In one possible embodiment, in reading the first sub-data in each circular cache space by each set of processing units, the neural network processor 810 is specifically configured to:
and sequentially circularly reading INT (M/n) row y wheels of INT (M/n) annular buffer spaces to read each first sub data.
In one possible embodiment, in the aspect of sequentially cyclically reading INT (M/n) x row y-wheels of INT (M/n) annular buffer spaces to read each of the first sub-data, the neural network processor 810 is specifically configured to:
when first sub-data of any one head shadow space is required to be read, reading the first sub-data in the N-1 row tail address space of the previous annular cache space corresponding to the head shadow space according to the first mapping relation;
when first sub-data of any tail shadow space is required to be read, reading the first sub-data in a front N-1 row head address space of a rear annular cache space corresponding to the tail shadow space according to the second mapping relation;
and when any one head shadow space or any one tail shadow space is not required to be read, sequentially reading the first sub-data of each row.
Therefore, the area of a buffer control circuit in the neural network processor can be reduced, and the power consumption of data writing and reading is reduced.
An electronic device in the embodiment of the present application will be described below with reference to fig. 9, and fig. 9 is a schematic structural diagram of an electronic device provided in the embodiment of the present application, as shown in fig. 9, where the electronic device 900 includes a processor 901, a communication interface 902, and a memory 903, where the processor, the communication interface, and the memory are connected to each other, where the electronic device 900 may further include a bus 904, where the processor 901, the communication interface 902, and the memory 903 may be connected to each other through the bus 904, where the bus 904 may be a peripheral component interconnect standard (Peripheral Component Interconnect, abbreviated as PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, abbreviated as EISA) bus, etc. The bus 904 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in fig. 9, but not only one bus or one type of bus. The memory 903 is used to store a computer program comprising program instructions, the processor being configured to invoke the program instructions to perform all or part of the method described in fig. 5 above.
The foregoing description of the embodiments of the present application has been presented primarily in terms of a method-side implementation. It will be appreciated that the electronic device, in order to achieve the above-described functions, includes corresponding hardware structures and/or software modules that perform the respective functions. Those of skill in the art will readily appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein may be embodied as hardware or a combination of hardware and computer software. Whether a function is implemented as hardware or computer software driven hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The embodiment of the application may divide the functional units of the electronic device according to the above method example, for example, each functional unit may be divided corresponding to each function, or two or more functions may be integrated in one processing unit. The integrated units may be implemented in hardware or in software functional units. It should be noted that, in the embodiment of the present application, the division of the units is schematic, which is merely a logic function division, and other division manners may be implemented in actual practice.
The present application also provides a computer storage medium storing a computer program for electronic data exchange, the computer program causing a computer to execute some or all of the steps of any one of the methods described in the method embodiments above.
Embodiments of the present application also provide a computer program product comprising a non-transitory computer-readable storage medium storing a computer program operable to cause a computer to perform some or all of the steps of any one of the methods described in the method embodiments above. The computer program product may be a software installation package, said computer comprising an electronic device.
It should be noted that, for simplicity of description, the foregoing method embodiments are all expressed as a series of action combinations, but it should be understood by those skilled in the art that the present application is not limited by the order of actions described, as some steps may be performed in other order or simultaneously in accordance with the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required in the present application.
In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to related descriptions of other embodiments.
In the several embodiments provided in this application, it should be understood that the disclosed apparatus may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, such as the above-described division of units, merely a division of logic functions, and there may be additional manners of dividing in actual implementation, such as multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, or may be in electrical or other forms.
The units described above as separate components may or may not be physically separate, and components shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units described above, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable memory. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a memory, including several instructions for causing a computer device (which may be a personal computer, a server or a network device, etc.) to perform all or part of the steps of the above-mentioned method of the various embodiments of the present application. And the aforementioned memory includes: a U-disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.
Those of ordinary skill in the art will appreciate that all or a portion of the steps in the various methods of the above embodiments may be implemented by a program that instructs associated hardware, and the program may be stored in a computer readable memory, which may include: flash disk, read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), magnetic disk or optical disk.
The foregoing has outlined rather broadly the more detailed description of embodiments of the present application, wherein specific examples are provided herein to illustrate the principles and embodiments of the present application, the above examples being provided solely to assist in the understanding of the methods of the present application and the core ideas thereof; meanwhile, as those skilled in the art will have modifications in the specific embodiments and application scope in accordance with the ideas of the present application, the present description should not be construed as limiting the present application in view of the above.

Claims (10)

1. The data processing device is characterized by comprising a neural network processor, wherein the neural network processor comprises a processing unit array and M storage modules, the processing unit array comprises M columns of processing unit sets, and M is an even number;
each storage module comprises annular cache spaces corresponding to the layers of the neural network one by one, M annular cache spaces of each layer form at most M/2 circulating cache spaces, and each circulating cache space comprises at least 2 and at most M annular cache spaces;
each annular cache space comprises x line cache address spaces, x is a positive integer greater than 2, each x line cache address space comprises a front N-1 line head address space and a rear N-1 line tail address space, N is the size NxN of a convolution kernel, and N is a positive integer greater than 1 and less than x; a first mapping relation exists between the tail shadow space of the N-1 row behind the tail address space of any one annular cache space forming the circular cache space and the head address space of the front N-1 row of the next annular cache space, and a second mapping relation exists between the head shadow space of the N-1 row in front of the head address space of any one annular cache space forming the circular cache space and the tail address space of the N-1 row of the previous annular cache space;
the M storage modules are used for storing first data in a distributed mode, and the M-column processing unit set is used for reading the first data stored in the distributed mode from the M storage modules.
2. A data processing method applied to the data processing apparatus of claim 1, the method comprising:
dividing first data into n pieces of first sub data according to the channel number n of the first data, wherein the first data represents data to be stored, and n is a positive integer less than or equal to M/2;
and writing each first sub-data into a circular buffer space formed by INT (M/n) annular buffer spaces, wherein any circular buffer space is used for storing any first sub-data.
3. The method of claim 2, wherein writing each first sub-data into a circular buffer space comprised of INT (M/n) ring buffers, comprises:
acquiring the row number y of each first sub data;
and sequentially writing each first sub data into a circular buffer space formed by the INT (M/n) annular buffer spaces, and writing y/(INT (M/n) x) wheels to store the first data.
4. The method of claim 1, wherein after writing each first sub-data into a circular buffer space comprised of INT (M/n) ring buffers, the method further comprises:
the first sub-data in each circular cache space is read by each processing unit set.
5. The method of any of claims 2-4, wherein the reading, by each set of processing units, the first sub-data in each circular cache space comprises:
and sequentially circularly reading INT (M/n) row y wheels of INT (M/n) annular buffer spaces to read each first sub data.
6. The method of claim 5, wherein sequentially cyclically reading INT (M/n) x-line y-wheels of INT (M/n) ring buffer spaces to read each of the first sub-data, comprises:
when first sub-data of any one head shadow space is required to be read, reading the first sub-data in the N-1 row tail address space of the previous annular cache space corresponding to the head shadow space according to the first mapping relation;
when first sub-data of any tail shadow space is required to be read, reading the first sub-data in a front N-1 row head address space of a rear annular cache space corresponding to the tail shadow space according to the second mapping relation;
and when any one head shadow space or any one tail shadow space is not required to be read, sequentially reading the first sub-data of each row.
7. The system-in-chip is characterized in that the neural network processor comprises a processing unit array and M storage modules, wherein the processing unit array comprises M columns of processing unit sets, and M is an even number; each storage module comprises annular cache spaces corresponding to the layers of the neural network one by one, M annular cache spaces of each layer form at most M/2 circulating cache spaces, and each circulating cache space comprises at least 2 and at most M annular cache spaces; each annular cache space comprises x line cache address spaces, x is a positive integer greater than 2, each x line cache address space comprises a front N-1 line head address space and a rear N-1 line tail address space, N is the size NxN of a convolution kernel, and N is a positive integer greater than 1 and less than x; a first mapping relation exists between the tail shadow space of the N-1 row behind the tail address space of any one annular cache space forming the circular cache space and the head address space of the front N-1 row of the next annular cache space, and a second mapping relation exists between the head shadow space of the N-1 row in front of the head address space of any one annular cache space forming the circular cache space and the tail address space of the N-1 row of the previous annular cache space; the neural network processor is configured to:
dividing first data into n pieces of first sub data according to the channel number n of the first data, wherein the first data represents data to be stored, and n is a positive integer less than or equal to M/2;
and writing each first sub-data into a circular buffer space formed by INT (M/n) annular buffer spaces, wherein any circular buffer space is used for storing any first sub-data.
8. The system-on-chip of claim 7, wherein the neural network processor is specifically configured to, in reading the first sub-data in each of the circular cache spaces by each of the sets of processing units:
and sequentially circularly reading INT (M/n) row y wheels of INT (M/n) annular buffer spaces to read each first sub data.
9. An electronic device, comprising: a memory for storing a program and a processor for executing the program stored in the memory, the processor being configured to execute the data processing method according to any one of claims 2 to 6 when the program stored in the memory is executed.
10. A computer storage medium storing program code for execution by an electronic device, the program code comprising the method of any one of claims 2 to 6.
CN202111584175.8A 2021-12-22 2021-12-22 Data processing device, data processing method and related device Pending CN116362304A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111584175.8A CN116362304A (en) 2021-12-22 2021-12-22 Data processing device, data processing method and related device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111584175.8A CN116362304A (en) 2021-12-22 2021-12-22 Data processing device, data processing method and related device

Publications (1)

Publication Number Publication Date
CN116362304A true CN116362304A (en) 2023-06-30

Family

ID=86938968

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111584175.8A Pending CN116362304A (en) 2021-12-22 2021-12-22 Data processing device, data processing method and related device

Country Status (1)

Country Link
CN (1) CN116362304A (en)

Similar Documents

Publication Publication Date Title
JP6767660B2 (en) Processor, information processing device and how the processor operates
WO2022037257A1 (en) Convolution calculation engine, artificial intelligence chip, and data processing method
CN110222818B (en) Multi-bank row-column interleaving read-write method for convolutional neural network data storage
CN112905530B (en) On-chip architecture, pooled computing accelerator array, unit and control method
CN112967172B (en) Data processing device, method, computer equipment and storage medium
CN112416433A (en) Data processing device, data processing method and related product
CN111028136B (en) Method and equipment for processing two-dimensional complex matrix by artificial intelligence processor
CN110009103B (en) Deep learning convolution calculation method and device
CN110009644B (en) Method and device for segmenting line pixels of feature map
JP2024516514A (en) Memory mapping of activations for implementing convolutional neural networks
CN118193443A (en) Data loading method for processor, computing device and medium
CN111125628A (en) Method and apparatus for processing two-dimensional data matrix by artificial intelligence processor
CN109844774B (en) Parallel deconvolution computing method, single-engine computing method and related products
CN112631955B (en) Data processing method, device, electronic equipment and medium
CN110298441B (en) Data processing method, electronic device and computer readable storage medium
US11868875B1 (en) Data selection circuit
CN116362304A (en) Data processing device, data processing method and related device
KR20240036594A (en) Subsum management and reconfigurable systolic flow architectures for in-memory computation
CN116362303A (en) Data processing device, data processing method and related device
CN112596881B (en) Storage component and artificial intelligence processor
CN115204380A (en) Data storage and array mapping method and device of storage-computation integrated convolutional neural network
CN114996647A (en) Arithmetic unit, related device and method
CN111291884B (en) Neural network pruning method, device, electronic equipment and computer readable medium
US8812813B2 (en) Storage apparatus and data access method thereof for reducing utilized storage space
CN115145842A (en) Data cache processor and method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination