CN116362304A

CN116362304A - Data processing device, data processing method and related device

Info

Publication number: CN116362304A
Application number: CN202111584175.8A
Authority: CN
Inventors: 高迪
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2021-12-22
Filing date: 2021-12-22
Publication date: 2023-06-30

Abstract

The application provides a data processing device, a data processing method and a related device, which comprise a neural network processor, wherein first, the first data is divided into n first sub-data according to the channel number n of the first data, the first data represents the data to be stored, and n is a positive integer less than or equal to M/2; and writing each first sub-data into a circular buffer space formed by INT (M/n) annular buffer spaces, wherein any circular buffer space is used for storing any first sub-data. The area of a buffer control circuit in the neural network processor can be reduced through a specific storage space architecture, and the power consumption of data writing and reading is reduced.

Description

Data processing device, data processing method and related device

Technical Field

The present disclosure relates to the technical field of neural network processors, and in particular, to a data processing apparatus, a data processing method, and a related apparatus.

Background

With the development of the prior art, in order to enhance the artificial intelligence capability of the device, a neural network processor (Neural network Processing Unit, NPU) is generally integrated in the system, and a data-driven parallel computing architecture is generally adopted for accelerating the operation of the neural network, so as to solve the problem of low efficiency of the traditional chip in the operation of the neural network. How to reduce the power consumption of the NPU becomes a challenge.

Disclosure of Invention

In view of this, the present application provides a data processing apparatus, a data processing method, and a related apparatus, which can reduce the area of a buffer control circuit in a neural network processor through a specific memory space architecture, and reduce the power consumption of data writing and reading.

In a first aspect, an embodiment of the present application provides a data processing apparatus, including a neural network processor, where the neural network processor includes a processing unit array and M storage modules, where the processing unit array includes M columns of processing unit sets, and M is an even number;

each storage module comprises annular cache spaces corresponding to the layers of the neural network one by one, M annular cache spaces of each layer form at most M/2 circulating cache spaces, and each circulating cache space comprises at least 2 and at most M annular cache spaces;

each annular cache space comprises x line cache address spaces, x is a positive integer greater than 2, each x line cache address space comprises a front N-1 line head address space and a rear N-1 line tail address space, N is the size NxN of a convolution kernel, and N is a positive integer greater than 1 and less than x; a first mapping relation exists between the tail shadow space of the N-1 row behind the tail address space of any one annular cache space forming the circular cache space and the head address space of the front N-1 row of the next annular cache space, and a second mapping relation exists between the head shadow space of the N-1 row in front of the head address space of any one annular cache space forming the circular cache space and the tail address space of the N-1 row of the previous annular cache space;

the M storage modules are used for storing first data in a distributed mode, and the M-column processing unit set is used for reading the first data stored in the distributed mode from the M storage modules.

In a second aspect, an embodiment of the present application provides a data processing method, which is applied to the data processing apparatus according to the first aspect of the embodiment of the present application, where the method includes:

dividing first data into n pieces of first sub data according to the channel number n of the first data, wherein the first data represents data to be stored, and n is a positive integer less than or equal to M/2;

and writing each first sub-data into a circular buffer space formed by INT (M/n) annular buffer spaces, wherein any circular buffer space is used for storing any first sub-data.

In a third aspect, an embodiment of the present application provides a system-on-chip, where the neural network processor includes a processing unit array and M memory modules, where the processing unit array includes M columns of processing unit sets, and M is an even number; each storage module comprises annular cache spaces corresponding to the layers of the neural network one by one, M annular cache spaces of each layer form at most M/2 circulating cache spaces, and each circulating cache space comprises at least 2 and at most M annular cache spaces; each annular cache space comprises x line cache address spaces, x is a positive integer greater than 2, each x line cache address space comprises a front N-1 line head address space and a rear N-1 line tail address space, N is the size NxN of a convolution kernel, and N is a positive integer greater than 1 and less than x; a first mapping relation exists between the tail shadow space of the N-1 row behind the tail address space of any one annular cache space forming the circular cache space and the head address space of the front N-1 row of the next annular cache space, and a second mapping relation exists between the head shadow space of the N-1 row in front of the head address space of any one annular cache space forming the circular cache space and the tail address space of the N-1 row of the previous annular cache space; the neural network processor is configured to:

In a fourth aspect, embodiments of the present application provide an electronic device, including a memory for storing a program and a processor executing the program stored in the memory, where the processor is configured to execute instructions of steps in the method according to any one of the second aspects of the embodiments of the present application when the program stored in the memory is executed.

In a fifth aspect, embodiments of the present application provide a computer storage medium storing a computer program comprising program instructions which, when executed by a processor, cause the processor to perform a method according to any one of the second aspects of embodiments of the present application.

In a sixth aspect, embodiments of the present application provide a computer program product, wherein the computer program product comprises a non-transitory computer readable storage medium storing a computer program operable to cause a computer to perform some or all of the steps described in any of the methods of the second aspect of embodiments of the present application. The computer program product may be a software installation package.

It can be seen that, by the above data processing apparatus, data processing method and related apparatus, the apparatus includes a neural network processor, where the neural network processor includes a processing unit array and M storage modules, the processing unit array includes M columns of processing unit sets, and M is an even number; each storage module comprises annular cache spaces corresponding to the layers of the neural network one by one, M annular cache spaces of each layer form at most M/2 circulating cache spaces, and each circulating cache space comprises at least 2 and at most M annular cache spaces; each annular cache space comprises x line cache address spaces, x is a positive integer greater than 2, each x line cache address space comprises a front N-1 line head address space and a rear N-1 line tail address space, N is the size NxN of a convolution kernel, and N is a positive integer greater than 1 and less than x; a first mapping relation exists between the tail shadow space of the N-1 row behind the tail address space of any one annular cache space forming the circular cache space and the head address space of the front N-1 row of the next annular cache space, and a second mapping relation exists between the head shadow space of the N-1 row in front of the head address space of any one annular cache space forming the circular cache space and the tail address space of the N-1 row of the previous annular cache space; the M storage modules are used for storing first data in a distributed mode, and the M-column processing unit set is used for reading the first data stored in the distributed mode from the M storage modules. The area of a buffer control circuit in the neural network processor can be reduced, and the power consumption of data writing and reading is reduced.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a data processing apparatus according to an embodiment of the present disclosure;

fig. 2 is a schematic structural diagram of a memory module according to an embodiment of the present application;

FIG. 3 is a schematic structural diagram of an annular buffer space according to an embodiment of the present disclosure;

FIG. 4 is an exemplary block diagram of a circular cache space according to an embodiment of the present application;

fig. 5 is a schematic flow chart of a data processing method according to an embodiment of the present application;

fig. 6 is a schematic architecture diagram of a neural network processor according to an embodiment of the present application;

FIG. 7 is a schematic diagram illustrating power consumption comparison according to an embodiment of the present disclosure;

fig. 8 is a schematic structural diagram of a system-on-chip according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the present application solution better understood by those skilled in the art, the following description will clearly and completely describe the technical solution in the embodiments of the present application with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.

The terms first, second and the like in the description and in the claims of the present application and in the above-described figures, are used for distinguishing between different objects and not for describing a particular sequential order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus.

It should be understood that the term "and/or" is merely an association relationship describing the associated object, and means that three relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist together, and B exists alone. In this context, the character "/" indicates that the front and rear associated objects are an "or" relationship. The term "plurality" as used in the embodiments herein refers to two or more.

The "connection" in the embodiments of the present application refers to various connection manners such as direct connection or indirect connection, so as to implement communication between devices, which is not limited in any way in the embodiments of the present application.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the present application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.

The following is a description of the background art and related terms of the present application.

The background technology is related:

in NPU architecture design, multiple levels of cache are typically designed to increase the data bandwidth of the system. The basic computational units in NPUs are typically arithmetic logic units (Arithmetic and Logic Unit, ALU) with storage, also called processing units (Processing Element, PE). The storage in the PE is referred to as level 0 cache. And the data processed by the neural network is usually data divided into a plurality of channels, such as image data including a plurality of color channels, in an NPU systolic array architecture for the convolutional neural network, one common architecture is to split a buffer (buffer) into separate small buffers corresponding to different channels composed of static random access memories (Static Random Access Memory, SRAMs) to increase the bandwidth corresponding to different PEs, and each SRAM corresponds to a column of PEs. When a certain column of PE in the PE array needs to access data in the SRAM which is not corresponding to the PE array, one of the existing methods is to add an extra bus in buffers of different SRAMs to support data access among different SRAMs, so that the area and the power consumption of a chip can be increased; another method is to splice multiple SRAMs into one large SRAM for use, but because different SRAMs have overlapping data portions, the data of the overlapping portions need to be accessed to the dynamic random access memory (Dynamic Random Access Memory, DRAM) multiple times, and the power consumption of each time the DRAM is read is about 100 times higher than the power consumption of accessing the SRAM, which greatly increases the power consumption.

In order to solve the above problems, the present application provides a data processing apparatus, a data processing method, and a related apparatus, which can use a new storage space architecture, reduce the area of a cache control circuit in a neural network processor, and reduce the power consumption of data writing and reading.

In the following, a data processing apparatus according to an embodiment of the present application will be described with reference to fig. 1, and fig. 1 is a schematic architecture diagram of a data processing apparatus according to an embodiment of the present application, where the data processing apparatus 100 includes a processing unit array 110 and M memory modules 120, and the processing unit array 110 includes M columns of processing unit sets 111, where M is an even number.

Each storage module 120 comprises annular cache spaces corresponding to the layers of the neural network one by one, wherein M annular cache spaces of each layer form at most M/2 circulating cache spaces, and each circulating cache space comprises at least 2 and at most M annular cache spaces;

the M storage modules 120 are configured to store first data in a distributed manner, and the M-column processing unit set 111 is configured to read the first data stored in a distributed manner from the M storage modules 120.

The storage module 120 may be an SRAM, and the processing unit set 111 may include a plurality of PEs.

For easy understanding, an arbitrary one of the memory modules 120 in the embodiments of the present application is described separately with reference to fig. 2, and fig. 2 is a schematic structural diagram of a memory module provided in the embodiments of the present application, where the memory module includes α Ring buffer spaces, that is, ring buffer spaces 0 to α -1 in the figures, and different Ring buffer spaces correspond to different layers in a neural network.

In one possible embodiment, when the annular buffer space of each storage module 120 is opened, the space can be opened by taking the space occupied by the pixel points in the even rows of the image as a unit, so as to ensure that the buffer addresses of different storage modules 120 in the same layer of neural network are aligned.

Further, referring to fig. 3 for describing the ring memory space in the embodiment of the present application, fig. 3 is a schematic structural diagram of the ring memory space provided in the embodiment of the present application, where the ring memory space includes x line memory address spaces, that is, lines 0 to x-1 in the drawing, and when the convolution kernel is NxN, the front N-1 may be set as a first address space, and the rear N-1 may be set as a tail address space.

Further, describing the circular buffer space in the embodiment of the present application with reference to fig. 4, fig. 4 is an exemplary structure diagram of a circular buffer space provided in the embodiment of the present application, it may be seen that, when 2 circular buffer spaces form the circular buffer space, each circular buffer space is set to include 8 line buffer address spaces, that is, line 0 to line 7 in the first circular buffer space and line 8 to line 15 (middle line is not shown) in the second circular buffer space, and the convolution kernel size is set to 3x3, then it may be determined that the first 2 lines and the second 2 lines in the first circular buffer space and the first 2 lines and the second 2 lines in the second circular buffer space are mapped to each other, that is, line 6 and line 7 are mapped to the 2 line shadow space before line 8, line 8 and line 9 are mapped to the 2 line tail shadow space after line 7, and similarly, line 14 and line 15 are mapped to the 2 line shadow space before line 0 and line 1 are mapped to the 2 line tail space after line 15, and thus the circular buffer space is formed. It will be appreciated that the shadow space has no actual stored data and may be retrieved from the source space to which the shadow space maps when data in the shadow space is required, the shadow space representing only the address space in which the mapping relationship exists. For example, when the data of the line 7, the line 8 and the line 9 are needed, the data of the line 7 is read from the first annular buffer space, and the data of the line 8 and the line 9 is read from the second annular buffer space according to the mapping relation.

By the data processing device, the distributed storage of the data can be executed, and the power consumption when the data stored in the distributed storage is read is low.

A data processing method in the embodiment of the present application is described below with reference to fig. 5, and fig. 5 is a schematic flow chart of the data processing method provided in the embodiment of the present application, which specifically includes the following steps:

in step 501, the first data is divided into n first sub-data according to the number n of channels of the first data.

The first data represents data to be stored, n is a positive integer less than or equal to M/2, for example, the first data may be image data having n channels, and in order to ensure that data of each channel is uniformly distributed and stored in M storage modules, n is less than or equal to M/2.

Step 502, writing each first sub-data into a circular buffer space formed by INT (M/n) annular buffer spaces.

Wherein, any one of the circular buffer spaces is used for storing any one of the first sub data.

The method comprises the steps of acquiring the row number y of each first sub-data, sequentially writing each first sub-data into a circular buffer space formed by INT (M/n) annular buffer spaces, and writing y/(INT (M/n) x) wheels to store the first data.

It can be understood that when n is 2 and M is 4, there are 2 first sub-data, and 2 circular buffer spaces can be constructed, where 2 first sub-data are respectively stored, that is, each first sub-data is written into a circular buffer space formed by INT (4/2) =2 circular buffer spaces; when n is 2 and m is 9, there are 2 first sub-data, 2 circular buffer spaces can be constructed, and 2 first sub-data can be stored respectively, that is, each first sub-data is written into a circular buffer space formed by INT (9/2) =4 annular buffer spaces, so that the maximized distributed storage can be ensured.

Step 503, reading the first sub-data in each circular cache space by each processing unit set.

The INT (M/n) row y-wheels of the INT (M/n) annular buffer spaces can be sequentially and circularly read to read each first sub-data.

In one possible embodiment, when the first sub-data of any one of the head shadow spaces needs to be read, the first sub-data in the N-1 row tail address space of the previous annular cache space corresponding to the head shadow space can be read according to the first mapping relation.

In one possible embodiment, when the first sub-data of any one tail shadow space needs to be read, the first sub-data in the front N-1 row head address space of the next annular cache space corresponding to the tail shadow space is read according to the second mapping relation.

In one possible embodiment, the first sub-data of each row is read in turn when it is not necessary to read any one of the head shadow spaces or any one of the tail shadow spaces.

By the data processing method, the distributed storage of the data can be realized, when the number of channels is 2 and the number of SRAM is 8, four SRAMs can be spliced together for one channel to use, the size of the stored image is 4 times that of the image which is not stored averagely, the data is read from the source space corresponding to the shadow space according to the mapping relation when the data is read, the memory is not required to be accessed, and the power consumption is greatly reduced.

For the sake of understanding, the data processing apparatus and the data processing method in the present application are described below with reference to examples, for example, 4 storage modules are set, that is, SRAM0, SRAM1, SRAM2, and SRAM3, respectively, and the first data to be stored is image data of 2 channels, where in the conventional method, data of one channel is generally stored in SRAM0, data of another channel is stored in SRAM1, and both SRAM2 and SRAM3 are in an empty state, which is very wasteful of resources.

In this scheme, a mapping relationship exists between the SRAM0 and the SRAM2, a mapping relationship exists between the head and tail address spaces of the SRAM0 and the SRAM2, and a mapping relationship exists between the SRAM1 and the SRAM3, and the head and tail address spaces of the SRAM1 and the SRAM3 are mapped to each other.

Assume that the number to be stored is an image, which is data of 2-channel 2800 rows. In the conventional method, the channel 1 is stored in the annular storage space 1 of the allocated storage module 1, and 20 lines of data can be stored in the annular storage space 1, so that 20 lines of data can also be stored in the annular storage space 2 of the corresponding other storage module 2. Then, a fetch may fetch 40 lines of data. Based on the previous analysis, in the case where the convolution kernel is 7×7, 40 lines of data are taken, and there are 24 lines in total in the overlapping portion of data. A total of 70 rounds are required to complete all the image data. The 70 rounds of overlapping data requiring additional fetching have 1680 rows in total, the 1680 rows account for 1680/2800=0.6 of the total image rows, 60% of the data are overlapping data, and repeated handling is required. If the mapping strategy proposed by the technology is adopted, the power consumption caused by 60% of data movement can be saved.

As shown in fig. 7, fig. 7 is a schematic diagram of power consumption comparison provided in the embodiment of the present application, and the hatched portion is the power consumption when the overlapping area is read, so that the power consumption of the scheme is obviously reduced compared with that of the existing scheme.

It should be understood that the foregoing is illustrative, and that the data of one channel may be divided into two memory modules by itself, for example, the data of one channel may be stored in SRAM0 and SRAM4, or SRAM1 and SRAM2, etc., which are not limited herein.

The scene that this embodiment of the application can be suitable for is:

n×2＜M

n is the number of channels of data to be stored, and M is the number of memory modules.

By the data processing device and the data processing method, in the architecture of the neural network processor which does not support access among the SRAMs of the same-level cache, a plurality of SRAMs are connected in series with an address space in a mapping mode, so that the space in the SRAMs is fully utilized, the size of data storage is improved, and meanwhile, the power consumption and the chip area are not increased.

A system-on-chip in an embodiment of the present application is described below with reference to fig. 8, where the system-on-chip 800 includes a neural network processor 810, where the neural network processor 810 includes a processing unit array and M memory modules, where the processing unit array includes M columns of processing unit sets, and M is an even number; each storage module comprises annular cache spaces corresponding to the layers of the neural network one by one, M annular cache spaces of each layer form at most M/2 circulating cache spaces, and each circulating cache space comprises at least 2 and at most M annular cache spaces; each annular cache space comprises x line cache address spaces, x is a positive integer greater than 2, each x line cache address space comprises a front N-1 line head address space and a rear N-1 line tail address space, N is the size NxN of a convolution kernel, and N is a positive integer greater than 1 and less than x; a first mapping relation exists between the tail shadow space of the N-1 row behind the tail address space of any one annular cache space forming the circular cache space and the head address space of the front N-1 row of the next annular cache space, and a second mapping relation exists between the head shadow space of the N-1 row in front of the head address space of any one annular cache space forming the circular cache space and the tail address space of the N-1 row of the previous annular cache space; the neural network processor 810 is configured to:

In one possible embodiment, in terms of the writing each first sub-data into a circular buffer space consisting of INT (M/n) ring buffer spaces, the neural network processor 810 is specifically configured to:

acquiring the row number y of each first sub data;

and sequentially writing each first sub data into a circular buffer space formed by the INT (M/n) annular buffer spaces, and writing y/(INT (M/n) x) wheels to store the first data.

In one possible embodiment, after writing each first sub-data into a circular buffer space consisting of INT (M/n) ring buffer spaces, the neural network processor is further configured to:

the first sub-data in each circular cache space is read by each processing unit set.

In one possible embodiment, in reading the first sub-data in each circular cache space by each set of processing units, the neural network processor 810 is specifically configured to:

and sequentially circularly reading INT (M/n) row y wheels of INT (M/n) annular buffer spaces to read each first sub data.

In one possible embodiment, in the aspect of sequentially cyclically reading INT (M/n) x row y-wheels of INT (M/n) annular buffer spaces to read each of the first sub-data, the neural network processor 810 is specifically configured to:

when first sub-data of any one head shadow space is required to be read, reading the first sub-data in the N-1 row tail address space of the previous annular cache space corresponding to the head shadow space according to the first mapping relation;

when first sub-data of any tail shadow space is required to be read, reading the first sub-data in a front N-1 row head address space of a rear annular cache space corresponding to the tail shadow space according to the second mapping relation;

and when any one head shadow space or any one tail shadow space is not required to be read, sequentially reading the first sub-data of each row.

Therefore, the area of a buffer control circuit in the neural network processor can be reduced, and the power consumption of data writing and reading is reduced.

An electronic device in the embodiment of the present application will be described below with reference to fig. 9, and fig. 9 is a schematic structural diagram of an electronic device provided in the embodiment of the present application, as shown in fig. 9, where the electronic device 900 includes a processor 901, a communication interface 902, and a memory 903, where the processor, the communication interface, and the memory are connected to each other, where the electronic device 900 may further include a bus 904, where the processor 901, the communication interface 902, and the memory 903 may be connected to each other through the bus 904, where the bus 904 may be a peripheral component interconnect standard (Peripheral Component Interconnect, abbreviated as PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, abbreviated as EISA) bus, etc. The bus 904 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in fig. 9, but not only one bus or one type of bus. The memory 903 is used to store a computer program comprising program instructions, the processor being configured to invoke the program instructions to perform all or part of the method described in fig. 5 above.

The foregoing description of the embodiments of the present application has been presented primarily in terms of a method-side implementation. It will be appreciated that the electronic device, in order to achieve the above-described functions, includes corresponding hardware structures and/or software modules that perform the respective functions. Those of skill in the art will readily appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein may be embodied as hardware or a combination of hardware and computer software. Whether a function is implemented as hardware or computer software driven hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The embodiment of the application may divide the functional units of the electronic device according to the above method example, for example, each functional unit may be divided corresponding to each function, or two or more functions may be integrated in one processing unit. The integrated units may be implemented in hardware or in software functional units. It should be noted that, in the embodiment of the present application, the division of the units is schematic, which is merely a logic function division, and other division manners may be implemented in actual practice.

The present application also provides a computer storage medium storing a computer program for electronic data exchange, the computer program causing a computer to execute some or all of the steps of any one of the methods described in the method embodiments above.

Embodiments of the present application also provide a computer program product comprising a non-transitory computer-readable storage medium storing a computer program operable to cause a computer to perform some or all of the steps of any one of the methods described in the method embodiments above. The computer program product may be a software installation package, said computer comprising an electronic device.

It should be noted that, for simplicity of description, the foregoing method embodiments are all expressed as a series of action combinations, but it should be understood by those skilled in the art that the present application is not limited by the order of actions described, as some steps may be performed in other order or simultaneously in accordance with the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required in the present application.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to related descriptions of other embodiments.

In the several embodiments provided in this application, it should be understood that the disclosed apparatus may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, such as the above-described division of units, merely a division of logic functions, and there may be additional manners of dividing in actual implementation, such as multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, or may be in electrical or other forms.

The units described above as separate components may or may not be physically separate, and components shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units described above, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable memory. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a memory, including several instructions for causing a computer device (which may be a personal computer, a server or a network device, etc.) to perform all or part of the steps of the above-mentioned method of the various embodiments of the present application. And the aforementioned memory includes: a U-disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Those of ordinary skill in the art will appreciate that all or a portion of the steps in the various methods of the above embodiments may be implemented by a program that instructs associated hardware, and the program may be stored in a computer readable memory, which may include: flash disk, read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), magnetic disk or optical disk.

The foregoing has outlined rather broadly the more detailed description of embodiments of the present application, wherein specific examples are provided herein to illustrate the principles and embodiments of the present application, the above examples being provided solely to assist in the understanding of the methods of the present application and the core ideas thereof; meanwhile, as those skilled in the art will have modifications in the specific embodiments and application scope in accordance with the ideas of the present application, the present description should not be construed as limiting the present application in view of the above.

Claims

1. The data processing device is characterized by comprising a neural network processor, wherein the neural network processor comprises a processing unit array and M storage modules, the processing unit array comprises M columns of processing unit sets, and M is an even number;

2. A data processing method applied to the data processing apparatus of claim 1, the method comprising:

3. The method of claim 2, wherein writing each first sub-data into a circular buffer space comprised of INT (M/n) ring buffers, comprises:

acquiring the row number y of each first sub data;

4. The method of claim 1, wherein after writing each first sub-data into a circular buffer space comprised of INT (M/n) ring buffers, the method further comprises:

5. The method of any of claims 2-4, wherein the reading, by each set of processing units, the first sub-data in each circular cache space comprises:

6. The method of claim 5, wherein sequentially cyclically reading INT (M/n) x-line y-wheels of INT (M/n) ring buffer spaces to read each of the first sub-data, comprises:

7. The system-in-chip is characterized in that the neural network processor comprises a processing unit array and M storage modules, wherein the processing unit array comprises M columns of processing unit sets, and M is an even number; each storage module comprises annular cache spaces corresponding to the layers of the neural network one by one, M annular cache spaces of each layer form at most M/2 circulating cache spaces, and each circulating cache space comprises at least 2 and at most M annular cache spaces; each annular cache space comprises x line cache address spaces, x is a positive integer greater than 2, each x line cache address space comprises a front N-1 line head address space and a rear N-1 line tail address space, N is the size NxN of a convolution kernel, and N is a positive integer greater than 1 and less than x; a first mapping relation exists between the tail shadow space of the N-1 row behind the tail address space of any one annular cache space forming the circular cache space and the head address space of the front N-1 row of the next annular cache space, and a second mapping relation exists between the head shadow space of the N-1 row in front of the head address space of any one annular cache space forming the circular cache space and the tail address space of the N-1 row of the previous annular cache space; the neural network processor is configured to:

8. The system-on-chip of claim 7, wherein the neural network processor is specifically configured to, in reading the first sub-data in each of the circular cache spaces by each of the sets of processing units:

9. An electronic device, comprising: a memory for storing a program and a processor for executing the program stored in the memory, the processor being configured to execute the data processing method according to any one of claims 2 to 6 when the program stored in the memory is executed.

10. A computer storage medium storing program code for execution by an electronic device, the program code comprising the method of any one of claims 2 to 6.