CN110770763A

CN110770763A - Data storage device, method, processor and removable equipment

Info

Publication number: CN110770763A
Application number: CN201880040193.XA
Authority: CN
Inventors: 韩峰; 王耀杰; 高明明
Original assignee: Shenzhen Dajiang Innovations Technology Co Ltd
Current assignee: SZ DJI Technology Co Ltd; Shenzhen Dajiang Innovations Technology Co Ltd
Priority date: 2018-10-08
Filing date: 2018-10-08
Publication date: 2020-02-07
Also published as: WO2020073164A1

Abstract

An apparatus (600), method, processor and removable device for data storage. The apparatus (600) comprises: an assembling module (610) for obtaining a calculation result obtained by multiplying and accumulating the multiply-accumulate units, wherein the calculation result comprises at least one data unit of the output characteristic diagram, and assembling the data unit of each output characteristic diagram in the at least one output characteristic diagram into a data unit group with a preset size; a storage module (620) for storing the group of data units into a memory, wherein the predetermined size is a size of a storage unit in the memory. The efficiency of data storage can be improved.

Description

Data storage device, method, processor and removable equipment

Copyright declaration

The disclosure of this patent document contains material which is subject to copyright protection. The copyright is owned by the copyright owner. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the patent and trademark office official records and records.

Technical Field

The present application relates to the field of information technology, and more particularly, to an apparatus, method, processor and removable device for data storage.

Background

Convolutional Neural Network (CNN) is a machine learning algorithm, and is widely applied to computer vision tasks such as target recognition, target detection, and semantic segmentation of images.

The output format of the calculation result of the convolutional neural network is different from the format stored in a Memory, such as a Static Random Access Memory (SRAM), and therefore, the calculation result needs to be converted into the format stored in the Memory during storage. Therefore, how to improve the efficiency of data storage becomes a technical problem to be solved urgently in the design of the convolutional neural network.

Disclosure of Invention

The embodiment of the application provides a data storage device, a data storage method, a processor and a mobile device, which can improve the data storage efficiency.

In a first aspect, an apparatus for storing data is provided, including: the assembling module is used for acquiring a calculation result obtained by multiplying and accumulating the multiplying and accumulating units, wherein the calculation result comprises at least one data unit of the output characteristic diagram, and assembling the data unit of each output characteristic diagram in the at least one output characteristic diagram into a data unit group with a preset size; a storage module, configured to store the group of data units in a memory, where the predetermined size is a size of a storage unit in the memory.

In a second aspect, a method for storing data is provided, including: acquiring a calculation result obtained by multiplying and accumulating the multiplication and accumulation unit, wherein the calculation result comprises at least one data unit for outputting a characteristic diagram; assembling the data units of each output characteristic diagram in the at least one output characteristic diagram into a data unit group with a preset size; storing the group of data cells into a memory, wherein the predetermined size is a size of a storage cell in the memory.

In a third aspect, a processor is provided that includes the data storage apparatus of the first aspect.

In a fourth aspect, there is provided a removable device comprising the data storage apparatus of the first aspect; alternatively, the processor of the third aspect.

In a fifth aspect, a computer storage medium is provided, in which program code is stored, the program code being operable to instruct execution of the method of the second aspect.

According to the technical scheme, the data units of each output characteristic diagram in the calculation results obtained after multiplication and accumulation of the multiplication and accumulation units are assembled into the data unit group with the preset size and stored in the memory, and due to the fact that the data units are assembled based on the size of the storage units in the memory, too many resources cannot be occupied, the data unit group is convenient to store in the memory, and therefore the data storage efficiency can be improved.

Drawings

Fig. 1 is a schematic diagram of a convolution operation process of a convolutional neural network according to an embodiment of the present application.

Fig. 2 is an architecture diagram of a solution to which an embodiment of the present application is applied.

Fig. 3 is a diagram illustrating a calculation result output by the multiply-accumulate unit according to an embodiment of the present application.

Fig. 4 is a schematic diagram of a storage format of a feature map in a memory according to an embodiment of the present application.

Fig. 5 is a schematic configuration diagram of a movable apparatus according to an embodiment of the present application.

FIG. 6 is a schematic diagram of an apparatus for data storage according to an embodiment of the present application.

Fig. 7 is a schematic diagram of a data storage apparatus according to another embodiment of the present application.

Fig. 8 is a schematic diagram of a data storage apparatus according to another embodiment of the present application.

Fig. 9 is a schematic diagram of a data storage apparatus according to another embodiment of the present application.

Fig. 10 is a schematic flow chart of reading data using a polling algorithm according to an embodiment of the present application.

Fig. 11 is a schematic diagram of a data storage apparatus according to another embodiment of the present application.

Fig. 12 is a schematic diagram of data unit distribution according to an embodiment of the present application.

Fig. 13 is a schematic diagram of a data storage apparatus according to another embodiment of the present application.

Fig. 14 is a schematic diagram of data unit assembly according to an embodiment of the present application.

Fig. 15 is a schematic flow chart of a method of data storage according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the accompanying drawings.

It should be understood that the specific examples are provided herein only to assist those skilled in the art in better understanding the embodiments of the present application and are not intended to limit the scope of the embodiments of the present application.

It should also be understood that, in the various embodiments of the present application, the sequence numbers of the processes do not mean the execution sequence, and the execution sequence of the processes should be determined by the functions and the inherent logic of the processes, and should not constitute any limitation to the implementation process of the embodiments of the present application.

It should also be understood that the various embodiments described in this specification can be implemented individually or in combination, and the examples in this application are not limited thereto.

The technical solution of the embodiment of the present application may be applied to various deep learning algorithms, such as a convolutional neural network, but the embodiment of the present application does not limit this.

Fig. 1 shows a schematic diagram of the convolution operation process of a convolutional neural network.

As shown in fig. 1, the convolution operation of the convolutional neural network operates an Input Feature Map (IFM) and a set of weight values, and outputs an Output Feature Map (OFM). The input weight values are called filters (filters) or convolution kernels. The input feature map is an output feature map of a previous layer. The output characteristic diagram is the characteristic diagram obtained after the input characteristic diagram is operated by the current layer. The convolution kernel and the input and output feature maps can be represented as a multi-dimensional matrix, and one convolution operation of convolution layers of the convolutional neural network is to perform an inner product operation on at least part of feature values (data units) of the input feature matrix and weight values of the convolution kernel matrix.

The convolution operation of the convolution layer can adopt a sliding window mode, the upper left corner of the input characteristic value matrix is taken as a starting point, the size of the convolution kernel is taken as a window, and the window is sequentially slid to the lower right corner of the input characteristic matrix to generate a complete two-dimensional output characteristic matrix. After sliding the window each time, the convolution calculation device extracts an input eigenvalue of the size of one window from the input eigenvalue matrix, and performs inner product operation on the input eigenvalue and the convolution kernel to generate an output eigenvalue. After all the two-dimensional output feature matrices are sequentially generated in the above manner, the three-dimensional output feature matrix of the convolutional layer can be obtained.

As shown in fig. 2, system 200 may include convolution calculation device 210 and memory 220.

The memory 220 is used for storing data to be processed, such as input feature maps and weight values, and storing processed data, such as output feature maps. The memory 220 may be an SRAM.

The convolution calculating device 210 includes a Multiply Accumulate Unit (MAU) 211, an IFM input module 212, a weight value input module 213, and an OFM storage module 214. The weight value input module 213 is responsible for reading out the weight values from the memory 220 and sending them to the MAU 211 in a specific format. The IFM input module 212 is responsible for reading the input profile data from the memory 220 and sending it to the MAU 211 for convolution. The MAU 211 may include a systolic array and a buffer to store intermediate computation results. In the convolution operation, the MAU 211 loads the weight value input from the weight value input module 213 to the systolic array, and then, after the input feature map data is input from the IFM input module 212 to the systolic array, multiplies the input feature map data by the previously loaded weight value. If the intermediate result is buffered in the buffer in MAU 211, the systolic array output result will continue to multiply and accumulate with the intermediate result in the buffer again. And if the result of the multiply-accumulate operation is still the intermediate result of the convolution operation, storing the result into a cache of the MAU, otherwise, outputting the result into the lower-level module OFM storage module 214 for subsequent processing. The OFM storage module 214 assembles the convolution calculation result output by the MAU 211 into a data format stored in the memory 220, and then writes it to the memory 220.

The result of the calculation output by the MAU 211 is shown in fig. 3. In fig. 3, [ k, m, n ] represents the eigenvalue of the mth row and nth column of the kth eigen map in the three-dimensional eigen matrix. The systolic array outputs a row of eigenvalues in figure 3 every cycle. The two-dimensional output characteristic matrix is output by each column of the systolic array, corresponding to an output characteristic diagram, and the delay between the first effective characteristic values output by two adjacent columns is greater than or equal to 1 cycle.

In the memory 220, the feature maps are stored consecutively in units of a predetermined size. The storage format is shown in fig. 4, where [ k, m, n ] represents the eigenvalue of the mth row and nth column of the kth eigenvalue in the three-dimensional eigenvalue matrix, and the predetermined size illustrated in fig. 4 is the size of 32 eigenvalues.

As can be seen from fig. 3 and 4, a line of feature values output by the MAU 211 per cycle in fig. 3 belongs to a plurality of different feature maps, and the storage format in the memory 220 is such that each feature map is stored in units of a predetermined size in succession. Therefore, the output format of the calculation result of the MAU 211 is different from the storage format in the memory 220.

In view of this, the embodiments of the present application provide a technical solution for data storage, which can efficiently assemble the convolution calculation result into a data format stored in a memory for storage, so as to improve the efficiency of data storage.

In some embodiments, the technical solutions of the embodiments of the present application may be applied to a mobile device. The movable device can be an unmanned aerial vehicle, an unmanned ship, an automatic driving vehicle or a robot, and the embodiment of the application is not limited to the above.

Fig. 5 is a schematic architecture diagram of a removable device 500 of an embodiment of the present application.

As shown in FIG. 5, the mobile device 500 may include a power system 510, a control system 520, a sensing system 530, and a processing system 540.

A power system 510 is used to power the mobile device 500.

Taking an unmanned aerial vehicle as an example, a power system of the unmanned aerial vehicle may include an electronic governor (abbreviated as an electronic governor), a propeller, and a motor corresponding to the propeller. The motor is connected between the electronic speed regulator and the propeller, and the motor and the propeller are arranged on the corresponding machine arm; the electronic speed regulator is used for receiving a driving signal generated by the control system and providing a driving current for the motor according to the driving signal so as to control the rotating speed of the motor. The motor is used for driving the propeller to rotate, so that power is provided for the flight of the unmanned aerial vehicle.

The sensing system 530 may be used to measure attitude information of the mobile device 500, i.e., position information and state information of the mobile device 500 in space, such as a three-dimensional position, a three-dimensional angle, a three-dimensional velocity, a three-dimensional acceleration, a three-dimensional angular velocity, and the like. The sensing System 530 may include at least one of a gyroscope, an electronic compass, an Inertial Measurement Unit (IMU), a vision sensor, a Global Positioning System (GPS), a barometer, an airspeed meter, and the like.

The sensing system 530 may also be used for capturing images, i.e. the sensing system 530 comprises a sensor, such as a camera or the like, for capturing images.

Control system 520 is used to control the movement of mobile device 500. Control system 520 may control removable device 500 according to preset program instructions. For example, control system 520 may control movement of mobile device 500 based on information about the attitude of mobile device 500 as measured by sensing system 530. Control system 520 may also control removable device 500 based on control signals from a remote control. For example, for a drone, the control system 520 may be a flight control system (flight control), or a control circuit in the flight control.

The processing system 540 may process the images acquired by the sensing system 530. For example, the Processing system 540 may be an Image Signal Processing (ISP) chip.

Processing system 540 may be system 200 in fig. 2, or processing system 540 may include system 200 in fig. 2.

It should be understood that the above-described division and naming of the various components of the removable device 500 is merely exemplary and should not be construed as a limitation of the embodiments of the present application.

It should also be understood that the removable device 500 may also include other components not shown in fig. 5, which are not limited by the embodiments of the present application.

FIG. 6 shows a schematic diagram of an apparatus 600 for data storage according to an embodiment of the present application. The apparatus 600 may be the OFM storage module 214 of fig. 2.

As shown in fig. 6, the apparatus 600 may include a construction module 610 and a storage module 620.

It should be understood that various modules in the embodiments of the present application may be specifically implemented by a circuit, for example, the assembling module 610 may be an assembling circuit, but the embodiments of the present application are not limited thereto, and they may also be implemented by other manners.

The assembling module 610 is configured to obtain a calculation result obtained by multiplying and accumulating the multiply-accumulate units, where the calculation result includes at least one data unit of the output feature map, and assemble the data unit of each output feature map in the at least one output feature map into a data unit group of a predetermined size.

The storage module 620 is configured to store the group of data units in a memory, where the predetermined size is a size of a storage unit in the memory.

In the embodiment of the present application, the format conversion of the calculation result is performed by the assembling module 610. The assembly module 610 assembles the data units of each output feature map into a data unit group of the size of the storage unit in the memory. Accordingly, the storage module 620 may store the assembled group of data units to a storage location in a memory. Because the data units are assembled based on the size of the storage units in the memory, the data units do not occupy too much resources, and the data unit groups are convenient to store in the memory, so that the data storage efficiency can be improved.

Optionally, in an embodiment of the present application, as shown in fig. 7, the assembling module 610 includes N assembling units 611, where each assembling unit 611 in the N assembling units 611 is configured to assemble data units of one output feature map into a data unit group of the predetermined size, and N is a positive integer greater than 1.

Specifically, in the present embodiment, a plurality of assembling units 611 are used to assemble the data units. Each of the assembling units 611 is responsible for assembling data units of one output characteristic diagram. For example, a first building block is responsible for building data units of a first output feature map, a second building block is responsible for building data units of a second output feature map, and so on. In this way, N concatenation units 611 can implement data unit concatenation of N output feature maps.

If the bit width of the output data of the multiply-accumulate unit is N data units, the multiply-accumulate unit can output N data units at a time, and the N data units respectively belong to N output characteristic graphs. As shown in fig. 3, the multiply-accumulate unit outputs one row of data units at a time, where each data unit belongs to one output characteristic diagram, and N data units in one row belong to N output characteristic diagrams respectively. The N number of the assembling units 611 may correspond to the N number of the output feature maps, respectively.

In this case, optionally, in an embodiment of the present application, as shown in fig. 7, the apparatus 600 further includes:

a distributing module 630, configured to distribute the N data units to the N splicing units 611, respectively.

The distributing module 630 distributes the N data units in one row output by the multiply-accumulate unit to the N splicing units 611, respectively. The data units consecutively input a plurality of times in each of the N packing units 611 are packed into the data unit group of the predetermined size.

That is, each of the assembling units 611 receives one data unit of the output characteristic map corresponding to the assembling unit 611 at a time, and assembles the data unit with the previously received data unit until assembling the data unit into the data unit group with the predetermined size.

It should be understood that the size of the storage unit in the memory, i.e. the predetermined size, is generally smaller than the size of one row in the feature map, the size of one row in the feature map may be an integer multiple of the predetermined size, and if the size of one row in the feature map is not an integer multiple of the predetermined size, the last data unit group of each row only includes the last remaining data unit, i.e. its size is smaller than the predetermined size.

If the access data bit width of the memory is the size of N data units, that is, the predetermined size is the size of N data units, the data units input N times consecutively in each of the N assembling units 611 are assembled into the data unit group of the predetermined size.

Optionally, in an embodiment of the present application, as shown in fig. 8, each of the N building units 611 includes a first cache 612.

The size of the first cache 612 may be the predetermined size. Alternatively, the first cache 612 may be implemented by a register. The first cache 612 is used for data unit assembly. Accordingly, the storage module 620 is configured to store the data unit group assembled by the first cache 612 in the memory.

The size of the first buffer 612 in the assembling unit 611 is to ensure that the assembling of the data unit group with the predetermined size can be realized, and therefore, the size of the first buffer 612 can be the predetermined size at minimum. In this case, each time one data unit group is assembled in the first cache 612, the storage module 620 needs to immediately store the assembled data unit group in the memory.

Optionally, in an embodiment of the present application, as shown in fig. 9, each of the N building units 611 includes a first cache 612 and a second cache 613.

The sizes of the first buffer memory 612 and the second buffer memory 613 are both the predetermined size. The first cache 612 is configured to perform data unit assembling, and cache the assembled data unit group to the second cache 613. Accordingly, the storing module 620 is configured to store the assembled group of data units in the second cache 613 into the memory.

In this embodiment, the splicing unit 611 is implemented by using a first cache 612 and a second cache 613. The size of both buffers is the predetermined size. Alternatively, the first cache 612 and the second cache 613 may be implemented by registers. The first cache 612 is used for assembling, and the second cache 613 is used for caching the assembled data unit group.

It should be understood that the first cache 612 and the second cache 613 may be physically separate or integrated. That is, the first cache 612 and the second cache 613 may be two independent caches or may be two parts of one cache, which is not limited in this embodiment of the present invention.

It should also be understood that the sizes of the first cache 612 and the second cache 613 may also be larger than the predetermined size, as long as the assembly and cache of the data unit groups of the predetermined size can be implemented, which is not limited in the embodiment of the present application.

Due to the existence of the second cache 613, the storage module 620 facilitates the storage of the assembled data unit group.

For example, the storing module 620 may sequentially read the assembled data unit group from the second buffer 613 of each assembly unit 611 of the N assembly units 611 according to a round-robin algorithm and store the assembled data unit group in the memory.

Fig. 10 shows a schematic flow chart for reading out data using a polling algorithm. As shown in fig. 10, the storage module 620 may execute 1001, 1002, 1003, 1004, …, 1005, 1006 in a loop, and sequentially read out the assembled data unit group from each assembly unit 611 and store the assembled data unit group in the memory. For example, in 1001, it is determined whether there is a data unit group assembled in the first assembly unit, and if yes, 1002 is executed to read out the data unit group assembled in the first assembly unit. Then, 1003 is executed, whether the second assembling unit has the assembled data unit group is judged, if yes, 1004 is executed, the assembled data unit group in the second assembling unit is read out, and the like.

Optionally, in an embodiment of the present application, as shown in fig. 7, the apparatus 600 may further include:

and the control module 640 is used for controlling the speed of the multiply-accumulate unit for outputting the calculation result.

Specifically, the speed at which multiply-accumulate units output the results of the computations may not match the speed at which apparatus 600 processes the data. Therefore, in the embodiment of the present application, the speed at which the multiply-accumulate unit outputs the calculation result is controlled by the control module 640. For example, when the data is sent into the multiply-accumulate unit too quickly, the control module 640 may trigger the backpressure signal to the multiply-accumulate unit. And the multiply-accumulate unit stops calculating after receiving the backpressure signal until the backpressure signal is cancelled and continues calculating.

In the above embodiment, the data unit is assembled by N assembling units. Other manners for assembling the data units may also be used, that is, other implementations of the assembling module 610 may also be used. The implementation of another embodiment of the present application is described below.

Optionally, in one embodiment of the present application, as shown in fig. 11, the construction module 610 includes a first construction unit 616 and a second construction unit 617.

The first assembling unit 616 is configured to assemble data units of a specific odd-numbered line into the data unit group of the predetermined size, and the second assembling unit 617 is configured to assemble data units of a specific even-numbered line into the data unit group of the predetermined size, where the specific odd-numbered line represents an odd-numbered line of each of the at least one output characteristic maps, and the specific even-numbered line represents an even-numbered line of each of the at least one output characteristic maps.

Specifically, in the present embodiment, the data unit is assembled by using the first assembling unit 616 and the second assembling unit 617. The first assembling unit 616 is used for assembling the data units of the odd-numbered rows of each output characteristic map, and the second assembling unit 616 is used for assembling the data units of the even-numbered rows of each output characteristic map.

In this case, optionally, in an embodiment of the present application, as shown in fig. 11, the apparatus 600 further includes:

a distribution module 635, configured to distribute the data units of the specific odd row to the first splicing unit 616, and distribute the data units of the specific even row to the second splicing unit 617.

Specifically, the distribution module 635 may count the data units of each feature map, and distribute the data units to different splicing units according to the row number of the input data unit. For example, as shown in fig. 12, data units of odd-numbered rows of each feature map are distributed to the first splicing unit 616, and data units of even-numbered rows are distributed to the second splicing unit 617. In fig. 12, [ k, m, n ] indicates the eigenvalue (data cell) of the mth row and nth column of the kth signature in the three-dimensional signature matrix, the width (number of data cells per row) of the signature is 56, and the number of signatures is 32.

Optionally, in an embodiment of the present application, as shown in fig. 13, each of the First splicing unit 616 and the second splicing unit 617 includes N First-in-First-out queues (FIFOs). Alternatively, the FIFO may be a dual port FIFO implemented in Random Access Memory (RAM).

The p × N + i data elements of the particular odd-numbered line of data elements are input into the i FIFO of the first packing unit 616, and the N data elements of the particular odd-numbered line of the N FIFOs of the first packing unit are packed into the data element group of the predetermined size;

the p × N + i data elements of the specific even row are input into the i FIFO of the second packing unit 617, and the N data elements of the specific even row of the N FIFOs of the second packing unit are packed into the data element group of the predetermined size, where N is a positive integer greater than 1, i is a positive integer not greater than N, and p is zero or a positive integer.

When the bit width of the output data of the multiply-accumulate unit is N data units, the multiply-accumulate unit can output N data units at a time, and the N data units respectively belong to N output characteristic diagrams. As shown in fig. 3, the multiply-accumulate unit outputs one row of data units at a time, where each data unit belongs to one output characteristic diagram, and N data units in one row belong to N output characteristic diagrams respectively.

In this case, the distributing module 635 is configured to distribute the N data units to corresponding FIFOs, respectively, where the p × N + i data units in the specific odd row of data units are distributed to the i FIFO of the first assembling unit 616, and the p × N + i data units in the specific even row of data units are distributed to the i FIFO of the second assembling unit 617.

For example, as shown in FIG. 14, [0,0,0] is the first data element of the first line of the first signature graph, so [0,0,0] is distributed into the 1 st FIFO of the first packing unit 616; [0,0,1] is the second data element of the first row of the first profile, so [0,0,1] is distributed to the 2 nd FIFO of the first tile 616; [1,0,0] is the first data element of the first row of the second profile, thus distributing [1,0,0] into the 1 st FIFO of the first tile 616; [0,0,2] is the third data element of the first row of the first profile, so [0,0,2] is distributed to the 3 rd FIFO of the first tile 616; [1,0,1] is the second data element of the first row of the second profile, thus distributing [1,0,0] into the 2 nd FIFO of the first tile 616; [2,0,0] is the first data element in the first row of the third map, so [2,0,0] is distributed to the 1 st FIFO of the first tile 616; and so on. After the nth (32 in fig. 14, N) data elements [0,0,31] of the first row of the first map are distributed into the 32 nd FIFO of the first packing unit 616, the 32 data elements of the first row of the first map, i.e., [0,0,0], [0,0,1], …, [0,0,31] of the 32 FIFOs of the first packing unit 616 are packed into one data element group.

Accordingly, in this embodiment, the storage module 620 is configured to store the group of data units packed in the N FIFOs of the first packing unit 616 or the second packing unit 617 in the memory.

The storage module 620 reads the assembled data unit groups from the two assembling units in turn according to the distribution rule of the distribution module 635, and stores the assembled data unit groups into the memory. For example, in the above example, after the 32 data elements in the first row of the first characteristic diagram in the 32 FIFOs of the first assembling unit 616, namely, [0,0,0], [0,0,1], …, [0,0,31] are assembled into one data element group, the storage module 620 reads out and stores the data element group into the memory.

Similar to the previous embodiment, in this embodiment, as shown in fig. 11, the apparatus 600 may also include: and the control module 640 is used for controlling the speed of the multiply-accumulate unit for outputting the calculation result. For brevity, the description of the foregoing embodiments may be omitted.

In this embodiment, data element packing is implemented by a FIFO, which may be implemented by a RAM. According to the structure of a Field Programmable Gate Array (FPGA) lookup Table (LUT), the RAM requires less LUT resources than a register of the same scale, so the technical scheme of the embodiment requires less LUT resources.

The data storage device according to the embodiment of the present application is described above, and the data storage method according to the embodiment of the present application is described below. The data storage method according to the embodiment of the present application is a method for implementing the technical solution according to the embodiment of the present application by using the data storage device according to the embodiment of the present application or a device including the data storage device according to the embodiment of the present application, and related descriptions may refer to the foregoing embodiment, and in the following, for brevity, no further description is given here.

FIG. 15 shows a schematic flow chart diagram of a method 1500 of data storage of an embodiment of the present application.

As shown in fig. 15, the method 1500 includes:

1510, obtaining a calculation result obtained by multiplying and accumulating by the multiply-accumulate unit, wherein the calculation result comprises at least one data unit for outputting the characteristic diagram;

1520, assembling the data units of each of the at least one output feature map into a data unit group of a predetermined size;

1530 storing the set of data cells into a memory, wherein the predetermined size is the size of the in-memory storage cells.

Optionally, in an embodiment of the present application, the data units of one output feature map are assembled into the data unit group of the predetermined size by each of N assembling units, where N is a positive integer greater than 1.

Optionally, in an embodiment of the present application, N data units output by the multiply-accumulate unit at one time are obtained, where the N data units respectively belong to N output feature maps; the method 1500 further comprises: and distributing the N data units to the N splicing units respectively.

Optionally, in an embodiment of the present application, data units that are continuously input multiple times in each of the N splicing units are spliced into the data unit group of the predetermined size.

Optionally, in an embodiment of the present application, the predetermined size is a size of N data units; and assembling the data units which are continuously input for N times in each assembling unit in the N assembling units into the data unit group with the preset size.

Optionally, in an embodiment of the present application, each of the N assembly units includes a first cache, and a size of the first cache is the predetermined size; assembling data units of an output characteristic diagram into a data unit group with the preset size through the first cache; and storing the data unit group after the first cache assembly into the memory.

Optionally, in an embodiment of the present application, each of the N assembly units includes a first cache and a second cache, and the sizes of the first cache and the second cache are both the predetermined size; assembling data units of an output characteristic diagram into a data unit group with the preset size through the first cache; caching the data unit group assembled by the first cache to the second cache; and storing the assembled data unit group in the second cache into the memory.

Optionally, in an embodiment of the present application, according to a polling algorithm, sequentially reading the assembled data unit group from the second buffer of each of the N assembly units and storing the assembled data unit group in the memory.

Optionally, in an embodiment of the present application, a specific odd row of data units is packed into the data unit group of the predetermined size by the first packing unit, where the specific odd row represents an odd row of each of the at least one output feature map; and assembling data units of a specific even row into the data unit group with the preset size through a second assembling unit, wherein the specific even row represents the even row of each output characteristic diagram in the at least one output characteristic diagram.

Optionally, in an embodiment of the present application, the method 1500 further includes: and distributing the data units of the specific odd-numbered row to the first splicing unit, and distributing the data units of the specific even-numbered row to the second splicing unit.

Optionally, in an embodiment of the present application, each of the first splicing unit and the second splicing unit includes N FIFO queues; distributing the p x N + i data units of the specific odd row to the i FIFO of the first splicing unit; assembling the N data units of the specific odd-numbered line in the N FIFOs of the first assembling unit into the data unit group of the predetermined size; distributing the p × N + i data units in the data units of the specific even row to the i FIFO of the second splicing unit; and assembling the N data units of the specific even line in the N FIFOs of the second assembling unit into the data unit group with the preset size, wherein N is a positive integer larger than 1, i is a positive integer not larger than N, and p is zero or a positive integer.

Optionally, in an embodiment of the present application, N data units output by the multiply-accumulate unit at one time are obtained, where the N data units respectively belong to N output feature maps; distributing the p × N + i data elements of the particular odd row of the N data elements into the i FIFO of the first packing element; distributing the p × N + i data elements of the particular even row of the N data elements to the i FIFO of the second packing element.

Optionally, in an embodiment of the present application, the group of data units packed in the N FIFOs of the first packing unit or the second packing unit is stored in the memory.

Optionally, in an embodiment of the present application, the method 1500 further includes: and controlling the speed of the multiply-accumulate unit to output the calculation result.

The embodiment of the application also provides a processor, which comprises a multiply-accumulate unit and the data storage device of the embodiment of the application.

The multiply-accumulate unit is used for performing multiply-accumulate calculation and outputting a calculation result to the data storage device, and the data storage device stores data into the memory by adopting the technical scheme of the embodiment of the application.

For example, the processor may be the convolution calculation device 210 in fig. 2, wherein the OFM storage module 214 may be a device for storing data according to an embodiment of the present application.

The embodiment of the present application further provides a mobile device, which may include the data storage apparatus in the embodiment of the present application; or, include the processors of the embodiments of the present application described above.

The embodiment of the present application further provides a computer storage medium, in which a program code is stored, where the program code may be used to instruct to execute the method for storing data of the embodiment of the present application.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may also be an electric, mechanical or other form of connection.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiments of the present application.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially or partially contributed by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

While the invention has been described with reference to specific embodiments, the scope of the invention is not limited thereto, and those skilled in the art can easily conceive various equivalent modifications or substitutions within the technical scope of the invention. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. An apparatus for data storage, comprising:

the assembling module is used for acquiring a calculation result obtained by multiplying and accumulating the multiplying and accumulating units, wherein the calculation result comprises at least one data unit of the output characteristic diagram, and assembling the data unit of each output characteristic diagram in the at least one output characteristic diagram into a data unit group with a preset size;

a storage module, configured to store the group of data units in a memory, where the predetermined size is a size of a storage unit in the memory.

2. The apparatus of claim 1, wherein the mosaic module comprises N mosaic units, wherein each mosaic unit of the N mosaic units is configured to mosaic data units of one output feature map into data unit groups of the predetermined size, and wherein N is a positive integer greater than 1.

3. The apparatus of claim 2, wherein the multiply-accumulate unit outputs N data units at a time, and the N data units belong to N output profiles, respectively;

the device further comprises:

and the distribution module is used for distributing the N data units to the N splicing units respectively.

4. The apparatus of claim 3, wherein data units of consecutive multiple inputs in each of the N construction units are constructed as the predetermined size of data unit group.

5. The apparatus of claim 4, wherein the predetermined size is a size of N data units, and wherein N consecutive input data units in each of the N construction units are constructed as a group of data units of the predetermined size.

6. The apparatus of any of claims 2 to 5, wherein each of the N building units comprises a first cache, the first cache being of the predetermined size;

the first cache is used for assembling data units;

the storage module is used for storing the data unit group after the first cache assembly into the memory.

7. The apparatus of any of claims 2 to 5, wherein each of the N building units comprises a first cache and a second cache, the first cache and the second cache each having a size of the predetermined size;

the first cache is used for assembling data units and caching the assembled data unit groups to the second cache;

the storage module is used for storing the assembled data unit group in the second cache into the memory.

8. The apparatus of claim 7, wherein the storage module is configured to sequentially read the assembled data unit groups from the second buffer of each of the N assembly units and store the assembled data unit groups in the memory according to a polling algorithm.

9. The apparatus of claim 1, wherein the construction module comprises a first construction unit and a second construction unit, wherein the first construction unit is configured to construct data units of a specific odd row into the data unit group of the predetermined size, and the second construction unit is configured to construct data units of a specific even row into the data unit group of the predetermined size, wherein the specific odd row represents an odd row of each of the at least one output feature maps, and the specific even row represents an even row of each of the at least one output feature maps.

10. The apparatus of claim 9, further comprising:

and the distribution module is used for distributing the data units in the specific odd-numbered row to the first splicing unit and distributing the data units in the specific even-numbered row to the second splicing unit.

11. The apparatus of claim 9 or 10, wherein the first and second building units each comprise N first-in-first-out queues FIFO;

the p x N + i data elements of the particular odd-numbered line of data elements are input into the i FIFO of the first packing unit, and the N data elements of the particular odd-numbered line of the N FIFOs of the first packing unit are packed into the predetermined-sized data element group;

the p × N + i data elements of the specific even row are input into the i FIFO of the second packing unit, and the N data elements of the specific even row of the N FIFOs of the second packing unit are packed into the data element group of the predetermined size, where N is a positive integer greater than 1, i is a positive integer not greater than N, and p is zero or a positive integer.

12. The apparatus of claim 11, wherein the multiply-accumulate unit outputs N data units at a time, and the N data units belong to N output profiles, respectively;

the distributing module is configured to distribute the N data units to corresponding FIFOs, respectively, where a p × N + i data unit in the specific odd line data unit is distributed to an i-th FIFO of the first assembling unit, and a p × N + i data unit in the specific even line data unit is distributed to an i-th FIFO of the second assembling unit.

13. The apparatus of claim 11 or 12, wherein the storage module is configured to store the packed group of data units in the N FIFOs of the first or second packing unit into the memory.

14. The apparatus of any one of claims 2 to 13, further comprising:

and the control module is used for controlling the speed of the multiply-accumulate unit for outputting the calculation result.

15. A method of data storage, comprising:

acquiring a calculation result obtained by multiplying and accumulating the multiplication and accumulation unit, wherein the calculation result comprises at least one data unit for outputting a characteristic diagram;

assembling the data units of each output characteristic diagram in the at least one output characteristic diagram into a data unit group with a preset size;

storing the group of data cells into a memory, wherein the predetermined size is a size of a storage cell in the memory.

16. The method of claim 15, wherein the assembling the data units of each of the at least one output signature into a predetermined set of data units comprises:

and assembling the data units of one output characteristic diagram into the data unit group with the preset size through each assembling unit in the N assembling units, wherein N is a positive integer larger than 1.

17. The method of claim 16, wherein obtaining the result of multiply-accumulate calculation by the multiply-accumulate unit comprises:

acquiring N data units output by the multiply-accumulate unit at one time, wherein the N data units respectively belong to N output characteristic graphs;

the method further comprises the following steps:

and distributing the N data units to the N splicing units respectively.

18. The method of claim 17, wherein said assembling data units of one output feature map into data unit groups of said predetermined size by each of N assembling units comprises:

and assembling the data units continuously and repeatedly input in each assembling unit of the N assembling units into the data unit group with the preset size.

19. The method of claim 18, wherein the predetermined size is a size of N data units;

the assembling of the data units continuously and repeatedly input in each assembling unit of the N assembling units into the data unit group with the preset size comprises the following steps:

and assembling the data units which are continuously input for N times in each assembling unit in the N assembling units into the data unit group with the preset size.

20. The method of any of claims 16 to 19, wherein each of the N construction units comprises a first buffer, the first buffer having a size of the predetermined size;

the assembling of the data unit of one output characteristic diagram into the data unit group with the preset size through each assembling unit in the N assembling units comprises the following steps:

assembling data units of an output characteristic diagram into a data unit group with the preset size through the first cache;

the storing the group of data units into a memory comprises:

and storing the data unit group after the first cache assembly into the memory.

21. The method of any of claims 16 to 19, wherein each of the N building units comprises a first cache and a second cache, the first cache and the second cache each having a size of the predetermined size;

the method further comprises the following steps:

caching the data unit group assembled by the first cache to the second cache;

the storing the group of data units into a memory comprises:

and storing the assembled data unit group in the second cache into the memory.

22. The method of claim 21, wherein storing the assembled group of data units in the second cache into the memory comprises:

and according to a polling algorithm, sequentially reading the assembled data unit groups from the second cache of each assembling unit in the N assembling units and storing the assembled data unit groups into the memory.

23. The method of claim 15, wherein the assembling the data units of each of the at least one output signature into a predetermined set of data units comprises:

assembling, by a first assembling unit, data units of a specific odd-numbered row into the data unit group of the predetermined size, wherein the specific odd-numbered row represents an odd-numbered row of each of the at least one output feature map;

and assembling data units of a specific even row into the data unit group with the preset size through a second assembling unit, wherein the specific even row represents the even row of each output characteristic diagram in the at least one output characteristic diagram.

24. The method of claim 23, further comprising:

and distributing the data units of the specific odd-numbered row to the first splicing unit, and distributing the data units of the specific even-numbered row to the second splicing unit.

25. The method of claim 24, wherein the first building unit and the second building unit each comprise N first-in-first-out queues FIFO;

the distributing the data units of the particular odd row to the first splice unit includes:

distributing the p x N + i data units of the specific odd row to the i FIFO of the first splicing unit;

the assembling of the data units of the specific odd-numbered line into the data unit group of the predetermined size by the first assembling unit includes:

assembling the N data units of the specific odd-numbered line in the N FIFOs of the first assembling unit into the data unit group of the predetermined size;

the distributing the data units of the particular even row to the second splice unit includes:

distributing the p × N + i data units in the data units of the specific even row to the i FIFO of the second splicing unit;

the assembling of the data units of the specific even-numbered row into the data unit group of the predetermined size by the second assembling unit includes:

and assembling the N data units of the specific even line in the N FIFOs of the second assembling unit into the data unit group with the preset size, wherein N is a positive integer larger than 1, i is a positive integer not larger than N, and p is zero or a positive integer.

26. The method of claim 25, wherein obtaining the result of multiply-accumulate calculation by the multiply-accumulate unit comprises:

said distributing the p x N + i data elements of said particular odd row of data elements into the i FIFO of said first packing element comprises:

distributing the p × N + i data elements of the particular odd row of the N data elements into the i FIFO of the first packing element;

said assembling N data elements of said particular even row in N FIFOs of said second assembling unit into said predetermined size group of data elements comprises:

distributing the p × N + i data elements of the particular even row of the N data elements to the i FIFO of the second packing element.

27. The method of claim 25 or 26, wherein storing the group of data units in a memory comprises:

and storing the data unit groups assembled in the N FIFOs of the first assembling unit or the second assembling unit into the memory.

28. The method according to any one of claims 16 to 27, further comprising:

and controlling the speed of the multiply-accumulate unit to output the calculation result.

29. A processor comprising a multiply accumulate unit and an apparatus for data storage according to any of claims 1 to 14.

30. A mobile device, comprising:

an apparatus for data storage according to any one of claims 1 to 14; alternatively, the first and second electrodes may be,

the processor of claim 29.