CN117908831A

CN117908831A - Data processing method, processing array and processing device

Info

Publication number: CN117908831A
Application number: CN202410089553.2A
Authority: CN
Inventors: 张瑞凯; 孙福海
Original assignee: Beijing Eswin Computing Technology Co Ltd
Current assignee: Beijing Eswin Computing Technology Co Ltd
Priority date: 2024-01-22
Filing date: 2024-01-22
Publication date: 2024-04-19

Abstract

A data processing method, a processing array and a processing device for neural network calculation are provided. The data processing method is used for multiplying and adding at least one first vector and a second vector, and comprises the following steps: in each operation period, first data elements are acquired from the first vector one by one, and the acquired first data elements are multiplied in parallel with a plurality of second data elements in the second vector in response to the first data elements acquired from the first vector being non-0 data, or the multiplication operation for the acquired first data elements is skipped in response to the first data elements acquired from the first vector being 0. The data processing method can fully utilize sparse input characteristic diagrams, and realizes calculation acceleration while keeping simple design.

Description

Data processing method, processing array and processing device

Technical Field

Embodiments of the present disclosure relate to a data processing method, a processing array, and a processing apparatus for neural network computation.

Background

In the neural network algorithm, a plurality of 0 s are usually generated, and the 0 s do not have any influence on the result after multiplication and addition operation, but still consume operation time and occupy computing resources. Therefore, skipping 0 in the neural network calculation can reduce the amount of computation without decreasing the accuracy. How to fully utilize 0 in data so as to increase the operation speed and reduce the dynamic power consumption has been attracting attention.

Disclosure of Invention

At least some embodiments of the present disclosure provide a data processing method for neural network computation for multiply-add operation of at least one first vector with a second vector, the method comprising: in each operation period, first data elements are acquired from the first vector one by one, and the acquired first data elements are multiplied with a plurality of second data elements in the second vector in parallel in response to the first data elements acquired from the first vector being non-0 data, or the first data elements acquired from the first vector are multiplied in response to the first data elements acquired from the first vector being 0, so that the first data elements acquired from the first vector are skipped.

For example, in a data processing method provided in at least some embodiments of the present disclosure, in response to a first data element acquired from the first vector being non-0 data, performing a multiplication operation on the acquired first data element and a plurality of second data elements in the second vector in parallel includes: and in response to the first data element acquired from the first vector being non-0 data, performing a multiplication operation on the acquired first data element and all second data elements in the second vector in parallel.

For example, in a data processing method provided in at least some embodiments of the present disclosure, in response to a first data element acquired from the first vector being non-0 data, performing a multiplication operation on the acquired first data element and a plurality of second data elements in the second vector in parallel includes: in response to the data obtained from the first vector being non-0 data, the non-0 data is sent to a processing module to perform a multiply-add operation of the first vector and the second vector by the processing module, wherein the processing module includes a plurality of multiplication processing subunits, each of which correspondingly receives one second data element to perform a multiply operation.

For example, in a data processing method provided in at least some embodiments of the present disclosure, the first vector is one of the vectors juxtaposed in the input feature map calculated by the neural network; the second vector is one of parallel vectors in the weight graph calculated by the neural network.

For example, a data processing method provided in at least some embodiments of the present disclosure further includes: each first vector whole is cached in a single mode, first data elements are obtained from the first vectors one by one, and the method comprises the following steps: the first data elements are acquired from the cached first vector one by one.

For example, in one data processing method provided by at least some embodiments of the present disclosure, at least one first vector includes a plurality of first vectors, each of which is multiplied and added in parallel with the second vector, respectively.

For example, at least some embodiments of the present disclosure provide a data processing array for neural network computation, including at least one input module and at least one processing module, wherein the input module is configured to acquire first data elements from the first vector one by one to input the first vector in each operation cycle, and the processing modules are each configured to perform a multiplication operation on the acquired first data element in parallel with a plurality of second data elements in the second vector in response to the first data element acquired from the first vector being non-0 data, or skip the multiplication operation on the acquired first data element in response to the first data element acquired from the first vector being 0 to perform a multiplication-addition operation on at least one first vector and a second vector.

In a data processing array provided by at least some embodiments of the present disclosure, the input module includes a buffer module configured to buffer a first vector whole at a single time.

In a data processing array provided in at least some embodiments of the present disclosure, the processing module includes a plurality of multiplication processing subunits, each configured to multiply-add a corresponding one of the second data elements with the acquired first data element.

In a data processing array provided in at least some embodiments of the present disclosure, at least one processing module includes a plurality of processing modules, at least one input module includes a plurality of input modules respectively corresponding to the plurality of processing modules, the plurality of input modules are configured to respectively input a plurality of first vectors, and the plurality of processing modules are configured to respectively multiply and add the plurality of first vectors with the second vectors in parallel.

At least some embodiments of the present disclosure provide a data processing apparatus for neural network computation, including a processing array provided by embodiments of the present disclosure.

At least some embodiments of the present disclosure provide a data processing apparatus for neural network computation, comprising: a processor and a memory having one or more computer program modules stored thereon; wherein the one or more computer program modules are configured to, when executed by the processor, perform the data processing methods provided by the embodiments of the present disclosure.

At least some embodiments of the present disclosure also provide a non-transitory storage medium that non-transitory stores computer readable instructions, wherein the computer readable instructions, when executed by a computer, perform the data processing method provided by the embodiments of the present disclosure.

At least some embodiments of the present disclosure also provide a computer program product comprising a computer program/instructions, wherein the computer program/instructions, when executed by a processor, implement the data processing method provided by the embodiments of the present disclosure.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings of the embodiments will be briefly described below, and it is apparent that the drawings in the following description relate only to some embodiments of the present disclosure, not to limit the present disclosure.

FIG. 1A shows a schematic diagram of a convolution layer performing a multi-channel convolution operation;

FIG. 1B shows a schematic architecture diagram of a neural network processor;

FIG. 1C is a schematic block diagram of a processing unit performing data processing;

FIG. 2 is a flow chart of a data processing method for neural network computation provided in some embodiments of the present disclosure;

FIG. 3 is a schematic diagram of a processing unit provided by some embodiments of the present disclosure;

FIG. 4A is a schematic diagram of a data processing process in an nth operating cycle provided by some embodiments of the present disclosure;

FIG. 4B is a schematic diagram of a data processing process within an n+1th operating cycle provided by some embodiments of the present disclosure;

FIG. 4C is a schematic diagram of a data processing process within an n+2th operational cycle provided by some embodiments of the present disclosure;

FIG. 5 is a schematic diagram of a data processing array provided in some embodiments of the present disclosure;

fig. 6 is a schematic structural diagram of an electronic device according to some embodiments of the present disclosure; and

Fig. 7 is a schematic diagram of a storage medium provided by some embodiments of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present disclosure. It will be apparent that the described embodiments are some, but not all, of the embodiments of the present disclosure. All other embodiments, which can be made by one of ordinary skill in the art without the need for inventive faculty, are within the scope of the present disclosure, based on the described embodiments of the present disclosure.

Unless defined otherwise, technical or scientific terms used in this disclosure should be given the ordinary meaning as understood by one of ordinary skill in the art to which this disclosure belongs. The terms "first," "second," and the like, as used in this disclosure, do not denote any order, quantity, or importance, but rather are used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that elements or items preceding the word are included in the element or item listed after the word and equivalents thereof, but does not exclude other elements or items. The terms "connected" or "connected," and the like, are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", etc. are used merely to indicate relative positional relationships, which may also be changed when the absolute position of the object to be described is changed.

The present disclosure is illustrated by the following several specific examples. Detailed descriptions of known functions and known parts (elements) may be omitted for the sake of clarity and conciseness in the following description of the embodiments of the present disclosure. When any part (element) of an embodiment of the present disclosure appears in more than one drawing, the part (element) is denoted by the same or similar reference numeral in each drawing.

The neural network is a mathematical calculation model inspired by the brain neuron structure and the nerve conduction principle, and a mode of realizing intelligent calculation based on the model is called brain inspired calculation. For example, the neural network includes various forms of network structures, such as a Back Propagation (BP) neural network, a Convolutional Neural Network (CNN), a convolutional neural network (RNN), a long-short-term memory network (LSTM), etc., and for example, the convolutional neural network may be further subdivided into a full convolutional network, a deep convolutional network, a U-Net, etc.

For example, one common convolutional neural network typically includes an input, an output, and a plurality of processing layers. For example, the input end is used for receiving data to be processed, such as an image to be processed, and the output end is used for outputting a processing result, such as a processed image, and the processing layers can include a convolution layer, a pooling layer, a batch normalization layer (Batch Normalization, BN for short), a full connection layer, and the like, and the processing layers can include different contents and combination modes according to different structures of the convolution neural network. After input data is input into the convolutional neural network, corresponding output is obtained through a plurality of processing layers, for example, the input data can finish operations such as convolution, up-sampling, down-sampling, standardization, full connection, planarization and the like through the plurality of processing layers.

The convolutional layer is the core layer of the convolutional neural network, and applies several filters to the input data (input image or input feature map), which is used to perform various types of feature extraction. The result obtained after applying one filter to input data is called feature map (feature map), and the number of feature maps is equal to the number of filters. The feature map output by one convolution layer may be input to the next level of convolution layer for further processing to obtain a new feature map.

FIG. 1A shows a schematic diagram of a convolution layer performing a multi-channel convolution operation. As shown in fig. 1A, convolution operation is performed on N sets of input images (or input feature maps) of h×w having C channels using M sets of convolution kernels of r×s having C channels, to obtain N sets of output feature maps of e×f having M channels, respectively, so that the output feature maps collectively include F/E/M multiple dimensions. Convolution operation has the characteristics of high parallelism and high data multiplexing. The high parallelism is embodied in that a plurality of convolution kernels can be operated simultaneously with a plurality of input feature maps.

Since the computational effort of the neural network, especially for convolutional layers with large-sized input feature maps, is very large, it is often necessary to decompose the computational operations of one convolutional layer in the neural network. For example, the convolution operations for different parts of the same convolution layer may be performed independently of each other, and the decomposed tasks are submitted to a plurality of processing units to perform computation in parallel, and then the computation results of the processing units are combined to obtain the computation result of the whole convolution layer, and then the computation result of the network layer may be used as the input of the next convolution layer.

Neural network processors (Neural-network Processing Unit, npus) are a class of microprocessors or computing systems dedicated to hardware acceleration of artificial intelligence (particularly artificial neural networks, machine vision, machine learning, etc.), sometimes referred to as artificial intelligence accelerators (AI accelerants).

Fig. 1B shows a schematic architecture diagram of a neural network processor. As shown in fig. 1B, the neural network processor 100 includes a processing unit (PE) array 110, a global cache 120, and a memory 130. The processing unit array 110 includes a plurality of rows and columns (e.g., 12 rows by 12 columns) of processing units coupled to each other and sharing a global cache 120 through on-chip interconnects, such as a network on chip (NoC). Each processing unit has a computing function and may also have its own local cache, for example a cache or register array comprising a multiply-accumulator (MAC) and a vector (or matrix) for caching inputs. Each PE may access other PEs around it, its own local cache and global cache. Global cache 120 is further coupled to memory 130 by, for example, a bus.

In operation, data such as a convolution kernel (Flt), an input feature map (Ifm), etc., required for calculation by a network layer (e.g., a convolution layer) is read into the global cache 120 from the memory 130, and then the convolution kernel (Flt), the input image (Img), etc., are input to the processing unit array 110 from the global cache 120 for calculation, and the calculation tasks for different image pixels are assigned to different processing units (i.e., mapped). The partial sum (Psum 1) generated during the calculation is temporarily stored in the global buffer, and if a further accumulation operation is required for the partial sum (Psum 1) generated before in the subsequent calculation, the required partial sum (Psum 2) may be read into the global buffer 120 to perform the operation on the processing unit array 110. The output profile (Ofm) resulting from the completion of the operations of one convolution layer may be output from the global cache 120 to the memory 130 for storage, e.g., for use in the computation of the next network layer (e.g., convolution layer).

fig. 1C shows a schematic diagram of a processing unit (Processing element, abbreviated as PE) for a processing layer of a neural network, which is used for data processing. The input data, such as an input feature map (input feature map, ifmap for short), is fed into the processing unit shown in fig. 1C, and is operated with, for example, weight (weight) data in the processing module and outputs the calculation result.

In the algorithm of the neural network, 0 in the data can be utilized to reduce dynamic power consumption. The inventors of the present disclosure have noted that an enable signal may be set, for example, to the multiply-add unit as an indication of start/enable of execution, and make a non-0 determination on the data before performing the multiply-add operation, and the detection of the data being 0 pulls the corresponding enable bit (en bit) in the pipeline (pipeline) low, so that the multiply-add unit and the pipeline do not flip but the pipeline operates as usual.

The acceleration operation with 0 in the data is relatively complex, and there are very few AI accelerators supporting this function. The acceleration of operations with 0's in the data can be divided into three ways:

① Acceleration is realized by using 0 in the weight, for example, doubling of the operation speed can be realized;

② Acceleration is achieved by utilizing 0 in the input feature diagram;

③ Acceleration is achieved with 0 in the input feature map and weights at the same time.

Limited by the sparse unpredictable nature of the input signature, while achieving acceleration with 0 in the input signature and weight dramatically increases the complexity of the AI accelerator, so this type of AI accelerator product is also very rare on the market.

The inventors of the present disclosure have attempted to implement a method of acceleration in a neural network processor (Neural Processing Unit, NPU) product, while utilizing 0 in the input feature map and the weights, and have also found that the method can result in a significant increase in input feature map fetch, weight fetch, and additive relationship complexity.

The embodiment of the disclosure provides a data processing method and a corresponding data processing device for neural network calculation, which are used for multiplying and adding at least one first vector and a second vector, wherein the data processing method comprises the following steps: in each operation period, first data elements are acquired from the first vector one by one, and the acquired first data elements are multiplied in parallel with a plurality of second data elements in the second vector in response to the first data elements acquired from the first vector being non-0 data, or the multiplication operation for the acquired first data elements is skipped in response to the first data elements acquired from the first vector being 0.

The data processing method provided by the embodiment of the disclosure can make full use of the sparse input feature map, balance between the calculation complexity and the calculation speed, and realize the calculation acceleration while keeping the design simple.

As shown in fig. 2, the data processing method provided in some embodiments of the present disclosure includes steps S100 to S200.

Step S100, in each operation cycle, the first data elements are acquired from the first vector one by one.

In step S200, in response to the first data element acquired from the first vector being non-0 data, the acquired first data element is multiplied in parallel with the plurality of second data elements in the second vector, or in response to the first data element acquired from the first vector being 0, the multiplication operation for the acquired first data element is skipped.

It is noted that "first vector" and "second vector" herein are used to refer to a vector currently being the object of description, and may be any vector to which the embodiments of the present disclosure apply; similarly, the first data element and the second data element are used to refer to data elements in the first vector and the second vector, respectively, that are currently the object of description.

The above steps S100 to S200 will be described below with reference to fig. 3 and fig. 4A to 4C. Fig. 3 illustrates a schematic diagram of a processing unit provided in some embodiments of the present disclosure. The processing unit 10 shown in fig. 3 comprises an input module 20 and a processing module 30. For convenience of explanation, fig. 3 is simplified to show only 1 input module and 1 processing module, and a plurality of input modules and processing modules may be obtained according to the configuration shown in fig. 3.

In the embodiment illustrated in fig. 3, the first data element may be obtained from the first vector by the input module during each operation cycle to input the first vector to the processing module. The first vector may be one of the parallel vectors in the input feature map calculated by the neural network, and the vector in the input feature map may include the first data element. The processing module performs multiplication operations on the acquired first data elements in parallel with a plurality of second data elements in the second vector in response to the acquired first data elements from the first vector being non-0 data, or skips the multiplication operations on the acquired first data elements in response to the acquired first data elements from the first vector being 0, so as to perform multiplication and addition operations on at least one first vector and the second vector. In this embodiment, the second vector may be one of the parallel vectors in the weight map calculated by the neural network.

For example, the processing modules shown in fig. 3 and fig. 4A to fig. 4C include 8 independent multipliers for multiplying and adding the first vector and the second vector. It should be understood that the number of multipliers in the embodiments of the present disclosure may be 16, 24, etc., and that 8 multipliers are merely examples for illustration, and the number of elements in a row or a column of the first vector and the second vector in the calculation is not limited to the number in the examples, and these should not be construed as limiting the present disclosure.

As shown in fig. 4A, in the nth operation period (n is a positive integer), taking an example that the first vector of the input feature map includes 7 data elements, for example, white squares represent data 0 and gray squares represent non-0 data in the map. Embodiments of the present disclosure are not limited in the type of data, for example, the type of data may be Int8, bf16, fp32, and the like.

In fig. 4A, the first data element in the first vector is, for example, the first non-0 data closest to the multiplier and the second vector in the figure, and step S200 may include step S210:

In step S210, in response to the first data element acquired from the first vector being non-0 data, the acquired first data element and all the second data elements in the second vector are multiplied in parallel.

The parallel multiplication operation is performed on the non-0 first data element and all the second data elements in the second vector according to step S210.

In the embodiment shown in fig. 4A, the second vector comprises, for example, 8-bit weight data, the processing module comprises, for example, 8 multiplication sub-units, respectively, each multiplication sub-unit comprising a multiplier and optionally an adder, thereby obtaining a multiplier-adder (MAC), and step S210 may further comprise step S211:

In step S211, in response to the data obtained from the first vector being non-0 data, the non-0 data is sent to the processing module, so that the processing module performs a multiply-add operation on the first vector and the second vector, where the processing module includes a plurality of multiplication processing subunits, each of which correspondingly receives one second data element for performing a multiply operation.

According to step S211, the non-0 first data element is sent to the processing module, and a second data element is correspondingly received by each of the 8 multiplication sub-units of the processing module for multiplication.

Next, in the n+1th operation period shown in fig. 4B, since the second bit data after the first non-0 data is 0, the multiply operation for this 0 data element is skipped and the multiply-and-add operation for the third bit data non-0 data element and the second vector is performed according to step S200.

As shown in fig. 4B, since the fourth and fifth bits after the third bit is not 0 data are both 0, the multiplication operation of the two 0 data elements is skipped similarly according to step S200.

In the n+2th operation cycle shown in fig. 4C, a multiply-add operation is performed for the sixth bit non-0 data element with the second vector.

In order to simplify the operation of skipping 0, the data processing method provided in the above embodiment of the present disclosure only sends one bit of data element in each operation period. This is because the inventors of the present disclosure found that if too much data is acquired and fed in one operation cycle, the complexity of the design is higher, and even the skip 0 operation cannot be achieved in a part of the scene, which also results in a decrease in the computational efficiency.

The data processing method provided in the above embodiment may further include step S300: each first vector is cached in a single way, first data elements are obtained from the first vectors one by one, and the method comprises the following steps: the first data elements are retrieved from the cached first vector one by one.

Some embodiments of the present disclosure further provide a data processing array for neural network computation, where the data processing array 200 shown in fig. 5 includes 3 processing units 10, for example, data processed by each processing unit may be output for subsequent processing such as addition, where each processing unit includes an input module 20 and a processing module 30.

It should be noted that the 3 processing units in fig. 5 are identical and are only used as an example, and the disclosure is not limited thereto, and the number of processing units, the number of input modules, and the number of processing modules are also only used as an example, and should not be construed as limiting the disclosure.

The data processing array provided in at least one embodiment of the present disclosure includes at least one input module and at least one processing module, wherein the input module is configured to acquire first data elements from first vectors one by one to input the first vectors in each operation cycle, and the processing modules are each configured to perform a multiplication operation on the acquired first data elements in parallel with a plurality of second data elements in the second vectors in response to the first data elements acquired from the first vectors being non-0 data, or to skip the multiplication operation on the acquired first data elements in response to the first data elements acquired from the first vectors being 0 to perform a multiplication-addition operation on the at least one first vectors and the second vectors. For the data processing method performed by the data processing array, reference may be made to the data processing method of a single processing unit shown in fig. 4A to 4C.

The processing array provided by the above embodiments may further include a buffer module (not shown in fig. 5) configured to buffer the first vector whole at a time.

In the above processing array, the processing module may include a plurality of multiplication processing subunits, each configured to multiply and add a corresponding one of the second data elements with the acquired first data element, for which reference may be made to fig. 3 or fig. 4A to fig. 4C.

In the data processing array provided in the above embodiment, only one bit of data is sent to the data processing module of the processing unit in each operation period, so that the calculation of the input feature map address and the access of the buffer module are simplified; while the addressing of the second vector, e.g. the weights, is also dependent on only this one input profile, thereby simplifying buffer module access and second vector addressing such that the width of the second vector is not limited. Meanwhile, the arrangement sequence of the second vectors is unchanged, the accumulation relation of the multiplication and addition units is fixed, and the design of the multiplication and addition units is still simple and clear.

In the processing array, the at least one processing module includes a plurality of processing modules, the at least one input module includes a plurality of input modules respectively corresponding to the plurality of processing modules, the plurality of input modules are configured to respectively input a plurality of first vectors, and the plurality of processing modules are configured to respectively perform multiplication and addition operations on the plurality of first vectors and the second vectors in parallel.

Some embodiments of the present disclosure provide a data processing apparatus for neural network computation, including a processing array provided by embodiments of the present disclosure.

Some embodiments of the present disclosure provide a data processing apparatus for neural network computation, comprising a processor and a memory, the memory having stored thereon one or more computer program modules; the one or more computer program modules are configured to perform, when executed by the processor, the data processing methods provided by the embodiments of the present disclosure.

Some embodiments of the present disclosure further provide an electronic device, which includes the data processing apparatus for neural network computation.

Fig. 6 is an electronic device provided in some embodiments of the present disclosure, for example, as shown in fig. 6, and the electronic device 300 is used, for example, to implement a data processing method provided in any embodiment of the present disclosure. For example, the electronic device 300 may be a personal computer, a notebook computer, a tablet computer, a mobile phone, or a terminal device such as a workstation, a server, a cloud service, or the like. It should be noted that the electronic device 300 shown in fig. 6 is merely an example, and does not impose any limitation on the functionality and scope of use of the embodiments of the present disclosure.

As shown in fig. 6, the electronic device 300 may include a processing means (e.g., one or more central processing units, one or more graphics processors, etc., the data processing means described above) 310, which may perform various suitable actions and processes according to programs stored in a Read Only Memory (ROM) 320 or programs loaded from a storage 380 into a Random Access Memory (RAM) 330. In the RAM 330, various executable programs and data required for the operation of the electronic device 300 are also stored. The processing device 310, the ROM 320, and the RAM 330 are connected to each other by a bus 340. An input/output (I/O) interface 350 is also connected to bus 340.

In general, the following devices may be connected to the I/O interface 350: input devices 360 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, or gyroscope; an output device 370 including, for example, a Liquid Crystal Display (LCD), a speaker, or a vibrator; storage 380 including, for example, magnetic tape, hard disk, etc.; and a communication device 390. The communication device 390 may allow the electronic apparatus 300 to communicate wirelessly or by wire with other electronic apparatuses to exchange data. While fig. 6 illustrates an electronic device 300 including various means, it is to be understood that not all illustrated means are required to be implemented or provided, and that electronic device 300 may alternatively be implemented or provided with more or fewer means.

For example, according to embodiments of the present disclosure, the above-described data processing method may be implemented as a computer software program. For example, embodiments of the present disclosure include a computer program product comprising a computer program, carried on a non-transitory computer readable medium, the computer program comprising program code for performing the above-described data processing method. In such an embodiment, the computer program may be downloaded and installed from a network via communications device 390, or from storage device 380, or from ROM 320. The functions defined in the data processing method provided by the embodiments of the present disclosure may be performed when the computer program is executed by the processing device 310.

Some embodiments of the present disclosure also provide a storage medium storing non-transitory computer program executable code (e.g., computer executable instructions) that, when executed by a computer (e.g., comprising one or more processors), can implement the data processing method of any of the embodiments of the present disclosure; or may implement the data processing methods provided by the embodiments of the present disclosure when the non-transitory computer program executable code is executed by a computer.

Fig. 7 is a schematic diagram of a storage medium according to some embodiments of the present disclosure. As shown in fig. 7, the storage medium 400 stores computer program executable code 401 non-temporarily. For example, the computer program executable code 401, when executed by a computer (e.g., comprising one or more processors), may perform a data processing method provided in accordance with embodiments of the present disclosure.

For example, the storage medium 400 may be applied to the above-described data processing apparatus. For another example, the storage medium 400 may be the memory 320 in the electronic device 300 shown in fig. 6. For example, the relevant description of the storage medium 400 may refer to the corresponding description of the memory 320 in the electronic device 300 shown in fig. 6, and will not be repeated here.

While the disclosure has been described in detail with respect to the general description and the specific embodiments thereof, it will be apparent to those skilled in the art that certain modifications and improvements may be made thereto based on the embodiments of the disclosure. Accordingly, such modifications or improvements may be made without departing from the spirit of the disclosure and are intended to be within the scope of the disclosure as claimed.

For the present disclosure, in addition to the above exemplary description, the following points need to be explained:

(1) The drawings of the embodiments of the present disclosure relate only to the structures related to the embodiments of the present disclosure, and other structures may refer to the general design.

(2) In the drawings for describing embodiments of the present disclosure, the thickness of layers or regions is exaggerated or reduced for clarity, i.e., the drawings are not drawn to actual scale.

(3) The embodiments of the present disclosure and features in the embodiments may be combined with each other to arrive at a new embodiment without conflict.

The foregoing is merely specific embodiments of the disclosure, but the scope of the disclosure is not limited thereto, and the scope of the disclosure should be determined by the claims.

Claims

1. A data processing method for neural network computation for multiplying and adding at least one first vector with a second vector, the method comprising:

In each operation cycle, the first data elements are acquired from the first vector one by one, and

And in response to the first data element acquired from the first vector being non-0 data, performing a multiplication operation on the acquired first data element and a plurality of second data elements in the second vector in parallel, or in response to the first data element acquired from the first vector being 0, skipping the multiplication operation on the acquired first data element.

2. The data processing method of claim 1, wherein in response to the first data element acquired from the first vector being non-0 data, multiplying the acquired first data element with a plurality of second data elements in the second vector in parallel comprises:

and in response to the first data element acquired from the first vector being non-0 data, performing a multiplication operation on the acquired first data element and all second data elements in the second vector in parallel.

3. The data processing method of claim 2, wherein in response to the first data element acquired from the first vector being non-0 data, multiplying the acquired first data element with a plurality of second data elements in the second vector in parallel, comprises:

in response to the data obtained from the first vector being non-0 data, the non-0 data is sent to a processing module to perform a multiply-add operation of the first vector and the second vector by the processing module, wherein the processing module includes a plurality of multiplication processing subunits, each of which correspondingly receives one second data element to perform a multiply operation.

4. The data processing method according to claim 1, wherein the first vector is one of vectors juxtaposed in an input feature map calculated by the neural network;

the second vector is one of parallel vectors in the weight graph calculated by the neural network.

5. The data processing method of claim 1, further comprising: each first vector entity is cached a single time,

Obtaining the first data elements from the first vector one by one comprises:

the first data elements are acquired from the cached first vector one by one.

6. The data processing method of claim 1, wherein the at least one first vector comprises a plurality of first vectors,

The first vectors are each multiplied and summed in parallel with the second vector, respectively.

7. A data processing array for neural network computation includes at least one input module and at least one processing module, wherein,

The input module is configured to acquire first data elements from the first vector one by one in each operation period to input the first vector, and

The processing modules are each configured to perform a multiplication operation on the acquired first data element in parallel with a plurality of second data elements in the second vector in response to the acquired first data element from the first vector being non-0 data, or skip a multiplication operation on the acquired first data element in response to the acquired first data element from the first vector being 0, to perform a multiplication-addition operation on at least one first vector and a second vector.

8. The processing array of claim 7, wherein the input module comprises a buffer module configured to buffer the first vector whole at a single time.

9. The processing array of claim 8, wherein the processing module comprises a plurality of multiplication processing subunits, each multiplication processing subunit configured to multiply-add a corresponding one of the second data elements with the acquired first data element.

10. The processing array of claim 9, wherein the at least one processing module comprises a plurality of processing modules,

At least one input module comprises a plurality of input modules respectively corresponding to the plurality of processing modules,

The plurality of input modules are configured to input a plurality of first vectors respectively,

The plurality of processing modules are configured to multiply-add the plurality of first vectors in parallel with the second vector, respectively.

11. A data processing apparatus for neural network computation, comprising a processing array as claimed in any one of claims 7 to 10.

12. A data processing apparatus for neural network computation, comprising:

The processor may be configured to perform the steps of,

A memory having one or more computer program modules stored thereon;

wherein the one or more computer program modules are configured to, when executed by the processor, perform the data processing method of any of claims 1-6.

13. A non-transitory storage medium storing non-transitory computer readable instructions, wherein the computer readable instructions, when executed by a computer, perform the data processing method of any of claims 1-6.

14. A computer program product comprising computer programs/instructions which, when executed by a processor, implement the data processing method of any of claims 1-6.