CN111126582B

CN111126582B - Data processing method and related product

Info

Publication number: CN111126582B
Application number: CN201911323837.9A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Shanghai Cambricon Information Technology Co Ltd
Current assignee: Shanghai Cambricon Information Technology Co Ltd
Priority date: 2019-12-20
Filing date: 2019-12-20
Publication date: 2024-04-05
Anticipated expiration: 2039-12-20
Also published as: CN111126582A

Abstract

The present application relates to a data processing method and related products. The method comprises the following steps: splitting the first operation data by the central leaf structure to obtain a plurality of groups of first operation sub-data; the central leaf structure distributes a plurality of groups of first operator data to corresponding node leaf structures in each operation period; the central leaf structure sends second operation data to each node leaf structure; each node leaf structure multiplexes the second operation data, and performs convolution operation on the received first operation data to obtain multiple parts and data; wherein the first operation data or the second operation data includes at least one of voice data, text data, and image data. The method can reduce the energy consumption expense of the processor.

Description

Data processing method and related product

Technical Field

The present disclosure relates to the field of computer technology, and in particular, to a data processing method and related products.

Background

With the development of neural network technology, a deep learning framework (Caffe) has been widely used.

When the neural network model based on Caffe is applied to the data processing of the deep learning processor, the data such as images, voices and texts can be processed, so that the required recognition result can be obtained. For example, an image is recognized to obtain image features, and a voice is recognized to obtain control instructions. With the rapid development of the neural network, the data volume of the neural network in the data processing port process is larger and larger, and the large amount of data access memory ensures that the energy efficiency cost is large in the data processing process of the processor.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a data processing method, apparatus, processing circuit, processor, chip, board, and electronic device that can reduce the power consumption overhead.

In a first aspect, an embodiment of the present application provides a data processing method, where the method includes:

splitting the first operation data by the central leaf structure to obtain a plurality of groups of first operation sub-data;

the central leaf structure distributes a plurality of groups of first operator data to corresponding node leaf structures in each operation period;

the central leaf structure sends second operation data to each node leaf structure;

each node leaf structure multiplexes the second operation data, and performs convolution operation on the received first operation data to obtain multiple parts and data;

wherein the first operation data or the second operation data includes at least one of voice data, text data, and image data.

In one embodiment, the central leaf structure distributes a plurality of the first operator data to corresponding node leaf structures, including:

the central leaf structure distributes a plurality of groups of first operator data to corresponding node leaf structures in a unicast mode through an interconnection bus;

The central leaf structure sending the second operational data to each of the node leaf structures, comprising:

and the central leaf structure transmits the second operation data to each node leaf structure in a broadcast mode through an interconnection bus.

In one embodiment, the first operation data is neuron data, and the second operation data is weight data;

the central leaf structure splits the first operation data to obtain a plurality of groups of first operation sub-data, and the method comprises the following steps:

the central leaf structure splits the neuron data to obtain a plurality of groups of neuron sub-data;

and each node leaf structure multiplexes the second operation data, and performs convolution operation on the received first operation data to obtain a plurality of parts and data, including:

and multiplexing the weight data in a plurality of operation periods, and performing convolution operation on the received different neuron sub-data to obtain part and data of each operation period.

In one embodiment, the splitting the neuron data by the central leaf structure to obtain multiple sets of neuron sub-data includes:

The central leaf structure splits the neuron data according to a preset neuron data splitting mode to obtain multi-living neuron sub-data; wherein the neuron data splitting mode comprises a convolution kernel K _x Direction, convolution kernel K _y At least one of direction, input feature map direction split, output feature map direction split, input feature map size split, and output feature map size split.

In one embodiment, the first operation data is weight data, and the second operation data is neuron data;

the central leaf structure splits the weight data to obtain a plurality of groups of weight sub-data;

and in a plurality of operation periods, multiplexing neurons to perform convolution operation on the received different weight sub-data and the neuron data to obtain a plurality of parts and data.

In one embodiment, the neuron data is data representing different pixels of the same output feature map or data representing pixels of different output feature maps.

In a second aspect, embodiments of the present application provide a data processing apparatus, the apparatus including:

the splitting module is used for splitting the first operation data through the central leaf structure to obtain a plurality of groups of first operation sub data;

the distribution module is used for distributing a plurality of groups of first operator data to the corresponding node leaf structures in each operation period through the central leaf structure; each node leaf structure corresponds to each set of the first operator data;

the sending module is used for sending second operation data to each node leaf structure through the central leaf structure;

the processing module is used for carrying out convolution operation on the received first operation data and the received second operation data through each node leaf structure to obtain a plurality of parts and data;

In a third aspect, embodiments of the present application provide a processing circuit comprising a central leaf structure and a plurality of node leaf structures; the central leaf structure and each node leaf structure are connected through an interconnection bus;

The central leaf structure is used for splitting the first operation data to obtain a plurality of groups of first operation sub data;

the central leaf structure is used for distributing a plurality of groups of first operation sub data to the corresponding node leaf structure in each operation period; each node leaf structure corresponds to each set of the first operator data;

the central leaf structure is used for sending second operation data to each node leaf structure;

each node leaf structure is used for carrying out convolution operation on the received first operation data and the received second operation data to obtain a plurality of parts and data;

In a fourth aspect, embodiments of the present application provide a processor, including the processing circuit of the foregoing embodiments.

In a fifth aspect, embodiments of the present application provide a neural network chip that includes the processor of the embodiments.

In a fifth aspect, embodiments of the present application provide a board, the board including: a memory device, a receiving device, and a control device, and a neural network chip in the above-described embodiments;

The neural network chip is respectively connected with the storage device, the control device and the receiving device;

the storage device is used for storing data;

the receiving device is used for realizing data transmission between the chip and external equipment;

the control device is used for monitoring the state of the chip.

In one embodiment, the memory device includes: each group of storage units is connected with the chip through a bus, and the storage units are as follows: DDR SDRAM;

the chip comprises: the DDR controller is used for controlling data transmission and data storage of each storage unit;

the receiving device is as follows: standard PCIE interfaces.

In a sixth aspect, an embodiment of the present application provides an electronic device, where the electronic device includes the chip in the foregoing embodiment.

According to the data processing method, the device, the processing circuit, the processor, the chip, the board card and the electronic equipment, the first operation data are split through the central leaf structure to obtain multiple groups of first operation sub-data, the multiple groups of first operation sub-data are distributed to the corresponding node leaf structures in each operation period, meanwhile, the central leaf structure sends the second operation data to each node leaf structure, so that each node leaf structure multiplexes the second operation data, and convolution operation is carried out on the received first operation sub-data to obtain multiple parts and data. In the method, as the second operation data are multiplexed by each node leaf structure in a plurality of operation periods, convolution operation is carried out on the received first operation data, so that the data access and storage demand is greatly reduced in the data processing process, and the energy consumption expense of a processor is greatly reduced. Because the first operation data or the second operation data comprises at least one of voice data, text data and image data, the method can greatly reduce the access quantity of the voice data, the text data and the image data in the voice processing, the text processing or the image processing process, further improve the data processing efficiency and greatly reduce the energy consumption expense of a processor.

Drawings

FIG. 1 is an internal block diagram of a computer device in one embodiment;

FIG. 2 is a flow chart of a data processing method according to an embodiment;

FIG. 2a is a computational schematic of convolutional neuron multiplexing provided by one embodiment;

FIG. 2b is a schematic diagram illustrating the calculation of weight multiplexing according to another embodiment;

FIG. 3 is a schematic diagram of a data processing apparatus according to an embodiment;

FIG. 4 is a schematic diagram of a processing circuit according to one embodiment;

fig. 5 is a schematic structural diagram of a board card according to an embodiment.

Detailed Description

The following description of the embodiments of the present disclosure will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the disclosure. Based on the embodiments in this disclosure, all other embodiments that may be made by those skilled in the art without the inventive effort are within the scope of the present disclosure.

It should be understood that the terms "first," "second," "third," and "fourth," etc. in the description and figures of this disclosure are used for distinguishing between different objects and not for describing a particular sequential order. The terms "comprises" and "comprising" when used in the specification and claims of the present disclosure, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the present disclosure is for the purpose of describing particular embodiments only, and is not intended to be limiting of the disclosure. As used in the specification and claims of this disclosure, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the present disclosure and claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

The term "if" may be interpreted as "when..once" or "in response to a determination" or "in response to detection" depending on the context. Similarly, the phrase "if a determination" or "if a [ described condition or event ] is detected" may be interpreted in the context of meaning "upon determination" or "in response to determination" or "upon detection of a [ described condition or event ]" or "in response to detection of a [ described condition or event ]".

The data processing method provided by the embodiment of the application can be applied to the computer device shown in fig. 1, and the computer device can comprise a processor. Alternatively, the processor may be an artificial intelligence processor or a deep learning processor, which is not limited in the type of the processor in the embodiments of the present application. It should be noted that, in the method for acquiring a network model according to the embodiment of the present application, the execution body may include a motherboard of a processor, or may be an electronic device including the motherboard. In the following method embodiments, the execution subject is a component of the processor.

It will be appreciated by those skilled in the art that the structure shown in fig. 1 is merely a block diagram of some of the structures associated with the present application and is not limiting of the computer device to which the present application may be applied, and that a particular computer device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.

The following describes the technical solution of the present application and how the technical solution of the present application solves the above technical problems in detail with specific embodiments. The following embodiments may be combined with each other, and the same or similar concepts or processes may not be described in detail in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.

FIG. 2 is a flow chart of a method for processing data according to one embodiment, which can be applied to a computing platform including a processor. The method comprises the following steps:

s10, splitting the first operation data by the central leaf structure to obtain a plurality of groups of first operation sub-data.

It should be noted that the method can be applied to a processor including a master-slave distributed leaf structure, including a central leaf structure (central tile) and a plurality of node leaf structures (leaf tile), where the core of each leaf structure is a weight Buffer unit (SB) composed of a neural operation unit (Neural Functional Unit, NFU) and four eDram. Specifically, the central leaf structure splits the first operation data to be processed into multiple groups of first operation sub data, alternatively, the first operation sub data may be split according to different directions of the convolution kernel, or may be split according to a direction of the feature map or a size of the feature map, and a specific splitting manner may be determined according to a characteristic of the first operation data or a splitting manner of a previous network layer, which is not limited in this embodiment.

And S20, distributing a plurality of groups of first operator data to the corresponding node leaf structure by the central leaf structure in each operation period.

Specifically, in each operation period, the central leaf structure distributes multiple groups of split first operation data to corresponding node leaf structures through interconnection buses, wherein each group of first operation data is different data.

S30, the central leaf structure sends second operation data to each node leaf structure.

Specifically, the central leaf structure may further send the second operation data to each node leaf structure through the interconnection bus. Alternatively, the node leaf structure may cache the second operation data through its weight cache unit.

S40, multiplexing the second operation data by each node leaf structure, and performing convolution operation on the received first operation operator data to obtain a plurality of parts and data; wherein the first operation data or the second operation data includes at least one of voice data, text data, and image data.

Specifically, each node leaf structure receives corresponding first operation data and second operation data, multiplexes the second operation data in one operation period, and performs convolution operation on the second operation data by using the NFU, thereby obtaining part of sum data. It should be noted that one of the first operation data and the second operation data may be neuron data, and the neuron data may include at least one of voice data, text data, and image data. It should be noted that, the above data processing method may be to perform feature vector extraction on the image data, so as to identify an image, for example, identify an object in the image, identify a type of the image, and so on; the method may also be to identify or convert voice data, identify or convert text data, thereby identifying semantics, and the like, which is not limited to the embodiment of the present application.

In this embodiment, the first operation data is split through the central leaf structure to obtain multiple groups of first operation data, multiple groups of first operation data are distributed to corresponding node leaf structures in each operation period, and meanwhile, the central leaf structure sends second operation data to each node leaf structure, so that each node leaf structure multiplexes the second operation data, and convolution operation is performed on the received first operation data to obtain multiple parts and data. In the method, as the second operation data are multiplexed by each node leaf structure in a plurality of operation periods, convolution operation is carried out on the received first operation data, so that the data access and storage demand is greatly reduced in the data processing process, and the energy consumption expense of a processor is greatly reduced. Because the first operation data or the second operation data comprises at least one of voice data, text data and image data, the method can greatly reduce the access quantity of the voice data, the text data and the image data in the voice processing, the text processing or the image processing process, further improve the data processing efficiency and greatly reduce the energy consumption expense of a processor.

Alternatively, on the basis of the above embodiment, one possible implementation of step S20 may include: and the central leaf structure distributes a plurality of groups of first operator data to the corresponding node leaf structures in a unicast mode through an interconnection bus. Specifically, since the first operator data are sub-data obtained by splitting, each group of sub-data is different and needs to be processed by different node leaf structures, the central leaf structure distributes a plurality of groups of the first operator data to the corresponding node leaf structures in a unicast mode through the interconnection bus, namely, each group of first operator data points are sent to the corresponding node leaf structures in a point-to-point mode. In this embodiment, the central leaf structure distributes multiple groups of the first operator data in a unicast manner, so that each group of the first operator data points are sent to the corresponding node leaf structure in a peer-to-peer manner, and data processing is facilitated.

Alternatively, on the basis of the above embodiment, one possible implementation of step S30 may include: the central leaf structure transmits the second operation data to each node leaf structure in a broadcast mode through an interconnection bus, and the second operation data received by each node leaf structure are the same data.

Optionally, on the basis of the foregoing embodiment, the first operation data is neuron data, and the second operation data is weight data; one possible implementation manner of the step S10 may include: and splitting the neuron data by the central leaf structure to obtain a plurality of groups of neuron sub-data. One possible implementation manner of the step S40 may include: and multiplexing the weight data in a plurality of operation periods by each node leaf structure, and carrying out convolution operation on the received different neuron sub-data to obtain part and data of each operation period. In this embodiment, multiple sets of neuron sub-data obtained by splitting the neuron data through the central leaf structure are distributed to the node leaf structures, and each node leaf structure performs convolution operation on different received neuron sub-data by multiplexing the weight data in multiple operation periods, so as to obtain a part and data.

Optionally, a possible implementation manner of the step of splitting the neuron data by the central leaf structure to obtain multiple sets of neuron sub-data may include: the central leaf structure splits the neuron data according to a preset neuron data splitting mode to obtain a plurality of groups of neuron sub-data; wherein the neuron data splitting mode comprises a convolution kernel K _x Direction, convolution kernel K _y At least one of direction, input feature map direction split, output feature map direction split, input feature map size split, and output feature map size split. Specifically, the manner in which the central leaf structure splits the neuron data may be determined according to the network layer in which the data processing is located or the previous network layer. For example, in the case of a convolutional layer, a convolutional kernel K may be employed _x Direction, convolution kernel K _y Direction, input feature map direction splitting, output feature map direction splitting, input feature map size splitting and output feature map size splitting; when the layer is fully connected, one of input feature map direction splitting, output feature map direction splitting, input feature map size splitting and output feature map size splitting can be adopted; at the time of pooling layers, the feature pattern direction can be adopted to disassemble Splitting and outputting a feature map direction split, splitting and outputting an input feature map size split and outputting a feature map size split. Optionally, the network layer may also select a splitting manner by combining the splitting manner of the preamble layer and the current layer, where the alignment of the feature map needs to be considered when splitting according to the feature map direction.

Optionally, the first operation data is weight data, and the second operation data is neuron data, and one possible implementation manner of the step S10 may include: the central leaf structure splits the weight data to obtain a plurality of groups of weight sub-data; one possible implementation manner of the step S40 may include: and in a plurality of operation periods, multiplexing neurons to perform convolution operation on the received different weight sub-data and the neuron data to obtain a plurality of parts and data. In the embodiment, the central leaf structure distributes a plurality of groups of weight sub-data obtained by splitting the weight data to the node leaf structures, and each node leaf structure carries out convolution operation on different received weight sub-data in a plurality of operation periods by multiplexing the neuron data so as to obtain a part and data.

Optionally, the neuron data is data representing different pixels of the same output feature map, or data representing pixels of different output feature maps.

The following describes the data multiplexing of the network layer specifically, so as to obtain the power consumption saving and the corresponding hardware energy consumption expenditure of different data multiplexing methods.

For a convolution layer, in a deep learning network, a three-dimensional convolution input-output mapping relation such as a formulaAs shown, according to the formula, the convolution window is divided into two sections along the X, Y direction of the input feature map _x 、S _y Single input nerve when step slidingThe number of times the element stays in the convolution window is +.>And->At the same time f _o The output neurons corresponding to the same pixel point position of the output feature images are the same, i.e. one input neuron can be used for calculating f _o The same pixel point of the output characteristic diagram, so that each input neuron data can be multiplexed at most for the number of times of +.>On the other hand, the convolution layer performs the convolution function on the input feature diagram by S _x 、S _y Step sliding calculation to obtain an output characteristic diagram, thus, it can be known that the maximum multiplexing frequency of each weight data is as follows _synapse ＝X _o ×Y _o Wherein X is _o And Y _o The dimensions of the output feature map in the X-direction and the Y-direction, respectively. Wherein K is _x And K _y The dimensions of the convolution kernel in the x-direction and the y-direction, respectively.

When the variable data bit wide deep learning processor (Multi-Width Deep Learning Accelerator, abbreviated as MW-DLA) inputs both the neuron data and the weight data as 16 bits, the central leaf structure reads the 32 byte neuron data optical waveguide 16 node leaf structures from the RAM every operation cycle. Meanwhile, each node leaf structure needs to read 512-byte weight data from SB in each operation period and send the weight data to NFU of the node leaf structure for calculation. In the node leaf structure NFU calculation, partial sums of 16 output feature map neurons are calculated using the input 16 neuron data and a weight matrix of 16X 16. In this process, one neuron unique to the central leaf structure is used to calculate a partial sum of 16 x16=256 output neurons, and one weight unique to the node leaf structure is used to calculate a corresponding partial sum of output neurons. Thus the memory access utilization of the neuron data multiplexing isThe memory access utilization rate of the weight data multiplexing is +.>In MW-DLA, the read data width of each node leaf structure is 4096 bits, the read data width of the central leaf structure is 256 bits, and if the Power consumption cost of the central leaf structure and the node leaf structure for reading 1 bit of data is alpha and beta respectively, the Power consumption cost of MW-DLA for reading the neuron data and the weight data is Power _mem =256×α+16×4096×β=256×α+65536×β. If the weight data multiplexing utilization rate is improved by M times, the neuron data multiplexing utilization rate is improved by N times, and the total storage access utilization rate of MW-DLA becomes +.>Wherein M is E [1, X _o ×Y _o ]，/>The total memory access utilization ratio is +.>Wherein M is E [1, X _o ×Y _o ]，/>

In the deep learning network, two neuron multiplexing methods are adopted, one method is to multiplex one neuron to calculate pixel points of different output characteristic images; the other is to multiplex different pixels of a neuron to calculate a unified output feature map. In the implementation of MW-DLA, the 16 node leaf structure uses the same input data to calculate the pixels of 256 output feature maps. Fig. 2a is a schematic diagram of the computation of convolutional neuron multiplexing provided by one embodiment. In the schematic diagram of fig. 2a, taking the calculation sequence of one feature map as an example, the size of the input feature map is 5X5, the convolution kernel size is 3X3, and the sliding step size is 2X2. When neurons are multiplexed, the node leaf structure receives the central leaf structureInput neuron vector post-continuumThe weights of the individual output pixel points calculate different partial sums. For neuronal administration, the node leaf structure needs to be partially buffered +.>A group part and a register.

When the convolution layer multiplexes the weight, the partial sum of the patch output neurons is obtained by multiplying the input weight of the same full-time and the corresponding position of the different output neurons. Optionally, fig. 2b is a schematic diagram of weight multiplexing provided by an embodiment, in fig. 2b, the size of an input feature map (feature map) is exemplified by 5X5, the size of a convolution kernel is 3X3, and the sliding step size is 2X2, where different pixels of different output feature maps are calculated by using the same weight data, and only the calculation sequence of one feature map is exemplified. Assume that the number of multiplexing times of weight is R _x ×R _y The node leaf structure reads a row of weights from the weight cache unit and then continuously R _x ×R _y The NFU calculation is entered for each operation cycle. Meanwhile, the node leaf structure switches R on the corresponding output characteristic diagram _x ×R _y The input neurons of the output pixel points compute different partial sums. Similarly, when weight data is multiplexed, the node leaf structure caches portions of the data in R _x ×R _y A group part and a register.

It should be noted that, the neuron multiplexing needs to switch different weights in each operation cycle, and the weight data multiplexing needs to switch different neurons in each operation cycle. If the neuron multiplexing and the weight data multiplexing are performed simultaneously, the node leaf structure is required to cache the complete input feature map and convolution kernel (kernel). If the power consumption overhead of the central leaf structure and the node leaf structure for reading 1 bit of data is the same, it can be determined that when the utilization rate of the neuron multiplexing is improved by 1 time, the power consumption of the access RAM is only reduced by 1/514; when the weight multiplexing utilization rate is improved by 1 time, the power consumption of the access RAM is reduced by 256/514, and the power consumption is greatly saved by 1 time higher than that of the neuron multiplexing utilization rate. At the same time According to the formula, the acceleration ratio of the nerve cell multiplexing rate is at mostAnd the weight multiplexing rate acceleration ratio is at most X _o ×Y _o . In contrast, the weight multiplexing rate is much higher than the neuron multiplexing rate by the upper limit. From the calculation control perspective, in the neuron multiplexing method, the number of output neurons calculated corresponding to the pixel points close to the frame position in the input feature map is smaller than that calculated by the input pixel points at the central position; and when K _x 、K _y Not S _x 、S _y Even if the input points are at the center of the input feature map, the input points at different positions are different from the calculated output pixel points. In the weight multiplexing method, the multiplexing times of each weight are the same, the number of output neurons correspondingly calculated is the same, and the control logic is simpler than the neuron multiplexing. Wherein, the method adopting weight data multiplexing can divide the output characteristic diagram data into T _w ×T _h And the calculation of the convolution layer is carried out, in the calculation process, the node leaf structure reads a row of weight multiplexing T each time _w ×T _h Multiplying the number of times with input pixels at different positions to calculate T _w ×T _h The output parts of the individual pixels.

For the full-connection layer, in the deep learning network, the input-output mapping relation of the full-connection layer is as follows Each input neuron in the fully connected layer is used to calculate n _o The number of output neurons, i.e. the number of multiplexing of neurons is at most n _o The multiplexing utilization rate of the input neuron is +.>

It should be understood that, although the steps in the flowchart of fig. 2 are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in fig. 2 may include multiple sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, nor do the order in which the sub-steps or stages are performed necessarily performed in sequence, but may be performed alternately or alternately with at least a portion of the sub-steps or stages of other steps or other steps.

In one embodiment, as shown in fig. 3, there is provided a data processing apparatus for use in a computing platform including a processor, the apparatus comprising:

the splitting module 100 is configured to split the first operation data through the central leaf structure to obtain multiple groups of first operation sub-data;

A distributing module 200, configured to distribute, through the central leaf structure, a plurality of groups of the first operator data to corresponding node leaf structures in each operation period; each node leaf structure corresponds to each set of the first operator data;

a transmitting module 300, configured to transmit second operation data to each of the node leaf structures through the central leaf structure;

a processing module 400, configured to perform convolution operation on the received first operation data and the second operation data through each node leaf structure, so as to obtain a plurality of parts and data;

In one embodiment, the distributing module 200 is specifically configured to distribute, through the central leaf structure, the plurality of groups of the first operator data to the corresponding node leaf structures through the interconnection bus in a unicast manner; the sending module 300 is specifically configured to send, through the central leaf structure, the second operation data to each node leaf structure through an interconnection bus in a broadcast manner.

In one embodiment, the first operational data is neuron data and the second operational data is weight data; the splitting module 100 is specifically configured to split the neuron data through the central leaf structure to obtain multiple sets of neuron sub-data; the processing module 400 is specifically configured to multiplex the weight data and perform convolution operation on the received different neuron sub-data in a plurality of operation periods through each node leaf structure, so as to obtain a part and data of each operation period.

In one embodiment, the splitting module 100 is specifically configured to split the neuron data according to a preset neuron data splitting manner through the central leaf structure, so as to obtain multi-living neuron sub-data; wherein the neuron data splitting mode comprises a convolution kernel K _x Direction, convolution kernel K _y At least one of direction, input feature map direction split, output feature map direction split, input feature map size split, and output feature map size split.

In one embodiment, the first operational data is weight data and the second operational data is neuron data; the distribution module 200 is specifically configured to split the weight data through the central leaf structure to obtain multiple groups of weight sub-data; the processing module 400 is specifically configured to perform convolution operation on the received different weight sub-data and the neuron data by multiplexing neurons in a plurality of operation periods through each node leaf structure, so as to obtain a plurality of parts and data.

In one embodiment, as shown in FIG. 4, there is also provided a processing circuit, the circuit comprising a central leaf structure and a plurality of node leaf structures; the central leaf structure 500 and each node leaf structure 600 are connected by an interconnection bus 700;

the central leaf structure 500 is used for splitting the first operation data to obtain a plurality of groups of first operation sub-data;

a central leaf structure 500 for distributing a plurality of sets of the first operator data to corresponding node leaf structures in each operation cycle; each node leaf structure 600 corresponds to each set of the first operator data;

a central leaf structure 500 for transmitting second operation data to each node leaf structure 600;

each node leaf structure 600 is configured to perform convolution operation on the received first operation data and the second operation data to obtain a plurality of parts and data;

The number of node leaf structures in fig. 4 is only one example.

In one embodiment, the central leaf structure 500 distributes the plurality of sets of the first operator data to the corresponding node leaf structures 600 in a unicast manner through the interconnection bus 700; the central leaf structure 500 transmits the second operation data to each node leaf structure 600 in a broadcast form through the interconnection bus 700.

In one embodiment, the first operational data is neuron data and the second operational data is weight data; the central leaf structure 500 splits the neuron data to obtain multiple sets of neuron sub-data; and each node leaf structure 600 multiplexes the weight data in a plurality of operation periods, and performs convolution operation on the received different neuron sub-data to obtain a part and data of each operation period.

In one embodiment, the central leaf structure 500 splits the neuron data according to a preset neuron data splitting manner to obtain multi-living neuron sub-data; wherein the neuron data splitting mode comprises a convolution kernel K _x Direction, convolution kernel K _y At least one of direction, input feature map direction split, output feature map direction split, input feature map size split, and output feature map size split.

In one embodiment, the first operational data is weight data and the second operational data is neuron data; the central leaf structure 500 splits the weight data to obtain a plurality of groups of weight sub-data; each node leaf structure 600 performs convolution operation on the received different weight sub-data and the neuron data by multiplexing neurons in a plurality of operation periods, so as to obtain a plurality of parts and data.

The embodiment of the application also provides a processor, which comprises the processing circuit in the embodiment.

The embodiment of the application also provides a neural network chip, which comprises the processor in the embodiment.

Fig. 5 is a schematic structural diagram of a board card according to an embodiment. The board may be applied to electronic devices that, in addition to the artificial intelligence processor 389 described above, may include other kits including, but not limited to: a storage device 390, a receiving device 391 and a control device 392;

the storage device 390 is connected to the artificial intelligence processor via a bus for storing data. The memory device may include multiple sets of memory cells 393. Each group of storage units is connected with the artificial intelligence processor through a bus. It is understood that each set of memory cells may be DDR SDRAM (English: double Data Rate SDRAM, double Rate synchronous dynamic random Access memory). DDR can double the speed of SDRAM without increasing the clock frequency. DDR allows data to be read out on both the rising and falling edges of the clock pulse. DDR is twice as fast as standard SDRAM. In one embodiment, the memory device may include 4 sets of the memory cells. Each set of the memory cells may include a plurality of DDR4 particles (chips).

In one embodiment, each set of memory cells includes a plurality of double rate synchronous dynamic random access memories arranged in parallel. DDR can transfer data twice in one clock cycle. And setting a controller for controlling DDR in the artificial intelligence processor, wherein the controller is used for controlling data transmission and data storage of each storage unit.

The receiving device is electrically connected with the artificial intelligence processor. The receiving means is for enabling data transfer between the artificial intelligence processor and an external device, such as a server or computer. For example, in one embodiment, the receiving device may be a standard PCIE interface. For example, the data to be processed is transferred from the server to the artificial intelligence processor through a standard PCIE interface, so as to implement data transfer. Preferably, when PCIE 3.0X10 interface transmission is adopted, the theoretical bandwidth can reach 16000MB/s. In another embodiment, the receiving device may be another interface, and the application is not limited to the specific form of the other interface, and the interface unit may be capable of implementing a switching function. In addition, the results of the computation by the artificial intelligence processor are still transmitted by the receiving device back to an external device (e.g., a server).

The control device is electrically connected with the artificial intelligence processor. The control device is used for monitoring the state of the artificial intelligence processor. Specifically, the artificial intelligence processor and the control device can be electrically connected through an SPI interface. The control device may comprise a single chip microcomputer (Micro Controller Unit, MCU). The artificial intelligence processor can comprise a plurality of processing chips, a plurality of processing cores or a plurality of processing circuits, and can drive a plurality of loads. Thus, the artificial intelligence processor may be in different operating states such as multi-load and light load. The control device can realize the regulation and control of the working states of a plurality of processing chips, a plurality of processing circuits and/or a plurality of processing circuits in the artificial intelligent processor.

In one embodiment, an electronic device is provided that includes the processor, chip, or board described above.

The electronic device may be a data processor, a robot, a computer, a printer, a scanner, a tablet, an intelligent terminal, a cell phone, a vehicle recorder, a navigator, a sensor, a camera, a server, a cloud server, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a vehicle, a household appliance, and/or a medical device.

The vehicle comprises an aircraft, a ship and/or a vehicle; the household appliances comprise televisions, air conditioners, microwave ovens, refrigerators, electric cookers, humidifiers, washing machines, electric lamps, gas cookers and range hoods; the medical device includes a nuclear magnetic resonance apparatus, a B-mode ultrasonic apparatus, and/or an electrocardiograph apparatus.

Those skilled in the art will appreciate that implementing all or part of the above-described methods may be accomplished by way of electronic device programs stored on a non-volatile electronic device readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the various embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

The technical features of the above-described embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above-described embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples merely represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the invention. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application is to be determined by the claims appended hereto.

Claims

1. A method of data processing, the method comprising:

splitting the first operation data by the central leaf structure to obtain a plurality of groups of first operation sub-data; if the first operation data are the neuron data, the central leaf structure splits the neuron data according to a preset neuron data splitting mode to obtain a plurality of groups of neuron sub-data; wherein the neuron data splitting mode comprises a convolution kernel K _x Direction, convolution kernel K _y At least one of direction, input feature map direction splitting, output feature map direction splitting, input feature map size splitting, and output feature map size splitting;

each node leaf structure multiplexes the second operation data, and performs convolution operation on the received first operation data to obtain multiple parts and data; the neuron data multiplexing requires that different weight data are switched every operation period, and the weight data multiplexing requires that different neuron data are switched every operation period; the multiplexing frequency of the neuron data does not exceed a preset first threshold value, the multiplexing frequency of the weight data does not exceed a preset second threshold value, and the first threshold value isWherein S is _x 、S _y For step length, K _x And K _y The dimensions of the convolution kernels in the x-direction and the y-direction, respectively，f _o For the number of the output feature graphs, the second threshold is used _synapse ＝X _o ×Y _o ，X _o And Y _o The dimension of the output feature diagram in the X direction and the Y direction are respectively; the first operation data are the neuron data or the weight data, and the second operation data are the neuron data or the weight data;

Wherein the first operation data or the second operation data includes at least one of voice data, text data, and image data;

the central leaf structure distributes a plurality of the first operator data to corresponding node leaf structures, including:

the central leaf structure transmits the second operation data to each node leaf structure in a broadcast mode through an interconnection bus;

the first operation data are neuron data, and the second operation data are weight data;

2. The method of claim 1, wherein the first operational data is weight data and the second operational data is neuron data;

3. The method of claim 2, wherein the neuron data is data representing different pixels of the same output feature map or data representing pixels of different output feature maps.

4. A data processing apparatus, the apparatus comprising:

the splitting module is used for splitting the first operation data through the central leaf structure to obtain a plurality of groups of first operation sub data; if the first operation data are the neuron data, the central leaf structure splits the neuron data according to a preset neuron data splitting mode to obtain a plurality of groups of neuron sub-data; wherein the neuron data splitting mode comprises a convolution kernel K _x Direction, convolution kernel K _y At least one of direction, input feature map direction splitting, output feature map direction splitting, input feature map size splitting, and output feature map size splitting;

the processing module is used for carrying out convolution operation on the received first operation data and the received second operation data through each node leaf structure to obtain a plurality of parts and data; the neuron data multiplexing requires that different weight data are switched every operation period, and the weight data multiplexing requires that different neuron data are switched every operation period; the multiplexing frequency of the neuron data does not exceed a preset first threshold value, the multiplexing frequency of the weight data does not exceed a preset second threshold value, and the first threshold value is Wherein S is _x 、S _y For step length, K _x And K _y The dimensions of the convolution kernel in the x-direction and the y-direction, f _o For the number of the output feature graphs, the second threshold is used _synapse ＝X _o ×Y _o ，X _o And Y _o The dimension of the output feature diagram in the X direction and the Y direction are respectively; the first operation data are the neuron data or the weight data, and the second operation data are the neuron data or the weight data;

wherein the first operation data or the second operation data includes at least one of voice data, text data, and image data; the central leaf structure distributes a plurality of the first operator data to corresponding node leaf structures, including:

5. A processing circuit comprising a central leaf structure and a plurality of node leaf structures; the central leaf structure and each node leaf structure are connected through an interconnection bus;

the central leaf structure is used for splitting the first operation data to obtain a plurality of groups of first operation sub data; if the first operation data are the neuron data, the central leaf structure splits the neuron data according to a preset neuron data splitting mode to obtain a plurality of groups of neuron sub-data; wherein the neuron data splitting mode comprises a convolution kernel K _x Direction, convolution kernel K _y At least one of direction, input feature map direction splitting, output feature map direction splitting, input feature map size splitting, and output feature map size splitting;

each node leaf structure is used for carrying out convolution operation on the received first operation data and the received second operation data to obtain a plurality of parts and data; the neuron data multiplexing requires that different weight data are switched every operation period, and the weight data multiplexing requires that different neuron data are switched every operation period; the multiplexing frequency of the neuron data does not exceed a preset first threshold value, the multiplexing frequency of the weight data does not exceed a preset second threshold value, and the first threshold value isWherein S is _x 、S _y For step length, K _x And K _y The dimensions of the convolution kernel in the x-direction and the y-direction, f _o For the number of the output feature graphs, the second threshold is used _synapse ＝X _o ×Y _o ，X _o And Y _o The dimension of the output feature diagram in the X direction and the Y direction are respectively; the first operation data are the neuron data or the weight data, and the second operation data are the neuron data or the weight data;

6. A processor comprising the processing circuit of claim 5.

7. A neural network chip, characterized in that it comprises the processor of claim 6.

8. A board, characterized in that, the board includes: a memory device, a receiving means and a control device, and a neural network chip as claimed in claim 7;

the storage device is used for storing data;

The control device is used for monitoring the state of the chip.

9. The board card of claim 8, wherein the board card comprises,

the memory device includes: each group of storage units is connected with the chip through a bus, and the storage units are as follows: DDR SDRAM;

the receiving device is as follows: standard PCIE interfaces.

10. An electronic device comprising the chip of claim 9.