CN114201727A

CN114201727A - Data processing method, processor, artificial intelligence chip and electronic equipment

Info

Publication number: CN114201727A
Application number: CN202111542656.2A
Authority: CN
Inventors: 裴京; 王松; 谢天天
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2021-12-16
Filing date: 2021-12-16
Publication date: 2022-03-18

Abstract

The present disclosure relates to a data processing method, a processor, an artificial intelligence chip and an electronic device, wherein the processing method is applied to a computing core of the processor, the processor includes a plurality of computing cores, and the processing method includes: generating and activating a convolution operation result based on the to-be-processed subdata stored in the calculation core; exchanging the sub-data to be processed with different computing cores, and executing the following steps until each computing core stores each piece of sub-data to be processed: releasing the storage space occupied by the to-be-processed subdata before exchange; and generating and activating a convolution operation result based on the exchanged to-be-processed subdata. According to the data processing method provided by the embodiment of the disclosure, the sub-data to be processed is not divided in a depth dimension, so that the storage space occupied by the convolution operation result is smaller, that is, the storage space occupied by the convolution operation result in a single calculation core is reduced, and further, the data to be processed can still be normally processed under the condition of hardware resource shortage.

Description

Data processing method, processor, artificial intelligence chip and electronic equipment

Technical Field

The present disclosure relates to the field of data processing, and in particular, to a data processing method, a processor, an artificial intelligence chip, and an electronic device.

Background

The brain-like computing chip adopts a decentralized many-core parallel processing architecture, and each computing core can independently operate and exchange data. In general, when data to be processed is too large, a computer-like computing chip needs to divide the data to be processed into a plurality of incomplete sub-data to be processed, so how to process the sub-data to be processed affects the processing efficiency and the memory occupancy rate of the computer-like computing chip.

Disclosure of Invention

According to a first aspect of the present disclosure, there is provided a data processing method applied to a computing core of a processor, where the processor includes a plurality of computing cores, the data processing method including: generating and activating a convolution operation result based on the to-be-processed subdata stored in the calculation core; exchanging the sub-data to be processed with different computing cores, and executing the following steps until each computing core stores each piece of sub-data to be processed: releasing the storage space occupied by the to-be-processed subdata before exchange; generating and activating convolution operation results based on the exchanged to-be-processed subdata; the sub data to be processed is a part of data to be processed, and the depth of the sub data to be processed is the same as that of the data to be processed.

In one possible implementation, the processing method further includes: dividing data to be processed into subdata to be processed on at least one non-depth dimension; and storing each piece of the to-be-processed sub data into different computing cores.

In one possible embodiment, the generating and activating convolution operation results includes: generating a convolution operation result based on the weight data and the to-be-processed subdata stored in each computation core; and activating the convolution operation result through a preset activation function.

In a possible implementation manner, the dividing the to-be-processed data into the to-be-processed sub-data in at least one non-depth dimension includes: dividing the data to be processed into subdata to be processed in the dimension of the last storage sequence of the data to be processed and the dimension of non-depth; wherein the storage order dimension is used for determining the storage order of the data to be processed on different dimensions.

In one possible implementation, the exchanging the to-be-processed sub-data between different computing cores includes: determining the number of a dynamic storage space occupied by each to-be-exchanged sub data to be processed based on the distribution sequence number and the exchange times of each computing core; the distribution sequence number is used for representing the distribution sequence of the sub-data to be processed; and distributing a dynamic storage space for each piece of to-be-processed subdata to be exchanged according to the dynamic storage space number.

In a possible implementation manner, the multiple computing cores execute at least one of the steps of exchanging the to-be-processed sub data with different computing cores, releasing a storage space occupied by the to-be-processed sub data before exchange, and generating and activating a convolution operation result by means of parallel synchronous operation.

According to a second aspect of the present disclosure, there is provided a processor comprising a plurality of processors including a plurality of computing cores, the processor to perform: generating and activating a convolution operation result based on the to-be-processed subdata stored in the calculation core; exchanging the sub-data to be processed with different computing cores, and executing the following steps until each computing core stores each piece of sub-data to be processed: releasing the storage space occupied by the to-be-processed subdata before exchange; generating and activating convolution operation results based on the exchanged to-be-processed subdata; the sub data to be processed is a part of data to be processed, and the depth of the sub data to be processed is the same as that of the data to be processed.

In one possible implementation, the processor is further configured to perform: dividing data to be processed into subdata to be processed on at least one non-depth dimension; and storing each piece of the to-be-processed sub data into different computing cores.

In a possible implementation manner, the dividing the to-be-processed data into the to-be-processed sub-data in at least one non-depth dimension includes: dividing the data to be processed into subdata to be processed in the dimension of the last storage sequence of the data to be processed and the dimension of non-depth; wherein the storage order dimension is used to determine a storage order of the data to be processed in different dimensions.

According to a third aspect of the present disclosure, there is provided an artificial intelligence chip comprising the processor of any one of the above.

According to a fourth aspect of the present disclosure, there is provided an electronic device comprising the artificial intelligence chip described above.

The embodiment of the disclosure provides a data processing method, which is applied to a computation core of a processor, and the processing method generates and activates a convolution operation result based on to-be-processed subdata stored in the computation core; exchanging the sub data to be processed with different computing cores, and releasing the storage space occupied by the sub data to be processed before exchange; and then, based on the exchanged to-be-processed subdata, a convolution operation result is generated and activated, and because the to-be-processed subdata is not divided in a depth dimension, the storage space occupied by the convolution operation result is smaller, namely the storage space occupied by the convolution operation result in a single calculation core is reduced, and further the to-be-processed data can still be normally processed under the condition of hardware resource shortage.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure. Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments, features, and aspects of the disclosure and, together with the description, serve to explain the principles of the disclosure.

Fig. 1 is a reference diagram illustrating a convolution operation process implemented in a step-by-step summation manner in the related art.

FIG. 2 is a schematic diagram illustrating a convolution operation performed by loading an initial membrane potential in the related art.

Fig. 3 is a reference diagram of an artificial intelligence chip according to an embodiment of the disclosure.

Fig. 4 is a flowchart of a data processing method according to an embodiment of the present disclosure.

Fig. 5 is a reference diagram of a data processing method according to an embodiment of the disclosure.

Fig. 6 is a flowchart of a data processing method according to an embodiment of the present disclosure.

Fig. 7 is a block diagram of an electronic device provided according to an embodiment of the present disclosure.

Fig. 8 is a block diagram of an electronic device provided according to an embodiment of the present disclosure.

Detailed Description

Various exemplary embodiments, features and aspects of the present disclosure will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers can indicate functionally identical or similar elements. While the various aspects of the embodiments are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present disclosure.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present disclosure, "a plurality" means two or more unless specifically limited otherwise.

Referring to fig. 1, fig. 1 is a schematic diagram illustrating a convolution operation process implemented in a step-by-step summation manner in the related art.

As shown in fig. 1, in the convolution operation process, assuming that the total depth of the to-be-processed data is 128, the to-be-processed data may be divided into 4 to-be-processed sub-data in the depth direction, and the to-be-processed sub-data is respectively sent to the computation cores 0 to 3 to perform convolution operation, and after the convolution operation, partial sum data (the to-be-processed data is sliced in the depth dimension, in which case, the value of the to-be-processed sub-data after convolution is referred to as partial sum, and is denoted by Sm 0 to Sm 3 in fig. 1), that is, the result of the convolution operation of the to-be-processed sub-data is obtained. The computing core0 stores the part sum data with the depth of 0 to 31, the computing core1 stores the part sum data with the depth of 32 to 63, the computing core2 stores the part sum data with the depth of 64 to 95, and the computing core3 stores the part sum data with the depth of 96 to 127. The addition of the partial sum data is realized through the calculation stages Phase1 to Phase3 by means of step-by-step summation, and then the added partial sum data is activated in the calculation core3 through a preset activation function.

Illustratively, in the first computation Phase1, compute core0 sends the stored parts and data to compute core1 at depths 0 through 31. And then, the computing core1 performs addition operation on the received partial sum data with the depth of 0 to 31 and the partial sum data with the depth of 32 to 63 stored in the computing core1 to obtain an addition result, namely the partial sum data with the depth of 0 to 63.

In a second computation Phase2, compute core1 sends the stored parts and data to compute core2 with depths 0 through 63. And then, the computing core2 performs addition operation on the received partial sum data with the depth of 0 to 63 and the partial sum data with the depth of 64 to 95 stored in the computing core2 to obtain an addition result, namely the partial sum data with the depth of 0 to 95.

In a third Phase of computation Phase3, computing core2 sends the stored parts and data to computing core3 with depths from 0 to 95. And then the computing core3 performs addition operation on the received partial sum data with the depth of 0 to 95 and the partial sum data with the depth of 96 to 127 stored in the computing core3 to obtain an addition result, namely the partial sum data with the depth of 0 to 127 and a convolution operation result of the data to be processed. And then the calculation kernel 3 activates the convolution operation result to increase the non-linearity degree of the data and releases the part with the depth of 0 to 127 before activation and the data to complete the convolution operation of the data to be processed.

Therefore, in the above-mentioned step-by-step summation scheme, most of the computation cores are in an idle state, for example, after the computation core1 receives the part sum data sent by the computation core0, when the computation core1 performs the part sum operations from 0 to 63, the

computation cores

2 and 3 are in a waiting state, where the waiting time of the computation core3 in the whole step-by-step summation process is longest, which is prone to cause a waste of computation resources. In addition, the bit width occupied by the part and the data is large, and is usually four times or sixteen times of the data to be processed and the weight data, the computing core may cause the remaining storage space to be insufficient due to the large storage space occupied by the part and the data, and the routing transmission delay of the part and the data is also long.

Referring to fig. 2, fig. 2 is a schematic diagram illustrating a convolution operation process implemented by loading an initial membrane potential in the related art.

As shown in fig. 2, the data to be processed Xn is not sliced in the depth direction, and the weight data W is sliced in the depth direction, that is, the depth direction of each convolution kernel in the weight data W is sliced. The weight data may include N convolution kernels. The weight data W is divided into 4 parts along the depth direction, which are weight data W0, weight data W1, weight data W2, and weight data W3, respectively. For example, if the depth of the weight data W is 128, the weight data may be divided into 4 pieces along the depth direction, W0 represents weight data with a depth of 0 to 31, W1 represents weight data with a depth of 32 to 63, W2 represents weight data with a depth of 64 to 95, and W3 represents weight data with a depth of 96 to 127.

In a T0 calculation cycle, to-be-processed data Xn (n is 1-4) and weight data W0-W3 after segmentation are respectively sent to a calculation Core0-Core3, to-be-processed data X0 and weight data W0 are stored in a calculation Core0, to-be-processed data X1 and weight data W1 are stored in a calculation Core1, to-be-processed data X2 and weight data W2 are stored in a calculation Core2, and to-be-processed data X3 and weight data W3 are stored in a calculation Core 3.

And each calculation core performs convolution operation on the received data to be processed and the segmented weight data to obtain a convolution operation result (partial sum). For example, the calculation Core0 performs convolution operation on the received to-be-processed data X0 and the weight data W0 to obtain an operation result Vout _0, that is, Vout _0 ═ X0 × W0. Here, the operation result Vout _0 is a partial sum corresponding to the depth of the sliced weight data W0.

After the T0 calculation cycle is completed, the calculation cores Core0 to Core3 transmit the weight data W0-W3 stored in turn to other calculation cores in the next calculation cycle. In each subsequent calculation period, each calculation core synchronously receives and transmits corresponding weight data, and after the weight data are received and transmitted, convolution operation of the weight data and the data to be processed is performed, and one round (calculation period) of operation results can be added while convolution operation is performed. The result (partial sum) of the previous round is referred to as the initial membrane potential in the current round.

For example, the compute Core0 may send weight data W0 to the compute Core2 and receive weight data W1 sent by the compute Core 1. The received weight data W1 and the to-be-processed data X0 stored in the calculation Core0 are subjected to convolution operation, and then the obtained convolution operation result X0 × W1 is added to the operation result Vout _0 of the previous round (T0 calculation period) to obtain an operation result Vout _1, that is, Vout _1 is X0 × W1+ Vout _ 0. The operation result Vout _1 is a partial sum of the depth of the weight data W0 and the depth of the weight data W1 after segmentation. The calculation kernel Core0 can be added to the operation result Vout _0 of the previous round while performing convolution operation, and the operation result Vout _0 of the previous round is referred to as an initial membrane potential in the current round of operation.

However, the convolution operation based on the method of loading the initial value membrane potential still cannot reduce the storage space occupied by the convolution operation result in a single computation Core, and as in the above example, the final data stored by the computation Core3 is Vout _3 ═ X0 ═ W0+ X0 ═ W1+ X0 × W2+ X0 × W3), and still occupies a large amount of storage space of a single computation Core before activation.

In view of this, the present disclosure provides a data processing method, which is applied to a computation core of a processor, and generates and activates a convolution operation result based on to-be-processed sub-data stored in the computation core; exchanging the sub data to be processed with different computing cores, and releasing the storage space occupied by the sub data to be processed before exchange; and then, generating and activating convolution operation results based on the exchanged to-be-processed subdata, wherein the to-be-processed subdata is not divided in a depth dimension, so that the storage space occupied by the convolution operation results is small (details will be described later in combination with an example), namely the storage space occupied by the convolution operation results in a single computing core is reduced, and further the to-be-processed data can still be normally processed under the condition of shortage of hardware resources.

Referring to fig. 3, fig. 3 is a schematic reference diagram of an artificial intelligence chip according to an embodiment of the disclosure. As shown in FIG. 3, the artificial intelligence chip may include a plurality of processors.

In one possible implementation, as shown in fig. 3, each processor may include multiple computing cores, with data transfer enabled between computing cores within each processor, and between computing cores of different processors; wherein each computing core includes a storage component for storing data for transmission with other computing cores.

In one possible implementation, as shown in FIG. 3, each compute core may include a processing component and a storage component. The processing means may comprise a dendrite unit, an axon unit, a soma unit, a routing unit. The storage part may include a plurality of storage units.

In a possible implementation manner, a plurality of processors may also be integrated into a brain-like computing chip, which is a neural morphological circuit integrated with a computer, and by taking a processing mode of the brain as a reference, the processing efficiency is improved and the power consumption is reduced by simulating the transmission and processing of information by neurons in the brain. Each processor can comprise a plurality of computing cores, and different tasks can be processed independently among the computing cores or the same task can be processed in parallel, so that the processing efficiency is improved. The information transmission between the cores can be carried out through the routing unit in the computing core.

Within the computing core, processing components and storage components may be provided. The processing means may comprise a dendrite unit, an axon unit, a soma unit and a routing unit. The processing component can simulate the processing mode of neurons of the brain on information, wherein the dendritic units are used for receiving signals, the axonal units are used for sending spike signals, the soma units are used for integrated transformation of the signals, and the routing units are used for information transmission with other computing cores. The processing unit in the computing core can perform read-write access on a plurality of storage units of the storage unit to perform data interaction with the storage unit in the computing core, and can respectively undertake respective data processing tasks and/or data transmission tasks to obtain data processing results, or perform communication with other computing cores. Wherein communicating with other computing cores includes communicating with other computing cores within the present processor, as well as communicating with computing cores within other processors.

In one possible implementation manner, the storage unit includes a plurality of storage units, where the storage units may be Static Random Access Memories (SRAMs). For example, an SRAM with a read/write width of 16B and a capacity of 12KB may be included. The capacity size and bit width of the memory unit are not limited in the present disclosure.

Referring to fig. 4, fig. 4 is a flowchart of a data processing method according to an embodiment of the disclosure. The processing method comprises the following steps:

and step S100, generating and activating a convolution operation result based on the to-be-processed subdata stored in the calculation core. The sub data to be processed is a part of the data to be processed, and the depth of the sub data to be processed is the same as that of the data to be processed. For example, the data to be processed may be images, audio, text data, etc., and is multiplied by the weight data to generate a convolution operation result of the data to be processed. The weight data may include a plurality of convolution kernels for extracting different types of features in the data to be processed, and a researcher may set relevant setting parameters of the convolution kernels according to actual conditions. In one example, the data to be processed may be stored in an external memory or in one processor a of the artificial intelligence chip, and the external memory or the one processor a may be connected to another processor B to transmit the data to be processed to a part or all of the computation cores of the processor B.

In one example, step S100 may include: and generating a convolution operation result based on the weight data and the to-be-processed subdata stored in each computation core. And activating the convolution operation result through a preset activation function. Illustratively, the weight data stored within each compute core may be different, for example: the weight data comprises 64 convolution kernels, if 4 computation kernels exist, the weight data stored in each computation kernel is 16 convolution kernels in the 64 convolution kernels, the data to be processed is also divided into 4 parts, and the data to be processed is sent to the 4 computation kernels. In other words, in the case of the divided weight data, the number of the sub-data to be processed is the same as the number of the divided weight data. If the weight data is not divided, the number of the sub-data to be processed can be selected according to the actual hardware condition. The activation function can refer to activation functions in the related art, such as: RELU functions, Sigmod functions, Maxout functions, etc., and embodiments of the present disclosure are not limited thereto.

Continuing with fig. 4, step S200 exchanges the to-be-processed sub-data with different computing cores until each computing core stores each to-be-processed sub-data. In one possible implementation, step S200 may include: and determining the number of the dynamic storage space occupied by each to-be-exchanged sub data to be processed based on the distribution sequence number and the exchange times of each computing core. And distributing a dynamic storage space for each piece of to-be-processed subdata to be exchanged according to the dynamic storage space number. Wherein, the distribution sequence number is used for representing the distribution sequence of the sub-data to be processed.

Illustratively, the allocation sequence number is used to indicate a sequence in which the to-be-processed sub-data is allocated to different computing cores, that is, a storage sequence number of the to-be-processed sub-data in the storage dimension sequence. The exchange times are used for representing the times of data exchange between the computing core and other computing cores.

For example, the weight data corresponding to each computing core may be stored in a static storage space, and the corresponding to-be-processed sub-data may be stored in a dynamic storage space (that is, the weight data does not change its value in the exchange process, but the to-be-processed sub-data in the computing core changes in each exchange). For example: the dynamic storage space can be divided into k areas, each area corresponds to a dynamic storage space number, and the numbers of the dynamic storage spaces corresponding to the sub data to be processed before the exchange and the sub data to be processed after the exchange are different, so that the two sub data to be processed are prevented from occupying the same dynamic storage space, and the probability of data washout is further reduced.

For example, the number of the dynamic memory space corresponding to the above-mentioned computing core in each exchange process may be calculated according to the following Python pseudo code:

For(i＝0；i++；i<N)：

For(T＝0；T++；T<N)：

Address_Core[i][T]＝(i+T)％M

in the above code, i is an assignment order number of the calculation core. And N is the division number of the data to be processed, namely the total number of the computing cores to be exchanged. T is the number of exchanges, and may also be understood as the tth time period. And M is the division number of the dynamic storage space. And the Address _ Core [ i ] [ T ] represents the number of the dynamic storage space occupied by the ith computing Core in the Tth time period.

For example: the to-be-processed sub-data A, B, C, D is 4 in total and is respectively stored in the computing cores 0 to 3, and the dynamic storage space in each computing core is divided into 4 parts, which are numbered from 0 to 3.

Then, in time period T0 (i.e., exchange 0), the computing core0 stores the pending sub-data a into the dynamic storage space numbered 0. The computation core1 stores the sub-data B to be processed into a dynamic storage space numbered 1 (i.e., (i + T)% M ═ 1+ 0)% 4 ═ 1). And the computing core2 stores the sub data C to be processed into a dynamic storage space with the number of 2. And the computing core3 stores the sub-data D to be processed into a dynamic storage space with the number of 3.

And receiving the to-be-processed sub data B by the computing core0 in the time period T1 (namely, the 1 st exchange), and storing the to-be-processed sub data B into the dynamic storage space with the number of 1. And the computing core1 receives the sub data C to be processed and stores the sub data C to be processed into the dynamic storage space with the number of 2. And the computing core2 receives the sub data D to be processed and stores the sub data D to be processed into a dynamic storage space with the number of 3. The computation core3 receives the sub-data a to be processed and stores it in the dynamic storage space numbered 0 (i.e., (i + T)% M ═ 3+ 1)% 4 ═ 0).

The following time periods T2 and T3 are not repeated herein in this disclosure, the allocation manner of the above dynamic storage space numbers is also only expressed as an example, and the dynamic storage space occupied by the exchanged to-be-processed sub data can be arbitrarily selected on the premise that the dynamic storage spaces of the to-be-processed sub data before and after the exchange are different.

With continued reference to fig. 4, steps S210 and S220 are executed in step S200, that is, the computing core executes steps S210 and S220 each time the to-be-processed sub-data is exchanged. Step S210, releasing the storage space occupied by the to-be-processed subdata before the exchange. Step S220, based on the exchanged to-be-processed subdata, a convolution operation result is generated and activated.

For example, if the sub-data to be processed is n, the sub-data to be processed is allocated to n computing cores, and unprocessed sub-data to be processed is exchanged between each computing core and other computing cores, and each sub-data to be processed is stored in each computing core after n-1 exchanges. After each computing core receives the new to-be-processed subdata, the to-be-processed subdata before exchange is released to increase the available storage space in each computing core, and then a convolution operation result is generated and activated based on the exchanged to-be-processed subdata. In one example, different computing cores may perform any of the above steps by way of parallel synchronous operations.

For example: if the data to be processed is three-dimensional data, the data to be processed comprises a data depth Cin (in the convolution process, the parameter is ignored, namely the data depth is changed into 1, then the data depth is changed into Cout through Cout convolution kernels, the data width W and the data height H, the data to be processed is divided into 4 parts in the depth direction in the related technology, each calculation kernel uses the Cout convolution kernels to carry out convolution operation, the size of the sub data (namely the partial sum) to be processed after convolution of each calculation kernel is H W Cout, each calculation kernel carries out sub data exchange to be processed, before activation, the data space occupied by the partial sum in each calculation kernel is H W Cout 4, and after activation, the data size is changed into H W Cout.

In the embodiment of the present disclosure, if the same data to be processed is divided into 4 parts in the non-depth direction, for example: dividing into 4 parts in height direction, the sub-data size after convolution of each calculation kernel is (H/4) × W × Cout, and then directly activating, and after 4 times of data exchange, the data size is changed into H × W × Cout.

To intuitively represent the difference between the two manners, the embodiments of the present disclosure enumerate the storage space occupation amounts of the two at different time periods (also referred to as exchange times):

in the period T0, the storage space occupation size of the convolution operation result in the related art computation core is H × W × Cout, and the storage space occupation size of the convolution operation result in the computation core according to the embodiment of the present disclosure is (H/4) × W × Cout, which is one fourth of the related art.

In the period T1, the storage space occupation size of the convolution operation result in the related art calculation core is 2 × H × W × Cout, and the storage space occupation size of the convolution operation result in the calculation core according to the embodiment of the present disclosure is (H/2) × W × Cout, which is one fourth of the related art.

At time T2, the storage space occupation size of the convolution operation result in the related art computation core is 3 × H × W × Cout, and the storage space occupation size of the convolution operation result in the computation core of the embodiment of the present disclosure is (3/4 × H) × W × Cout, which is one fourth of the related art.

In the period T3, the storage space occupation size of the convolution operation result in the related art calculation core is 4 × H × W × Cout, and the storage space occupation size of the convolution operation result in the calculation core of the embodiment of the present disclosure is H × W × Cout, which is one fourth of the related art. Then, the correlation technique adds 4 pieces of H × W × Cout and activates, and the data size becomes H × W × Cout.

As shown in the above example, in the periods T0, T1, and T2, and in the periods T3, the storage size of the convolution operation result in the related art computing core is 4 times that of the embodiment of the present disclosure, which greatly occupies the available storage space of the computing core.

Referring to fig. 5, fig. 5 is a schematic reference diagram of a data processing method according to an embodiment of the disclosure.

As shown in FIG. 5, the sub-data of the data to be processed are respectively the data of 0-13 line, 14-27 line, 28-41 line and 42-55 line with the depth of 128. The line data of 0-13 is allocated to the Core0, the line data of 14-27 is allocated to the Core1, the line data of 28-41 is allocated to the Core2, and the line data of 42-55 is allocated to the Core3, i.e., the T0 period in fig. 5. Weight data are stored in each calculation Core, namely the cores 0-3 respectively store weight data W0-W3 in sequence.

In a T0 time period, 0-13 row data in Core0 are multiplied by W0, and convolution operation results are generated and activated. And multiplying the 14-27 row data in the Core1 by the W1 to generate and activate a convolution operation result. And multiplying 28-41 row data in the Core2 by the W2 to generate and activate a convolution operation result. And multiplying 42-55 row data in the Core3 by W3 to generate and activate convolution operation results.

In a T1 time period, Core0 releases the 0-13 line data, and multiplies the exchanged 42-55 line data by W0 to generate and activate a convolution operation result. The Core1 releases the 14-27 line data, multiplies the switched 0-13 line data by the W1, and generates and activates convolution operation results. The Core2 releases the 28-41 line data, multiplies the switched 14-27 line data by the W2, and generates and activates convolution operation results. The Core3 releases the 42-55 row data, multiplies the switched 28-41 row data by the W3, and generates and activates convolution operation results. The time periods T2 and T3 are similar to the above time periods and are not described herein.

Referring to fig. 6, fig. 6 is a flowchart illustrating a data processing method according to an embodiment of the disclosure. In a possible implementation, step S100 further includes, before:

step S300, dividing the data to be processed into the sub data to be processed on at least one non-depth dimension. For example: and if the data to be processed is three-dimensional data which comprises a depth dimension, a width dimension and a height dimension, selecting at least one dimension from the width dimension and the height dimension to divide the data to be processed into the subdata to be processed. In one example, the selection of the division dimension may be performed according to the following rule: and dividing the data to be processed into the subdata to be processed in the dimension of the final storage sequence of the data to be processed and the dimension of the non-depth. Wherein the storage order dimension is used for determining the storage order of the data to be processed on different dimensions. For example: the storage sequence of the data to be processed is depth dimension, height dimension and width dimension, and the data to be processed is divided in the width dimension so as to keep the continuity of the data to be processed in the dimension and facilitate the ordered distribution of the sub-data to be processed to the computing core. According to the embodiment of the disclosure, the storage pressure of the data to be processed stored in a single computing core is reduced by storing the sub-data to be processed in a plurality of computing cores, and the storage amount of the data stored in each computing core is only 1/N under the condition that N computing cores exist (that is, under the condition that the weight data is divided into N groups). In addition, the repeated utilization rate of each piece of sub-data to be processed is higher, namely, each piece of sub-data to be processed is called when the computing cores exchange data.

Step S400, storing each of the to-be-processed sub data into a different computing core. The embodiment of the present disclosure does not limit the way in which the to-be-processed sub data is allocated to the computation core, and the computation core may include weight data corresponding to the to-be-processed data. For example, the steps S300 and S400 may be performed by a processor, a computing core, and a storage medium storing data to be processed, and then the divided sub-data to be processed is sent to different computing cores.

The steps S100 and S200 may be executed after the above steps, and the embodiment of the disclosure is not described herein again. In the case where the data to be processed has been divided into a plurality of sub-data to be processed, steps S300 and S400 may be omitted.

In one possible implementation, an embodiment of the present disclosure further provides a processor, where the processor includes a plurality of processors, the processor includes a plurality of computing cores, and the processor is configured to perform: generating and activating a convolution operation result based on the to-be-processed subdata stored in the calculation core; exchanging the sub-data to be processed with different computing cores, and executing the following steps until each computing core stores each piece of sub-data to be processed: releasing the storage space occupied by the to-be-processed subdata before exchange; generating and activating convolution operation results based on the exchanged to-be-processed subdata; the sub data to be processed is a part of data to be processed, and the depth of the sub data to be processed is the same as that of the data to be processed.

In some embodiments, functions or modules included in the processor provided in the embodiments of the present disclosure may be used to execute the method described in the above method embodiments, and specific implementation thereof may refer to the description of the above method embodiments, and for brevity, no further description is provided here.

In a possible implementation manner, an embodiment of the present disclosure further provides an artificial intelligence chip, where the chip includes the processor as described above. As shown in fig. 3, the chip may include one or more processors, the processors may include a plurality of computing cores, and the present disclosure does not limit the number of computing cores within the chip.

In one possible implementation manner, the embodiment of the present disclosure provides an electronic device, which includes the artificial intelligence chip as described above.

Referring to fig. 7, fig. 7 is a block diagram of an electronic device 1200 according to an embodiment of the disclosure. As shown in fig. 7, the electronic device 1200 includes a computing processing means 1202 (e.g., an artificial intelligence processor including multiple computing cores as described above), an interface means 1204, other processing means 1206, and a storage means 1208. Depending on the application scenario, one or more computing devices 1210 (e.g., computing cores) may be included in the computing processing device.

In one possible implementation, the computing processing device of the present disclosure may be configured to perform operations specified by a user. In an exemplary application, the computing processing device may be implemented as a single chip artificial intelligence processor or a multi-chip artificial intelligence processor. Similarly, one or more computing devices included within the computing processing device may be implemented as an artificial intelligence chip or as part of a hardware structure of an artificial intelligence chip. When a plurality of computing devices are implemented as artificial intelligence chips or as part of the hardware structure of artificial intelligence chips, the computing processing device of the present disclosure may be considered as having a single chip structure or a homogeneous multi-chip structure.

In an exemplary operation, the computing processing device of the present disclosure may interact with other processing devices through an interface device to collectively perform user-specified operations. Other Processing devices of the present disclosure may include one or more types of general and/or special purpose processors such as a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), an artificial intelligence processor, and the like, depending on the implementation. These processors may include, but are not limited to, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic, discrete hardware components, etc., and the number may be determined based on actual needs. As previously mentioned, the computational processing apparatus of the present disclosure may be considered to have a single core structure or a homogeneous multi-core structure only. However, when considered together, a computing processing device and other processing devices may be considered to form a heterogeneous multi-core structure.

In one or more embodiments, the other processing devices may interface with external data and controls as a computational processing device of the present disclosure (which may be embodied as an artificial intelligence, such as a neural network computing related computing device), performing basic controls including, but not limited to, data handling, starting and/or stopping of the computing device, and the like. In further embodiments, other processing devices may also cooperate with the computing processing device to collectively perform computing tasks.

In one or more embodiments, the interface device may be used to transfer data and control instructions between the computing processing device and other processing devices. For example, the computing processing device may obtain input data from other processing devices via the interface device, and write the input data into a storage device (or memory) on the computing processing device. Further, the computing processing device may obtain the control instruction from the other processing device via the interface device, and write the control instruction into the control cache on the computing processing device slice. Alternatively or optionally, the interface device may also read data from the memory device of the computing processing device and transmit the data to the other processing device.

Additionally or alternatively, the electronic device of the present disclosure may further comprise a storage means. As shown in the figure, the storage means is connected to the computing processing means and the further processing means, respectively. In one or more embodiments, the storage device may be used to hold data for the computing processing device and/or the other processing devices. For example, the data may be data that is not fully retained within internal or on-chip storage of a computing processing device or other processing device.

According to different application scenarios, the artificial intelligence chip disclosed by the disclosure can be used for a server, a cloud server, a server cluster, a data processing device, a robot, a computer, a printer, a scanner, a tablet computer, an intelligent terminal, a PC device, a terminal of the internet of things, a mobile terminal, a mobile phone, a vehicle data recorder, a navigator, a sensor, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a visual terminal, an automatic driving terminal, a vehicle, a household appliance, and/or a medical device. The vehicle comprises an airplane, a ship and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph.

Referring to fig. 8, fig. 8 is a block diagram of an electronic device according to an embodiment of the disclosure.

For example, the electronic device 1900 may be provided as a terminal device or a server. Referring to fig. 8, electronic device 1900 includes a processing component 1922 further including one or more processors and memory resources, represented by memory 1932, for storing instructions, e.g., applications, executable by processing component 1922. The application programs stored in memory 1932 may include one or more modules that each correspond to a set of instructions. Further, the processing component 1922 is configured to execute instructions to perform the above-described method.

The electronic device 1900 may also include a power component 1926 configured to perform power management of the electronic device 1900, a wired or wireless network interface 1950 configured to connect the electronic device 1900 to a network, and an input/output (I/O) interface 1958. The electronic device 1900 may operate based on an operating system stored in memory 1932, such as the Microsoft Server operating system (Windows Server), the graphical user interface based operating system (Mac OS XTM) available from apple Inc., the Multi-user Multi-Process computer operating system (Unix), the Unix-like operating system of free and open native code (LinuxTM), the Unix-like operating system of open native code (FreeBSDTM), or the like.

In an exemplary embodiment, a non-transitory computer readable storage medium, such as the memory 1932, is also provided that includes computer program instructions executable by the processing component 1922 of the electronic device 1900 to perform the above-described methods.

The present disclosure may be systems, methods, and/or computer program products. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied thereon for causing a processor to implement various aspects of the present disclosure.

The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.

The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.

The computer program instructions for carrying out operations of the present disclosure may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, the electronic circuitry that can execute the computer-readable program instructions implements aspects of the present disclosure by utilizing the state information of the computer-readable program instructions to personalize the electronic circuitry, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA).

Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The computer program product may be embodied in hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied in a computer storage medium, and in another alternative embodiment, the computer program product is embodied in a Software product, such as a Software Development Kit (SDK), or the like.

Having described embodiments of the present disclosure, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A data processing method applied to a computing core of a processor, wherein the processor comprises a plurality of computing cores, the data processing method comprising:

generating and activating a convolution operation result based on the to-be-processed subdata stored in the calculation core;

exchanging the sub-data to be processed with different computing cores, and executing the following steps until each computing core stores each piece of sub-data to be processed:

releasing the storage space occupied by the to-be-processed subdata before exchange;

generating and activating convolution operation results based on the exchanged to-be-processed subdata;

the sub data to be processed is a part of data to be processed, and the depth of the sub data to be processed is the same as that of the data to be processed.

2. The process of claim 1, wherein the process further comprises:

dividing data to be processed into subdata to be processed on at least one non-depth dimension;

and storing each piece of the to-be-processed sub data into different computing cores.

3. The processing method of claim 2, wherein said generating and activating convolution operation results comprises:

generating a convolution operation result based on the weight data and the to-be-processed subdata stored in each computation core;

and activating the convolution operation result through a preset activation function.

4. The processing method of claim 2, wherein said partitioning the data to be processed into sub-data to be processed in at least one non-depth dimension comprises:

dividing the data to be processed into subdata to be processed in the dimension of the last storage sequence of the data to be processed and the dimension of non-depth; wherein the storage order dimension is used for determining the storage order of the data to be processed on different dimensions.

5. The processing method of claim 2, wherein exchanging the sub-data to be processed between different computing cores comprises:

determining the number of a dynamic storage space occupied by each to-be-exchanged sub data to be processed based on the distribution sequence number and the exchange times of each computing core; the distribution sequence number is used for representing the distribution sequence of the sub-data to be processed;

and distributing a dynamic storage space for each piece of to-be-processed subdata to be exchanged according to the dynamic storage space number.

6. The processing method according to claim 1, wherein the plurality of computing cores perform at least one of the steps of exchanging the to-be-processed sub-data with different computing cores, releasing a storage space occupied by the to-be-processed sub-data before exchange, and generating and activating a result of the convolution operation by means of parallel synchronous operation.

7. A processor comprising a plurality of processors, the processor comprising a plurality of computing cores, the processor to perform:

8. The processor as recited in claim 7, wherein said processor is further configured to perform:

9. The processor of claim 8, wherein said generating and activating convolution operation results comprises:

10. The processor of claim 8, wherein the partitioning the data to be processed into sub-data to be processed in at least one non-depth dimension comprises:

dividing the data to be processed into subdata to be processed in the dimension of the last storage sequence of the data to be processed and the dimension of non-depth; wherein the storage order dimension is used to determine a storage order of the data to be processed in different dimensions.

11. The processor of claim 8, wherein the exchanging the sub-data to be processed between different compute cores comprises: determining the number of a dynamic storage space occupied by each to-be-exchanged sub data to be processed based on the distribution sequence number and the exchange times of each computing core; the distribution sequence number is used for representing the distribution sequence of the sub-data to be processed;

12. The processor of claim 7, wherein the plurality of computing cores perform at least one of the exchanging of the to-be-processed sub-data with different computing cores, the releasing of the storage space occupied by the to-be-processed sub-data before the exchanging, and the generating and activating of the result of the convolution operation by means of parallel synchronous operation.

13. An artificial intelligence chip, wherein the artificial intelligence chip comprises a processor according to any one of claims 7 to 12.

14. An electronic device, characterized in that the electronic device comprises an artificial intelligence chip according to claim 13.