CN116629320B

CN116629320B - Neural network optimization method, device, storage medium and chip

Info

Publication number: CN116629320B
Application number: CN202310903079.8A
Authority: CN
Inventors: 丁维浩; 唐剑; 张法朝; 牟小峰; 赵东宇; 夏立超
Original assignee: Midea Robozone Technology Co Ltd
Current assignee: Midea Robozone Technology Co Ltd
Priority date: 2023-07-21
Filing date: 2023-07-21
Publication date: 2023-11-28
Anticipated expiration: 2043-07-21
Also published as: CN116629320A

Abstract

The application relates to the technical field of data processing, and provides a neural network optimization method, a device, a storage medium and a chip, wherein the neural network optimization method comprises the following steps: extracting a convolution layer in the neural network, and converting a data structure of a convolution kernel in the convolution layer to obtain a target convolution kernel; slicing the input features of the convolution layer according to the target dimension to obtain a plurality of slice features; inputting a slice feature and a target convolution kernel into a matrix multiplication module to generate an intermediate feature corresponding to the slice feature; output data is generated based on the plurality of intermediate features. The technical scheme provided by the application can improve the data processing efficiency of the neural network.

Description

Neural network optimization method, device, storage medium and chip

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a method and apparatus for optimizing a neural network, a storage medium, and a chip.

Background

In the related art, convolutional neural networks have relatively wide application in the field of data processing. In the process of processing data by the convolutional neural network, input data are generally directly input into the convolutional layer for calculation, and under the condition of more data, the calculation speed of the convolutional layer is slower, so that the data processing efficiency of the convolutional neural network is affected.

Therefore, how to improve the data processing efficiency of the convolutional neural network becomes a technical problem to be solved.

Disclosure of Invention

The present application aims to solve at least one of the technical problems existing in the prior art.

To this end, a first aspect of the application provides a method of optimizing a neural network.

A second aspect of the present application provides an optimizing apparatus for a neural network.

A third aspect of the application provides a data processing method.

A fourth aspect of the application provides a RISCV device.

A fifth aspect of the present application proposes a readable storage medium.

A sixth aspect of the application proposes a computer program product.

A seventh aspect of the application proposes a chip.

The first aspect of the present application provides a method for optimizing a neural network, including: extracting a convolution layer in the neural network, and converting a data structure of a convolution kernel in the convolution layer to obtain a target convolution kernel; slicing the input features of the convolution layer according to the target dimension to obtain a plurality of slice features; inputting a slice feature and a target convolution kernel into a matrix multiplication module to generate an intermediate feature corresponding to the slice feature; an output feature is generated from the plurality of intermediate features.

The optimization method of the neural network can be used for optimizing the characteristic processing process of the convolution layer in the neural network so as to improve the characteristic processing speed of the convolution layer and further improve the overall characteristic processing speed of the neural network.

Specifically, the matrix multiplication module is used for carrying out multiplication calculation on the input features of the convolution layer and the convolution kernel of the convolution layer, so that the scalar calculation of the original input features is converted into vector calculation on the basis of realizing the convolution calculation on the input features, the calculation efficiency of the input features is effectively improved, and the effect of improving the processing speed of the convolution layer on the input features is further achieved.

Specifically, firstly, a convolution layer in a neural network is extracted, and data structure conversion is carried out on a convolution kernel in the convolution layer to obtain a target convolution kernel. The data structure of the convolution kernel is suitable for the process of calculating the characteristic data by the matrix multiplication module by performing data structure conversion on the convolution kernel, so that the accuracy of the characteristic obtained after calculation is ensured.

Further, slicing is performed on the input features of the convolution layer according to the target dimension, so that a plurality of slice features are obtained. In the process of calculation through the matrix multiplication module, one slice feature and the target convolution kernel are input into the matrix multiplication module at a time, and one intermediate feature corresponding to the slice feature is obtained. Under the condition that all slice features are calculated, a plurality of intermediate features with the same number as the slice features can be obtained, and finally, the output features of the convolution layer are generated according to the plurality of intermediate features.

The input features of the convolution layer are sliced according to the target dimension to obtain a plurality of slice features, so that matrix multiplication modules can respectively carry out matrix multiplication calculation on one slice and the target convolution kernel for a plurality of times in the calculation process, and the situation that a large amount of temporary data generated in the calculation process occupy a large amount of memory due to the fact that a large amount of data are simultaneously subjected to multiplication calculation can be avoided, the memory pressure of a neural network is effectively reduced, and the data calculation speed is improved. Meanwhile, the input features are sliced according to the target dimension, so that slicing rules of the input features can be accurately indicated, disorder of sliced features is avoided, and meanwhile, integration rules are provided for integration of a plurality of intermediate features.

In addition, the matrix multiplication module can be a matrix multiplication module for compiling based on a RISC-V Vector expansion instruction set (RVV instruction set), the RVV instruction set adopts different block sizes of 4 multiplied by 4, 4 multiplied by 6 and the like to realize different compiling kernels, the output channel dimension and the target dimension of slice characteristics are respectively parallel in the calculation process, and the multiplication accumulation calculation is carried out in the calculation process along the dimension of the input channel multiplied by the dimension of the target convolution kernel height multiplied by the target convolution kernel width.

Specifically, the RVV instruction set defines two forms of multiply-accumulate instructions, vfmacc.vf and vfmacc.v, that implement two different matrix multiplication assembly microkernels. Wherein a multiply-accumulate instruction in the form of vfmacc.vf loads the data of the matrix from memory into the register using a vectorized load instruction for one of the two matrices (i.e., slice feature and target convolution kernel) entered and a floating point load instruction loads the data of the matrix from memory into the register. The multiply-accumulate instruction in the form of vfmacc.v adopts a vectorized load instruction to load the data of the matrix from the memory into the register for both matrixes, and then performs data broadcasting by means of the vragather.vi instruction defined in the RVV instruction set to complete multiply-accumulate calculation operation. The micro-kernel of the matrix multiplication in two forms has the advantages of each core under different convolution layer parameter configuration scenes, the matrix multiplication assembly micro-kernel with the optimal performance of each convolution layer can be selected in a pre-operation mode to realize in a preparation operation stage, the selected matrix multiplication micro-kernel function is recorded, and callback is carried out in the operation stage to carry out multiplication calculation on the two matrixes.

According to the optimization method of the neural network, disclosed by the application, the matrix multiplication module is used for carrying out multiplication calculation on the input features of the convolution layer and the convolution kernel of the convolution layer, so that the scalar calculation of the original input features is converted into vector calculation on the basis of realizing the convolution calculation on the input features, the calculation efficiency of the input features is effectively improved, and the effect of improving the processing speed of the convolution layer on the input features is further achieved. Meanwhile, the input features of the convolution layer are sliced, and in the calculation process, only one slice feature and the target convolution kernel are input to the matrix multiplication module for calculation at a time, so that the phenomenon that a large amount of data are multiplied simultaneously to cause temporary data generated in the calculation process to occupy a large amount of memory can be avoided, the memory pressure of the neural network is effectively reduced, and the data calculation speed is improved.

According to a second aspect of the present application, there is provided an optimizing apparatus of a neural network, comprising: the conversion unit is used for extracting a convolution layer in the neural network and converting a data structure of a convolution kernel in the convolution layer to obtain a target convolution kernel; the processing unit is used for carrying out slicing processing on the input features of the convolution layer according to the target dimension to obtain a plurality of slice features; the generating unit is used for inputting a slice feature and a target convolution kernel into the matrix multiplication module to generate an intermediate feature corresponding to the slice feature; and generating an output feature from the plurality of intermediate features.

According to a third aspect of the present application, there is provided a data processing method comprising: acquiring data to be processed; the neural network obtained based on the optimization method of the neural network in any technical scheme processes the data to be processed to obtain a processing result.

According to a fourth aspect of the present application, there is provided a RISCV device supporting a RISC-V Vector extended instruction set, on which a neural network obtained by an optimization method based on a neural network according to any of the above-mentioned technical solutions is deployed.

According to a fifth aspect of the present application, there is provided a readable storage medium having stored thereon a program or instructions which when executed by a processor performs the steps of: extracting a convolution layer in the neural network, and converting a data structure of a convolution kernel in the convolution layer to obtain a target convolution kernel; slicing the input features of the convolution layer according to the target dimension to obtain a plurality of slice features; inputting a slice feature and a target convolution kernel into a matrix multiplication module to generate an intermediate feature corresponding to the slice feature; an output feature is generated from the plurality of intermediate features.

In a sixth aspect of the application, a computer program product is presented, comprising a computer program or instructions which, when executed by a processor, performs the steps of: extracting a convolution layer in the neural network, and converting a data structure of a convolution kernel in the convolution layer to obtain a target convolution kernel; slicing the input features of the convolution layer according to the target dimension to obtain a plurality of slice features; inputting a slice feature and a target convolution kernel into a matrix multiplication module to generate an intermediate feature corresponding to the slice feature; an output feature is generated from the plurality of intermediate features.

In a seventh aspect of the present application, a chip is provided, including a program or instructions, for implementing the following steps when the chip is running: extracting a convolution layer in the neural network, and converting a data structure of a convolution kernel in the convolution layer to obtain a target convolution kernel; slicing the input features of the convolution layer according to the target dimension to obtain a plurality of slice features; inputting a slice feature and a target convolution kernel into a matrix multiplication module to generate an intermediate feature corresponding to the slice feature; an output feature is generated from the plurality of intermediate features.

Additional aspects and advantages of the application will be set forth in part in the description which follows, or may be learned by practice of the application.

Drawings

The foregoing and/or additional aspects and advantages of the application will become apparent and may be better understood from the following description of embodiments taken in conjunction with the accompanying drawings in which:

FIG. 1 shows a flow diagram of a method for optimizing a neural network, according to one embodiment of the application;

fig. 2 is a schematic flow chart of a method for optimizing a neural network according to another embodiment of the present application;

FIG. 3 is a flow chart of a method for optimizing a neural network according to still another embodiment of the present application;

FIG. 4 is a flow chart of a method for optimizing a neural network according to still another embodiment of the present application;

FIG. 5 shows a flow diagram of a method for optimizing a neural network, according to yet another embodiment of the present application;

FIG. 6 is a block flow diagram of the operation of the matrix multiplication module in the optimization method of the neural network according to the embodiment of the application;

FIG. 7 is a block flow diagram of a method for optimizing a neural network according to one embodiment of the present application;

fig. 8 is a block diagram showing the configuration of an optimizing apparatus of a neural network according to an embodiment of the present application.

The correspondence between the reference numerals and the component names in fig. 8 is:

600 neural network optimization device, 602 conversion unit, 604 processing unit, 606 generation unit.

Detailed Description

In order that the above-recited objects, features and advantages of the present application will be more clearly understood, a more particular description of the application will be rendered by reference to the appended drawings and appended detailed description. It should be noted that, without conflict, the embodiments of the present application and features in the embodiments may be combined with each other.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application, however, the present application may be practiced otherwise than as described herein, and therefore the scope of the present application is not limited to the specific embodiments disclosed below.

Methods, apparatuses, storage media, and chips for optimizing a neural network according to some embodiments of the present application are described below with reference to fig. 1 to 8.

As shown in fig. 1, according to an embodiment of the present application, there is provided a method for optimizing a neural network, including:

s102, extracting a convolution layer in a neural network, and converting a data structure of a convolution kernel in the convolution layer to obtain a target convolution kernel;

s104, slicing the input features of the convolution layer according to the target dimension to obtain a plurality of slice features;

s106, inputting a slice feature and a target convolution kernel into a matrix multiplication module to generate an intermediate feature corresponding to the slice feature;

s108, generating output features according to the plurality of intermediate features.

Specifically, as shown in fig. 6, the RVV instruction set defines two forms of multiply-accumulate instructions, vfmacc.vf and vfmacc.v, implementing two different matrix multiplication assembly microkernels. Wherein a multiply-accumulate instruction in the form of vfmacc.vf loads the data of the matrix from memory into the register using a vectorized load instruction for one of the two matrices (i.e., slice feature and target convolution kernel) entered and a floating point load instruction loads the data of the matrix from memory into the register. The multiply-accumulate instruction in the form of vfmacc.v adopts a vectorized load instruction to load the data of the matrix from the memory into the register for both matrixes, and then performs data broadcasting by means of the vragather.vi instruction defined in the RVV instruction set to complete multiply-accumulate calculation operation. The performance of the matrix multiplication assembly micro-kernel in two forms has own advantages under different convolution layer parameter configuration scenes, the matrix multiplication assembly micro-kernel with the optimal performance of each convolution layer can be selected in a pre-operation mode to be realized in a preparation operation stage, the selected matrix multiplication micro-kernel function is recorded, and callback is carried out in the operation stage to carry out multiplication calculation on the two matrixes.

According to an embodiment of the present application, as shown in fig. 2, a method for optimizing a neural network is provided, including:

s202, extracting a convolution layer in a neural network, and converting a data structure of a convolution kernel in the convolution layer to obtain a target convolution kernel;

s204, converting the data structure of the input features into a first target data structure corresponding to the module parameters of the matrix multiplication module, and generating first intermediate input features;

S206, combining the first dimension and the second dimension of the first intermediate input feature into a target dimension by using a first preset algorithm to generate a second intermediate input feature;

s208, slicing the second intermediate input feature according to the target dimension to obtain a plurality of slice features;

s210, inputting a slice feature and a target convolution kernel into a matrix multiplication module to generate an intermediate feature corresponding to the slice feature;

s212, generating output features according to the plurality of intermediate features.

The first dimension is the height of the first intermediate input feature, and the second dimension is the width of the first intermediate input feature.

In this embodiment, the specific procedure of slicing the input features of the volume aggregation layer is as follows: firstly, converting a data structure of an input characteristic into a first target data structure corresponding to module parameters of a matrix multiplication module, generating a first intermediate input characteristic, and further obtaining a second intermediate input characteristic according to the first intermediate input characteristic. Therefore, the data structure of the slice feature obtained after slicing the middle input feature is ensured to correspond to the module parameters of the matrix multiplication module, and the matrix multiplication module is ensured to accurately calculate the slice feature, namely the accuracy of the finally obtained output feature is ensured.

In particular, for a matrix multiplication module of RVV instruction set assembly, the data structure of the input features may be converted from NCHW to NC4HW 4. In this way, in the process of slicing the intermediate input feature, boundaries in the height direction and the width direction can be removed, and the efficiency of slicing can be improved. Meanwhile, the vector load instruction in the RVV instruction set can be fully utilized to load the intermediate data from the memory into the register.

Further, the first dimension and the second dimension of the intermediate input feature are combined into a target dimension by using a first preset algorithm to convert the first intermediate input feature into a second intermediate input feature. That is, the first dimension and the second dimension of the intermediate input feature are sequentially arranged into one dimension, that is, the target dimension. Therefore, dimension reduction processing can be performed on the intermediate input features so as to further simplify the dimension of the intermediate features, and in the process of slicing the intermediate input features, slicing can be performed along the target dimension.

The first dimension is the height of the first intermediate input feature, and the second dimension is the width of the first intermediate input feature. The first target algorithm may be an Im2co algorithm, that is, the height of the input feature and the width of the input feature are combined to form a target dimension by using the Im2co algorithm, and other dimensions of the input feature (for example, the dimension of the input channel) remain unchanged.

Further, slicing is performed on the second intermediate input feature according to the target dimension, and a plurality of slice features are obtained.

In particular, for a matrix multiplication module of RVV instruction set assembly, the number of slices may be set to four or six.

Further, the process of converting the data structure of the convolution kernel in the convolution layer to obtain the target convolution kernel can also convert the data structure of the convolution kernel into a second target data structure corresponding to the module parameters of the matrix multiplication module to generate the target convolution kernel. Therefore, the data structure of the target convolution kernel is ensured to correspond to the module parameters of the matrix multiplication module, and the matrix multiplication module can accurately calculate the target convolution kernel, namely the accuracy of the finally obtained output characteristic is ensured.

According to an embodiment of the present application, as shown in fig. 3, a method for optimizing a neural network is provided, including:

s302, extracting a convolution layer in a neural network, and converting a data structure of a convolution kernel in the convolution layer to obtain a target convolution kernel;

s304, converting the data structure of the input features into a first target data structure corresponding to the module parameters of the matrix multiplication module, and generating first intermediate input features;

S306, combining the first dimension and the second dimension of the first intermediate input feature into a target dimension by using a first preset algorithm to generate a second intermediate input feature;

s308, slicing the second intermediate input feature according to the target dimension to obtain a plurality of slice features;

s310, acquiring a convolution kernel parameter of a target convolution kernel;

s312, rearranging the data structure of the slice characteristics according to the convolution kernel parameters;

s314, inputting a slice feature and a target convolution kernel into a matrix multiplication module to generate an intermediate feature corresponding to the slice feature;

s316, generating output features according to the plurality of intermediate features.

In this embodiment, after the second intermediate input feature is sliced to obtain a plurality of slice features, that is, before the slice features and the target convolution kernel are input to the matrix multiplication module for multiplication. The convolution kernel parameters of the target convolution kernel can be obtained, and then the data structure of the slice feature is rearranged according to the convolution kernel parameters of the target convolution kernel, so that the data structure of the intermediate feature after multiplication calculation of the slice feature and the target convolution kernel is identical to the data structure before the data structure rearrangement of the slice feature. The accuracy of the middle feature is guaranteed, and the accuracy of the output feature of the convolution kernel is guaranteed after the output feature of the convolution layer is obtained according to the middle feature.

According to an embodiment of the present application, as shown in fig. 4, a method for optimizing a neural network is provided, including:

s402, extracting a convolution layer in the neural network, and converting a data structure of a convolution kernel in the convolution layer to obtain a target convolution kernel;

s404, converting the data structure of the input features into a first target data structure corresponding to the module parameters of the matrix multiplication module, and generating first intermediate input features;

s406, combining the first dimension and the second dimension of the first intermediate input feature into a target dimension by using a first preset algorithm to generate a second intermediate input feature;

s408, slicing the second intermediate input feature according to the target dimension to obtain a plurality of slice features;

s410, acquiring a convolution kernel parameter of a target convolution kernel;

s412, rearranging the data structure of the slice feature according to the convolution kernel parameters;

s414, adjusting the data structure of the slice feature according to the third dimension of the slice feature;

s416, inputting a slice feature and a target convolution kernel into a matrix multiplication module to generate an intermediate feature corresponding to the slice feature;

s418, generating output features according to the plurality of intermediate features.

In this embodiment, after the second intermediate input feature is sliced to obtain a plurality of slice features, that is, before the slice features and the target convolution kernel are input to the matrix multiplication module for multiplication. The data structure of the slice feature may also be adjusted according to a third dimension of the slice feature. Wherein the third dimension of the slice feature may be an output channel dimension of the slice feature. Specifically, the adjustment of the data structure of the slice feature according to the third dimension of the slice feature may be the adjustment of the number of output channels of the slice feature, so that in the process of multiplying the slice feature by the convolution kernel, the number of output channels of the slice feature can be adapted to the data structure of the convolution kernel, that is, in the process of multiplying the calculation, the output channels of the slice feature can participate in the calculation, and the reduction of the calculation efficiency caused by the occurrence of data redundancy due to the failure of the output channels in the calculation process is avoided.

Further, adjusting the data structure of the slice feature according to the third dimension of the slice feature includes: maintaining the number of output channels in the case that the number of output channels can be divided by a preset value; increasing the output channels so that the number of the output channels can be divided by the preset value in the case that the number of the output channels cannot be divided by the preset value; wherein the data of the other dimensions in the added output channel are all set to 0.

Specifically, the adjustment of the data structure of the slice feature according to the third dimension of the slice feature, that is, the adjustment of the data structure of the slice feature according to the third dimension of the slice feature, may be adjusted according to the number of output channels of the slice feature. Specifically, the preset value may be 4, 6 or set to other values according to the matrix multiplication module.

Specifically, the preset value may be set to 4 based on the matrix multiplication module assembled by the RVV instruction set, i.e., where the number of output channels of the slice feature is divisible by 4, then there is no need to adjust the number of output channels of the slice feature. That is, in the case where the number of output channels can be divided by 4, in the process of multiplication by the matrix multiplication module, data of all the output channels can participate in the calculation, and no data redundancy occurs, so that the efficiency of the multiplication calculation is not affected.

Accordingly, in the case where the number of output channels of the slice feature is not divisible by 4, then the number of output channels of the slice feature needs to be adjusted. Specifically, the number of output channels of the slice feature can be increased, so that the number of output channels of the slice feature can be divided by 4, and all the output channels can participate in calculation in the multiplication process of the adjusted slice feature, so that data redundancy is avoided. After multiplication, the newly added output channel in the intermediate feature is eliminated, and the data structure of the intermediate feature can be restored to be the same as the data structure of the slice feature before adjustment, so that the data structure of the output feature of the convolution layer is ensured.

Further, in the added output channel, the data of other dimensions in the output channel are all set to 0, so that the influence of the data of other dimensions in the newly added output channel on the calculation result of multiplication calculation is avoided.

According to an embodiment of the present application, as shown in fig. 5, a method for optimizing a neural network is provided, including:

s502, extracting a convolution layer in the neural network, and converting a data structure of a convolution kernel in the convolution layer to obtain a target convolution kernel;

s504, converting the data structure of the input features into a first target data structure corresponding to the module parameters of the matrix multiplication module, and generating first intermediate input features;

s506, combining the first dimension and the second dimension of the first intermediate input feature into a target dimension by using a first preset algorithm to generate a second intermediate input feature;

s508, slicing the second intermediate input feature according to the target dimension to obtain a plurality of slice features;

s510, acquiring a convolution kernel parameter of a target convolution kernel;

s512, rearranging the data structure of the slice characteristics according to the convolution kernel parameters;

s514, adjusting the data structure of the slice feature according to the third dimension of the slice feature;

S516, inputting a slice feature and a target convolution kernel into a matrix multiplication module to generate an intermediate feature corresponding to the slice feature;

s518, converting the data structure of the intermediate feature into the same data structure as the data structure of the input feature;

and S520, splicing the plurality of intermediate features according to the first dimension and the second dimension to generate output features.

In this embodiment, after matrix multiplication is performed on each slice feature and the convolution kernel by the matrix multiplication module, the output feature of the convolution layer may be generated according to the generated plurality of intermediate features, that is, before matrix multiplication, the input feature is converted into a plurality of slice features, so after matrix multiplication is performed on each slice feature and the convolution kernel, the obtained plurality of intermediate features need to be spliced to generate a complete output feature.

Specifically, the plurality of intermediate features can be spliced according to the first dimension and the second dimension of the input features, so that the generated output features are identical to the data structures of the input features, and the accuracy of the output features is ensured.

Further, after matrix multiplication is performed on each slice feature and the convolution kernel by the matrix multiplication module, output features of the convolution layer can be generated according to the generated plurality of intermediate features.

Further, before generating the output feature from the generated plurality of intermediate features, it is also necessary to convert the data structure of the intermediate features into the same data structure as the data structure of the input feature. It can be understood that before slicing the input feature, the data structure of the input feature is first converted into a first target data structure corresponding to the module parameter of the matrix multiplication module, so as to ensure that the data structure of the sliced feature obtained after slicing the intermediate input feature corresponds to the module parameter of the matrix multiplication module, and further ensure that the matrix multiplication module can accurately calculate the sliced feature. Therefore, after the slice feature performs matrix multiplication, the generated data structure of the intermediate feature is the first target data structure, so that the data structure of the intermediate feature needs to be converted from the first target data structure to the data structure identical to the input feature, thereby ensuring that the data structure of the output feature is identical to the data structure of the input feature after the output feature is generated, that is, ensuring the accuracy of the output feature.

In the process of processing the input features of the convolutional layer, as shown in fig. 7, the convolutional layer in the neural network is extracted first, and the parameters such as the number of input and output channels, step factors, boundary filling parameters, and void factors of the convolutional layer are initialized and configured. Then, in the initialization stage, the convolution kernel of the convolution layer is converted by a C4 layout adjustment module. Further, the input features of the convolution layer and the data structure of the convolution kernel are converted from NCHW to NC4HW 4. Then, the adjusted input features are input into a block im2col data rearrangement module, slicing processing is performed on the input features, slice features are generated along a target dimension of the input features, namely, a target dimension (h x w dimension) after the height and the width of the input features are combined, and data rearrangement is performed on the slice features according to the height and the width of the convolution kernel. Further, the slice characteristics after data rearrangement are input to the self-adaptive layout adjustment module, and the data structure adjustment of the slice characteristics is performed according to whether the number of input channels of the slice characteristics can be divided by 4. A slice of the input feature along the h x w dimension is obtained. Specifically, in the case that the number of output channels of the slice feature cannot be divided by 4, the number of output channels of the slice feature may be increased, so that the number of output channels of the slice feature can be divided by 4, so that all output channels of the slice feature after adjustment can participate in calculation in the process of performing multiplication calculation, and data redundancy is avoided. In the case where the number of output channels of the slice feature can be divided by 4, there is no need to adjust the number of output channels of the slice feature. Further, inputting the slice feature and the target convolution kernel to an RVV universal matrix multiplication module for matrix multiplication calculation to obtain an intermediate feature. It will be appreciated that the number of slice features is multiple, and therefore, each slice feature requires data rearrangement and data structure adjustment according to the number of input channels, and finally, multiple slice features are obtained. And finally, carrying out data structure adjustment on the plurality of intermediate features through a C4 layout adjustment module, and then splicing all the intermediate features to obtain the output features of the volume collection layer.

In some embodiments, optionally, the convolutional layer in the neural network may be optimized based on the technical solution provided by the present application, so as to implement optimization acceleration of the convolutional layer in the neural network.

At present, an explicit optimal deployment scheme is lacking in the aspect of deploying the neural network based on a RISC-V Vector (RVV for short) instruction set, and the application provides a scheme for deploying the optimal neural network based on a RISC-V Vector expansion instruction set architecture. In some embodiments, the application provides a complete set of optimized deployment schemes suitable for a RISCV device to deploy a neural network model, which specifically comprises tensor layout adjustment, sliding window partitioning Im2Col which is helpful for device memory overhead, adaptive layout adjustment which is helpful for reducing redundant calculation, performance preference mechanism of microkernels based on different multiply-accumulate instructions of RVV instruction sets, and the like. The acceleration effect of the actual convolution layer can reach more than 90% of the peak computing performance on a development board supporting the RVV instruction set.

In some embodiments, optionally, as shown in fig. 8, an optimization apparatus 600 of a neural network is provided, including: the conversion unit 602 is configured to extract a convolution layer in the neural network, and convert a data structure of a convolution kernel in the convolution layer to obtain a target convolution kernel; a processing unit 604, configured to perform slicing processing on the input features of the convolutional layer according to the target dimension, so as to obtain a plurality of slice features; a generating unit 606, configured to input a slice feature and a target convolution kernel to the matrix multiplication module, and generate an intermediate feature corresponding to the slice feature; and generating an output feature from the plurality of intermediate features.

According to the optimization device 600 of the neural network, the matrix multiplication module is used for carrying out multiplication calculation on the input features of the convolution layer and the convolution kernel of the convolution layer, so that the scalar calculation of the original input features is converted into vector calculation on the basis of carrying out convolution calculation on the input features, the calculation efficiency of the input features is effectively improved, and the effect of improving the processing speed of the convolution layer on the input features is achieved. Meanwhile, the input features of the convolution layer are sliced, and in the calculation process, only one slice feature and the target convolution kernel are input to the matrix multiplication module for calculation at a time, so that the phenomenon that a large amount of data are multiplied simultaneously to cause temporary data generated in the calculation process to occupy a large amount of memory can be avoided, the memory pressure of the neural network is effectively reduced, and the data calculation speed is improved.

Further, the processing unit 604 is specifically configured to convert the data structure of the input feature into a first target data structure corresponding to the module parameter of the matrix multiplication module, and generate a first intermediate input feature; combining the first dimension and the second dimension of the first intermediate input feature into a target dimension by using a first preset algorithm to generate a second intermediate input feature; slicing the second intermediate input feature according to the target dimension to obtain a plurality of slice features; the first dimension is the height of the first intermediate input feature, and the second dimension is the width of the first intermediate input feature.

Further, the optimizing device of the neural network further comprises a rearrangement unit, wherein the rearrangement unit is used for acquiring the convolution kernel parameters of the target convolution kernel; and rearranging the data structure of the slice characteristics according to the convolution kernel parameters.

Further, the optimization device of the neural network further comprises an adjustment unit, wherein the adjustment unit is used for adjusting the data structure of the slice feature according to the third dimension of the slice feature after performing slice processing on the second intermediate input feature according to the target dimension to obtain a plurality of slice features; the third dimension is the dimension of the output channel of the slice feature.

Further, the adjusting unit is specifically configured to maintain the number of output channels if the number of output channels is divisible by 4; in the case where the number of output channels is not divisible by 4, increasing the output channels so that the number of output channels is divisible by 4; wherein the data of the other dimensions in the added output channel are all set to 0.

Further, the generating unit 606 is specifically configured to splice the plurality of intermediate features according to the first dimension and the second dimension, and generate an output feature.

Further, the conversion unit 602 is further configured to convert the data structure of the intermediate feature into the same data structure as the data structure of the input feature before generating the output feature from the plurality of intermediate features.

In some embodiments, optionally, the present application further provides a RISCV device, where the RISCV device supports a RISC-V Vector expansion instruction set, and deploys a neural network obtained based on the method for optimizing a neural network in any one of the above technical solutions to the RISCV device. Wherein the neural network may be a neural network.

In some embodiments, optionally, the present application further provides a data processing method, including: obtaining data to be processed, and processing the data to be processed based on the neural network obtained by the neural network optimization method according to any one of the above methods to obtain a processing result. The data to be processed may include voice data, image data, and the like, and the voice data is processed, including but not limited to, being used for voice recognition, voice wakeup, voice noise reduction, and the like, which are not limited herein.

In some embodiments, optionally, a readable storage medium is provided, on which a program or an instruction is stored, which when executed by a processor, implements a method for optimizing a neural network according to any of the above technical solutions.

The readable storage medium provided by the application has a program or an instruction stored thereon, and when the program or the instruction is executed by a processor, the method for optimizing the neural network according to any one of the above technical solutions can be implemented, so that the readable storage medium has all the beneficial effects of the method for optimizing the neural network, and will not be described herein.

In some embodiments, optionally, a computer program product is proposed, comprising a computer program or instructions which, when executed by a processor, implement the steps of the method of optimizing a neural network of any of the above embodiments. The computer program product thus has all the advantages of the above-mentioned method for optimizing a neural network, which is not described in detail here.

In some embodiments, optionally, a chip is proposed, comprising a program or instructions for implementing the steps of the optimization method of the neural network of any of the above embodiments when the chip is running. The computer program product thus has all the advantages of the above-mentioned method for optimizing a neural network, which is not described in detail here.

In particular, the chip proposed by the present application may be a chip developed based on the RVV instruction set.

In the description of the present application, the term "plurality" means two or more, unless explicitly defined otherwise, the orientation or positional relationship indicated by the terms "upper", "lower", etc. are based on the orientation or positional relationship shown in the drawings, merely for convenience of description of the present application and to simplify the description, and do not indicate or imply that the apparatus or elements referred to must have a specific orientation, be constructed and operated in a specific orientation, and therefore should not be construed as limiting the present application; the terms "coupled," "mounted," "secured," and the like are to be construed broadly, and may be fixedly coupled, detachably coupled, or integrally connected, for example; can be directly connected or indirectly connected through an intermediate medium. The specific meaning of the above terms in the present application can be understood by those of ordinary skill in the art according to the specific circumstances.

In the description of the present specification, the terms "one embodiment," "some embodiments," "particular embodiments," and the like, mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

The above is only a preferred embodiment of the present application, and is not intended to limit the present application, but various modifications and variations can be made to the present application by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method of data processing, comprising:

acquiring data to be processed, wherein the data to be processed comprises voice data and/or image data;

the neural network obtained by the neural network-based optimization method processes the data to be processed to obtain a processing result;

The optimization method of the neural network comprises the following steps:

extracting a convolution layer in the neural network, and converting a data structure of a convolution kernel in the convolution layer to obtain a target convolution kernel;

slicing the input features of the convolution layer according to the target dimension to obtain a plurality of slice features;

inputting one slice feature and the target convolution kernel into a matrix multiplication module to generate an intermediate feature corresponding to the slice feature;

generating an output feature according to a plurality of the intermediate features;

the optimization method of the neural network before generating the output feature according to the plurality of intermediate features further comprises:

converting the data structure of the intermediate feature into the same data structure as the data structure of the input feature;

the matrix multiplication module comprises a matrix multiplication module which is assembled based on a RISC-V Vector expansion instruction set;

the input features of the convolution layer are sliced according to the target dimension to obtain a plurality of slice features, including:

converting the data structure of the input features into a first target data structure corresponding to module parameters of the matrix multiplication module, and generating first intermediate input features;

Combining the first dimension and the second dimension of the first intermediate input feature into the target dimension by using a first preset algorithm to generate a second intermediate input feature;

slicing the second intermediate input feature according to the target dimension to obtain a plurality of slice features;

wherein the first dimension is the height of the first intermediate input feature and the second dimension is the width of the first intermediate input feature;

the matrix multiplication module converts the data structure of the input features from NCHW to NC4HW 4.

2. The method according to claim 1, wherein after slicing the second intermediate input feature according to the target dimension to obtain the plurality of slice features, the method further comprises:

acquiring a convolution kernel parameter of the target convolution kernel;

and rearranging the data structure of the slice characteristic according to the convolution kernel parameter.

3. The method according to claim 1, wherein after slicing the second intermediate input feature according to the target dimension to obtain the plurality of slice features, the method further comprises:

Adjusting the data structure of the slice feature according to the third dimension of the slice feature;

wherein the third dimension is an output channel dimension of the slice feature.

4. A data processing method according to claim 3, wherein said adjusting the data structure of the slice feature according to the third dimension of the slice feature comprises:

maintaining the number of output channels if the number of output channels is divisible by a preset value;

increasing the output channels so that the number of the output channels can be divided by a preset value in the case that the number of the output channels cannot be divided by the preset value;

wherein the data of other dimensions in the added output channel are all set to 0.

5. The data processing method of claim 1, wherein generating an output feature from a plurality of the intermediate features comprises:

and splicing the plurality of intermediate features according to the first dimension and the second dimension to generate the output features.

6. A data processing apparatus, comprising:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring data to be processed, and the data to be processed comprises voice data and/or image data;

The first processing unit is used for processing the data to be processed based on the neural network obtained by the neural network optimization method to obtain a processing result;

an optimizing apparatus of a neural network, the optimizing apparatus of the neural network comprising:

the conversion unit is used for extracting a convolution layer in the neural network and converting a data structure of a convolution kernel in the convolution layer to obtain a target convolution kernel;

the second processing unit is used for slicing the input features of the convolution layer according to the target dimension to obtain a plurality of slice features;

the generating unit is used for inputting one slice feature and the target convolution kernel into the matrix multiplication module to generate an intermediate feature corresponding to the slice feature; and

converting the data structure of the intermediate features into the same data structure as the data structure of the input features, and then generating output features according to a plurality of the intermediate features;

7. A RISCV device supporting a RISC-V Vector extended instruction set, characterized in that the RISCV device has means for implementing a data processing method according to any of claims 1 to 5 deployed thereon.

8. A readable storage medium, characterized in that the readable storage medium has stored thereon a program or instructions which, when executed by a processor, implement the steps of the data processing method according to any of claims 1 to 5.

9. A chip comprising a program or instructions for implementing the steps of the data processing method according to any one of claims 1 to 5 when said chip is run.