CN118333127A

CN118333127A - Data processing method and device and data processing chip

Info

Publication number: CN118333127A
Application number: CN202410742115.1A
Authority: CN
Inventors: 陈勇
Original assignee: Dingdao Zhixin Shanghai Semiconductor Co ltd
Current assignee: Dingdao Zhixin Shanghai Semiconductor Co ltd
Priority date: 2024-06-07
Filing date: 2024-06-07
Publication date: 2024-07-12

Abstract

The application discloses a data processing method, a data processing device and a data processing chip, and belongs to the field of artificial intelligence. The data processing chip comprises a plurality of available hardware processing channels and a plurality of second operation components; each available hardware processing channel comprises a first operation component, a second operation component and a first processing component, wherein the first operation component is used for acquiring corresponding first target data in at least one first target data sub-object, acquiring second target data corresponding to the corresponding first target data from a second data object corresponding to a target data processing channel, and performing first operation processing on the acquired first and second target data; the first target data in the at least one first target data sub-object comprises all valid data of the first data object corresponding to the target data processing channel; each second operation component is used for carrying out second operation processing on the first operation result output by the corresponding available hardware processing channel; the number of first target data sub-objects is less than the number of first data sub-objects in the first data object.

Description

Data processing method and device and data processing chip

Technical Field

The application belongs to the technical field of artificial intelligence, and particularly relates to a data processing method, a data processing device and a data processing chip.

Background

The neural network (Neural Networks, NN) is a complex network system formed by a large number of simple processing units (called neurons) widely interconnected, and the Transformer network is a neural network based on a self-attention mechanism. At present, in data processing based on a neural network such as a transducer network, there are problems such as low computational performance, high resource requirements for storage and transmission, low resource utilization rate, and the like, and how to solve at least some of the problems becomes a technical difficulty in the art.

Disclosure of Invention

Therefore, the application discloses the following technical scheme:

a data processing chip, comprising: a plurality of available hardware processing channels and a plurality of second arithmetic components;

Each available hardware processing channel comprises a first operation component, a second operation component and a first processing component, wherein the first operation component is used for acquiring corresponding first target data in at least one first target data sub-object, the first target data in the at least one first target data sub-object comprises all valid data of each first data sub-object in the first data objects corresponding to the target data processing channels, and each first target data corresponds to corresponding position information and is used for indicating the original position of the first target data in the first data objects; and the second target data corresponding to the corresponding first target data is acquired from the second data object corresponding to the target data processing channel according to the position information corresponding to the corresponding first target data; performing first operation processing on the acquired first target data and second target data to obtain a first operation result;

Each second operation component is used for carrying out second operation processing on the first operation result output by the corresponding available hardware processing channel to obtain a second operation result;

Wherein the number of first target data sub-objects is less than the number of first data sub-objects in the first data object.

Optionally, the first data object includes a first data matrix having a plurality of data to be processed;

the first data sub-object is a column in the first data matrix;

The at least one first target data sub-object comprises: moving the effective data of the corresponding columns in the first data matrix to the positions of the ineffective data of other columns except the corresponding columns in the first data matrix, and obtaining a first target column at least comprising the effective data; the data of the same column in the first data matrix are in the same first target column after the movement is completed, the data of the same row are in different first target columns after the movement is completed, and the number of the first target columns is smaller than that of columns contained in the first data matrix;

The second data object includes a second data matrix having a plurality of data to be processed.

Optionally, the first operation processing includes a multiplication operation, where the multiplication operation includes a multiplication operation involved in matrix multiplying the first data matrix and the second data matrix; the second operation processing includes an accumulation operation;

the first operation component is specifically configured to, when performing a first operation process on the obtained first target data and second target data: performing multiplication operation on the first target data and the corresponding second target data in the corresponding first target column to obtain a multiplication operation result;

the second operation component is specifically configured to, when performing a second operation process on the first operation result output by the corresponding available hardware processing channel: and accumulating the multiplication operation results output by the corresponding available hardware processing channels to obtain accumulation operation results.

Optionally, the data processing chip further comprises a routing component;

each of the available hardware processing channels further comprises an independent storage component for storing first target data allocated to the corresponding available hardware processing channel;

The routing component is used for sending the multiplication operation result output by the available hardware processing channel to the corresponding second operation component for accumulation processing according to the position information corresponding to the first target data and the second target data in the available hardware processing channel;

The position information corresponding to the second target data is used for indicating the position of the second target data in the second data object; and the multiplication operation results corresponding to the first target data in the same row in the first data matrix are sent to the same second operation component for accumulation processing.

A data processing method, comprising:

Acquiring at least one first target data sub-object containing first target data; the first target data in the at least one first target data sub-object at least comprises all valid data of each first data sub-object in the first data object corresponding to the target data processing channel; each first target data corresponds to corresponding position information for indicating an original position of the first target data in the first data object;

Acquiring second target data corresponding to each first target data from a second data object corresponding to the target data processing channel according to the position information corresponding to each first target data in the at least one first target data sub-object;

performing data processing on each first target data and the corresponding second target data;

Optionally, the method for forming the first target data sub-object includes:

acquiring a first data object corresponding to a target data processing channel in a model;

Moving the effective data of the corresponding first data sub-object in the first data object to the position of the ineffective data of other first data sub-objects so as to reduce the number of the first data sub-objects contained in the first data object;

Wherein the at least one first target data sub-object comprises: completing the first data sub-object containing at least effective data after moving; the other first data sub-objects include first data sub-objects of the first data object other than the corresponding first data sub-object.

Optionally, the first data object includes a first data matrix with a plurality of data to be processed, and the first data sub-object is a column in the first data matrix;

the moving the valid data of the corresponding first data sub-object in the first data object to the position of the invalid data of other first data sub-objects includes:

Moving the valid data of the corresponding columns in the first data matrix to the positions of invalid data of other columns except the corresponding columns in the first data matrix so as to reduce the number of columns contained in the first data matrix;

wherein the at least one first target data sub-object comprises: completing a first target column which at least contains effective data and is obtained after the movement; the data of the same column in the first data matrix are in the same first target column after the movement is completed, and the data of the same row are in different first target columns after the movement is completed.

Optionally, the method for forming the first target data sub-object further includes:

And carrying out data sequence adjustment processing on the data in each first target column so as to enable the effective data originally belonging to the same column in the first data matrix to be continuously arranged in the first target column where the effective data is located after moving.

Optionally, the second data object includes a second data matrix having a plurality of data to be processed;

The obtaining, according to the location information corresponding to each first target data in the at least one first target data sub-object, second target data corresponding to each first target data from the second data object corresponding to the target data processing channel includes:

Acquiring second target data corresponding to each first target data from a second target column of the second data matrix according to the position information of each first target data in each first target column; the second target column is a column currently to be processed in the second data matrix.

Optionally, the data processing for each first target data and the corresponding second target data includes:

according to the number of available hardware processing channels capable of being processed in parallel, distributing a corresponding number of first target data to each available hardware processing channel, and carrying out data processing on the distributed first target data and the corresponding second target data by utilizing the available hardware processing channels;

the number of available hardware processing channels required for processing the first target data is smaller than the number of available hardware processing channels required for processing the data to be processed in the first data object.

Optionally, the valid data originally belonging to the same column in the first data matrix are continuously arranged in a first target column where the valid data is located after moving; the data processing comprises multiply-accumulate processing, wherein the multiply operation in the multiply-accumulate processing comprises multiply operation involved in matrix multiplication of the first data matrix and the second data matrix;

the data processing for each first target data and the corresponding second target data comprises the following steps:

the first target data of the same group in each first target column are respectively distributed to continuous available hardware processing channels in the channel array; the first target data of the same group comprise first target data belonging to the same column in the first data matrix in the first target column;

Determining second target data corresponding to the first target data of the same group according to the position information corresponding to the first target data of the same group; the first target data in the same group correspond to the same second target data;

The determined second target data are distributed to available hardware processing channels where the corresponding groups of first target data are respectively located, and the first operation components provided by the available hardware processing channels are used for carrying out multiplication operation on the distributed first target data and second target data to obtain multiplication operation results;

According to the position information corresponding to each first target data, the multiplication operation results corresponding to the first target data belonging to the same row in the first data matrix are sent to the same second operation assembly for accumulation processing, and an accumulation operation result is obtained;

And the position information corresponding to the second target data is used for indicating the position of the second target data in the second data object.

A data processing apparatus comprising:

A first acquisition module for acquiring at least one first target data sub-object containing first target data; the first target data in the at least one first target data sub-object at least comprises all valid data of each first data sub-object in the first data object corresponding to the target data processing channel; each first target data corresponds to corresponding position information for indicating an original position of the first target data in the first data object;

the second acquisition module is used for acquiring second target data corresponding to each first target data from the second data object corresponding to the target data processing channel according to the position information corresponding to each first target data in the at least one first target data sub-object;

The data processing module is used for carrying out data processing on each first target data and the corresponding second target data;

An electronic device comprising at least:

A memory for storing a set of computer instructions;

a processor for implementing a data processing method as provided in any one of the above by executing the set of computer instructions.

A readable storage medium having stored thereon a set of computer instructions for invocation and execution by a processor to implement a data processing method as claimed in any one of the preceding claims.

And, there is also provided a computer program product comprising a computer program/instruction which, when executed by a processor, implements a data processing method as claimed in any preceding claim.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only embodiments of the present application, and that other drawings may be obtained according to the provided drawings without inventive effort for a person skilled in the art.

FIG. 1 is an example of a sparse weight matrix and sparse feature map provided by the present application;

FIG. 2 is a schematic flow chart of a data processing method according to the present application;

FIG. 3 is an example of a matrix multiplication operation provided by the present application;

FIG. 4 is a flowchart of a method for forming a first target data sub-object provided by the present application;

FIG. 5 is an example of compressing a first data object from a row direction provided by the present application;

FIG. 6 is an example of compressing a first data object from a column direction provided by the present application;

FIG. 7 is an example of a first target column resulting from compression of a first data object from row and column directions provided by the present application;

FIG. 8 is a schematic flow chart of another method for processing data according to the present application;

FIG. 9 is a schematic diagram of the internal structure of a PE according to the present application;

FIG. 10 is a schematic flow chart of data processing on each first target data and corresponding second target data according to the present application;

FIG. 11 is a schematic diagram of the indexing of valid second destination data with each packet provided by the present application;

FIG. 12 is a diagram of a mapping connection between PE arrays and an accumulator in accordance with the present application;

FIG. 13 is a block diagram showing the constitution of a data processing apparatus according to the present application;

FIG. 14 is a block diagram showing the constitution of a data processing chip according to the present application;

Fig. 15 is a component configuration diagram of an electronic device provided by the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

At present, in the data processing based on the neural network model, a series of problems such as low calculation performance, high resource requirements for storage and transmission, low resource utilization rate and the like exist.

Specifically, in the neural network model, the quantization and pruning operations are often performed on the network layer weights, so that the weight matrix generates a large number of 0 values, and meanwhile, because of the ReLU (active) operation, the feature map also generates a large number of 0 values, for example, there are a large number of 0 values in the weight matrix and the feature map shown in fig. 1 (where gray blocks represent non-0 values and white blocks represent 0 values), and a phenomenon of a large number of 0 values in such a network is called sparsification. In particular, in a transform network, a value of 0 (sparse) is more common due to the local relevance of token (token refers to a minimum unit with independent semantics, each token represents an independent unit, has a certain semantic meaning, and can be modeled).

The applicant finds that the operation in the neural network mainly comprises multiplication and addition (such as multiplication and addition operation related to matrix multiplication of a weight matrix in a transducer network), and a value of 0 does not have any contribution to a final calculation result, if only valid values are transmitted and stored during data transmission and storage, the bandwidth required by transmission and storage can be greatly reduced, and if the value of 0 is skipped during data calculation, the calculation performance of the system can be greatly improved, and the resource utilization rate of the system is improved.

However, related hardware currently responsible for data processing of a neural network model, such as a related commercial chip, does not support weight sparsification processing, and a 0-value weight still participates in processing and occupies calculation time, so that how to improve the calculation performance when processing the model network based on the sparsification characteristic of the weight in the model network, reduce the data storage and bandwidth demand, and improve the resource utilization rate and calculation efficiency becomes a difficulty.

Based on the above, the application provides a data processing method, a data processing device and a data processing chip, which mainly aims at matrix multiplication operation of weight matrixes in a transducer and other neural networks, and provides efficient multiplication operation of an inner product multiplication structure optimized sparse matrix by utilizing the characteristic that the weight matrixes are fixed and known so as to solve various problems in the prior art.

The data processing method, the data processing device and the data processing chip provided by the application can be applied to, but not limited to, personal computers or servers and other electronic equipment.

Referring to the flow chart of the data processing method shown in fig. 2, the provided data processing method at least comprises the following processing steps:

Step 201, obtaining at least one first target data sub-object containing first target data; the first target data in the at least one first target data sub-object at least comprises all valid data of each first data sub-object in the first data object corresponding to the target data processing channel; each first target data corresponds to corresponding position information for indicating an original position of the first target data in the first data object.

The method provided by the application can be applied to various fields such as natural language processing, image processing, video processing, voice recognition, industrial detection (such as equipment defect detection) and the like.

The embodiment of the application mainly takes the data processing of a neural network model (such as a neural network model based on a transducer) as an example for scheme explanation.

The target data processing channel can be, but not limited to, an input channel of a neural network model network layer, for example, for image processing based on a neural network model, and specifically the target data processing channel can include a R, G, B three primary color input channel of the model network layer and any one or more input channels of a texture input channel and a semantic input channel.

Valid data in the first data object, specifically, data contained in the first data object and having a contribution value to data processing of the first data object, and data contained in the first data object and not having a contribution value to data processing of the first data object, are regarded as non-valid data or invalid data of the first data object.

Optionally, the first data object is a data matrix including a plurality of data to be processed, and each first data sub-object included in the first data object is each column in the data matrix. For the data processing scenario of the neural network model, the first data object may specifically be a weight matrix corresponding to a corresponding input channel of the neural network model network layer, each first data sub-object in the first data object may be each column included in the weight matrix, the valid data in the first data sub-object included in the first data object may be a non-0 value in the column included in the weight matrix, and the non-0 weight is regarded as valid data in the first data sub-object included in the first data object due to the contribution value of the non-0 value to the operation of the model network, and correspondingly, the 0 value weight in the weight matrix is regarded as invalid data due to the fact that the 0 value does not have any contribution value to the operation of the model network.

In order to improve the calculation performance of a system, reduce the storage and transmission bandwidth required by the system and improve the resource utilization rate of the system, the embodiment of the application provides a technical idea for compressing data in a first data object (such as a weight matrix of a neural network model network layer) so as to reduce the number of first data sub-objects contained in the first data object, thereby optimizing the data processing (such as matrix multiplication operation of optimizing a sparse matrix) of the first data object, and solving various problems existing in the prior art based on the technical idea.

The method comprises the steps of compressing data in a first data object according to a first data object, compressing the data in the first data object according to a first data sub-object, compressing the data in the first data object according to a row direction, and aggregating all valid data in each original column of the data matrix of the first data object into a part of columns in each original column, so that the effect of reducing the number of the first data sub-objects contained in the first data object is achieved, and data processing of the first data object is optimized.

More specifically, by enabling valid data in corresponding original columns of a data matrix of a first data object to occupy positions of invalid data such as 0 value weight in other original columns (one or more columns except the corresponding original columns) of the data matrix, all valid data of the data matrix are gathered into part of columns of the data matrix, so that the corresponding part of columns of the data matrix at least contain valid data, and other corresponding part of columns do not contain any valid data (namely all valid data), columns which do not contain any valid data can be directly removed, so that compression of the data matrix of the first data object is realized, and the number of first data sub-objects (columns) contained in the data matrix is reduced.

The at least one first target data sub-object is a first data sub-object at least containing valid data obtained by compressing data in each first data sub-object contained in the first data object based on the technical thought, for example, a column at least containing valid data obtained by gathering all valid data in each original column of the data matrix is eliminated, and the first data sub-object (column) not containing any valid data is eliminated and does not participate in subsequent data processing of the first data object.

It is emphasized that, in order to achieve efficient compression, it is required that the number of the first target data sub-objects obtained after data compression in the first data object is smaller than the number of the first data sub-objects in the first data object; and the first target data in the at least one first target data sub-object obtained after compression at least comprises all valid data of each first data sub-object in the first data object. For example, based on the technical idea of the compression processing, the number of columns containing at least valid data obtained after valid data aggregation of the data matrix is smaller than the number of original columns contained in the data matrix; and each data in the columns containing at least valid data obtained after compression at least comprises all valid data in each original column of the data matrix, and besides all valid data, the data matrix may also comprise a part of invalid data, and certainly may not comprise any invalid data according to actual situations.

That is, at least part of invalid data in the first data object is removed based on the compression processing, and all valid data is reserved, so that the influence on the data processing result of the first data object is avoided while the data processing amount of the first data object is reduced, and the accuracy of the data processing result is ensured.

For the case that the first data object is the weight matrix of the neural network model network layer, in practical application, when model training is completed, that is, the weight matrix of the model network layer is compressed based on the thought, the at least one first target data sub-object (for example, at least each column containing valid data) obtained based on the compression processing is stored, and when the data processing is required to be performed by using the model, the stored data of the at least one first target data sub-object can be directly read to perform the required processing on the data, but in other embodiments, when the data processing is required to be performed by using the model, the weight matrix of each network layer of the model can be compressed in real time, and the required processing is performed on the at least one first target data sub-object obtained after the compression, which can be determined according to the actual application requirement.

Each first target data corresponds to corresponding position information for indicating an original position of the first target data in the first data object. The location information corresponding to the first target data may specifically include a row index and a column index, which are respectively used to indicate an original row and an original column where the first target data is located in the first data object.

Step 202, according to the position information corresponding to each first target data in the at least one first target data sub-object, obtaining second target data corresponding to each first target data from the second data object corresponding to the target data processing channel.

Alternatively, the second data object may equally be a data matrix comprising a plurality of data to be processed. For the data processing scenario of the neural network model, the second data object may specifically be a feature map corresponding to a corresponding input channel of the neural network model network layer, and the second target data may then be a corresponding feature value in the feature map.

The second data object may also include valid data and invalid data, where the valid data of the second data object refers specifically to data included in the second data object and having a contribution value to data processing of the second data object, and the data included in the second data object and not having a contribution value to data processing of the second data object is regarded as non-valid data or invalid data of the second data object. Taking the second data object as a feature map as an example, determining a non-0 feature value in the feature map as effective data of the feature map based on the characteristic that whether the feature value has a contribution value to data operation of the feature map, and considering the 0-value feature value as ineffective data.

The feature map may be, but is not limited to, various types of data to be processed, such as images, voices, etc., depending on the specific application scenario.

Each data to be processed in the first data object corresponds to corresponding data to be processed in the second data object, so as to be matched into a data pair to be processed between the first data object and the second data object, thereby carrying out required data processing (the data pair to be processed) on the data pair, for example, carrying out multiplication operation on two data to be processed contained in the data pair to be processed, and accumulating multiplication operation results of corresponding different data pairs to be processed.

Whether certain to-be-processed data in the first data object is matched with certain to-be-processed data in the second data object (namely, whether the to-be-processed data is matched into corresponding to-be-processed data pairs) or not depends on the positions of the two to-be-processed data in the data objects, and the data in the matched positions between the first data object and the second data object are correspondingly matched to be the matched to-be-processed data. Further, the matching location between the first data object and the second data object is dependent on the data processing rules for the first data object and the second data object.

Based on this, for each first target data (essentially the corresponding data to be processed in the first data object) in the at least one first target data sub-object, second target data corresponding to the first target data may be obtained from the second data object corresponding to the target data processing channel according to the location information corresponding to the first target data.

The method specifically can determine the position information matched with the position information corresponding to the first target data in the second data object according to the data processing rules of the first data object and the second data object, and acquire the data to be processed at the position indicated by the matched position information in the second data object as the second target data corresponding to the first target data. More specifically, for the case that the first and second data objects are data matrices, according to the data processing rules for the first data object and the second data object, a row index and a column index matched with a row index and a column index corresponding to the first target data in the second data object may be determined, and data to be processed at a row and column position indicated by the matched row index and column index in the second data object may be obtained as second target data corresponding to the first target data.

The data processing of the first data object and the second data object in the embodiment of the application mainly refers to matrix multiplication operation of the first data object and the second data object, such as matrix multiplication operation of a weight matrix and a feature map, and the application provides an efficient inner product multiplication structure to optimize the multiplication operation of a sparse matrix by utilizing the characteristic that the weight matrix is fixed, namely, the data processing rule of the first data object and the second data object, in particular, the matrix multiplication operation rule based on the inner product.

The matrix multiplication operation based on the inner product is applicable to, but not limited to, matrix multiplication processing of the weight matrix and the feature map in the large language model (Large Language Model, LLM) network layer, for example, matrix multiplication processing of the weight matrix and the feature map in the large language model network layer based on the transducer network, and the like.

For matrix multiplication operation based on inner product, the rows of the data matrix of the first data object are required to be multiplied with the columns of the data matrix of the second data object, and specifically, the data in each row of the data matrix of the first data object is required to be in one-to-one correspondence with the data in each column of the data matrix of the second data object according to the sequence, the data pairs formed by the data in the corresponding positions are subjected to multiplication operation, and the multiplication operation results obtained by the multiplication operation of the same row of the first data object and the same column of the second data object are accumulated.

Wherein, for rows in the first data object, the precedence refers to a left-to-right order, and for columns in the second data object, the precedence refers to a top-to-bottom order.

It will be readily appreciated that the inner product-based matrix multiplication described above essentially requires matching the same data in the column index in the row in which the first data object is currently participating (i.e. the row to be processed) as in the column in which the second data object is currently participating (i.e. the column to be processed) into pairs of data to be processed. For example, in the example of fig. 3, assuming that the first data object is Matrix a (Matrix a), the second data object is Matrix B (Matrix B), and assuming that the row currently participating in the operation in Matrix a is the first row of the row Matrix a and the column currently participating in the operation in Matrix B is the first column of Matrix B, in the inner product operation for the first row of Matrix a and the first column of Matrix B, it is required to match the data having the same column index in the first row of Matrix a as the row index in the first column of Matrix B into the pair of data to be processed, such as the pair of data to be processed represented by (11, 1), (41,4) and (71,7) in this example, respectively. In Matrix a and Matrix B, white boxes represent invalid data, and gray boxes represent valid data.

Based on this, for the matrix multiplication operation based on the inner product, in this step, the data at the row position corresponding to the column index may be obtained from the corresponding column of the first data object currently participating in the operation according to the column index corresponding to the first target data, and used as the second target data corresponding to the first target data, where the second target data is the data to be processed matched with the first target data, and is used to be matched with the first target data into the pair of data to be processed to participate in the subsequent data processing.

It should be noted that, the data processing of the weight matrix and the feature map in the neural network mainly includes two kinds of data: convolution operation and matrix multiplication operation (such as matrix multiplication operation based on inner product in the application), the currently popular large language model such as large language model based on a transducer network is the matrix multiplication operation adopted for the weight matrix and the feature map. In practical application, the convolution operation can be converted into a matrix multiplication operation form through a corresponding conversion rule, and the data processing method provided by the embodiment of the application can be correspondingly adopted to implement the convolution operation.

The neural network model can specifically perform one-dimensional convolution, two-dimensional convolution or three-dimensional convolution on the feature map, is not limited, and can be determined according to actual requirements. For example, for a one-dimensional convolution kernel of 1x3 size, a one-dimensional convolution may be performed on a feature map of 1x3 based on a weight matrix of 1x3, and for a two-dimensional convolution kernel of 3x3 size, a corresponding two-dimensional convolution may be performed on a feature map of 3x3 based on a weight matrix of 3x 3.

And 203, performing data processing on each first target data and the corresponding second target data.

Alternatively, the data processing performed on the respective first target data and the corresponding second target data may include multiply-accumulate processing. The first target data to be processed and the corresponding second data object are multiplied, and then the corresponding multiplication results are accumulated, but the application is not limited thereto, and the executed data processing can be determined according to the actual application requirement.

After performing the multiplication operation on each first target data and the corresponding second target data, the multiplication operation results corresponding to the first target data in the same row in the first data object may be accumulated for the corresponding column of the second data object currently participating in the processing, and the accumulated result may be used as one result data of the data processing results of the first data object and the second data object.

In summary, according to the data processing method provided by the embodiment of the application, based on the sparsification characteristic of the data in the first data object, the data included in the first data object is compressed, and each first data sub-object in the first data object is compressed into at least one first target data sub-object, so that the number of the first target data sub-objects is smaller than that of the first data sub-objects, thereby effectively reducing the data processing amount of the first data object, improving the computing performance of the system, reducing the resource requirements of storage, transmission, operation and the like, and improving the system resource utilization rate and operation efficiency. For application scenes such as natural language processing, image processing, video processing, voice recognition, industrial detection and the like, the processing efficiency of various applications such as natural language processing, image processing, voice recognition and the like can be correspondingly improved, and the utilization rate of system resources can be improved.

Meanwhile, the first target data in the at least one first target data sub-object obtained after compression comprises all effective data of each first data sub-object in the first data object, so that the compression processing is carried out only to remove at least part of invalid data in the first data object, all effective data are reserved, the data processing result of the first data object is not influenced, and the accuracy of the data processing result is ensured. And because the application is based on soft processing to remove at least part of invalid data in the first data object, that is to say, at least part of invalid data in the first data object is removed before the first data object is sent to hardware for processing, the data processing method provided by the application is still applicable to related hardware which does not support weight sparsification processing (0 value weight still participates in processing and occupies calculation time) at present, such as related commercial chips.

In an alternative embodiment, referring to the flowchart of the forming method of the first target data sub-object shown in fig. 4, based on the compression concept described above, the forming method of the first target data sub-object by compressing data in the first data object includes:

step 401, a first data object corresponding to a target data processing channel in a model is acquired.

The model here is a neural network model, which may be, but is not limited to, a large language model based on a transducer network. The target data processing channel may be, but is not limited to, an input channel to a neural network model network layer.

In this embodiment, the first data object is a first data matrix having a plurality of data to be processed, and the first data sub-objects in the first data object are columns in the first data matrix. For the data processing scene of the neural network model, the first data object may specifically be a weight matrix corresponding to the input channel of the model network layer.

The step can acquire a weight matrix corresponding to an input channel of a neural network model network layer as the first data object at the time of completing training of the neural network model so as to realize compression processing of data in the first data object in combination with the subsequent steps; however, the method is not limited thereto, and when the data processing needs to be performed by using the model after the training of the neural network model is completed, a weight matrix corresponding to the input channel of the neural network model network layer may be acquired in real time as the first data object, so as to perform the required data compression processing on the first data object.

Step 402, moving the valid data of the corresponding first data sub-object in the first data object to the position of the invalid data of other first data sub-objects, so as to reduce the number of first data sub-objects included in the first data object.

In this embodiment, the effective data in the first data object is aggregated by moving the effective data of the corresponding first data sub-object in the first data object to the position where the ineffective data of other first data sub-objects is located, and the effective data is aggregated from each original first data sub-object into a part of first data sub-objects, so as to reduce the number of the first data sub-objects included in the first data object.

Wherein the at least one first target data sub-object comprises: completing the first data sub-object containing at least effective data after moving; the other first data sub-objects include first data sub-objects other than the corresponding first data sub-object in the first data object.

For the case that the first data object is a first data matrix, such as a weight matrix, at least data in the first data matrix is compressed from a row direction, that is, valid data of a corresponding column in the first data matrix is specifically moved to a position where invalid data of other columns except the corresponding column in the first data matrix is located, so as to reduce the number of columns included in the first data matrix. In this case, the at least one first target data sub-object correspondingly comprises: and completing the first target column which at least contains valid data and is obtained after the movement.

When moving the valid data of the corresponding column in the first data matrix to the position of the invalid data of the other columns except the corresponding column in the first data matrix, optionally, moving the valid data of the corresponding column in the first data matrix to the position of the invalid data of the other columns on the left side of the corresponding column in the first data matrix based on a left compression mode, so that the valid data of the corresponding column occupies the position of the invalid data of the other columns on the left side of the corresponding column, such as 0 value weight; however, the method is not limited thereto, and the valid data of the corresponding column in the first data matrix may be moved to the position of the invalid data of the other column on the right side of the corresponding column in the first data matrix based on the rightward compression manner, so that the valid data of the corresponding column occupies the position of the invalid data of the other column on the right side of the corresponding column, for example, the position of the 0 value weight. And gathering the effective data into a part of columns of the first data matrix by the left compression or the right compression mode, so that the part of columns at least contain the effective data, and the other columns except the part of columns do not contain the effective data.

Preferably, the data in the same column in the first data matrix is in the same first target column after the movement is completed, and the data in the same row is in a different first target column after the movement is completed.

The following examples are given:

Referring to fig. 5, assuming that the first data object is a sparse Matrix a (Matrix a) in fig. 5, the first data sub-object in the first data object is a column in the Matrix a, the Matrix a includes 8 columns of data, the column indexes are sequentially 1,2 and 3 from left to right, the blank square indicates 0 value data in the Matrix a, that is, invalid data, the non-blank square indicates non-0 value data in the Matrix a, that is, valid data, the gray frame indicates non-0 value data in the Matrix a, as shown in fig. 5, in a manner of compressing the Matrix a specifically based on the row direction to the left, the valid data "25" in the 2 nd column and the valid data "38" in the 3 rd column are moved to the position of invalid data in the 1 st column, the valid data "52", "55" in the 5 th column and the valid data "69" in the 6 th column are moved to the position of invalid data in the 4 th column, the valid data "86", "89" in the 8 th column is moved to the position of invalid data in the 7 th column, at least one obtained after the Matrix a is compressed, the first sub-object is at least one obtained sub-object is a small number of data sub-object, as shown in fig. 7, the first sub-object is added to the first sub-object, and the first sub-object contains at least one obtained sub-object, and the obtained sub-object is at least 1 sub-object is shown in fig. 1, and the obtained sub-object is shown in the number of the original data, and contains the obtained sub-data sub-object, and the sub-object is shown in the sub-data, and the sub-object.

In practical application, the data of the matrix a may be compressed from the row direction by adopting a rightward compression manner, which is not limited. Different compression modes to the left and right will generally result in different compression results, for example, if the matrix a is compressed to the right, the valid data "52" and "55" in the 5 th column may be compressed into the last column, i.e. the 8 th column. It should be noted that, although the compression results of the matrix a are different due to different compression modes to the left and right, since the data processing is performed on the first target data by the position information (for indicating the original position of the first target data in the matrix a) corresponding to each first target data after compression, the position information corresponding to each first target data is fixed, so that the data processing result of the matrix a is not affected, and both modes can ensure the accuracy of the data processing result of the matrix a.

The first target data child object in this example is the first target column described above. After the movement is completed, the data in the same column in the matrix a are in the same first target column, the data in the same row are in different first target columns, for example, the data "52", "55" in the 5 th column are in the same target column after the movement is completed, the data "86", "89" in the 8 th column are in the same target column after the movement is completed, and the data "11", "41", "71" in the first row are in different first target columns after the movement is completed.

According to the embodiment, the data in the first data object is gathered from the original first data sub-objects into the first data sub-objects with partial quantity by compressing the data in the first data object, so that the quantity of the first data sub-objects contained in the first data object is reduced, and the data quantity to be processed of the first data object is correspondingly reduced, thereby improving the computing performance of the system, reducing the resource demands of storage, transmission, operation and the like, and improving the resource utilization rate of the system.

In addition, for the case that the first data object is the first data matrix, in this embodiment, by controlling the data in the same column in the first data matrix to be in the same first target column after the movement is completed, the data in the same row is controlled to be in a different first target column after the movement is completed, which can facilitate the subsequent reduction of the data selection logic and the reduction of the hardware wiring complexity when the data processing is performed on the compressed data (the at least one first target data sub-object/the first target data in the first target column) by using the hardware.

In an alternative embodiment, the method for forming the first target data sub-object may further include the following processing: and carrying out data sequence adjustment processing on the data in each first target column so as to enable the effective data originally belonging to the same column in the first data matrix to be continuously arranged in the first target column where the effective data is located after moving.

In this embodiment, a technical idea of further compressing the first data matrix from the column direction is provided for the compression result obtained by compressing the first data matrix from the row direction in the previous embodiment.

Optionally, the invalid data in each first target column may be removed first, then, the data in each first target column is subjected to position-based moving data sequence adjustment processing according to the corresponding column index, and the valid data originally belonging to the same column in the first data matrix is continuously arranged in the first target column after moving through the invalid data removing and sequence adjustment processing, so as to further compress the first data matrix from the column direction.

As shown in the example of fig. 6, for three first target columns obtained by compressing the matrix a from the row direction, first, 0-value data in the first target columns are removed to obtain three first target columns not containing any 0-value data, on this basis, further, according to the column index of each data in the first target columns, the data are sequentially adjusted based on the position movement, so that each first target column in which valid data originally belonging to the same column are sequentially arranged is finally obtained, for example, data "11", "14" and "17" in the first target column are data in the first 1 st column in the matrix a, and after the movement-based sequential adjustment is completed, the three data are sequentially arranged in the first target columns.

In practical applications, the method is not limited to the column direction compression method of removing the invalid data first and then sequentially adjusting, in other embodiments, based on the above continuous arrangement targets (the valid data in the same column in the first data matrix is continuously arranged in the first target column where the valid data in the same column is located after moving), the position-based sequential adjustment is performed on the data in each first target column according to the corresponding column index first, then the invalid data such as 0 value in the first target column after the sequential adjustment is removed, and the above continuous arrangement targets are achieved by combining the data sequential adjustment and the invalid data removal, so that the valid data in the same column in the first data matrix is continuously arranged in the first target column where the valid data in the same column is located after moving.

According to the embodiment, the row direction compression result of the first data matrix is further compressed from the column direction, invalid data in a first target column obtained after the first data matrix is compressed from the row direction is removed, the data processing amount of a first data object can be further reduced, the calculation performance of a system can be correspondingly further improved, the resource requirements of storage, transmission, operation and the like are reduced, and the system resource utilization rate and the operation efficiency are improved.

In addition, when performing data compression in the column direction, the present embodiment may further facilitate the subsequent data processing performed on the compressed data (the first target data in each first target column including only the valid data obtained after the column-row direction compression) by using hardware by continuously arranging valid data originally belonging to the same column in the first target column where the valid data is located after the movement, thereby reducing the selection logic of the data and reducing the complexity of hardware wiring. In addition, the index amount of the data in the first target column can be reduced, for example, for a plurality of valid data which are continuously arranged in the first target column and originally belong to the same column, only all position information (such as column index and row index) of first valid data in the plurality of valid data is recorded, and the data number of the plurality of valid data is recorded, and other data except the first valid data in the plurality of valid data does not need to record column index and only needs to record row index, so that the index amount of the data in the first target column is reduced, and the storage space is saved; and aiming at the characteristics, for a plurality of effective data which are continuously arranged in a first target column and originally belong to the same column, the plurality of effective data in the whole column (originally the same column) can be read at one time according to the number in a continuous beat, and the corresponding channels and the number are simultaneously input, so that the calculation of one column (originally one column) of data is completed in one beat.

In an alternative embodiment, the second data object corresponding to the target data processing channel may be a map corresponding to the second data matrix including a plurality of data to be processed, for example, a particular input channel of the neural network, in match with the first data object may be a first data matrix including a plurality of data to be processed (e.g., a weight matrix in the neural network).

Step 202 in the data processing method provided by the present application, namely, according to the position information corresponding to each first target data in the at least one first target data sub-object, obtaining second target data corresponding to each first target data from the second data object corresponding to the target data processing channel, where the corresponding may be implemented as:

Acquiring second target data corresponding to each first target data from a second target column of a second data matrix according to the position information of each first target data in each first target column; the second target column is a column currently to be processed in the second data matrix.

For the matrix multiplication operation based on the inner product in the embodiment of the present application, because the essence is that the data in which the column index in the row currently participating in the operation in the first data object/first data matrix is the same as the row index in the column currently participating in the operation in the second data object/second data matrix is required to be matched into the pair of data to be processed, when the second target data corresponding to each first target data is acquired from the second target column of the second data matrix according to the position information of each first target data in each first target column based on the fact that the data in the row index corresponding to the column index in the first data matrix of the first target data is required to be matched into the pair of data to be processed, the second target data corresponding to the first target data is the pair of data to be processed, and the second target data is the data matched with the first target data to be matched into the pair of data to be processed to participate in the subsequent data processing.

For example, along with the matrix a and the matrix B in the example of fig. 3, for three first target columns in fig. 7, which are obtained by compressing the matrix a from the row-column direction, assuming that a second target column to be currently processed in the matrix B is a first column of the matrix B, taking the first target column in fig. 7 as an example, for the data "11", "14", "17" in the first target column, data with a row index of "1" may be taken out of the second target columns according to the corresponding column index of "1" (i.e., data represented by "1" in the second target column) as second target data corresponding to "11", "14", "17"; similarly, for the data "25" in the first target column, the data with the row index of "2" (i.e., the data represented by "2" in the second target column) can be taken out from the second target column according to the corresponding column index of "2", as the second target data corresponding to "25"; for data "38" in the first target column, data with a row index of "3" may be fetched from the second target column according to its corresponding column index of "3" (i.e., data represented by "3" in the second target column) as second target data corresponding to "38".

According to the embodiment, the second target data corresponding to each first target data are acquired from the second target columns of the second data matrix according to the position information of each first target data in each first target column, so that the second target data matched with the first target data can be conveniently and accurately positioned and acquired from the second data object, the data to be processed and the first target data are formed to participate in subsequent data processing, and the high efficiency and the accuracy of the data processing of the first data object and the second data object are ensured.

In an alternative embodiment, referring to the flowchart of the data processing method shown in fig. 8, step 203 in the data processing method provided by the present application, that is, performing data processing on each first target data and the corresponding second target data, may be implemented as:

Step 801, according to the number of available hardware processing channels capable of parallel processing, distributing a corresponding number of first target data to each available hardware processing channel, and performing data processing on the distributed first target data and the corresponding second target data by using the available hardware processing channels.

In a system of electronic devices such as a personal computer or a server, the hardware processing channels are available, and the hardware computing channels which are not occupied and can be scheduled to execute required operations on data, such as computing channels formed by arithmetic units and registers, can comprise required number of arithmetic units and/or registers, and can also comprise other required hardware.

Optionally, in an embodiment of the present application, the available hardware processing channel includes at least a first operation component, where the first operation component may be a multiplier that can be used to multiply data, and may further include a register to register data.

Further, each available hardware processing channel may be a PE (processelement, processing unit), referring to the PE internal structure diagram shown in fig. 9, each PE includes a register (Reg) and a multiplier (Mul) for registering data and performing multiplication operations on the data, respectively.

For the case that the first data object is the first data matrix and the second data object is the second data matrix, in this step 801, specifically, according to the number of available hardware processing channels that can be processed in parallel, a corresponding number of first target data in the first target columns obtained by compressing the first data matrix (compressing in the row direction or compressing in the column direction) is allocated to each available hardware processing channel in a one-to-one manner, for example, for three first target columns obtained by compressing the matrix a from the column direction, assuming that the current available PE number is 5, 5 first target data included in the first target column in the three first target columns may be allocated to 5 PEs in one-to-one manner in order and stored in the registers of the allocated PEs.

In practical applications, optionally, in the case that the neural network model is known, appropriate hardware resource configuration may be performed on the neural network model, for example, the number of hardware channels to be used for model data processing is configured as the number of rows of the weight matrix corresponding to the input channels of the neural network model network layer, or as the number of valid data contained in the column with the largest valid data amount in each column of the weight matrix, or the like.

The first target data corresponds to corresponding position information, for example, corresponds to a row index and a column index, and is used for indicating an original position of an original row, an original column and the like where the first target data is located in the first data object (the first data matrix). After the corresponding number of first target data are distributed to each available PE, according to the column index of the first target data distributed by each PE, the data on the row corresponding to the column index is obtained from the second target column to be processed currently of the second data object (second data matrix), and is used as second target data corresponding to the first target object to be sent to the PE, and the PE performs multiplication operation on the obtained first target data and second target data based on the multiplier of the PE.

The hardware used for performing data processing on the first target data and the corresponding second target data may further include, besides the available hardware processing channels such as PE, a plurality of second operation components and at least one routing component, where the second operation components may be accumulators that can be used for performing accumulation processing (sum operation) on the data, and the routing component is configured to transmit the multiplication result of the first operation component in the available hardware processing channels to the corresponding second operation component according to the row index of the corresponding first target data, so that the multiplication result with the same row index of the corresponding first target data is accumulated in each second operation component for the second target column currently participating in processing of the second data matrix, thereby implementing accumulation of the multiplication result corresponding to each data (first target data) in the same row in the first data matrix for the second target column currently participating in processing of the second data matrix.

The number of the second operation components is not lower than the number of rows of the first data matrix, for example, for the example that the first data matrix is matrix a, 9 accumulators are set, and correspond to the original 9 rows of the first data matrix one by one, after multiplication operation is performed on the obtained first target data and the corresponding second target data by the multiplier of each PE, the multiplication operation result and the row index of the first target data are sent to the routing component, the routing component routes the received multiplication operation result to the corresponding accumulator based on the row index of the first target data, so that the multiplication operation result obtained by each accumulator is the multiplication operation result of the second target column currently participating in processing by the first target data of the same row in the first data matrix for the second target column currently participating in processing by the second data matrix, and accordingly accumulation of the multiplication operation results corresponding to the data (the first target data) of the same row in the first data matrix is realized in the accumulator, and the multiplication operation rule based on the inner product is consistent, and the requirement of the multiplication operation based on the inner product matrix is met.

Because the first target data participating in data processing is data in the at least one first target data sub-object obtained after the data compression of the first data object, compared with the first data object, at least part of invalid data in the first data object is removed, so that the removed invalid data can be prevented from participating in operation, the data processing capacity of the first data object is reduced, and based on the data, the number of available hardware processing channels required for processing the first target data is smaller than the number of available hardware processing channels required for processing the data to be processed in the first data object, the computing performance of a system can be correspondingly improved, the resource demand of storage, transmission, operation and the like is reduced, and the resource utilization rate and the operation efficiency of the system are improved. Meanwhile, the designed hardware structure and the use mode of the hardware are consistent with the matrix multiplication operation rule based on the inner product, so that the requirement of the matrix multiplication operation based on the inner product is met, and the accuracy of the matrix multiplication operation result based on the inner product can be ensured.

In an alternative embodiment, the valid data in the same column in the first data matrix/first data object is arranged continuously in the first target column where the valid data in the same column is located after the movement, that is, the first target column in this embodiment is a column after sequential adjustment (the valid data in the same column is adjusted to be arranged continuously) by compressing the first data matrix/first data object from the row direction and the column direction.

The data processing of the first target data and the corresponding second target data in the present embodiment also includes a multiply-accumulate process, in which a multiply operation involving matrix-multiplying the first data matrix and the second data matrix is included.

In this embodiment, referring to the flowchart of fig. 10, step 203 in the method provided by the present application, namely, performing data processing on each first target data and the corresponding second target data, may be specifically implemented as:

step 1001, respectively distributing first target data in the same group in each first target column to continuous available hardware processing channels in the channel array; the first target data of the same group comprises first target data belonging to the same column in the first data matrix in the first target column.

The channel array is an array formed by each available hardware processing channel, such as a PE array formed by each PE.

In this embodiment, data in a first target column is grouped according to the characteristic that effective data in the same column in a first data matrix/first data object is continuously arranged in a first target column where effective data in the same column is located after moving, specifically, first target data in the first target column belonging to the same column in the first data matrix is grouped into a group, and first target data in the first target column belonging to different columns in the first data matrix are correspondingly grouped into different groups.

The data contained in each packet is continuously arranged in the first target columns, and for this feature, when the first target data is allocated to the available hardware processing channels, the first target data in the same group in each first target column is specifically allocated to the continuous available hardware processing channels in the channel array, for example, the first target data in the same group is respectively allocated to continuous PEs in the PE array, and is specifically stored in a register contained in the allocated PEs.

Step 1002, determining second target data corresponding to the first target data in the same group according to the position information corresponding to the first target data in the same group.

The column indexes corresponding to the first target data in the same group are the same, and the characteristics of matrix multiplication operation based on inner products are combined, so that the first target data in the same group correspond to the same second target data in the second data matrix.

The step may specifically determine, according to a column index corresponding to the first target data of the same group, data at a row position corresponding to the column index from a second target column to be currently processed in the second data matrix, as second target data corresponding to the first target data of the same group.

For example, for a first target column of three first target columns obtained by compressing the matrix a from the row-column direction, taking a first packet thereof as an example, for data "11", "14", "17" contained in the first packet in the column, specifically, data on a row indicated by the index "1" may be determined from second target columns currently to be processed in the second data matrix according to column indexes "1" of "11", "14", "17", as second target data corresponding to "11", "14", "17", and for example, data on row 1 may be determined from first columns currently to be processed in the matrix B as second target data corresponding to "11", "14", "17".

And step 1003, distributing the determined second target data to available hardware processing channels where the corresponding groups of first target data are respectively located, and performing multiplication operation on the distributed first target data and second target data by using a first operation component provided by the available hardware processing channels to obtain multiplication operation results.

The first operation component may be a multiplier included in PE, such as multiplier Mul in fig. 9.

After grouping the first target data in each first target column and determining the same second target data corresponding to each first target data for each group, the same second target data can be obtained, the obtained same second target data is distributed to available hardware processing channels where the corresponding groups of first target data are respectively located, and on the basis, the available hardware processing channels can execute multiplication operation on the obtained first target data and second target data based on the first operation component.

For example, the same second target data corresponding to each of the first target data "11", "14", and "17" included in the first packet in the first target column is read, the read second target data is allocated to the PE where the "11", "14", and "17" included in the packet are located, and the multiplier in the PE where the "11", "14", and "17" are located performs the multiplication operation on the obtained first target data and second target data.

Optionally, for the case that the second target data is invalid data (e.g. a value of 0) and the second target data is read, correspondingly, the second target data of the invalid data does not need to be input into the corresponding PE, and further the first target data currently allocated in the PE does not need to be operated, that is, the multiplication operation result corresponding to the first target data currently allocated in the PE can be directly regarded as an empty result, which accords with the characteristic that the invalid data does not contribute to the operation result in the matrix multiplication operation, and does not affect the whole matrix multiplication operation result.

Step 1004, according to the position information corresponding to each first target data, sending the multiplication result corresponding to the first target data in the same row in the first data matrix to the same second operation component for accumulation processing, so as to obtain an accumulation operation result.

The second operation component may be an accumulator that can be used for performing accumulation processing (sum operation) on the data, and the number of the second operation components is not less than the number of rows of the first data matrix, for example, for the first data matrix, for example, the matrix a is described above, and 9 accumulators may be provided, which respectively correspond to the original 9 rows of the matrix a one by one, which is not limited to this, and more than 9 accumulators may be provided in practical application.

After the multiplication operation of the allocated first target data and the second target data is completed in the corresponding available hardware processing channels such as PE, each PE sends the multiplication operation result and the row index of the corresponding first target data to the routing component, the routing component routes the received multiplication operation result to the corresponding accumulator based on the row index of the first target data, so that the multiplication operation result obtained by each accumulator is the multiplication operation result of the first target data of the same row in the first data matrix aiming at the second target column to be processed currently in the second data matrix, and accordingly the multiplication operation result corresponding to each data (the first target data) of the same row in the first data matrix is accumulated in the accumulator, accords with the matrix multiplication operation rule based on the inner product, and meets the requirement of matrix multiplication operation based on the inner product.

The following examples are given:

with the example of fig. 3, for three first target columns obtained by compressing the matrix a in the row direction and the column direction, as shown in fig. 11, assuming that the number of currently available PEs in the PE array is not less than the total number of first target data in the three first target columns, each first target data in the three first target columns may be directly allocated to different PEs in a one-to-one manner, where, specifically, first target data belonging to the same column as the first data matrix in each first target column may be allocated to consecutive PEs in the PE array in a packet form, and simplifying data of a second target column to be currently processed in the matrix B (such as first column data shown for the matrix B in fig. 3), invalid data in the first target column may be removed, and valid second target data may be associated with each packet in an index manner, as shown in fig. 11.

Then, the second target data corresponding to each group can be read according to the established association, the read second target data is directly distributed to the PE where each first target data in the corresponding group is located, each PE performs multiplication operation on the distributed first target data and second target data by using a multiplier of the PE, for the example, one beat can finish multiplication operation on each first target data and the corresponding second target data in the three first target columns in each PE based on a parallel processing mode, and for the matrix B, one column of data of the matrix B can be sent to the corresponding PE in one beat, so that one beat can process all operations of one column of data of the matrix B. The first target data (such as the compressed weight data) obtained by compressing the matrix A is stored in PE in advance, the matrix B data is directly connected with the PE array, and all PE calculation results are obtained in one beat. The data of the columns of the matrix B which are currently involved in operation can be divided into a plurality of groups, each group of data is multiplied by the data of the same first target column obtained by compressing the matrix A, and the data cannot cross different columns, so that the selection logic of the data is reduced, the complexity of hardware wiring is reduced, and invalid data in the matrix B is discarded.

In the embodiment of the present application, a beat specifically refers to that the PE performs a multiplication operation on the obtained data pair to be processed.

Referring further to fig. 12, each PE is connected to 9 accumulators adder-adder through a routing component crossbar, and the multiplication result of the PE in the same column is sent to different accumulators through the routing component crossbar, so that there is no conflict, and if the PE array has x columns (the number of columns of PEs in the PE array, that is, the number of columns of the first target column) and operates simultaneously, there are at most x inputs for each accumulator.

Assuming that a second target column to be currently processed in a second data matrix is an nth column of the matrix, first target data obtained in a PE is W _ij, second target data is B _mn (namely corresponding data in the nth column of the second data matrix), wherein i and j respectively represent a row index and a column index of the first target data, m and n respectively represent a row index and a column index of the second target data, i, j and m and n respectively represent integers not less than 1, if the number of the set accumulators is 9 and the numbers are 1-9 respectively, multiplication operation results of W _ij and B _mn in the PE are sent into an accumulator numbered i based on the row index i of the first target data W _ij, and multiplication operation results corresponding to all first target data with the row index i in the second data matrix are sent into the accumulator numbered i based on the row index i in the same manner as the second target column to be currently processed in the second data matrix, so that multiplication operation results of the same row data i and the second data in the second data matrix are accumulated in the accumulator.

In the embodiment of the application, the matrix multiplication operation based on the inner product is essentially that the data of each row in the first data matrix and the data of each column in the second data matrix are sequentially and one-to-one corresponding and multiplied, then the multiplication operation results are accumulated to obtain multiplication and accumulation results, the result of the matrix multiplication operation based on the inner product is still a matrix, and the multiplication and accumulation results of the data of the same row in the first data matrix and the data of the same column in the second data matrix are one numerical value in the final matrix multiplication operation result (also a matrix). Based on this, in practical application, optionally, an accumulator may be set for a row-column combination formed by each row in the first data matrix and each column of data in the second data matrix, so that after the PE multiplies the obtained first target data with W _ij and the second target data with B _mn, the PE may send the row index i corresponding to the first target data with W _ij and the column index n corresponding to the second target data with B _mn into the accumulator with a position R _in, where the accumulator with a position R _in is specifically an accumulator corresponding to a row-column combination formed by the row i in the first data matrix and the column n in the second data matrix, based on this routing manner, the multiplication result corresponding to the same row of data in the first data matrix is accumulated in the corresponding accumulator for the second target column to be currently processed in the second data matrix, so as to meet the requirement of the multiplication operation of the matrix.

According to the embodiment, based on the characteristic that effective data in the same column in the first data matrix/the first data object are continuously arranged in the first target column after compression, data in the first target column is grouped, the same second target data corresponding to each first target data in the same group in the first target column can be conveniently determined in a grouping mode, the determined second target data is sent to available hardware processing channels such as PE (polyethylene) where each first target data in the same group is located based on one-time reading operation, so that data selection logic is further simplified, data reading amount and bandwidth requirements are reduced, hardware wiring complexity is reduced, data operation efficiency is improved, and compared with an outer product mode, access to memory by partial sum is reduced.

Corresponding to the above data processing method, an embodiment of the present application further provides a data processing apparatus, whose constituent structure is shown in fig. 13, including:

A first obtaining module 1301, configured to obtain at least one first target data sub-object containing first target data; the first target data in the at least one first target data sub-object at least comprises all valid data of each first data sub-object in the first data object corresponding to the target data processing channel; each first target data corresponds to corresponding position information for indicating an original position of the first target data in the first data object;

A second obtaining module 1302, configured to obtain, according to the location information corresponding to each first target data in the at least one first target data sub-object, second target data corresponding to each first target data from a second data object corresponding to the target data processing channel;

The data processing module 1303 is configured to perform data processing on the first target data and the second target data;

In an alternative embodiment, the apparatus further includes a preprocessing apparatus for forming the first target data sub-object based on preprocessing, the preprocessing apparatus forming the first target data sub-object, including:

In an alternative embodiment, the first data object includes a first data matrix having a plurality of data to be processed, and the first data sub-object is a column in the first data matrix;

The preprocessing device is specifically configured to, when moving valid data of a corresponding first data sub-object in the first data object to a position where invalid data of other first data sub-objects is located:

In an alternative embodiment, the process of forming the first target data sub-object by the preprocessing device further includes:

In an alternative embodiment, the second data object comprises a second data matrix having a plurality of data to be processed;

The second obtaining module 1302 is specifically configured to: acquiring second target data corresponding to each first target data from a second target column of the second data matrix according to the position information of each first target data in each first target column; the second target column is a column currently to be processed in the second data matrix.

In an alternative embodiment, the data processing module 1303 is specifically configured to:

In an optional implementation manner, the valid data originally belonging to the same column in the first data matrix are continuously arranged in a first target column where the valid data is located after moving; the data processing comprises multiply-accumulate processing, wherein the multiply operation in the multiply-accumulate processing comprises multiply operation involved in matrix multiplication of the first data matrix and the second data matrix;

The data processing module 1303 is specifically configured to:

The embodiment of the present application further provides a data processing chip, referring to the component structure diagram shown in fig. 14, where the data processing chip includes a plurality of available hardware processing channels 1401 and a plurality of second arithmetic components 1402.

Wherein each of the available hardware processing channels 1401 comprises a first operation component 1403, configured to obtain corresponding first target data in at least one first target data sub-object, where the first target data in the at least one first target data sub-object includes all valid data of each first data sub-object in the first data object corresponding to the target data processing channel, and each first target data corresponds to corresponding location information, and is configured to indicate an original location of the first target data in the first data object; and the second target data corresponding to the corresponding first target data is acquired from the second data object corresponding to the target data processing channel according to the position information corresponding to the corresponding first target data; performing first operation processing on the acquired first target data and second target data to obtain a first operation result;

Each second operation module 1402 is configured to perform a second operation on the first operation result output by the corresponding available hardware processing channel, to obtain a second operation result;

In an alternative embodiment, the first data object includes a first data matrix having a plurality of data to be processed;

the first data sub-object is a column in the first data matrix;

In an alternative embodiment, the first operation process includes a multiplication operation, the multiplication operation including a multiplication operation involved in matrix multiplying the first data matrix and the second data matrix; the second operation processing includes an accumulation operation;

The first operation component 1403 is specifically configured to, when performing a first operation process on the acquired first target data and second target data: performing multiplication operation on the first target data and the corresponding second target data in the corresponding first target column to obtain a multiplication operation result;

the second operation module 1402 is specifically configured to, when performing a second operation on the first operation result output by the corresponding available hardware processing channel: and accumulating the multiplication operation results output by the corresponding available hardware processing channels to obtain accumulation operation results.

In an alternative embodiment, the data processing chip further includes a routing component;

Each of the available hardware processing channels 1401 further comprises a separate storage component for storing first target data allocated to the corresponding available hardware processing channel;

In an alternative embodiment, the available hardware processing channel 1401 is a PE, the second operation component 1402 is an accumulator, the first operation component 1403 is a multiplier in the PE, and the storage component is a register.

The data processing chip provided in this embodiment corresponds to the data processing method disclosed in each method embodiment, and is configured to implement the data processing method disclosed in each method embodiment based on the hardware structure of the data processing chip and the functions of each component, and the more detailed functions of each component in the data processing chip and the process of implementing data processing based on each component of the data processing chip, which may be specifically referred to the description of each method embodiment above, and will not be repeated herein.

The embodiment of the application also discloses an electronic device, and the composition structure of the electronic device, as shown in fig. 15, at least comprises:

A memory 10 for storing a set of computer instructions;

The set of computer instructions may be implemented in the form of a computer program.

A processor 20 for implementing a data processing method as disclosed in any of the method embodiments above by executing a set of computer instructions.

The processor 20 may be a central processing unit (Central Processing Unit, CPU), application-specific integrated circuit (ASIC), digital Signal Processor (DSP), application-specific integrated circuit (ASIC), field-programmable gate array (FPGA), neural Network Processor (NPU), deep learning processor (DPU), or other programmable logic device, etc.

The electronic device is provided with a display device and/or a display interface, and can be externally connected with the display device.

Optionally, the electronic device further includes a camera assembly, and/or an external camera assembly is connected thereto.

In addition, the electronic device may include communication interfaces, communication buses, and the like. The memory, processor and communication interface communicate with each other via a communication bus.

The communication interface is used for communication between the electronic device and other devices. The communication bus may be a peripheral component interconnect standard (PERIPHERAL COMPONENT INTERCONNECT, PCI) bus, or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, etc., and may be classified as an address bus, a data bus, a control bus, etc.

In addition, the embodiment of the present application further provides a readable storage medium having stored thereon a set of computer instructions for being invoked and executed by a processor to implement a data processing method as provided in any of the above method embodiments.

And, a computer program product is also provided, comprising a computer program/instruction which, when executed by a processor, implements a data processing method as provided by any of the method embodiments above.

It should be noted that, in the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different manner from other embodiments, and identical and similar parts between the embodiments are referred to each other.

For convenience of description, the above system or apparatus is described as being functionally divided into various modules or units, respectively. Of course, the functions of each element may be implemented in the same piece or pieces of software and/or hardware when implementing the present application.

From the above description of embodiments, it will be apparent to those skilled in the art that the present application may be implemented in software plus a necessary general hardware platform. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the embodiments or some parts of the embodiments of the present application.

Finally, it is further noted that relational terms such as first, second, third, fourth, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The foregoing is merely a preferred embodiment of the present application and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present application, which are intended to be comprehended within the scope of the present application.

Claims

1. A data processing chip, comprising: a plurality of available hardware processing channels and a plurality of second arithmetic components;

2. The data processing chip of claim 1, the first data object comprising a first data matrix having a plurality of data to be processed;

the first data sub-object is a column in the first data matrix;

3. The data processing chip of claim 2, the first operation processing comprising a multiplication operation involved in matrix multiplying the first data matrix and the second data matrix; the second operation processing includes an accumulation operation;

4. The data processing chip of claim 3, further comprising a routing component;

5. A data processing method, comprising:

6. The data processing method according to claim 5, wherein the forming method of the first target data sub-object includes:

7. The data processing method of claim 6, the first data object comprising a first data matrix having a plurality of data to be processed, the first data sub-object being a column in the first data matrix;

8. The data processing method of claim 7, the method of forming the first target data sub-object, further comprising:

9. A data processing method according to claim 7 or 8, the second data object comprising a second data matrix having a plurality of data to be processed;

10. The data processing method according to claim 5, wherein the data processing of the respective first target data and the corresponding second target data includes:

11. The data processing method according to claim 9, wherein the valid data originally belonging to the same column in the first data matrix is arranged continuously in a first target column in which the valid data is located after the movement; the data processing comprises multiply-accumulate processing, wherein the multiply operation in the multiply-accumulate processing comprises multiply operation involved in matrix multiplication of the first data matrix and the second data matrix;

12. A data processing apparatus comprising: