CN110780849A

CN110780849A - Matrix processing method, device, equipment and computer readable storage medium

Info

Publication number: CN110780849A
Application number: CN201911034863.XA
Authority: CN
Inventors: 闯小明; 杨龚轶凡; 郑瀚寻; 曾昭仁; 张伊达
Original assignee: Shenzhen Xinying Technology Co Ltd
Current assignee: Zhonghao Xinying (Hangzhou) Technology Co.,Ltd.
Priority date: 2019-10-29
Filing date: 2019-10-29
Publication date: 2020-02-11
Anticipated expiration: 2039-10-29
Also published as: CN110780849B

Abstract

The embodiment of the invention discloses a matrix processing method and a matrix processing device, which are used for realizing the position movement of elements in a matrix. An execution module, channel configuration data, and an input matrix are provided. An execution channel is constructed by sending a group of fields including the target serial number in the channel configuration data to an execution module; and inputting the input matrix into an execution channel, and performing selective recombination according to the target sequence number to obtain an output matrix with the element positions transformed. The method can support various data types and different data bit widths, and can adopt a single-channel mode or a multi-channel mode to carry out matrix element position moving processing. The time delay in the matrix processing process can be reduced, the processing efficiency is improved, and the selection space of the input data type and bit width is increased.

Description

Matrix processing method, device, equipment and computer readable storage medium

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a matrix processing method, apparatus, device, and computer-readable storage medium.

Background

Artificial neural network algorithms are widely used in modern artificial intelligence technology. Such as convolutional neural networks in computer vision applications and recurrent neural networks in natural language processing applications. In the application of artificial neural networks, it is often necessary to shift the positions of elements within a matrix. In many application scenarios of the artificial neural network, the matrix to be processed is mostly an ultra-large-scale matrix, the number of internal elements is large, the data types of the elements in the matrix are various, and the data bit width is different. When the bit width of the element data of the input matrix is changed, corresponding implementation devices are often required to be replaced, and the problems of hardware redundancy, various lines, high space occupation and the like are caused. For the position shift of the matrix elements, only the patterned shift operation, such as matrix transposition, matrix rotation, etc., can be performed according to a preset rule, and the shift rule cannot be flexibly set.

Disclosure of Invention

In view of the above, the present invention provides a matrix processing method, a matrix processing apparatus, a processing device, and a computer readable storage medium, so as to solve the problems of hardware redundancy, a plurality of lines, high space occupation, and lack of flexibility in the element position moving process of a matrix.

In a first aspect, an embodiment of the present invention provides a matrix processing method. Providing a data selection array comprising P rows and Q columns of data selection cells, the method comprising:

and acquiring channel configuration data, wherein the channel configuration data comprises L groups of fields, and at least one group of fields in the L groups of fields comprises the target sequence number.

Acquiring an execution instruction, and sending at least one group of fields comprising target sequence numbers to a data selection array according to the execution instruction, wherein one data selection unit receives one group of fields; and dividing at least one column of data selection units comprising the same group of fields into one execution channel according to the execution instruction.

Acquiring an input matrix, inputting the input matrix into a data selection unit in the execution channel according to columns, and selecting by the data selection unit according to the received target serial number to obtain a selection result; and recombining each selection result to obtain an output vector after element position transformation, wherein a plurality of output vectors form an output matrix.

Based on the matrix processing method provided by the embodiment of the invention, one execution channel or a plurality of execution channels can be selectively constructed in the data selection array according to actual requirements and the type and bit width of input data to carry out position moving operation on elements in the input matrix. When the data bit width of the element of the input matrix is equal to the maximum bit width supported by the data selection array, constructing the data selection array into a single channel according to the execution instruction and the channel configuration data for efficient execution; and when the data bit width of the element of the input matrix is smaller than the maximum bit width supported by the data selection array, constructing the Soviet selection array into multiple channels for parallel execution according to the execution instruction. The method can support various data types and different data bit widths through flexible channel configuration, adopts a single-channel mode or a multi-channel mode to carry out matrix element position moving processing, reduces time delay in the matrix processing process, improves processing efficiency, increases selection space of input data types and bit widths, and enables the matrix position moving processing to have higher flexibility. For the device or equipment for carrying out element position movement processing by using the method, the hardware redundancy in related processing circuits can be reduced, the hardware wiring is simplified, and the available space of hardware is saved.

Further, the execution channel includes an N-channel mode, a value range of N includes an M power of 2, and a value range of M includes [0, log [ ] ₂Q]Q is a non-negative integer power of 2, where L is equal to Q. The execution channels in multiple modes can support different data bit widths, and when the data type or the data bit width of the input matrix changes, the execution channel mode can be changed correspondingly to adapt to the data type of the input matrix, so that the flexibility of input data type selection is improved.

Furthermore, the execution instruction includes a first execution instruction, and the N-channel mode includes a first channel mode corresponding to the first execution instruction. Sending the fields of the relevant groups in the channel configuration data to a data selection unit corresponding to the execution instruction through a first execution instruction, and constructing a corresponding execution channel in a data selection array; the execution channel of the first channel mode is configured according to the first execution instruction, so that the construction process of the execution channel can be ensured to be orderly and efficient.

Further, the input matrix comprises column vectors, elements in the column vectors comprise bit sections, one column vector is divided into a plurality of sub-vectors according to the bit sections, and the plurality of sub-vectors comprise a first vector; the data selection array comprises a first data selection unit, and is used for receiving the target sequence number and the first sub-vector, selecting the target element from the first sub-vector according to the target sequence number, and splicing all the target elements in one row in one execution channel to obtain a selection result. By splitting a single element into a plurality of bit segments, the matrix element position shifting can be still realized by the way that a plurality of data selection units respectively calculate partial bit segments when the maximum bit width supported by a single data selection unit is smaller than the minimum bit width of the elements of the input matrix; and meanwhile, the execution efficiency can be ensured.

In a second aspect, an embodiment of the present invention provides a matrix processing apparatus, which includes an execution module, where the execution module includes a data selection array, and the data selection array includes P rows and Q columns of data selection units. The execution module is used for receiving channel configuration data; the channel configuration data comprises a plurality of groups of fields, and at least one group of fields in the plurality of groups of fields comprises a target sequence number. The execution module at least comprises an execution channel, the execution channel comprises at least one column of data selection units, and the data selection units in the same execution channel all comprise the same group of fields. The execution module is also used for receiving an input matrix, the input matrix enters the data selection unit according to the columns, and the data selection unit selects according to the target serial number to obtain a selection result. And combining the selection results in the same execution channel to obtain output vectors, wherein the output vectors form an output matrix.

According to the matrix processing device provided by the embodiment of the invention, one execution channel or a plurality of execution channels can be selectively constructed in the data selection array to perform the position shifting operation of the elements in the input matrix. When the data bit width of the element of the input matrix is equal to the maximum bit width supported by the data selection array, constructing the data selection array into a single channel according to the execution instruction and the channel configuration data for efficient execution; and when the data bit width of the element of the input matrix is smaller than the maximum bit width supported by the data selection array, constructing the Soviet selection array into multiple channels for parallel execution according to the execution instruction. Corresponding execution channels are constructed according to the bit width of the input data for matrix processing, a special processing circuit is not needed to be arranged for a certain data type or data bit width, hardware redundancy in related processing equipment is reduced, hardware wiring is simplified, and the available space of hardware is saved. The method can support various data types and different data bit widths through flexible channel configuration, adopts a single-channel mode or a multi-channel mode to carry out matrix element position moving processing, reduces time delay in the matrix processing process, improves processing efficiency, increases selection space of input data types and bit widths, enlarges application range and enables matrix position moving processing to have higher flexibility.

Further, the lane configuration data includes L sets of fields, L including a ratio of a highest bit width to a lowest bit width of data supported by the data selection array.

Furthermore, the input matrix comprises a column vector, elements in the column vector comprise bit sections, the column vector is divided into a plurality of sub-vectors according to the bit sections, and the plurality of sub-vectors comprise a first sub-vector; the data selection array comprises a first data selection unit, a second data selection unit and a third data selection unit, wherein the first data selection unit is used for receiving a target sequence number and a first sub-vector and selecting a target element from the first sub-vector according to the target sequence number; the selection result includes a target element in a row within an execution lane. By splitting a single element into a plurality of bit segments, the matrix element position shifting can still be carried out in a manner that a plurality of data selection units respectively calculate partial bit segments when the maximum bit width supported by a single data selection unit is smaller than the minimum bit width of the elements of the input matrix; and meanwhile, the execution efficiency can be ensured.

Further, the value of the column number Q of the data selection array comprises a non-negative integer power of 2; the number of field sets L in the channel configuration data is equal to the number of columns Q of the data selection array. The group number of the channel configuration data is the same as the column number of the data selection array, so that the maximum value of the number of the executable channels can reach Q when the channel configuration is carried out, and more executable channels can be constructed to continue parallel processing when the data bit width of the input matrix is small, thereby accelerating the execution efficiency and reducing the execution time.

Furthermore, the execution module is further configured to receive an execution instruction, where the execution instruction is used to send the channel configuration data to a corresponding data selection unit in the execution module; the execution channel comprises a plurality of channel modes, and the plurality of channel modes comprise a first channel mode; the execution instruction comprises a first execution instruction, and the first execution instruction corresponds to the first channel mode. Sending the fields of the relevant groups in the channel configuration data to a data selection unit corresponding to the execution instruction through a first execution instruction, and constructing a corresponding execution channel in a data selection array; the execution channel of the first channel mode is configured according to the first execution instruction, so that the construction process of the execution channel can be ensured to be orderly and efficient.

In a third aspect, an embodiment of the present invention provides a processing device. The device comprises a memory for storing calculations and programs; the apparatus further comprises a processor comprising the matrix processing device as disclosed above; the processor is adapted to carry out the steps of the aforementioned matrix processing method when executing the computer program stored in the memory.

In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by an execution processor, the computer program implements the steps of the above matrix processing method.

The invention can be further combined to provide more implementation modes on the basis of the implementation modes provided by the aspects.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic flowchart of a matrix processing method according to an embodiment of the present invention;

fig. 2 is an information schematic diagram of channel configuration data according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a single channel mode provided by embodiments of the present invention;

FIG. 4 is a schematic diagram of a dual channel mode provided by an embodiment of the present invention;

FIG. 5 is a schematic diagram of a multi-channel mode provided by an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a matrix processing apparatus 600 according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of a data selecting unit 602 according to an embodiment of the present invention;

FIG. 8 is a block diagram of an execution module with 4 rows and 4 columns according to an embodiment of the present invention;

FIG. 9 is an information diagram of an input matrix X and an output matrix Y according to an embodiment of the present invention;

FIG. 10 is a diagram illustrating information of another channel configuration data according to an embodiment of the present invention;

FIG. 11 is a schematic diagram of another single-channel execution module provided in embodiments of the invention;

FIG. 12 is a diagram illustrating information of an input vector according to an embodiment of the present invention;

FIG. 13 is an information diagram of a target element and an output vector according to an embodiment of the present invention;

FIG. 14 is a diagram illustrating information of another channel configuration data according to an embodiment of the present invention;

FIG. 15 is a schematic diagram of another dual channel execution module provided by embodiments of the present invention;

fig. 16 is a schematic structural diagram of a processing apparatus according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It will be understood that when an element is referred to as being "connected to" another element, or "coupled" to one or more other elements, it can be directly connected to the other element or be indirectly connected to the other element.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present invention, "a plurality" means two or more unless specifically defined otherwise.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

The following specifically describes embodiments of the present invention.

Fig. 1 is a schematic flow chart of a matrix processing method according to an embodiment of the present invention. The method can be used for realizing the position movement of the matrix elements, wherein the data types of the matrix elements can be integer types or floating point types; the data bit width of the matrix elements may be a non-negative integer power of 2. The method provides a data selection array, which comprises P rows and Q columns of data selection units, wherein the bit width of data supported by one data selection unit can be a non-negative integer power of 2 and is also the minimum bit width supported by the data selection array, and the sum of the bit widths of all the data selection units in one row in the data selection array is the maximum bit width supported by the data selection array. The aforementioned maximum bit width or minimum bit width may also be referred to as a maximum bit or a minimum bit. As illustrated in fig. 1, the method comprises the steps of:

step S101: channel configuration data, execution instructions, and an input matrix are obtained. The input matrix comprises A rows and W columns of elements; the channel configuration data includes L sets of fields, each set of fields includes K bits, that is, the bit width of each set of fields is K. The number of sets L of fields of the channel configuration data is equal to the number of columns Q of the data selection array. In the channel configuration data, at least one set of fields includes a target sequence number. The number of sets L of fields of the channel configuration data is equal to the number of columns Q of the data selection array. The value of the row number A in the input matrix is not more than the row number P of the data selection array, and the column number W is not limited. In case the number of rows a of the input matrix is smaller than the data selection array P, only the a row data selection unit may be activated for processing. In some preferred embodiments, the data selection array comprises 64 rows and 8 columns of data selection cells, then the maximum number of rows a of the input matrix is equal to 64; when the number of rows of the input matrix is 48, only the first 48 rows of the data selection array are activated for matrix processing.

Fig. 2 is a schematic diagram of channel configuration data according to an embodiment of the present invention. As shown in fig. 2, the channel configuration data includes L sets of fields, at least one set of fields in the sets of fields includes a target sequence number, and the data selection unit selects a corresponding target element according to the target sequence number. One set of fields comprises K bits, K log ₂A; wherein, the bit in the channel configuration data is in the interval [ nK-1, (n-1) K]Is set as the nth group field. I.e., group 1 fields consisting of [ K-1, 0]]Is composed of bits of (1), the field of the 2 nd group is composed of [2K-1, K]And so on. The maximum value of the target sequence number is the maximum value which can be represented by K bits, namely the row number A of the input matrix. In some embodiments, the number of rows A of the input matrix is 128, and the channel configuration data isEach group of fields includes 7 bits, then the 7 bits can represent 128 target sequence numbers, that is, 128 channel configuration data are required to process one input matrix of 128 rows, the number of groups of the 128 channel configuration data is equal to the number of bits of each group, and the difference is that the 128 channel configuration data includes different target sequence numbers.

Step S102: and constructing a corresponding execution channel according to the execution instruction and the channel configuration data. The execution instructions comprise instructions corresponding to multiple channel modes, such as a single-channel execution instruction, a double-channel execution instruction, a four-channel execution instruction, an eight-channel execution instruction, a sixteen-channel execution instruction and the like. And each different type of instruction sends at least one group of fields including the target sequence number in the channel configuration data to the data selection array, and at least one column of data selection units receiving the same group of fields in the data selection array is constructed into an execution channel. In a preferred embodiment, the data selection array includes 128 rows and 8 columns of data selection units, the data bit width supported by each data selection unit is 8 bits, and the maximum data bit width supported by the data selection array is 64 bits, so that the data selection array can be constructed as an execution channel through a single-channel execution instruction, the data type of the input matrix supported by the execution channel can be a floating point number or an integer, and the data bit width is 64 bits; the data selection array can be further constructed into two execution channels through a two-channel execution instruction, the data type of an input matrix supported by the execution channels can be a floating point number or an integer, and the data bit width is 32 bits; the data selection array can be further constructed into four execution channels through a four-channel execution instruction, the data type of an input matrix supported by the execution channels can be a floating point number or an integer, and the data bit width is 16 bits; the data selection array can be further constructed into eight execution channels through an eight-channel execution instruction, the data type of an input matrix supported by the execution channels can be a floating point number or an integer, and the data bit width is 8 bits. The embodiment of the invention can also be expanded upwards, when the data bit width supported by the data selection array is 128 bits, the data selection array can be constructed into an execution channel through a single-channel execution instruction, the data type of the input matrix supported by the execution channel can be a floating point number or an integer, and the data bit width is 128 bits; the data selection array can be further constructed into two execution channels through a two-channel execution instruction, the data type of an input matrix supported by the execution channels can be a floating point number or an integer, and the data bit width is 64 bits; the data selection array can be further constructed into four execution channels through a four-channel execution instruction, the data type of an input matrix supported by the execution channels can be a floating point number or an integer, and the data bit width is 32 bits; the data selection array can be further constructed into eight execution channels through an eight-channel execution instruction, the data type of an input matrix supported by the execution channels can be a floating point number or an integer, the data bit width is 16 bits, the data selection array can be further constructed into sixteen execution channels according to a sixteen-channel execution instruction, and the data bit width supported by the execution channels is 8 bits.

In one embodiment, a single channel execute instruction sends the set 1 field in the channel configuration data to all data select units in the data select array. Fig. 3 is a schematic diagram of a single channel mode according to an embodiment of the present invention. As shown in fig. 3, the fields received by each data selection unit in each column in the data selection array are all the 1 st group of fields, i.e. the bit intervals in the channel configuration data are all [ K-1, 0], so that the entire data selection array is divided into one execution channel. Of particular note, each row includes the same target sequence number in group 1 field and each column includes a different target sequence number in group 1 field. The difference between each channel configuration data in fig. 3 is that the target sequence numbers contained therein are different.

In another embodiment, the dual lane execute instruction sends the set 1 and set 2 fields of the lane configuration data to the data select array. Fig. 4 is a schematic diagram of a dual channel mode according to an embodiment of the present invention. As shown in fig. 4, when the first Q/2 column data selecting unit in the data selecting array receives the 1 st group of fields in the channel configuration data, and the second Q/2 column data selecting unit receives the 2 nd group of fields in the channel configuration data, the first Q/2 column data selecting unit in the data selecting array forms the execution channel 1, and the second Q/2 column data selecting unit in the data selecting array forms the execution channel 2. It should be noted that the aforementioned dual-channel execution instruction for sending the 1 st and 2 nd groups of fields to the data selection array is only an exemplary illustration, and not only the 1 st and 2 nd groups of fields to the data selection array, but also the two groups of fields including the target sequence number may be used to send any two different groups of fields including the target sequence number to the data selection array in the channel configuration instruction.

In another embodiment, an n-lane execution instruction sends n fields of lane configuration data to the data selection array, the n fields including the target sequence number. Fig. 5 is a schematic diagram of a multi-channel mode according to an embodiment of the present invention. As shown in fig. 5, each column of data selecting units in the data selecting array receives a different set of fields in the configuration data of each channel, and the data selecting array is divided into n execution channels, where n has a value ranging from 1 to the number of columns Q in the data selecting array. In another embodiment, the number of columns Q in the data selection array is equal to the number of sets L of channel configuration data. It is noted that the same group field in each row includes the same target sequence number, and the same group field in each column includes a different target sequence number.

Step S103: and carrying out element selection recombination on the input matrix through an execution channel to obtain an output matrix with transformed element positions. The input matrix enters the data selection units in the execution channel according to columns, and each data selection unit selects a selection result corresponding to the target sequence number from the received column of vectors according to the target sequence number. A row in an execution channel will get an element in a column vector, i.e. the selection result. And forming the selection result of each line in one execution channel into an output vector, and forming an output matrix by a plurality of output vectors. In a preferred embodiment of the element selection reorganization, the selection results output by each line in the execution channel are sequentially arranged according to the number of lines in the execution channel, that is, the selection result output by the first line in the execution channel is the first element in the output vector, the selection result output by the second line in the execution channel is the second element in the output vector, and so on, the selection result output by the I-th line in the execution channel is the I-th element in the output vector. Where the maximum value of I is equal to the maximum value of the column vector of the input matrix.

In another embodiment, an implementation is provided for transposing elements within an input matrix. The elements in a column of vectors in the input matrix are divided into a plurality of segments according to different bits, and the same bit segment in all the elements in one column of vectors is divided into one sub-vector. Inputting the sub-vectors into each data selection unit in a column in the data selection array, wherein one data selection unit selects one element in the sub-vectors as a target element according to the target sequence number; combining and splicing target elements output by a row of data selection units in one execution channel into a selection result; and combining the selection results of each row in the execution channel into one output vector, and forming an output matrix by a plurality of output vectors.

Fig. 6 is a schematic structural diagram of a matrix processing apparatus 600 according to an embodiment of the present invention. As shown in fig. 6, the matrix processing apparatus 600 includes an execution module 601, where the execution module 601 is formed by a data selection array with P rows and Q columns, and includes P × Q data selection units 602. The execution module 601 is used for receiving channel configuration data, execution instructions, and input matrices. The execution modules can be divided into execution channels of different channel modes according to different types of execution instructions and channel configuration data. The data selection unit in the execution channel comprises target serial numbers acquired from channel configuration data, and when a column of vectors in the input matrix is received, the execution channel selects a selection result according to each target serial number. One execution lane may be used to perform vector element position transformation on one of the columns of vectors in the input matrix according to the target sequence number in the lane configuration instruction. One row in one execution channel can obtain one selection result, all the selection results in one execution channel form an output vector, and the output vector forms an output matrix. In one embodiment, the number of columns Q in the data selection array in the execution block 601 may take a value that is a non-negative integer power of 2. The maximum number of execution channels that can be constructed is the ratio of the highest bit width to the lowest bit width of the data supported by the data selection array. In one embodiment, the highest bit width of the data supported by the data selection array in the execution module is 64 bits, and the lowest bit width supported by the data selection array is 8 bits, then the execution module may be constructed as an execution channel, and the bit width of the input matrix element supported by the execution channel is 64 bits; eight execution channels can be constructed, and the bit width of the input matrix element supported by each execution channel in the eight execution channels is 8 bits. Obviously, in the case that the number Q of columns in the data selection array in the execution module is fixed, the larger the number of execution channels that can be constructed, the smaller the bit width of the element of the input matrix supported by each execution channel.

Fig. 7 is a schematic structural diagram of a data selecting unit 602 according to an embodiment of the present invention. As shown in fig. 7, the data selection unit 602 includes a data selector 701 and a register 702. The register 702 is configured to receive and store a set of fields in the channel configuration data, where the fields include a target sequence number; the data selector 701 is configured to receive input data, where the input data includes a column of vectors in an input matrix, and the column of vectors includes a plurality of elements. The data selector 701 outputs a target element according to the target sequence number in the register 702. The target element may be an element in the input matrix or may be a partial bit width of an element in the input matrix.

For better understanding of the present invention, a specific input matrix is taken as an example and is specifically described in combination with the matrix processing method and apparatus provided by the embodiment of the present invention. Fig. 8 is a schematic structural diagram of an execution module with 4 rows and 4 columns according to an embodiment of the present invention.

An execution block 801 is provided, as shown in fig. 8, the execution block 801 includes a data selection array consisting of 16

data selection units

802, 4 rows and 4 columns. Wherein, the maximum bit width of the data supported by the single data selection unit is 4 bits, the minimum bit width supported by the data selection array is 4 bits, and the maximum bit width supported by the data selection array is 16 bits.

An input matrix X is provided, as shown in fig. 9(a), the input matrix X includes 4 rows and 4 columns of elements, the data type in the input matrix is a floating point number, and the bit width is 16 bits. Now, the data matrix is shifted to obtain an output matrix Y as shown in fig. 9 (b). The bit width of the input matrix element is 16 bits, so the execution module shown in fig. 8 can be configured to perform the matrix element position shifting operation in a single channel mode, and the corresponding channel configuration data is shown in fig. 10(a), where the 1 st field includes the target sequence number.

Specifically, the execution module receives a single-channel execution instruction and four pieces of channel configuration data, and respectively sends the 1 st group of fields of the channel configuration data 1 to 4 to each data selection unit in the rows 1 to 4 in the data selection array according to the instruction, wherein the 1 st group of fields received by the first row of data selection units are all from the channel configuration data 1, and the included target serial numbers are all 11; the 1 st group of fields received by the second row of data selection unit are all from channel configuration data 2, and the included target sequence numbers are all 10; the 1 st group of fields received by the third row of data selection units are all from channel configuration data 3, and the included target sequence numbers are all 01; the fourth row data selection unit receives the 1 st group of fields from the channel configuration data 4, and the target sequence numbers are 00. The construction of the execution channel is completed after the target sequence number is received, and the single-channel execution module shown in fig. 11 is obtained.

Inputting the first column of vectors in the input matrix into the execution channel, wherein the column of vectors includes 4 elements each including 16 bits, as shown in fig. 12(a), each element may be divided into 4 bit segments each including 4 bits. Constructing the same bit segment of each element into one sub-vector, four sub-vectors are obtained as shown in fig. 12(b), in which sub-vector 1 is composed of bit segments of 0 to 3 for each element in the column vector, sub-vector 2 is composed of bit segments of 4 to 7 for each element in the column vector, sub-vector 3 is composed of bit segments of 7 to 11 for each element in the column vector, and sub-vector 4 is composed of bit segments of 12 to 15 for each element in the column vector.

The specific position moving process is as follows: inputting the sub-vector 1 into each data selection unit in a first column in a single execution channel in the execution module, inputting the sub-vector 2 into each data selection unit in a second column in the execution module, inputting the sub-vector 3 into each data selection unit in a third column in the execution module, and inputting the sub-vector 4 into each data selection unit in a fourth column in the execution module; as shown in fig. 13(a), each data selection data in each column selects one target element according to the target sequence number, and the 4-row and 4-column data selection units output 16 target elements in total. As shown in fig. 13(b), 4 selection results can be obtained by stitching and combining the target elements in each row in the execution channel into one selection result. The four selection results are combined to form an output vector obtained by shifting the element position of the first column vector in the input matrix, the execution process of the second, third and fourth column vectors in the input matrix is the same as the above process, and finally four output vectors with shifted element positions are obtained, and the four vectors form the output matrix shown in fig. 9 (b). In another embodiment, providing the channel configuration data as shown in fig. 10(b), and the target sequence numbers received by each row of the data selection array after the channel is executed are 11, 01, 10, 00, respectively, the output matrix as shown in fig. 9(c) can be obtained through the aforementioned processing procedure. It should be understood that the present invention can flexibly set the target serial number required in practical application through the channel configuration data, and is not limited to the target serial number mentioned in this embodiment.

Similarly, in another embodiment, the execution module shown in FIG. 8 may also be used to process an input matrix whose data type is integer and whose data bit width is 8 bits. Because the maximum bit width of the input matrix element is one half of the maximum bit width supported by the execution module, the execution module can be constructed into a dual-channel mode for parallel processing. Corresponding channel configuration data referring to fig. 14, as shown in fig. 14, the 1 st group field and the 3 rd group field in the channel configuration data include a target sequence number. In some other embodiments, the target sequence number in each set of fields may be set according to specific requirements. After receiving the dual-channel execution instruction, the execution module sends the 1 st group of fields to the first and second rows of data selection units in the execution module, and sends the 3 rd group of fields to the third and fourth rows of data selection units in the execution module, so as to obtain the execution module including two execution channels as shown in fig. 15. The position moving process in the dual-channel mode is as follows: the data bit width supported by a single data selection unit is 4 bits, 0-3 bit sections of each element in a first column in an input matrix are sent to a first column data selection unit in a channel 1, and 4-7 bit sections of each element in the first column in the input matrix are sent to a second column data selection unit in the channel 1; the 0 to 3 bits of each element in the second column in the input matrix are sent to the first column data selection unit in lane 2, and the 4 to 7 bits of each element in the second column in the input matrix are sent to the second column data selection unit in lane 2. The subsequent processes are the same as those described above and are not described herein again. It should be noted that, in the single channel mode, the execution module can only execute the processing of one column of vectors in the input matrix at the same time, and in the dual channel mode, the execution module can execute the processing of two columns of vectors in the input matrix at the same time. In another embodiment, the execution module shown in fig. 8 may be further configured to process a matrix with an integer data type and a data bit width of 4 bits, at this time, the execution module may be divided into four execution channels, and four columns of vectors in the input matrix are executed in parallel, where the specific execution process refers to the foregoing process. It should be understood that the foregoing data bit widths are merely exemplary illustrations of the present embodiment, and do not represent that the present invention supports only the foregoing data bit widths. The data bit width supported by the data selection unit in the execution module provided by the invention can be a nonnegative integer power of 2, and the specific value can be set according to the actual requirement. The value of the channel number can be 1 to the non-negative integer power of 2 in the interval of the ratio of the maximum bit number and the minimum bit number supported by the execution module.

Based on the matrix processing method or device provided by the embodiment of the invention, one or more execution channels can be constructed through different types of execution instructions according to the specific data types and data bit widths of the input matrix while realizing the position movement of the elements in the matrix. In addition to processing the same matrix with the same data bit width as the maximum bit width supported by the execution module in a single channel manner, the matrix with the data bit width smaller than the maximum bit width supported by the execution module can be processed in parallel in a multi-channel manner. The element position movement of the matrix with the data type of floating point number can be realized in the same execution module, and the element position movement of the matrix with the data type of integer can also be realized. Corresponding execution channels are constructed according to the bit width of the input data for matrix processing, a special processing circuit is not needed to be arranged for a certain data type or data bit width, hardware redundancy in related processing equipment is reduced, hardware wiring is simplified, and the available space of hardware is saved. The matrix of the element data type with high bit width and high precision is decomposed into a plurality of matrixes of different data types with low bit width for parallel processing, so that the high-efficiency execution of the data with high bit width and high precision in a single channel can be ensured, the parallel execution of the data with low bit width in multiple channels can be ensured, and the high throughput is kept; besides improving the flexibility of input data type selection, the efficiency of the whole execution process is improved, and the time delay of the execution circuit is reduced.

Fig. 16 is a schematic structural diagram of a processing apparatus according to an embodiment of the present invention. The processing device 1600 shown in fig. 16 includes one or more processors 1601, a communication interface 1602 and a memory 1603, and the processors 1601, the communication interface 1602 and the memory 1603 may be connected by a bus or may be communicated by other means such as wireless transmission. The embodiment of the present invention is illustrated as being connected via the bus 1604. The memory 1603 is used for storing instructions, and the processor 1601 comprises the matrix processing apparatus disclosed in the above embodiments, and is used for executing the instructions stored in the memory 1603. The memory 1603 stores a program code, and the processor 1601 may call the program code stored in the memory 1603 to implement the related functions of the matrix processing apparatus 600, which may be referred to in the related descriptions of the foregoing embodiments specifically, and is not described herein again.

It should be understood that in the embodiment of the present invention, the Processor 1601 may be a Central Processing Unit (CPU), and the Processor may be other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, a discrete hardware component, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The communication interface 1602 may be a wired interface (e.g., an ethernet interface) or a wireless interface (e.g., a cellular network interface or using a wireless local area network interface) for communicating with other modules or equipment devices. For example, the communication interface 1602 in the embodiment of the present application may be specifically configured to receive input data input by a user; or receive data from an external device, etc.

Memory 1603 may include Volatile Memory (Volatile Memory), such as Random Access Memory (RAM); the Memory may also include a Non-volatile Memory (Non-volatile Memory), such as a Read-Only Memory (ROM), a Flash Memory (Flash Memory), a Hard Disk (Hard Disk Drive, HDD), or a Solid-State Drive (SSD); the memory may also comprise a combination of memories of the kind described above. The memory may be configured to store a set of program codes to facilitate the processor to invoke the program codes stored in the memory to implement the functionality associated with the matrix processing device 600 as previously described.

It should be noted that fig. 16 is only one possible implementation manner of the embodiment of the present invention, and in practical applications, the processing device may further include more or less components, which is not limited herein. For the content that is not shown or described in the embodiment of the present invention, reference may be made to the relevant explanation in the foregoing method embodiment, which is not described herein again.

An embodiment of the present invention further provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and when the computer-readable storage medium is run on a processor, the matrix processing method flow is implemented. The storage medium includes a ROM/RAM, a magnetic disk, an optical disk, and the like.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the terminal device and the unit described above may refer to corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may also be an electric, mechanical or other form of connection.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment of the present invention.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention essentially or partially contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

While the invention has been described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method of matrix processing, wherein a data selection array is provided, the data selection array comprising P rows and Q columns of data selection cells, the method comprising:

acquiring channel configuration data, wherein the channel configuration data comprises L groups of fields, and at least one group of fields in the L groups of fields comprises a target sequence number;

acquiring an execution instruction, and sending at least one group of fields including the target sequence number to the data selection array according to the execution instruction, wherein one data selection unit receives one group of fields; dividing at least one column of the data menu elements comprising the same group of fields into an execution channel according to the execution instruction;

acquiring an input matrix, inputting the input matrix into the data selection unit in the execution channel in a row, and selecting by the data selection unit according to the target serial number to obtain a selection result; and recombining each selection result to obtain an output vector after element position transformation, wherein the output vector forms an output matrix.

2. The matrix processing method according to claim 1, wherein the execution channel comprises an N-channel mode, wherein a value of N comprises 2 to M power, and wherein M has a value in a range of [0, log [ ] ₂Q](ii) a The value of Q comprises a non-negative integer power of 2; the L is equal to the Q.

3. The matrix processing method of claim 2, wherein the execution instruction comprises a first execution instruction, wherein the N-channel mode comprises a first channel mode, and wherein the first execution instruction corresponds to the first channel mode.

4. The matrix processing method according to claim 1, wherein the input matrix comprises column vectors, wherein elements in the column vectors comprise bit segments, and wherein one of the column vectors is divided into a plurality of sub-vectors according to the bit segments, and wherein the plurality of sub-vectors comprises a first sub-vector;

the data selection array comprises a first data selection unit, a second data selection unit and a third data selection unit, wherein the first data selection unit is used for receiving the target sequence number and the first sub-vector and selecting a target element from the first sub-vector according to the target sequence number; and splicing the target elements in one row in one execution channel to obtain the selection result.

5. A matrix processing apparatus, the apparatus comprising an execution module, the execution module comprising a data selection array, the data selection array comprising P rows and Q columns of data selection units;

the execution module is used for receiving channel configuration data; the channel configuration data comprises a plurality of groups of fields, and at least one group of fields in the plurality of groups of fields comprises a target sequence number;

the execution module comprises at least one execution channel, the execution channel comprises at least one column of the data selection units, and the data selection units in the same execution channel comprise the same group of the fields;

the execution module is further used for receiving an input matrix, the input matrix enters the data selection unit in a row, and the data selection unit selects according to the target serial number to obtain a selection result; and combining the selection results in the same execution channel to obtain an output vector, wherein the output vector forms an output matrix.

6. The matrix processing apparatus of claim 5 wherein the lane configuration data includes L sets of fields, the L selecting a ratio of a highest bit width to a lowest bit width of data supported by the array for the data.

7. The apparatus according to claim 5 or 6, wherein the input matrix comprises a column vector, wherein elements in the column vector comprise bit segments, and wherein the column vector is divided into a plurality of sub-vectors according to the bit segments, and wherein the plurality of sub-vectors comprises a first sub-vector;

the data selection array comprises a first data selection unit, a second data selection unit and a third data selection unit, wherein the first data selection unit is used for receiving the target sequence number and the first sub-vector and selecting a target element from the first sub-vector according to the target sequence number; the selection result includes the target element in a row within one of the execution lanes.

8. The matrix processing apparatus of claim 6, wherein the value of Q comprises a non-negative integer power of 2; the L is equal to the Q.

9. The matrix processing apparatus according to claim 8, wherein the execution module is further configured to receive an execution instruction, and the execution instruction is configured to send the channel configuration data to the corresponding data selection unit in the execution module; the execution channel comprises a plurality of channel modes, including a first channel mode; the execution instruction comprises a first execution instruction, and the first execution instruction corresponds to the first channel mode.

10. A processing device, comprising:

a memory for storing a computer program;

a processor comprising at least the matrix processing apparatus of any of claims 5 to 9; the processor is adapted to carry out the steps of the matrix processing method according to any one of claims 1 to 4 when executing the computer program.

11. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the matrix processing method according to any one of claims 1 to 4.