CN112230993A

CN112230993A - Data processing method and device and electronic equipment

Info

Publication number: CN112230993A
Application number: CN202011048741.9A
Authority: CN
Inventors: 薛大庆
Original assignee: Haiguang Information Technology Co Ltd
Current assignee: Haiguang Information Technology Co Ltd
Priority date: 2020-09-29
Filing date: 2020-09-29
Publication date: 2021-01-15

Abstract

A data processing method, a data processing device and an electronic device are provided, wherein the data processing method comprises the following steps: obtaining a first vector, a second vector and an immediate operand, the first vector comprising a plurality of first elements; selecting one of the plurality of first elements as a broadcast element based on the immediate operand; and performing an arithmetic operation on the second vector and the broadcast element to obtain an operation result. The data processing method can effectively reduce the number of codes, provides flexibility for vector-vector arithmetic operation, can improve the operation efficiency, and is beneficial to realizing high performance and energy-saving design.

Description

Data processing method and device and electronic equipment

Technical Field

The embodiment of the disclosure relates to a data processing method and device and electronic equipment.

Background

With the development of instruction set architecture technology, vector computation is more and more widely applied to the design of a Central Processing Unit (CPU), and plays an important role in high-performance computation and Artificial Intelligence (AI) model training and reasoning based on a neural network. The CPU supporting vector calculation can perform arithmetic operation by using the vector as an operand, thereby processing a plurality of different values in the vector by using a single instruction and improving the processing efficiency.

Disclosure of Invention

At least one embodiment of the present disclosure provides a data processing method, including: obtaining a first vector, a second vector, and an immediate operand, wherein the first vector includes a plurality of first elements; selecting one of the plurality of first elements as a broadcast element based on the immediate operand; and performing arithmetic operation on the second vector and the broadcast element to obtain an operation result.

For example, in a data processing method provided by an embodiment of the present disclosure, the immediate operand has a plurality of different values, and the different values of the immediate operand are in one-to-one correspondence with the plurality of first elements.

For example, in a data processing method provided by an embodiment of the present disclosure, selecting one of the plurality of first elements as the broadcast element based on the immediate operand includes: and selecting a first element corresponding to the immediate operand as the broadcast element according to the corresponding relation between the numerical value of the immediate operand and the first element.

For example, in a data processing method provided in an embodiment of the present disclosure, selecting, as the broadcast element, a first element corresponding to the immediate operand according to a correspondence between a numerical value of the immediate operand and the first element includes: using a multiplexer, the immediate operand being an input to a selection control terminal of the multiplexer, the first elements being inputs to input terminals of the multiplexer, and the multiplexer being caused to output a first element corresponding to the immediate operand as the broadcast element.

For example, in a data processing method provided in an embodiment of the present disclosure, the immediate operand is an immediate, and the immediate is a scalar value.

For example, in a data processing method provided by an embodiment of the present disclosure, a plurality of first elements in the first vector are sequentially arranged and sequentially numbered from 0 to N, N is an integer greater than 0, the broadcast element is a pth first element in the plurality of first elements, P is 0, … …, or N.

For example, in a data processing method provided in an embodiment of the present disclosure, the first element is a scalar value.

For example, in a data processing method provided by an embodiment of the present disclosure, the arithmetic operation includes at least one of an addition operation, a subtraction operation, a multiplication operation, a division operation, a multiplication-addition operation, and a comparison operation.

For example, in the data processing method provided in an embodiment of the present disclosure, the operation result is a vector result.

For example, in a data processing method provided by an embodiment of the present disclosure, the second vector includes a plurality of second elements, and performing an arithmetic operation on the second vector and the broadcast element to obtain the operation result includes: and performing an arithmetic operation on each of the plurality of second elements and the broadcast element to obtain the operation result.

For example, in the data processing method provided by an embodiment of the present disclosure, the operation result includes a third vector, and the third vector includes a plurality of third elements, and the number of the plurality of third elements is the same as the number of the plurality of second elements.

For example, in a data processing method provided by an embodiment of the present disclosure, the data processing method is applied to an apparatus based on an instruction set architecture.

At least one embodiment of the present disclosure also provides a data processing apparatus including a multiplexer, a plurality of arithmetic units, and a plurality of registers; the multiplexer comprises a selection control end, an output end and a plurality of input ends; a plurality of input ends of the multiplexer are connected with the first register; the plurality of arithmetic units are connected with the second register, and each of the plurality of arithmetic units is connected with the output end of the multiplexer; the first register is configured to store a first vector and the second register is configured to store a second vector; the multiplexer is configured to select and output one of a plurality of first elements of the first vector stored in the first register as a broadcast element based on an immediate operand acquired by the selection control terminal; the plurality of operation units are configured to perform arithmetic operations on the second vector and the broadcast element stored in the second register according to the acquired operation instruction to obtain an operation result.

For example, an embodiment of the present disclosure provides the data processing apparatus, further including a controller, where the controller is connected to the plurality of arithmetic units and the selection control terminal of the multiplexer, and the controller is configured to perform instruction decoding, provide the decoded immediate operand to the selection control terminal of the multiplexer, and provide the decoded operation instruction to the plurality of arithmetic units.

For example, in the data processing apparatus provided in an embodiment of the present disclosure, the number of the plurality of arithmetic units is equal to the number of the plurality of second elements of the second vector stored in the second register, and the plurality of arithmetic units are in one-to-one correspondence with the plurality of second elements.

For example, in the data processing apparatus provided in an embodiment of the present disclosure, the plurality of registers further include a third register, the third register is connected to the plurality of arithmetic units, and the third register is configured to store the arithmetic results obtained by the plurality of arithmetic units.

For example, in the data processing apparatus provided in an embodiment of the present disclosure, the operation result includes a third vector, the third vector includes a plurality of third elements, the number of the plurality of operation units is equal to the number of the plurality of third elements, and the plurality of operation units are in one-to-one correspondence with the plurality of third elements.

At least one embodiment of the present disclosure also provides a data processing apparatus, including: an obtaining unit configured to obtain a first vector, a second vector, and an immediate operand, wherein the first vector includes a plurality of first elements; a selection unit configured to select one of the plurality of first elements as a broadcast element based on the immediate operand; an arithmetic unit configured to perform an arithmetic operation on the second vector and the broadcast element to obtain an operation result.

At least one embodiment of the present disclosure also provides an electronic device including: a processor; a memory including one or more computer program modules; wherein the one or more computer program modules are stored in the memory and configured to be executed by the processor, the one or more computer program modules comprising instructions for implementing the data processing method of any embodiment of the disclosure.

Drawings

To more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings of the embodiments will be briefly introduced below, and it is apparent that the drawings in the following description relate only to some embodiments of the present disclosure and are not limiting to the present disclosure.

Fig. 1 is a schematic flow chart of a data processing method according to some embodiments of the present disclosure;

fig. 2A is a schematic diagram of a first vector used in a data processing method according to some embodiments of the present disclosure;

fig. 2B is a schematic diagram of a second vector used in a data processing method according to some embodiments of the present disclosure;

fig. 3 is an example of a corresponding relationship between an immediate operand and a first element in a data processing method according to some embodiments of the present disclosure;

fig. 4 is an example of selecting a first element using a multiplexer in a data processing method according to some embodiments of the present disclosure;

fig. 5 is a schematic logical structure diagram of a data processing method according to some embodiments of the present disclosure;

fig. 6A is a schematic diagram of an instruction format of an application example of a data processing method according to some embodiments of the present disclosure;

FIG. 6B is a block diagram illustrating a logic structure corresponding to the instruction shown in FIG. 6A;

fig. 7 is a schematic block diagram of a data processing apparatus provided in some embodiments of the present disclosure;

fig. 8 is a schematic logical structure diagram of a data processing apparatus according to some embodiments of the present disclosure;

FIG. 9 is a schematic block diagram of another data processing apparatus provided in some embodiments of the present disclosure;

fig. 10 is a schematic block diagram of an electronic device provided by some embodiments of the present disclosure; and

fig. 11 is a schematic block diagram of another electronic device provided by some embodiments of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions of the embodiments of the present disclosure will be described clearly and completely with reference to the drawings of the embodiments of the present disclosure. It is to be understood that the described embodiments are only a few embodiments of the present disclosure, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the described embodiments of the disclosure without any inventive step, are within the scope of protection of the disclosure.

Unless otherwise defined, technical or scientific terms used herein shall have the ordinary meaning as understood by one of ordinary skill in the art to which this disclosure belongs. The use of "first," "second," and similar terms in this disclosure is not intended to indicate any order, quantity, or importance, but rather is used to distinguish one element from another. Also, the use of the terms "a," "an," or "the" and similar referents do not denote a limitation of quantity, but rather denote the presence of at least one. The word "comprising" or "comprises", and the like, means that the element or item listed before the word covers the element or item listed after the word and its equivalents, but does not exclude other elements or items. The terms "connected" or "coupled" and the like are not restricted to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", and the like are used merely to indicate relative positional relationships, and when the absolute position of the object being described is changed, the relative positional relationships may also be changed accordingly.

In instruction set architecture based devices, some support vector instructions that involve a number of element-by-element vector/vector and vector/scalar arithmetic operations. Flexible and efficient element-by-element encoding and selection schemes are crucial to improving latency and throughput of the underlying application. The Advanced Vector Extensions (AVX) instruction set under the X86 architecture and the Scalable Vector Extensions (SVE) instruction set under the ARM architecture both contain Vector instructions. Both instruction set architectures provide element-based arithmetic operations such as multiply-ADD/subtract operations of opcode representations such as MADD/ADD/SUB. The AVX instruction set also provides scalar arithmetic operations for two vector operands, however, the scalar arithmetic operations are only performed on the first scalar element of the source vector operand. In addition, a super multiply add (super MADD) instruction is an operation instruction for two source vector operands (e.g., V1 and V2) and three source scalars (e.g., a, b, c), and can obtain an operation result of a × V1+ b × V2+ c.

Arithmetic instructions in the current AVX instruction set are limited to element-by-element arithmetic operations or first-pair element arithmetic operations in source vector operands. For example, arithmetic operations are represented using the notation x, such as addition/subtraction/multiply-ADD (ADD/SUB/MADD) and other related operations may be represented. For a given two vector operands, V1 and V2, there are generally two types of arithmetic operations to obtain the result vector V0. The first operation mode is as follows: v0[ i ] ═ V1[ i ] × V2[ i ], where i denotes all elements in the corresponding vector, i.e. the operation is a vector operation, where all elements in the vector are operated on an element-by-element basis. The second operation method is as follows: v0[0] ═ V1[0] × V2[0], V0[1.. max ] ═ 0 or hold the original value, that is, this operation mode realizes a partial scalar operation in which only the first pair of elements are arithmetically operated, and no corresponding arithmetic operation is performed on the elements other than the first pair of elements, and in this operation mode, only broadcasting of the first element is realized.

The above two operation methods are only exemplary, and are not limited to only two source operands V1 and V2, and the operation method is similar for a super multiply add (super MADD) operation with three source operands, that is, the operation may be performed for all elements element by element or only for the first pair of elements (the first element is broadcasted and operated on). The super MADD operation is designed for specific applications and lacks flexibility.

The current arithmetic instructions are difficult to broadcast any scalar element in the source vector operands, and cannot broadcast any scalar element in one source vector operand and apply basic operation to another source operand in an element mode, so that more codes are needed to realize a vector-vector arithmetic operation function when a program is written, the design flexibility of the program is insufficient, and the operation efficiency is low.

At least one embodiment of the disclosure provides a data processing method and device and electronic equipment. The data processing method can broadcast any scalar element in one source vector operand and apply basic operation to another source operand in an element mode, so that the code quantity can be effectively reduced, flexibility is provided for vector-vector arithmetic operation, the operation efficiency can be improved, and the high-performance and energy-saving design is facilitated.

Hereinafter, embodiments of the present disclosure will be described in detail with reference to the accompanying drawings. It should be noted that the same reference numerals in different figures will be used to refer to the same elements that have been described.

At least one embodiment of the present disclosure provides a data processing method, including: obtaining a first vector, a second vector and an immediate operand, the first vector comprising a plurality of first elements; selecting one of the plurality of first elements as a broadcast element based on the immediate operand; and performing an arithmetic operation on the second vector and the broadcast element to obtain an operation result.

Fig. 1 is a schematic flow chart of a data processing method according to some embodiments of the present disclosure. As shown in fig. 1, the data processing method includes the following operations.

Step S10: obtaining a first vector, a second vector and an immediate operand, wherein the first vector comprises a plurality of first elements;

step S20: selecting one of the plurality of first elements as a broadcast element based on the immediate operand;

step S30: and performing an arithmetic operation on the second vector and the broadcast element to obtain an operation result.

For example, the above steps S10-S30 are all performed by a device having data processing capability, such as a CPU, to implement vector operations.

For example, in step S10, the first vector is a vector (also referred to as a vector or one-dimensional tensor) stored in a first register, such as a vector register, for storing vector data. The first vector may be a result of the CPU performing other operations before executing the data processing method, for example, the result is obtained by decoding an instruction, or the result is obtained by performing other operations, which is not limited by the embodiment of the present disclosure.

For example, the first vector includes a plurality of first elements, and the plurality of first elements are sequentially stored in the first register, that is, the plurality of first elements are sequentially stored in a segment of memory address corresponding to the first register. For example, as shown in fig. 2A, in some examples, the first vector V1 includes 4 first elements, the first elements being X0, X1, X2, and X3, respectively, and X0, X1, X2, and X3 are stored in a first register in sequence. X0 is the first element in the first vector V1, X1 is the second element in the first vector V1, and so on. It should be noted that, in the embodiment of the present disclosure, the number of the first elements is not limited to 4, and the first elements may be any number, that is, the first elements included in the first vector V1 may be X0, X1, … …, Xn, and n is any positive integer. The number of the first elements may be determined according to actual requirements, and the embodiment of the disclosure is not limited thereto.

For example, each first element is a scalar value, and the data type of the first element may be a double-precision floating point number, a single-precision floating point number, an integer number, a long integer number, or the like, and may also be any other data type, which is not limited in this respect by the embodiments of the present disclosure. The plurality of first elements form a first vector, and the first elements are scalar elements in the first vector.

For example, the second vector is a vector (also referred to as a vector or one-dimensional tensor) stored in a second register, such as a vector register, that is used to store vector data. The second vector may be a result of the CPU performing other operations before executing the data processing method, for example, the result is obtained by decoding an instruction, or the result is obtained by performing other operations, which is not limited by the embodiment of the present disclosure.

For example, the second vector includes a plurality of second elements, and the plurality of second elements are sequentially stored in the second register, that is, the plurality of second elements are sequentially stored in a section of memory address corresponding to the second register. For example, as shown in fig. 2B, in some examples, the second vector V2 includes 4 second elements, the second elements are Y0, Y1, Y2, and Y3, respectively, and Y0, Y1, Y2, and Y3 are stored in a second register in sequence. Y0 is the first element in the second vector V2, Y1 is the second element in the second vector V2, and so on. It should be noted that, in the embodiment of the present disclosure, the number of the second elements is not limited to 4, and the second elements may be any number, that is, the second elements included in the second vector V2 may be Y0, Y1, … …, Ym, and m is any positive integer. The number of the second elements may be determined according to actual requirements, and embodiments of the present disclosure are not limited thereto.

For example, each second element is a scalar value, and the data type of the second element may be a double-precision floating point number, a single-precision floating point number, an integer number, a long integer number, or the like, and may also be any other data type, which is not limited in this respect by the embodiments of the present disclosure. The plurality of second elements form a second vector, and the second elements are scalar elements in the second vector.

It should be noted that, in the embodiment of the present disclosure, the number of the first elements and the number of the second elements may be the same or different, and the data type of the first elements and the data type of the second elements may also be the same or different, which may be determined according to actual needs, and the embodiment of the present disclosure is not limited thereto.

For example, the immediate operand is an immediate, which is a scalar value and is given by immediate addressing. For example, in some examples, immediate operands are derived from instruction decode, and may be used directly after instruction decode without storing into registers. For example, the immediate operand is obtained from the decoder and forwarded to the execution unit (e.g., to a multiplexer as described below), which is obtained without the need for an additional read port in the physical register file. For example, when code is written using a computer language (e.g., assembly language, C + + language, etc.), immediate operands may be assigned directly in the code as needed. Of course, the embodiments of the present disclosure are not limited thereto, and the immediate operand may also be stored in a register as needed, and a certain operation rule or operation rule may also be adopted to obtain the assignment value, which may be determined according to actual requirements, and the embodiments of the present disclosure do not limit this.

For example, the immediate operand has a plurality of different values, and the different values of the immediate operand are in one-to-one correspondence with the plurality of first elements. That is, the immediate operand may be given different values so as to be mapped to corresponding first elements according to the value of the immediate operand. For example, in some examples, as shown in FIG. 3, the immediate operand Imm is a 3-bit binary number, and different numerical values of the immediate operand Imm have a one-to-one correspondence with the plurality of first elements X0-X3. When the immediate operand Imm is 001, the corresponding first element is X0; when the immediate operand Imm is 010, the corresponding first element is X1; when the immediate operand Imm is 011, the corresponding first element is X2; when the immediate operand Imm is 100, the corresponding first element is X3. Thus, from the value of the immediate operand Imm, the corresponding first element can be determined.

It should be noted that, in the embodiment of the present disclosure, the immediate operand is not limited to a 3-bit binary number, but may also be a binary number of any bit, for example, a 6-bit binary number, an 8-bit binary number, and the like, which may be determined according to an actual requirement, and the embodiment of the present disclosure does not limit this. The more "bits" of the immediate operand, the more values the immediate operand can represent, thereby allowing the immediate operand to correspond to more first elements. For example, when the number of first elements in the first vector is greater, then more "bits" of the immediate operand may be made, so that all of the first elements correspond to the value of the immediate operand.

For example, as shown in fig. 1, in step S20, one of the plurality of first elements is selected as a broadcast element based on the immediate operand. Since different values of the immediate operand correspond to the first elements one-to-one, step S20 can be implemented as: and selecting the first element corresponding to the immediate operand as a broadcast element according to the corresponding relation between the numerical value of the immediate operand and the first element.

For example, in some examples, when the corresponding relationship of the immediate operand Imm and the first element is the case shown in fig. 3, the broadcast element may be determined according to the value of the immediate operand Imm. For example, when the immediate operand Imm is 001, the corresponding first element is X0, thus taking X0 as the broadcast element; when the immediate operand Imm is 010, the corresponding first element is X1, thus X1 is taken as the broadcast element; when the immediate operand Imm is 011, the corresponding first element is X2, thus taking X2 as the broadcast element; when the immediate operand Imm is 100, the corresponding first element is X3, thus X3 is taken as the broadcast element. It should be noted that the correspondence relationship between the immediate operand and the first element is only exemplary and not limiting, and in practical applications, the correspondence relationship between the immediate operand and the first element may be defined according to the requirements of program design.

Fig. 4 is an example of selecting a first element by a multiplexer in a data processing method according to some embodiments of the present disclosure. For example, in some examples, the first element may be selected by using a multiplexer MUX, when the data processing method is executed by a CPU, only the multiplexer MUX needs to be added in the CPU, and other hardware structures do not need to be modified, so that the change of the hardware structure is small and the implementation is convenient.

For example, as shown in fig. 4, with the multiplexer MUX, the immediate operand Imm is input to the selection control terminal Con of the multiplexer MUX, and the plurality of first elements X0, X1, X2 and X3 are input to the plurality of input terminals I0, I1, I2 and I3 of the multiplexer MUX, so that the multiplexer MUX outputs the first element corresponding to the immediate operand Imm as the broadcast element Bro. For example, the plurality of first elements X0, X1, X2, and X3 are connected with the plurality of input terminals I0, I1, I2, and I3 of the multiplexer MUX in one-to-one correspondence. That is, the address of the first register G1 storing the first element X0 is connected to the input terminal I0, the address of the first register G1 storing the first element X1 is connected to the input terminal I1, the address of the first register G1 storing the first element X2 is connected to the input terminal I2, and the address of the first register G1 storing the first element X3 is connected to the input terminal I3. The immediate operand Imm is connected to the selection control terminal Con of the multiplexer MUX, which, based on the value of the immediate operand Imm, will connect one of the input terminals I0, I1, I2 and I3 to the output terminal OT, so that the corresponding first element is output as the broadcast element Bro.

For example, the correspondence relationship of the immediate operand Imm and the first element shown in fig. 3 is still used for the exemplary explanation. When the immediate operand Imm is 001, the input I0 is connected to the output OT, so that the first element X0 is output, the broadcast element Bro at this time being X0; when the immediate operand Imm is 010, the input I1 is connected to the output OT, so that the first element X1 is output, the broadcast element Bro at this time being X1; when the immediate operand Imm is 011, the input I2 is connected to the output OT, so that the first element X2 is output, the broadcast element Bro at this time being X2; when the immediate operand Imm is 100, the input I3 is connected to the output OT, so that the first element X3 is output, the broadcast element Bro of which is X3.

By employing the multiplexer MUX, the plurality of first elements can be selected and output based on the value of the immediate operand, whereby any one of the plurality of first elements can be selected as a broadcast element. The method has the advantages of small change on the hardware structure, convenient realization, and easy control and expansion.

For example, any one of the plurality of first elements may be selected as a broadcast element based on an immediate operand. That is, the plurality of first elements in the first vector are sequentially arranged and numbered from 0 to N, N is an integer greater than 0, and the broadcast element may be the pth first element in the plurality of first elements, P is 0, … …, or N. For example, P may be equal to 0, may be equal to 1, may be equal to 2, and so on, depending on the value of the immediate operand and the correspondence of the value of the immediate operand to the first plurality of elements. A general arithmetic operation instruction can only operate on source vector operands element by element or only on the first element (scalar element), and cannot realize vector operations after broadcasting any element (scalar element) in the source vector operands. In contrast, the data processing method provided by the embodiments of the present disclosure can broadcast any one element (scalar element) in the source vector operand, not limited to the first element (scalar element), which greatly improves flexibility.

For example, as shown in fig. 1, in step S30, an arithmetic operation is performed on the second vector and the broadcast element to obtain an operation result. For example, the arithmetic operation may include at least one of an addition operation (ADD), a subtraction operation (SUB), a multiplication operation (MUL), a division operation (DIV), a multiply-ADD operation (MADD), and a comparison operation (CMP). Of course, the embodiments of the present disclosure are not limited thereto, and the arithmetic operation may also include any other suitable operation, which may be determined according to actual needs.

For example, the second vector includes a plurality of second elements, and when the second vector and the broadcast element are arithmetically operated, each of the plurality of second elements may be arithmetically operated with the broadcast element, respectively, to obtain an operation result. For example, in some examples, each of the plurality of second elements may be caused to separately add to the broadcast element when performing an addition operation, each of the plurality of second elements may be caused to separately subtract from the broadcast element when performing a subtraction operation, and so on.

For example, the operation result is a vector result. For example, the operation result includes a third vector including a plurality of third elements. In some examples, each second element is arithmetically operated with the broadcast element to obtain one third element, so that arithmetically operating each of the plurality of second elements with the broadcast element respectively can obtain a plurality of third elements, and the number of the plurality of third elements is the same as that of the plurality of second elements. And forming a third vector by a plurality of third elements, wherein the operation result is the third vector.

For example, the data processing method provided by the embodiment of the disclosure is applied to a device based on an instruction set architecture, and the data processing method can realize vector operation. For example, the apparatus may be a CPU or other device having data processing capabilities. The CPU may be based on an X86 architecture, such as 32-bit X86 or 64-bit X86, an ARM architecture, or other applicable instruction set architectures, which are not limited in this respect by the embodiments of the present disclosure.

Fig. 5 is a schematic logical structure diagram of a data processing method according to some embodiments of the present disclosure. For example, as shown in fig. 5, in some examples, a first vector V1 is stored in the first register G1, the first vector V1 includes first elements X0, X1, X2, and X3, a second vector V2 is stored in the second register G2, the second vector V2 includes second elements Y0, Y1, Y2, and Y3, and the first vector V1 and the second vector V2 are both source operands for performing arithmetic operations.

The first elements X0, X1, X2, and X3 are connected to input terminals I0, I1, I2, and I3 of a multiplexer MUX in one-to-one correspondence, a selection control terminal Con of the multiplexer MUX is connected to the immediate operand Imm, and the multiplexer MUX selects one of the first elements X0, X1, X2, and X3 to be output from an output terminal OT as a broadcast element Bro based on a value of the immediate operand Imm.

The plurality of arithmetic logic units ALU0, ALU1, ALU2, and ALU3 are respectively connected with a plurality of second elements Y0, Y1, Y2, and Y3 in a one-to-one correspondence, and the second elements Y0, Y1, Y2, and Y3 are respectively used as inputs of different arithmetic logic units. The output OT of the multiplexer MUX is connected to each arithmetic logic unit, with the broadcast element Bro as an input to each arithmetic logic unit.

The plurality of arithmetic logic units ALU0, ALU1, ALU2, and ALU3 perform arithmetic operations on the second elements Y0, Y1, Y2, and Y3 and the broadcast element Bro, which may be addition operations (ADD), subtraction operations (SUB), multiplication operations (MUL), division operations (DIV), multiply-ADD operations (MADD), comparison operations (CMP), or other suitable operations.

For example, the arithmetic logic unit ALU0 performs an arithmetic operation on the second element Y0 and the broadcast element Bro to obtain a third element Z0; the arithmetic logic unit ALU1 performs arithmetic operation on the second element Y1 and the broadcast element Bro to obtain a third element Z1; the arithmetic logic unit ALU2 performs arithmetic operation on the second element Y2 and the broadcast element Bro to obtain a third element Z2; the arithmetic logic unit ALU3 performs an arithmetic operation on the second element Y3 and the broadcast element Bro to obtain a third element Z3. The third register G3 is connected with outputs of a plurality of arithmetic logic units ALU0, ALU1, ALU2 and ALU3, and the third elements Z0, Z1, Z2 and Z3 constitute a third vector V3 and are stored in the third register G3.

For example, in some examples, assuming that the immediate operand Imm is 011, the first vector V1 is [2,4,1,5], the second vector V2 is [0,3,2,0], and the arithmetic logic unit needs to perform an addition operation (ADD), then, in accordance with Imm 011, the broadcast element Bro output by the multiplexer MUX is the third element "1" in V1 (i.e., corresponding to X2 described above), and the arithmetic logic unit performs an addition operation on the second vector V2 and the element "1", thereby obtaining a result of V3 being [1,4,3,1 ].

For example, in some other examples, still assuming that the immediate operand Imm is 011, the first vector V1 is [2,4,1,5], the second vector V2 is [0,3,2,0], and at this time the arithmetic logic unit needs to perform a multiplication operation (MUL), then, according to Imm 011, the broadcast element Bro output by the multiplexer MUX is the third element "1" in V1 (i.e., corresponding to X2 described above), and the arithmetic logic unit performs a multiplication operation on the second vector V2 and the element "1", thereby obtaining the result of V3' ═ 0,3,2, 0.

The logical structure shown in FIG. 5 demonstrates a micro-architectural implementation of selecting a first element to broadcast in a first vector V1 for arithmetic operations. At the hardware design level, for current designs that already support the AVX instruction set, only one multiplexer MUX needs to be added and connected to all first elements in the first vector V1 so that a certain first element can be indexed in the first vector V1 for broadcast based on the immediate operand Imm. The change of the hardware structure is small, and the realization is convenient.

Fig. 6A is a schematic instruction format diagram of an application example of a data processing method according to some embodiments of the present disclosure. For example, in this example, instructions are written to implement the data processing method provided by the embodiments of the present disclosure based on the X86 architecture. As shown in fig. 6A, the instruction for implementing the data processing method includes 6 fields, prefix, opcode, dest/src3, src1, src2, and imm8, respectively. prefix represents a prefix byte; opcode refers to an opcode that represents a basic arithmetic operation, which may be, for example, an Addition (ADD), a Subtraction (SUB), a Multiplication (MUL), a Division (DIV), a multiplication-and-addition (MADD), a Comparison (CMP), or other suitable operation; dest/src3 represents the result register and/or the third source register in the Multiply Add (MADD) operation, i.e., dest/src3 may be the third register G3 in FIG. 5; src1 represents a source register and/or memory operand, i.e., src1 may be the second register G2 in fig. 5; src2 represents a source register and/or memory operand, i.e., src2 may be the first register G1 in fig. 5; imm8 represents an immediate operand, such as an 8-bit immediate operand, which may be the immediate operand Imm in FIG. 5, used to index the scalar elements for broadcast in src 2. The design rules for the instruction format based on the X86 architecture may refer to conventional designs and will not be described in detail herein.

FIG. 6B is a logic diagram corresponding to the instruction shown in FIG. 6A. The logical structure shown in fig. 6B is substantially the same as that shown in fig. 5 except for the component numbers. As shown in fig. 6B, src1 is the second register G2, src2 is the first register G1, Imm8 is the immediate operand Imm, dest is the third register G3. For a detailed description of the logic structure, reference is made to the foregoing description, which is not repeated herein.

The data processing method provided by the embodiment of the disclosure can be applied to linear algebra operation, and provides a new arithmetic instruction which can broadcast any element in a source vector operand. A typical use case of this data processing method is matrix multiplication. For example, a general operation method and an operation method based on the data processing method according to the embodiment of the present disclosure will be briefly described below by taking 4 × 4 matrix multiplication as an example. For example, the following operations need to be performed: c is a known 4 × 4 matrix, a and B are 4 × 4 matrices that need to be calculated. For example, the matrix A, B, C is column-based, i.e., basically operates on column vectors during operation, and calculates results in the middle of the column vectors.

Based on the AVX256 instruction set, the kernel code of a typical Linear algebraic function (BLAS) open source library is as follows.

For example, in the notation of the above codes, a00 to a33 represent 16 elements in matrix a, and B00 to B33 represent 16 elements in matrix B. It should be noted that aij represents the ith row and jth column element of the matrix a, and bij represents the ith row and jth column element of the matrix B. The row 1 code indicates that column 1 element of the read matrix a is stored in the register ymm0, the row 2 code indicates that column 2 element of the read matrix a is stored in the register ymm1, and so on. By executing the row 1 to row 4 codes, 4 columns of elements in the matrix a are stored in the registers ymm0, ymm1, ymm2, and ymm3, respectively.

The row 5 code indicates that element B00 in matrix B was read and stored in register ymm 12. Note that the element b00 is broadcast here so that 4 b00 are stored in the register 12, i.e., ymm12 ═ b00, b00, b00, b 00. Similarly, line 6 indicates the element B10 in read matrix B and is stored in register ymm13, line 7 indicates the element B20 in read matrix B and is stored in register ymm14, line 8 indicates the element B30 in read matrix B and is stored in register ymm 15. Thus, the 1 st column element of matrix B is stored in 4 registers.

The code in line 9 performs a multiply-add operation on ymm0 and ymm12, that is, ymm4 is ymm0 × ymm12+ ymm4, and the specific operation result can be referred to the note in the code. By executing the codes of lines 9 to 12, the resulting 1 st element in ymm4 is the sum of the multiplication of the 1 st row of matrix a and the corresponding element of the 1 st column of matrix B, i.e. the 1 st element in ymm4 is the sum of a00B00+ a01B10+ a02B20+ a03B30, the 2 nd element in ymm4 is the sum of the multiplication of the 2 nd row of matrix a and the corresponding element of the 1 st column of matrix B, i.e. the 2 nd element in ymm4 is a10B00+ a11B10+ a12B20+ a13B30, and so on. Thus, 4 elements in ymm4 are the column 1 elements of matrix C.

The row 13 to row 16 code performs operations similar to those performed by the row 5 to row 8 code, the row 13 to row 16 code causing the column 2 elements of the matrix B to be stored in 4 registers. Similarly, lines 17 through 20 code are similar to the operations performed by lines 9 through 12 code, with lines 17 through 20 code resulting in ymm5 being the column 2 element of matrix C. Similarly, the calculation continues to obtain the 3 rd column element and the 4 th column element of the matrix C. This completes the multiplication of the matrix a and the matrix B, and obtains the matrix C.

The code realizes multiplication calculation of a 4 x 4 matrix by using 36 instructions, and has a large number of codes and low operation efficiency.

Based on the AVX256 instruction set, the kernel code to which the data processing method provided by the embodiments of the present disclosure is applied is as follows.

The row 5 code indicates that column 1 element of the read matrix B is stored in the register ymm12, the row 6 code indicates that column 2 element of the read matrix B is stored in the register ymm13, and so on. By executing the row 5 to row 8 codes, 4 columns of elements in the matrix B are stored in the registers ymm12, ymm13, ymm14, and ymm15, respectively.

Line 9 code indicates that scalar broadcast is performed using the data processing method of the embodiment of the present disclosure and a multiply-add operation is performed based on the broadcast scalar, that is, "xvfmadd 231 pd" indicates the instruction name, line 9 code broadcasts the 1 st element in ymm12, performs a multiply-add operation on ymm0 and broadcast element Bro, and ymm4 is ymm0 × Bro + ymm 4. Here, the immediate operand is 0, and is directly assigned in the code. The specific operation result of the code in line 1, column 1, i.e. B00, line 9 of the ymm12 with the 1 st element in line 1 of the matrix B can be seen in the notation of the code.

Similarly, line 10 codes broadcast element 2 in ymm12, and multiply-add operation for ymm1 and broadcast element Bro, ymm4 ═ ymm1 × Bro + ymm 4. Here the immediate operand is 1, assigned directly in the code. The specific operation result of the code in line 2, column 1, row 2, i.e. B10, row 10 of the ymm12 is shown in the note of the code.

By executing the codes of lines 9 to 12, the 1 st element in ymm4 is a00B00+ a01B10+ a02B20+ a03B30, i.e. the sum after multiplication of the 1 st row of matrix a with the corresponding element of column 1 of matrix B, the 2 nd element in ymm4 is a10B00+ a11B10+ a12B20+ a13B30, i.e. the sum after multiplication of the 2 nd row of matrix a with the corresponding element of column 1 of matrix B, and so on. Thus, 4 elements in ymm4 are the column 1 elements of matrix C.

The operations performed by the code of lines 13 to 16 are similar to those performed by the code of lines 9 to 12, and the resulting 4 elements in ymm5 are the 2 nd column elements of matrix C. Similarly, by executing the codes of rows 17 to 20, 4 elements in the resultant ymm6 are the elements of column 3 of the matrix C, and by executing the codes of rows 21 to 24, 4 elements in the resultant ymm7 are the elements of column 4 of the matrix C. This completes the multiplication of the matrix a and the matrix B, and obtains the matrix C.

It should be noted that "xvfmadd 231 pd" in the above code represents an instruction name, and after the instruction is decoded by the CPU, the CPU obtains an immediate operand, and performs scalar broadcast by applying the data processing method according to the embodiment of the present disclosure, and further performs multiply-add operation based on the broadcast scalar. The instruction may play an important role in high performance computational libraries such as linear algebra functions (BLAS) and AI model training and reasoning as kernel operators in matrix multiplication.

The code uses 24 instructions to realize the multiplication calculation of the 4 x 4 matrix, compared with 36 instructions required by a common method, the code number is effectively reduced, flexibility is provided for vector-vector arithmetic operation, the operation efficiency can be improved, and the realization of high performance is facilitated. In this example, the number of kernel code instructions is reduced by 33% by broadcasting scalar elements in matrix B. This greatly eliminates the data dependent bubble of the Float Multiple Add (FMA) engine and fully utilizes the FMA engine to improve the efficiency of the matrix multiplication. Furthermore, instruction reduction means less use of dispatch and dispatch queues, which are typically very power hungry, so the data processing method of the embodiments of the present disclosure helps achieve a power saving design.

At least one embodiment of the present disclosure also provides a data processing apparatus, which may broadcast any scalar element in one source vector operand and apply a basic operation to another source operand in an element manner when performing a vector operation, thereby effectively reducing the number of codes, providing flexibility for a vector-vector arithmetic operation, being capable of improving an operation efficiency, and contributing to a high-performance and energy-saving design.

Fig. 7 is a schematic block diagram of a data processing apparatus according to some embodiments of the present disclosure. For example, as shown in fig. 7, the data processing apparatus 100 includes a multiplexer 110, a plurality of arithmetic units 120, and a plurality of registers 130. For example, by using the multiplexer 110, a plurality of first elements of the first vector can be selected and output based on the value of the immediate operand, whereby any one of the plurality of first elements can be selected as a broadcast element. Compared with the conventional device supporting vector calculation, the data processing device 100 only adds one multiplexer 110, has small modification on the hardware structure, is convenient to implement, and is easy to control and expand.

Fig. 8 is a schematic logical structure diagram of a data processing apparatus according to some embodiments of the present disclosure. Referring to fig. 7 and 8, the plurality of registers 130 includes at least a first register G1 and a second register G2, and the multiplexer 110 includes a selection control terminal Con, an output terminal OT, and a plurality of input terminals I0, I1, I2, I3. The inputs I0, I1, I2, I3 of the multiplexer 110 are connected to the first register G1. The plurality of arithmetic units 120 (i.e., ALU0, ALU1, ALU2, ALU3) are connected to the second register G2, and each of the plurality of arithmetic units 120 is connected to the output terminal OT of the multiplexer 110.

The first register G1 is configured to store a first vector V1 and the second register G2 is configured to store a second vector V2. The multiplexer 110 is configured to select and output one first element among the plurality of first elements X0, X1, X2, X3 of the first vector V1 stored in the first register G1 as the broadcast element Bro based on the immediate operand Imm acquired by the selection control terminal Con. The plurality of arithmetic units 120 are configured to perform arithmetic operations on the second vector V2 and the broadcast element Bro stored in the second register G2 according to the acquired operation instruction OP to obtain an operation result.

For example, in some examples, as shown in fig. 8, the data processing apparatus 100 further includes a controller 140. For example, the controller 140 is connected to the plurality of arithmetic units 120 and the selection control terminal Con of the multiplexer 110. The controller 140 is configured to perform instruction decoding, supply the decoded immediate operand Imm to the selection control terminal Con of the multiplexer 110, and supply the decoded operation instruction OP to the plurality of operation units 120. The multiplexer 110 selects one of the first element outputs as the broadcast element Bro based on the obtained immediate operand Imm, and the plurality of arithmetic units 120 arithmetically operate the second vector V2 and the broadcast element Bro based on the operation instruction OP. For example, the operation instruction OP may be an instruction representing an addition operation, a subtraction operation, a multiplication operation, a division operation, a multiplication-addition operation, a comparison operation, or other arithmetic operation.

For example, the number of the plurality of operation units 120 is equal to the number of the plurality of second elements Y0, Y1, Y2, Y3 of the second vector V2 stored in the second register G2, and the plurality of operation units 120 are in one-to-one correspondence with the plurality of second elements Y0, Y1, Y2, Y3.

For example, the plurality of registers 130 further includes a third register G3, the third register G3 is connected to the plurality of arithmetic units 120, and the third register G3 is configured to store the arithmetic results obtained by the plurality of arithmetic units 120. For example, the operation result may be a third vector V3, the third vector V3 includes a plurality of third elements Z0, Z1, Z2, and Z3, the number of the plurality of operation units 120 is equal to the number of the plurality of third elements Z0, Z1, Z2, and Z3, and the plurality of operation units 120 are in one-to-one correspondence with the plurality of third elements Z0, Z1, Z2, and Z3. For example, in the example shown in fig. 8, the number of the arithmetic units 120, the second elements, and the third elements is 4, so as to realize the one-to-one correspondence.

It should be noted that, in the embodiment of the present disclosure, when the data processing apparatus 100 is implemented as a CPU, the register 130 may be implemented as a general register (e.g., a vector register), and the controller 140 may be implemented as a general control unit including, for example, an instruction register, an instruction decoder, an operation controller, and the like. The arithmetic unit 120 may be implemented as a general Arithmetic Logic Unit (ALU) that performs a corresponding arithmetic operation upon receiving a command of the controller 140. The arithmetic unit 120, the register 130, and the controller 140 are connected to each other through a CPU internal bus. The detailed description of the arithmetic unit 120, the register 130 and the controller 140 can refer to conventional designs, and will not be described in detail here.

It should be noted that, in the embodiment of the present disclosure, compared to a general CPU, the data processing apparatus 100 adds the multiplexer 110, and the multiplexer 110 is connected to the arithmetic unit 120, the register 130, and the controller 140.

It should be noted that, in the embodiment of the present disclosure, the data processing apparatus 100 may further include more components, such as a bus unit, a prefetch unit, a data cache, and the like, to achieve more comprehensive functions, which may be determined according to actual needs, and the embodiment of the present disclosure is not limited thereto.

Fig. 9 is a schematic block diagram of another data processing apparatus provided in some embodiments of the present disclosure. For example, as shown in fig. 9, in some examples, the data processing apparatus 200 includes an acquisition unit 210, a selection unit 220, and an operation unit 230. The fetch unit 210 is configured to fetch a first vector, a second vector, and an immediate operand. For example, the first vector includes a plurality of first elements. The acquisition unit 210 may execute, for example, step S10 in the data processing method shown in fig. 1. The selection unit 220 is configured to select one of the plurality of first elements as a broadcast element based on the immediate operand. The selection unit 220 may, for example, execute step S20 in the data processing method shown in fig. 1. The operation unit 230 is configured to perform an arithmetic operation on the second vector and the broadcast element to obtain an operation result. The arithmetic unit 230 may execute, for example, step S30 in the data processing method shown in fig. 1.

For example, the obtaining unit 210, the selecting unit 220, and the operating unit 230 may be hardware, software, firmware, and any feasible combination thereof. For example, the obtaining unit 210, the selecting unit 220, and the calculating unit 230 may be dedicated or general circuits, chips, devices, or the like, or may be a combination of a processor and a memory. The embodiments of the present disclosure are not limited in this regard to the specific implementation forms of the above units.

It should be noted that, in the embodiment of the present disclosure, each unit of the data processing apparatus 200 corresponds to each step of the foregoing data processing method, and for a specific function of the data processing apparatus 200, reference may be made to the related description about the data processing method, which is not described herein again. The components and configuration of data processing device 200 shown in FIG. 9 are exemplary only, and not limiting, and data processing device 200 may include other components and configurations as desired.

At least one embodiment of the present disclosure also provides an electronic device comprising a processor and a memory, the memory including one or more computer program modules. One or more computer program modules are stored in the memory and configured to be executed by the processor, the one or more computer program modules including instructions for implementing the data processing method provided by any of the embodiments of the present disclosure. When the electronic equipment carries out vector operation, any scalar element can be broadcasted in one source vector operand, and basic operation is applied to the other source operand in an element mode, so that the number of codes can be effectively reduced, flexibility is provided for vector-vector arithmetic operation, the operation efficiency can be improved, and the high-performance and energy-saving design is facilitated.

Fig. 10 is a schematic block diagram of an electronic device provided in some embodiments of the present disclosure. As shown in fig. 10, the electronic device 300 includes a processor 310 and a memory 320. Memory 320 is used to store non-transitory computer readable instructions (e.g., one or more computer program modules). The processor 310 is configured to execute non-transitory computer readable instructions, which when executed by the processor 310 may perform one or more of the steps of the data processing method described above. The memory 320 and the processor 310 may be interconnected by a bus system and/or other form of connection mechanism (not shown).

For example, the processor 310 may be a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), or other form of processing unit having data processing capabilities and/or program execution capabilities. For example, the Central Processing Unit (CPU) may be an X86 or ARM architecture or the like. The processor 310 may be a general-purpose processor or a special-purpose processor that may control other components in the electronic device 300 to perform desired functions.

For example, memory 320 may include any combination of one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. Volatile memory can include, for example, Random Access Memory (RAM), cache memory (or the like). The non-volatile memory may include, for example, Read Only Memory (ROM), a hard disk, an Erasable Programmable Read Only Memory (EPROM), a portable compact disc read only memory (CD-ROM), USB memory, flash memory, and the like. One or more computer program modules may be stored on the computer-readable storage medium and executed by processor 310 to implement various functions of electronic device 300. Various applications and various data, as well as various data used and/or generated by the applications, and the like, may also be stored in the computer-readable storage medium.

It should be noted that, in the embodiment of the present disclosure, reference may be made to the above description on the data processing method for specific functions and technical effects of the electronic device 300, and details are not described here.

Fig. 11 is a schematic block diagram of another electronic device provided by some embodiments of the present disclosure. The electronic device 400 is, for example, suitable for implementing the data processing method provided by the embodiments of the present disclosure. The electronic device 400 may be a terminal device or the like. It should be noted that the electronic device 400 shown in fig. 11 is only an example, and does not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 11, electronic device 400 may include a processing means (e.g., central processing unit, graphics processor, etc.) 410 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)420 or a program loaded from a storage device 480 into a Random Access Memory (RAM) 430. In the RAM 430, various programs and data necessary for the operation of the electronic apparatus 400 are also stored. The processing device 410, the ROM 420, and the RAM 430 are connected to each other by a bus 440. An input/output (I/O) interface 450 is also connected to bus 440.

Generally, the following devices may be connected to the I/O interface 450: input devices 460 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; output devices 470 including, for example, a Liquid Crystal Display (LCD), speakers, vibrators, or the like; storage 480 including, for example, magnetic tape, hard disk, etc.; and a communication device 490. The communication device 490 may allow the electronic device 400 to communicate wirelessly or by wire with other electronic devices to exchange data. While fig. 11 illustrates an electronic device 400 having various means, it is to be understood that not all illustrated means are required to be implemented or provided, and that the electronic device 400 may alternatively be implemented or provided with more or less means.

For example, according to an embodiment of the present disclosure, the above-described data processing method may be implemented as a computer software program. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program comprising program code for performing the above-described data processing method. In such embodiments, the computer program may be downloaded and installed from a network through communication device 490, or installed from storage device 480, or installed from ROM 420. When executed by the processing device 410, the computer program may implement the functions defined in the data processing method provided by the embodiments of the present disclosure.

The following points need to be explained:

(1) the drawings of the embodiments of the disclosure only relate to the structures related to the embodiments of the disclosure, and other structures can refer to common designs.

(2) Without conflict, embodiments of the present disclosure and features of the embodiments may be combined with each other to arrive at new embodiments.

The above description is only a specific embodiment of the present disclosure, but the scope of the present disclosure is not limited thereto, and the scope of the present disclosure should be subject to the scope of the claims.

Claims

1. A method of data processing, comprising:

obtaining a first vector, a second vector, and an immediate operand, wherein the first vector includes a plurality of first elements;

selecting one of the plurality of first elements as a broadcast element based on the immediate operand;

and performing arithmetic operation on the second vector and the broadcast element to obtain an operation result.

2. The data processing method of claim 1, wherein the immediate operand has a plurality of different values, the different values of the immediate operand having a one-to-one correspondence with the plurality of first elements.

3. The data processing method of claim 2, wherein selecting one of the plurality of first elements as the broadcast element based on the immediate operand comprises:

and selecting a first element corresponding to the immediate operand as the broadcast element according to the corresponding relation between the numerical value of the immediate operand and the first element.

4. The data processing method of claim 3, wherein selecting a first element corresponding to the immediate operand as the broadcast element according to the correspondence of the value of the immediate operand to the first element comprises:

using a multiplexer, the immediate operand being an input to a selection control terminal of the multiplexer, the first elements being inputs to input terminals of the multiplexer, and the multiplexer being caused to output a first element corresponding to the immediate operand as the broadcast element.

5. The data processing method according to any of claims 1-4, wherein the immediate operand is an immediate, the immediate being a scalar value.

6. The data processing method according to any of claims 1-4, wherein the plurality of first elements in the first vector are arranged sequentially and numbered sequentially from 0 to N, N being an integer greater than 0,

the broadcast element is the pth first element of the plurality of first elements, P being 0, … …, or N.

7. The data processing method of any of claims 1 to 4, wherein the first element is a scalar value.

8. The data processing method of any of claims 1 to 4, wherein the arithmetic operation comprises at least one of an addition operation, a subtraction operation, a multiplication operation, a division operation, a multiply-add operation, and a comparison operation.

9. The data processing method according to any of claims 1 to 4, wherein the operation result is a vector result.

10. The data processing method of any of claims 1-4, wherein the second vector comprises a plurality of second elements,

performing an arithmetic operation on the second vector and the broadcast element to obtain the operation result, comprising:

and performing an arithmetic operation on each of the plurality of second elements and the broadcast element to obtain the operation result.

11. The data processing method according to claim 10, wherein the operation result includes a third vector including a plurality of third elements, the number of the plurality of third elements being the same as the number of the plurality of second elements.

12. The data processing method according to any of claims 1 to 4, wherein the data processing method is applied in an apparatus based on an instruction set architecture.

13. A data processing apparatus includes a multiplexer, a plurality of arithmetic units, and a plurality of registers;

the multiplexer comprises a selection control end, an output end and a plurality of input ends;

a plurality of input ends of the multiplexer are connected with the first register;

the plurality of arithmetic units are connected with the second register, and each of the plurality of arithmetic units is connected with the output end of the multiplexer;

the first register is configured to store a first vector and the second register is configured to store a second vector;

the multiplexer is configured to select and output one of a plurality of first elements of the first vector stored in the first register as a broadcast element based on an immediate operand acquired by the selection control terminal;

the plurality of operation units are configured to perform arithmetic operations on the second vector and the broadcast element stored in the second register according to the acquired operation instruction to obtain an operation result.

14. The data processing apparatus of claim 13, further comprising a controller,

wherein the controller is connected to the plurality of arithmetic units and the selection control terminal of the multiplexer,

the controller is configured to decode an instruction, provide the decoded immediate operand to a selection control terminal of the multiplexer, and provide the decoded operation instruction to the plurality of arithmetic units.

15. The data processing apparatus according to claim 13, wherein a number of the plurality of arithmetic units is equal to a number of a plurality of second elements of the second vector stored in the second register, the plurality of arithmetic units being in one-to-one correspondence with the plurality of second elements.

16. The data processing apparatus according to claim 13, wherein the plurality of registers further comprises a third register,

the third register is connected to the plurality of arithmetic units, and the third register is configured to store the arithmetic results obtained by the plurality of arithmetic units.

17. The data processing apparatus according to claim 16, wherein the operation result includes a third vector including a plurality of third elements, the number of the plurality of operation units being equal to the number of the plurality of third elements, the plurality of operation units being in one-to-one correspondence with the plurality of third elements.

18. A data processing apparatus comprising:

an obtaining unit configured to obtain a first vector, a second vector, and an immediate operand, wherein the first vector includes a plurality of first elements;

a selection unit configured to select one of the plurality of first elements as a broadcast element based on the immediate operand;

an arithmetic unit configured to perform an arithmetic operation on the second vector and the broadcast element to obtain an operation result.

19. An electronic device, comprising:

a processor;

a memory including one or more computer program modules;

wherein the one or more computer program modules are stored in the memory and configured to be executed by the processor, the one or more computer program modules comprising instructions for implementing the data processing method of any of claims 1-12.