WO2020073923A1

WO2020073923A1 - Operation method and device, computer equipment, and storage medium

Info

Publication number: WO2020073923A1
Application number: PCT/CN2019/110146
Authority: WO
Inventors: 苏振宇; 周晓勇; 张定飞; 孟小甫
Original assignee: 上海寒武纪信息科技有限公司
Priority date: 2018-10-09
Filing date: 2019-10-09
Publication date: 2020-04-16

Abstract

An operation method and device, computer equipment, and a storage medium. A combined processing device comprises a machine learning operation device, a universal interconnection interface, and other processing devices. The machine learning operation device interacts with said other processing devices to jointly complete computing operation specified by a user. The combined processing device further comprises a storage device. The storage device is connected to the machine learning operation device and other processing devices separately, for storing data of the machine learning operation device and other processing devices. The operation method and device, the computer equipment and the storage medium above have a wide range of applications, high operation processing efficiency, and quick processing speed.

Description

Calculation method, device, computer equipment and storage medium

Technical field

The present disclosure relates to the field of computer technology, and in particular, to an arithmetic method, device, computer equipment, and storage medium.

Background technique

With the continuous development of science and technology, machine learning, especially neural network algorithms, are becoming more and more widely used. It has been well used in image recognition, speech recognition, natural language processing and other fields. However, due to the increasing complexity of neural network algorithms, the types and number of data operations involved are increasing. In related technologies, data selection operations, counting operations, fully connected operations, convolution operations, maximum pooling operations, activation operations, filling operations, matrix transposition operations, average pooling operations, scalar calculations, scalar type conversion, Fetching address processing, scalar data migration, processing of instruction flow jump control, vector calculation, loop vector calculation, vector data migration, synchronization control, interrupt storage and other operations or processing are inefficient and slow.

Summary of the invention

Based on this, it is necessary to provide an arithmetic method, device, computer equipment, and storage medium that can solve the above-mentioned technical problems.

According to an aspect of the present disclosure, an activation instruction processing apparatus is provided, the apparatus including:

The control module is used to compile the obtained activation instruction to obtain the compiled activation instruction, analyze the compiled activation instruction, obtain the operation code and operation domain of the activation instruction, and according to the operation code Acquiring the data to be calculated and the target address required to execute the activation instruction with the operation domain;

The operation module is used for performing activation operation on the data to be operated to obtain an operation result, and storing the operation result in the target address,

Wherein, the operation code is used to indicate that the operation performed by the activation instruction on the data is an activation operation, and the operation domain includes the data address to be operated and the target address.

According to another aspect of the present disclosure, a machine learning computing device is provided, the device including:

One or more of the above-mentioned activation instruction processing devices are used to obtain data and control information to be calculated from other processing devices, and perform designated machine learning operations, and pass the execution results to other processing devices through the I / O interface;

When the machine learning computing device includes a plurality of the activation instruction processing devices, the plurality of activation instruction processing devices may be connected and transmit data through a specific structure;

Among them, a plurality of the activation instruction processing devices interconnect and transmit data through a PCIE bus, a fast external device interconnection bus, to support larger-scale machine learning operations; a plurality of the activation instruction processing devices share the same control system or Respective control systems; a plurality of the activation instruction processing devices share memory or have their own memories; the interconnection mode of the plurality of activation instruction processing devices is an arbitrary interconnection topology.

According to another aspect of the present disclosure, a combined processing device is provided, the device comprising:

The above-mentioned machine learning computing device, universal interconnection interface and other processing devices;

The machine learning operation device interacts with the other processing device to jointly complete the calculation operation specified by the user.

According to another aspect of the present disclosure, a machine learning chip is provided, the machine learning chip including the above machine learning network operation device or the above combination processing device.

According to another aspect of the present disclosure, there is provided a machine learning chip packaging structure including the above machine learning chip.

According to another aspect of the present disclosure, there is provided a board card including the above machine learning chip packaging structure.

According to another aspect of the present disclosure, there is provided an electronic device including the aforementioned machine learning chip or the aforementioned board.

According to another aspect of the present disclosure, an activation instruction processing method is provided. The method is applied to an activation instruction processing device. The method includes:

The control module is used to compile the obtained activation instruction to obtain a compiled activation instruction, and the compiled activation instruction is parsed to obtain the operation code and operation domain of the activation instruction, and according to the operation code and The operation domain obtains the data to be calculated and the target address required to execute the activation instruction;

Using an arithmetic module to perform an activation operation on the data to be operated to obtain an operation result, and store the operation result in the target address,

According to another aspect of the present disclosure, there is provided a non-volatile computer-readable storage medium having computer program instructions stored thereon, the computer program instructions implementing the above activation instruction processing method when executed by a processor.

Embodiments of the present disclosure provide an activation instruction processing method, device, and related products. The device includes a control module and an arithmetic module. The control module is used to compile the obtained activation instruction to obtain a compiled activation instruction. The activation instruction is analyzed to obtain the operation code and operation domain of the activation instruction, and the data to be operated and the target address required to execute the activation instruction are obtained according to the operation code and the operation domain; the operation module is used to activate the operation data to obtain the operation result And store the operation result in the target address. The method, device and related products for processing activation instructions provided by the embodiments of the present disclosure have a wide range of applications, and have high processing efficiency and processing speed for activation instructions, and high processing efficiency and processing speed for performing activation calculations.

In some embodiments, the electronic device includes a data processing device, a robot, a computer, a printer, a scanner, a tablet computer, a smart terminal, a mobile phone, a driving recorder, a navigator, a sensor, a camera, a server, a cloud server, a camera, Cameras, projectors, watches, headphones, mobile storage, wearable devices, vehicles, household appliances, and / or medical devices.

In some embodiments, the vehicle includes an airplane, ship, and / or vehicle; the household appliance includes a TV, air conditioner, microwave oven, refrigerator, rice cooker, humidifier, washing machine, electric lamp, gas stove, and range hood; and the medical Equipment includes MRI, B-mode ultrasound and / or electrocardiograph.

Other features and aspects of the present disclosure will become clear from the following detailed description of exemplary embodiments with reference to the accompanying drawings.

BRIEF DESCRIPTION

The drawings included in the specification and forming a part of the specification together with the specification show exemplary embodiments, features, and aspects of the present disclosure, and are used to explain the principles of the present disclosure.

FIG. 1 shows a schematic diagram of a processor of an instruction processing method according to an embodiment of the present disclosure.

FIG. 1-1 shows a block diagram of an activation instruction processing apparatus according to an embodiment of the present disclosure.

1-2a and 1-2b show block diagrams of an activation instruction processing apparatus according to an embodiment of the present disclosure.

1-3 are schematic diagrams illustrating application scenarios of an activation instruction processing apparatus according to an embodiment of the present disclosure.

1-4 show a flowchart of an activation instruction processing method according to an embodiment of the present disclosure.

FIG. 2-1 shows a block diagram of a linear rectification function activation instruction processing device according to an embodiment of the present disclosure.

2-2a and 2-2b show block diagrams of a linear rectification function activation instruction processing device according to an embodiment of the present disclosure.

FIG. 2-3 shows a schematic diagram of an application scenario of a linear rectification function activation instruction processing device according to an embodiment of the present disclosure.

2-4 illustrate a flowchart of a method for processing a linear rectification function activation instruction according to an embodiment of the present disclosure.

FIG. 3-1 shows a block diagram of an S-shaped growth curve function activation instruction processing apparatus according to an embodiment of the present disclosure.

3-2a and 3-2b show block diagrams of an S-shaped growth curve function activation instruction processing device according to an embodiment of the present disclosure.

3-3 shows a schematic diagram of an application scenario of an S-shaped growth curve function activation instruction processing apparatus according to an embodiment of the present disclosure.

FIGS. 3-4 illustrate a flowchart of an S-shaped growth curve function activation instruction processing method according to an embodiment of the present disclosure.

FIG. 4-1 shows a block diagram of an exponential function activation instruction processing apparatus according to an embodiment of the present disclosure.

4-2a and 4-2b show block diagrams of an exponential function activation instruction processing device according to an embodiment of the present disclosure.

4-3 shows a schematic diagram of an application scenario of an exponential function activation instruction processing device according to an embodiment of the present disclosure.

4-4 shows a flowchart of an exponential function activation instruction processing method according to an embodiment of the present disclosure.

FIG. 5-1 shows a block diagram of a selection instruction processing apparatus according to an embodiment of the present disclosure.

5-2a and 5-2b show block diagrams of a selection instruction processing apparatus according to an embodiment of the present disclosure.

5-3 shows a schematic diagram of an application scenario for selecting an instruction processing apparatus according to an embodiment of the present disclosure.

5-4 shows a flowchart of a selection instruction processing method according to an embodiment of the present disclosure.

6-1 shows a block diagram of a count instruction processing device according to an embodiment of the present disclosure.

6-2a and 6-2b show block diagrams of a counting instruction processing device according to an embodiment of the present disclosure.

6-3 shows a schematic diagram of an application scenario of a counting instruction processing device according to an embodiment of the present disclosure.

6-4 shows a flowchart of a counting instruction processing method according to an embodiment of the present disclosure.

FIG. 7-1 shows a block diagram of a fully connected instruction processing apparatus according to an embodiment of the present disclosure.

7-2a and 7-2b show block diagrams of a fully connected instruction processing apparatus according to an embodiment of the present disclosure.

7-3 shows a schematic diagram of an application scenario of a fully connected instruction processing apparatus according to an embodiment of the present disclosure.

7-4 shows a flowchart of a fully connected instruction processing method according to an embodiment of the present disclosure.

8-1 shows a block diagram of a convolution instruction processing device according to an embodiment of the present disclosure.

8-2a and 8-2b show block diagrams of a convolution instruction processing device according to an embodiment of the present disclosure.

8-3 shows a schematic diagram of an application scenario of a convolution instruction processing device according to an embodiment of the present disclosure.

8-4 shows a flowchart of a convolution instruction processing method according to an embodiment of the present disclosure.

9-1 shows a block diagram of a maximum pooled instruction processing apparatus according to an embodiment of the present disclosure.

9-2a and 9-2b show block diagrams of a maximum pooled instruction processing apparatus according to an embodiment of the present disclosure.

9-3 shows a schematic diagram of an application scenario of a maximum pooled instruction processing apparatus according to an embodiment of the present disclosure.

9-4 shows a flowchart of a maximum pooling instruction processing method according to an embodiment of the present disclosure.

10-1 shows a block diagram of a filling instruction processing apparatus according to an embodiment of the present disclosure.

10-2a and 10-2b show block diagrams of a filling instruction processing device according to an embodiment of the present disclosure.

10-3 shows a schematic diagram of an application scenario of a filling instruction processing apparatus according to an embodiment of the present disclosure.

10-4 shows a flowchart of a filling instruction processing method according to an embodiment of the present disclosure.

11-1 shows a block diagram of a matrix transposition instruction processing apparatus according to an embodiment of the present disclosure.

11-2a and 11-2b show block diagrams of a matrix transposition instruction processing device according to an embodiment of the present disclosure.

11-3 shows a schematic diagram of an application scenario of a matrix transposition instruction processing apparatus according to an embodiment of the present disclosure.

11-4 shows a flowchart of a matrix transposition instruction processing method according to an embodiment of the present disclosure.

12-1 shows a block diagram of an average pooled instruction processing apparatus according to an embodiment of the present disclosure.

12-2a and 12-2b show block diagrams of an average pooled instruction processing apparatus according to an embodiment of the present disclosure.

12-3 shows a schematic diagram of an application scenario of an average pooled instruction processing apparatus according to an embodiment of the present disclosure.

12-4 shows a flowchart of an average pooling instruction processing method according to an embodiment of the present disclosure.

13-1 shows a block diagram of a scalar instruction processing device according to an embodiment of the present disclosure.

13-2a and 13-2b show block diagrams of a scalar instruction processing device according to an embodiment of the present disclosure.

13-3a and 13-3b show schematic diagrams of application scenarios of a scalar instruction processing apparatus according to an embodiment of the present disclosure.

13-4 shows a flowchart of a scalar instruction processing method according to an embodiment of the present disclosure.

14-1 shows a block diagram of a scalar type conversion instruction processing device according to an embodiment of the present disclosure.

14-2a and 14-2b show block diagrams of a scalar type conversion instruction processing device according to an embodiment of the present disclosure.

14-3 shows a schematic diagram of an application scenario of a scalar type conversion instruction processing device according to an embodiment of the present disclosure.

14-4 shows a flowchart of a scalar type conversion instruction processing method according to an embodiment of the present disclosure.

15-1 shows a block diagram of an address fetch instruction processing apparatus according to an embodiment of the present disclosure.

15-2 shows a block diagram of an address fetch instruction processing apparatus according to an embodiment of the present disclosure.

15-3a and 15-3b show schematic diagrams of application scenarios of an address fetch instruction processing apparatus according to an embodiment of the present disclosure.

15-4 shows a flowchart of an address fetch instruction processing method according to an embodiment of the present disclosure.

16-1 shows a block diagram of a scalar data migration instruction processing apparatus according to an embodiment of the present disclosure.

16-2 shows a block diagram of a scalar data migration instruction processing apparatus according to an embodiment of the present disclosure.

16-3 shows a schematic diagram of an application scenario of a scalar data migration instruction processing apparatus according to an embodiment of the present disclosure.

16-4 shows a flowchart of a scalar data migration instruction processing method according to an embodiment of the present disclosure.

17-1 shows a block diagram of a scalar control flow instruction processing apparatus according to an embodiment of the present disclosure.

17-2 shows a block diagram of a scalar control flow instruction processing apparatus according to an embodiment of the present disclosure.

17-3 shows a schematic diagram of an application scenario of a scalar control flow instruction processing apparatus according to an embodiment of the present disclosure.

17-4 shows a flowchart of a scalar control flow instruction processing method according to an embodiment of the present disclosure.

18-1 shows a block diagram of a vector instruction processing apparatus according to an embodiment of the present disclosure.

18-2a and 18-2b show block diagrams of a vector instruction processing device according to an embodiment of the present disclosure.

18-3 shows a schematic diagram of an application scenario of a vector instruction processing apparatus according to an embodiment of the present disclosure.

18-4 shows a flowchart of a vector instruction processing method according to an embodiment of the present disclosure.

19-1 shows a block diagram of a loop vector instruction processing apparatus according to an embodiment of the present disclosure.

19-2a and 19-2b show block diagrams of a loop vector instruction processing device according to an embodiment of the present disclosure.

19-3 shows a schematic diagram of an application scenario of a loop vector instruction processing device according to an embodiment of the present disclosure.

19-4 shows a flowchart of a loop vector instruction processing method according to an embodiment of the present disclosure.

FIG. 20-1 shows a block diagram of a vector data migration instruction processing apparatus according to an embodiment of the present disclosure.

FIG. 20-2 shows a block diagram of a vector data migration instruction processing apparatus according to an embodiment of the present disclosure.

20-3 shows a schematic diagram of an application scenario of a vector data migration instruction processing apparatus according to an embodiment of the present disclosure.

20-4 shows a flowchart of a vector data migration instruction processing method according to an embodiment of the present disclosure.

21-1a shows a block diagram of a synchronization control instruction processing apparatus according to an embodiment of the present disclosure.

21-1b shows a schematic structural diagram of a module cluster in a synchronous control instruction processing apparatus according to an embodiment of the present disclosure.

21-2 shows a block diagram of a synchronization control instruction processing apparatus according to an embodiment of the present disclosure.

21-3 illustrate a schematic diagram of an application scenario of a synchronization control instruction processing apparatus according to an embodiment of the present disclosure.

21-4 illustrate a flowchart of a method for processing synchronization control instructions according to an embodiment of the present disclosure.

22-1 shows a block diagram of an interrupt storage instruction processing apparatus according to an embodiment of the present disclosure.

22-2a and 22-2b illustrate block diagrams of an interrupt storage instruction processing apparatus according to an embodiment of the present disclosure.

22-3a and 22-3b are schematic diagrams illustrating application scenarios of an apparatus for processing interrupt storage instructions according to an embodiment of the present disclosure.

22-4 shows a flowchart of an interrupt storage instruction processing method according to an embodiment of the present disclosure.

23a-23d show a block diagram of an arithmetic module according to an embodiment of the present disclosure.

23e shows a block diagram of a control module according to an embodiment of the present disclosure.

24a and 24b show block diagrams of a combined processing device according to an embodiment of the present disclosure.

FIG. 25 shows a schematic structural diagram of a board according to an embodiment of the present disclosure.

detailed description

The technical solutions in the embodiments of the present disclosure will be described clearly and completely in conjunction with the drawings in the embodiments of the present disclosure. Obviously, the described embodiments are part of the embodiments of the present disclosure, but not all of the embodiments. Based on the embodiments in the present disclosure, all other embodiments obtained by those skilled in the art without creative work fall within the protection scope of the present disclosure.

It should be understood that the terms "first", "second", "zeroth", etc. in the claims, specification, and drawings of the present disclosure are used to distinguish different objects, not to describe a specific order. The terms "comprising" and "including" used in the specification and claims of the present disclosure indicate the presence of the described features, wholes, steps, operations, elements and / or components, but do not exclude one or more other features, wholes , Steps, operations, elements, components and / or their existence or addition.

It should also be understood that the terminology used in the present specification of the disclosure is for the purpose of describing particular embodiments only, and is not intended to limit the disclosure. As used in this disclosure specification and claims, the singular forms "a", "an", and "the" are intended to include the plural forms unless the context clearly dictates otherwise. It should also be further understood that the term "and / or" used in the present specification and claims refers to any and all possible combinations of one or more of the associated listed items and includes these combinations.

As used in this specification and claims, the term "if" may be interpreted as "when" or "once" or "in response to a determination" or "in response to a detection" depending on the context. Similarly, the phrase "if determined" or "if [described condition or event] is detected" may be interpreted in the context to mean "once determined" or "in response to a determination" or "once detected [described condition or event ] "Or" In response to detection of [the described condition or event] ".

The present disclosure provides instruction processing methods and devices corresponding to different operations or processes, and computer equipment and storage media corresponding to each instruction processing method and device, and instruction processing methods corresponding to different operations or processes And devices include: selection instruction processing method and device, counting instruction processing method and device, fully connected instruction processing method and device, convolution instruction processing method and device, maximum pooling instruction processing method and device, linear rectification function activation instruction processing method And device, S-shaped growth curve function activation instruction processing method and device, activation instruction processing method and device, filling instruction processing method and device, matrix transposition instruction processing method and device, average pooling instruction processing method and device, exponential function activation Instruction processing method and device, scalar instruction processing method and device, scalar type conversion instruction processing method and device, address fetch instruction processing method and device, scalar data migration instruction processing method and device, scalar control flow instruction processing method and device, vector instruction Processor Method and device, loop vector instruction processing method and device, vector data migration instruction processing method and device, synchronous control instruction processing method and device, and interrupt storage instruction processing method and device. The instruction processing method and instruction processing device described below may be any of the instruction processing methods and devices listed above.

The instruction processing method according to the embodiment of the present disclosure may be applied to a processor, which may be a general-purpose processor, such as a CPU (Central Processing Unit), or artificial intelligence processing for performing artificial intelligence operations Device (IPU). Artificial intelligence operations can include machine learning operations, brain-like operations, and so on. Among them, machine learning operations include neural network operations, k-means operations, support vector machine operations, etc. The artificial intelligence processor may include, for example, GPU (Graphics Processing Unit), NPU (Neural-Network Processing Unit, neural network processing unit), DSP (Digital Signal Processing, digital signal processing unit), field programmable gate array (Field-Programmable Gate Array, FPGA) One or a combination of chips. This disclosure does not limit the specific types of processors.

In a possible implementation manner, the processor mentioned in the present disclosure may include multiple processing units, and each processing unit may independently run various assigned tasks, such as: convolution operation tasks and pooling tasks Or fully connected tasks. The present disclosure does not limit the processing unit and the tasks executed by the processing unit.

FIG. 1 shows a schematic diagram of a processor of an instruction processing method according to an embodiment of the present disclosure. As shown in FIG. 1, the processor 100 includes a plurality of processing units 101 and a storage unit 102. The plurality of processing units 101 are used to execute an instruction sequence, and the storage unit 102 is used to store data, which may include a random access memory (RAM, Random Access Memory) And register file. The multiple processing units 101 in the processor 100 can share a part of the storage space, for example, share a part of the RAM storage space and the register file, and can also have their own storage spaces at the same time.

FIG. 1-1 shows a block diagram of an activation instruction processing apparatus according to an embodiment of the present disclosure. As shown in Figure 1-1, the device includes a control module 8-11 and an arithmetic module 8-12.

The control module 8-11 is used to compile the obtained activation instruction to obtain the compiled activation instruction, and parse the compiled activation instruction to obtain the operation code and operation domain of the activation instruction, and according to the operation code and operation domain Obtain the data to be calculated and the target address required to execute the activation instruction. The operation code is used to indicate that the operation performed by the activation instruction on the data is an activation operation, and the operation domain includes the data address and the target address to be operated.

The operation module 8-12 is used to activate the operation data to obtain the operation result, and store the operation result in the target address.

In this embodiment, the activation instruction acquired by the control module is an uncompiled software instruction that cannot be directly executed by hardware. The control module needs to first compile the activation instruction (uncompiled). After the compiled activation instruction is obtained, the compiled activation instruction can be parsed. The compiled activation instruction is a hardware instruction that can be directly executed by the hardware. The control module can obtain the data to be calculated from the data address to be calculated. The control module can obtain instructions and data through the data input and output unit, which can be one or more data I / O interfaces or I / O pins.

In this embodiment, the operation code may be the part of the instruction or field (usually represented by code) specified in the computer program to perform the operation, and is the instruction sequence number used to inform the device that executes the instruction which instruction needs to be executed. . The operation domain may be a source of all data required to execute the corresponding instruction, such as a corresponding address, etc. All data required to execute the corresponding instruction include data such as data to be operated and corresponding operation methods, etc. For an activation instruction, it must include an operation code and an operation field, where the operation field includes at least the data address and the target address to be calculated.

It should be understood that those skilled in the art can set the instruction format of the activation instruction, as well as the included operation codes and operation domains as needed, and this disclosure does not limit this.

In this embodiment, the device may include one or more control modules and one or more arithmetic modules, and the number of control modules and arithmetic modules may be set according to actual needs, which is not limited in the present disclosure. When the device includes a control module, the control module may receive a linear rectification function activation instruction, and control one or more processing modules to perform the linear rectification function activation operation. When the device includes multiple control modules, the multiple control modules may respectively receive linear rectification function activation instructions and control the corresponding one or more processing modules to perform linear rectification function activation operations.

An embodiment of the present disclosure provides an activation instruction processing apparatus. The apparatus includes a control module and an arithmetic module. The control module is used to compile the acquired activation instruction to obtain a compiled activation instruction and analyze the compiled activation instruction. Obtain the operation code and operation domain of the activation instruction, and obtain the data to be operated and the target address required to execute the activation instruction according to the operation code and operation domain; the operation module is used to activate the operation data to obtain the operation result, and the operation result Store in the target address. The activation instruction processing device provided by the embodiments of the present disclosure has a wide application range, and has high processing efficiency and fast processing speed for activation instructions, and high processing efficiency and fast processing speed for performing activation calculations.

In a possible implementation, the activation function used by the activation operation may include at least one of the following: a linear rectification function (Rectified Linear Unit, ReLU, also called ReLU function), and an S-shaped growth curve function (Sigmoid function, It can be called Sigmoid function), hyperbolic tangent function (tanh, can also be called tanh function), linear rectification function with leakage (Leaky ReLU, a variant of ReLU function), the function of taking the maximum value (maxout function, output the maximum Value) and power function.

In this implementation, the activation function used for the activation operation may also be other features that are non-linear, continuously differentiable, as unsaturated as possible in range, monotonic, approximate straight lines at dots, etc., available This disclosure does not limit the function of activating the operation.

In a possible implementation manner, the control module 8-11 may also be used to obtain an activation parameter table according to the operation code and / or operation domain.

The operation modules 8-12 can also be used to perform activation calculation on the data to be calculated according to the activation parameter table to obtain the operation result.

The activation parameter table may include an activation table and a constant table.

In this implementation mode, the activation parameter table address may be included in the operation domain, so that the control module obtains the activation parameter table address from the activation parameter table address. Alternatively, the control module may determine that the activation parameter table needs to be activated according to the operation code, and may directly obtain the activation parameter table from the storage address of the predetermined activation parameter table. Alternatively, when the control module may determine that the activation parameter table needs to be activated according to the operation code, it may directly obtain the activation parameter table corresponding to the activation command from the storage address of the predetermined parameter table. A person skilled in the art may set the acquisition method of the activation parameter table according to actual needs, which is not limited in the present disclosure.

In a possible implementation manner, the control module can also obtain an activation function corresponding to the activation instruction, so that the operation module can perform activation calculation on the operation data according to the activation function and the corresponding operator.

It should be noted that, those skilled in the art may set the manner in which the calculation module implements the activation calculation according to actual needs, which is not limited in the present disclosure.

In this implementation manner, an activation table and a constant table required for activation operations using different activation functions can be predetermined. The activation table and constant table corresponding to different activation functions are different.

1-2a shows a block diagram of an activation instruction processing apparatus according to an embodiment of the present disclosure. In a possible implementation manner, as shown in FIG. 1-2a, the arithmetic module 8-12 may include a plurality of activation operators 8-120. A plurality of activation calculators 8-120 are used to perform activation calculation on the data to be calculated.

In this implementation, the calculation module may also include an activation calculator. The number of activation operators can be set according to the size of the data amount of the activation operation to be performed, the processing speed, efficiency, etc. of the activation operation, which is not limited in the present disclosure.

1-2b show a block diagram of an activation instruction processing apparatus according to an embodiment of the present disclosure. In a possible implementation, as shown in FIG. 1-2b, the operation module 8-12 may include a master operation sub-module 8-121 and multiple slave operation sub-modules 8-122, and the master operation sub-module 8-121 includes Multiple activation operators 8-120 (not shown in the figure).

The main operation sub-module 8-121 is used for performing an activation operation on the data to be calculated by using a plurality of activation operators, obtaining an operation result, and storing the operation result in a target address.

In a possible implementation manner, the operation domain may further include a read-in amount or a storage address of the read-in amount. Among them, the control module 8-11 is also used to obtain the read-in amount, and obtain a plurality of data to be calculated according to the read-in amount. Among them, the data amount of the plurality of data to be calculated may be less than or equal to the read-in amount.

In this implementation manner, the read-in amount may be the data amount of the acquired plurality of data to be calculated, and may be the size of the acquired data to be calculated. When the operation field directly contains the specific value of the read-in amount, the value can be determined as the read-in amount. When the storage address of the read-in amount is included in the operation domain, the read-in amount can be obtained from the storage address.

In a possible implementation manner, when the read-in amount is not included in the operation domain, a plurality of data to be calculated may be obtained according to a preset default read-in amount. The acquired data amount of the plurality of data to be calculated may be less than or equal to the default read-in amount.

In the above manner, the data amount and size of the operation data can be limited, the accuracy of the operation result can be guaranteed, and the device can execute the activation instruction.

In a possible implementation manner, as shown in FIGS. 1-2a and 1-2b, the device may further include a storage module 8-13. The storage modules 8-13 are used to store data to be calculated. The storage modules 8-13 can also be used to store activation tables and constant tables.

In this implementation manner, the storage module may include a memory, such as one or more of a cache and a register, and the cache may include a high-speed temporary storage cache. The data to be calculated, the activation table and the constant table can be stored in the cache and / or register of the storage module as needed, and the disclosure does not limit this.

In a possible implementation, the device may further include a direct memory access module, which is used to read or store data from the storage module.

In a possible implementation manner, the control module 8-11 may also be used to generate an assembly file according to the activation instruction and translate the assembly file into a binary file, where the binary file is a compiled activation instruction.

In a possible implementation manner, the instruction format of the activation instruction may be:

active dst src0 active_table const_table size

Where active is the opcode of the activation instruction, and dst, src0, active_table, const_table, and size are the operation domains of the activation instruction. Among them, dst is the target address, src0 is the data address to be calculated, active_table is the active table address, const_table is the constant table address, and size is the read-in amount.

In a possible implementation, the instruction format of the activation instruction may also be:

active src0 size

Among them, active is the operation code of the activation instruction, dst, src0, size are the operation domain of the activation instruction. Among them, dst is the target address, src0 is the data address to be calculated, and size is the read-in amount.

It should be understood that those skilled in the art can set the operation code of the activation instruction, the position of the operation code and the operation field in the instruction format as needed, and the disclosure does not limit this.

In a possible implementation manner, the device may be set in a graphics processor (Graphics Processing Unit, GPU for short), a central processing unit (Central Processing Unit, CPU for short) and an embedded neural network processor (Neural-network Processing Unit) , Referred to as NPU).

It should be noted that although the above-mentioned embodiment is taken as an example to introduce the activation instruction processing apparatus as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various modules flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.

Application examples

In the following, an application example according to an embodiment of the present disclosure will be given in conjunction with "using an activation instruction processing device to perform an activation operation" as an exemplary application scenario, so as to facilitate understanding of the flow of the activation instruction processing device. Those skilled in the art should understand that the following application examples are only for the purpose of facilitating understanding of the embodiments of the present disclosure, and should not be considered as limitations to the embodiments of the present disclosure

1-3 are schematic diagrams illustrating application scenarios of an activation instruction processing apparatus according to an embodiment of the present disclosure. As shown in Figure 1-3, the activation instruction processing device processes the activation instruction as follows:

Example 1

The control module 8-11 compiles the acquired activation instruction 1 to obtain the compiled activation instruction 1 (for example, activation instruction 1 is active 500, 100, 200, 300, 64). Analyze the compiled activation instruction 1 to obtain the operation code and operation domain of the activation instruction 1. The operation code of the activation instruction 1 is active, the target address is 500, the data address to be calculated is 100, the activation table address is 200, the constant table address is 300, and the read-in amount is 64. The control module 8-11 obtains the data to be operated with a data amount of 64 (read-in amount) from the data to be operated 100, the activation table from the activation table address 200, and the constant table from the constant table address 300.

The operation module 8-12 performs activation calculation on the operation data according to the activation table and the constant table, obtains the operation result, and stores the operation result in the target address 500.

Example 2

The difference from Example 1 is that the activation instruction 1 is active 500 and 100. Assuming that the activation calculation needs to be performed according to the activation parameter table, the control module 8-11 needs to obtain the activation parameter table (see the above description for the specific implementation process).

For the working process of the above modules, please refer to the relevant description above.

In this way, the activation instruction processing device can process the activation instruction efficiently and quickly, and realize the efficient and rapid processing of the activation operation.

1-4 show a flowchart of an activation instruction processing method according to an embodiment of the present disclosure. As shown in FIGS. 1-4, this method is applied to the above-mentioned activation instruction processing apparatus. The method includes step S51-8 and step S52-8. The method can be applied to computer equipment including a memory and a processor, where the memory is used to store data used in the execution of the method; the processor is used to perform related processing and operation steps, such as performing the following steps S51-8和步骤 S52-8.

In step S51-8, the control module is used to compile the obtained activation instruction to obtain a compiled activation instruction, and the compiled activation instruction is parsed to obtain the operation code and operation domain of the activation instruction, and according to the operation code and The operation domain obtains the data to be calculated and the target address required to execute the activation instruction. The operation code is used to indicate that the operation performed by the activation instruction on the data is an activation operation, and the operation domain includes the data address and the target address to be operated.

In step S52-8, the arithmetic module is used to activate the operation data to obtain the operation result, and the operation result is stored in the target address.

In a possible implementation manner, the method may further include:

Obtain the activation parameter table according to the operation code and / or operation field;

Among them, the operation module is used to activate the operation data to obtain the operation result, including:

According to the activation parameter table, perform activation operation on the operation data to obtain the operation result,

In a possible implementation manner, the operation module is used to activate the operation data to obtain the operation result, which may include:

Use multiple activation operators to perform activation operations on the data to be calculated.

In a possible implementation manner, the operation module may include a master operation sub-module and multiple slave operation sub-modules, and the master operation sub-module may include multiple activation operators,

Wherein, the operation module is used to activate the operation data to obtain the operation result, which may include:

Use multiple activation operators in the main operation sub-module to perform activation operation on the operation data to obtain the operation result, and store the operation result in the target address.

In a possible implementation manner, the operation domain may further include a read-in amount or a storage address of the read-in amount. Among them, obtaining the data to be calculated, the activation table, the constant table, and the target address required to execute the activation instruction according to the operation code and the operation domain may include:

Obtain the read-in amount, and obtain multiple data to be calculated according to the read-in amount.

In a possible implementation manner, the method may further include: storing data to be calculated.

In a possible implementation manner, parsing the compiled activation instruction to obtain the operation code and operation domain of the activation instruction may include:

Store the compiled activation instruction;

Analyze the compiled activation instruction to obtain the opcode and operation domain of the activation instruction;

The instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed in order according to the execution order, and the plurality of instructions to be executed include the compiled activation instructions.

In a possible implementation manner, the method may further include: when determining that the first to-be-executed instruction among the plurality of to-be-executed instructions is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction, caching the first The instruction to be executed, and after determining that the execution of the zeroth instruction to be executed is completed, the execution of the first instruction to be executed is controlled,

Wherein, the association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:

The first storage address interval storing data required for the first instruction to be executed has an overlapping area with the zeroth storage address interval storing data required for the zeroth instruction to be executed.

In a possible implementation manner, the control module is used to compile the obtained activation instruction to obtain a compiled activation instruction, including:

The assembly file is generated according to the activation instruction, and the assembly file is translated into a binary file. Among them, the binary file is the activation instruction after compilation.

In a possible implementation manner, the activation function utilized by the activation operation may include at least one of the following:

Linear rectification function, S-shaped growth curve function, hyperbolic tangent function, linear rectification function with leakage, maximum function and power function.

It should be noted that, although the above embodiment is taken as an example to introduce the activation instruction processing method as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various steps flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.

The method for processing an activation instruction provided by the embodiments of the present disclosure has a wide application range, and has a high processing efficiency and a fast processing speed for the activation instruction, and a high processing efficiency and a fast processing speed for performing the activation operation.

The foregoing can be better understood based on the following terms:

Clause A1, an activation instruction processing device, the device comprising:

The control module is used to compile the obtained activation instruction to obtain the compiled activation instruction, analyze the compiled activation instruction, obtain the operation code and operation domain of the activation instruction, and according to the operation code and The operation domain obtains the data to be calculated and the target address required to execute the activation instruction;

Clause A2. The device according to Clause A1,

The control module is also used to obtain an activation parameter table according to the operation code and / or the operation domain;

The calculation module is also used to perform activation calculation on the data to be calculated according to the activation parameter table to obtain an operation result,

Wherein, the activation parameter table includes an activation table and a constant table.

Clause A3. The device according to Clause A1, the arithmetic module includes:

A plurality of activation calculators are used to perform activation calculation on the data to be calculated.

Clause A4. The device according to Clause A3, the operation module includes a master operation sub-module and a plurality of slave operation sub-modules, the master operation sub-module includes the plurality of activation operators,

The main operation sub-module is configured to perform activation operation on the data to be operated by using the plurality of activation operators to obtain an operation result, and store the operation result in the target address.

Clause A5. The device according to Clause A1, the operation domain includes a read-in amount or a storage address of the read-in amount,

Wherein, the control module is also used to obtain the read-in amount, and obtain the data to be calculated according to the read-in amount.

Clause A6. The device according to Clause A1, the device further comprising:

The storage module is used for storing the data to be calculated.

Clause A7. The device according to Clause A1, the control module includes:

An instruction storage sub-module for storing the compiled activation instruction;

An instruction processing sub-module, which is used to parse the compiled activation instruction to obtain the operation code and operation domain of the activation instruction;

A queue storage sub-module is used to store an instruction queue. The instruction queue includes a plurality of instructions to be executed, which are sequentially arranged according to the execution order, and the plurality of instructions to be executed include the compiled activation instructions.

Clause A8. The device according to Clause A7, the control module, further comprising:

The dependency processing sub-module is used to determine the first pending instruction when there is an association relationship between the first pending instruction in the plurality of pending instructions and the zeroth pending instruction before the first pending instruction The execution instruction is cached in the instruction storage submodule, and after the execution of the zeroth instruction to be executed is completed, the first instruction to be executed is extracted from the instruction storage submodule and sent to the arithmetic module,

The first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.

Clause A9. The device according to Clause A1,

The control module is also used to generate an assembly file according to the activation instruction and translate the assembly file into a binary file,

Wherein, the binary file is the compiled activation instruction.

Clause A10. The device according to any one of Clause A1 to Clause A9, the activation function utilized by the activation operation includes at least one of the following:

Clause A11. A machine learning computing device, the device comprising:

One or more activation instruction processing devices as described in any one of Clause A1-Clause A10, which are used to obtain data to be calculated and control information from other processing devices, and perform specified machine learning operations, and pass the execution result through I / O interface is passed to other processing devices;

Clause A12. A combined processing device, the combined processing device comprising:

Machine learning computing device, general interconnection interface and other processing devices as described in clause A11;

The machine learning computing device interacts with the other processing device to jointly complete the calculation operation specified by the user,

Wherein, the combined processing device further includes: a storage device, which is respectively connected to the machine learning computing device and the other processing device, and is used for storing data of the machine learning computing device and the other processing device.

Article A13. A machine learning chip, the machine learning chip includes:

The machine learning arithmetic device according to clause A11 or the combined processing device according to claim A12.

Article A14. An electronic device, the electronic device comprising:

Machine learning chip as described in clause A13.

Clause A15, a board card, the board card includes: a storage device, an interface device and a control device, and a machine learning chip as described in Clause A13;

Wherein, the machine learning chip is respectively connected to the storage device, the control device and the interface device;

The storage device is used for storing data;

The interface device is used to realize data transmission between the machine learning chip and an external device;

The control device is used for monitoring the state of the machine learning chip.

Clause A16. An activation instruction processing method, the method is applied to an activation instruction processing device, the method includes:

The control module is used to compile the obtained activation instruction to obtain a compiled activation instruction, and the compiled activation instruction is parsed to obtain the operation code and operation domain of the activation instruction, and according to the operation code and the operation The domain obtains the data to be calculated and the target address required to execute the activation instruction;

Clause A17. The method according to Clause A16, the method further comprising:

Obtaining an activation parameter table according to the operation code and / or the operation domain;

Wherein, the operation module is used to activate the operation data to obtain the operation result, including:

Performing an activation operation on the data to be calculated according to the activation parameter table to obtain an operation result,

Clause A18. According to the method described in Clause A16, an operation module is used to activate the data to be operated to obtain an operation result, including:

Clause A19. The method according to Clause A18, the operation module includes a master operation sub-module and a plurality of slave operation sub-modules, the master operation sub-module includes the plurality of activation operators,

Use multiple activation operators in the main operation sub-module to perform activation operation on the data to be operated to obtain an operation result, and store the operation result in the target address.

Clause A20. The method according to Clause A16, the operation domain further includes a read-in amount or a storage address of the read-in amount,

Wherein, obtaining the data to be operated, the activation table, the constant table and the target address required to execute the activation instruction according to the operation code and the operation domain include:

Acquiring the read-in amount, and acquiring the data to be calculated according to the read-in amount.

Clause A21. The method according to Clause A16, the method further comprising:

Store the data to be calculated.

Clause A22. According to the method described in Clause A16, parse the compiled activation instruction to obtain the operation code and operation domain of the activation instruction, including:

Storing the compiled activation instruction;

Parse the compiled activation instruction to obtain the operation code and operation domain of the activation instruction;

An instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed, which are sequentially arranged according to an execution order, and the plurality of instructions to be executed include the compiled activation instructions.

Clause A23. The method according to Clause A22, the method further comprising:

When it is determined that the first to-be-executed instruction among the plurality of to-be-executed instructions is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction, cache the first to-be-executed instruction, and determine the After the execution of the zeroth to-be-executed instruction is completed, control to execute the execution of the first to-be-executed instruction,

Clause A24. According to the method described in Clause A16, use the control module to compile the obtained activation instruction to obtain the compiled activation instruction, including:

Generate an assembly file according to the activation instruction, and translate the assembly file into a binary file,

Wherein, the binary file is the compiled activation instruction.

Clause A25. The method according to any one of Clause A16 to Clause A24, the activation function utilized by the activation operation includes at least one of the following:

FIG. 2-1 shows a block diagram of a linear rectification function activation instruction processing device according to an embodiment of the present disclosure. As shown in Figure 2-1, the device includes a control module 6-11 and an arithmetic module 6-12.

The control module 6-11 is used to compile the obtained linear rectification function activation instruction to obtain the compiled linear rectification function activation instruction, and parse the compiled linear rectification function activation instruction to obtain the operation of the linear rectification function activation instruction Code and operation domain, and obtain the data to be calculated and the target address required to execute the linear rectification function activation instruction according to the operation code and operation domain.

The operation code is used to indicate that the activation operation performed by the linear rectification function activation instruction on the data is a linear rectification function activation operation. The operation domain includes the data address and target address to be calculated.

The operation module 6-12 is configured to perform a linear rectification function activation operation on the data to be operated, obtain an operation result, and store the operation result in the target address.

In this embodiment, the linear rectification function activation instruction acquired by the control module is an uncompiled software instruction that cannot be directly executed by the hardware. The control module needs to first compile the linear rectification function activation instruction (uncompiled). After the compiled linear rectification function activation instruction is obtained, the compiled linear rectification function activation instruction can be analyzed. The compiled linear rectification function activation instruction is a hardware instruction that can be directly executed by the hardware.

In this embodiment, the compiled linear rectification function activation instruction may be the only instruction corresponding to the linear rectification function activation instruction. The compiled linear rectification function activation instruction may also include all activation operation instructions, so that when the device processes the linear rectification function activation instruction, the device can perform the linear rectification function without distinguishing the activation function corresponding to the instruction. Activation operations simplify the process.

In this embodiment, the control module can obtain the data to be calculated from the data address to be calculated. The control module may determine the data required to perform the linear rectification function activation operation according to the operation code of the linear rectification function activation instruction. The control module can obtain instructions and data through the data input and output unit, which can be one or more data I / O interfaces or I / O pins.

In this embodiment, the operation code may be the part of the instruction or field (usually represented by code) specified in the computer program to perform the operation, and is the instruction sequence number used to inform the device that executes the instruction which instruction needs to be executed. . The operation domain may be a source of all data required to execute the corresponding instruction, such as a corresponding address, etc. All data required to execute the corresponding instruction include data such as data to be operated and corresponding operation methods, etc. For a linear rectification function activation instruction, it must include an operation code and an operation field, where the operation field includes at least the data address to be operated and the target address.

It should be understood that those skilled in the art may set the instruction format of the linear rectification function activation instruction, as well as the included operation codes and operation domains as needed, which is not limited in this disclosure.

An embodiment of the present disclosure provides a linear rectification function activation instruction processing device. The device includes a control module and an arithmetic module. The control module is used to compile the obtained linear rectification function activation instruction to obtain a compiled linear rectification function activation instruction. Analyze the compiled linear rectification function activation instruction to obtain the operation code and operation domain of the linear rectification function activation instruction, and obtain the data to be operated and the target address required to execute the linear rectification function activation instruction according to the operation code and operation domain; The module is used to perform linear rectification function activation operation on the operation data to obtain the operation result and store the operation result in the target address. The linear rectification function activation instruction processing device provided by the embodiments of the present disclosure has a wide range of applications. The linear rectification function activation instruction has high processing efficiency and fast processing speed, and the linear rectification function activation operation has high processing efficiency and fast processing speed.

In a possible implementation manner, the control module 6-11 may also be used to obtain a linear rectification activation function parameter table according to the operation code and / or operation domain.

The operation module 6-12 can also be used to perform a linear rectification function activation operation on the data to be calculated according to the linear rectification activation function parameter table to obtain an operation result.

The linear rectification activation function parameter table may include a linear rectification activation function activation table and a linear rectification activation function constant table.

In this implementation, the operation domain may include a linear rectification activation function parameter table address, so that the control module obtains the linear rectification activation function parameter table address from the linear rectification activation function parameter table address. Alternatively, the control module may determine that the linear rectification activation function parameter table is required to execute the linear rectification function activation instruction according to the operation code, and may directly obtain the linear rectification activation function parameter table from the storage address of the predetermined linear rectification activation function parameter table. Or alternatively, when the control module can determine that the linear rectification function activation instruction requires a linear rectification activation function parameter table according to the operation code, the linear rectification activation function corresponding to the linear rectification function activation instruction can be obtained directly from the storage address of the predetermined parameter table Parameters Table. A person skilled in the art can set the acquisition method of the linear rectification activation function parameter table according to actual needs, which is not limited in the present disclosure.

In a possible implementation manner, the control module may also obtain an activation function corresponding to the linear rectification function activation instruction, so that the operation module may perform linear rectification function activation operation on the operation data according to the activation function and the corresponding operator.

It should be noted that, those skilled in the art can set the manner in which the calculation module implements the linear rectification function activation calculation according to actual needs, and the disclosure does not limit this.

2-2a shows a block diagram of a linear rectification function activation instruction processing device according to an embodiment of the present disclosure. In a possible implementation, as shown in FIG. 2-2a, the arithmetic module 6-12 may include multiple activation operators 6-120. A plurality of activation operators 6-120 are used to perform linear rectification function activation operations on the data to be operated.

In this implementation, the calculation module may also include an activation calculator. The number of activation operators can be set according to the amount of data required to perform the linear rectification function activation operation, the processing speed and efficiency of the linear rectification function activation operation, and the disclosure does not limit this.

2-2b shows a block diagram of a linear rectification function activation instruction processing device according to an embodiment of the present disclosure. In a possible implementation manner, as shown in FIG. 2-2b, the operation module 6-12 may include a master operation submodule 6-121 and a plurality of slave operation submodules 6-122, and the master operation submodule 6-121 includes Multiple activation operators 6-120 (not shown in the figure).

The main operation sub-module 6-121 is used to perform a linear rectification function activation operation on the data to be calculated using a plurality of activation operators to obtain the operation result, and store the operation result in the target address.

In a possible implementation manner, the operation domain may further include a read-in amount or a storage address of the read-in amount. Among them, the control module 6-11 is also used to obtain the read-in amount, and obtain a plurality of data to be calculated according to the read-in amount. Among them, the data amount of the plurality of data to be calculated may be less than or equal to the read-in amount.

In the above manner, the data amount and size of the operation data can be limited, the accuracy of the operation result can be guaranteed, and the device can execute the linear rectification function activation instruction.

In a possible implementation manner, as shown in FIGS. 2-2a and 2-2b, the device may further include a storage module 6-13. The storage modules 6-13 are used to store data to be calculated. The storage modules 6-13 can also be used to store the linear rectification activation function parameter table.

In this implementation manner, the storage module may include a memory, such as one or more of a cache and a register, and the cache may include a high-speed temporary storage cache. The data to be calculated and the parameter table of the linear rectification activation function can be stored in the cache and / or register of the storage module as needed, and the disclosure does not limit this.

In a possible implementation, as shown in FIGS. 2-2a and 2-2b, the control module 6-11 can also be used to generate an assembly file according to the linear rectification function activation instruction, and translate the assembly file into a binary file. Among them, the binary file is the compiled linear rectification function activation instruction.

In a possible implementation manner, the instruction format of the linear rectification function activation instruction may be:

active.reludst src0 size

Among them, active.relu is the operation code of the linear rectification function activation instruction, and dst, src0, and size are the operation domains of the linear rectification function activation instruction. Among them, dst is the target address, src0 is the data address to be calculated, and size is the read-in amount.

active.reludst src0 src1 size

Among them, active.relu is the operation code of the linear rectification function activation instruction, and dst, src0, src1, and size are the operation domains of the linear rectification function activation instruction. Among them, dst is the target address, src0 is the data address to be calculated, src1 is the address of the linear rectification activation function parameter table, and size is the amount of reading.

It should be understood that those skilled in the art can set the operation code of the linear rectification function activation instruction, the position of the operation code and the operation domain in the instruction format according to need, and this disclosure does not limit this.

It should be noted that, although the above embodiment is taken as an example to introduce the linear rectification function activation instruction processing apparatus as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various modules flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.

Application examples

In the following, an application example according to an embodiment of the present disclosure will be given in conjunction with “using a linear rectification function to activate an instruction processing device to perform an activation operation” as an exemplary application scenario, so as to facilitate understanding of a flow of a linear rectification function to activate an instruction processing device. Those skilled in the art should understand that the following application examples are only for the purpose of facilitating understanding of the embodiments of the present disclosure, and should not be considered as limitations to the embodiments of the present disclosure

FIG. 2-3 shows a schematic diagram of an application scenario of a linear rectification function activation instruction processing device according to an embodiment of the present disclosure. As shown in Figure 2-3, the linear rectification function activation instruction processing device processes the linear rectification function activation instruction as follows:

As shown in Figure 2-3, the control module 6-11 compiles the obtained linear rectification function activation instruction 1 to obtain the compiled linear rectification function activation instruction 1 (for example, the linear rectification function activation instruction 1 is active.relu500.100 64) Analyze the compiled linear rectification function activation instruction 1 to obtain the operation code and operation domain of the linear rectification function activation instruction 1. The operation code of the linear rectification function activation instruction 1 is active.relu, the target address is 500, the data address to be calculated is 100, and the read-in amount is 64. The control module 6-11 acquires the data to be calculated with a data amount of 64 (read-in amount) from the data address to be calculated 100. Assuming that the activation calculation needs to be performed according to the linear rectification activation function parameter table, the control module 11 also needs to obtain the linear rectification activation function parameter table (see the above description for the specific implementation process).

The operation module 6-12 performs linear rectification function activation calculation on the operation data according to the linear rectification activation function parameter table, obtains the operation result, and stores the operation result in the target address 500.

In this way, the linear rectification function activation instruction processing device can efficiently and quickly process the linear rectification function activation instruction, and realize the efficient and rapid processing of the linear rectification function activation operation.

2-4 illustrate a flowchart of a method for processing a linear rectification function activation instruction according to an embodiment of the present disclosure. As shown in FIGS. 2-4, the method is applied to the above linear rectification function activation instruction processing device, and the method includes step S51-6 and step S52-6. The method can be applied to computer equipment including a memory and a processor, where the memory is used to store data used in the execution of the method; the processor is used to perform related processing and operation steps, such as performing the following steps S51-6和步骤 S52-6.

In step S51-6, the control module is used to compile the obtained linear rectification function activation instruction to obtain a compiled linear rectification function activation instruction, and the compiled linear rectification function activation instruction is analyzed to obtain a linear rectification function activation instruction Operation code and operation domain, and obtain the data to be calculated and the target address required to execute the linear rectification function activation instruction according to the operation code and operation domain. The operation code is used to indicate that the activation operation performed by the linear rectification function activation instruction on the data is the linear rectification function activation operation, and the operation domain includes the data address to be operated and the target address.

In step S52-6, the operation module performs linear rectification function activation operation on the operation data to obtain the operation result, and stores the operation result in the target address.

In a possible implementation manner, the method may further include:

Obtain the linear rectification activation function parameter table according to the operation code and / or operation domain;

Wherein, using the operation module to perform the linear rectification function activation operation on the operation data to obtain the operation result includes: performing the linear rectification function activation operation on the operation data according to the linear rectification activation function parameter table to obtain the operation result. The linear rectification activation function parameter table may include a linear rectification activation function activation table and a linear rectification activation function constant table.

In a possible implementation manner, using the operation module to perform a linear rectification function activation operation on the operation data to obtain an operation result may include: using multiple activation operators to perform a linear rectification function activation operation on the operation data.

In a possible implementation manner, the operation module may include a master operation sub-module and multiple slave operation sub-modules, and the master operation sub-module may include multiple activation operators.

A plurality of activation operators in the main operation sub-module are used to perform linear rectification function activation operation on the operation data to obtain an operation result.

In a possible implementation manner, the operation domain may further include a read-in amount or a storage address of the read-in amount. Wherein, obtaining the data to be operated and the target address required to execute the linear rectification function activation instruction according to the operation code and the operation domain may include: acquiring the read-in amount, and obtaining the data to be operated according to the read-in amount.

In a possible implementation manner, parsing the compiled linear rectification function activation instruction to obtain the operation code and operation domain of the linear rectification function activation instruction may include:

Activation command of the rectification function after the storage line is compiled;

Analyze the compiled linear rectification function activation instruction to obtain the operation code and operation domain of the linear rectification function activation instruction;

The instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed in order according to the execution order, and the plurality of instructions to be executed include a compiled linear rectification function activation instruction.

In a possible implementation manner, the obtained linear rectification function activation instruction is compiled by the control module to obtain a compiled linear rectification function activation instruction, which may include: generating an assembly file according to the linear rectification function activation instruction, and compiling The file is translated into a binary file. Among them, the binary file is the compiled linear rectification function activation instruction.

It should be noted that although the above embodiment is taken as an example to introduce the linear rectification function activation instruction processing method as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various steps flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.

The linear rectification function activation instruction processing method provided by the embodiments of the present disclosure has a wide range of application, and has high processing efficiency and fast processing speed for the linear rectification function activation instruction, and high processing efficiency and fast processing speed for performing the linear rectification function activation operation.

The foregoing can be better understood based on the following terms:

Clause B1, a linear rectification function activation command processing device, the device comprising:

The control module is used to compile the obtained linear rectification function activation instruction to obtain the compiled linear rectification function activation instruction, analyze the compiled linear rectification function activation instruction, and obtain the operation code of the linear rectification function activation instruction And the operation domain, and according to the operation code and the operation domain, obtain the data to be operated and the target address required to execute the linear rectification function activation instruction;

The operation module is used to perform a linear rectification function activation operation on the data to be operated to obtain an operation result, and store the operation result in the target address,

Wherein, the operation code is used to indicate that the activation operation performed by the linear rectification function activation instruction on the data is a linear rectification function activation operation, and the operation domain includes the data address to be operated and the target address.

Clause B2, the device according to Clause B1,

The control module is further configured to obtain a linear rectification activation function parameter table according to the operation code and / or the operation domain;

The operation module is further configured to perform a linear rectification function activation operation on the data to be calculated according to the linear rectification activation function parameter table to obtain an operation result,

Wherein, the linear rectification activation function parameter table includes a linear rectification activation function activation table and a linear rectification activation function constant table.

Clause B3. The device according to Clause B1, the calculation module includes:

A plurality of activation operators are used to perform a linear rectification function activation operation on the data to be operated.

Clause B4. The device according to Clause B3, the operation module includes a master operation sub-module and a plurality of slave operation sub-modules, the master operation sub-module includes the plurality of activation operators,

The main operation sub-module is configured to use the plurality of activation operators to perform a linear rectification function activation operation on the data to be operated to obtain an operation result, and store the operation result in the target address.

Clause B5. The device according to Clause B1, the operation domain further includes a read-in amount or a storage address of the read-in amount,

Clause B6. The device according to Clause B1, the device further comprising:

The storage module is used for storing the data to be calculated.

Clause B7. The device according to Clause B1, the control module includes:

An instruction storage sub-module for storing the compiled linear rectification function activation instruction;

An instruction processing sub-module, which is used to analyze the compiled linear rectification function activation instruction to obtain the operation code and operation domain of the linear rectification function activation instruction;

The queue storage submodule is used to store an instruction queue, and the instruction queue includes a plurality of instructions to be executed that are sequentially arranged according to an execution order, and the plurality of instructions to be executed include the compiled linear rectification function activation instruction.

Clause B8. The device according to Clause B7, the control module, further comprising:

Clause B9. The device according to Clause B1,

The control module is also used to generate an assembly file according to the linear rectification function activation instruction and translate the assembly file into a binary file,

Wherein, the binary file is the compiled linear rectification function activation instruction.

Article B10. A machine learning computing device, the device comprising:

One or more linear rectification function activation instruction processing devices as described in any one of Clause B1-Clause B9, used to obtain data to be calculated and control information from other processing devices, and perform a specified machine learning operation, which will execute the result Passed to other processing devices through the I / O interface;

When the machine learning operation device includes a plurality of the linear rectification function activation instruction processing devices, the plurality of linear rectification function activation instruction processing devices may be connected and transmit data through a specific structure;

Among them, a plurality of the linear rectification function activation instruction processing devices interconnect and transmit data through a fast external device interconnect bus PCIE bus to support larger-scale machine learning operations; a plurality of the linear rectification function activation instruction processing devices The same control system is shared or has its own control system; multiple linear rectification function activation instruction processing devices share memory or own memory; the interconnection method of multiple linear rectification function activation instruction processing devices is any interconnection topology.

Clause B11. A combined processing device, the combined processing device comprising:

Machine learning computing device, general interconnection interface and other processing devices as described in clause B10;

Article B12. A machine learning chip, the machine learning chip includes:

The machine learning arithmetic device described in Item B10 or the combined processing device described in Item B11.

Article B3. An electronic device, the electronic device comprising:

Machine learning chip as described in clause B12.

Clause B14, a board card, the board card includes: a storage device, an interface device and a control device, and a machine learning chip as described in Clause B12;

The storage device is used for storing data;

Article B15. A method for processing a linear rectification function activation instruction. The method is applied to a linear rectification function activation instruction processing apparatus. The method includes:

The control module is used to compile the obtained linear rectification function activation instruction to obtain a compiled linear rectification function activation instruction, and the compiled linear rectification function activation instruction is analyzed to obtain the operation code and operation of the linear rectification function activation instruction. Field, and obtain the data to be operated and the target address required to execute the linear rectification function activation instruction according to the operation code and the operation field;

Using an operation module to perform a linear rectification function activation operation on the data to be operated to obtain an operation result, and store the operation result in the target address,

Clause B16. The method according to Clause B15, the method further comprising:

Obtaining a linear rectification activation function parameter table according to the operation code and / or the operation domain;

Wherein, the operation module is used to perform a linear rectification function activation operation on the data to be operated to obtain an operation result, including:

Performing a linear rectification function activation operation on the data to be calculated according to the linear rectification activation function parameter table to obtain an operation result,

Clause B17. According to the method described in Clause B15, the operation module is used to perform a linear rectification function activation operation on the data to be operated to obtain an operation result, including:

Clause B18. The method according to Clause B17, the operation module includes a master operation sub-module and a plurality of slave operation sub-modules, the master operation sub-module includes the plurality of activation operators,

A plurality of activation operators in the main operation sub-module are used to perform a linear rectification function activation operation on the data to be operated to obtain an operation result.

Clause B19. The method according to Clause B15, the operation domain further includes a read-in amount or a storage address of the read-in amount,

Wherein, obtaining the data to be operated and the target address required to execute the linear rectification function activation instruction according to the operation code and the operation domain includes:

Clause B20. The method according to Clause B15, the method further comprising:

Store the data to be calculated.

Clause B21. According to the method described in Clause B15, analyze the compiled linear rectification function activation instruction to obtain the operation code and operation domain of the linear rectification function activation instruction, including:

Storing the compiled linear rectification function activation instruction;

Parse the compiled linear rectification function activation instruction to obtain the operation code and operation domain of the linear rectification function activation instruction;

An instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed, which are sequentially arranged according to an execution order, and the plurality of instructions to be executed include the compiled linear rectification function activation instruction.

Clause B22. The method according to Clause B21, the method further comprising:

Clause B23. According to the method described in Clause B15, use the control module to compile the obtained linear rectification function activation instruction to obtain the compiled linear rectification function activation instruction, including:

Generate an assembly file according to the linear rectification function activation instruction, and translate the assembly file into a binary file,

FIG. 3-1 shows a block diagram of an S-shaped growth curve function activation instruction processing apparatus according to an embodiment of the present disclosure. As shown in Figure 3-1, the device includes a control module 7-11 and an arithmetic module 7-12.

The control module 7-11 is used to compile the acquired S-shaped growth curve function activation instruction to obtain the compiled S-shaped growth curve function activation instruction, and parse the compiled S-shaped growth curve function activation instruction to obtain S The growth code function activates the operation code and operation domain of the instruction, and obtains the data to be calculated and the target address required to execute the S-type growth curve function activation instruction according to the operation code and the operation domain.

The operation code is used to instruct the S-type growth curve function activation instruction to perform the activation operation on the data as the S-type growth curve function activation operation. The operation domain includes the data address and target address to be calculated.

The operation module 7-12 is used to perform S-shaped growth curve function activation operation on the data to be calculated, obtain the operation result, and store the operation result in the target address.

In this embodiment, the S-shaped growth curve function activation instruction acquired by the control module is an uncompiled software instruction that cannot be directly executed by the hardware. The control module needs to first activate the S-shaped growth curve function activation instruction (uncompiled) To compile. After the compiled S-shaped growth curve function activation instruction is obtained, the compiled S-shaped growth curve function activation instruction can be analyzed. The compiled S-shaped growth curve function activation instruction is a hardware instruction that can be directly executed by the hardware.

In this embodiment, the compiled S-shaped growth curve function activation instruction may be the only instruction corresponding to the S-shaped growth curve function activation instruction. The compiled S-shaped growth curve function activation instruction may also include all activation calculation instructions, so that when the device processes the S-shaped growth curve function activation instruction, it can do without distinguishing the activation function corresponding to the instruction. Perform S-shaped growth curve function activation operation to simplify the processing of instructions.

In this embodiment, the control module can obtain the data to be calculated from the data address to be calculated. The control module may determine the data required for the S-shaped growth curve function activation operation according to the operation code of the S-shaped growth curve function activation instruction. The control module can obtain instructions and data through the data input and output unit, which can be one or more data I / O interfaces or I / O pins.

In this embodiment, the operation code may be the part of the instruction or field (usually represented by code) specified in the computer program to perform the operation, and is the instruction sequence number used to inform the device that executes the instruction which instruction needs to be executed. . The operation domain may be a source of all data required to execute the corresponding instruction, such as a corresponding address, etc. All data required to execute the corresponding instruction include data such as data to be operated and corresponding operation methods, etc. For an S-shaped growth curve function activation instruction, it must include an operation code and an operation field, where the operation field includes at least the data address and the target address to be calculated.

It should be understood that, those skilled in the art can set the instruction format of the S-shaped growth curve function activation instruction, as well as the included operation codes and operation domains as needed, and this disclosure does not limit this.

In this embodiment, the device may include one or more control modules and one or more arithmetic modules, and the number of control modules and arithmetic modules may be set according to actual needs, which is not limited in the present disclosure. When the device includes a control module, the control module may receive an S-shaped growth curve function activation instruction, and control one or more processing modules to perform an S-shaped growth curve function activation operation. When the device includes multiple control modules, the multiple control modules may respectively receive the S-shaped growth curve function activation instruction and control the corresponding one or more processing modules to perform the S-shaped growth curve function activation operation.

An embodiment of the present disclosure provides an S-shaped growth curve function activation instruction processing device. The device includes a control module and an arithmetic module. The control module is used to compile the obtained S-shaped growth curve function activation instruction to obtain the compiled S-type Growth curve function activation instruction, analyze the compiled S-shaped growth curve function activation instruction to obtain the operation code and operation domain of the S-shaped growth curve function activation instruction, and obtain and execute the S-shaped growth curve function activation according to the operation code and operation domain The data to be operated and the target address required by the instruction; the operation module is used to perform the S-shaped growth curve function activation operation on the operation data to obtain the operation result, and store the operation result in the target address. The S-shaped growth curve function activation instruction processing device provided by the embodiments of the present disclosure has a wide range of application, high processing efficiency and fast processing speed for the S-shaped growth curve function activation instruction, and high processing efficiency for performing the S-shaped growth curve function activation operation , Fast processing speed.

In a possible implementation, the control module 7-11 can also be used to obtain the S-shaped growth curve activation function parameter table according to the operation code and / or operation domain.

The operation module can also be used to perform S-type growth curve function activation calculation on the data to be calculated according to the S-type growth curve activation function parameter table to obtain the operation result.

The S-type growth curve activation function parameter table may include an S-type growth curve activation function activation table and an S-type growth curve activation function constant table.

In this implementation, the S-type growth curve activation function parameter table address may be included in the operation domain, so that the control module obtains the S-type growth curve activation function parameter table address from the S-type growth curve activation function parameter table address. Alternatively, the control module may determine that the S-shaped growth curve activation function parameter table is required to execute the S-shaped growth curve function activation instruction according to the operation code, and may directly obtain the S-shaped growth from the storage address of the predetermined S-shaped growth curve activation function parameter table Curve activation function parameter table. Or alternatively, the control module may determine that the S-shaped growth curve activation function parameter table is required to execute the S-shaped growth curve function activation instruction according to the operation code, and may directly obtain the corresponding S-shaped growth curve function activation from the storage address of the predetermined parameter table Commanded S-shaped growth curve activation function parameter table. Those skilled in the art can set the acquisition method of the S-shaped growth curve activation function parameter table according to actual needs, which is not limited in the present disclosure.

In a possible implementation, the control module can also obtain an activation function corresponding to the S-shaped growth curve function activation instruction, so that the operation module can perform S-shaped growth curve function activation on the operation data according to the activation function and the corresponding operator Operation.

It should be noted that those skilled in the art can set the manner in which the calculation module implements the S-shaped growth curve function activation calculation according to actual needs, and the disclosure does not limit this.

3-2a shows a block diagram of an S-shaped growth curve function activation instruction processing apparatus according to an embodiment of the present disclosure. In a possible implementation manner, as shown in FIG. 3-2a, the arithmetic module 7-12 may include multiple activation operators 7-120. A plurality of activation operators 7-120 are used to perform S-shaped growth curve function activation calculation on the data to be calculated.

In this implementation, the calculation module may also include an activation calculator. The number of activation operators can be set according to the amount of data required for the S-shaped growth curve function activation operation, the processing speed and efficiency of the S-shaped growth curve function activation operation, and the disclosure does not limit this.

3-2b shows a block diagram of an S-shaped growth curve function activation instruction processing apparatus according to an embodiment of the present disclosure. In a possible implementation, as shown in FIG. 3-2b, the operation module 7-12 may include a master operation sub-module 7-121 and a plurality of slave operation sub-modules 7-122, and the master operation sub-module 7-121 includes Multiple activation operators 7-120 (not shown in the figure).

The main operation sub-module 7-121 is used to perform S-shaped growth curve function activation calculation on the data to be calculated by using a plurality of activation operators to obtain operation results, and store the operation results in the target address.

In a possible implementation manner, the operation domain may further include a read-in amount or a storage address of the read-in amount. Among them, the control module 7-11 is also used to obtain the read-in amount, and obtain a plurality of data to be calculated according to the read-in amount. Among them, the data amount of the plurality of data to be calculated may be less than or equal to the read-in amount.

In the above manner, the data amount and size of the operation data can be limited, the accuracy of the operation result can be ensured, and the device can execute the S-shaped growth curve function activation instruction.

In a possible implementation manner, as shown in FIGS. 3-2a and 3-2b, the device may further include a storage module 7-13. The storage module 7-13 is used to store data to be calculated. The storage module 7-13 can also be used to store the S-shaped growth curve activation function parameter table.

In this implementation manner, the storage module may include a memory, such as one or more of a cache and a register, and the cache may include a high-speed temporary storage cache. The to-be-calculated data and the S-shaped growth curve activation function parameter table can be stored in the storage module cache and / or register as needed, and the disclosure does not limit this.

In a possible implementation, the control module 7-11 is also used to generate an assembly file according to the S-shaped growth curve function activation instruction and translate the assembly file into a binary file, where the binary file is the compiled S-shaped growth curve Function activation instruction.

In a possible implementation manner, the instruction format of the S-shaped growth curve function activation instruction may be:

active.sigmoid dst src0 size

Wherein, active.sigmoid is the operation code of the S-shaped growth curve function activation instruction, and dst, src0, and size are the operation domains of the S-shaped growth curve function activation instruction. Among them, dst is the target address, src0 is the data address to be calculated, and size is the read-in amount.

active.sigmoid dst src0 src1 size

Wherein, active.sigmoid is the operation code of the S-type growth curve function activation instruction, and dst, src0, src1, and size are the operation domains of the S-type growth curve function activation instruction. Among them, dst is the target address, src0 is the data address to be calculated, src1 is the address of the S-type growth curve activation function parameter table, and size is the read-in amount.

It should be understood that those skilled in the art can set the operation code of the S-shaped growth curve function activation instruction, the position of the operation code and the operation field in the instruction format according to needs, and this disclosure does not limit this.

It should be noted that although the above-mentioned embodiment is taken as an example to introduce the S-shaped growth curve function activation instruction processing apparatus as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various modules flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.

Application examples

In the following, an application example according to an embodiment of the present disclosure is given in conjunction with "using an S-shaped growth curve function to activate an instruction processing device for activation operation" as an exemplary application scenario, so as to facilitate understanding of a flow of an S-shaped growth curve function to activate an instruction processing device. Those skilled in the art should understand that the following application examples are only for the purpose of facilitating understanding of the embodiments of the present disclosure, and should not be considered as limitations to the embodiments of the present disclosure

3-3 shows a schematic diagram of an application scenario of an S-shaped growth curve function activation instruction processing apparatus according to an embodiment of the present disclosure. As shown in Figure 3-3, the S-shaped growth curve function activation instruction processing device processes the S-shaped growth curve function activation instruction as follows:

As shown in Figure 3-3, the control module 7-11 compiles the acquired S-shaped growth curve function activation instruction 1 to obtain the compiled S-shaped growth curve function activation instruction 1 (such as the S-shaped growth curve function activation instruction 1 Is active.sigmoid50050010064), analyzes the compiled S-shaped growth curve function activation instruction 1 to obtain the operation code and operation domain of the S-shaped growth curve function activation instruction 1. The operation code of the S-shaped growth curve function activation instruction 1 is active.sigmoid, the target address is 500, the data address to be calculated is 100, and the read-in amount is 64. The control module 7-11 acquires the data to be calculated with a data amount of 64 (read-in amount) from the data address to be calculated 100. Assuming that the activation calculation needs to be performed according to the S-shaped growth curve activation function parameter table, the control module 7-11 also needs to obtain the S-shaped growth curve activation function parameter table (see the above description for the specific implementation process).

The operation module 7-12 performs the S-type growth curve function activation operation on the data to be calculated according to the S-type growth curve activation function parameter table, obtains the operation result, and stores the operation result in the target address 500.

In this way, the S-shaped growth curve function activation instruction processing device can efficiently and quickly process the S-shaped growth curve function activation instruction, and the S-shaped growth curve function activation operation has high processing efficiency and fast processing speed.

FIGS. 3-4 illustrate a flowchart of an S-shaped growth curve function activation instruction processing method according to an embodiment of the present disclosure. As shown in FIG. 3-4, this method is applied to the above S-shaped growth curve function activation instruction processing device. The method includes steps S51-7 and S52-7. The method can be applied to computer equipment including a memory and a processor, where the memory is used to store data used in the execution of the method; the processor is used to perform related processing and operation steps, such as performing the following steps S51-7和步骤 S52-7.

In step S51-7, the control module is used to compile the acquired S-shaped growth curve function activation instruction to obtain a compiled S-shaped growth curve function activation instruction, and the compiled S-shaped growth curve function activation instruction is analyzed, Obtain the operation code and operation domain of the S-shaped growth curve function activation instruction, and obtain the data to be calculated and the target address required to execute the S-shaped growth curve function activation instruction according to the operation code and the operation domain. The operation code is used to instruct the S-type growth curve function activation instruction to perform the activation operation on the data as the S-type growth curve function activation operation, and the operation domain includes the data address and the target address to be calculated.

In step S52-7, an S-shaped growth curve function activation operation is performed on the data to be operated by the operation module to obtain the operation result, and the operation result is stored in the target address.

In a possible implementation manner, the method may further include:

Obtain the S-shaped growth curve activation function parameter table according to the operation code and / or operation domain;

Among them, the S-shaped growth curve function activation operation is performed on the operation data using the operation module to obtain the operation result, including:

According to the S-shaped growth curve activation function parameter table, perform the S-shaped growth curve function activation operation on the data to be obtained, and obtain the operation result,

The S-type growth curve activation function parameter table includes an S-type growth curve activation function activation table and an S-type growth curve activation function constant table.

In a possible implementation manner, using the operation module to perform an S-type growth curve function activation operation on the operation data to obtain the operation result may include:

Use multiple activation operators to perform S-shaped growth curve function activation calculation on the data to be calculated.

Wherein, using the operation module to perform the S-type growth curve function activation operation on the operation data to obtain the operation result may include: performing the S-type growth curve function activation operation using multiple activation operators in the main operation sub-module to obtain the operation result.

In a possible implementation manner, the operation domain may further include a read-in amount or a storage address of the read-in amount. Among them, obtaining the data to be calculated and the target address required to execute the S-shaped growth curve function activation instruction according to the operation code and the operation domain may include: acquiring the read-in amount, and obtaining the pending operation according to the read-in amount

In a possible implementation manner, the control module is used to parse the obtained S-shaped growth curve function activation instruction to obtain the operation code and operation domain of the S-shaped growth curve function activation instruction, which may include:

S-shaped growth curve function activation instruction after storage line compilation;

Analyze the compiled S-shaped growth curve function activation instruction to obtain the operation code and operation domain of the S-shaped growth curve function activation instruction;

The instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed in order according to the execution order, and the plurality of instructions to be executed include a compiled S-shaped growth curve function activation instruction.

In a possible implementation manner, compiling the obtained S-shaped growth curve function activation instruction to obtain the compiled S-shaped growth curve function activation instruction may include:

The assembly file is generated according to the activation instruction of the S-shaped growth curve function, and the assembly file is translated into a binary file. Among them, the binary file is the compiled S-shaped growth curve function activation instruction.

It should be noted that although the above embodiment is taken as an example to introduce the processing method of the S-shaped growth curve function activation instruction as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various steps flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.

The S-shaped growth curve function activation instruction processing method provided by the embodiments of the present disclosure has a wide range of application, high processing efficiency and fast processing speed for the S-shaped growth curve function activation instruction, and high processing efficiency for performing the S-shaped growth curve function activation operation , Fast processing speed.

The foregoing can be better understood based on the following terms:

Clause C1, an S-shaped growth curve function activation instruction processing device, the device comprising:

The control module is used to compile the obtained S-shaped growth curve function activation instruction to obtain the compiled S-shaped growth curve function activation instruction, and analyze the compiled S-shaped growth curve function activation instruction to obtain the S-type An operation code and an operation domain of the growth curve function activation instruction, and according to the operation code and the operation domain, obtain the data to be operated and the target address required to execute the S-shaped growth curve function activation instruction;

An operation module, configured to perform an S-shaped growth curve function activation operation on the data to be operated to obtain an operation result, and store the operation result in the target address,

Wherein, the operation code is used to indicate that the activation operation performed by the S-shaped growth curve function activation instruction on the data is an S-shaped growth curve function activation operation, and the operation domain includes the data address to be operated and the target address.

Clause C2, the device according to Clause C1,

The control module is further configured to obtain an S-shaped growth curve activation function parameter table according to the operation code and / or the operation domain;

The operation module is further configured to perform an S-type growth curve function activation operation on the data to be calculated according to the S-type growth curve activation function parameter table to obtain an operation result,

Wherein, the S-type growth curve activation function parameter table includes an S-type growth curve activation function activation table and an S-type growth curve activation function constant table.

Clause C3. The device according to Clause C1, the calculation module includes:

A plurality of activation calculators are used to perform S-shaped growth curve function activation calculation on the data to be calculated.

Clause C4. The device according to Clause C3, the operation module includes a master operation sub-module and a plurality of slave operation sub-modules, the master operation sub-module includes the plurality of activation operators,

The main operation sub-module is configured to use the plurality of activation operators to perform an S-shaped growth curve function activation operation on the data to be operated to obtain an operation result, and store the operation result in the target address.

Clause C5. The device according to Clause C1, the operation domain further includes a read-in amount or a storage address of the read-in amount,

Clause C6. The device according to Clause C1, the device further comprising:

The storage module is used for storing the data to be calculated.

Clause C7. The device according to Clause C1, the control module includes:

An instruction storage sub-module for storing the compiled S-shaped growth curve function activation instruction;

An instruction processing sub-module, which is used to analyze the compiled S-shaped growth curve function activation instruction to obtain the operation code and operation domain of the S-shaped growth curve function activation instruction;

A queue storage submodule is used to store an instruction queue, and the instruction queue includes a plurality of instructions to be executed, which are sequentially arranged in order of execution, and the plurality of instructions to be executed include the compiled S-shaped growth curve function activation instruction.

Clause C8. The device according to Clause C7, the control module, further comprising:

Clause C9, the device according to Clause C1,

The control module is also used to generate an assembly file according to the S-shaped growth curve function activation instruction, and translate the assembly file into a binary file,

Wherein, the binary file is the compiled S-shaped growth curve function activation instruction.

Clause C10. A machine learning computing device, the device comprising:

One or more S-shaped growth curve function activation instruction processing devices as described in any one of Clauses C1-C9, used to obtain data to be calculated and control information from other processing devices, and perform specified machine learning operations, will The execution result is transferred to other processing devices through the I / O interface;

When the machine learning operation device includes a plurality of the S-shaped growth curve function activation instruction processing devices, the plurality of the S-shaped growth curve function activation instruction processing devices can be connected and transmit data through a specific structure;

Among them, a plurality of the S-shaped growth curve function activation instruction processing devices interconnect and transmit data through a PCIE bus that is a fast external device interconnect bus to support larger-scale machine learning operations; a plurality of the S-shaped growth curve functions The activation instruction processing device shares the same control system or has its own control system; a plurality of the S-shaped growth curve function activation instruction processing devices share memory or have their own memories; a plurality of the S-shaped growth curve function activation instruction processing devices The interconnection method is any interconnection topology.

Clause C11. A combined processing device, the combined processing device comprising:

Machine learning computing devices, general interconnection interfaces and other processing devices as described in clause C10;

Clause C12. A machine learning chip, the machine learning chip includes:

The machine learning arithmetic device according to clause C10 or the combined processing device according to clause C11.

Article C13. An electronic device, the electronic device comprising:

Machine learning chip as described in clause C12.

Clause C14, a board card, the board card includes: a storage device, an interface device and a control device, and a machine learning chip as described in Clause C12;

The storage device is used for storing data;

Clause C15. An S-shaped growth curve function activation instruction processing method. The method is applied to an S-shaped growth curve function activation instruction processing device. The method includes:

The control module is used to compile the acquired S-shaped growth curve function activation instruction to obtain a compiled S-shaped growth curve function activation instruction, and the compiled S-shaped growth curve function activation instruction is analyzed to obtain an S-shaped growth curve. An operation code and an operation domain of a function activation instruction, and according to the operation code and the operation domain, obtain the data to be operated and the target address required to execute the S-shaped growth curve function activation instruction;

Using an operation module to perform an S-shaped growth curve function activation operation on the data to be operated to obtain an operation result, and store the operation result in the target address,

Clause C16. The method according to Clause C15, the method further comprising:

Obtaining an S-shaped growth curve activation function parameter table according to the operation code and / or the operation domain;

Wherein, using an operation module to perform an S-shaped growth curve function activation operation on the data to be operated to obtain an operation result includes:

Performing an S-type growth curve function activation operation on the data to be calculated according to the S-type growth curve activation function parameter table to obtain an operation result,

Clause C17. According to the method described in Clause C15, an operation module is used to perform an S-shaped growth curve function activation operation on the data to be operated to obtain an operation result, including:

Clause C18. The method according to Clause C17, the operation module includes a master operation sub-module and a plurality of slave operation sub-modules, the master operation sub-module includes the plurality of activation operators,

A plurality of activation operators in the main operation sub-module are used to perform S-shaped growth curve function activation operation to obtain an operation result.

Clause C19. The method according to Clause C15, the operation domain further includes a read-in amount or a storage address of the read-in amount,

Wherein, obtaining the data to be calculated and the target address required to execute the S-shaped growth curve function activation instruction according to the operation code and the operation domain includes:

Clause C20. The method according to Clause C15, the method further comprising:

Store the data to be calculated.

Clause C21. According to the method described in Clause C15, analyze the compiled S-shaped growth curve function activation instruction to obtain the operation code and operation domain of the S-shaped growth curve function activation instruction, including:

Storing the compiled S-shaped growth curve function activation instruction;

Parse the compiled S-shaped growth curve function activation instruction to obtain the operation code and operation domain of the S-shaped growth curve function activation instruction;

An instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed in order according to an execution order, and the plurality of instructions to be executed include the compiled S-shaped growth curve function activation instruction.

Clause C22. The method according to Clause C21, the method further comprising:

Clause C23. According to the method described in Clause C15, use the control module to compile the obtained S-shaped growth curve function activation instruction to obtain the compiled S-shaped growth curve function activation instruction, including:

Generate an assembly file according to the S-shaped growth curve function activation instruction, and translate the assembly file into a binary file,

FIG. 4-1 shows a block diagram of an exponential function activation instruction processing apparatus according to an embodiment of the present disclosure. As shown in Figure 4-1, the device includes a control module 12-11 and an arithmetic module 12-12.

The control module 12-11 is used to compile the obtained exponential function activation instruction to obtain the compiled exponential function activation instruction, and parse the compiled exponential function activation instruction to obtain the operation code and operation domain of the exponential function activation instruction , And obtain the data to be calculated and the target address required to execute the exponential function activation instruction according to the operation code and operation domain.

The operation code is used to indicate that the activation operation performed by the exponential function activation instruction on the data is the exponential function activation operation. The operation domain includes the data address and target address to be calculated.

The operation module 12-12 is used to perform an exponential function activation operation on the data to be operated, obtain the operation result, and store the operation result in the target address.

In this embodiment, the exponential function activation instruction obtained by the control module is an uncompiled software instruction that cannot be directly executed by hardware. The control module needs to first compile the exponential function activation instruction (uncompiled). After the compiled index function activation instruction is obtained, the compiled index function activation instruction can be parsed. The compiled exponential function activation instruction is a hardware instruction that can be directly executed by the hardware.

In this embodiment, the compiled exponential function activation instruction may be an instruction corresponding only to the exponential function activation instruction. The compiled exponential function activation instruction may also include all activation operation instructions, so that when the device processes the exponential function activation instruction, it can simplify the exponential function activation operation without distinguishing the activation function corresponding to the instruction Process.

In this embodiment, the control module can obtain the data to be calculated from the data address to be calculated. The control module may determine the data required to perform the exponential function activation operation according to the operation code of the exponential function activation instruction. The control module can obtain instructions and data through the data input and output unit, which can be one or more data I / O interfaces or I / O pins.

In this embodiment, the operation code may be the part of the instruction or field (usually represented by code) specified in the computer program to perform the operation, and is the instruction sequence number used to inform the device that executes the instruction which instruction needs to be executed. . The operation domain may be a source of all data required to execute the corresponding instruction, such as a corresponding address, etc. All data required to execute the corresponding instruction include data such as data to be operated and corresponding operation methods, etc. For an exponential function activation instruction, it must include an operation code and an operation field, where the operation field includes at least the data address to be calculated and the target address.

It should be understood that, those skilled in the art can set the instruction format of the exponential function activation instruction, as well as the included operation codes and operation domains as required, which is not limited in this disclosure.

In this embodiment, the device may include one or more control modules and one or more arithmetic modules, and the number of control modules and arithmetic modules may be set according to actual needs, which is not limited in the present disclosure. When the device includes a control module, the control module may receive an exponential function activation instruction, and control one or more processing modules to perform a linear rectification function activation operation. When the device includes multiple control modules, the multiple control modules may respectively receive exponential function activation instructions and control the corresponding one or more processing modules to perform exponential function activation operations.

An embodiment of the present disclosure provides an exponential function activation instruction processing device. The device includes a control module and an arithmetic module. The control module is used to compile the obtained exponential function activation instruction to obtain the compiled exponential function activation instruction. The exponential function activation instruction is parsed to obtain the opcode and operation domain of the exponential function activation instruction, and the data to be operated and the target address required to execute the exponential function activation instruction are obtained according to the opcode and operation domain; the operation module is used to process the operation data Perform the exponential function activation operation to obtain the operation result, and store the operation result in the target address. The exponential function activation instruction processing device provided by the embodiments of the present disclosure has a wide range of application, and has high processing efficiency and fast processing speed for the exponential function activation instruction, and high processing efficiency and fast processing speed for performing the exponential function activation operation.

In a possible implementation manner, the control module 12-11 may also be used to obtain an exponential activation function parameter table according to the operation code and / or operation domain.

The calculation module 12-12 can also be used to perform exponential function activation calculation on the data to be calculated according to the exponential activation function parameter table, to obtain the operation result,

The exponential activation function parameter table may include an exponential activation function activation table and an exponential activation function constant table.

In this implementation manner, the address of the index activation function parameter table may be included in the operation domain, so that the control module obtains the address of the index activation function parameter table from the address of the index activation function parameter table. Alternatively, the control module may determine that the exponential activation function parameter table needs an exponential activation function parameter table according to the operation code, and may directly obtain the exponential activation function parameter table from a predetermined storage address of the exponential activation function parameter table. Alternatively, when the control module can determine that an exponential activation function parameter table is required to execute the exponential function activation instruction according to the operation code, it can directly obtain the exponential activation function parameter table corresponding to the exponential function activation instruction from the storage address of the predetermined parameter table. A person skilled in the art can set the acquisition method of the index activation function parameter table according to actual needs, and this disclosure does not limit this.

In a possible implementation manner, the control module can also obtain an activation function corresponding to the exponential function activation instruction, so that the operation module can perform a linear rectification function activation operation on the operation data according to the activation function and the corresponding operator.

4-2a shows a block diagram of an exponential function activation instruction processing apparatus according to an embodiment of the present disclosure. In a possible implementation, as shown in FIG. 4-2a, the computing module 12-12 may include multiple activation operators 12-120. A plurality of activation calculators 12-120 are used to perform exponential function activation calculation on the data to be calculated.

In this implementation, the calculation module may also include an activation calculator. The number of activation operators can be set according to the size of the data required for the exponential function activation operation, the processing speed and efficiency of the exponential function activation operation, and the disclosure does not limit this.

4-2b shows a block diagram of an exponential function activation instruction processing apparatus according to an embodiment of the present disclosure. In a possible implementation, as shown in FIG. 4-2b, the operation module 12-12 may include a master operation sub-module 12-121 and a plurality of slave operation sub-modules 12-122, and the master operation sub-module 12-121 includes Multiple activation operators 12-120 (not shown in the figure).

The main operation sub-module 12-121 is used for performing an exponential function activation operation on the data to be calculated by using a plurality of activation operators to obtain an operation result and storing the operation result in a target address.

In a possible implementation manner, the operation domain may further include a read-in amount or a storage address of the read-in amount. Among them, the control module 12-11 is also used to obtain the read-in amount, and obtain a plurality of data to be calculated according to the read-in amount. Among them, the data amount of the plurality of data to be calculated may be less than or equal to the read-in amount.

In the above manner, the data amount and size of the operation data can be limited, the accuracy of the operation result can be ensured, and the device can execute the exponential function activation instruction.

In a possible implementation manner, as shown in FIGS. 4-2a and 4-2b, the device may further include a storage module 12-13. The storage modules 12-13 are used to store data to be calculated. The storage modules 12-13 can also be used to store exponential activation function parameter tables.

In this implementation manner, the storage module may include a memory, such as one or more of a cache and a register, and the cache may include a high-speed temporary storage cache. The data to be calculated and the parameter table of the exponential activation function can be stored in the cache and / or register of the storage module as needed, and the disclosure does not limit this.

In a possible implementation, as shown in FIGS. 4-2a and 4-2b, the control module 12-11 can also be used to generate an assembly file according to the exponential function activation instruction and translate the assembly file into a binary file, where , The binary file is the compiled exponential function activation instruction.

In a possible implementation manner, the instruction format of the exponential function activation instruction may be:

active.exps dst src0 size

Among them, active.exps is the opcode of the exponential function activation instruction, and dst, src0, and size are the operation domains of the exponential function activation instruction. Among them, dst is the target address, src0 is the data address to be calculated, and size is the read-in amount.

In a possible implementation manner, the instruction format of the exponential function activation instruction may also be:

active.exps dst src0 src1 size

Among them, active.exps is the opcode of the exponential function activation instruction, and dst, src0, and size are the operation domains of the exponential function activation instruction. Among them, dst is the target address, src0 is the data address to be calculated, src1 is the address of the index activation function parameter table, and size is the read-in amount.

It should be understood that those skilled in the art can set the position of the operation code of the exponential function activation instruction, the operation code and the operation field in the instruction format according to needs, and the disclosure does not limit this.

It should be noted that, although the above embodiment is taken as an example to introduce the exponential function activation instruction processing apparatus as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various modules flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.

Application examples

The following describes an application example according to an embodiment of the present disclosure in conjunction with "using an exponential function to activate an instruction processing device for activation operation" as an exemplary application scenario, so as to facilitate understanding of the flow of an exponential function activation instruction processing device. Those skilled in the art should understand that the following application examples are only for the purpose of facilitating understanding of the embodiments of the present disclosure, and should not be considered as limitations to the embodiments of the present disclosure

4-3 shows a schematic diagram of an application scenario of an exponential function activation instruction processing device according to an embodiment of the present disclosure. As shown in Figure 4-3, the exponential function activation instruction processing device processes the exponential function activation instruction as follows:

As shown in Figure 4-3, the control module 12-11 compiles the obtained exponential function activation instruction 1 to obtain the compiled exponential function activation instruction 1 (for example, the exponential function activation instruction 1 is active.exps500, 100, 64), Analyze the compiled index function activation instruction 1 to obtain the opcode and operation domain of the index function activation instruction 1. The operation code of the exponential function activation instruction 1 is active.exps, the target address is 500, the data address to be calculated is 100, and the read-in amount is 64. The control module 12-11 acquires the data to be operated with a data amount of 64 (read-in amount) from the data address to be operated 100. Assuming that the activation calculation needs to be performed according to the exponential activation function parameter table, the control module 12-11 also needs to obtain the exponential activation function parameter table (see the above description for the specific implementation process).

The operation module 12-12 performs the exponential function activation operation on the operation data according to the exponential activation function parameter table, obtains the operation result, and stores the operation result in the target address 500.

In this way, the exponential function activation instruction processing device can process the exponential function activation instruction efficiently and quickly, and realize the efficient and rapid processing of the exponential function activation operation.

4-4 shows a flowchart of an exponential function activation instruction processing method according to an embodiment of the present disclosure. As shown in FIG. 4-4, this method is applied to the above exponential function activation instruction processing device. The method includes steps S51-12 and S52-12. The method can be applied to computer equipment including a memory and a processor, where the memory is used to store data used in the execution of the method; the processor is used to perform related processing and operation steps, such as performing the following steps S51-12和步骤 S52-12.

In step S51-12, the control module is used to compile the obtained index function activation instruction to obtain a compiled index function activation instruction, and the compiled index function activation instruction is parsed to obtain the operation code and The operation domain, and obtain the data to be calculated and the target address required to execute the exponential function activation instruction according to the operation code and the operation domain. The operation code is used to indicate that the activation operation performed by the exponential function activation instruction on the data is the exponential function activation operation, and the operation domain includes the data address and the target address to be operated.

In step S52-12, an arithmetic module is used to perform an exponential function activation operation on the operation data to obtain an operation result, and the operation result is stored in the target address.

In a possible implementation manner, the method may further include:

Obtain the index activation function parameter table according to the operation code and / or operation domain;

Among them, the operation module performs exponential function activation operation on the operation data to obtain the operation result, which may include:

Perform exponential function activation calculation on the operation data according to the exponential activation function parameter table to obtain the operation result,

In a possible implementation manner, the operation module performs an exponential function activation operation on the operation data to obtain an operation result, which may include: performing an exponential function activation operation on the operation data using a plurality of activation operators.

Use multiple activation operators in the main operation sub-module to perform exponential function activation operation on the operation data to obtain the operation result, and store the operation result in the target address.

In a possible implementation manner, the operation domain may further include a read-in amount or a storage address of the read-in amount. Wherein, obtaining the data to be calculated and the target address required to execute the exponential function activation instruction according to the operation code and the operation domain may include: acquiring the read-in amount, and acquiring the data to be calculated according to the read-in amount.

In a possible implementation manner, parsing the compiled index function activation instruction to obtain the operation code and operation domain of the index function activation instruction may include:

The exponential function activation instruction after the storage line is compiled;

Analyze the compiled index function activation instruction to obtain the opcode and operation domain of the index function activation instruction;

The instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed arranged in order according to the execution order, and the plurality of instructions to be executed include a compiled index function activation instruction.

In a possible implementation, using the control module to compile the obtained index function activation instruction to obtain the compiled index function activation instruction may include: generating an assembly file according to the index function activation instruction, and translating the assembly file into binary file. Among them, the binary file is the compiled exponential function activation instruction.

It should be noted that although the above embodiment is taken as an example to introduce the processing method of the exponential function activation instruction as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various steps flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.

The processing method of the exponential function activation instruction provided by the embodiment of the present disclosure has a wide range of application, and the exponential function activation instruction has high processing efficiency and fast processing speed, and the exponential function activation operation has high processing efficiency and fast processing speed.

The foregoing can be better understood based on the following terms:

Clause D1, an exponential function activation instruction processing device, the device comprising:

The control module is used to compile the obtained exponential function activation instruction to obtain a compiled exponential function activation instruction, and parse the compiled exponential function activation instruction to obtain the operation code and operation domain of the exponential function activation instruction, And obtain the data to be calculated and the target address required to execute the exponential function activation instruction according to the operation code and the operation domain;

The operation module is used to perform an exponential function activation operation on the data to be operated to obtain an operation result, and store the operation result in the target address,

Wherein, the operation code is used to indicate that the activation operation performed by the exponential function activation instruction on the data is an exponential function activation operation, and the operation domain includes the data address to be operated and the target address.

Clause D2, the device according to Clause D1,

The control module is further configured to obtain an index activation function parameter table according to the operation code and / or the operation domain;

The calculation module is further configured to perform an exponential function activation operation on the data to be calculated according to the exponential activation function parameter table to obtain an operation result,

Wherein, the exponential activation function parameter table includes an exponential activation function activation table and an exponential activation function constant table.

Clause D3. The device according to Clause D1, the calculation module includes:

A plurality of activation calculators are used to perform exponential function activation calculation on the data to be calculated.

Clause D4. The device according to Clause D3, the operation module includes a master operation sub-module and a plurality of slave operation sub-modules, the master operation sub-module includes the plurality of activation operators,

The main operation sub-module is used to perform an exponential function activation operation on the data to be operated by using the plurality of activation operators to obtain an operation result, and store the operation result in the target address.

Clause D5. The device according to Clause D1, the operation domain further includes a read-in amount or a storage address of the read-in amount,

Clause D6. The device according to Clause D1, the device further comprising:

The storage module is used for storing the data to be calculated.

Clause D7. The device according to Clause D1, the control module includes:

An instruction storage sub-module for storing the compiled index function activation instruction;

Instruction processing sub-module, which is used to parse the compiled index function activation instruction to obtain the operation code and operation domain of the index function activation instruction;

A queue storage submodule is used to store an instruction queue, and the instruction queue includes a plurality of instructions to be executed that are sequentially arranged according to an execution order, and the plurality of instructions to be executed include the compiled index function activation instruction.

Clause D8. The device according to Clause D7, the control module, further comprising:

Clause D9, the device according to Clause D1,

The control module is also used to generate an assembly file according to the index function activation instruction and translate the assembly file into a binary file,

Wherein, the binary file is the compiled index function activation instruction.

Clause D10. A machine learning computing device, the device comprising:

One or more exponential function activation instruction processing devices as described in any one of clauses D1 to D9, used to obtain data to be calculated and control information from other processing devices, and perform specified machine learning operations, passing the execution results The I / O interface is passed to other processing devices;

When the machine learning operation device includes a plurality of the exponential function activation instruction processing devices, the plurality of the exponential function activation instruction processing devices may be connected and transmit data through a specific structure;

Among them, a plurality of the exponential function activation instruction processing devices interconnect and transmit data through a fast external device interconnect bus PCIE bus to support larger-scale machine learning operations; a plurality of the exponential function activation instruction processing devices share the same The control system may have its own control system; the multiple exponential function activation instruction processing devices share memory or have their own memories; the interconnection method of the multiple exponential function activation instruction processing devices is any interconnected topology.

Clause D11. A combined processing device, the combined processing device comprising:

Machine learning computing devices, general interconnection interfaces and other processing devices as described in clause D10;

Clause D12. A machine learning chip, the machine learning chip includes:

The machine learning arithmetic device described in Item D10 or the combined processing device described in Item D11.

Article D13. An electronic device, the electronic device comprising:

Machine learning chip as described in clause D12.

Clause D14, a board card including: a storage device, an interface device and a control device, and a machine learning chip as described in Clause D12;

The storage device is used for storing data;

Clause D15. An exponential function activation instruction processing method. The method is applied to an exponential function activation instruction processing device. The method includes:

The control module is used to compile the obtained exponential function activation instruction to obtain a compiled exponential function activation instruction, and the compiled exponential function activation instruction is analyzed to obtain the operation code and operation domain of the exponential function activation instruction, and according to The operation code and the operation domain acquire the data to be calculated and the target address required to execute the exponential function activation instruction;

Use an arithmetic module to perform an exponential function activation operation on the data to be operated to obtain an operation result, and store the operation result in the target address,

Clause D16. The method according to Clause D15, the method further comprising:

Obtaining an exponential activation function parameter table according to the operation code and / or the operation domain;

Wherein, the operation module performs an exponential function activation operation on the data to be operated to obtain an operation result, including:

Performing an exponential function activation operation on the data to be calculated according to the exponential activation function parameter table to obtain an operation result,

Clause D17. According to the method described in Clause D15, an arithmetic module is used to perform an exponential function activation operation on the data to be calculated to obtain an operation result, including:

Multiple activation operators are used to perform an exponential function activation operation on the data to be operated.

Clause D18. The method according to Clause D17, the operation module includes a master operation submodule and a plurality of slave operation submodules, the master operation submodule includes the plurality of activation operators,

Use a plurality of activation operators in the main operation sub-module to perform exponential function activation operation on the data to be operated to obtain an operation result, and store the operation result in the target address.

Clause D19. The method according to Clause D15, the operation domain further includes a read-in amount or a storage address of the read-in amount,

Wherein, obtaining the data to be calculated and the target address required to execute the exponential function activation instruction according to the operation code and the operation domain includes:

Clause D20. The method according to Clause D15, the method further comprising:

Store the data to be calculated.

Clause D21. According to the method described in Clause D15, parse the compiled index function activation instruction to obtain the operation code and operation domain of the index function activation instruction, including:

Storing the compiled index function activation instruction;

Parse the compiled index function activation instruction to obtain the operation code and operation domain of the index function activation instruction;

An instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed that are sequentially arranged according to an execution order, and the plurality of instructions to be executed include the compiled exponential function activation instruction.

Clause D22. The method according to Clause D21, the method further comprising:

Clause D23. According to the method described in Clause D15, the control module is used to compile the obtained index function activation instruction to obtain the compiled index function activation instruction, including:

Generate an assembly file according to the index function activation instruction, and translate the assembly file into a binary file,

Wherein, the binary file is the compiled index function activation instruction.

Due to the extensive use of neural network algorithms, the ability of computer hardware operators to continuously improve, the types and number of data operations involved in practical applications continue to increase. Selection operation is a processing operation to select data according to selection conditions. Due to the variety of programming languages, in different language environments, in order to realize the operation process of selecting operations, in related technologies, because there is no selection instruction that can be widely applied to various programming languages at this stage, technicians need to customize their corresponding programming Multiple instructions in the language environment are used to implement the selection operation, which results in low efficiency and slow speed in the selection operation. The present disclosure provides a selection instruction processing method, device, computer equipment, and storage medium, and selection operation can be implemented with only one instruction, which can significantly improve the efficiency and speed of selection operation.

FIG. 5-1 shows a block diagram of a selection instruction processing apparatus according to an embodiment of the present disclosure. As shown in Figure 5-1, the device includes a control module 1-11 and an arithmetic module 1-12.

The control module 1-11 is used to compile the obtained selection instruction to obtain the compiled selection instruction, parse the compiled selection instruction to obtain the operation code and operation domain of the selection instruction, and according to the operation code and operation domain Obtain multiple index data, multiple data to be calculated and target addresses required to execute the selection instruction. The operation code is used to indicate that the operation performed by the selection instruction on the data is a selection operation, and the operation field includes the data address to be operated, the index data address, and the target address.

The operation module 1-12 is used to sequentially determine whether a plurality of index data meets the storage conditions, and when the index data meets the storage conditions, sequentially store the data to be operated corresponding to the index data that meets the storage conditions in the target address.

In this embodiment, the selection instruction acquired by the control module is an uncompiled software instruction that cannot be directly executed by the hardware. The control module needs to first compile the selection instruction (uncompiled). After the compiled selection instruction is obtained, the compiled selection instruction can be parsed. The compiled selection instructions are hardware instructions that can be directly executed by the hardware. The control module may obtain a plurality of data to be calculated and a plurality of index data from the data address to be calculated and the index data address, respectively. The control module may obtain a selection instruction, a plurality of data to be calculated, and a plurality of index data through a data input / output unit. The data input / output unit may be one or more data I / O interfaces or I / O pins.

In this embodiment, the operation code may be the part of the instruction or field (usually represented by code) specified in the computer program to perform the operation, and is the instruction sequence number used to inform the device that executes the instruction which instruction needs to be executed. . The operation domain may be a source of all data required to execute the corresponding instruction, such as a corresponding address, etc. All data required to execute the corresponding instruction include parameter data, data to be operated, corresponding operation methods, and so on. For a selection instruction, it must include an operation code and an operation field, where the operation field includes at least the data address to be operated, the index data address, and the target address.

It should be understood that, those skilled in the art can set the instruction format of the selection instruction, as well as the included operation codes and operation fields as required, and this disclosure does not limit this.

In this embodiment, the device may include one or more control modules and one or more arithmetic modules, and the number of control modules and arithmetic modules may be set according to actual needs, which is not limited in the present disclosure. When the device includes a control module, the control module can receive a selection instruction and control one or more arithmetic modules to perform a selection operation. When the device includes multiple control modules, the multiple control modules may respectively receive selection instructions and control the corresponding one or more arithmetic modules to perform selection operations.

The selection instruction processing device provided by the embodiment of the present disclosure includes a control module and an arithmetic module. The control module is used to compile the obtained selection instruction to obtain the compiled selection instruction, parse the compiled selection instruction to obtain the operation code and operation domain of the selection instruction, and obtain and execute the selection instruction according to the operation code and operation domain The required multiple index data, multiple data to be calculated and the target address. The operation module is used to sequentially determine whether a plurality of index data satisfy the storage conditions, and when the index data meets the storage conditions, sequentially store the data to be operated corresponding to the index data satisfying the storage conditions into the target address. The selection instruction processing device provided by the embodiment of the present disclosure has a wide application range, and has high processing efficiency and fast processing speed for the selection instruction, and high processing efficiency and fast processing speed for performing the selection operation.

In a possible implementation, the storage condition may be that the index data is not zero.

In this implementation manner, when the index data is not zero, the data to be operated corresponding to the non-zero index data is sequentially stored to the target address. The storage condition may also be that the index data is not a specified value, and the specified value may be a value such as 1. Those skilled in the art can set the storage conditions according to actual needs, and this disclosure does not limit this.

In this implementation manner, it can be set according to the required storage conditions or index data to store the data required in the data to be calculated to the target address. For example, in order to select the data to be calculated according to different selection needs, different storage conditions may be set, or different index data may be set to realize different selections of the data to be calculated.

5-2a shows a block diagram of a selection instruction processing apparatus according to an embodiment of the present disclosure. In a possible implementation manner, as shown in FIG. 5-2a, the operation module 1-12 may include a plurality of comparators 1-120, which are used to sequentially determine whether a plurality of index data meets storage conditions.

For example, taking the storage condition as "index data is not 0" as an example, the comparator may sequentially compare the index data with 0 to determine whether the index data meets the storage condition. Furthermore, the operation module can store the data to be operated corresponding to the index data other than 0 into the target address in sequence. The number of comparators can be set according to the amount of data to be compared, the processing speed, efficiency, and other requirements of the comparison, which is not limited in the present disclosure.

5-2b shows a block diagram of a selection instruction processing apparatus according to an embodiment of the present disclosure. In a possible implementation manner, as shown in FIG. 5-2b, the operation module 1-12 may include a master operation sub-module 1-121 and a plurality of slave operation sub-modules 1-122. The main operation sub-module 1-121 may include a plurality of comparators 1-120 (not shown in the figure).

The main operation sub-module 1-121 is used to sequentially determine whether multiple index data satisfy the storage condition using multiple comparators, determine the data to be operated corresponding to the index data satisfying the storage condition, and compare the index with the storage condition The data to be calculated corresponding to the data is sequentially stored in the target address.

In a possible implementation manner, the operation domain may further include a read-in amount or a storage address of the read-in amount. Among them, the control module 1-11 is also used to obtain the read-in amount, and obtain a plurality of data to be calculated according to the read-in amount. Among them, the data amount of the multiple data to be calculated is less than or equal to the read-in amount, and the read-in amount is less than or equal to the data amount of the multiple index data.

In a possible implementation manner, when the read-in amount is not included in the operation domain, a plurality of data to be calculated may be obtained according to a preset default read-in amount. The acquired data amount of the plurality of data to be calculated is less than or equal to the default read-in amount, and the default read-in amount is less than or equal to the data amount of multiple index data.

In this implementation, the amount of data to be calculated, the amount of data to be indexed, and the amount of data that can be stored at the target address can be the same, and can all be equal to the read-in amount or the default read-in amount. the amount. In this way, the operation module can sequentially store the data to be operated corresponding to the index data that meets the storage conditions in the target address, which can avoid problems such as insufficient target addresses and waste of target addresses.

In a possible implementation manner, as shown in FIGS. 5-2a and 5-2b, the device may further include a storage module 1-13. The storage modules 1-13 are used to store multiple index data, multiple data to be calculated, and storage conditions.

In this implementation, the storage module may include memory, such as one or more of a cache and a register, the cache may include a high-speed temporary cache, and may also include at least one NRAM (Neuron Random Access Memory, neuron random access memory ). The cache can be used to store data to be calculated and pooled cores, and the register can be used to store scalar data in the data to be calculated. .

In a possible implementation, the cache may include a neuron cache. The neuron cache, that is, the foregoing neuron random access memory, can be used to store neuron data in the data to be calculated, and the neuron data can include neuron vector data.

In a possible implementation manner, the control modules 1-11 may also be used to generate an assembly file according to the selection instruction and translate the assembly file into a binary file, where the binary file is the selection instruction after compilation.

In a possible implementation, the instruction format of the selection instruction may be:

select src0 src1 size

Where, select is the opcode of the selection instruction, and dst, src0, src1, and size are the operation fields of the selection instruction. dst is the target address, src0 is the data address to be calculated, src1 is the index data address, and size is the read-in amount.

It should be understood that those skilled in the art can set the operation code of the selection instruction, the position of the operation code and the operation field in the instruction format according to need, and the disclosure does not limit this.

In a possible implementation, the device may be set in (Graphics Processing Unit, GPU for short), Central Processing Unit (CPU for short), and Neural-network Processing Unit (NPU for short) ) Of one or more.

It should be noted that although the above-mentioned embodiment is taken as an example to introduce the selection instruction processing apparatus as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various modules flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.

Application examples

In the following, an application example according to an embodiment of the present disclosure is given in conjunction with "data selection using a selection instruction processing device" as an exemplary application scenario, so as to facilitate understanding of a flow of a selection instruction processing device. Those skilled in the art should understand that the following application examples are only for the purpose of facilitating the understanding of the embodiments of the present disclosure, and should not be considered as limitations to the embodiments of the present disclosure.

5-3 shows a schematic diagram of an application scenario for selecting an instruction processing apparatus according to an embodiment of the present disclosure. As shown in Figure 5-3, the selection instruction processing device processes the selection instruction as follows:

The control module 1-11 compiles the obtained selection instruction 1 to obtain the compiled selection instruction 1. Analyze the compiled selection instruction 1 (for example, selection instruction 1 is select 500, 100, 200, and 5) to obtain the operation code and operation field of selection instruction 1. Among them, the operation code of the selection instruction 1 is select, the target address is 500, the data address to be calculated is 100, the index data address is 200, and the read-in amount is 5. The control module 1-11 obtains a plurality of data to be operated and a plurality of index data with a read-in amount of 5 from the data address to be operated 100 and the index data address 200, respectively.

Assume that the obtained plurality of data to be calculated include 1, 5, 6, 7, and 3. The multiple index data includes 1, 8, 0, 6, and 9. The storage condition is that the index data is not 0.

The operation module 1-12 sequentially judges whether multiple index data are 0, and when the index data is not 0, sequentially stores the data to be operated corresponding to the index data that is not 0 into the target address 500. Specifically, the arithmetic module 1-12 sequentially determines whether the multiple index data "1, 8, 0, 6, 9" are not 0. Since the third index data is 0, the "1" , 5, 7, 3 "are sequentially stored in the target address 500. For the working process of the above modules, please refer to the relevant description above.

In this way, the selection instruction processing device can process the selection instruction efficiently and quickly, and the selection operation has high processing efficiency and fast processing speed.

5-4 shows a flowchart of a selection instruction processing method according to an embodiment of the present disclosure. As shown in FIG. 5-4, the method is applied to the above selection instruction processing apparatus, and the method includes step S51-1 and step S52-1. The method can be applied to computer equipment including a memory and a processor, where the memory is used to store data used in the execution of the method; the processor is used to perform related processing and operation steps, such as performing the following step S51-1和步骤 S52-1.

In step S51-1, the control module is used to compile the obtained selection instruction to obtain the compiled selection instruction, and the compiled selection instruction is parsed to obtain the operation code and operation domain of the selection instruction, and according to the operation code and operation The domain acquires multiple index data, multiple data to be calculated, and a target address required to execute the selection instruction. The operation code is used to indicate that the operation performed by the selection instruction on the data is a selection operation, and the operation field includes the data address to be operated, the index data address, and the target address.

In step S52-1, the operation module is used to sequentially determine whether a plurality of index data meets the storage conditions, and when the index data meets the storage conditions, the data to be operated corresponding to the index data that meets the storage conditions are sequentially stored in the target address .

In a possible implementation manner, the operation module may include the multiple comparators,

Among them, using the operation module to sequentially determine whether a plurality of index data meets the storage conditions, and when the index data meets the storage conditions, sequentially store the data to be operated corresponding to the index data that meets the storage conditions into the target address, which may include:

The multiple comparators in the arithmetic module are used to sequentially determine whether the multiple index data meet the storage conditions.

In a possible implementation manner, the operation module includes a master operation sub-module and multiple slave operation sub-modules, and the master operation sub-module includes the multiple comparators,

Use multiple comparators to sequentially determine whether multiple index data meet the storage conditions, determine the data to be calculated corresponding to the index data that meets the storage conditions, and store the data to be operated corresponding to the index data that meet the storage conditions in sequence Target address.

In a possible implementation manner, the operation domain may further include a read-in amount or a storage address of the read-in amount. Step S51-1 may include: acquiring the read-in amount, and acquiring a plurality of data to be calculated according to the read-in amount. Among them, the data amount of the multiple data to be calculated is less than or equal to the read-in amount, and the read-in amount is less than or equal to the data amount of the multiple index data.

In a possible implementation manner, the method may further include: using the storage module of the device to store multiple index data, multiple data to be calculated, and storage conditions,

Wherein, the storage module includes at least one of a register and a cache,

The cache is used to store the plurality of index data, the plurality of data to be calculated, and the storage conditions, and the cache includes at least one neuron cache NRAM;

The register is used to store the data to be calculated, the plurality of data to be calculated, and the scalar data in the storage condition;

The neuron cache is used to store the data to be operated, the plurality of data to be operated, and the neuron data in the storage condition, and the neuron data includes neuron vector data.

In a possible implementation manner, step S51-1 may include:

Store the compiled selection instructions;

Analyze the compiled selection instruction to obtain the operation code and operation domain of the selection instruction;

The instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed, which are arranged in order according to the execution order, and the plurality of instructions to be executed include a compiled selection instruction.

In a possible implementation manner, the method may further include:

When it is determined that the first to-be-executed instruction among the plurality of to-be-executed instructions is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction, cache the first to-be-executed instruction, and after determining that the execution of the zeroth to-be-executed instruction is completed , Control the execution of the first instruction to be executed,

In a possible implementation manner, the control module is used to compile the obtained selection instructions to obtain the compiled selection instructions, including:

The assembly file is generated according to the selection instruction, and the assembly file is translated into a binary file. Among them, the binary file is the selection instruction after compilation.

In a possible implementation, the storage condition may include that the index data is not zero.

It should be noted that although the above-mentioned embodiment is used as an example to introduce the selection instruction processing method as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various steps flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.

The selection instruction processing method provided by the embodiments of the present disclosure has a wide application range, and has high processing efficiency and fast processing speed for the selection instruction, and high processing efficiency and fast processing speed for performing the selection operation.

The foregoing can be better understood based on the following clauses:

Clause E1, a selection instruction processing device, the device comprising:

The control module is used to compile the obtained selection instruction to obtain the compiled selection instruction, parse the compiled selection instruction to obtain the operation code and operation domain of the selection instruction, and according to the operation code and the operation The domain acquires multiple index data, multiple data to be calculated, and target addresses required to execute the selection instruction;

The operation module is used to sequentially determine whether the plurality of index data meets the storage condition, and when the index data meets the storage condition, sequentially store the data to be operated corresponding to the index data meeting the storage condition into the target address in,

Wherein, the operation code is used to indicate that the operation performed by the selection instruction on the data is a selection operation, and the operation field includes a data address to be operated, an index data address, and the target address.

Clause E2. The device according to Clause E1, the calculation module includes:

A plurality of comparators are used to sequentially determine whether the plurality of index data satisfy the storage condition.

Clause E3. The device according to Clause E2, the operation module includes a master operation sub-module and a plurality of slave operation sub-modules, the master operation sub-module includes the plurality of comparators,

The main operation sub-module is used to sequentially determine whether the plurality of index data satisfy the storage condition using the plurality of comparators, determine the data to be operated corresponding to the index data satisfying the storage condition, and compare with The data to be operated corresponding to the index data satisfying the storage conditions are sequentially stored in the target address.

Clause E4. The device according to Clause E1, the operation domain further includes a read-in amount or a storage address of the read-in amount,

Wherein, the control module is also used to obtain the read-in amount, and obtain the plurality of data to be calculated according to the read-in amount,

Wherein, the data amount of the plurality of data to be calculated is less than or equal to the read-in amount, and the read-in amount is less than or equal to the data amount of the plurality of index data.

Clause E5. The device according to Clause E1, the device further comprising:

A storage module, configured to store the plurality of index data, the plurality of data to be calculated, and the storage conditions,

Wherein, the storage module includes at least one of a register and a cache,

Clause E6. The device according to Clause E1, the control module includes:

An instruction storage sub-module for storing the compiled selection instruction;

An instruction processing sub-module, which is used to parse the compiled selection instruction to obtain the operation code and operation domain of the selection instruction;

The queue storage submodule is used to store an instruction queue, and the instruction queue includes a plurality of instructions to be executed, which are sequentially arranged according to the execution order, and the plurality of instructions to be executed include a plurality of compiled selection instructions.

Clause E7. The device according to Clause E6, the control module, further comprising:

Clause E8. The device according to Clause E1,

The control module is also used to generate an assembly file according to the selection instruction and translate the assembly file into a binary file,

Wherein, the binary file is the compiled selection instruction.

Clause E9. The device according to any one of Clause E1 to Clause E8, the storage condition includes that the index data is not zero.

Clause E10. A machine learning computing device, characterized in that the device includes:

One or more selection instruction processing devices as described in any one of Clause E1-Clause E9, used to obtain the data and control information to be calculated from other processing devices, and perform the specified machine learning operation, and pass the execution result through I / O interface is passed to other processing devices;

When the machine learning computing device includes a plurality of the selection instruction processing devices, the plurality of the selection instruction processing devices can be connected and transmit data through a specific structure;

Among them, a plurality of the selection instruction processing devices interconnect and transmit data through a fast external device interconnection bus PCIE bus to support larger-scale machine learning operations; a plurality of the selection instruction processing devices share the same control system or own Respective control systems; a plurality of the selection instruction processing devices share memory or have their own memories; the interconnection mode of the plurality of selection instruction processing devices is any interconnection topology.

Clause E11. A combined processing device, characterized in that the combined processing device includes:

Machine learning computing devices, general interconnection interfaces and other processing devices as described in clause E10;

Clause E12. A machine learning chip, characterized in that the machine learning chip includes:

The machine learning arithmetic device according to clause E10 or the combined processing device according to clause E11.

Clause E13. An electronic device, characterized in that the electronic device includes:

Machine learning chip as described in clause E10.

Clause E14, a board card, characterized in that the board card includes: a storage device, an interface device and a control device, and a machine learning chip as described in Clause E12;

The storage device is used for storing data;

Item E15. A method for processing a selection instruction, characterized in that the method is applied to a device for processing a selection instruction, the device includes a control module and an arithmetic module, and the method includes:

The control module is used to compile the obtained selection instruction to obtain the compiled selection instruction, and the compiled selection instruction is analyzed to obtain the operation code and operation domain of the selection instruction, and the operation code and the operation domain are determined according to the operation code and the operation domain. Obtain multiple index data, multiple data to be calculated and target addresses required to execute the selection instruction;

The operation module is used to sequentially determine whether the plurality of index data meet the storage conditions, and when the index data meets the storage conditions, sequentially store the data to be operated corresponding to the index data that meets the storage conditions into the target address,

Clause E16. The method according to Clause E15, characterized in that the arithmetic module includes the plurality of comparators,

Wherein, the operation module is used to sequentially determine whether the plurality of index data meet the storage conditions, and when the index data meets the storage conditions, the data to be operated corresponding to the index data that meets the storage conditions are sequentially stored in the target address Including:

A plurality of comparators in the arithmetic module are used to sequentially determine whether the plurality of index data satisfy the storage condition.

Clause E17. The method according to Clause E16, wherein the operation module includes a master operation sub-module and a plurality of slave operation sub-modules, and the master operation sub-module includes the plurality of comparators,

Wherein, the operation module is used to sequentially determine whether the plurality of index data meet the storage conditions, and when the index data meets the storage conditions, sequentially store the data to be operated corresponding to the index data that meets the storage conditions into the target address Including:

Use the plurality of comparators to sequentially determine whether the plurality of index data satisfy the storage condition, determine the data to be calculated corresponding to the index data that meets the storage condition, and compare the index data that meets the storage condition The data to be operated are sequentially stored in the target address.

Clause E18. The method according to Clause E15, wherein the operation domain further includes a read-in amount or a storage address of the read-in amount,

Wherein, acquiring a plurality of index data, a plurality of data to be calculated and a target address required to execute the selection instruction according to the operation code and the operation domain includes:

Acquiring the read-in amount, and acquiring the plurality of data to be calculated according to the read-in amount,

Clause E19. The method according to Clause E15, characterized in that the method further comprises:

Using the storage module of the device to store the plurality of index data, the plurality of data to be calculated, and the storage conditions,

Wherein, the storage module includes at least one of a register and a cache,

Clause E20. The method according to Clause E15, characterized in that the compiled selection instruction is parsed to obtain the operation code and operation domain of the selection instruction, including:

Storing the compiled selection instruction;

Parse the compiled selection instruction to obtain the operation code and operation domain of the selection instruction;

An instruction queue is stored. The instruction queue includes a plurality of instructions to be executed, which are sequentially arranged according to an execution order, and the plurality of instructions to be executed include a plurality of compiled selection instructions.

Clause E21. The method according to Clause E20, characterized in that the method further comprises:

Clause E22. The method according to Clause E15, wherein the control module is used to compile the obtained selection instruction to obtain the compiled selection instruction, including:

Generate an assembly file according to the selection instruction, and translate the assembly file into a binary file,

Wherein, the binary file is the compiled selection instruction.

Clause E23. The method according to any one of Clause E15 to Clause E22, wherein the storage condition includes that the index data is not zero.

Due to the extensive use of neural network algorithms, the ability of computer hardware operators to continuously improve, the types and number of data operations involved in practical applications continue to increase. Due to the variety of programming languages, in different language environments, in order to realize the calculation processing process of counting statistics, in the related art, because there are no counting instructions that can be widely applied to various programming languages at this stage, technicians need to customize their corresponding Multiple instructions in the programming language environment or creating corresponding counting instructions for different programming language environments to achieve flat counting statistics, resulting in low efficiency and slow speed of counting statistics. The present disclosure provides a counting instruction processing method, device, computer equipment, and storage medium, and counting statistics can be realized with only one instruction, which can significantly improve the efficiency and speed of counting statistics.

6-1 shows a block diagram of a count instruction processing device according to an embodiment of the present disclosure. As shown in Figure 6-1, the device includes a control module 2-11 and an arithmetic module 2-12.

The control module 2-11 is used to compile the obtained counting instruction to obtain the compiled counting instruction, analyze the compiled counting instruction to obtain the operation code and operation domain of the counting instruction, and according to the operation code and operation domain Obtain multiple data to be calculated and the target address required to execute the count instruction. Among them, the operation code is used to indicate that the operation performed by the counting instruction on the data is a counting statistical operation, and the operation domain includes the data address and the target address to be calculated.

The operation module 2-12 is used to determine the number of data to be operated satisfying the counting condition among the plurality of data to be operated, and store the number of data in the target address.

In this embodiment, the counting instruction acquired by the control module is an uncompiled software instruction that cannot be directly executed by hardware. The control module needs to first compile the counting instruction (uncompiled). After the compiled count instruction is obtained, the compiled count instruction can be analyzed. The compiled count instruction is a hardware instruction that can be directly executed by the hardware. The control module can obtain multiple data to be calculated from the data to be calculated.

In this embodiment, the operation code may be the part of the instruction or field (usually represented by code) specified in the computer program to perform the operation, and is the instruction sequence number used to inform the device that executes the instruction which instruction needs to be executed. . The operation domain may be a source of all data required to execute the corresponding instruction, and all data required to execute the corresponding instruction include parameter data, data to be operated, corresponding operation methods, and so on. For a counting instruction, it must include an operation code and an operation field, where the operation field includes at least the data address to be operated and the target address.

It should be understood that, those skilled in the art can set the instruction format of the counting instruction, as well as the included operation codes and operation fields as required, and the disclosure does not limit this.

In this embodiment, the device may include one or more control modules and one or more arithmetic modules, and the number of control modules and arithmetic modules may be set according to actual needs, which is not limited in the present disclosure. When the device includes a control module, the control module can receive counting instructions and control one or more arithmetic modules to perform counting statistics. When the device includes multiple control modules, the multiple control modules can respectively receive counting instructions and control the corresponding one or more arithmetic modules to perform counting statistics.

The counting instruction processing device provided by the embodiment of the present disclosure includes a control module and an arithmetic module. The control module is used to compile the obtained counting instruction to obtain the compiled counting instruction and analyze the compiled counting instruction. Obtain the operation code and operation domain of the counting instruction, and obtain the multiple data to be calculated and the target address required to execute the counting instruction according to the operation code and the operation domain; the operation module is used to determine the plurality of data to be operated that satisfy the counting condition The number of data, and store the number of data in the target address. The counting instruction processing device provided by the embodiments of the present disclosure has a wide application range, high processing efficiency and fast processing speed for counting instructions, and high processing efficiency and fast processing speed for counting statistics.

In a possible implementation, the counting condition may be that the data to be calculated is not zero.

In this implementation manner, the data number of the data to be operated which is not 0 among the plurality of data to be operated is counted, and the data number is stored to the target address. The counting condition may also be that the data to be calculated is not a specified value, and the specified value may be a value such as 1. Those skilled in the art can set the counting conditions according to actual needs, and this disclosure does not limit this.

6-2a shows a block diagram of a counting instruction processing device according to an embodiment of the present disclosure. In a possible implementation, as shown in FIG. 6-2a, the arithmetic module 2-12 may include multiple counters 2-120. The plurality of counters 2-120 are used for counting and counting the number of data to be calculated satisfying the counting condition to obtain the number of data to be calculated satisfying the counting condition among the plurality of data to be calculated.

In this implementation, the arithmetic module may also include a counter. The number of counters can be set according to the size of the data to be calculated, the processing speed, efficiency, and other requirements of the counting statistical operation, which is not limited in the present disclosure.

6-2b shows a block diagram of a counting instruction processing device according to an embodiment of the present disclosure. In a possible implementation, as shown in FIG. 6-2b, the operation module 2-12 may include a master operation submodule 2-121 and a plurality of slave operation submodules 2-122, and the master operation submodule 2-121 includes Multiple counters 2-120 (not shown in the figure).

The main operation submodule 2-121 is used to count and count the number of data to be calculated satisfying the counting condition using a plurality of counters 2-120 to determine the number of data to be calculated satisfying the counting condition among the plurality of data to be calculated And store the number of data in the target address.

In a possible implementation manner, the operation domain may further include a read-in amount or a storage address of the read-in amount. Among them, the control module 2-11 is also used to obtain the read-in amount, and obtain a plurality of data to be calculated according to the read-in amount. Among them, the data amount of the multiple data to be calculated is less than or equal to the read-in amount.

In a possible implementation manner, when the read-in amount is not included in the operation domain, a plurality of data to be calculated may be obtained according to a preset default read-in amount. The data amount of the acquired multiple data to be calculated is less than or equal to the default read-in amount.

In the above manner, the data amount of a plurality of data to be calculated can be limited, to ensure the accuracy of the counted data number, and also to ensure that the device can run the counting instruction.

In a possible implementation manner, as shown in FIGS. 6-2a and 6-2b, the device may further include a storage module 2-13. The storage modules 2-13 are used to store a plurality of data to be calculated and counting conditions.

In this implementation, the storage module may include one or more of a cache and a register. The cache may include a high-speed temporary storage cache, and may also include at least one NRAM (Neuron Random Access Memory). Registers can be used to store multiple data to be calculated and scalar data in counting conditions;

In a possible implementation manner, the neuron cache may be used to store multiple to-be-operated data and neuron data in counting conditions, and the neuron data includes neuron vector data.

In a possible implementation manner, the control module 2-11 may also be used to generate an assembly file according to the counting instruction and translate the assembly file into a binary file, where the binary file is a compiled counting instruction.

In a possible implementation, the instruction format of the counting instruction may be:

count dst src0 size

Among them, select is the operation code of the count instruction, dst, src0, size are the operation domain of the count instruction. dst is the target address, src0 is the data address to be calculated, and size is the read-in amount.

It should be understood that those skilled in the art can set the operation code of the counting instruction, the position of the operation code and the operation field in the instruction format according to need, and the disclosure does not limit this.

It should be noted that, although the counting instruction processing apparatus is described above by taking the above embodiment as an example, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various modules flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.

Application examples

The following uses "counting statistics using a count instruction processing device" as an exemplary application scenario to give an application example according to an embodiment of the present disclosure to facilitate understanding of the flow of the count instruction processing device. Those skilled in the art should understand that the following application examples are only for the purpose of facilitating the understanding of the embodiments of the present disclosure, and should not be considered as limitations to the embodiments of the present disclosure.

6-3 shows a schematic diagram of an application scenario of a counting instruction processing device according to an embodiment of the present disclosure. As shown in Figure 6-3, the counting command processing device processes the counting command as follows:

The control module 2-11 compiles the obtained counting instruction 1 to obtain the compiled counting instruction 1 (if the counting instruction 1 is count 500 and 100), analyzes the compiled counting instruction 1 to obtain the operation of counting instruction 1 Code and operation domain. Among them, the operation code of the count instruction 1 is count, the target address is 500, the data address to be calculated is 100, and the read-in amount is 5. The control module 2-11 acquires a plurality of data to be calculated with a read amount of 5 from the data address to be calculated 100.

Assume that the obtained plurality of data to be calculated include 1, 5, 0, 7, and 3. The counting condition is that the data to be calculated is not 0.

The operation module 2-12 counts the number of data to be operated which is not 0 among the plurality of data to be operated, and stores the number of data in the target address 500. For the working process of the above modules, please refer to the relevant description above.

In this way, the counting instruction processing device can process the counting instruction efficiently and quickly, and the processing efficiency for counting statistics is high and the processing speed is fast.

6-4 shows a flowchart of a counting instruction processing method according to an embodiment of the present disclosure. The method can be applied to computer equipment including a memory and a processor, where the memory is used to store data used in the execution of the method; the processor is used to perform related processing and operation steps, such as performing the following step S51-2和步骤 S52-2. As shown in FIG. 6-4, the method is applied to the above-mentioned counting instruction processing device, and the method includes step S51-2 and step S52-2.

In step S51-2, the control module is used to compile the obtained counting instruction to obtain a compiled counting instruction, and the compiled counting instruction is analyzed to obtain the operation code and operation domain of the counting instruction, and according to the operation code and The operation domain obtains a plurality of data to be calculated and a target address required for executing the counting instruction. Among them, the operation code is used to indicate that the operation performed by the counting instruction on the data is a counting statistical operation, and the operation domain includes the data address and the target address to be calculated.

In step S52-2, the operation module is used to determine the number of data to be operated satisfying the counting condition among the plurality of data to be operated, and the number of data is stored in the target address.

In a possible implementation manner, determining the number of data to be calculated among the plurality of data to be calculated that meets the counting condition may include:

The number of the data to be operated satisfying the counting condition is counted and counted by using a plurality of counters in the operation module to obtain the data number of the data to be operated meeting the counting condition among the plurality of data to be operated.

In a possible implementation manner, the operation module includes a main operation submodule and multiple slave operation submodules, and the main operation submodule includes multiple adders and multiple dividers,

Among them, determining the number of data to be calculated among the plurality of data to be calculated satisfying the counting condition, and storing the number of data in the target address includes:

Use multiple counters in the main operation submodule to count the number of data to be calculated that meet the counting conditions, determine the number of data to be calculated that meets the counting conditions among the multiple data to be calculated, and store the number of data Into the destination address.

In a possible implementation manner, the operation domain further includes a read-in amount or a storage address of the read-in amount,

Among them, obtaining a plurality of data to be calculated and a target address required to execute the counting instruction according to the operation code and the operation domain may include:

Obtain the read-in amount, and obtain multiple data to be calculated according to the read-in amount,

Among them, the data amount of the multiple data to be calculated is less than or equal to the read-in amount.

In a possible implementation manner, the method may further include: using the storage module of the device to store a plurality of data to be calculated and counting conditions,

Wherein, the storage module includes at least one of a register and a cache,

Cache, used to store multiple data to be calculated and counting conditions, the cache includes at least one neuron cache NRAM;

Registers, used to store multiple data to be calculated and scalar data in counting conditions;

The neuron cache is used to store a plurality of data to be operated and neuron data in counting conditions. The neuron data includes neuron vector data.

In a possible implementation manner, analyzing the compiled counting instruction to obtain the operation code and operation domain of the counting instruction may include:

Store the compiled count instruction;

Analyze the count instruction after compilation to obtain the opcode and operation domain of the count instruction;

The instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed arranged in order according to the execution order, and the plurality of instructions to be executed may include a counted instruction after compilation.

In a possible implementation manner, the method may further include: when determining that the first to-be-executed instruction among the plurality of to-be-executed instructions is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction, caching the first The instruction to be executed, after the execution of the zeroth instruction to be executed is completed, the first instruction to be executed is executed,

In a possible implementation manner, compiling the obtained counting instruction to obtain the compiled counting instruction may include:

Generate assembly files according to counting instructions, and translate the assembly files into binary files,

Among them, the binary file is the count instruction after compilation.

In a possible implementation, the counting condition may include that the data to be calculated is not zero.

It should be noted that although the counting instruction processing method is described above using the above embodiment as an example, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various steps flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.

The counting instruction processing method provided by the embodiments of the present disclosure has a wide application range, high processing efficiency and fast processing speed for counting instructions, and high processing efficiency and fast processing speed for counting statistics.

The foregoing can be better understood based on the following clauses:

Clause F1, a counting instruction processing device, the device comprising:

The control module is used to compile the obtained counting instruction to obtain the compiled counting instruction, analyze the compiled counting instruction to obtain the operation code and operation domain of the counting instruction, and according to the operation code and all The operation domain obtains a plurality of data to be calculated and a target address required to execute the counting instruction;

The operation module is used to determine the number of data of the plurality of data to be operated that satisfy the counting condition and store the number of data in the target address,

Wherein, the operation code is used to indicate that the operation performed by the counting instruction on the data is a counting statistical operation, and the operation domain includes the data address to be operated and the target address.

Clause F2. The device according to Clause F1, the calculation module includes:

A plurality of counters are used for counting and counting the number of data to be operated that satisfy the counting condition to obtain the number of data to be operated that satisfy the counting condition among the plurality of data to be operated.

Clause F3. The device according to Clause F2, the operation module includes a master operation sub-module and a plurality of slave operation sub-modules, the master operation sub-module includes the plurality of counters,

The main operation sub-module is configured to use the plurality of counters to count statistics on the number of data to be operated that satisfy the counting condition, and determine data of the data to be operated that satisfy the counting condition among the plurality of data to be operated Number, and store the number of data in the target address.

Clause F4. The device according to Clause F1, the operation domain further includes a read-in amount or a storage address of the read-in amount,

Wherein, the data amount of the plurality of data to be calculated is less than or equal to the read-in amount.

Clause F5. The device according to Clause F1, the device further comprising:

A storage module for storing the plurality of data to be calculated and the counting condition,

Wherein, the storage module includes at least one of a register and a cache,

The cache is used to store the plurality of data to be calculated and the counting condition, and the cache includes at least one neuron cache NRAM;

The register is used to store the plurality of data to be operated and the scalar data in the counting condition;

The neuron cache is used to store the plurality of data to be operated and neuron data in the counting condition, and the neuron data includes neuron vector data.

Clause F6. The device according to Clause F1, the control module includes:

An instruction storage sub-module for storing the compiled counting instruction;

The first instruction processing sub-module is used to parse the compiled counting instruction to obtain the operation code and operation domain of the counting instruction;

The queue storage sub-module is used to store an instruction queue, and the instruction queue includes a plurality of instructions to be executed, which are sequentially arranged according to an execution order, and the plurality of instructions to be executed include the compiled counting instructions.

Clause F7. The device according to Clause F6, the control module, further comprising:

Clause F8. The device according to Clause F1,

The control module is also used to generate an assembly file according to the counting instruction and translate the assembly file into a binary file,

Wherein, the binary file is the compiled counting instruction.

Clause F9. The device according to any one of Clause F1 to Clause F8, the counting condition includes that the data to be calculated is not zero.

Clause F10. A machine learning computing device, the device comprising:

One or more counting instruction processing devices as described in any one of clauses F1 to F9, used to obtain data to be operated and control information from other processing apparatuses, and perform specified machine learning operations, and pass the execution results through I / O interface is passed to other processing devices;

When the machine learning operation device includes a plurality of the counting instruction processing devices, the plurality of counting instruction processing devices may be connected and transmit data through a specific structure;

Among them, a plurality of the counting instruction processing devices are interconnected and transmitting data through a fast external device interconnect bus PCIE bus to support larger-scale machine learning operations; a plurality of the counting instruction processing devices share the same control system or own Respective control systems; a plurality of the counting instruction processing devices share memory or have their own memories; the interconnection method of the plurality of counting instruction processing devices is any interconnected topology.

Clause F11. A combined processing device, the combined processing device comprising:

Machine learning computing devices, general interconnection interfaces and other processing devices as described in clause F10;

Clause F12. A machine learning chip, the machine learning chip includes:

The machine learning arithmetic device according to clause F10 or the combined processing device according to clause F11.

Article F13. An electronic device, the electronic device comprising:

Machine learning chip as described in clause F12.

Clause F14, a board card including: a storage device, an interface device and a control device, and a machine learning chip as described in Clause F12;

The storage device is used for storing data;

Clause F15. A counting instruction processing method. The method is applied to a counting instruction processing apparatus. The apparatus includes a control module and an arithmetic module. The method includes:

The control module is used to compile the obtained counting instruction to obtain a compiled counting instruction, and the compiled counting instruction is analyzed to obtain the operation code and operation domain of the counting instruction, and according to the operation code and the operation The domain obtains multiple data to be calculated and the target address required to execute the counting instruction;

Using an operation module to determine the number of data of the plurality of data to be calculated that satisfy the counting condition and storing the number of data in the target address,

Clause F16. According to the method described in Clause F15, determining the number of data to be calculated among the plurality of data to be calculated satisfying the counting condition includes:

A plurality of counters in the calculation module are used to count and count the number of data to be calculated satisfying the counting condition to obtain the number of data to be calculated satisfying the counting condition among the plurality of data to be calculated.

Clause F17. The method according to Clause F16, the operation module includes a master operation submodule and a plurality of slave operation submodules, the master operation submodule includes a plurality of adders and a plurality of dividers,

Wherein, determining the number of data to be calculated among the plurality of data to be calculated satisfying the counting condition, and storing the number of data in the target address includes:

Using the plurality of counters in the main operation sub-module to count statistics on the number of data to be operated satisfying the counting condition to determine the number of data to be operated meeting the counting condition among the plurality of data to be operated Count and store the number of data in the target address.

Clause F18. The method according to Clause F15, the operation domain further includes a read-in amount or a storage address of the read-in amount,

Wherein, acquiring a plurality of data to be calculated and a target address required to execute the counting instruction according to the operation code and the operation domain includes:

Clause F19. The method according to Clause F15, the method further comprising:

Using the storage module of the device to store the plurality of data to be calculated and the counting condition,

Wherein, the storage module includes at least one of a register and a cache,

Clause F20. According to the method described in Clause F15, parse the compiled counting instruction to obtain the operation code and operation domain of the counting instruction, including:

Store the compiled counting instruction;

Parse the compiled counting instruction to obtain the operation code and operation domain of the counting instruction;

An instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed, which are sequentially arranged according to an execution order, and the plurality of instructions to be executed include the compiled counting instruction.

Clause F21. The method according to Clause F20, the method further comprising:

Clause F22. Compile the obtained counting instruction according to the method described in Clause F15 to obtain the compiled counting instruction, including:

Generate an assembly file according to the counting instruction, and translate the assembly file into a binary file,

Wherein, the binary file is the compiled counting instruction.

Clause F23. The method according to any one of Clause F15 to Clause F22, the counting condition includes that the data to be calculated is not zero.

Clause F24. A non-volatile computer-readable storage medium having computer program instructions stored thereon, which when executed by a processor implements the method of any one of Clauses F15 to F23.

Due to the extensive use of neural network algorithms, the ability of computer hardware operators to continuously improve, the types and number of data operations involved in practical applications continue to increase. Due to the variety of programming languages, in different language environments, in order to achieve the operation process of average pooling operations, in related technologies, because there is no fully connected instructions that can be widely applied to various programming languages at this stage, the technicians need to customize Corresponding to multiple instructions in its programming language environment to realize fully connected operations, resulting in low efficiency and slow speed of fully connected operations. The present disclosure provides a fully-connected instruction processing method, device, computer equipment, and storage medium. Only one instruction can be used to realize fully-connected operation, which can significantly improve the efficiency and speed of fully-connected operation.

FIG. 7-1 shows a block diagram of a fully connected instruction processing apparatus according to an embodiment of the present disclosure. As shown in Figure 7-1, the device includes a control module 3-11 and an arithmetic module 3-12.

The control module 3-11 is used to compile the obtained fully-connected instruction to obtain the compiled fully-connected instruction, to parse the compiled fully-connected instruction, to obtain the fully-connected instruction's operation code and operation domain, and according to the operation The code and the operation domain acquire the first data, the second data, the weight data, and the target address required to execute the fully connected instruction. The operation code is used to indicate that the operation performed by the fully connected instruction on the data is a fully connected operation, and the operation domain includes a first data address, a second data address, a weight data address, and a target address.

The operation module 3-12 is configured to perform a fully connected operation on the first data and the second data according to the weight data to obtain an operation result, and store the operation result in the target address.

In this embodiment, the fully-connected instruction acquired by the control module is an uncompiled software instruction that cannot be directly executed by the hardware. The control module needs to first compile the fully-connected instruction (uncompiled). After the compiled fully-connected instruction is obtained, the compiled fully-connected instruction can be parsed. The compiled fully-connected instructions are hardware instructions that can be directly executed by the hardware. The control module may obtain the first data, the second data, and the weight data from the first data address, the second data address, and the weight data address, respectively.

In this embodiment, the operation code may be the part of the instruction or field (usually represented by code) specified in the computer program to perform the operation, and is the instruction sequence number used to inform the device that executes the instruction which instruction needs to be executed. . The operation domain may be a source of all data required to execute the corresponding instruction, and all data required to execute the corresponding instruction include first data, second data, weight data, and corresponding calculation methods, and so on. For a fully connected instruction, it must include an operation code and an operation field, where the operation field includes at least a first data address, a second data address, a weight data address, and a target address.

It should be understood that those skilled in the art can set the instruction format of the fully connected instruction, as well as the included operation codes and operation domains as needed, and this disclosure does not limit this.

In this embodiment, the device may include one or more control modules and one or more arithmetic modules, and the number of control modules and arithmetic modules may be set according to actual needs, which is not limited in the present disclosure. When the device includes a control module, the control module can receive a fully connected instruction and control one or more arithmetic modules to perform fully connected operations. When the device includes a plurality of control modules, the plurality of control modules can respectively receive a fully connected instruction and control the corresponding one or more arithmetic modules to perform a fully connected operation.

The fully-connected instruction processing device provided by the embodiment of the present disclosure includes a control module and an arithmetic module. The control module is used to compile the obtained fully-connected instruction to obtain the compiled fully-connected instruction and the compiled fully-connected instruction. The instruction is parsed to obtain the operation code and operation domain of the fully connected instruction, and the first data, second data, weight data and target address required to execute the fully connected instruction are obtained according to the operation code and the operation domain; The weight data performs a fully connected operation on the first data and the second data to obtain an operation result, and stores the operation result in the target address. The fully connected command processing device provided by the embodiments of the present disclosure has a wide application range, and has high processing efficiency and fast processing speed for the fully connected command, and high processing efficiency and speed for performing the fully connected operation.

7-2a shows a block diagram of a fully connected instruction processing apparatus according to an embodiment of the present disclosure. In a possible implementation, as shown in FIG. 7-2a, the arithmetic module 3-12 may include multiple multipliers 3-120 and multiple adders 3-120 '. Multiple multipliers 3-120 are used to perform the multiplication operation in the fully connected operation. A plurality of adders 3-120 'are used to perform addition operations in the fully connected operation.

In this implementation manner, the operation module may further include one adder and one multiplier, or one adder and multiple multipliers, or multiple adders and one multiplier. The number of multipliers and adders can be set according to the data amount of the fully-connected operation, the processing speed, and the processing efficiency of the fully-connected operation, which is not limited in the present disclosure.

7-2b shows a block diagram of a fully connected instruction processing apparatus according to an embodiment of the present disclosure. In a possible implementation, as shown in FIG. 7-2b, the operation module 3-12 may include a master operation sub-module 3-121 and multiple slave operation sub-modules 3-122, and the slave operation sub-module 3-122 includes Multiple multipliers 3-120 and multiple adders 3-120 '(not shown).

The control module 3-11 is also used to parse the compiled fully connected instruction to obtain a plurality of operation instructions, and send the first data, the second data, and the plurality of operation instructions to the main operation submodule 3-121.

The main operation sub-module 3-121 is used to perform pre-processing on the first data and the second data, and to transmit data and operation instructions with a plurality of slave operation sub-modules 3-122.

The slave operation sub-module 3-122 is used to perform multiple operations in parallel based on the data and operation instructions transmitted from the master operation sub-module 3-121 based on multiple multipliers 3-120 and multiple adders 3-120 ′. Intermediate results, and transmit multiple intermediate results to the main operation sub-module 3-122.

The main operation submodule 3-121 is also used to perform subsequent processing on a plurality of intermediate results, obtain operation results, and store the operation results in the target address.

In a possible implementation manner, the operation domain may further include a weight width and a weight height. Among them, the control module 3-11 is also used to obtain the weight data from the weight data address according to the weight height and weight width.

In this implementation, the weight width and weight height may limit the amount of weight data acquired. The weight width and weight height included in the operation domain may be specific numerical values, and may also be storage addresses storing the weight width and weight height. When the specific value of the weight width and weight height is directly included in the operation domain, the specific value is determined as the corresponding weight width and weight height. When the storage address of the weight width and the weight height is included in the operation domain, the weight height and the weight width can be obtained from the storage addresses of the weight width and the weight height, respectively.

In a possible implementation manner, when the weight height and / or weight width are not included in the operation domain, weight data may be obtained according to the preset default weight height and default weight width.

In this way, the amount of weight data can be limited to ensure the accuracy of the calculation results.

In a possible implementation manner, as shown in FIGS. 7-2a and 7-2b, the device may further include a storage module 3-13. The storage module 3-13 is used to store the first data, the second data and the weight data.

In this implementation, the storage module may include one or more of a cache and a register. The cache may include a high-speed temporary storage cache, and may also include at least one NRAM (Neuron Random Access Memory). The buffer is used to store the first data, the second data, and the weight data, and the register is used to store the scalar data in the first data, the second data, and the weight data.

In a possible implementation manner, the neuron cache is used to store the neuron data in the first data, the second data, and the weight data, and the neuron data includes neuron vector data.

In a possible implementation manner, the control module 3-11 can also be used to generate an assembly file according to the fully-connected instruction and translate the assembly file into a binary file, where the binary file is a fully-connected instruction after compilation.

In a possible implementation, the command format of the fully connected command may be:

mlpdstAABWeightWeight.widthWeight.height

Among them, mlp is the opcode of the fully connected instruction. dst, A, B, Weight, Weight.width, Weight.height are the operation domains of fully connected instructions. Among them, dst is the target address, A is the first data address, B is the second data address, Weight is the weight data address, Weight.width is the weight width, Weight.height is the weight height.

It should be understood that those skilled in the art can set the operation code of the fully connected instruction, the position of the operation code and the operation field in the instruction format as required, and this disclosure does not limit this.

It should be noted that although the above-mentioned embodiment is taken as an example to introduce the fully-connected instruction processing apparatus as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various modules flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.

Application examples

In the following, an application example according to an embodiment of the present disclosure is given in conjunction with "using a fully-connected instruction processing device to perform a fully-connected operation" as an exemplary application scenario, so as to facilitate understanding of the flow of the fully-connected instruction processing device. Those skilled in the art should understand that the following application examples are only for the purpose of facilitating understanding of the embodiments of the present disclosure, and should not be considered as limitations to the embodiments of the present disclosure

7-3 shows a schematic diagram of an application scenario of a fully connected instruction processing apparatus according to an embodiment of the present disclosure. As shown in Figure 7-3, the fully connected command processing device processes the fully connected command as follows:

The control module 3-11 compiles the obtained fully-connected instruction 1 to obtain the compiled fully-connected instruction 1 (for example, the fully-connected instruction 1 is mlp500 500 100 200 200 300 9). Analyze the fully-connected instruction 1 after compilation to obtain the operation code and operation domain of the fully-connected instruction 1. Among them, the operation code of the full connection instruction 1 is mlp, the target address is 500, the first data address is 100, the second data address is 200, the weight data address is 300, the weight width is 5, and the weight height is 9. The control module 3-11 acquires the first data from the first data address 100, the second data from the second data address 200, and the weights with a weight width of 5 and a weight height of 9 from the weight data address 300 Value data.

The operation module 3-12 performs full connection operation on the first data and the second data according to the weight data to obtain an operation result, and stores the operation result in the target address 500. For the working process of the above modules, please refer to the relevant description above.

In this way, the fully connected command processing device can efficiently and quickly process the fully connected command, and the processing efficiency of the fully connected operation is high and fast.

7-4 shows a flowchart of a fully connected instruction processing method according to an embodiment of the present disclosure. The method can be applied to computer equipment including a memory and a processor, where the memory is used to store data used in the execution of the method; the processor is used to perform related processing and operation steps, such as performing the following steps S51-3和步骤 S52-3. As shown in FIG. 7-4, this method is applied to the above-mentioned fully-connected instruction processing apparatus. The method includes step S51-3 and step S52-3.

In step S51-3, the control module is used to compile the obtained fully-connected instruction to obtain the compiled fully-connected instruction, and the compiled fully-connected instruction is parsed to obtain the fully-connected instruction's operation code and operation domain, and Obtain the first data, the second data, the weight data and the target address required to execute the fully connected instruction according to the operation code and the operation domain. The operation code is used to indicate that the operation performed by the fully connected instruction on the data is a fully connected operation, and the operation domain includes a first data address, a second data address, a weight data address, and a target address.

In step S52-3, the operation module is used to perform a fully connected operation on the first data and the second data according to the weight data to obtain an operation result, and store the operation result in the target address.

In a possible implementation manner, performing a fully connected operation on the first data and the second data according to the weight data may include: using multiple multipliers in the operation module to perform the multiplication operation in the fully connected operation, and using multiple The adders in each arithmetic module perform the addition operation in the fully connected operation.

In a possible implementation manner, the operation module includes a master operation sub-module and multiple slave operation sub-modules, and the slave operation sub-module includes multiple multipliers and multiple adders,

Among them, the method may further include:

Use the control module to parse the compiled fully-connected instructions to obtain multiple operation instructions;

Among them, the first data and the second data are fully connected according to the weight data to obtain the operation result, and the operation result is stored in the target address, including:

Use the main operation sub-module to perform pre-processing on the first data and the second data, and to transmit data and operation instructions;

Based on multiple multipliers and multiple adders in the slave operation sub-module, performing intermediate operations in parallel according to the transmitted data and operation instructions to obtain multiple intermediate results;

Use the main operation sub-module to perform subsequent processing on multiple intermediate results to obtain the operation results, and store the operation results in the target address.

In a possible implementation manner, the operation domain may further include a weight width and a weight height. Wherein, obtaining the first data, the second data, the weight data and the target address required to execute the fully-connected instruction according to the operation code and the operation domain may include: obtaining the weight from the weight data address according to the weight height and weight width Value data.

In a possible implementation manner, the method may further include: using the storage module of the device to store the first data, the second data, and the weight data,

Wherein, the storage module includes at least one of a register and a cache,

The cache is used to store the first data, the second data, and the weight data. The cache includes at least one neuron cache NRAM;

The register is used to store the scalar data in the first data, the second data and the weight data;

The neuron cache is used to store the neuron data in the first data, the second data, and the weight data. The neuron data includes neuron vector data.

In a possible implementation manner, parsing the obtained fully connected instruction to obtain the operation code and operation domain of the fully connected instruction may include:

Store the compiled fully connected instructions;

Analyze the fully-connected instructions after compilation to obtain the operation codes and operation domains of the fully-connected instructions;

The instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed in order according to the execution order, and the plurality of instructions to be executed may include a fully-connected instruction after compilation.

In a possible implementation manner, compiling the obtained fully-connected instruction to obtain the compiled fully-connected instruction may include:

The assembly file is generated according to the full connection instruction, and the assembly file is translated into a binary file. Among them, the binary file is a fully connected instruction after compilation.

It should be noted that although the above embodiment is taken as an example to introduce the method for processing the fully connected command as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various steps flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.

The fully connected instruction processing method provided by the embodiments of the present disclosure has a wide range of application, and has high processing efficiency and fast processing speed for the fully connected instruction, and high processing efficiency and speed for performing the fully connected operation.

The foregoing can be better understood based on the following clauses:

Clause G1, a fully connected command processing device, the device comprising:

The control module is used to compile the obtained fully-connected instruction to obtain the compiled fully-connected instruction, analyze the compiled fully-connected instruction to obtain the fully-connected instruction's operation code and operation domain, and according to the The operation code and the operation domain obtain the first data, the second data, the weight data and the target address required to execute the fully connected instruction;

An operation module, configured to perform a fully connected operation on the first data and the second data according to the weight data, obtain an operation result, and store the operation result in the target address,

Wherein, the operation code is used to indicate that the operation performed on the data by the fully connected instruction is a fully connected operation, and the operation domain includes a first data address, a second data address, a weight data address, and the target address.

Clause G2. The device according to Clause G1, the calculation module includes:

Multiple multipliers for performing the multiplication operation in the fully connected operation;

A plurality of adders are used to perform the addition operation in the fully connected operation.

Clause G3. The device according to Clause G2, the operation module includes a master operation submodule and a plurality of slave operation submodules, the slave operation submodule includes the plurality of multipliers and the plurality of adders,

The control module is further configured to parse the compiled fully-connected instruction to obtain a plurality of operation instructions, and send the first data, the second data, and the plurality of operation instructions to the main operator Module

The master operation sub-module is used to perform pre-processing on the first data and the second data, and transmit data and operation instructions with the plurality of slave operation sub-modules;

The slave operation sub-module is configured to execute intermediate operations in parallel based on the data and operation instructions transmitted from the master operation sub-module based on the multiple multipliers and the multiple adders to obtain multiple intermediate results, and Transmitting the plurality of intermediate results to the main operation submodule;

The main operation sub-module is also used to perform subsequent processing on the plurality of intermediate results to obtain an operation result, and store the operation result in the target address.

Clause G4. The device according to Clause G1, the operation domain further includes a weight width and a weight height,

Wherein, the control module is further configured to obtain the weight data from the weight data address according to the weight height and the weight width.

Clause G5. The device according to Clause G1, the device further comprising:

A storage module, configured to store the first data, the second data, and the weight data,

Wherein, the storage module includes at least one of a register and a cache,

The cache is used to store the first data, the second data, and the weight data, and the cache includes at least one neuron cache NRAM;

The register is used to store the scalar data in the first data, the second data, and the weight data;

The neuron cache is used to store neuron data in the first data, the second data, and the weight data, and the neuron data includes neuron vector data.

Clause G6. The device according to Clause G1, the control module includes:

An instruction storage submodule, used to store the compiled fully connected instruction;

Instruction processing sub-module, which is used to parse the compiled fully connected instruction to obtain the operation code and operation domain of the fully connected instruction;

The queue storage sub-module is used to store an instruction queue, and the instruction queue includes a plurality of instructions to be executed in order according to the execution order, and the plurality of instructions to be executed include the compiled fully-connected instructions.

Clause G7. The device according to Clause G6, the control module, further comprising:

Clause G8, the device according to Clause G1,

The control module is also used to generate an assembly file according to the fully connected instruction and translate the assembly file into a binary file,

Wherein, the binary file is the compiled fully connected instruction.

Clause G9. A machine learning computing device, the device comprising:

One or more fully connected instruction processing devices as described in any one of clauses G1 to G8, used to obtain data and control information to be calculated from other processing devices, and perform specified machine learning operations, and pass the execution result to I / O interface is passed to other processing devices;

When the machine learning operation device includes a plurality of the fully connected instruction processing devices, the plurality of fully connected instruction processing devices may be connected and transmit data through a specific structure;

Among them, a plurality of the fully connected command processing devices interconnect and transmit data through a fast external device interconnect bus PCIE bus to support larger-scale machine learning operations; a plurality of the fully connected command processing devices share the same control system Or have their own control systems; multiple of the fully-connected instruction processing devices share memory or have their own memories; the interconnection method of multiple of the fully-connected instruction processing devices is an arbitrary interconnection topology.

Clause G10. A combined processing device, the combined processing device comprising:

Machine learning computing devices, general interconnection interfaces and other processing devices as described in Clause G9;

Clause G11. A machine learning chip, the machine learning chip includes:

The machine learning arithmetic device according to clause G9 or the combined processing device according to clause G10.

Article G12. An electronic device, the electronic device comprising:

Machine learning chip as described in clause G11.

Clause G13, a board card including: a storage device, an interface device and a control device, and a machine learning chip as described in Clause G11;

The storage device is used for storing data;

Clause G14. A method for processing a fully connected command. The method is applied to a device for processing a fully connected command. The device includes a control module and an arithmetic module. The method includes:

The control module is used to compile the obtained fully-connected instruction to obtain the compiled fully-connected instruction. The compiled fully-connected instruction is analyzed to obtain the fully-connected instruction's operation code and operation domain, and the operation code Acquiring the first data, the second data, the weight data and the target address required to execute the fully connected instruction with the operation domain;

Using an operation module to perform a fully connected operation on the first data and the second data according to the weight data to obtain an operation result, and store the operation result in the target address,

Clause G15. According to the method of Clause G14, performing a fully connected operation on the first data and the second data according to the weight data includes:

A plurality of multipliers in the operation module are used to perform the multiplication operation in the fully connected operation, and a plurality of adders in the operation module are used to perform the addition operation in the fully connected operation.

Clause G16. The method according to Clause G15, the operation module includes a master operation submodule and a plurality of slave operation submodules, the slave operation submodule includes a plurality of multipliers and a plurality of adders,

Wherein, the method further includes:

Use the control module to parse the compiled fully connected instruction to obtain multiple operation instructions;

Wherein, performing a fully connected operation on the first data and the second data according to the weight data to obtain an operation result, and storing the operation result in the target address includes:

Using the main operation sub-module to perform pre-processing on the first data and the second data, and to transmit data and operation instructions;

Based on the multiple multipliers and multiple adders in the slave operation submodule, performing intermediate operations in parallel according to the transmitted data and operation instructions to obtain multiple intermediate results;

The main operation sub-module is used to perform subsequent processing on the plurality of intermediate results to obtain an operation result, and the operation result is stored in the target address.

Clause G17. The method according to Clause G14, the operation domain further includes a weight width and a weight height,

Wherein, obtaining the first data, the second data, the weight data and the target address required to execute the fully connected instruction according to the operation code and the operation domain includes:

Obtain the weight data from the weight data address according to the weight height and the weight width.

Clause G18. The method according to Clause G14, the method further comprising:

Using the storage module of the device to store the first data, the second data, and the weight data,

Wherein, the storage module includes at least one of a register and a cache,

Clause G19. According to the method described in Clause G14, parse the obtained fully-connected instruction to obtain the operation code and operation domain of the fully-connected instruction, including:

Store the compiled fully connected instruction;

Parse the compiled fully connected instruction to obtain the operation code and operation domain of the fully connected instruction;

An instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed, which are sequentially arranged according to an execution order, and the plurality of instructions to be executed include the compiled fully-connected instructions.

Clause G20. The method according to Clause G19, the method further comprising:

Clause G21. According to the method described in Clause G14, compile the obtained fully-connected instruction to obtain the compiled fully-connected instruction, including:

Generate an assembly file according to the fully connected instruction, and translate the assembly file into a binary file,

Wherein, the binary file is the compiled fully connected instruction.

Clause G22. A non-volatile computer-readable storage medium having computer program instructions stored thereon, which when executed by a processor implements the method of any one of Clause G14 to Clause G21.

Due to the extensive use of neural network algorithms, the ability of computer hardware operators to continuously improve, the types and number of data operations involved in practical applications continue to increase. The convolution operation is a mathematical operator that generates a third function from two functions f and g, and characterizes the area of the overlapping part of the functions f and g after flipping and translation. Due to the variety of programming languages, in different language environments, in order to achieve the operation process of convolution operation, in related technologies, because there is no convolution instruction that can be widely applied to various programming languages at this stage, technicians need to customize the corresponding Multiple instructions in its programming language environment are used to implement the convolution operation, which results in low efficiency and slow speed of the convolution operation. The present disclosure provides a convolution instruction processing method, device, computer equipment, and storage medium. Convolution operations can be implemented with only one instruction, which can significantly improve the efficiency and speed of performing convolution operations.

8-1 shows a block diagram of a convolution instruction processing device according to an embodiment of the present disclosure. As shown in Figure 8-1, the device includes a control module 4-11 and an arithmetic module 4-12.

The control module 4-11 is used to compile the obtained convolution instruction to obtain the compiled convolution instruction, analyze the compiled convolution instruction to obtain the operation code and operation domain of the convolution instruction, and according to the operation The code and operation domain obtain the data to be operated, the convolution kernel and the target address required to execute the convolution instruction. The operation code is used to instruct the operation performed by the convolution instruction on the data to be a convolution operation, and the operation domain includes the data address to be operated, the convolution kernel address, and the target address.

The operation module 4-12 is used to perform a convolution operation on the data to be calculated according to the convolution kernel, obtain the operation result, and store the operation result in the target address.

In this embodiment, the convolution instruction acquired by the control module is an uncompiled software instruction that cannot be directly executed by hardware. The control module needs to first compile the convolution instruction (uncompiled). After the compiled convolution instruction is obtained, the compiled convolution instruction can be analyzed. The compiled convolutional instructions are hardware instructions that can be directly executed by hardware. The control module can obtain the data to be operated and the convolution kernel from the address of the data to be operated and the address of the convolution kernel, respectively.

In this embodiment, the operation code may be the part of the instruction or field (usually represented by code) specified in the computer program to perform the operation, and is the instruction sequence number used to inform the device that executes the instruction which instruction needs to be executed. . The operation domain may be the source of all data required to execute the corresponding instruction. All data required to execute the corresponding instruction include data to be operated, parameters such as convolution kernels, and corresponding operation methods. For a convolution instruction, it must include an operation code and an operation domain, where the operation domain includes at least the data address to be operated, the convolution kernel address, and the target address

It should be understood that, those skilled in the art can set the instruction format of the convolution instruction, as well as the included operation codes and operation domains as needed, and this disclosure does not limit this.

In this embodiment, the device may include one or more control modules and one or more arithmetic modules, and the number of control modules and arithmetic modules may be set according to actual needs, which is not limited in the present disclosure. When the device includes a control module, the control module can receive the convolution instruction and control one or more arithmetic modules to perform the convolution operation. When the device includes multiple control modules, the multiple control modules may respectively receive convolution instructions and control the corresponding one or more arithmetic modules to perform convolution operations.

A convolution instruction processing device provided by an embodiment of the present disclosure includes a control module and an operation module. The control module is used to compile the obtained convolution instruction to obtain a compiled convolution instruction, and the compiled convolution instruction The instruction is analyzed to obtain the operation code and operation domain of the convolution instruction, and according to the operation code and operation domain, the data to be operated, the convolution kernel and the target address required to execute the convolution instruction are obtained; the operation module is used to calculate the convolution kernel Perform convolution operation on the data to be calculated, obtain the operation result, and store the operation result in the target address. The convolution instruction processing device provided by the embodiments of the present disclosure has a wide application range, and has high processing efficiency and fast processing speed for convolution instructions, and high processing efficiency and speed for performing convolution operations.

8-2a shows a block diagram of a convolution instruction processing device according to an embodiment of the present disclosure. In a possible implementation, as shown in FIG. 8-2a, the arithmetic module 4-12 may include multiple multipliers 4-120 and multiple adders 4-120 '. Multiple multipliers 4-120 are used to perform multiplication operations in convolution operations. A plurality of adders 4-120 'are used to perform addition operations in convolution operations.

In this implementation manner, the operation module may further include one adder and one multiplier, or one adder and multiple multipliers, or multiple adders and one multiplier. The number of multipliers and adders can be set according to the amount of data required for the convolution operation, the processing speed of the convolution operation, the processing efficiency, etc., and the disclosure does not limit this.

8-2b shows a block diagram of a convolution instruction processing device according to an embodiment of the present disclosure. In a possible implementation, as shown in FIG. 8-2b, the operation module 4-12 may include a master operation submodule 4-121 and a plurality of slave operation submodules 4-122, and the slave operation submodule 4-122 includes Multiple multipliers 4-120 and multiple adders 4-120 '(not shown).

The control module 4-11 is also used to parse the compiled convolution instruction to obtain a plurality of operation instructions, and send the data to be operated, the convolution kernel and the plurality of operation instructions to the main operation sub-module 4-121.

The main operation sub-module 4-121 is used for performing pre-processing on the operation data and the convolution kernel, and transmitting data and operation instructions with a plurality of slave operation sub-modules 4-122.

The slave operation sub-module 4-122 is used to execute multiple operations in parallel based on the data and operation instructions transmitted from the master operation sub-module 4-121 based on multiple multipliers 4-120 and multiple adders 4-120 '. The intermediate result, and transmit multiple intermediate results to the main operation sub-module 4-122.

The main operation submodule 4-121 is also used to perform subsequent processing on a plurality of intermediate results, obtain operation results, and store the operation results in the target address.

In a possible implementation manner, the operation domain may further include an input height and an input width.

The control module is also used to obtain data to be calculated corresponding to the input width and input height from the data to be calculated.

In this implementation, the input height and input width can define the data amount and size of the obtained data to be calculated. The input height and input width included in the operation domain may be specific numerical values, and may also be a storage address that stores the input height and input width. When the specific values of the input height and input width are directly included in the operation domain, the specific values are determined as the corresponding input height and input width. When the storage addresses of the input height and the input width are included in the operation domain, the input height and the input width can be obtained from the storage addresses of the input height and the input width, respectively.

In a possible implementation manner, when the input height and / or input width are not included in the operation domain, the data to be calculated may be obtained according to the preset default input height and default input width.

In the above manner, the data amount and size of the operation data can be limited, the accuracy of the operation result can be ensured, and the device can execute the convolution instruction.

In a possible implementation manner, the operation domain may further include a convolution kernel height and a convolution kernel width.

The control module 4-11 is also used to obtain the convolution kernel from the convolution kernel address according to the height of the convolution kernel and the width of the convolution kernel.

In a possible implementation, the operation domain may also include the first step. Among them, the calculation module 4-12 is also used to move the convolution kernel according to the first step in the X direction of the data to be calculated.

In a possible implementation manner, the operation domain may further include a second step. Among them, the calculation module 4-12 is also used to move the convolution kernel according to the second step in the Y direction of the data to be calculated.

In a possible implementation manner, when one or more of the first and second steps of the convolution kernel height, the width of the convolution kernel, and the convolution kernel are not included in the operation domain, the advance The set default convolution kernel height, default convolution kernel width, convolution kernel default first step width and default second step width enable the control module and the arithmetic module to execute the convolution instruction.

In a possible implementation, the operation domain may also include the number of convolution kernels. Among them, the calculation module 4-12 is also used to perform convolution operation on the data to be calculated through a plurality of convolution kernels whose number is the number of convolution kernels.

In this implementation, the number of convolution kernels corresponds to the data to be calculated. For example, when the number of convolution kernels is 5, it can be determined that the data to be calculated can be divided into five parts, and five convolution kernels are required to perform convolution operations on the five parts of the data to be calculated.

In this implementation manner, when the operation domain does not include the number of convolution kernels, it can be determined that only one convolution kernel is needed for the data to be calculated to implement the convolution operation.

In a possible implementation, the operation domain may also include the number of channels. Among them, the calculation module 4-12 is also used to perform convolution operation on the data to be calculated through the corresponding channel according to the number of channels to obtain the calculation result.

For example, when the number of channels is 3, the data to be operated can be convoluted on the three channels to obtain the operation result.

In a possible implementation manner, as shown in FIGS. 8-2a and 8-2b, the device may further include a storage module 4-13. The storage modules 4-13 are used to store the data to be calculated and the convolution kernel.

In this implementation, the storage module may include one or more of a cache and a register. The cache may include a high-speed temporary storage cache, and may also include at least one NRAM (Neuron Random Access Memory). The buffer is used to store the data to be operated and the convolution kernel, and the register is used to store the scalar data in the data to be operated.

In a possible implementation, the cache may include a neuron cache. The neuron cache is used to store the neuron data in the data to be calculated, and the neuron data includes neuron vector data.

In a possible implementation manner, the control module 4-11 may also be used to generate an assembly file according to the convolution instruction and translate the assembly file into a binary file, where the binary file is a compiled convolution instruction.

In a possible implementation, the instruction format of the convolution instruction may be:

convdstsrc0kernelsrcChannelsrcHeighsrcWidthkernelHeightkernelWidthstrideXstrideYdstChannel

Among them, conv is the operation code of the convolution instruction, dst, src0, kernel, srcChannel, srcHeigh, srcWidth, kernelHeight, kernelWidth, strideX, strideY, dstChannel are the operation domain of the convolution instruction. Among them, dst is the target address, src0 is the address of the data to be calculated, kernel is the address of the convolution core or convolution core, srcChannel is the number of convolution cores, srcHeigh is the input height of the data to be calculated, srcWidth is the input width of the data to be calculated, kernelHeight is the height of the convolution kernel, kernelWidth is the width of the convolution kernel, strideX is the first step, strideY is the second step, and dstChannel is the number of channels.

It should be understood that those skilled in the art can set the operation code of the convolution instruction, the position of the operation code and the operation domain in the instruction format according to needs, and the disclosure does not limit this.

It should be noted that, although the convolution instruction processing apparatus is described above by taking the above embodiment as an example, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various modules flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.

Application examples

The following uses “convolution instruction processing device for convolution operation” as an exemplary application scenario to give an application example according to an embodiment of the present disclosure, in order to facilitate understanding of the flow of the convolution instruction processing device. Those skilled in the art should understand that the following application examples are only for the purpose of facilitating understanding of the embodiments of the present disclosure, and should not be considered as limitations to the embodiments of the present disclosure

8-3 shows a schematic diagram of an application scenario of a convolution instruction processing device according to an embodiment of the present disclosure. As shown in Figure 8-3, the convolution instruction processing device processes the convolution instruction as follows:

The control module 4-11 compiles the obtained convolution instruction 1 to obtain the compiled convolution instruction 1 (for example, the convolution instruction 1 is conv500, 100, 200, 5, 64, 32, 2, 2, 3, 3). The product instruction 1 is analyzed to obtain the operation code and operation domain of the convolution instruction 1. Among them, the operation code of convolution instruction 1 is conv, the target address is 500, the address of the data to be calculated is 100, the address of the convolution kernel is 200, the number of convolution kernels is 5, the input height is 64, the input width is 32, and the convolution The kernel height is 2, the convolution kernel width is 1, the first step width is 2, the second step width is 3, and the number of channels is 3. The control module 4-11 acquires 64 × 32 to-be-calculated data from the data-to-be-operated data address 100, and acquires a 2 × 1 convolution kernel from the convolution kernel address 200.

The operation module 4-12 performs convolution operation on the data to be calculated according to the number of convolution kernels 5, the first step width 2, the second step width 3, and the number of channels 3, obtains the operation result, and stores the operation result in the target address 500.

In this way, the convolution instruction processing device can efficiently and quickly process the convolution instruction, and the processing efficiency of the convolution operation is high and the speed is fast.

8-4 shows a flowchart of a convolution instruction processing method according to an embodiment of the present disclosure. The method can be applied to computer equipment including a memory and a processor, where the memory is used to store data used in the execution of the method; the processor is used to perform related processing and operation steps, such as performing the following steps S51-4和步骤 S52-4. As shown in FIG. 8-4, the method is applied to the above-mentioned convolution instruction processing device. The method includes step S51-4 and step S52-4.

In step S51-4, the control module is used to compile the acquired convolution instruction to obtain the compiled convolution instruction, and the compiled convolution instruction is analyzed to obtain the operation code and operation domain of the convolution instruction, and Obtain the data to be operated, the convolution kernel and the target address required to execute the convolution instruction according to the operation code and the operation domain. The operation code is used to instruct the operation performed by the convolution instruction on the data to be a convolution operation, and the operation domain includes the data address to be operated, the convolution kernel address, and the target address.

In step S52-4, the operation module is used to perform convolution operation on the data to be operated according to the convolution kernel to obtain the operation result, and the operation result is stored in the target address.

In a possible implementation manner, performing convolution operation on the data to be calculated according to the convolution kernel may include:

Multiple multipliers are used to perform multiplication operations in convolution operations, and multiple adders are used to perform addition operations in convolution operations.

Wherein, the method may further include: parsing the compiled convolution instruction using the control module to obtain multiple operation instructions;

Wherein, step S52-4 may include:

Use the main operation sub-module to perform pre-processing on the operation data and convolution kernel, as well as the transmission of data and operation instructions;

In a possible implementation manner, the operation domain may further include a read input height and an input width. Among them, obtaining the data to be operated, the convolution kernel and the target address required to execute the convolution instruction according to the operation code and the operation domain may include:

Obtain the data to be calculated corresponding to the input width and input height from the address of the data to be calculated.

In a possible implementation manner, the operation domain may further include a convolution kernel height and a convolution kernel width. Among them, obtaining the data to be operated, the convolution kernel and the target address required to execute the convolution instruction according to the operation code and the operation domain may include: obtaining the convolution kernel from the convolution kernel address according to the height of the convolution kernel and the width of the convolution kernel .

In a possible implementation, the operation domain may further include a first step, in which the data to be operated is convoluted according to the convolution kernel to obtain the operation result, including: according to the first direction in the X direction of the data to be operated The stride moves the convolution kernel.

In a possible implementation manner, the operation domain may further include a second stride, where performing convolution operation on the data to be calculated according to the convolution kernel to obtain the operation result includes: according to the second The stride moves the convolution kernel.

In a possible implementation, the operation domain may also include the number of convolution kernels. Wherein, performing convolution operation on the data to be calculated according to the convolution kernel to obtain the operation result may include:

The number of convolution kernels is the number of convolution kernels to perform convolution operation on the data to be calculated.

In a possible implementation, the operation domain may also include the number of channels. Wherein, performing convolution operation on the data to be calculated according to the convolution kernel to obtain the operation result may include:

According to the number of channels, perform convolution operation on the data to be calculated through the corresponding channel to obtain the operation result.

In a possible implementation manner, the method may further include: using the storage module of the device to store the data to be calculated and the convolution kernel,

Wherein, the storage module includes at least one of a register and a cache,

The cache is used to store the data to be calculated and the convolution kernel. The cache includes at least one neuron cache NRAM;

Register, used to store scalar data in the data to be calculated;

The neuron cache is used to store neuron data in the data to be calculated, and the neuron data includes neuron vector data.

In a possible implementation manner, parsing the obtained convolution instruction to obtain the operation code and operation domain of the convolution instruction may include:

Store the compiled convolution instruction;

Analyze the compiled convolution instruction to obtain the operation code and operation domain of the convolution instruction;

An instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed in order according to the execution order, and the plurality of instructions to be executed may include compiled convolutional instructions.

In a possible implementation manner, the method may further include:

When it is determined that the first to-be-executed instruction among the plurality of to-be-executed instructions is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction, cache the first to-be-executed instruction, and execute The first instruction to be executed,

In a possible implementation manner, compiling the obtained convolution instruction to obtain the compiled convolution instruction may include:

The assembly file is generated according to the convolution instruction, and the assembly file is translated into a binary file. Among them, the binary file is a compiled convolution instruction.

It should be noted that although the above embodiment is taken as an example to introduce the convolution instruction processing method as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various steps flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.

The convolution instruction processing method provided by the embodiments of the present disclosure has a wide application range, and has high processing efficiency and fast processing speed for convolution instructions, and high processing efficiency and speed for performing convolution operations.

The foregoing can be better understood based on the following clauses:

Clause H1, a convolution instruction processing device, the device comprising:

The control module is used to compile the obtained convolution instruction to obtain the compiled convolution instruction, analyze the compiled convolution instruction to obtain the operation code and operation domain of the convolution instruction, and according to the The operation code and the operation domain obtain the data to be operated, the convolution kernel and the target address required to execute the convolution instruction;

An operation module, configured to perform a convolution operation on the data to be operated according to the convolution kernel, obtain an operation result, and store the operation result in the target address,

Wherein, the operation code is used to indicate that the operation performed by the convolution instruction on the data is a convolution operation, and the operation domain includes the data address to be operated, the convolution kernel address, and the target address.

Clause H2. The device according to Clause H1, the arithmetic module includes:

Multiple multipliers for performing the multiplication operation in the convolution operation;

A plurality of adders are used to perform the addition operation in the convolution operation.

Clause H3. The device according to Clause H2, the operation module includes a master operation sub-module and a plurality of slave operation sub-modules, the slave operation sub-module includes the plurality of multipliers and the plurality of adders,

The control module is further configured to parse the compiled convolution instruction to obtain a plurality of operation instructions, and send the data to be operated, the convolution kernel, and the plurality of operation instructions to the main operator Module

The master operation sub-module is used to perform pre-processing on the data to be operated and the convolution kernel, and to transmit data and operation instructions with the plurality of slave operation sub-modules;

Clause H4. The device according to Clause H1, the operation domain further includes an input height and an input width,

Wherein, the control module is also used to obtain the data to be calculated corresponding to the input width and the input height from the data to be calculated address.

Clause H5. The device according to Clause H1, the operation domain further includes a convolution kernel height and a convolution kernel width,

Wherein, the control module is further configured to obtain the convolution kernel from the convolution kernel address according to the height of the convolution kernel and the width of the convolution kernel.

Clause H6. The device according to Clause H1, the operation domain further includes a first step,

The calculation module is also used to move the convolution kernel in the X direction of the data to be calculated according to the first step.

Clause H7. The device according to Clause H1, the operation domain further includes a second step,

The calculation module is also used to move the convolution kernel in the Y direction of the data to be calculated according to a second step.

Clause H8. The device according to Clause H1, the operation domain further includes the number of convolution kernels,

Wherein, the operation module is also used to perform convolution operation on the data to be operated through a plurality of convolution kernels whose number is the number of the convolution kernels.

Clause H9. The device according to Clause H1, the operation domain further includes the number of channels,

Wherein, the operation module is also used to perform convolution operation on the data to be operated through the corresponding channel according to the number of channels to obtain an operation result.

Clause H10. The device according to Clause H1, the device further comprising:

A storage module for storing the data to be calculated and the convolution kernel,

Wherein, the storage module includes at least one of a register and a cache,

The cache is used to store the data to be operated and the convolution kernel, and the cache includes at least one neuron cache NRAM;

The register is used to store scalar data in the data to be calculated;

The neuron cache is used to store neuron data in the data to be operated, and the neuron data includes neuron vector data.

Clause H11. The device according to Clause H1, the control module includes:

An instruction storage sub-module for storing the compiled convolution instruction;

Instruction processing sub-module, which is used to parse the compiled convolution instruction to obtain the operation code and operation domain of the convolution instruction;

A queue storage submodule is used to store an instruction queue, and the instruction queue includes a plurality of instructions to be executed, which are sequentially arranged according to an execution order, and the plurality of instructions to be executed include the compiled convolutional instructions.

Clause H12. The device according to Clause H11, the control module, further comprising:

Clause H13, the device according to Clause H1,

The control module is also used to generate an assembly file according to the convolution instruction and translate the assembly file into a binary file,

Wherein, the binary file is the compiled convolution instruction.

Clause H14. A machine learning computing device, the device comprising:

One or more convolution instruction processing devices as described in any one of Clause H1-Clause H13, used to obtain data to be calculated and control information from other processing devices, and perform specified machine learning operations, and pass the execution result through I / O interface is passed to other processing devices;

When the machine learning operation device includes a plurality of the convolution instruction processing devices, the plurality of the convolution instruction processing devices can be connected and transmit data through a specific structure;

Among them, a plurality of the convolution instruction processing devices interconnect and transmit data through a fast external device interconnect bus PCIE bus to support larger-scale machine learning operations; a plurality of the convolution instruction processing devices share the same control system Or have their own control systems; multiple convolutional instruction processing devices share memory or have their own memory; the interconnection method of multiple convolutional instruction processing devices is any interconnected topology.

Clause H15. A combined processing device, the combined processing device comprising:

Machine learning computing devices, general interconnection interfaces and other processing devices as described in clause H14;

Clause H16. A machine learning chip, the machine learning chip includes:

The machine learning arithmetic device according to clause H14 or the combined processing device according to clause H15.

Clause H17. An electronic device, the electronic device comprising:

Machine learning chip as described in clause H16.

Clause H18, a board card including: a storage device, an interface device and a control device, and a machine learning chip as described in Clause H16;

The storage device is used for storing data;

Clause H19. A convolution instruction processing method. The method is applied to a convolution instruction processing device. The device includes a control module and an operation module. The method includes:

The control module is used to compile the obtained convolution instruction to obtain the compiled convolution instruction, and the compiled convolution instruction is analyzed to obtain the operation code and operation domain of the convolution instruction, and the operation code is based on the operation code. Acquiring the data to be operated, the convolution kernel and the target address required to execute the convolution instruction with the operation domain;

Using an operation module to perform a convolution operation on the data to be operated according to the convolution kernel to obtain an operation result, and store the operation result in the target address,

Clause H20. According to the method of Clause H19, performing convolution operation on the data to be calculated according to the convolution kernel includes:

The multiplication operation in the convolution operation is performed using multiple multipliers, and the addition operation in the convolution operation is performed using multiple adders.

Clause H21. The method according to Clause H19, the operation module includes a master operation submodule and a plurality of slave operation submodules, the slave operation submodule includes the plurality of multipliers and the plurality of adders,

Wherein, the method further includes:

Use the control module to parse the compiled convolution instruction to obtain multiple operation instructions;

Wherein, performing convolution operation on the data to be operated according to the convolution kernel to obtain an operation result, and storing the operation result in the target address includes:

Using the main operation sub-module to perform pre-processing on the data to be operated and the convolution kernel, and to transmit data and operation instructions;

Based on the plurality of multipliers and the plurality of adders in the slave operation submodule, performing intermediate operations in parallel according to the transmitted data and operation instructions to obtain multiple intermediate results;

Clause H22. The method according to Clause H19, the operation field further includes reading input height and input width,

Wherein, obtaining the data to be operated, the convolution kernel and the target address required to execute the convolution instruction according to the operation code and the operation domain includes:

Obtain the data to be calculated corresponding to the input width and the input height from the address of the data to be calculated.

Clause H23. The method according to Clause H19, the operation domain further includes a convolution kernel height and a convolution kernel width,

Obtain the convolution kernel from the convolution kernel address according to the height of the convolution kernel and the width of the convolution kernel.

Clause H24, the method according to Clause H19, the operation domain further includes a first step,

Wherein, performing convolution operation on the data to be calculated according to the convolution kernel to obtain an operation result includes:

Move the convolution kernel according to the first step in the X direction of the data to be calculated.

Clause H25. The method according to Clause H19, the operation domain further includes a second step,

The convolution kernel is moved in the Y direction of the data to be calculated according to a second step.

Clause H26. The method according to Clause H19, the operation domain further includes the number of convolution kernels,

The convolution operation is performed on the data to be operated through a plurality of convolution kernels whose number is the number of the convolution kernels.

Clause H27, the method according to Clause H19, the operation domain further includes the number of channels,

According to the number of channels, perform convolution operation on the data to be calculated through the corresponding channel to obtain an operation result.

Clause H28. The method according to Clause H19, the method further comprising:

Using the storage module of the device to store the data to be calculated and the convolution kernel,

Wherein, the storage module includes at least one of a register and a cache,

The register is used to store scalar data in the data to be calculated;

Clause H29. According to the method described in Clause H19, parse the obtained convolution instruction to obtain the operation code and operation domain of the convolution instruction, including:

Storing the compiled convolution instruction;

Analyzing the compiled convolution instruction to obtain the operation code and operation domain of the convolution instruction;

An instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed, which are sequentially arranged according to an execution order, and the plurality of instructions to be executed include the compiled convolutional instructions.

Clause H30. The method according to Clause H29, the method further comprising:

Clause H31. Compile the obtained convolution instruction according to the method described in Clause H19 to obtain the compiled convolution instruction, including:

Generate an assembly file according to the convolution instruction, and translate the assembly file into a binary file,

Wherein, the binary file is the compiled convolution instruction.

Clause H32. A non-volatile computer-readable storage medium having computer program instructions stored thereon, which when executed by a processor implements the method of any one of Clause H19 to Clause H31.

Due to the extensive use of neural network algorithms, the ability of computer hardware operators to continuously improve, the types and number of data operations involved in practical applications continue to increase. Max-pooling (max-pooling) is a method to obtain the maximum value of all data in the local area. Due to the variety of programming languages, in different language environments, in order to achieve the operation process of the maximum pooling operation, in the related technology, because there is no maximum pooling instruction that can be widely applied to various programming languages at this stage, the technical staff needs to Define multiple instructions corresponding to its programming language environment to achieve maximum pooling operation, which results in low efficiency and slow speed for maximum pooling operation. The present disclosure provides a maximum pooling instruction processing method, device, computer equipment, and storage medium. The maximum pooling operation can be realized with only one instruction, which can significantly improve the efficiency and speed of performing the maximum pooling operation.

9-1 shows a block diagram of a maximum pooled instruction processing apparatus according to an embodiment of the present disclosure. As shown in Figure 9-1, the device includes a control module 5-11 and an arithmetic module 5-12.

The control module 5-11 is used to compile the obtained maximum pooling instruction to obtain the compiled maximum pooling instruction, and parse the compiled maximum pooling instruction to obtain the operation code and operation domain of the maximum pooling instruction , And obtain the data to be calculated, the pooling core and the target address required to execute the maximum pooling instruction according to the operation code and the operation domain. The operation code is used to indicate that the operation performed by the maximum pooling instruction on the data is the maximum pooling operation, and the operation domain includes the data address to be operated, the pooling core address, and the target address.

The operation module 5-12 is used for performing the maximum pooling operation on the data to be calculated according to the pooling core, obtaining the operation result, and storing the operation result in the target address.

In this embodiment, the maximum pooling instruction acquired by the control module is an uncompiled software instruction that cannot be directly executed by hardware. The control module needs to first compile the maximum pooling instruction (uncompiled). After the compiled maximum pooling instruction is obtained, the compiled maximum pooling instruction can be parsed. The compiled maximum pooled instructions are hardware instructions that can be directly executed by the hardware. The control module can obtain the data to be calculated and the pooled core from the data to be calculated and the pooled core address, respectively. The control module can obtain instructions and data through the data input and output unit, which can be one or more data I / O interfaces or I / O pins.

In this embodiment, the operation code may be the part of the instruction or field (usually represented by code) specified in the computer program to perform the operation, and is the instruction sequence number used to inform the device that executes the instruction which instruction needs to be executed. . The operation domain may be a source of all data required to execute the corresponding instruction, and all data required to execute the corresponding instruction include parameters to be operated, parameters such as pooling cores, and corresponding operation methods. For a maximum pooling instruction, it must include an operation code and an operation domain, where the operation domain includes at least the data address to be calculated, the pooling core address, and the target address.

It should be understood that, those skilled in the art can set the instruction format of the maximum pooling instruction, as well as the included operation codes and operation domains as needed, and this disclosure does not limit this.

In this embodiment, the device may include one or more control modules and one or more arithmetic modules, and the number of control modules and arithmetic modules may be set according to actual needs, which is not limited in the present disclosure. When the device includes a control module, the control module can receive the maximum pooling instruction and control one or more arithmetic modules to perform the maximum pooling operation. When the device includes multiple control modules, the multiple control modules may respectively receive the maximum pooling instruction and control the corresponding one or more arithmetic modules to perform the maximum pooling operation.

The maximum pooling instruction processing device provided by an embodiment of the present disclosure includes a control module and an arithmetic module. The control module is used to compile the obtained maximum pooling instruction to obtain the compiled maximum pooling instruction. The largest pooling instruction is parsed to obtain the operation code and operation domain of the largest pooling instruction, and according to the operation code and operation domain, the data to be calculated, the pooling core, and the target address required to execute the largest pooling instruction are obtained; According to the pooling core, the maximum pooling operation is performed on the data to be calculated, the operation result is obtained, and the operation result is stored in the target address. The maximum pooling instruction processing device provided by the embodiments of the present disclosure has a wide application range, and has high processing efficiency and fast processing speed for the maximum pooling instruction, and high processing efficiency and speed for performing the maximum pooling operation.

9-2a shows a block diagram of a maximum pooled instruction processing apparatus according to an embodiment of the present disclosure. In a possible implementation, as shown in FIG. 9-2a, the arithmetic module 5-12 may include multiple comparators 5-120. A plurality of comparators 5-120 are used to perform comparison operations on a plurality of data to be operated in the area corresponding to the pooled core, and obtain operation results.

In this implementation, the arithmetic module may also include a comparator. The number of comparators can be set according to the size of the data amount of the comparison operation to be performed, the processing speed and efficiency of the comparison operation, and the like, which is not limited in the present disclosure.

9-2b shows a block diagram of a maximum pooled instruction processing apparatus according to an embodiment of the present disclosure. In a possible implementation, as shown in FIG. 9-2b, the operation module 5-12 may include a master operation submodule 5-121 and a plurality of slave operation submodules 5-122, and the master operation submodule 5-121 includes Multiple comparators.

The main operation sub-module 5-121 is used to perform comparison operations on a plurality of data to be operated in the area corresponding to the pooling core by using a plurality of comparators to obtain operation results, and store the operation results in the target address.

In the above manner, the data amount and size of the operation data can be limited, the accuracy of the operation result can be ensured, and the device can execute the maximum pooling instruction.

In a possible implementation, the operation domain may further include pooled core height and pooled core width.

Among them, the control module 5-11 is also used to obtain the pooled core from the pooled core address according to the pooled core height and the pooled core width.

In a possible implementation, the operation domain may also include the first step. Among them, the arithmetic modules 5-12 can also be used to move the pooled core in the x direction according to the first step.

In a possible implementation manner, the operation domain may further include a second step. Among them, the arithmetic modules 5-12 can also be used to move the pooling core in the y direction according to the second step.

In this implementation, the step of the maximum pooling operation is the amplitude of each movement of the pooling core during the maximum pooling operation. The first step may be to move the amplitude of the pooled core in the x direction, and the second step may be to move the amplitude of the pooled core in the y direction.

It should be noted that in this disclosure, only the pooling core is taken as a two-dimensional example, and the parameters such as the height, width, first step width and second step width of the pooling core required for the maximum pooling operation are described. If the pooling kernel is multi-dimensional, the parameters of the pooling kernel include the size and stride of each dimension.

In a possible implementation manner, when the first step width and the second step width are not given in the operation domain of the maximum pooling instruction, the computing module may use the height and width of the pooling core as their corresponding dimensions, respectively The stride ensures the normal operation of the maximum pooling operation. For example, the calculation modules 5-12 can also be used to move the pooled cores non-overlapping on the data to be calculated, and compare multiple data to be calculated in the area corresponding to the pooled cores to obtain the calculation result.

In a possible implementation, when the pooled core height, pooled core width, and the pooled core are not included in the operation domain, the preset default pooled core height and default pooled core width can be obtained, so that the control module and the arithmetic module Can execute maximum pooling instructions.

In a possible implementation, the operation domain may further include the number of pooled cores. Among them, the calculation module 5-12 is also used to perform the maximum pooling operation on the data to be calculated through a plurality of pooling cores with the number of pooling cores.

In this implementation, the number of pooled cores corresponds to the data to be calculated. For example, when the number of pooling cores is 5, it can be determined that the data to be calculated can be divided into five parts, and five pooling cores are required to perform the maximum pooling operation on the five parts of the data to be calculated, respectively.

In this implementation manner, when the operation domain does not include the number of pooled cores, it can be determined that only one pooled core is needed for the data to be calculated to achieve the maximum pooled operation.

In a possible implementation, the calculation module 5-12 is further used to calculate data that is an integer multiple of the pooled core size in the data to be calculated when the size of the data to be calculated is a non-integer multiple of the pooled core size. Perform maximum pooling operations. The size of the data to be calculated is a non-integer multiple of the size of the pooled core, which may include at least one of the following: the input width of the data to be calculated is a non-integer multiple of the width of the pooled core, and the input height of the data to be calculated It is a non-integer multiple of the height of the pooled core.

In this implementation manner, the maximum pooling operation may not be performed on a part of the remaining data that is a non-integer multiple of the pooling core in the data to be calculated.

In a possible implementation manner, as shown in FIGS. 9-2a and 9-2b, the device may further include a storage module 5-13. Storage modules 5-13 are used to store data to be calculated and pooled cores.

In this implementation, the storage module may include one or more of a cache and a register. The cache may include a high-speed temporary storage cache, and may also include at least one NRAM (Neuron Random Access Memory). The cache can be used to store data to be calculated and pooled cores, and the register can be used to store scalar data in the data to be calculated.

In a possible implementation manner, the control module 5-11 may also be used to generate an assembly file according to the maximum pooling instruction and translate the assembly file into a binary file, where the binary file is the compiled maximum pooling instruction.

In a possible implementation manner, the instruction format of the maximum pooling instruction may be:

maxpool dst src0 srcChannel srcHeigh srcWidth kernelHeight kernelWidth sxsy

Among them, maxpool is the operation code of the largest pooling instruction, and dst, src0, srcChannel, srcHeigh, srcWidth, kernelHeight, kernelWidth, sx, and sy are the operation domains of the largest pooling instruction. Where dst is the target address, src0 is the data address to be calculated, srcChannel is the number of pooled cores, srcHeigh is the input height, srcWidth is the input width, kernelHeight is the pooled core height, kernelWidth is the pooled core width, and sx is the pooled core The first step of the movement in the x direction, sy is the second step of the movement of the pooling core in the y direction.

It should be understood that those skilled in the art can set the position of the operation code of the maximum pooling instruction, the operation code and the operation domain in the instruction format according to needs, and this disclosure does not limit this.

It should be noted that, although the above embodiment is used as an example to introduce the maximum pooling instruction processing apparatus as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various modules flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.

Application examples

The following uses "maximizing pooled instruction processing device for maximum pooling operation" as an exemplary application scenario, and gives an application example according to an embodiment of the present disclosure to facilitate understanding of the flow of the maximum pooling instruction processing device. Those skilled in the art should understand that the following application examples are only for the purpose of facilitating understanding of the embodiments of the present disclosure, and should not be considered as limitations to the embodiments of the present disclosure

9-3 shows a schematic diagram of an application scenario of a maximum pooled instruction processing apparatus according to an embodiment of the present disclosure. As shown in Figure 9-3, the maximum pooling instruction processing device processes the maximum pooling instruction as follows:

The control module 5-11 compiles the obtained maximum pooling instruction 1 to obtain the compiled maximum pooling instruction 1 (for example, the maximum pooling instruction 1 is maxpool 500 500 100 100 200 5 64 64 32 2 2 1), after compiling The largest pooling instruction is parsed to obtain the operation code and operation domain of the largest pooling instruction 1. Among them, the operation code of the max pooling instruction 1 is maxpool, the target address is 500, the data address to be calculated is 100, the pooling core address is 200, the number of pooling cores is 5, the input height is 64, the input width is 32, the pool The nucleus height is 2, the pooling nucleus width is 1, the first step is 2, and the second step is 1. The control module 5-11 obtains 64 × 32 to-be-calculated data from the data-to-be-operated data address 100 and 2 × 1 pooled cores from the pooled core address 200.

The calculation module 5-12 uses 5 pooling cores to perform maximum pooling operation on the data to be calculated, obtains the calculation result, and stores the calculation result in the target address 500.

In this way, the maximum pooling instruction can be processed efficiently and quickly, and the efficiency and speed of the maximum pooling operation are also significantly improved.

9-4 shows a flowchart of a maximum pooling instruction processing method according to an embodiment of the present disclosure. The method can be applied to a computer device including a memory and a processor, where the memory is used to store data used in the execution of the method; the processor is used to perform related processing and operation steps, such as performing the following steps S51-5和步骤 S52-5. As shown in FIG. 9-4, this method is applied to the above-mentioned maximum pooling instruction processing device. The method includes step S51-5 and step S52-5.

In step S51-5, the control module is used to compile the obtained maximum pooling instruction to obtain the compiled maximum pooling instruction, and the compiled maximum pooling instruction is analyzed to obtain the operation code and The operation domain, and according to the operation code and the operation domain, obtain the data to be operated, the pooling core, and the target address required to execute the maximum pooling instruction. The operation code is used to indicate that the operation performed by the maximum pooling instruction on the data is the maximum pooling operation, and the operation domain includes the data address to be operated, the pooling core address, and the target address.

In step S52-5, the operation module is used to perform the maximum pooling operation on the data to be calculated according to the pooling core to obtain the operation result, and the operation result is stored in the target address.

In a possible implementation manner, performing the maximum pooling operation on the data to be calculated according to the pooling core to obtain the operation result may include:

A plurality of comparators in the operation module are used to perform comparison operation on a plurality of data to be operated in the area corresponding to the pooled core, and the operation result is obtained.

In a possible implementation manner, the operation module includes a master operation sub-module and multiple slave operation sub-modules, and the master operation sub-module includes multiple comparators,

Among them, the maximum pooling operation is performed on the data to be calculated according to the pooling core to obtain the operation result, and the operation result is stored in the target address, including:

A plurality of comparators are used to perform comparison operations on a plurality of data to be operated in the area corresponding to the pooled core, to obtain operation results, and store the operation results in the target address.

In a possible implementation manner, the operation domain may further include an input height and an input width. Among them, obtaining the data to be calculated, the pooling core, and the target address required to execute the maximum pooling instruction according to the operation code and the operation domain may include:

In a possible implementation, the operation domain may further include pooled core height and pooled core width. Among them, obtaining the data to be calculated, the pooling core, and the target address required to execute the maximum pooling instruction according to the operation code and the operation domain may include:

Obtain the pooled core from the pooled core address according to the pooled core height and the pooled core width.

Move the pooled cores non-overlapping on the data to be calculated, and compare multiple data to be calculated in the area corresponding to the pooled cores to obtain the calculation result.

When the size of the data to be calculated is a non-integer multiple of the size of the pooled core, the maximum pooling operation is performed on the data to be calculated that is an integer multiple of the size of the pooled core,

The size of the data to be calculated is a non-integer multiple of the size of the pooled core, which may include at least one of the following: the input width of the data to be calculated is a non-integer multiple of the width of the pooled core, and the input height of the data to be calculated is the pool Non-integer multiple of the height of the chemical core.

In a possible implementation, the operation domain may further include the number of pooled cores. Among them, performing the maximum pooling operation on the data to be calculated according to the pooling core to obtain the operation result may include:

Through multiple pooling cores with the number of pooling cores, the maximum pooling operation is performed on the data to be calculated.

In a possible implementation manner, the method may further include: using the storage module of the device to store the data to be calculated and the pooled core. Among them, the storage module may include at least one of a register and a cache, the cache is used to store the data to be calculated and the pooled core, the cache may include at least one neuron cache NRAM; the register is used to store the scalar data in the data to be calculated; nerve The meta buffer is used to store neuron data in the data to be operated, and the neuron data may include neuron vector data.

In a possible implementation manner, parsing the obtained maximum pooling instruction to obtain the operation code and operation domain of the maximum pooling instruction may include:

Store the compiled maximum pooled instruction;

Analyze the compiled maximum pooling instruction to obtain the operation code and operation domain of the maximum pooling instruction;

The instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed in order according to the execution order, and the plurality of instructions to be executed may include a compiled maximum pooled instruction.

In a possible implementation manner, compiling the obtained maximum pooling instruction to obtain the compiled maximum pooling instruction may include:

The assembly file is generated according to the maximum pooling instruction, and the assembly file is translated into a binary file. Among them, the binary file is the largest pooled instruction after compilation.

It should be noted that, although the above embodiment is used as an example to introduce the maximum pooling instruction processing method as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various steps flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.

The maximum pooling instruction processing method provided by the embodiments of the present disclosure has a wide application range, and has high processing efficiency and fast processing speed for the maximum pooling instruction, and high efficiency and speed for performing the maximum pooling operation.

The foregoing can be better understood based on the following clauses:

Clause I1, a maximum pooling instruction processing device, the device comprising:

The control module is used to compile the obtained maximum pooling instruction to obtain the compiled maximum pooling instruction, and parse the compiled maximum pooling instruction to obtain the operation code and operation domain of the maximum pooling instruction, And obtain the data to be operated, the pooling core and the target address required to execute the maximum pooling instruction according to the operation code and the operation domain;

An operation module, configured to perform a maximum pooling operation on the data to be calculated according to the pooling check, obtain an operation result, and store the operation result in the target address,

Wherein, the operation code is used to indicate that the operation performed by the maximum pooling instruction on the data is the maximum pooling operation, and the operation domain includes the data address to be operated, the pooling core address, and the target address.

Clause I2. The device according to Clause I1, the calculation module includes:

A plurality of comparators are used to perform comparison operations on a plurality of data to be operated in the area corresponding to the pooled core to obtain operation results.

Clause I3. The device according to Clause I2, the operation module includes a master operation sub-module and a plurality of slave operation sub-modules, the master operation sub-module includes the plurality of comparators,

The main operation sub-module is configured to use the plurality of comparators to perform comparison operations on a plurality of data to be operated in the area corresponding to the pooled core, obtain operation results, and store the operation results in the Described in the target address.

Clause I4. The device according to Clause I1, the operation domain further includes an input height and an input width,

Clause I5. The device according to Clause I1, the operation domain further includes a pooled core height and a pooled core width,

Wherein, the control module is further configured to obtain the pooled core from the pooled core address according to the pooled core height and the pooled core width.

Clause I6. The device according to Clause I1, the operation domain further includes a first step,

Wherein, the arithmetic module is also used to move the pooling core in the x direction according to the first step.

Clause I7. The device according to Clause I1, the operation domain further includes a second step,

Wherein, the calculation module is also used to move the pooling core in the y direction according to the second step.

Clause I8. The device according to Clause I1,

The calculation module is also used to move the pooled core on the data to be calculated non-overlapping, and compare a plurality of data to be calculated in the area corresponding to the pooled core to obtain the calculation result.

Clause I9, the device according to Clause I1,

The calculation module is also used to, when the size of the data to be calculated is a non-integer multiple of the size of the pooled core, the data to be calculated is an integer multiple of the size of the pooled core Perform maximum pooling operations,

Wherein, the size of the data to be calculated is a non-integer multiple of the size of the pooled core, including at least one of the following: The input width of the data to be calculated is a non-integer multiple of the width of the pooled core. The input height of the data to be calculated is a non-integer multiple of the height of the pooled core.

Clause I10. The device according to Clause I1, the operation domain further includes the number of pooled cores,

Wherein, the calculation module is also used to perform a maximum pooling operation on the data to be calculated through a plurality of pooling cores whose number is the number of the pooling cores.

Clause I11. The device according to Clause I1, the device further comprising:

A storage module for storing the data to be calculated and the pooled core,

Wherein, the storage module includes at least one of a register and a cache,

The cache is used to store the data to be operated and the pooled core, and the cache includes at least one neuron cache NRAM;

The register is used to store scalar data in the data to be calculated;

Clause I12. The device according to Clause I1, the control module includes:

An instruction storage submodule, used to store the compiled maximum pooled instruction;

Instruction processing sub-module, which is used to parse the compiled maximum pooled instruction to obtain the operation code and operation domain of the maximum pooled instruction;

The queue storage submodule is used to store an instruction queue, and the instruction queue includes a plurality of instructions to be executed, which are sequentially arranged according to the execution order, and the plurality of instructions to be executed include the compiled maximum pooled instruction.

Clause I13. The device according to Clause I12, the control module, further comprising:

Clause I14, the device according to Clause I1,

The control module is also used to generate an assembly file according to the maximum pooling instruction and translate the assembly file into a binary file,

Wherein, the binary file is the compiled maximum pooling instruction.

Clause I15. A machine learning computing device, the device comprising:

One or more maximum pooling instruction processing devices as described in any one of Clause I1-Clause I14, used to obtain the data and control information to be calculated from other processing devices, and perform the specified machine learning operation, and pass the execution result The I / O interface is passed to other processing devices;

When the machine learning computing device includes a plurality of the maximum pooled instruction processing devices, the plurality of maximum pooled instruction processing devices may be connected and transmit data through a specific structure;

Among them, a plurality of the largest pooled instruction processing apparatuses interconnect and transmit data through a PCIE bus, a fast external device interconnection bus, to support larger-scale machine learning operations; a plurality of the largest pooled instruction processing apparatuses share the same The control system may have its own control system; the multiple maximum pooled instruction processing devices share memory or have their own memories; the interconnection method of the multiple maximum pooled instruction processing devices is any interconnection topology.

Clause I16. A combined processing device, the combined processing device comprising:

Machine learning computing devices, general interconnect interfaces and other processing devices as described in clause I15;

Clause I17. A machine learning chip, the machine learning chip includes:

The machine learning arithmetic device according to clause I15 or the combined processing device according to clause I16.

Clause I18. An electronic device, the electronic device comprising:

Machine learning chip as described in clause I17.

Clause I19. A board card including: a storage device, an interface device and a control device, and a machine learning chip as described in Clause I17;

The storage device is used for storing data;

Clause I20. A method for processing maximum pooled instructions. The method is applied to a device for processing maximum pooled instructions. The device includes a control module and an arithmetic module. The method includes:

Use the control module to compile the obtained maximum pooling instruction to obtain the compiled maximum pooling instruction, analyze the compiled maximum pooling instruction to obtain the operation code and operation domain of the maximum pooling instruction, and according to The operation code and the operation domain obtain the data to be operated, the pooling core, and the target address required to execute the maximum pooling instruction;

Using an operation module to perform a maximum pooling operation on the data to be calculated according to the pooling check, to obtain an operation result, and to store the operation result in the target address,

Clause I21. According to the method described in Clause I20, perform a maximum pooling operation on the data to be calculated according to the pooling check, to obtain an operation result, including:

A plurality of comparators in the operation module are used to perform operation on a plurality of data to be operated in the area corresponding to the pooled core to obtain operation results.

Clause I22. The method according to Clause I21, the operation module includes a master operation sub-module and a plurality of slave operation sub-modules, the master operation sub-module includes the plurality of comparators,

Wherein, performing maximum pooling operation on the data to be calculated according to the pooling verification to obtain an operation result, and storing the operation result in the target address includes:

The plurality of comparators are used to perform comparison operations on a plurality of data to be operated in the area corresponding to the pooled core to obtain operation results, and the operation results are stored in the target address.

Clause I23. The method according to Clause I20, the operation domain further includes an input height and an input width,

Wherein, obtaining the data to be operated, the pooling core and the target address required to execute the maximum pooling instruction according to the operation code and the operation domain includes:

Clause I24. The method according to Clause I20, the operation domain further includes a pooled core height and a pooled core width,

Wherein, obtaining the data to be operated, the pooling core and the target address required to execute the maximum pooling instruction according to the operation code and the operation domain include:

Clause I25. The method according to Clause I20, the operation domain further includes a first step,

Wherein, performing the maximum pooling operation on the data to be calculated according to the pooling verification includes:

The pooling core is moved in the x direction according to the first step.

Clause I26. The method according to Clause I20, the operation domain further includes a second step,

The pooling core is moved in the y direction according to the second step.

Clause I27. According to the method described in Clause I20, perform a maximum pooling operation on the data to be calculated according to the pooling check, to obtain an operation result, including:

Clause I28. According to the method described in Clause I20, perform a maximum pooling operation on the data to be calculated according to the pooling check, to obtain an operation result, including:

When the size of the data to be calculated is a non-integer multiple of the size of the pooled core, perform a maximum pooling operation on the data to be calculated that is an integer multiple of the size of the pooled core,

Clause I29. The method according to Clause I20, the operation domain further includes the number of pooled cores,

Wherein, performing maximum pooling operation on the data to be calculated according to the pooling verification to obtain the operation result includes:

The maximum pooling operation is performed on the data to be calculated through a plurality of pooling cores whose number is the number of the pooling cores.

Clause I30. The method according to Clause I20, the method further comprising:

Using the storage module of the device to store the data to be calculated and the pooled core,

Wherein, the storage module includes at least one of a register and a cache,

The register is used to store scalar data in the data to be calculated;

The neuron buffer is used to store neuron data in the data to be operated, and the neuron data includes neuron vector data.

Clause I31. According to the method described in Clause I20, parse the obtained maximum pooling instruction to obtain the operation code and operation domain of the maximum pooling instruction, including:

Storing the compiled maximum pooling instruction;

Analyzing the compiled maximum pooling instruction to obtain the operation code and operation domain of the maximum pooling instruction;

An instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed that are sequentially arranged according to an execution order, and the plurality of instructions to be executed include the compiled maximum pooled instruction.

Clause I32. The method according to Clause I31, the method further comprising:

Clause I33. According to the method described in Clause I20, compile the obtained maximum pooling instruction to obtain the compiled maximum pooling instruction, including:

Generate an assembly file according to the maximum pooling instruction, and translate the assembly file into a binary file,

Wherein, the binary file is the compiled maximum pooling instruction.

Clause I34. A non-volatile computer-readable storage medium having computer program instructions stored thereon, which when executed by a processor implements the method of any one of Clause I20 to Clause I33.

Due to the extensive use of neural network algorithms, the ability of computer hardware operators to continuously improve, the types and number of data operations involved in practical applications continue to increase. Due to the variety of programming languages, in different language environments, in order to realize the process of filling operation, in the related art, because there is no filling instruction that can be widely applied to various programming languages at this stage, technicians need to customize their corresponding programming One or more instructions of the locale are used to implement the padding operation, which results in low efficiency and slow speed of the padding operation. The present disclosure provides a filling instruction processing method, device, computer equipment, and storage medium. The filling operation can be realized with only one instruction, which can significantly improve the efficiency and speed of performing the filling operation.

10-1 shows a block diagram of a filling instruction processing apparatus according to an embodiment of the present disclosure. As shown in Figure 10-1, the device includes a control module 9-11 and an arithmetic module 9-12.

The control module 9-11 is used to compile the obtained stuffing instruction to obtain the compiled stuffing instruction, and parse the compiled stuffing instruction to obtain the opcode and operation domain of the stuffing instruction, and according to the opcode and operation domain Obtain the data to be calculated, the padding core and the target address required to execute the padding instruction. The operation code is used to indicate that the operation performed by the filling instruction on the data is a filling operation, and the operation domain includes the data address to be operated, the filling core address, and the target address.

The operation module 9-12 is used to perform pad operation (pad) on the data to be operated according to the filling core, obtain the operation result, and store the operation result in the target address.

In this embodiment, the filling instructions obtained by the control module are uncompiled software instructions that cannot be directly executed by the hardware. The control module needs to first compile the filling instructions (uncompiled). After the compiled filling instruction is obtained, the compiled filling instruction can be parsed. The compiled filling instructions are hardware instructions that can be directly executed by the hardware. The control module can obtain the data to be calculated and the filling core from the address of the data to be calculated and the address of the filling core, respectively. The control module can obtain instructions and data through the data input and output unit, which can be one or more data I / O interfaces or I / O pins.

In this embodiment, the operation code may be the part of the instruction or field (usually represented by code) specified in the computer program to perform the operation, and is the instruction sequence number used to inform the device that executes the instruction which instruction needs to be executed. . The operation domain may be the source of all data required to execute the corresponding instruction, and all data required to execute the corresponding instruction include data to be operated, parameters such as core filling, and corresponding operation methods. For a stuffing instruction, it must include an opcode and an operation field, where the operation field includes at least the data address to be calculated, the stuffing core address, and the target address

It should be understood that those skilled in the art can set the instruction format of the padding instruction, as well as the included operation codes and operation fields as needed, and this disclosure does not limit this.

In this embodiment, the device may include one or more control modules and one or more arithmetic modules, and the number of control modules and arithmetic modules may be set according to actual needs, which is not limited in the present disclosure. When the device includes a control module, the control module can receive the filling instruction and control one or more arithmetic modules to perform the filling operation. When the device includes multiple control modules, the multiple control modules may respectively receive the filling instruction and control the corresponding one or more arithmetic modules to perform the filling operation.

A filling instruction processing device provided by an embodiment of the present disclosure includes a control module and an arithmetic module. The control module is used to compile the obtained filling instruction to obtain a compiled filling instruction and analyze the compiled filling instruction. Obtain the operation code and operation field of the filling instruction, and obtain the data to be operated, the filling core and the target address required to execute the filling instruction according to the operation code and the operation field; the operation module is used to perform the filling operation on the operation data according to the filling core to obtain The operation result, and store the operation result in the target address. The filling instruction processing device provided by the embodiments of the present disclosure has a wide application range, and has high processing efficiency and fast processing speed for filling instructions, and high processing efficiency and fast speed for performing filling operations.

10-2a shows a block diagram of a filling instruction processing apparatus according to an embodiment of the present disclosure. In a possible implementation, as shown in FIG. 10-2a, the arithmetic module 9-12 may include multiple comparators 9-120. A plurality of comparators 9-120 are used to perform filling operation on the data to be operated according to the filling kernel.

In this implementation, the arithmetic module may also include a comparator. The number of comparators can be set according to the data amount of the padding operation to be performed, the processing speed of the padding operation, the processing efficiency, etc., and the disclosure does not limit this.

10-2b shows a block diagram of a filling instruction processing apparatus according to an embodiment of the present disclosure. In a possible implementation, as shown in FIG. 10-2b, the operation module 9-12 may include a master operation submodule 9-121 and a plurality of slave operation submodules 9-122, and the master operation submodule 9-121 includes Multiple comparators 9-120 (not shown in the figure).

The main operation sub-module 9-121 is used for performing a filling operation on the data to be calculated according to the filling core using a plurality of comparators 9-120 to obtain an operation result, and storing the operation result in the target address.

In the above manner, the data amount and size of the operation data can be limited, the accuracy of the operation result can be guaranteed, and the device can execute the filling instruction.

In a possible implementation manner, the operation domain may further include a filling core height and a filling core width.

Among them, the control module 9-11 is also used to obtain the filling core corresponding to the height and width of the filling core from the address of the filling core.

In a possible implementation manner, when the padding core height and the padding core width are not included in the operation domain, the preset default padding core height and default padding core width may be acquired, so that the control module and the arithmetic module can execute the padding instruction.

In a possible implementation, the operation domain may further include the number of filling cores. Among them, the calculation module 9-12 is also used to perform filling operation on the data to be calculated through a plurality of filling cores whose number is the number of filling cores.

In this implementation, the number of padding cores corresponds to the data to be calculated. For example, when the number of filling cores is 5, it can be determined that the data to be calculated can be divided into five parts, and five filling cores are required to perform filling operations on the five parts of the data to be calculated.

In this implementation manner, when the operation domain does not include the number of padding cores, it can be determined that only one padding core is needed for the data to be calculated to implement the padding operation.

In a possible implementation manner, as shown in FIGS. 10-2a and 10-2b, the device may further include a storage module 9-13. The storage modules 9-13 are used for storing data to be calculated and filling cores.

In a possible implementation manner, the control module 9-11 may also be used to generate an assembly file according to the filling instruction and translate the assembly file into a binary file, where the binary file is a compiled filling instruction.

In a possible implementation manner, the instruction format of the filling instruction may be:

paddst src channel srcHeight srcWidth padHeight padWdith

Among them, pad is the operation code of the filling instruction, and dst, src, channel, srcHeight, srcWidth, padHeight, and padWdith are the operation fields of the filling instruction. Among them, dst is the target address, src0 is the data address to be calculated, src is the filling core or filling core address, channel is the number of filling cores, srcHeigh is the input height of the data to be calculated, srcWidth is the input width of the data to be calculated, padHeigh is the pad Core height, padWdith is the width of the filling core.

It should be understood that, those skilled in the art can set the operation code of the filling instruction, the position of the operation code and the operation field in the instruction format as needed, and the disclosure does not limit this.

It should be noted that although the above embodiment is taken as an example to introduce the filling instruction processing apparatus as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various modules flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.

Application examples

In the following, an application example according to an embodiment of the present disclosure is given in conjunction with "filling operation using a filling instruction processing device" as an exemplary application scenario, so as to facilitate understanding of the flow of the filling instruction processing device. Those skilled in the art should understand that the following application examples are only for the purpose of facilitating understanding of the embodiments of the present disclosure, and should not be considered as limitations to the embodiments of the present disclosure

10-3 shows a schematic diagram of an application scenario of a filling instruction processing apparatus according to an embodiment of the present disclosure. As shown in Figure 10-3, the filling instruction processing device processes the filling instruction as follows:

The control module 9-11 compiles the obtained filling instruction 1 to obtain the compiled filling instruction 1 (for example, the filling instruction 1 is

pad

500, 100, 200, 5, 64, 32, 2). Analyze the compiled stuffing instruction 1 to obtain the opcode and operation domain of the stuffing instruction 1. Among them, the operation code of the filling instruction 1 is pad, the target address is 500, the data address to be calculated is 100, the filling core address is 200, the filling core number is 5, the input height is 64, the input width is 32, and the filling core height is 2 , The width of the filling core is 1. The control module 9-11 acquires 64 × 32 to-be-calculated data from the data-to-be-operated data address 100, and acquires a 2 × 1 stuffing core from the stuffing core address 200.

The arithmetic module 9-12 performs the stuffing operation on the data to be calculated according to the number of stuffing cores 5 to obtain the calculation result, and stores the calculation result in the target address 500.

In this way, the stuffing instruction processing device can process the stuffing instructions efficiently and quickly, and the stuffing operation has high processing efficiency and high speed.

10-4 shows a flowchart of a filling instruction processing method according to an embodiment of the present disclosure. The method can be applied to computer equipment including a memory and a processor, where the memory is used to store data used in the execution of the method; the processor is used to perform related processing and operation steps, such as performing the following steps S51-9和步骤 S52-9. As shown in FIG. 10-4, the method is applied to the above-mentioned filling instruction processing device, and the method includes step S51-9 and step S52-9.

In step S51-9, the control module is used to compile the obtained filling instruction to obtain a compiled filling instruction, and the compiled filling instruction is analyzed to obtain the operation code and operation domain of the filling instruction, and according to the operation code and The operation domain obtains the data to be operated, the padding core and the target address required to execute the padding instruction. The operation code is used to indicate that the operation performed by the filling instruction on the data is a filling operation, and the operation domain includes the data address to be operated, the filling core address, and the target address.

In step S52-9, the arithmetic module is used to perform the stuffing operation on the data to be calculated according to the stuffing core to obtain the calculation result, and the calculation result is stored in the target address.

In a possible implementation manner, performing the filling operation on the data to be calculated according to the filling core to obtain the operation result may include:

Using multiple comparators in the calculation module, the data to be calculated is filled according to the filling core.

In a possible implementation manner, the operation module includes a master operation sub-module and multiple slave operation sub-modules, and the master operation sub-module includes multiple comparators. Wherein, step S52-9 may include:

A plurality of comparators in the main operation sub-module are used to perform the filling operation on the data to be calculated according to the filling core to obtain the operation result, and the operation result is stored in the target address.

In a possible implementation manner, the operation domain may further include a read input height and an input width. Wherein, obtaining the data to be calculated, the padding core and the target address required to execute the padding instruction according to the operation code and the operation domain may include:

Wherein, obtaining the data to be calculated, the filling core and the target address required to execute the filling instruction according to the operation code and the operation field may include: obtaining the filling core corresponding to the filling core height and the filling core width from the filling core address.

In a possible implementation, the operation domain may further include the number of filling cores. Wherein, performing the filling operation on the data to be calculated according to the filling core to obtain the operation result may include:

The number of filling cores is the number of filling cores, and the data to be operated is filled.

In a possible implementation manner, the method may further include: using the storage module of the device to store the data to be calculated and filling the core,

Wherein, the storage module includes at least one of a register and a cache,

The cache is used to store the data to be calculated and the filling core. The cache includes at least one neuron cache NRAM;

Register, used to store scalar data in the data to be calculated;

In a possible implementation manner, parsing the compiled padding instruction to obtain the operation code and operation field of the padding instruction may include:

Store the filled instructions after compilation;

Analyze the compiled stuffing instruction to get the opcode and operation domain of the stuffing instruction;

An instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed that are sequentially arranged according to the execution order, and the plurality of instructions to be executed may include a filled instruction after compilation.

In a possible implementation manner, the method may further include:

It should be noted that although the above embodiment is taken as an example to introduce the filling instruction processing method as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various steps flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.

The filling instruction processing method provided by the embodiments of the present disclosure has a wide range of application, and has high processing efficiency and fast processing speed for filling instructions, and high processing efficiency and fast speed for performing filling operations.

The present disclosure also provides a non-volatile computer-readable storage medium on which computer program instructions are stored, and when the computer program instructions are executed by a processor, the above filling instruction processing method is realized.

The foregoing can be better understood based on the following clauses:

Clause J1, a filling instruction processing device, the device comprising:

The control module is used to compile the obtained filling instruction to obtain a compiled filling instruction, and parse the compiled filling instruction to obtain an operation code and an operation field of the filling instruction. The operation code is used to indicate The operation performed by the filling instruction on the data is a filling operation, and the operation domain includes a data address to be operated, a filling core address, and a target address, and obtains a pending operation required to execute the filling instruction according to the operation code and the operation domain Data, padding core and target address;

The operation module is configured to perform filling operation on the data to be operated according to the filling check, obtain an operation result, and store the operation result in the target address.

Clause J2. The device according to Clause J1, the operation module includes:

A plurality of comparators are used to perform filling operation on the data to be calculated according to the filling core.

Clause J3. The device according to Clause J2, the operation module includes a master operation sub-module and a plurality of slave operation sub-modules, the master operation sub-module includes the plurality of comparators,

The main operation sub-module is configured to use the plurality of comparators to perform filling operation on the data to be operated according to the filling core to obtain an operation result, and store the operation result in the target address.

Clause J4. The device according to Clause J1, the operation domain further includes an input height and an input width,

Clause J5. The device according to Clause J1, the operation domain further includes a filling core height and a filling core width,

Wherein, the control module is further configured to acquire the filling core corresponding to the height of the filling core and the width of the filling core from the address of the filling core.

Clause J6. The device according to Clause J1, the operation domain further includes a number of filling cores,

Wherein, the operation module is further used to perform filling operation on the data to be operated by using a plurality of filling cores whose number is the number of filling cores.

Clause J7. The device according to Clause J1, the device further comprising:

A storage module for storing the data to be calculated and the filling core,

Wherein, the storage module includes at least one of a register and a cache,

The cache is used to store the data to be calculated and the filling core, and the cache includes at least one neuron cache NRAM;

The register is used to store scalar data in the data to be calculated;

Clause J8. The device according to Clause J1, the control module comprising:

Instruction storage submodule, used to store the compiled filling instruction;

An instruction processing sub-module, which is used to parse the compiled stuffing instruction to obtain the opcode and operation domain of the stuffing instruction;

A queue storage submodule is used to store an instruction queue, and the instruction queue includes a plurality of instructions to be executed in order according to an execution order, and the plurality of instructions to be executed include the compiled filling instructions.

Clause J9. The device according to Clause J8, the control module, further comprising:

Clause J10, the device according to Clause J1,

The control module is also used to generate an assembly file according to the filling instruction and translate the assembly file into a binary file,

Wherein, the binary file is the compiled filling instruction.

Clause J11. A machine learning computing device, the device comprising:

One or more stuffing instruction processing devices as described in any one of Clause J1-Clause J10, which are used to obtain data and control information to be calculated from other processing devices, and perform specified machine learning operations, and pass the execution result through I / O interface is passed to other processing devices;

When the machine learning operation device includes a plurality of the filling instruction processing devices, the plurality of filling instruction processing devices may be connected and transmit data through a specific structure;

Among them, a plurality of the filling instruction processing devices are interconnected and transmitting data through a fast external device interconnect bus PCIE bus to support larger-scale machine learning operations; a plurality of the filling instruction processing devices share the same control system or own Respective control systems; a plurality of the filling instruction processing devices share memory or have their own memories; the interconnection method of the plurality of filling instruction processing devices is an arbitrary interconnection topology.

Clause J12. A combined processing device, the combined processing device comprising:

Machine learning computing device, general interconnection interface and other processing devices as described in clause J11;

Clause J13. A machine learning chip, the machine learning chip includes:

The machine learning arithmetic device according to clause J11 or the combined processing device according to clause J12.

Clause J14. An electronic device, the electronic device comprising:

Machine learning chip as described in clause J13.

Clause J15, a board card, the board card includes: a storage device, an interface device and a control device, and a machine learning chip as described in Clause J13;

The storage device is used for storing data;

Clause J16. A filling instruction processing method, the method is applied to a filling instruction processing device, the device includes a control module and an arithmetic module, and the method includes:

The control module is used to compile the obtained filling instruction to obtain a compiled filling instruction, and the compiled filling instruction is analyzed to obtain an operation code and an operation domain of the filling instruction, and the operation code and the operation are performed according to the operation code and the operation The domain obtains the data to be calculated, the padding core and the target address required to execute the padding instruction;

Using an operation module to perform filling operation on the data to be operated according to the filling core to obtain an operation result, and store the operation result in the target address,

Wherein, the operation code is used to indicate that the operation performed by the stuffing instruction on the data is a stuffing operation, and the operation domain includes a data address to be calculated, a stuffing core address, and the target address.

Clause J17. According to the method described in Clause J16, perform a filling operation on the data to be calculated according to the filling check, to obtain an operation result, including:

A plurality of comparators in the calculation module are used to perform a filling operation on the data to be calculated according to the filling core to obtain an operation result.

Clause J18. The method according to Clause J17, the operation module includes a master operation sub-module and a plurality of slave operation sub-modules, the master operation sub-module includes the plurality of comparators,

Wherein, performing filling operation on the data to be operated according to the filling check to obtain an operation result, and storing the operation result in the target address includes:

Use the plurality of comparators in the main operation sub-module to perform filling operation on the data to be operated according to the filling kernel to obtain an operation result, and store the operation result in the target address.

Clause J19. The method according to Clause J16, the operation domain further includes an input height and an input width,

Wherein, obtaining the data to be operated, the padding core and the target address required to execute the padding instruction according to the operation code and the operation domain include:

Clause J20. The method according to Clause J16, the operation domain further includes a filling core height and a filling core width,

From the address of the filler core, a filler core corresponding to the height of the filler core and the width of the filler core is obtained.

Clause J21. The method according to Clause J16, the operation domain further includes a number of filling cores,

Wherein, performing filling operation on the data to be calculated according to the filling core to obtain an operation result includes:

The number of filling cores is the number of filling cores to perform filling operation on the data to be calculated.

Clause J22. The method according to Clause J16, the method further comprising:

Using the storage module of the device to store the data to be calculated and the filling core,

Wherein, the storage module includes at least one of a register and a cache,

The register is used to store scalar data in the data to be calculated;

Clause J23. According to the method described in Clause J16, parse the compiled stuffing instruction to obtain the opcode and operation domain of the stuffing instruction, including:

Storing the compiled filling instruction;

Parse the compiled stuffing instruction to obtain the opcode and operation domain of the stuffing instruction;

An instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed in order according to an execution order, and the plurality of instructions to be executed include the compiled filling instructions.

Clause J24. The method according to Clause J23, the method further comprising:

Clause J25. According to the method described in Clause J16, compile the obtained filling instructions to obtain the compiled filling instructions, including:

Generate an assembly file according to the filling instruction, and translate the assembly file into a binary file,

Wherein, the binary file is the compiled filling instruction.

Clause J26. A non-volatile computer-readable storage medium having computer program instructions stored thereon. When the computer program instructions are executed by a processor, the method according to any one of clauses J16 to J25 is implemented.

Due to the widespread use of neural network algorithms and the continuous improvement of computer hardware computing capabilities, the types and number of data operations involved in practical applications continue to increase. Due to the variety of programming languages, in different language environments, in order to realize the operation process of matrix transposition operation, in the related art, because there is no matrix transposition instruction that can be widely applied to various programming languages at this stage, the technical staff needs to Define multiple instructions corresponding to its programming language environment to implement the matrix transpose operation, which results in low efficiency and slow speed of the matrix transpose operation. The present disclosure provides a matrix transposition instruction processing method, device, computer equipment, and storage medium. The matrix transposition operation can be implemented with only one instruction, which can significantly improve the efficiency and speed of matrix transposition operation.

11-1 shows a block diagram of a matrix transposition instruction processing apparatus according to an embodiment of the present disclosure. As shown in Figure 11-1, the device includes a control module 10-11 and an arithmetic module 10-12.

The control module 10-11 is used to compile the obtained matrix transposition instruction to obtain the compiled matrix transposition instruction, and parse the compiled matrix transposition instruction to obtain the operation code and operation domain of the matrix transposition instruction. , And obtain the data to be operated, the target address, and the input height and width of the data to be operated required to execute the matrix transposition instruction according to the operation code and the operation domain. Among them, the operation code is used to instruct the operation performed by the matrix transposition instruction on the data to be a matrix transposition operation, and the operation domain includes the data address to be operated, the input height, the input width, and the target address.

The operation module 10-12 is used to perform matrix transposition operation on the data to be calculated according to the input height and input width to obtain the transposed data, and store the transposed data in the target address.

In this embodiment, the matrix transposition instruction acquired by the control module is an uncompiled software instruction that cannot be directly executed by the hardware. The control module needs to first compile the matrix transposition instruction (uncompiled). After the compiled matrix transposition instruction is obtained, the compiled matrix transposition instruction can be analyzed. The compiled matrix transposition instructions are hardware instructions that can be directly executed by hardware.

In this embodiment, the control module can obtain the data to be calculated from the data address to be calculated. The operation domain may include an input height and an input width, or the operation domain includes a storage address that stores the input height and input width of the data to be calculated. When the specific value of the input height and input width of the data to be calculated is directly included in the operation domain, the specific value can be determined as the input height and input width. When the storage address of the input height and input width is included in the operation domain, the input height and input width can be obtained from the corresponding storage address. The control module can obtain instructions and data through the data input and output unit, which can be one or more data I / O interfaces or I / O pins.

In this embodiment, the operation code may be the part of the instruction or field (usually represented by code) specified in the computer program to perform the operation, and is the instruction sequence number used to inform the device that executes the instruction which instruction needs to be executed. . The operation domain may be a source of all data required to execute the corresponding instruction. All data required to execute the corresponding instruction include data to be operated, input height, input width and other parameters of the data to be operated, and corresponding operation methods. For a matrix transpose instruction, it must include an operation code and an operation field, where the operation field includes at least the data address to be operated, the input height, the input width, and the target address

It should be understood that, those skilled in the art can set the instruction format of the matrix transpose instruction, as well as the included operation codes and operation domains as needed, and this disclosure does not limit this.

In this embodiment, the device may include one or more control modules and one or more arithmetic modules, and the number of control modules and arithmetic modules may be set according to actual needs, which is not limited in the present disclosure. When the device includes a control module, the control module can receive the matrix transposition instruction and control one or more arithmetic modules to perform the matrix transposition operation. When the device includes multiple control modules, the multiple control modules may respectively receive matrix transposition instructions and control the corresponding one or more arithmetic modules to perform matrix transposition operations.

A matrix transposition instruction processing device provided by an embodiment of the present disclosure includes a control module and an operation module. The control module is used to compile the obtained matrix transposition instruction to obtain the compiled matrix transposition instruction. The matrix transposition instruction is parsed to obtain the operation code and operation domain of the matrix transposition instruction, and according to the operation code and operation domain, the data to be operated, the target address, and the input height of the data to be operated required to execute the matrix transposition instruction Input width; the calculation module is used to perform matrix transposition operation on the data to be calculated according to the input height and input width to obtain the transposed data, and store the transposed data in the target address. The matrix transposition instruction processing device provided by the embodiments of the present disclosure has a wide range of applications, and has high processing efficiency and fast processing speed for matrix transposition instructions, and high processing efficiency and fast processing speed for matrix transposition operations.

11-2a shows a block diagram of a matrix transposition instruction processing apparatus according to an embodiment of the present disclosure. In a possible implementation manner, as shown in FIG. 11-2a, the operation module 10-12 may include a plurality of matrix transpose operators 10-120. A plurality of matrix transposition operators 10-120 are used to perform matrix transposition operations on the data to be calculated according to the input height and input width. Among them, the height of the transposed data is equal to the input width, and the width of the transposed data is equal to the input height.

In this implementation manner, the operation module may further include a matrix transpose operator. The number of matrix transposition operators can be set according to the amount of data required for the matrix transposition operation, the processing speed, processing efficiency, and other requirements of the matrix transposition operation, which is not limited in this disclosure.

11-2b shows a block diagram of a matrix transposition instruction processing apparatus according to an embodiment of the present disclosure. In a possible implementation manner, as shown in FIG. 11-2b, the operation module 10-12 may include a master operation submodule 10-121 and a plurality of slave operation submodules 10-122, and the master operation submodule 10-121 includes A plurality of matrix transpose operators 10-120 (not shown in the figure).

The main operation sub-module 10-121 is used for performing matrix transposition operation on the data to be calculated according to the input height and input width using a plurality of matrix transposition operators 10-120 to obtain transposed data, and store the transposed data in the target address in.

In a possible implementation manner, as shown in FIGS. 11-2a and 11-2b, the device may further include a storage module 10-13. The storage modules 10-13 are used to store data to be calculated.

In this implementation, the storage module may include one or more of a cache and a register. The cache may include a high-speed temporary storage cache, and may also include at least one NRAM (Neuron Random Access Memory). The cache can be used to store data to be calculated, and the register can be used to store scalar data in the data to be calculated.

In a possible implementation manner, the control module 10-11 may also be used to generate an assembly file according to the matrix transposition instruction and translate the assembly file into a binary file, where the binary file is a compiled matrix transposition instruction.

In a possible implementation manner, the instruction format of the matrix transpose instruction may be:

transpose dst src srcHeight srcWidth

Among them, transpose is the operation code of the matrix transposition instruction, dst, src, srcHeight, srcWidth is the operation domain of the matrix transposition instruction. Among them, dst is the target address, src is the data address to be calculated, srcHeight is the input height of the data to be calculated, and srcWidth is the input width of the data to be calculated.

It should be understood that those skilled in the art may set the position of the operation code of the matrix transposition instruction, the operation code and the operation field in the instruction format according to needs, and this disclosure does not limit this.

It should be noted that although the above embodiment is taken as an example to introduce the matrix transposition instruction processing apparatus as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various modules flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.

Application examples

The following uses "matrix transposition instruction processing device for matrix transposition operation" as an exemplary application scenario to give an application example according to an embodiment of the present disclosure, so as to facilitate understanding of the flow of the matrix transposition instruction processing device. Those skilled in the art should understand that the following application examples are only for the purpose of facilitating understanding of the embodiments of the present disclosure, and should not be considered as limitations to the embodiments of the present disclosure

11-3 shows a schematic diagram of an application scenario of a matrix transposition instruction processing apparatus according to an embodiment of the present disclosure. As shown in Figure 11-3, the matrix transposition instruction processing device processes the matrix transposition instruction as follows:

The control module 10-11 compiles the obtained matrix transposition instruction 1 to obtain the compiled matrix transposition instruction 1 (for example, the matrix transposition instruction 1 is transpose 500, 100, 64, 32). The compiled matrix transposition instruction 1 is analyzed to obtain the operation code and operation domain of the matrix transposition instruction 1. The operation code of the matrix transposition instruction 1 is transpose, the target address is 500, the data address to be calculated is 100, the input height is 64, and the input width is 32. The control module 10-11 acquires 64 × 32 data to be calculated from the data address 100 to be calculated.

The operation module 10-12 performs matrix transposition operation on the data to be operated to obtain 32 × 64 transposed data, and stores the transposed data in the target address 500.

In this way, the matrix transposition instruction processing device can process the matrix transposition instruction efficiently and quickly, and the matrix transposition operation has high processing efficiency and fast processing speed.

11-4 shows a flowchart of a matrix transposition instruction processing method according to an embodiment of the present disclosure. The method can be applied to computer equipment including a memory and a processor, where the memory is used to store data used in the execution of the method; the processor is used to perform related processing and operation steps, such as performing the following steps S51-10和步骤 S52-10. As shown in FIG. 11-4, the method is applied to the above matrix transposition instruction processing device. The method includes steps S51-10 and S52-10.

In step S51-10, the control module is used to compile the obtained matrix transposition instruction to obtain the compiled matrix transposition instruction, and the compiled matrix transposition instruction is analyzed to obtain the operation code and the matrix transposition instruction. The operation domain, and according to the operation code and the operation domain, obtain the data to be operated, the target address, the input height and the input width of the data to be operated required for executing the matrix transposition instruction. Among them, the operation code is used to instruct the operation performed by the matrix transposition instruction on the data to be a matrix transposition operation, and the operation domain includes the data address to be operated, the input height, the input width, and the target address.

In step S52-10, the operation module is used to perform matrix transposition operation on the data to be calculated according to the input height and input width to obtain transposed data, and the transposed data is stored in the target address.

In a possible implementation manner, according to the input height and the input width, performing matrix transposition operation on the data to be operated to obtain transposed data may include:

A plurality of matrix transpose operators in the arithmetic module are used to perform matrix transpose calculation on the data to be calculated according to the input height and input width. Among them, the height of the transposed data is equal to the input width, and the width of the transposed data is equal to the input height.

In a possible implementation manner, the operation module includes a master operation sub-module and multiple slave operation sub-modules, and the master operation sub-module includes multiple matrix transpose operators. Wherein, step S52-10 may include:

A plurality of matrix transposition operators in the main operation sub-module are used to perform matrix transposition operation on the data to be calculated according to the input height and input width to obtain transposed data, and store the transposed data in the target address.

In a possible implementation manner, the method may further include: storing the data to be calculated by using a storage module of the device,

Wherein, the storage module includes at least one of a register and a cache,

Cache, used to store data to be calculated, the cache includes at least one neuron cache NRAM;

Register, used to store scalar data in the data to be calculated;

In a possible implementation manner, parsing the compiled matrix transposition instruction to obtain the operation code and operation domain of the matrix transposition instruction includes:

Store the compiled matrix transpose instruction;

Analyze the compiled matrix transpose instruction to obtain the operation code and operation domain of the matrix transpose instruction;

The instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed in order according to the execution order. The plurality of instructions to be executed may include a compiled matrix transposition instruction.

In a possible implementation manner, compiling the obtained matrix transposition instruction to obtain the compiled matrix transposition instruction may include:

The assembly file is generated according to the matrix transposition instruction, and the assembly file is translated into a binary file. Among them, the binary file is a compiled matrix transposition instruction.

It should be noted that although the above embodiment is used as an example to introduce the matrix transposition instruction processing method as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various steps flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.

The method for processing matrix transposition instructions provided by the embodiments of the present disclosure has a wide range of applications, and has high processing efficiency and fast processing speed for matrix transposition instructions, and high processing efficiency and fast processing speed for matrix transposition operations.

The foregoing can be better understood based on the following clauses:

Clause K1, a matrix transposition instruction processing device, the device comprising:

A control module, configured to compile the obtained matrix transposition instruction to obtain a compiled matrix transposition instruction, and parse the compiled matrix transposition instruction to obtain an operation code and an operation domain of the matrix transposition instruction, The operation code is used to indicate that the operation performed by the matrix transposition instruction on the data is a matrix transposition operation, and the operation field includes a data address to be operated, an input height, an input width, and a target address, and the operation code And the operation domain to obtain the data to be operated, the target address, the input height and the input width of the data to be operated required to execute the matrix transposition instruction;

The operation module is configured to perform matrix transposition operation on the data to be calculated according to the input height and the input width to obtain transposed data, and store the transposed data in the target address.

Clause K2. The device according to Clause K1, the calculation module includes:

A plurality of matrix transpose operators, used to perform matrix transpose operations on the data to be calculated according to the input height and the input width,

Wherein, the height of the transposed data is equal to the input width, and the width of the transposed data is equal to the input height.

Clause K3. The device according to Clause K2, the operation module includes a master operation sub-module and a plurality of slave operation sub-modules, the master operation sub-module includes the plurality of matrix transpose operators,

The main operation submodule is configured to use the plurality of matrix transposition operators to perform matrix transposition operation on the data to be calculated according to the input height and the input width to obtain transposed data, and convert the The transposed data is stored in the target address.

Clause K4. The device according to Clause K1, the device further comprising:

A storage module for storing the data to be calculated,

Wherein, the storage module includes at least one of a register and a cache,

The cache is used to store the data to be calculated, and the cache includes at least one neuron cache NRAM;

The register is used to store scalar data in the data to be calculated;

Clause K5. The device according to Clause K1, the control module includes:

An instruction storage sub-module for storing the compiled matrix transposition instruction;

An instruction processing sub-module, which is used to parse the compiled matrix transposition instruction to obtain the operation code and operation domain of the matrix transposition instruction;

The queue storage submodule is used to store an instruction queue, and the instruction queue includes a plurality of instructions to be executed, which are sequentially arranged according to an execution order, and the plurality of instructions to be executed include the compiled matrix transposition instruction.

Clause K6. The device according to Clause K5, the control module, further comprising:

Clause K7, the device according to Clause K1,

The control module is also used to generate an assembly file according to the matrix transposition instruction and translate the assembly file into a binary file,

Wherein, the binary file is the compiled matrix transposition instruction.

Clause K8. A machine learning computing device, the device comprising:

One or more matrix transposition instruction processing devices as described in any one of Clause K1-Clause K7, used to obtain the data and control information to be calculated from other processing devices, and execute the specified machine learning operation, and pass the execution result The I / O interface is passed to other processing devices;

When the machine learning operation device includes a plurality of matrix transposition instruction processing devices, the plurality of matrix transposition instruction processing devices can be connected and transmit data through a specific structure;

Among them, a plurality of the matrix transposition instruction processing devices interconnect and transmit data through a fast external device interconnect bus PCIE bus to support larger-scale machine learning operations; a plurality of the matrix transposition instruction processing devices share the same The control system may have its own control system; a plurality of the matrix transposition instruction processing devices share memory or have their own memories; the interconnection mode of the plurality of matrix transposition instruction processing devices is an arbitrary interconnection topology.

Clause K9. A combined processing device, the combined processing device comprising:

Machine learning computing devices, general interconnection interfaces and other processing devices as described in Clause K8;

Clause K10, a machine learning chip, the machine learning chip including:

The machine learning arithmetic device according to clause K8 or the combined processing device according to clause K8.

Clause K11. An electronic device, the electronic device comprising:

Machine learning chip as described in clause K10.

Clause K12, a board card including: a storage device, an interface device and a control device, and a machine learning chip as described in Clause K10;

The storage device is used for storing data;

Clause K13. A method for processing matrix transposition instructions. The method is applied to a matrix transposition instruction processing apparatus. The apparatus includes a control module and an arithmetic module. The method includes:

The control module is used to compile the obtained matrix transposition instruction to obtain the compiled matrix transposition instruction, and the compiled matrix transposition instruction is analyzed to obtain the operation code and operation domain of the matrix transposition instruction, and according to The operation code and the operation domain obtain the data to be operated, the target address, the input height and the input width of the data to be operated required to execute the matrix transposition instruction;

Using an arithmetic module to perform matrix transposition on the data to be calculated according to the input height and the input width to obtain transposed data, and store the transposed data in the target address,

Wherein, the operation code is used to indicate that the operation performed by the matrix transposition instruction on the data is a matrix transposition operation, and the operation field includes the data address to be operated, the input height, the input width, and the target address.

Clause K14. According to the method described in Clause K13, perform matrix transposition on the data to be calculated according to the input height and the input width to obtain transposed data, including:

Using a plurality of matrix transposition operators in the calculation module to perform matrix transposition calculation on the data to be calculated according to the input height and the input width,

Clause K15. The method according to Clause K14, the operation module includes a master operation sub-module and a plurality of slave operation sub-modules, the master operation sub-module includes the plurality of matrix transpose operators,

Wherein, according to the input height and the input width, performing matrix transposition operation on the data to be operated to obtain transposed data, and storing the transposed data in the target address includes:

Using a plurality of matrix transposition operators in the main operation sub-module to perform matrix transposition operation on the data to be calculated according to the input height and the input width to obtain transposed data, and convert the transposed data Store in the target address.

Clause K16. The method according to Clause K13, the method further comprising:

Use the storage module of the device to store the data to be calculated,

Wherein, the storage module includes at least one of a register and a cache,

The register is used to store scalar data in the data to be calculated;

Clause K17. According to the method described in Clause K13, parse the compiled matrix transposition instruction to obtain the operation code and operation domain of the matrix transposition instruction, including:

Storing the compiled matrix transposition instruction;

Analyzing the compiled matrix transposition instruction to obtain the operation code and operation domain of the matrix transposition instruction;

An instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed, which are sequentially arranged according to an execution order, and the plurality of instructions to be executed include the compiled matrix transposition instruction.

Clause K18. The method according to Clause K17, the method further comprising:

Clause K19. According to the method described in Clause K13, compile the obtained matrix transposition instruction to obtain the compiled matrix transposition instruction, including:

Generate an assembly file according to the matrix transposition instruction, and translate the assembly file into a binary file,

Wherein, the binary file is the compiled matrix transposition instruction.

Clause K20. A non-volatile computer-readable storage medium having computer program instructions stored thereon, said computer program instructions being executed by a processor to implement the method of any one of Clauses K13 to K19.

Due to the extensive use of neural network algorithms, the ability of computer hardware operators to continuously improve, the types and number of data operations involved in practical applications continue to increase. Average-pooling (Average-pooling) is an average value of all data in the local area. Due to the variety of programming languages, in different language environments, in order to achieve the operation process of average pooling operation, in related technologies, because there is no average pooling instruction that can be widely applied to various programming languages at this stage, the technical staff needs to Define multiple instructions corresponding to its programming language environment to implement average pooling operations, resulting in low efficiency and slow speed of performing average pooling operations. The present disclosure provides an average pooling instruction processing method, device, computer equipment, and storage medium. The average pooling operation can be realized with only one instruction, which can significantly improve the efficiency and speed of performing the average pooling operation.

12-1 shows a block diagram of an average pooled instruction processing apparatus according to an embodiment of the present disclosure. As shown in Figure 12-1, the device includes a control module 11-11 and an arithmetic module 11-12.

The control module 11-11 is used to compile the obtained average pooling instruction to obtain the compiled average pooling instruction, and parse the compiled average pooling instruction to obtain the operation code and operation domain of the average pooling instruction , And obtain the data to be calculated, the pooling core and the target address required to execute the average pooling instruction according to the operation code and the operation domain. The operation code is used to indicate that the operation performed by the average pooling instruction on the data is the average pooling operation, and the operation domain includes the data address to be operated, the pooling core address, and the target address.

The operation module 11-12 is configured to perform average pooling operation on the data to be calculated according to the pooling core, obtain the operation result, and store the operation result in the target address.

In this embodiment, the average pooling instruction acquired by the control module is an uncompiled software instruction that cannot be directly executed by the hardware. The control module needs to first compile the average pooling instruction (uncompiled). After the compiled average pooling instruction is obtained, the compiled average pooling instruction can be analyzed. The compiled average pooled instructions are hardware instructions that can be directly executed by the hardware. The control module can obtain the data to be calculated and the pooled core from the data to be calculated and the pooled core address, respectively. The control module can obtain instructions and data through the data input and output unit, which can be one or more data I / O interfaces or I / O pins.

In this embodiment, the operation code may be the part of the instruction or field (usually represented by code) specified in the computer program to perform the operation, and is the instruction sequence number used to inform the device that executes the instruction which instruction needs to be executed. . The operation domain may be a source of all data required to execute the corresponding instruction, and all data required to execute the corresponding instruction include parameters to be operated, parameters such as pooling cores, and corresponding operation methods. For an average pooling instruction, it must include an operation code and an operation domain, where the operation domain includes at least the data address to be calculated, the pooling core address, and the target address.

It should be understood that those skilled in the art can set the instruction format of the average pooling instruction, as well as the included operation codes and operation domains as needed, and this disclosure does not limit this.

In this embodiment, the device may include one or more control modules and one or more arithmetic modules, and the number of control modules and arithmetic modules may be set according to actual needs, which is not limited in the present disclosure. When the device includes a control module, the control module can receive the average pooling instruction and control one or more arithmetic modules to perform the average pooling operation. When the device includes multiple control modules, the multiple control modules may respectively receive the average pooling instruction and control the corresponding one or more arithmetic modules to perform the average pooling operation.

An average pooling instruction processing device provided by an embodiment of the present disclosure includes a control module and an arithmetic module. The control module is used to compile the obtained average pooling instruction to obtain the compiled average pooling instruction. The average pooling instruction is parsed to obtain the operation code and operation domain of the average pooling instruction, and according to the operation code and operation domain, the data to be calculated, the pooling core and the target address required to execute the average pooling instruction are obtained; Based on the pooling core, the data to be calculated is averagely pooled to obtain the operation result, and the operation result is stored in the target address. The average pooling instruction processing device provided by the embodiments of the present disclosure has a wide application range, and has high processing efficiency and fast processing speed for the average pooling instruction, and high processing efficiency and speed for performing the average pooling operation.

12-2a shows a block diagram of an average pooled instruction processing apparatus according to an embodiment of the present disclosure. In a possible implementation, as shown in FIG. 12-2a, the operation module 11-12 may include a plurality of adders 11-120 and a plurality of dividers 11-120 '. A plurality of adders 11-120 are used to perform addition operations in the average pooling operation. A plurality of dividers 11-120 'are used to perform the division operation in the average pooling operation.

In this implementation manner, the operation module may also include one adder and one divider, or one adder, multiple dividers, or multiple adders and one divider. The number of adders and dividers can be set according to the amount of data required for the average pooling operation, the processing speed, efficiency, etc. of the average pooling operation, which is not limited in the present disclosure.

12-2b shows a block diagram of an average pooled instruction processing apparatus according to an embodiment of the present disclosure. In a possible implementation manner, as shown in FIG. 12-2b, the operation module 11-12 may include a master operation sub-module 11-121 and a plurality of slave operation sub-modules 11-122. The main operation sub-module 11-121 may include multiple adders and multiple dividers.

The main operation sub-module 11-121 is used to perform addition and division operations in the average pooling operation using multiple adders and multiple dividers, respectively, to obtain an operation result, and store the operation result in a target address.

In the above manner, the data amount and size of the operation data can be limited, the accuracy of the operation result can be ensured, and the device can execute the average pooling instruction.

Among them, the control module 11-11 is also used to obtain the pooled core from the pooled core address according to the pooled core height and the pooled core width.

In a possible implementation, the operation domain may also include the first step. Among them, the arithmetic modules 11-12 can also be used to move the pooled core in the x direction according to the first step.

In a possible implementation manner, the operation domain may further include a second step. Among them, the calculation module 11-12 can also be used to move the pooling core in the y direction according to the second step.

In this implementation, the step size of the average pooling operation is the amplitude of each moving pooling core in the average pooling operation. The first step may be to move the amplitude of the pooled core in the x direction, and the second step may be to move the amplitude of the pooled core in the y direction.

It should be noted that, in this disclosure, only the pooling core is taken as a two-dimensional example, and the parameters such as the height, width, first step width and second step width of the pooling core required for the average pooling operation are described. If the pooling kernel is multi-dimensional, the parameters of the pooling kernel include the size and stride of each dimension.

In a possible implementation manner, when the first and second steps are not given in the operation domain of the average pooling instruction, the computing module may use the height and width of the pooling core as their corresponding dimensions, respectively The stride ensures that the average pooling operation proceeds normally. For example, the calculation modules 11-12 can also be used to move the pooled cores non-overlapping on the data to be calculated, and compare multiple data to be calculated in the area corresponding to the pooled cores to obtain the calculation result.

In a possible implementation, when the pooled core height, pooled core width, and the pooled core are not included in the operation domain, the preset default pooled core height and default pooled core width can be obtained, so that the control module and the arithmetic module The average pooling instruction can be executed.

In a possible implementation, the operation domain may further include the number of pooled cores. Among them, the calculation module 11-12 is also used to perform average pooling operation on the data to be calculated through a plurality of pooling cores whose number is the number of pooling cores.

In this implementation, the number of pooled cores corresponds to the data to be calculated. For example, when the number of pooling cores is 5, it can be determined that the data to be calculated can be divided into five parts, and five pooling cores are required to perform average pooling operations on the five parts of the data to be calculated, respectively.

In this implementation manner, when the operation domain does not include the number of pooled cores, it can be determined that only one pooled core is needed for the data to be calculated to implement average pooling operations.

In a possible implementation manner, the calculation module 11-12 is further used to calculate data that is an integer multiple of the size of the pooled core in the data to be calculated when the size of the data to be calculated is a non-integer multiple of the size of the pooled core Perform average pooling operations. The size of the data to be calculated is a non-integer multiple of the size of the pooled core, which may include at least one of the following: the input width of the data to be calculated is a non-integer multiple of the width of the pooled core, and the input height of the data to be calculated It is a non-integer multiple of the height of the pooled core.

In this implementation manner, the average pooling operation may not be performed on part of the remaining data that is not an integer multiple of the size of the pooling core in the data to be calculated.

In a possible implementation manner, as shown in FIGS. 12-2a and 12-2b, the device may further include a storage module 11-13. Storage modules 11-13 are used to store data to be calculated and pooled cores.

In a possible implementation manner, the control module 11-11 may also be used to generate an assembly file according to the average pooling instruction and translate the assembly file into a binary file, where the binary file is the compiled average pooling instruction.

In a possible implementation, the instruction format of the average pooling instruction may be:

avgpool dst src0 src1 srcChannel srcHeigh srcWidth kernelHeight kernelwidth sxsy

Among them, avgpool is the operation code of the average pooling instruction, dst, src0, src1, srcChannel, srcHeigh, srcWidth, kernelHeight, kernelWidth, sx, sy are the operation domain of the average pooling instruction. Among them, dst is the target address, src0 is the data address to be calculated, src1 is the pooled core address, srcChannel is the number of pooled cores, srcHeigh is the input height, srcWidth is the input width, kernelHeight is the pooled core height, and kernelWidth is the pooled core Width, sx is the first step for the pooled core to move in the x direction, and sy is the second step for the pooled core to move in the y direction.

It should be understood that those skilled in the art can set the operation code of the average pooling instruction, the position of the operation code and the operation domain in the instruction format as needed, and the disclosure does not limit this.

It should be noted that although the above embodiment is taken as an example to introduce the average pooling instruction processing apparatus as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various modules flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.

Application examples

The following uses “average pooling instruction processing device for average pooling operation” as an exemplary application scenario to give an application example according to an embodiment of the present disclosure, in order to understand the flow of the average pooling instruction processing device. Those skilled in the art should understand that the following application examples are only for the purpose of facilitating understanding of the embodiments of the present disclosure, and should not be considered as limitations to the embodiments of the present disclosure

12-3 shows a schematic diagram of an application scenario of an average pooled instruction processing apparatus according to an embodiment of the present disclosure. As shown in Figure 12-3, the average pooling instruction processing device processes the average pooling instruction as follows:

The control module 11-11 compiles the obtained average pooling instruction 1 to obtain the compiled average pooling instruction 1 (for example, the average pooling instruction 1 is

avgpool

500, 100, 200, 5, 64, 32, 2, 2 and 2). The average pooling instruction is analyzed to obtain the operation code and operation domain of the average pooling instruction 1. Among them, the operation code of the average pooling instruction 1 is avgpool, the target address is 500, the data address to be calculated is 100, the pooling core address is 200, the number of pooling cores is 5, the input height is 64, the input width is 32, the pool The nucleation height is 2, the pooling nucleus width is 2, the first step is 2, and the second step is 1. The control module 11-11 obtains 64 × 32 to-be-calculated data from the data-to-be-operated data address 100, and 2 × 2 pooled cores from the pooled core address 200.

The calculation module 11-12 uses 5 pooling cores to perform average pooling operation on the data to be calculated, obtain the calculation result, and store the calculation result in the target address 500.

In this way, the average pooling instruction can be processed efficiently and quickly, and the efficiency and speed of the average pooling operation are also significantly improved.

12-4 shows a flowchart of an average pooling instruction processing method according to an embodiment of the present disclosure. The method can be applied to computer equipment including a memory and a processor, where the memory is used to store data used in the execution of the method; the processor is used to perform related processing and operation steps, such as performing the following steps S51-11和步骤 S52-11. As shown in FIG. 12-4, the method is applied to the above average pooling instruction processing device, and the method includes steps S51-11 and S52-11.

In step S51-11, the control module is used to compile the obtained average pooling instruction to obtain the compiled average pooling instruction, and the compiled average pooling instruction is analyzed to obtain the average pooling instruction operation code and The operation domain, and according to the operation code and the operation domain, the data to be calculated, the pooling core, and the target address required to execute the average pooling instruction are obtained. The operation code is used to indicate that the operation performed by the average pooling instruction on the data is the average pooling operation, and the operation domain includes the data address to be operated, the pooling core address, and the target address.

In step S52-11, the operation module is used to perform average pooling operation on the data to be calculated according to the pooling core to obtain the operation result, and the operation result is stored in the target address.

In a possible implementation manner, performing the average pooling operation on the data to be calculated according to the pooling core to obtain the operation result may include: performing the addition operation in the average pooling operation using multiple adders in the operation module, and using The multiple dividers in the arithmetic module perform the division operation in the average pooling operation.

In a possible implementation manner, the operation module includes a master operation sub-module and multiple slave operation sub-modules, and the master operation sub-module includes multiple adders and multiple dividers. Wherein, step S52-11 may include:

Use multiple adders and multiple dividers in the main operation sub-module to perform addition and division operations in the average pooling operation, respectively, to obtain the operation result, and store the operation result in the target address.

In a possible implementation manner, the operation domain may further include an input height and an input width. Among them, obtaining the data to be calculated, the pooling core, and the target address required to execute the average pooling instruction according to the operation code and the operation domain may include:

In a possible implementation, the operation domain may further include pooled core height and pooled core width. Among them, obtaining the data to be calculated, the pooling core, and the target address required to execute the average pooling instruction according to the operation code and the operation domain may include:

In a possible implementation, the operation domain may also include the first step. Wherein, performing average pooling operation on the data to be calculated according to the pooling core may include: moving the pooling core in the x direction according to the first step.

In a possible implementation manner, the operation domain may further include a second step. Wherein, performing average pooling operation on the data to be calculated according to the pooling core may include: moving the pooling core in the y direction according to the second step.

In a possible implementation manner, performing average pooling operation on the data to be calculated according to the pooling core to obtain the operation result may include:

When the size of the data to be calculated is a non-integer multiple of the size of the pooled core, the average pooling operation is performed on the data to be calculated that is an integer multiple of the size of the pooled core,

In a possible implementation, the operation domain may further include the number of pooled cores. Wherein, performing average pooling operation on the data to be calculated according to the pooling core to obtain the operation result may include:

Through multiple pooling cores whose number is the number of pooling cores, the data to be operated is averagely pooled.

In a possible implementation manner, parsing the obtained average pooling instruction to obtain the operation code and operation domain of the average pooling instruction may include:

Store the compiled average pooled instructions;

Analyze the compiled average pooled instruction to obtain the opcode and operation domain of the average pooled instruction;

An instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed in order according to the execution order, and the plurality of instructions to be executed may include an average pooled instruction after compilation.

In a possible implementation manner, compiling the obtained average pooling instruction to obtain the compiled average pooling instruction may include:

The assembly file is generated according to the average pooling instruction, and the assembly file is translated into a binary file. Among them, the binary file is the average pooled instruction after compilation.

It should be noted that, although the above embodiment is taken as an example to introduce the average pooling instruction processing method as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various steps flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.

The average pooling instruction processing method provided by the embodiments of the present disclosure has a wide application range, and has high processing efficiency and fast processing speed for the average pooling instruction, and high processing efficiency and fast speed for the average pooling operation.

The foregoing can be better understood based on the following clauses:

Clause L1, an average pooled instruction processing device, the device comprising:

The control module is used to compile the obtained average pooling instruction to obtain the compiled average pooling instruction, and analyze the compiled average pooling instruction to obtain the operation code and operation domain of the average pooling instruction, And obtain, according to the operation code and the operation domain, the data to be operated, the pooling core, and the target address required to execute the average pooling instruction;

An operation module, configured to perform an average pooling operation on the data to be calculated according to the pooling check, obtain an operation result, and store the operation result in the target address,

Wherein, the operation code is used to indicate that the operation performed by the average pooling instruction on the data is an average pooling operation, and the operation domain includes the data address to be operated, the pooling core address, and the target address.

Clause L2. The device according to Clause L1, the arithmetic module includes:

Multiple adders for performing the addition operation in the average pooling operation;

A plurality of dividers are used to perform the division operation in the average pooling operation.

Clause L3. The device according to Clause L2, the operation module includes a master operation submodule and a plurality of slave operation submodules, the master operation submodule includes the plurality of adders and the plurality of dividers,

The main operation sub-module is configured to use the plurality of adders and the plurality of dividers to perform addition and division operations in the average pooling operation, respectively, to obtain an operation result, and store the operation result Into the target address.

Clause L4. The device according to Clause L1, the operation domain further includes an input height and an input width,

Clause L5. The device according to Clause L1, the operation domain further includes a pooled core height and a pooled core width,

Clause L6. The device according to Clause L1, the operation domain further includes a first step,

Clause L7. The device according to Clause L1, the operation domain further includes a second step,

Clause L8, the device according to Clause L1,

Clause L9, the device according to Clause L1,

The calculation module is also used to, when the size of the data to be calculated is a non-integer multiple of the size of the pooled core, the data to be calculated is an integer multiple of the size of the pooled core Perform average pooling operations,

Clause L10. The device according to Clause L1, the operation domain further includes the number of pooled cores,

Wherein, the calculation module is also used to perform average pooling operation on the data to be calculated through a plurality of pooling cores whose number is the number of the pooling cores.

Clause L11. The device according to Clause L1, the device further comprising:

A storage module for storing the data to be calculated and the pooled core,

Wherein, the storage module includes at least one of a register and a cache,

The register is used to store scalar data in the data to be calculated;

Clause L12. The device according to Clause L1, the control module includes:

Instruction storage sub-module for storing the compiled average pooled instruction;

Instruction processing sub-module, which is used to analyze the compiled average pooling instruction to obtain the operation code and operation domain of the average pooling instruction;

A queue storage sub-module is used to store an instruction queue, and the instruction queue includes a plurality of instructions to be executed in order according to an execution order, and the plurality of instructions to be executed include the compiled average pooled instructions.

Clause L13. The device according to Clause L12, the control module, further comprising:

Clause L14, the device according to Clause L1,

The control module is also used to generate an assembly file according to the average pooling instruction and translate the assembly file into a binary file,

Wherein, the binary file is the compiled average pooling instruction.

Clause L15. A machine learning computing device, the device comprising:

One or more average pooling instruction processing devices as described in any one of Clause L1-Clause L14, used to obtain data to be operated and control information from other processing devices, and perform designated machine learning operations, passing the execution result through The I / O interface is passed to other processing devices;

When the machine learning computing device includes a plurality of the average pooled instruction processing devices, the plurality of average pooled instruction processing devices may be connected and transmit data through a specific structure;

Among them, a plurality of the average pooled instruction processing apparatuses interconnect and transmit data through a PCIE bus, a fast external device interconnection bus, to support larger-scale machine learning operations; a plurality of the average pooled instruction processing apparatuses share the same The control system may have its own control system; the plurality of average pooled instruction processing devices share memory or have their own memories; the interconnection method of the plurality of average pooled instruction processing devices is any interconnection topology.

Clause L16. A combined processing device, the combined processing device comprising:

Machine learning computing device, general interconnection interface and other processing devices as described in clause L15;

Clause L17. A machine learning chip, the machine learning chip includes:

The machine learning arithmetic device according to clause L15 or the combined processing device according to clause L16.

Clause L18. An electronic device, the electronic device comprising:

Machine learning chip as described in clause L17.

Clause L19. A board card including: a storage device, an interface device and a control device, and a machine learning chip as described in Clause L17;

The storage device is used for storing data;

Clause L20. An average pooling instruction processing method. The method is applied to an average pooling instruction processing apparatus. The apparatus includes a control module and an arithmetic module. The method includes:

The control module is used to compile the obtained average pooling instruction to obtain the compiled average pooling instruction, and the compiled average pooling instruction is analyzed to obtain the operation code and operation domain of the average pooling instruction, and The operation code and the operation domain obtain the data to be calculated, the pooling core, and the target address required to execute the average pooling instruction;

Using an arithmetic module to perform an average pooling operation on the data to be calculated according to the pooling core to obtain an operation result, and store the operation result in the target address,

Clause L21. According to the method described in Clause L20, perform average pooling operation on the data to be calculated according to the pooling check, and obtain the operation result, including:

The addition operation in the average pooling operation is performed using multiple adders in the operation module, and the division operation in the average pooling operation is performed using multiple dividers in the operation module.

Clause L22. The method according to Clause L21, the operation module includes a master operation submodule and a plurality of slave operation submodules, the master operation submodule includes a plurality of adders and a plurality of dividers,

Wherein, performing an average pooling operation on the data to be calculated according to the pooling verification to obtain an operation result, and storing the operation result in the target address includes:

Use the plurality of adders and the plurality of dividers in the main operation submodule to perform addition and division operations in the average pooling operation, respectively, to obtain an operation result, and store the operation result in In the target address.

Clause L23. The method according to Clause L20, the operation domain further includes an input height and an input width,

Wherein, obtaining the data to be operated, the pooling core and the target address required to execute the average pooling instruction according to the operation code and the operation domain includes:

Clause L24. The method according to Clause L20, the operation domain further includes a pooled core height and a pooled core width,

Wherein, obtaining the data to be calculated, the pooling core and the target address required to execute the average pooling instruction according to the operation code and the operation domain include:

Clause L25. The method according to Clause L20, the operation domain further includes a first step,

Wherein, performing average pooling operation on the data to be calculated according to the pooling verification includes:

The pooling core is moved in the x direction according to the first step.

Clause L26. The method according to Clause L20, the operation domain further includes a second step,

The pooling core is moved in the y direction according to the second step.

Clause L27. According to the method described in Clause L20, perform an average pooling operation on the data to be calculated according to the pooling check to obtain the operation result, including:

Clause L28. According to the method described in Clause L20, perform an average pooling operation on the data to be calculated according to the pooling check to obtain the operation result, including:

When the size of the data to be calculated is a non-integer multiple of the size of the pooling core, perform an average pooling operation on the data to be calculated that is an integer multiple of the size of the pooling core,

Clause L29. The method according to Clause L20, the operation domain further includes the number of pooled cores,

Wherein, performing average pooling operation on the data to be calculated according to the pooling verification to obtain the operation result includes:

An average pooling operation is performed on the data to be calculated through a plurality of pooling cores whose number is the number of the pooling cores.

Clause L30. The method according to Clause L20, the method further comprising:

Wherein, the storage module includes at least one of a register and a cache,

The register is used to store scalar data in the data to be calculated;

Clause L31. According to the method described in Clause L20, analyze the obtained average pooling instruction to obtain the operation code and operation domain of the average pooling instruction, including:

Storing the compiled average pooled instruction;

Analyzing the compiled average pooling instruction to obtain the operation code and operation domain of the average pooling instruction;

An instruction queue is stored. The instruction queue includes a plurality of instructions to be executed, which are sequentially arranged in order of execution, and the plurality of instructions to be executed include the compiled average pooled instructions.

Clause L32. The method according to Clause L31, the method further comprising:

Clause L33. According to the method described in Clause L20, compile the obtained average pooling instruction to obtain the compiled average pooling instruction, including:

Generate an assembly file according to the average pooling instruction, and translate the assembly file into a binary file,

Wherein, the binary file is the compiled average pooling instruction.

Clause L34. A non-volatile computer-readable storage medium having computer program instructions stored thereon, which when executed by a processor implements the method of any one of Clause L20 to Clause L33.

Due to the widespread use of neural network algorithms, the ability of computer hardware operators to continuously improve, the types and number of data operations involved in practical applications continue to increase. Due to the variety of programming languages, in different language environments, in order to achieve the operation process of scalar operations, in related technologies, because there are no scalar instructions that can be widely applied to various programming languages at this stage, technicians need to customize their corresponding programming Multiple instructions in the locale implement different types of scalar operations, resulting in low efficiency and slow speed of scalar operations. The present disclosure provides a scalar instruction processing method, device, computer equipment, and storage medium, which can implement scalar operation with only one instruction, which can significantly improve the efficiency and speed of performing scalar operation.

13-1 shows a block diagram of a scalar instruction processing device according to an embodiment of the present disclosure. As shown in Figure 13-1, the device includes a control module 13-11 and an arithmetic module 13-12.

The control module 13-11 is used to compile the obtained scalar instruction to obtain the compiled scalar instruction, parse the compiled scalar instruction to obtain the operation code and operation domain of the scalar instruction, and according to the operation code and operation domain Obtain the to-be-operated scalar and target address required to execute the scalar instruction, and determine the scalar operation type of the scalar instruction. Among them, the operation code is used to indicate that the operation performed by the scalar instruction on the data is a scalar operation, the scalar operation type is used to indicate the type of operation that performs the scalar operation and the data type of the scalar to be operated, and the operation domain includes the scalar address to be operated and the target address.

The operation module 13-12 is configured to perform a scalar operation on the scalar to be calculated according to the scalar operation type, obtain an operation result, and store the operation result in a target address.

In this embodiment, there may be one or more scalars to be calculated. The type of operation indicated by the scalar operation type may indicate the type or type of arithmetic operation or logical operation performed on the scalar to be operated. For example, addition operation, logical left shift operation, etc. The data type of the scalar to be operated indicated by the scalar operation type may be the storage type of the scalar to be operated. Data types can include 16-bit unsigned types, 32-bit unsigned types, 48-bit unsigned types, 16-bit signed types, 32-bit signed types, 48-bit signed types, pointer types, etc. that can be applied to scalar data types . A person skilled in the art can set the operation type and data type according to actual needs, and the disclosure does not limit this.

In this embodiment, the scalar instructions acquired by the control module are uncompiled software instructions that cannot be directly executed by the hardware. The control module needs to first compile the scalar instructions (uncompiled). After the compiled scalar instruction is obtained, the compiled scalar instruction can be parsed. The compiled scalar instructions are hardware instructions that can be directly executed by the hardware. The control module can obtain the scalar to be calculated from the scalar address to be calculated respectively. The control module can obtain instructions and data through the data input and output unit, which can be one or more data I / O interfaces or I / O pins.

In this embodiment, the operation code may be the part of the instruction or field (usually represented by code) specified in the computer program to perform the operation, and is the instruction sequence number used to inform the device that executes the instruction which instruction needs to be executed. . The operation domain may be the source of all data required to execute the corresponding instruction. All data required to execute the corresponding instruction include parameters such as the scalar to be calculated, the type of the scalar operation, and the corresponding operation method. For a scalar instruction, it must include an operation code and an operation field, where the operation field includes at least the scalar address and the target address to be operated.

It should be understood that those skilled in the art can set the instruction format of the scalar instruction, as well as the included operation codes and operation fields as required, and this disclosure does not limit this.

In this embodiment, the device may include one or more control modules and one or more arithmetic modules, and the number of control modules and arithmetic modules may be set according to actual needs, which is not limited in the present disclosure. When the device includes a control module, the control module can receive scalar instructions and control one or more arithmetic modules to perform scalar arithmetic. When the device includes multiple control modules, the multiple control modules may respectively receive scalar instructions and control the corresponding one or more arithmetic modules to perform scalar operations.

The scalar instruction processing device provided by the embodiment of the present disclosure includes a control module and an arithmetic module. The control module is used to compile the obtained scalar instruction to obtain the compiled scalar instruction and analyze the compiled scalar instruction. Obtain the operation code and operation domain of the scalar instruction, and obtain the scalar operation target and target address required to execute the scalar instruction according to the operation code and operation domain, and determine the scalar operation type of the scalar instruction; The scalar performs a scalar operation, obtains the operation result, and stores the operation result in the target address. The scalar command processing device provided by the embodiments of the present disclosure has a wide application range, and has high processing efficiency and fast processing speed for scalar commands, and high processing efficiency and fast processing speed for performing scalar operations.

13-2a shows a block diagram of a scalar instruction processing device according to an embodiment of the present disclosure. In a possible implementation, as shown in FIG. 13-2a, the arithmetic module 13-12 may include a plurality of scalar operators 13-120. A plurality of scalar operators 13-120 are used to perform scalar operations corresponding to scalar operation types.

In this implementation manner, the scalar operator may include an adder, a divider, a multiplier, and the like that can perform arithmetic operations, logical operations, and the like on the scalar. The type and number of scalar operators can be set according to the size of the scalar operation, the type of scalar operation, the processing speed and efficiency of the scalar operation, etc. The disclosure does not limit this.

13-2b shows a block diagram of a scalar instruction processing apparatus according to an embodiment of the present disclosure. In a possible implementation, as shown in FIG. 13-2b, the operation module 13-12 may include a master operation sub-module 13-121 and a plurality of slave operation sub-modules 13-122. The main operation sub-module 13-121 may include a plurality of scalar operators (not shown in the figure).

The main operation sub-module 13-121 is used for performing scalar operations with multiple scalar operators, obtaining operation results, and storing the operation results in the target address.

In a possible implementation, the operation domain may also include a scalar operation type.

Among them, the control module 13-11 can also be used to determine the scalar operation type according to the operation domain.

In a possible implementation manner, the operation types may be addition operation, sum operation, multiplication operation, bitwise AND operation, bitwise remainder operation, bitwise absolute value operation, bitwise division operation, bitwise operation OR operation, bitwise XOR operation, bitwise inverse operation, bitwise maximum value operation, bitwise minimum value operation, logical left shift operation, logical right shift operation, arithmetic right shift operation, logical AND operation, logical OR At least one of operation, logical exclusive-OR operation and logical negation operation.

In this implementation, different operation field codes can be set for different types of scalar operations to distinguish different types of operations. For example, the code for the addition operation can be set to add. You can set the code of the sum operation to sub. The code for the multiplication operation can be set to mul. You can set the code for phase-and-operation to and. You can set the code for bitwise remainder operation to rem. You can set the code for bitwise absolute value operation to abs. You can set the code for division by bit to div. You can set the bitwise OR code to or. You can set the bitwise XOR code to xor. You can set the code for bitwise inversion to not. You can set the code for the maximum bitwise operation to max. You can set the code for computing the minimum bitwise operation to min. The code for logical left shift operation can be set to sll. The code for logical right shift operation can be set to srl. The code for arithmetic right shift operation can be set to sra. You can set the code of logical AND operation to land. You can set the logic or operation code to lor. The code for logical XOR operation can be set to lxo. You can set the code for logical inversion to lnot. A person skilled in the art can set the code of the operation type according to actual needs, which is not limited in this disclosure.

In a possible implementation manner, the operation domain may further include operation parameters. Among them, the control module 13-11 is also used to determine the operation parameters according to the operation domain. The arithmetic module 13-12 is further configured to perform scalar operation on the scalar to be calculated according to the type of scalar operation, obtain an operation result, and store the operation result in the target address.

In a possible implementation manner, as shown in FIGS. 13-2a and 13-2b, the device may further include a storage module 13-13. The storage modules 13-13 are used to store scalars to be calculated.

In this implementation, the storage module may include one or more of a cache and a register. The cache may include a high-speed temporary storage cache, and may also include at least one NRAM (Neuron Random Access Memory). The cache can be used to store the data to be calculated, and the register can be used to store the scalar to be calculated.

In a possible implementation, the cache may include a neuron cache. The neuron cache, that is, the foregoing neuron random access memory, can be used to store neuron data in the data to be calculated, and the neuron data can include neuron vector data. Wherein, the data to be calculated includes data related to performing scalar operations and / or data related to operations of other calculation instructions.

In a possible implementation manner, the control module 13-11 may also be used to generate an assembly file according to the scalar instruction and translate the assembly file into a binary file, where the binary file is a compiled scalar instruction.

In a possible implementation, the instruction format of the scalar instruction may be:

scalar dst src opcode.type pa.

Among them, scalar is the operation code of the scalar instruction, dst, src, opcode.type, pa are the operation domain of the scalar instruction. Among them, dst is the target address. src is a scalar address to be calculated. When there are multiple scalars to be calculated, src may include multiple vector addresses to be calculated src0, src1, ..., srcn, which is not limited in the present disclosure. opcode.type is a scalar operation type, opcode in opcode.type indicates the operation type of scalar operation, and type in opcode.type indicates the data type of the scalar operation to be operated. pa is the operation parameter, such as the number of shifts.

opcode.scalar.type dst srcpa.

Among them, opcode.scalar.type is the operation code of the scalar instruction, and dst, src, and pa are the operation domain of the scalar instruction. Among them, dst is the target address. src is a scalar address to be calculated. When there are multiple scalars to be calculated, src may include multiple vector addresses to be calculated src0, src1, ..., srcn, which is not limited in the present disclosure. Alternatively, multiple scalars to be calculated can be obtained from src. pa is the operation parameter, such as the number of shifts. In the opcode opcode.scalar.type, opcode indicates the type of operation of the scalar operation, and type indicates the data type of the scalar to be operated. type can be u16, u32, u48, s16, s32, s48, ptr, u16 indicates that the vector to be calculated is unsigned and has a length of 16 bits, u32 indicates that the vector to be calculated is unsigned, and has a length of 32 bits, u48 Indicates that the vector to be calculated is an unsigned scalar with a length of 48 bits, s16 indicates that the vector to be calculated is a signed scalar with a length of 16 bits, s32 indicates that the vector to be calculated is a signed scalar with a length of 32 bits, s48 indicates The operation vector is a signed scalar with a length of 48 bits, and ptr indicates that the vector to be operated is a pointer-type scalar.

In a possible implementation, the instruction format of the scalar instruction used for the scalar addition operation can be set to: add.scalar.type dst src0 src1. It means: add the first to-be-calculated scalar of data type stored in src0 and the second to-be-calculated scalar of data type of type stored in src1 to obtain the operation result. And store the operation result to the target address dst.

In a possible implementation, the instruction format of the scalar instruction used for the scalar addition operation can be set to: add.scalar.type dst src0 src1. It means that the first to-be-calculated scalar of data type stored in src0 and the second to-be-calculated scalar of data type of src1 are added to obtain an operation result. And store the operation result to the target address dst.

In a possible implementation, the instruction format of the scalar instruction used for the scalar sum operation can be set to: sub.scalar.type dst src0. It means: Summing a plurality of scalars of data type stored in src0 to be operated to obtain the operation result. And store the operation result to the target address dst. Alternatively, the instruction format of the scalar instruction used for the scalar sum operation can be set to: sub.scalar.type dst src0 src1, ..., srcn. It means: summing a plurality of scalars of the data type stored in src0, src1, ..., srcn to be typed to obtain an operation result. And store the operation result to the target address dst.

In a possible implementation, the instruction format of the scalar instruction used for the scalar multiplication operation can be set to: mul.scalar.type dst src0 src1. It means: multiplying the first to-be-calculated scalar of data type stored by src0 and the second to-be-calculated scalar of data type of type stored in src1 to obtain the operation result. And store the operation result to the target address dst.

In a possible implementation, the instruction format of the scalar instruction used for phase-and-operation can be set as: and.scalar.type dst src0 src1. It means that the first to-be-calculated scalar of the data type stored in src0 and the second to-be-calculated scalar of the data type of type stored in src1 are combined in phase to obtain the operation result. And store the operation result to the target address dst.

In a possible implementation, the instruction format of the scalar instruction used for bitwise remainder operation can be set to: rem.scalar.type dst src0. It means that the scalar to be operated on of the data type stored in src0 is subjected to bitwise remainder operation to obtain the operation result. And store the operation result to the target address dst.

In a possible implementation, the instruction format of the scalar instruction used for bit-wise absolute value operation can be set to: abs.scalar.type dst src0. It means: Perform the bitwise absolute value operation on the to-be-calculated scalar data type stored in src0 to obtain the operation result. And store the operation result to the target address dst.

In a possible implementation, the instruction format of the scalar instruction used for the bitwise division operation can be set to: div.scalar.type dst src0 src1. It means that the first to-be-calculated scalar of data type stored in src0 and the second to-be-calculated scalar of data type of type src1 are subjected to bitwise division operation to obtain the operation result. And store the operation result to the target address dst.

In a possible implementation, the instruction format of the scalar instruction for bitwise OR operation can be set to: or.scalar.type dst src0 src1. It means that the first to-be-operated scalar of data type stored in src0 and the second to-be-operated scalar of data type of src1 are bit-wise ORed to obtain the operation result. And store the operation result to the target address dst.

In a possible implementation, the instruction format of the scalar instruction for bitwise XOR operation can be set to: xor.scalar.type dst src0 src1. It means: performing a bitwise XOR operation on the first to-be-operated scalar of the data type stored by src0 and the second to-be-operated scalar of the data type of type stored in src1, to obtain an operation result. And store the operation result to the target address dst.

In a possible implementation, the instruction format of the scalar instruction used for bitwise inversion operation can be set to: not.scalar.type dst src0. It means: perform the bitwise inverse operation on the to-be-calculated scalar data type stored in src0 to obtain the operation result. And store the operation result to the target address dst.

In a possible implementation, the instruction format of the scalar instruction used for bit-wise maximum operation can be set to: max.scalar.type dst src0. It means that the scalar to be operated on which the data type stored in src0 is type is subjected to a bitwise maximum value operation to obtain an operation result. And store the operation result to the target address dst.

In a possible implementation, the instruction format of the scalar instruction used for bit-wise minimum operation can be set to: min.scalar.type dst src0. It means that the scalar to be calculated of the data type stored in src0 is typed to perform a bitwise minimum value operation to obtain an operation result. And store the operation result to the target address dst.

In a possible implementation, the instruction format of the scalar instruction used for the logical left shift operation can be set to: sll.scalar.type dst src0pa. It means that the scalar to be operated of data type stored in src0 is logically shifted to the left by pa bits to obtain the operation result. And store the operation result to the target address dst.

In a possible implementation, the instruction format of the scalar instruction used for logical right shift operation can be set to: srl.scalar.type dst src0pa. It means: logically shift the right-to-operate scalar of the data type stored in src0 by pa bits to obtain the operation result. And store the operation result to the target address dst.

In a possible implementation, the instruction format of the scalar instruction used for logical AND operation can be set to: land.scalar.type dst src0 src1. It means: perform a logical AND operation on the first to-be-operated scalar of the data type stored in src0 and the second to-be-operated scalar of the data type of type stored in src1, to obtain the operation result. And store the operation result to the target address dst.

In a possible implementation, the instruction format of the scalar instruction used for logical OR operation can be set as: lor.scalar.type dst src0 src1. It means: performing logical OR operation on the first to-be-calculated scalar of the data type stored in src0 and the second to-be-operated scalar of the data type of type stored in src1, to obtain the operation result. And store the operation result to the target address dst.

In a possible implementation, the instruction format of the scalar instruction used for logical XOR operation can be set to: lxo.scalar.type dst src0 src1. It means: perform a logical exclusive OR operation on the first to-be-calculated scalar of the data type stored in src0 and the second to-be-operated scalar of the data type of type stored in src1 to obtain the operation result. And store the operation result to the target address dst.

In a possible implementation, the instruction format of the scalar instruction used for logical inversion operation can be set to: lnot.scalar.type dst src0. It means: logically invert the scalar to be operated on which the data type stored in src0 is type to obtain the operation result. And store the operation result to the target address dst.

It should be understood that those skilled in the art can set the operation code of the scalar instruction, the position of the operation code and the operation field in the instruction format as needed, and the disclosure does not limit this.

It should be noted that although the above embodiment is taken as an example to introduce the scalar instruction processing apparatus as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various modules flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.

Application examples

The following describes an application example according to an embodiment of the present disclosure in conjunction with "scalar operation using a scalar instruction processing device" as an exemplary application scenario, so as to facilitate understanding of the flow of the scalar instruction processing device. Those skilled in the art should understand that the following application examples are only for the purpose of facilitating understanding of the embodiments of the present disclosure, and should not be considered as limitations to the embodiments of the present disclosure

13-3a and 13-3b show schematic diagrams of application scenarios of a scalar instruction processing apparatus according to an embodiment of the present disclosure. As shown in Figures 13-3a and 13-3b, the scalar command processing device processes the scalar commands as follows:

Example one

As shown in Figure 13-3a, the control module 13-11 compiles the obtained scalar instruction 1 to obtain the compiled scalar instruction 1 (for example, the scalar instruction 1 is scalar 500 500 101 102 add.u16) The instruction is parsed to obtain the operation code and operation domain of the scalar instruction 1. The operation code of scalar instruction 1 is scalar, the target address is 500, the first scalar address to be calculated is 101, and the second vector address to be calculated is 102. The scalar operation type is add.u16, where the operation type is add operation add and the data type is 16-bit unsigned scalar. The control module 13-11 obtains a 16-bit unsigned first scalar to be calculated from the scalar address 101 to be calculated, and a 16-bit unsigned second scalar to be calculated from the scalar address 102 to be calculated.

The arithmetic module 13-12 performs an addition operation on the first scalar to be calculated and the second scalar to be calculated to obtain an operation result 1, and stores the operation result 1 in the target address 500.

Example 2

As shown in Figure 13-3b, the control module 13-11 compiles the obtained scalar instruction 2 to obtain the compiled scalar instruction 2 (for example, the scalar instruction 2 is mul.scalar.u16501501103104). The scalar instruction is analyzed to obtain the operation code and operation domain of the scalar instruction 2. The operation code of scalar instruction 2 is mul.scalar.u16, the target address is 501, the third scalar address to be calculated is 103, and the fourth scalar address to be calculated is 104. The control module 13-11 obtains a 16-bit unsigned first scalar to be calculated from the scalar address 101 to be calculated, and a 16-bit unsigned second scalar to be calculated from the scalar address 102 to be calculated.

The arithmetic module 13-12 performs a multiplication operation on the first scalar to be calculated and the second scalar to be calculated to obtain an operation result 2 and stores the operation result 2 in the target address 501.

In this way, the scalar command processing device can process the scalar commands efficiently and quickly, and the scalar operation has high processing efficiency and fast processing speed.

13-4 shows a flowchart of a scalar instruction processing method according to an embodiment of the present disclosure. The method can be applied to computer equipment including a memory and a processor, where the memory is used to store data used in the execution of the method; the processor is used to perform related processing and operation steps, such as performing the following steps S51-13和步骤 S52-13. As shown in FIG. 13-4, the method is applied to the above scalar instruction processing device, and the method includes steps S51-13 and S52-13.

In step S51-13, the control module is used to compile the obtained scalar instruction to obtain the compiled scalar instruction, and the compiled scalar instruction is parsed to obtain the operation code and operation domain of the scalar instruction, and according to the operation code and The operation domain obtains the scalar to be operated and the target address required to execute the scalar instruction, and determines the scalar operation type of the scalar instruction. Among them, the operation code is used to indicate that the operation performed by the scalar instruction on the data is a scalar operation, the scalar operation type is used to indicate the type of operation that performs the scalar operation and the data type of the scalar to be operated, and the operation domain includes the scalar address to be operated and the target address.

In step S52-13, the arithmetic module is used to perform a scalar operation on the scalar to be calculated according to the scalar operation type to obtain an operation result, and the operation result is stored in the target address.

In a possible implementation manner, performing the scalar operation on the scalar to be calculated according to the scalar operation type to obtain the operation result may include: performing scalar operation corresponding to the scalar operation type by using multiple scalar operators in the operation module.

In a possible implementation manner, the operation module includes a master operation sub-module and multiple slave operation sub-modules, and the master operation sub-module includes multiple scalar operators. Wherein, steps S52-13 may include:

Use multiple scalar operators in the main operation sub-module to perform scalar operations corresponding to the scalar operation type, perform pre-processing on the scalar operations to obtain the operation results, and store the operation results in the target address.

In a possible implementation, the operation domain may also include a scalar operation type. Among them, determining the scalar operation type of the scalar instruction may include:

Determine the type of scalar operation based on the operation domain.

In a possible implementation manner, the operation domain may further include operation parameters. Wherein, obtaining the scalar to be calculated and the target address required to execute the scalar instruction according to the operation code and the operation domain may further include: determining the operation parameter according to the operation domain.

Among them, performing scalar operation on the scalar to be calculated according to the scalar operation type may include:

According to the operation parameters and the type of scalar operation, the scalar operation is performed on the scalar to be operated.

In a possible implementation manner, the operation type includes at least one of the following: addition operation, sum operation, multiplication operation, bitwise AND operation, bitwise remainder operation, bitwise absolute value operation, bitwise division Operation, bitwise OR operation, bitwise XOR operation, bitwise inverse operation, bitwise maximum value operation, bitwise minimum value operation, logical left shift operation, logical right shift operation, arithmetic right shift operation, logical AND Operation, logical OR operation, logical XOR operation and logical inverse operation.

In a possible implementation manner, the method may further include: using a storage module of the device to store the scalar to be calculated,

Wherein, the storage module includes at least one of a register and a cache,

Register, used to store the scalar to be calculated;

In a possible implementation, parsing the compiled scalar instruction to obtain the operation code and operation domain of the scalar instruction may include:

Store compiled scalar instructions;

Analyze the compiled scalar instructions to get the opcode and operation domain of the scalar instructions;

An instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed in order according to the execution order. The instructions to be executed may include compiled scalar instructions.

In a possible implementation manner, the method may further include: when determining that the first to-be-executed instruction among the plurality of to-be-executed instructions is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction, caching the first The instruction to be executed, and after determining that the execution of the zeroth execution instruction is completed, control to execute the execution of the first instruction to be executed,

Wherein, the first to-be-executed instruction is associated with the compiled zeroth to-be-executed instruction before the first to-be-executed instruction may include: a first storage address interval storing data required by the first to-be-executed instruction and storing a zeroth to-be-executed The zeroth storage address interval of the data required by the instruction has overlapping areas.

In a possible implementation manner, compiling the obtained scalar instruction to obtain the compiled scalar instruction may include:

Generate assembly files according to scalar instructions, and translate the assembly files into binary files. Among them, the binary file is a compiled scalar instruction.

It should be noted that although the above embodiment is taken as an example to introduce the scalar instruction processing method as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various steps flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.

The scalar instruction processing method provided by the embodiments of the present disclosure has a wide application range, and has high processing efficiency and fast processing speed for scalars, and high processing efficiency and fast processing speed for performing scalar operations.

The foregoing can be better understood based on the following clauses:

Clause M1, a scalar instruction processing device, the device comprising:

The control module is used to compile the obtained scalar instruction to obtain the compiled scalar instruction, analyze the compiled scalar instruction to obtain the operation code and operation domain of the scalar instruction, and according to the operation code and The operation domain obtains the scalar to be operated and the target address required to execute the scalar instruction, and determines the scalar operation type of the scalar instruction;

An operation module, configured to perform a scalar operation on the scalar to be operated according to the scalar operation type, obtain an operation result, and store the operation result in the target address,

Wherein, the operation code is used to indicate that the operation performed by the scalar instruction on the data is a scalar operation, and the scalar operation type is used to indicate the type of operation that performs the scalar operation and the data type of the scalar to be operated, the operation The field includes the scalar address to be operated and the target address.

Clause M2. The device according to Clause M1, the operation module includes:

A plurality of scalar operators are used to perform scalar operations corresponding to the types of scalar operations.

Clause M3. The device according to Clause M2, the operation module includes a master operation sub-module and a plurality of slave operation sub-modules, the master operation sub-module includes the plurality of scalar operators,

The main operation sub-module is configured to perform the scalar operation using the plurality of scalar operators, perform pre-processing on the scalar to be operated, obtain an operation result, and store the operation result in the target address in.

Clause M4. The device according to Clause M1, the operation domain further includes a scalar operation type,

Wherein, the control module is also used to determine the scalar operation type according to the operation domain.

Clause M5. The device according to Clause M1, the operation domain further includes operation parameters,

Wherein, the control module is also used to determine the operation parameter according to the operation domain;

The operation module is further configured to perform scalar operation on the scalar to be operated according to the operation parameter and the scalar operation type.

Clause M6. The device according to Clause M1, the operation code is further used to indicate the scalar operation type,

The control module is also used to determine the scalar operation type according to the operation code.

Clause M7. The device according to Clause M1, the operation type includes at least one of the following:

Addition operation, summation operation, multiplication operation, bitwise AND operation, bitwise remainder operation, bitwise absolute value operation, bitwise division operation, bitwise OR operation, bitwise XOR operation, bitwise inverse operation Operation, bitwise maximum value operation, bitwise minimum value operation, logical left shift operation, logical right shift operation, arithmetic right shift operation, logical AND operation, logical OR operation, logical exclusive OR operation and logical inverse operation.

Clause M8. The device according to Clause M1, the device further comprising:

A storage module for storing the scalar to be calculated,

Wherein, the storage module includes at least one of a register and a cache,

The register is used to store the scalar to be calculated;

Clause M9. The device according to Clause M1, the control module includes:

An instruction storage sub-module for storing the compiled scalar instruction;

An instruction processing sub-module, which is used to parse the compiled scalar instruction to obtain the operation code and operation domain of the scalar instruction;

The queue storage submodule is used to store an instruction queue, and the instruction queue includes a plurality of instructions to be executed in order according to the execution order, and the instructions to be executed include the compiled scalar instructions.

Clause M10. The device according to Clause M9, the control module, further comprising:

Clause M11, the device according to Clause M1,

The control module is also used to generate an assembly file according to the scalar instruction and translate the assembly file into a binary file,

Wherein, the binary file is the compiled scalar instruction.

Clause M12. A machine learning computing device, the device comprising:

One or more scalar instruction processing devices as described in any one of clauses M1 to M11, used to obtain scalars and control information to be calculated from other processing devices, and perform designated machine learning operations, and pass the execution result to O interface is passed to other processing devices;

When the machine learning operation device includes a plurality of the scalar instruction processing devices, the plurality of the scalar instruction processing devices may be connected and transmit data through a specific structure;

Among them, a plurality of the scalar instruction processing apparatuses interconnect and transmit data through a fast external device interconnection bus PCIE bus to support larger-scale machine learning operations; a plurality of the scalar instruction processing apparatuses share the same control system or own Respective control systems; a plurality of the scalar instruction processing devices share memory or have their own memories; the interconnection method of the plurality of scalar instruction processing devices is an arbitrary interconnection topology.

Clause M13. A combined processing device, the combined processing device comprising:

Machine learning computing device, general interconnection interface and other processing devices as described in clause M12;

Clause M14. A machine learning chip, the machine learning chip includes:

The machine learning arithmetic device according to clause M12 or the combined processing device according to clause M13.

Clause M15. An electronic device, the electronic device comprising:

Machine learning chip as described in clause M14.

Clause M16, a board card, the board card includes: a storage device, an interface device and a control device, and a machine learning chip as described in Clause M14;

The storage device is used for storing data;

Clause M17. A scalar instruction processing method. The method is applied to a scalar instruction processing device. The device includes a control module and an arithmetic module. The method includes:

The control module is used to compile the obtained scalar instruction to obtain the compiled scalar instruction, and the compiled scalar instruction is analyzed to obtain the operation code and operation domain of the scalar instruction, and according to the operation code and the operation The domain obtains the scalar to be operated and the target address required to execute the scalar instruction, and determines the scalar operation type of the scalar instruction;

Using an operation module to perform a scalar operation on the to-be-operated scalar according to the scalar operation type, obtain an operation result, and store the operation result in the target address,

Clause M18. Perform the scalar operation on the scalar to be calculated according to the scalar operation type according to the method described in Clause M17, including:

A plurality of scalar arithmetic units in the arithmetic module are used to perform scalar arithmetic corresponding to the scalar arithmetic type.

Clause M19. The method according to Clause M18, the operation module includes a master operation sub-module and a plurality of slave operation sub-modules, the master operation sub-module includes the plurality of scalar operators,

Wherein, performing a scalar operation on the scalar to be operated according to the scalar operation type to obtain an operation result, and storing the operation result in the target address includes:

A plurality of scalar operators in the main operation sub-module are used to perform a scalar operation corresponding to the scalar operation type, obtain an operation result, and store the operation result in the target address.

Clause M20. The method according to Clause M17, the operation domain further includes a scalar operation type,

Among them, determining the scalar operation type of scalar instructions includes:

The type of scalar operation is determined according to the operation domain.

Clause M21. The method according to Clause M17, the operation domain further includes operation parameters,

Wherein, obtaining the scalar to be calculated and the target address required to execute the scalar instruction according to the operation code and the operation domain also includes:

Determining the operation parameter according to the operation domain;

Wherein, performing scalar operation on the scalar to be calculated according to the scalar operation type includes:

Perform a scalar operation on the scalar to be operated according to the operation parameter and the type of scalar operation.

Clause M22. According to the method of Clause M17, the opcode is also used to indicate the scalar operation type and determine the scalar operation type of the scalar instruction, including:

The type of scalar operation is determined according to the operation code.

Clause M23. The method according to Clause M17, the operation type includes at least one of the following:

Clause M24. The method according to Clause M17, the method further comprising:

Using the storage module of the device to store the scalar to be calculated,

Wherein, the storage module includes at least one of a register and a cache,

The register is used to store the scalar to be calculated;

Clause M25. According to the method described in Clause M17, parse the compiled scalar instruction to obtain the operation code and operation domain of the scalar instruction, including:

Storing the compiled scalar instruction;

Parse the compiled scalar instruction to obtain the operation code and operation domain of the scalar instruction;

An instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed in order according to an execution order, and the instructions to be executed include the compiled scalar instructions.

Clause M26. The method according to Clause M25, the method further comprising:

Clause M27. According to the method described in Clause M17, compile the obtained scalar instruction to obtain the compiled scalar instruction, including:

Generate an assembly file according to the scalar instruction, and translate the assembly file into a binary file,

Wherein, the binary file is the compiled scalar instruction.

Clause M28. A non-volatile computer-readable storage medium having computer program instructions stored thereon. When the computer program instructions are executed by a processor, the method described in any one of clauses M17 to M27 is implemented.

14-1 shows a block diagram of a scalar type conversion instruction processing device according to an embodiment of the present disclosure. As shown in Figure 14-1, the device includes a control module 14-11 and an arithmetic module 14-12.

The control module 14-11 is used to compile the obtained scalar type conversion instruction to obtain the compiled scalar type conversion instruction, and parse the compiled scalar type conversion instruction to obtain the operation code and operation domain of the scalar type conversion instruction , And obtain the scalar to be calculated and the target address required to execute the scalar type conversion instruction according to the operation code and the operation domain, and determine the initial data type of the target data type and the scalar to be calculated. The operation code is used to instruct the operation performed by the scalar type conversion instruction on the data to be a scalar type conversion operation, and the operation domain includes the scalar address to be operated and the target address.

The operation module 14-12 is configured to perform a scalar type conversion operation on the scalar to be operated of the initial data type according to the target data type, obtain an operation result, and store the operation result in the target address. Among them, the data type of the operation result is the target data type.

In this embodiment, the scalar type conversion instruction acquired by the control module is an uncompiled software instruction that cannot be directly executed by hardware. The control module needs to first compile the scalar type conversion instruction (uncompiled). After the compiled scalar type conversion instruction is obtained, the compiled scalar type conversion instruction can be parsed. The compiled scalar type conversion instructions are hardware instructions that can be directly executed by the hardware. The control module may obtain the scalar to be calculated from the scalar address to be calculated. The control module can obtain the scalar type conversion instruction and the scalar to be calculated through the data input and output unit, and the data input and output unit can be one or more data I / O interfaces or I / O pins.

In this embodiment, the operation code may be the part of the instruction or field (usually represented by code) specified in the computer program to perform the operation, and is the instruction sequence number used to inform the device that executes the instruction which instruction needs to be executed. . The operation domain may be the source of all data required to execute the corresponding instruction, and all data required to execute the corresponding instruction include parameter data, scalar to be operated, corresponding operation method, and so on. For a scalar type conversion instruction, it must include an operation code and an operation field, where the operation field includes at least the scalar address and the target address to be operated.

It should be understood that those skilled in the art can set the instruction format of the scalar type conversion instruction as well as the included operation codes and operation domains as needed, and this disclosure does not limit this.

In this embodiment, the device may include one or more control modules and one or more arithmetic modules, and the number of control modules and arithmetic modules may be set according to actual needs, which is not limited in the present disclosure. When the device includes a control module, the control module may receive a scalar type conversion instruction and control one or more arithmetic modules to perform scalar type conversion operations. When the device includes multiple control modules, the multiple control modules may respectively receive scalar type conversion instructions and control the corresponding one or more arithmetic modules to perform scalar type conversion operations.

A scalar type conversion instruction processing device provided by an embodiment of the present disclosure includes a control module and an arithmetic module. The control module is used to compile the obtained scalar type conversion instruction to obtain the compiled scalar type conversion instruction, to parse the compiled scalar type conversion instruction, to obtain the scalar type conversion instruction operation code and operation domain, and according to the operation The code and operation domain obtain the scalar to be calculated and the target address required to execute the scalar type conversion instruction, and determine the initial data type of the target data type and the scalar to be calculated. The operation module is used to perform a scalar type conversion operation on the scalar to be operated of the initial data type according to the target data type, obtain the operation result, and store the operation result in the target address. The scalar type conversion instruction processing device provided by the embodiments of the present disclosure has a wide application range, and has high processing efficiency and fast processing speed for scalar type conversion instructions, and high processing efficiency and fast processing speed for scalar type conversion.

14-2a shows a block diagram of a scalar type conversion instruction processing device according to an embodiment of the present disclosure. In a possible implementation, as shown in FIG. 14-2a, the operation module 14-12 may include a plurality of scalar operators 14-120 for performing scalar type conversion operations.

In this implementation, the arithmetic module may also include a scalar arithmetic unit. The number of scalar operators can be set according to the size of the data amount required to perform the scalar type conversion operation, the processing speed, efficiency, etc. of the scalar type conversion operation, which is not limited in the present disclosure.

14-2b shows a block diagram of a scalar type conversion instruction processing device according to an embodiment of the present disclosure. In a possible implementation, as shown in FIG. 14-2b, the operation module 14-12 may include a master operation sub-module 14-121 and a plurality of slave operation sub-modules 14-122. The main operation sub-module 14-121 may include a plurality of scalar operators 14-120 (not shown in the figure).

The main operation sub-module 14-121 is used to perform a scalar type conversion operation using a plurality of scalar operators 14-120, obtain an operation result, and store the operation result in a target address.

In a possible implementation manner, the operation domain may further include an initial data type and a target data type. The control module 14-11 is also used to determine the target data type and the initial data type of the scalar to be calculated according to the operation domain.

In a possible implementation, the operation code can also be used to indicate the initial data type and the target data type. The control module 14-11 is also used to determine the target data type and the initial data type of the scalar to be calculated according to the operation code.

In a possible implementation manner, when the initial data type and / or the target data type cannot be determined according to the operation code or the operation domain, the initial data type and / or the default target data type may be determined according to the preset default initial data type and / or target data type Or target data type. The preset default initial data type may be determined as the current initial data type of the scalar type conversion instruction, and the preset default target data type may be determined as the current target data type of the scalar type conversion instruction. A person skilled in the art may set the determination method of the target data type and the initial data type according to actual needs, which is not limited in the present disclosure.

In a possible implementation, the target data type may include any one of 16-bit floating-point numbers, 32-bit floating-point numbers, 48-bit floating-point numbers, 16-bit integers, 32-bit integers, and 48-bit integers. The initial data type may be Including any of 16-bit signed numbers, 32-bit signed numbers, 48-bit signed numbers, 16-bit unsigned numbers, 32-bit unsigned numbers, 48-bit unsigned numbers, and pointer data types.

In this implementation, the target data type and the initial data type can also be data types such as 64-bit integers. Those skilled in the art can set the target data type and the initial data type according to actual needs, as long as the target data type and the initial data type are guaranteed. The data type indicated by the data type may be different, and this disclosure does not limit it.

In this implementation, the identification (or code) such as the number and name of the above target data type and initial data type can be set to determine the target indicated by the scalar type conversion instruction according to the identification (or code) in the scalar conversion instruction Data type and initial data type. For example, you can set the 16-bit floating point ID to cvtf16, the 32-bit floating point ID to cvtf32, the 48-bit floating point ID to cvtf48, the 16-bit integer ID to cvti16, and the 32-bit integer ID to cvti32 and set the 48-bit integer identifier to cvti48. You can set the 16-bit signed number ID to s16, the 32-bit signed number ID to s32, the 48-bit signed number ID to s48, the 16-bit unsigned ID to u16, and the 32-bit unsigned ID The ID of the number is set to u32, the ID of the 48-bit unsigned number is set to u48, and the ID of the pointer data type is set to ptr. Those skilled in the art can set the identification of the target data type and the initial data type according to actual needs, and this disclosure does not limit this.

In a possible implementation manner, as shown in FIGS. 14-2a and 14-2b, the device may further include a storage module 14-13. The storage modules 14-13 are used to store scalars to be calculated.

In a possible implementation, the cache may include a neuron cache. The neuron cache, that is, the foregoing neuron random access memory, can be used to store neuron data in the data to be calculated, and the neuron data can include neuron vector data. Wherein, the data to be calculated includes data related to the conversion of the scalar type and / or data related to the calculation of other calculation instructions.

In a possible implementation manner, the control module 14-11 may also be used to generate an assembly file according to the scalar type conversion instruction and translate the assembly file into a binary file, where the binary file is a compiled scalar type conversion instruction.

In a possible implementation, the instruction format of the scalar type conversion instruction may be:

scalar dst src0opcode.type

Among them, scalar is the operation code of the scalar type conversion instruction, dst, src0, opcode.type are the operation domain of the scalar type conversion instruction. Among them, dst is the target address. src0 is the scalar address to be calculated. opcode in opcode.type is the target data type, and type in opcode.type is the initial data type of the scalar to be calculated.

In a possible implementation, the instruction format of the scalar type conversion instruction may also be:

opcode.scalar.type dst src0

Among them, opcode.scalar.type is the operation code of the scalar type conversion instruction, and dst and src0 are the operation domains of the scalar type conversion instruction. Among them, opcode in opcode.scalar.type is used to indicate the target data type, type in opcode.scalar.type is used to indicate the initial data type of the scalar to be calculated, and scalar in opcode.scalar.type is used to indicate that the instruction is Scalar type conversion instructions. dst is the target address, and src0 is the scalar address to be calculated.

It should be understood that those skilled in the art can set the position of the operation code, operation code and operation field in the instruction format of the scalar type conversion instruction as needed, and this disclosure does not limit this.

It should be noted that although the above embodiment is taken as an example to introduce the scalar type conversion instruction processing device as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various modules flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.

Application examples

The following uses "scalar type conversion instruction processing apparatus for scalar type conversion operation" as an exemplary application scenario, and gives an application example according to an embodiment of the present disclosure, in order to understand the flow of the scalar type conversion instruction processing apparatus. Those skilled in the art should understand that the following application examples are only for the purpose of facilitating the understanding of the embodiments of the present disclosure, and should not be considered as limitations to the embodiments of the present disclosure.

14-3 shows a schematic diagram of an application scenario of a scalar type conversion instruction processing device according to an embodiment of the present disclosure. As shown in Figure 14-3, the scalar type conversion instruction processing device processes the scalar type conversion instruction as follows:

The control module 14-11 compiles the obtained scalar type conversion instruction 1 to obtain the compiled scalar type conversion instruction 1. Analyze the compiled scalar type conversion instruction 1 (for example, scalar type conversion instruction 1 is scalar 500 500 cvtf16.u32) to obtain the operation code and operation domain of the scalar type conversion instruction 1. Among them, the operation code of the scalar type conversion instruction 1 is scalar, the target address is 500, the scalar address to be calculated is 100, the target data type is cvtf16 (that is, 16 is a floating point number), and the initial data type of the scalar to be calculated is u32 (also (32-bit unsigned number). The control module 14-11 acquires the scalar to be calculated from the scalar address to be calculated 100.

The arithmetic module 14-12 performs a scalar type conversion operation on the scalar to be calculated of the initial data type according to the target data type (that is, converts the data type of the 32-bit unsigned scalar to be calculated into 16 to a floating point number), and obtains the operation result, and The operation result is stored in the target address 500.

In this way, the scalar type conversion instruction processing device can efficiently and quickly process the scalar type conversion instruction, and the processing efficiency of the scalar type conversion is high and the processing speed is fast.

14-4 shows a flowchart of a scalar type conversion instruction processing method according to an embodiment of the present disclosure. The method can be applied to computer equipment including a memory and a processor, where the memory is used to store data used in the execution of the method; the processor is used to perform related processing and operation steps, such as performing the following steps S51-14和步骤 S52-14. As shown in FIG. 14-4, the method is applied to the above scalar type conversion instruction processing device, and the method includes steps S51-14 and S52-14.

In step S51-14, the control module is used to compile the obtained scalar type conversion instruction to obtain the compiled scalar type conversion instruction, and the compiled scalar type conversion instruction is parsed to obtain the scalar type conversion instruction operation code and The operation domain, and obtain the scalar to be calculated and the target address required to execute the scalar type conversion instruction according to the operation code and the operation domain, and determine the initial data type of the target data type and the scalar to be calculated. The operation code is used to instruct the operation performed by the scalar type conversion instruction on the data to be a scalar type conversion operation, and the operation domain includes the scalar address to be operated and the target address.

In step S52-14, the operation module is used to perform a scalar type conversion operation on the scalar to be operated of the initial data type according to the target data type to obtain the operation result, and the operation result is stored in the target address, and the data type of the operation result is the target data Types of.

In a possible implementation manner, performing a scalar type conversion operation on the scalar to be operated on the initial data type according to the target data type may include:

Use multiple scalar operators in the arithmetic module to perform scalar type conversion operations.

In a possible implementation manner, the operation module includes a master operation sub-module and multiple slave operation sub-modules, and the master operation sub-module includes multiple scalar operators. Wherein, steps S52-14 may include:

Use multiple scalar operators in the main operation sub-module to perform scalar type conversion operations, obtain the operation results, and store the operation results in the target address.

In a possible implementation manner, the operation domain may further include an initial data type and a target data type, and steps S51-14 may include: determining the target data type and the initial data type of the scalar to be calculated according to the operation domain.

In a possible implementation, the operation code is also used to indicate the initial data type and the target data type. Steps S51-14 may include: determining the target data type and the initial data type of the scalar to be calculated according to the operation code

Wherein, the storage module includes at least one of a register and a cache,

Register, used to store the scalar to be calculated;

In a possible implementation manner, steps S51-14 may include:

Store the compiled scalar type conversion instructions;

Analyze the compiled scalar type conversion instruction to obtain the operation code and operation domain of the scalar type conversion instruction;

The instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed in order according to the execution order. The plurality of instructions to be executed may include a compiled scalar type conversion instruction.

In a possible implementation manner, the method may further include:

The first storage address section storing data required for the first instruction to be executed has an overlapping area with the zeroth storage address section storing data required for the zeroth instruction to be executed.

In a possible implementation manner, compiling the obtained scalar type conversion instruction to obtain the compiled scalar type conversion instruction may include:

The assembly file is generated according to the scalar type conversion instruction, and the assembly file is translated into a binary file. Among them, the binary file is a compiled scalar type conversion instruction.

It should be noted that although the above embodiment is taken as an example to introduce the scalar type conversion instruction processing method as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various steps flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.

The scalar type conversion instruction processing method provided by the embodiments of the present disclosure has a wide application range, and has high processing efficiency and fast processing speed for scalar type conversion instructions, and high processing efficiency and fast processing speed for scalar type conversion.

The foregoing can be better understood based on the following clauses:

Clause N1, a scalar type conversion instruction processing device, the device comprising:

The control module is used to compile the obtained scalar type conversion instruction to obtain the compiled scalar type conversion instruction, to parse the compiled scalar type conversion instruction, to obtain the scalar type conversion instruction operation code and operation domain, and according to The operation code and the operation domain obtain the scalar to be calculated and the target address required to execute the scalar type conversion instruction, and determine the target data type and the initial data type of the scalar to be calculated;

The operation module is configured to perform a scalar type conversion operation on the to-be-operated scalar of the initial data type according to the target data type, obtain an operation result, and store the operation result in the target address. The data type is the target data type,

Wherein, the operation code is used to indicate that the operation performed by the scalar type conversion instruction on the data is a scalar type conversion operation, and the operation field includes a scalar address to be operated and the target address.

Clause N2. The device according to Clause N1, the calculation module includes:

A plurality of scalar operators are used to perform the scalar type conversion operation.

Clause N3. The device according to Clause N2, the operation module includes a master operation sub-module and a plurality of slave operation sub-modules, the master operation sub-module includes the plurality of scalar operators,

The main operation submodule is configured to perform the scalar type conversion operation using the plurality of scalar operators, obtain an operation result, and store the operation result in the target address.

Clause N4. The device according to Clause N1, the operation domain further includes an initial data type and a target data type,

Wherein, the control module is also used to determine the target data type and the initial data type of the scalar to be calculated according to the operation domain.

Clause N5. The device according to Clause N1, the operation code is also used to indicate an initial data type and a target data type,

Wherein, the control module is also used to determine the target data type and the initial data type of the scalar to be calculated according to the operation code.

Clause N6. The device according to Clause N1, the target data type includes any one of a 16-bit floating point number, a 32-bit floating point number, a 48-bit floating point number, a 16-bit integer, a 32-bit integer, and a 48-bit integer. The initial data types include any of 16-bit signed numbers, 32-bit signed numbers, 48-bit signed numbers, 16-bit unsigned numbers, 32-bit unsigned numbers, 48-bit unsigned numbers, and pointer data types.

Clause N7. The device according to Clause N1, the device further comprising:

A storage module for storing the scalar to be calculated,

Wherein, the storage module includes at least one of a register and a cache,

The register is used to store the scalar to be calculated;

Clause N8. The device according to Clause N1, the control module includes:

An instruction storage sub-module for storing the compiled scalar type conversion instruction;

An instruction processing submodule, used for parsing the compiled scalar type conversion instruction to obtain the operation code and operation domain of the scalar type conversion instruction;

The queue storage sub-module is used to store an instruction queue, and the instruction queue includes a plurality of instructions to be executed in order according to an execution order, and the plurality of instructions to be executed include the compiled scalar type conversion instruction.

Clause N9. The device according to Clause N8, the control module, further comprising:

Clause N10, the device according to Clause N1,

The control module is also used to generate an assembly file according to the scalar type conversion instruction and translate the assembly file into a binary file,

Wherein, the binary file is the compiled scalar type conversion instruction.

Clause N11. A machine learning computing device, the device comprising:

One or more scalar type conversion instruction processing devices as described in any one of Clause N1-Clause N10, used to obtain scalar and control information to be calculated from other processing devices, and perform specified machine learning operations, and pass the execution result The I / O interface is passed to other processing devices;

When the machine learning computing device includes a plurality of the scalar type conversion instruction processing devices, the plurality of the scalar type conversion instruction processing devices can be connected and transmit data through a specific structure;

Among them, a plurality of the scalar type conversion instruction processing apparatuses interconnect and transmit data through a fast external device interconnect bus PCIE bus to support larger-scale machine learning operations; a plurality of the scalar type conversion instruction processing apparatuses share the same The control system may have its own control system; a plurality of the scalar type conversion instruction processing devices share memory or have their own memories; the interconnection mode of the plurality of scalar type conversion instruction processing devices is any interconnection topology.

Clause N12. A combined processing device, the combined processing device comprising:

Machine learning computing devices, general interconnection interfaces and other processing devices as described in clause N11;

Clause N13. A machine learning chip, the machine learning chip includes:

The machine learning arithmetic device according to clause N11 or the combined processing device according to clause N12.

Clause N14. An electronic device, the electronic device comprising:

Machine learning chip as described in clause N13.

Clause N15. A board card including: a storage device, an interface device and a control device, and a machine learning chip as described in Clause N13;

The storage device is used for storing data;

Clause N16. A scalar type conversion instruction processing method. The method is applied to a scalar type conversion instruction processing device. The device includes a control module and an arithmetic module. The method includes:

The control module is used to compile the obtained scalar type conversion instruction to obtain a compiled scalar type conversion instruction. The compiled scalar type conversion instruction is parsed to obtain the operation code and operation domain of the scalar type conversion instruction. The operation code and the operation domain obtain the scalar to be calculated and the target address required to execute the scalar type conversion instruction, and determine the target data type and the initial data type of the scalar to be calculated;

The operation module performs a scalar type conversion operation on the scalar to be operated of the initial data type according to the target data type to obtain an operation result, and stores the operation result in the target address, and the data type of the operation result Is the target data type,

Clause N17. According to the method of Clause N16, performing a scalar type conversion operation on the scalar to be operated on the initial data type according to the target data type, including:

The scalar type conversion operation is performed by using multiple scalar operators in the arithmetic module.

Clause N18. The method according to Clause N17, the operation module includes a master operation sub-module and a plurality of slave operation sub-modules, the master operation sub-module includes the plurality of scalar operators,

Wherein, performing a scalar type conversion operation on the scalar to be operated of the initial data type according to the target data type to obtain an operation result, and storing the operation result in the target address includes:

Use a plurality of scalar operators in the main operation sub-module to perform the scalar type conversion operation to obtain an operation result, and store the operation result in the target address.

Clause N19. The method according to Clause N16, the operation domain further includes an initial data type and a target data type,

Wherein, determining the target data type and the initial data type of the scalar to be calculated includes:

The target data type and the initial data type of the scalar to be calculated are determined according to the operation domain.

Clause N20, the method according to Clause N16, the operation code is also used to indicate the initial data type and the target data type,

The target data type and the initial data type of the scalar to be calculated are determined according to the operation code.

Clause N21. The method according to Clause N16, the target data type includes any one of 16-bit floating point number, 32-bit floating point number, 48-bit floating point number, 16-bit integer, 32-bit integer, and 48-bit integer. The initial data types include any of 16-bit signed numbers, 32-bit signed numbers, 48-bit signed numbers, 16-bit unsigned numbers, 32-bit unsigned numbers, 48-bit unsigned numbers, and pointer data types.

Clause N22. The method according to Clause N17, the method further comprising:

Using the storage module of the device to store the scalar to be calculated,

Wherein, the storage module includes at least one of a register and a cache,

The register is used to store the scalar to be calculated;

Clause N23. According to the method described in Clause N16, parse the compiled scalar type conversion instruction to obtain the operation code and operation domain of the scalar type conversion instruction, including:

Storing the compiled scalar type conversion instruction;

Parse the compiled scalar type conversion instruction to obtain the operation code and operation domain of the scalar type conversion instruction;

An instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed, which are sequentially arranged according to an execution order, and the plurality of instructions to be executed include the compiled scalar type conversion instruction.

Clause N24. The method according to Clause N23, the method further comprising:

Clause N25. According to the method described in Clause N16, compile the obtained scalar type conversion instruction to obtain the compiled scalar type conversion instruction, including:

Generate an assembly file according to the scalar type conversion instruction, and translate the assembly file into a binary file,

Wherein, the binary file is the compiled scalar type conversion instruction.

Clause N26. A non-volatile computer-readable storage medium having computer program instructions stored thereon. When the computer program instructions are executed by a processor, the method of any one of Clause N16 to Clause N25 is implemented.

Due to the extensive use of neural network algorithms, the ability of computer hardware operators to continuously improve, the types and number of data operations involved in practical applications continue to increase. Due to the variety of programming languages, in different language environments, in order to implement the address fetching process, in related technologies, since there is no address fetch instruction that can be widely applied to various programming languages at this stage, technicians need to customize their corresponding The specific instructions of the programming language environment are used to implement address fetch processing, which results in low efficiency and slow speed of address fetch processing. The present disclosure provides an address fetch instruction processing method, device, computer equipment, and storage medium. The address fetch processing can be implemented with only one instruction, which can significantly improve the efficiency and speed of address fetch processing.

15-1 shows a block diagram of an address fetch instruction processing apparatus according to an embodiment of the present disclosure. As shown in Figure 15-1, the device includes a control module 15-11 and a processing module 15-12.

The control module 15-11 is used to compile the obtained address fetch instruction to obtain the compiled address fetch instruction, and parse the compiled address fetch instruction to obtain the operation code and operation domain of the address fetch instruction, and according to the operation The code and the operation domain obtain the address data to be stored and the target address required to execute the address fetch instruction. Among them, the operation code is used to indicate that the processing performed by the address fetch instruction on the data is address fetch processing, and the operation domain includes the initial address and the target address of the address data to be stored.

The processing module 15-12 is configured to process the address data to be stored, obtain the processed address data to be stored, and store the processed address data to be stored in the target address.

In this embodiment, the address data to be stored may be data representing one address to be stored or a plurality of addresses to be stored. The address fetch processing indicated by the address fetch instruction may be to obtain the address data to be stored and re-store it, so that the address data to be stored can be obtained at the new address, and then the data in the address to be stored recorded in the address data to be stored can be obtained .

In this embodiment, the address fetch instruction obtained by the control module is an uncompiled software instruction that cannot be directly executed by the hardware. The control module needs to first compile the address fetch instruction (uncompiled). After the compiled address fetch instruction is obtained, the compiled address fetch instruction can be parsed. The compiled address instruction is a hardware instruction that can be directly executed by hardware. The control module may obtain the address data to be stored from the initial address where the address data to be stored is stored. The control module can obtain the address fetch instruction and the address data to be stored through the data input / output unit. The data input / output unit may be one or more data I / O interfaces or I / O pins.

In this embodiment, the operation code may be the part of the instruction or field (usually represented by code) specified in the computer program to perform the operation, and is the instruction sequence number used to inform the device that executes the instruction which instruction needs to be executed. . The operation domain may be a source of all data required to execute the corresponding instruction, and all data required to execute the corresponding instruction includes address data to be stored, an initial address to store the address data to be stored, a target address, and so on. For an address fetch instruction, it must include an operation code and an operation field, where the operation field includes at least an initial address and a target address for storing address data to be stored.

It should be understood that those skilled in the art can set the instruction format of the address fetch instruction, as well as the included operation codes and operation domains as needed, and this disclosure does not limit this.

In this embodiment, the device may include one or more control modules and one or more processing modules, and the number of control modules and processing modules may be set according to actual needs, which is not limited in the present disclosure. When the device includes a control module, the control module may receive an address fetch instruction and control one or more processing modules to perform address fetch processing. When the device includes multiple control modules, the multiple control modules may respectively receive address fetch instructions and control corresponding one or more processing modules to perform address fetch processing.

The address fetch instruction processing device provided by the embodiment of the present disclosure includes a control module and a processing module. The control module is used to compile the obtained address fetch instruction to obtain the compiled address fetch instruction, and parse the compiled address fetch instruction to obtain the operation code and operation domain of the address fetch instruction, and according to the operation code and operation domain Obtain the address data and target address required to execute the address fetch instruction. The processing module is used to process the address data to be stored, obtain the processed address data to be stored, and store the processed address data to be stored in the target address. The address fetch instruction processing device provided by the embodiments of the present disclosure has a wide range of applications, and has high processing efficiency and fast processing speed for address fetch instructions, and high efficiency and fast processing for address fetch processing.

15-2 shows a block diagram of an address fetch instruction processing apparatus according to an embodiment of the present disclosure. In a possible implementation, as shown in FIG. 15-2, the processing module 15-12 may include a master processing sub-module 15-121 and multiple slave processing sub-modules 15-122.

The main processing sub-module 15-121 is used to process the address data to be stored, obtain the processed address data to be stored, and store the processed address data to be stored in the target address.

In a possible implementation manner, the operation domain may further include an initial storage space identifier and a target storage space identifier. The control module 15-11 is also used to determine the initial storage space identifier, target storage space identifier, initial address and target address according to the operation domain, and obtain the address to be stored from the initial address of the initial storage space identified by the initial storage space identifier data. Wherein, storing the processed address data to be stored in the target address may include: storing the processed address data to be stored in the target address of the target storage space identified by the target storage space identifier.

In this implementation manner, the initial storage space identifier may be an identifier indicating the initial storage space, such as a number and name of the initial storage space. The target storage space identifier may be an ID representing the target storage space, such as the number and name of the target storage space. The target storage space may be different from the initial storage space, and the target storage space may be a storage space such as a cache of the device. The initial storage space may be a storage space other than the cache in the device, for example, the initial storage space may be NRAM, WRAM, DDR, etc. of the device. Among them, NRAM (Nanotube Random Access Memory) is a non-volatile memory based on carbon nanotube (Carbon Nanotube, CNT for short). WRAM (Window RAM) is a type of VRAM (Video RAM, the image is randomly accessed to the memory). DDR (DDR SDRAM) is double rate synchronous dynamic random access memory. The target storage space may be the same as the initial storage space, and the storage location of the address data to be stored can be changed or increased based on the address fetch instruction.

In a possible implementation, the operation code may also be used to indicate the initial storage space identifier and the target storage space identifier. The control module 15-11 is also used to determine the initial storage space identifier, target storage space identifier, initial address and target address according to the operation code, and obtain the address to be stored from the initial address of the initial storage space identified by the initial storage space identifier data. Wherein, storing the processed address data to be stored in the target address may include: storing the processed address data to be stored in the target address of the target storage space identified by the target storage space identifier.

In a possible implementation manner, the initial storage space where it is located may also be marked in the initial address, so that the control module can obtain the address data to be stored from the initial storage space where it is located according to the initial address. You can also mark the target storage space in the target address, so that the control module can determine the target address and the target storage space from the operation domain, and enable the processing module to store the processed address data to be stored In the target address of the target storage space.

In a possible implementation manner, the default initial storage space and the default target storage space may be preset. When the initial storage space and / or the target storage space cannot be determined according to the operation domain or operation code of the address fetch instruction, the default initial storage space can be determined as the initial storage space where the initial address of the current address fetch instruction is located, and the The default target storage space is determined as the target storage space where the target address of the current address fetch instruction is located.

In a possible implementation manner, as shown in FIG. 15-2, the device may further include a storage module 15-13. The storage modules 15-13 are used to store address data to be stored.

In this implementation, the storage module may include one or more of a cache and a register. The cache may include a high-speed temporary storage cache, and may also include at least one NRAM (Neuron Random Access Memory). The cache is used to store data to be calculated and address data to be stored. The register is used to store the scalar data in the data to be calculated. The data to be calculated includes data related to the execution of the above calculation instruction and / or address fetch instruction.

In a possible implementation manner, the control module 15-11 may also be used to generate an assembly file according to the address fetch instruction and translate the assembly file into a binary file, where the binary file is a compiled address fetch instruction.

In a possible implementation, the instruction format of the address fetch instruction may be:

lda.space1.space2 dst src0

Among them, lda.space1.space2 is the operation code of the address fetch instruction, and dst and src0 are the operation domains of the address fetch instruction. Among them, dst is the target address. src0 is the initial address to store the address data to be stored. lda in lda.space1.space2 is used to indicate that the instruction is an address fetch instruction, space1 in lda.space1.space2 is the target storage space identifier, and space2 in lda.space1.space2 is the initial storage space identifier.

In a possible implementation, the instruction format of the address fetch instruction may also be:

lda dst src0 space1 space2

Among them, lda is the operation code of the address fetch instruction, dst, src0, space1, space2 are the operation domain of the address fetch instruction. Among them, lda is used to indicate that the instruction is an address fetch instruction. dst is the target address, and src0 is the initial address to store the address data to be stored. space1 is the target storage space identifier. space2 is the initial storage space identifier.

It should be understood that those skilled in the art can set the operation code of the address fetch instruction, the position of the operation code and the operation field in the instruction format according to need, and the disclosure does not limit this.

It should be noted that, although the above embodiment is taken as an example to introduce the address fetch instruction processing apparatus as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various modules flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.

Application examples

In the following, an application example according to an embodiment of the present disclosure will be given in conjunction with “using address fetch instruction processing apparatus for address fetch processing” as an exemplary application scenario, so as to facilitate understanding of the flow of the address fetch instruction processing apparatus. Those skilled in the art should understand that the following application examples are only for the purpose of facilitating the understanding of the embodiments of the present disclosure, and should not be considered as limitations to the embodiments of the present disclosure.

15-3a and 15-3b show schematic diagrams of application scenarios of an address fetch instruction processing apparatus according to an embodiment of the present disclosure. As shown in Figure 15-3a and Figure 15-3b, the address fetch instruction processing device processes the address fetch instruction as follows:

Example one

As shown in FIG. 15-3a, the control module 15-11 compiles the obtained address fetch instruction 1 to obtain the compiled address fetch instruction 1. Analyze the compiled address fetch instruction 1 (if the address fetch instruction 1 is lda.n1.g1 500 500), the operation code and operation domain of the address fetch instruction 1 are obtained. The operation code of the address fetch instruction 1 is lda.n1.g1, and the initial storage space identifier n1 and the target storage space identifier g1 can be determined according to the operation code lda.n1.g1. The target address is 500, and the initial address for storing the address data to be stored is 100. The control module 15-11 acquires address data to be stored from the address data address 100 to be stored in the initial storage space identified by the initial storage space identification n1.

The processing module 15-12 processes the address data to be stored, obtains the processed address data 1 to be stored, and stores the processed address data 1 to be stored in the target storage address 500 of the target storage space identified by the target storage space identifier g1 in.

As shown in FIG. 15-3b, the control module 15-11 compiles the obtained address fetch instruction 2 to obtain the compiled address fetch instruction 2. Analyze the compiled address fetch instruction 2 (if the address fetch instruction 2 is lda501501101n2g2) to obtain the operation code and operation domain of the address fetch instruction 2. Among them, the operation code of the address fetch instruction 2 is lda. The target address is 501, the initial address for storing address data to be stored is 101, the initial storage space identifier n2 is, and the target storage space identifier g2 is. The control module 15-11 acquires the address data to be stored from the address data address 101 to be stored in the initial storage space identified by the initial storage space identification n2.

The processing module 15-12 processes the address data to be stored, obtains the processed address data 2 to be stored, and stores the processed address data 2 to be stored in the target storage address 501 of the target storage space identified by the target storage space identifier g2 in.

In this way, the address fetch instruction processing device can process the address fetch instruction efficiently and quickly, and the address fetch processing is efficient and fast.

15-4 shows a flowchart of an address fetch instruction processing method according to an embodiment of the present disclosure. The method can be applied to computer equipment including a memory and a processor, where the memory is used to store data used during the execution of the method; the processor is used to perform related processing and operation steps, such as performing the following steps S51-15和步骤 S52-15. As shown in FIG. 15-4, this method is applied to the above address fetching instruction processing device. The method includes steps S51-15 and S52-15.

In step S51-15, the control module is used to compile the obtained address fetch instruction to obtain the compiled address fetch instruction, and the compiled address fetch instruction is parsed to obtain the operation code and operation domain of the address fetch instruction, and Obtain the address data and target address required to execute the address fetch instruction according to the operation code and operation domain. The operation code is used to instruct the address fetch instruction to process the data as address fetch processing, and the operation domain includes an initial address and a target address for storing address data to be stored.

In step S52-15, the processing module is used to process the address data to be stored to obtain the processed address data to be stored, and the processed address data to be stored is stored in the target address.

In a possible implementation, the processing module includes a main processing sub-module and multiple slave processing sub-modules. Wherein, steps S52-15 may include:

The address data to be stored is processed to obtain the processed address data to be stored, and the processed address data to be stored is stored in the target address.

In a possible implementation manner, the operation domain may further include an initial storage space identifier and a target storage space identifier. Among them, obtaining the address data and the target address to be stored required to execute the address fetch instruction according to the operation code and the operation domain may include: determining the initial storage space identifier, target storage space identifier, initial address and target address according to the operation domain, and In the initial address of the initial storage space identified by the storage space identifier, the address data to be stored is obtained.

Wherein, storing the processed address data to be stored in the target address may include: storing the processed address data to be stored in the target address of the target storage space identified by the target storage space identifier.

In a possible implementation, the operation code is also used to indicate the initial storage space identifier and the target storage space identifier. Wherein, obtaining the address data and the target address required to execute the address fetch instruction according to the operation code and the operation domain may include: determining the initial storage space identifier and the target storage space identifier according to the operation code, and determining the initial address and target address according to the operation domain And obtain the address data to be stored from the initial address of the initial storage space identified by the initial storage space identifier.

In a possible implementation manner, the method may further include: using the storage module of the device to store the address data to be stored,

Wherein, the storage module includes at least one of a register and a cache,

Cache, used to store data to be calculated and address data to be stored, the cache includes at least one neuron cache NRAM;

Register, used to store scalar data in the data to be calculated;

In a possible implementation manner, steps S51-15 may include:

Store the compiled address instruction;

Analyze the compiled address fetch instruction to obtain the operation code and operation domain of the address fetch instruction;

The instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed arranged in order according to the execution order, and the plurality of instructions to be executed may include a compiled address fetch instruction.

In a possible implementation manner, the method may further include:

In a possible implementation manner, compiling the obtained address fetch instruction to obtain the compiled address fetch instruction may include:

The assembly file is generated according to the address fetch instruction, and the assembly file is translated into a binary file. Among them, the binary file is the address instruction after compilation.

It should be noted that although the above embodiment is taken as an example to introduce the address fetch instruction processing method as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various steps flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.

The address fetch instruction processing method provided by the embodiments of the present disclosure has a wide range of application, and has high processing efficiency and fast processing speed for address fetch instructions, and high efficiency and fast speed for address fetch processing.

The foregoing can be better understood based on the following clauses:

Clause O1, an address fetch instruction processing device, the device comprising:

The control module is used to compile the obtained address fetch instruction to obtain the compiled address fetch instruction, and parse the compiled address fetch instruction to obtain the operation code and operation domain of the address fetch instruction, and according to the operation code Acquiring the address data to be stored and the target address required to execute the address fetch instruction with the operation domain;

A processing module, configured to process the address data to be stored, obtain the processed address data to be stored, and store the processed address data to be stored in the target address,

Wherein, the operation code is used to indicate that the processing performed by the address fetch instruction on the data is address fetch processing, and the operation domain includes an initial address and the target address that store the address data to be stored.

Clause O2. The apparatus according to Clause O1, the processing module includes a master processing sub-module and a plurality of slave processing sub-modules,

The main processing submodule is configured to process the address data to be stored to obtain the processed address data to be stored, and store the processed address data to be stored in the target address.

Clause O3. The device according to Clause O1, the operation domain further includes an initial storage space identifier and a target storage space identifier,

Wherein, the control module is further configured to determine the initial storage space identifier, the target storage space identifier, the initial address and the target address according to the operation domain, and identify from the initial storage space identifier In the initial address of the initial storage space, obtain the address data to be stored;

Wherein, storing the processed address data to be stored in the target address includes:

Storing the processed address data to be stored in the target address of the target storage space identified by the target storage space identifier.

Clause O4. The device according to Clause O1, the operation code is further used to indicate an initial storage space identifier and a target storage space identifier,

Wherein, the control module is further configured to determine the initial storage space identifier and the target storage space identifier according to the operation code, determine the initial address and the target address according to the operation domain, and select from the Acquiring the address data to be stored from the initial address of the initial storage space identified by the initial storage space identifier;

Clause O5. The device according to Clause O1, the device further comprising:

A storage module, used to store the address data to be stored,

Wherein, the storage module includes at least one of a register and a cache,

The cache is used to store data to be calculated and the address data to be stored, and the cache includes at least one neuron cache NRAM;

The register is used to store scalar data in the data to be calculated;

Clause O6. The device according to Clause O1, the control module includes:

An instruction storage sub-module for storing the compiled address fetch instruction;

An instruction processing sub-module, which is used to parse the compiled address fetch instruction to obtain the operation code and operation domain of the address fetch instruction;

The queue storage sub-module is used to store an instruction queue, and the instruction queue includes a plurality of instructions to be executed, which are sequentially arranged according to the execution order, and the plurality of instructions to be executed include the compiled address fetch instruction.

Clause O7. The device according to Clause O6, the control module, further comprising:

The dependency processing sub-module is used to determine the first pending instruction when there is an association relationship between the first pending instruction in the plurality of pending instructions and the zeroth pending instruction before the first pending instruction The execution instruction is cached in the instruction storage submodule, and after the execution of the zeroth instruction to be executed is completed, the first instruction to be executed is extracted from the instruction storage submodule and sent to the processing module,

Clause O8, the device according to Clause O1,

The control module is also used to generate an assembly file according to the address fetch instruction and translate the assembly file into a binary file,

Wherein, the binary file is the compiled address fetch instruction.

Clause O9. A machine learning computing device, the device comprising:

One or more address fetch instruction processing devices as described in any one of clauses O1 to O8, used to obtain the address data and control information to be stored from other processing devices, and perform the specified machine learning operation, and pass the execution result through The I / O interface is passed to other processing devices;

When the machine learning arithmetic device includes a plurality of address fetch instruction processing devices, the plurality of address fetch instruction processing devices may be connected and transmit data through a specific structure;

Among them, a plurality of the address fetching instruction processing devices interconnect and transmit data through a fast external device interconnect bus PCIE bus to support larger-scale machine learning operations; a plurality of the address fetching instruction processing devices share the same control system Or have their own control systems; multiple of the address fetch instruction processing devices share memory or have their own memory; the interconnection method of the multiple address fetch instruction processing devices is any interconnected topology.

Clause O10. A combined processing device, the combined processing device comprising:

Machine learning computing devices, general interconnection interfaces and other processing devices as described in clause O9;

Clause O11. A machine learning chip, the machine learning chip including:

The machine learning arithmetic device according to clause O9 or the combined processing device according to clause O10.

Clause O12. An electronic device, the electronic device comprising:

Machine learning chip as described in clause O11.

Clause O13, a board card including: a storage device, an interface device and a control device, and a machine learning chip as described in Clause O11;

The storage device is used for storing data;

Clause O14. An address fetch instruction processing method. The method is applied to an address fetch instruction processing device. The device includes a control module and a processing module. The method includes:

The control module is used to compile the obtained address fetch instruction to obtain a compiled address fetch instruction. The compiled address fetch instruction is parsed to obtain the operation code and operation domain of the address fetch instruction, and according to the operation code Acquiring the address data to be stored and the target address required to execute the address fetch instruction with the operation domain;

Use the processing module to process the address data to be stored to obtain the processed address data to be stored, and store the processed address data to be stored in the target address,

Clause O15. The method according to Clause O14, the processing module includes a master processing submodule and a plurality of slave processing submodules,

Wherein, processing the address data to be stored to obtain processed address data to be stored, and storing the processed address data to be stored in the target address includes:

The address data to be stored is processed to obtain processed address data to be stored, and the processed address data to be stored is stored in the target address.

Clause O16. The method according to Clause O14, the operation domain further includes an initial storage space identifier and a target storage space identifier,

Wherein, acquiring the address data to be stored and the target address required to execute the address fetch instruction according to the operation code and the operation domain includes:

Determine the initial storage space identifier, the target storage space identifier, the initial address and the target address according to the operation domain, and obtain from the initial address of the initial storage space identified by the initial storage space identifier The address data to be stored;

Clause O17. The method according to Clause O14, the operation code is further used to indicate the initial storage space identifier and the target storage space identifier,

Determine the initial storage space identifier and the target storage space identifier according to the operation code, determine the initial address and the target address according to the operation domain, and determine the initial storage space identified from the initial storage space identifier In the initial address of, obtain the address data to be stored;

Clause O18. The method according to Clause O14, the method further comprising:

Using the storage module of the device to store the address data to be stored,

Wherein, the storage module includes at least one of a register and a cache,

The register is used to store scalar data in the data to be calculated;

Clause O19. According to the method described in Clause O14, parse the compiled address fetch instruction to obtain the operation code and operation domain of the address fetch instruction, including:

Store the compiled address fetch instruction;

Parse the compiled address fetch instruction to obtain the operation code and operation domain of the address fetch instruction;

An instruction queue is stored. The instruction queue includes a plurality of instructions to be executed, which are sequentially arranged according to an execution order, and the plurality of instructions to be executed include the compiled address fetch instruction.

Clause O20. The method according to Clause O19, the method further comprising:

Clause O21. According to the method described in Clause O14, compile the obtained address fetch instruction to obtain the compiled address fetch instruction, including:

Generate an assembly file according to the address fetch instruction, and translate the assembly file into a binary file,

Wherein, the binary file is the compiled address fetch instruction.

Clause O22. A non-volatile computer-readable storage medium having computer program instructions stored thereon. When the computer program instructions are executed by a processor, the method according to any one of clauses O14 to O21 is implemented.

Due to the extensive use of neural network algorithms, the ability of computer hardware operators to continuously improve, the types and number of data operations involved in practical applications continue to increase. Due to the variety of programming languages, in different language environments, in order to achieve scalar data migration, in related technologies, since there is no scalar data migration instruction that can be widely applied to various programming languages at this stage, technicians need to customize their corresponding programming One or more instructions in the language environment to achieve scalar data migration, resulting in low efficiency and slow speed of scalar data migration. The present disclosure provides a scalar data migration instruction processing method, device, computer equipment, and storage medium, which can realize scalar data migration with only one instruction, which can significantly improve the efficiency and speed of scalar data migration.

16-1 shows a block diagram of a scalar data migration instruction processing apparatus according to an embodiment of the present disclosure. As shown in Figure 16-1, the device includes a control module 16-11 and a processing module 16-12.

The control module 16-11 is used to compile the acquired scalar data migration instruction to obtain the compiled scalar data migration instruction, and parse the compiled scalar data migration instruction to obtain the scalar data migration instruction operation code and operation domain. And obtain the scalar data to be migrated and the target address required to execute the scalar data migration instruction according to the operation code and the operation domain, and determine the migration parameters required for the migration process. The operation code is used to instruct the scalar data migration instruction to process the scalar data as migration processing. The operation domain includes the address of the scalar data to be migrated and the target address, and the migration parameter may include the initial storage space and target where the scalar data address to be migrated is located. The target storage space where the address is located and the migration type to be migrated.

The processing module 16-12, according to the migration parameters, stores the scalar data to be migrated into the target address.

In this embodiment, there may be one or more scalar data to be migrated. The migration type may indicate the storage speed of the scalar data in the initial storage space, the storage speed of the scalar data in the target storage space, and the speed relationship between the storage speeds of the two. In the scalar data migration instruction, different codes can be set for the storage speed relationship between different target storage spaces and the initial storage space to distinguish the storage speed. For example, the code whose migration type is "the storage speed of the initial storage space is greater than the storage speed of the target storage space" can be set to "st". The code whose migration type is "the storage speed of the initial storage space is equal to the storage speed of the target storage space" can be set to "mv". The code whose migration type is "the storage speed of the initial storage space is less than the storage speed of the target storage space" can be set to "ld". A person skilled in the art may set the migration type and the code of the migration type according to actual needs, which is not limited in the present disclosure.

In this embodiment, the migration parameters may include an identifier such as the initial storage space, the name and number of the target storage space, to represent the initial storage space and the target storage space.

In this embodiment, the initial storage space may be NRAM, DDR, registers, etc. of the device. The target storage space may be the NRAM or DDR of the device. Among them, NRAM (Nanotube Random Access Memory) is a non-volatile memory based on carbon nanotube (Carbon Nanotube, CNT for short). DDR (also known as DDR SDRAM) is double rate synchronous dynamic random access memory (Double Data Rate Synchronous Dynamic Random Access Memory).

In this embodiment, the scalar data migration instruction acquired by the control module is an uncompiled software instruction that cannot be directly executed by hardware. The control module needs to first compile the scalar data migration instruction (uncompiled). After the compiled scalar data migration instruction is obtained, the compiled scalar data migration instruction can be analyzed. The compiled scalar data migration instructions are hardware instructions that can be directly executed by the hardware. The control module may obtain the scalar data to be migrated from the scalar data address to be migrated. The control module can obtain instructions and data through the data input and output unit, which can be one or more data I / O interfaces or I / O pins.

In this embodiment, the operation code may be the part of the instruction or field (usually represented by code) specified in the computer program to perform the operation, and is the instruction sequence number used to inform the device that executes the instruction which instruction needs to be executed. . The operation domain may be the source of all data required to execute the corresponding instruction. All data required to execute the corresponding instruction include the target address, the scalar data address to be migrated, the initial storage space where the scalar data address to be migrated is located, and the target address is located. Target storage space and migration parameters for migration processing, etc. For a scalar data migration instruction, it must include an operation code and an operation field, where the operation field includes at least the scalar data to be migrated and the target address.

It should be understood that those skilled in the art can set the format of the scalar data migration instruction, as well as the included operation codes and operation domains as needed, and this disclosure does not limit this.

In this embodiment, the device may include one or more control modules and one or more processing modules, and the number of control modules and processing modules may be set according to actual needs, which is not limited in the present disclosure. When the device includes a control module, the control module may receive a scalar data migration instruction and control one or more processing modules to perform scalar data migration. When the device includes multiple control modules, the multiple control modules may respectively receive scalar data migration instructions and control the corresponding one or more processing modules to perform scalar data migration.

A scalar data migration instruction processing device provided by an embodiment of the present disclosure includes a control module and a processing module. The control module is used to compile the acquired scalar data migration instruction to obtain the compiled scalar data migration instruction, to parse the compiled scalar data migration instruction, to obtain the scalar data migration instruction operation code and operation domain, and according to the operation The code and the operation domain acquire the scalar data to be migrated and the target address required to execute the scalar data migration instruction, and determine the migration parameters required for the migration process. The processing module is used to store the scalar data to be migrated into the target address according to the migration parameters. The scalar data migration instruction processing device provided by the embodiments of the present disclosure has a wide application range, and has high processing efficiency and fast processing speed for scalar data migration instructions, and high processing efficiency and fast speed for scalar data migration.

16-2 shows a block diagram of a scalar data migration instruction processing apparatus according to an embodiment of the present disclosure. In a possible implementation, as shown in FIG. 16-2, the processing module 16-12 may include a master processing sub-module 16-121 and multiple slave processing sub-modules 16-122.

The main processing sub-module 16-121 is used to process the scalar data to be migrated, obtain the processed scalar data to be migrated, and store the processed scalar data to be migrated in the target address. The processing performed on the scalar data to be migrated includes conversion processing such as data type, which is not limited in the present disclosure.

In a possible implementation manner, the operation domain may further include a scalar data migration amount. The control module 16-11 is also used to determine the scalar data migration amount according to the operation domain, and obtain the scalar data to be migrated corresponding to the scalar data migration amount from the scalar data address to be migrated.

In this implementation manner, the scalar data migration amount may be the data amount of the acquired scalar data to be migrated.

In a possible implementation manner, a default scalar data migration amount may be preset. When the scalar data migration amount is not included in the operation domain, the default scalar data migration amount may be determined as the scalar data migration amount of the current scalar data migration instruction. Furthermore, the scalar data to be migrated corresponding to the scalar data migration amount is acquired from the scalar data address to be migrated.

In a possible implementation manner, when the scalar data migration amount is not included in the operation domain, all scalar data to be migrated stored therein may be directly obtained from the scalar data address to be migrated.

In a possible implementation manner, the operation domain may further include migration parameters. Wherein, determining the migration parameters required for the migration process may include: determining the migration parameters required for the migration process according to the operation domain.

In a possible implementation, the operation code may also be used to indicate the migration parameter. Wherein, determining the migration parameters required for the migration process may include: determining the migration parameters required for the migration process according to the operation code.

In a possible implementation, default migration parameters can also be set. When the migration parameter of the current scalar data migration instruction cannot be determined according to the operation domain and the operation code, the default migration parameter may be determined as the migration parameter of the current scalar data migration instruction.

In a possible implementation, the initial storage space and the target storage space corresponding to the scalar data address and the target address to be migrated may be determined, and then the storage speed, storage space type, etc. of the initial storage space, the target storage space, etc. Parameters to determine the migration parameters.

In a possible implementation manner, as shown in FIG. 16-2, the device may further include a storage module 16-13. The storage modules 16-13 are used to store scalar data to be migrated.

In this implementation, the storage module may include one or more of a cache and a register. The cache may include a high-speed temporary storage cache, and may also include at least one NRAM (Neuron Random Access Memory). Cache, used to store data to be calculated. The register is used to store scalar data in the scalar data to be migrated and the data to be calculated. The data to be calculated may be data related to the execution of calculation instructions and scalar data migration instructions.

In a possible implementation, the cache may include a neuron cache. The neuron cache, that is, the foregoing neuron random access memory, can be used to store neuron data in the data to be calculated, and the neuron data includes neuron vector data.

In a possible implementation, the control module 16-11 may also be used to generate an assembly file according to the scalar data migration instruction and translate the assembly file into a binary file, where the binary file is a compiled scalar data migration instruction.

In a possible implementation, the instruction format of the scalar data migration instruction may be:

migratedstrcstype.space1.space2size

Among them, migrate is the operation code of the scalar data migration instruction, and dst, src0, type.space1.space2, and size are the operation fields of the scalar data migration instruction. Where dst is the target address and src is the scalar data address to be migrated. When there are multiple scalar data to be migrated, src may include multiple addresses of scalar data to be migrated src0, src1, ..., srcn No restrictions. type.space1.space2 is the migration parameter, type in type.space1.space2 indicates the migration type, space1 in type.space1.space2 indicates the initial storage space where the scalar data address src to be migrated is located, and space2 in type.space1.space2 Indicates the target storage space where the target address dst is located. size is the amount of scalar data migration.

In a possible implementation, the instruction format of the scalar data migration instruction may also be:

type.space1.space2 dst src size

Among them, type.space1.space2 is the operation code of the scalar data migration instruction, and dst, src, and size are the operation fields of the scalar data migration instruction. Where dst is the target address and src is the scalar data address to be migrated. When there are multiple scalar data to be migrated, src may include multiple addresses of scalar data to be migrated src0, src1, ..., srcn. No restrictions. size is the amount of scalar data migration. The type in opcode type.space1.space2 represents the migration type, space1 in type.space1.space2 represents the initial storage space where the scalar data address to be migrated is located, and space2 in type.space1.space2 represents the destination where the target address dst is located storage.

Among them, type can be ld, st, mv. The migration type indicated by ld is "the storage speed of the initial storage space is less than the storage speed of the target storage space". The migration type indicated by st is "the storage speed of the initial storage space is greater than the storage speed of the target storage space". The type of migration indicated by mv is "the storage speed of the initial storage space is equal to the storage speed of the target storage space".

In a possible implementation, the instruction format of the scalar data migration instruction whose migration type is "the storage speed of the initial storage space is less than the storage speed of the target storage space" can be set to: ld.space1.space2dst src0size. According to the scalar data migration amount size, the initial storage space space1, the target storage space space2, and the migration type ld, obtain the scalar data to be migrated from the scalar data address src0 in the initial storage space space1 whose data amount is the scalar data migration amount size, And store the scalar data to be migrated into the target address dst in the target storage space space2. The storage speed of the initial storage space space1 is less than the storage speed of the target storage space space2.

In a possible implementation, the instruction format of the scalar data migration instruction whose migration type is "the storage speed of the initial storage space is greater than the storage speed of the target storage space" may be set to: st.space1.space2dst src0size. According to the scalar data migration amount size, the initial storage space space1, the target storage space space2, and the migration type st, obtain the scalar data to be migrated from the scalar data address src0 in the initial storage space space1 whose data amount is the scalar data migration amount size, And store the scalar data to be migrated into the target address dst in the target storage space space2. The storage speed of the initial storage space space1 is greater than the storage speed of the target storage space space2.

In a possible implementation, the instruction format of the scalar data migration instruction whose migration type is "the storage speed of the initial storage space is equal to the storage speed of the target storage space" can be set to: mv.space1.space2dst src0size. According to the scalar data migration amount size, the initial storage space space1, the target storage space space2 and the migration type st, obtain the scalar data to be migrated from the scalar data address src0 in the initial storage space space1 whose data amount is the scalar data migration amount size, And store the scalar data to be migrated into the target address dst in the target storage space space2. The storage speed of the initial storage space space1 is equal to the storage speed of the target storage space space2.

It should be understood that those skilled in the art can set the operation code of the scalar data migration instruction, the position of the operation code and the operation field in the instruction format as needed, and this disclosure does not limit this.

It should be noted that although the above embodiment is taken as an example to introduce the scalar data migration instruction processing apparatus as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various modules flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.

Application examples

The following uses "scalar data migration instruction processing device for data migration" as an exemplary application scenario, and gives an application example according to an embodiment of the present disclosure to facilitate understanding of the flow of the scalar data migration instruction processing device. Those skilled in the art should understand that the following application examples are only for the purpose of facilitating the understanding of the embodiments of the present disclosure, and should not be considered as limitations to the embodiments of the present disclosure.

16-3 shows a schematic diagram of an application scenario of a scalar data migration instruction processing apparatus according to an embodiment of the present disclosure. As shown in Figure 16-3, the scalar data migration instruction processing device processes the scalar data migration instruction as follows:

The control module 16-11 compiles the acquired scalar data migration instruction 1 to obtain the compiled scalar data migration instruction 1. Analyze the compiled scalar data migration instruction 1 (such as scalar data migration instruction 1 is ld.200.300, 500, and 400) to obtain the operation code and operation domain of the scalar data migration instruction 1. The operation code of the scalar data migration instruction 1 is ld, the initial storage space is 200, the target storage space is 300, the target address is 500, the scalar data address to be migrated is 400, and the scalar data migration amount is 5. According to the operation code ld, it can be determined that the storage speed of the initial storage space 200 is less than the storage speed of the target storage space 300. The control module 16-11 acquires the scalar data to be migrated whose data volume is the scalar data migration volume 5 from the scalar data address 400 to be migrated in the initial storage space 200. The processing module 16-12 stores the scalar data to be migrated into the target address 500 in the target storage space 300 according to the migration parameters.

Among them, the scalar data migration instruction 1 can be not only the above ld.200.300, 500, 400, 5, but also migrate 500 # 400ld.200.300, etc., the processing procedures of the two are similar, and will not be repeated here.

In this way, the scalar data migration instruction processing device can efficiently and quickly process the scalar data migration instruction, and the processing efficiency of the scalar data migration is high and the speed is fast.

16-4 shows a flowchart of a scalar data migration instruction processing method according to an embodiment of the present disclosure. The method can be applied to computer equipment including a memory and a processor, where the memory is used to store data used in the execution of the method; the processor is used to perform relevant processing and operation steps, such as performing the following steps S51-16 And step S52-16. As shown in FIG. 16-4, the method is applied to the above scalar data migration instruction processing device, and the method includes steps S51-16 and S52-16.

In step S51-16, the control module is used to compile the acquired scalar data migration instruction to obtain a compiled scalar data migration instruction, and the compiled scalar data migration instruction is parsed to obtain the scalar data migration instruction operation code. And the operation domain, and obtain the scalar data to be migrated and the target address required to execute the scalar data migration instruction according to the operation code and the operation domain, and determine the migration parameters required for the migration process. The operation code is used to instruct the scalar data migration instruction to process the scalar data as migration processing. The operation domain includes the address of the scalar data to be migrated and the target address, and the migration parameters include the initial storage space and target address where the scalar data address to be migrated The target storage space and migration type for migration processing.

In step S52-16, the processing module stores the scalar data to be migrated into the target address according to the migration parameters.

In a possible implementation manner, the processing module may include a master processing sub-module and multiple slave processing sub-modules. Wherein, steps S52-16 may include:

The main processing sub-module is used to process the scalar data to be migrated to obtain the processed scalar data to be migrated, and the processed scalar data to be migrated is stored in the target address.

In a possible implementation manner, the operation domain may further include a scalar data migration amount. Wherein, acquiring the scalar data to be migrated and the target address required to execute the scalar data migration instruction according to the operation code and the operation domain may include:

Determine the scalar data migration amount according to the operation domain, and obtain scalar data to be migrated corresponding to the scalar data migration amount from the scalar data address to be migrated.

In a possible implementation manner, the method further includes: using the storage module of the device to store the scalar data to be migrated,

Wherein, the storage module includes at least one of a register and a cache,

The register is used to store the scalar data to be migrated and the scalar data in the data to be calculated;

In a possible implementation, step S51-16 may include:

Store compiled scalar data migration instructions;

Analyze the compiled scalar data migration instruction to obtain the operation code and operation domain of the scalar data migration instruction;

An instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed in order according to the execution order, and the plurality of instructions to be executed include compiled scalar data migration instructions.

In a possible implementation manner, the method may further include:

When it is determined that the first to-be-executed instruction among the plurality of to-be-executed instructions is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction, cache the first to-be-executed instruction and determine After the execution of the instruction is completed, the execution of the first instruction to be executed is controlled,

In a possible implementation manner, compiling the acquired scalar data migration to obtain a compiled scalar data migration instruction may include:

Generate assembly files according to scalar data migration instructions, and translate the assembly files into binary files,

Among them, the binary file is a compiled scalar data migration instruction.

It should be noted that although the above embodiment is taken as an example to introduce the scalar data migration instruction processing method as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various steps flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.

The scalar data migration instruction processing method provided by the embodiments of the present disclosure has a wide range of application, and has high processing efficiency and fast processing speed for scalar data migration instructions, and high processing efficiency and fast speed for scalar data migration.

The foregoing can be better understood based on the following clauses:

Clause P1, a scalar data migration instruction processing device, the device comprising:

The control module is used to compile the acquired scalar data migration instruction to obtain the compiled scalar data migration instruction, and parse the compiled scalar data migration instruction to obtain the scalar data migration instruction operation code and operation domain, and according to The operation code and the operation domain obtain the scalar data to be migrated and the target address required to execute the scalar data migration instruction, and determine the migration parameters required for the migration process;

The processing module stores the scalar data to be migrated into the target address according to the migration parameter,

Wherein, the operation code is used to instruct the scalar data migration instruction to process the scalar data as migration processing, the operation domain includes the scalar data address to be migrated and the target address, and the migration parameter includes the pending The initial storage space where the scalar data address is migrated, the target storage space where the target address is located, and the type of migration for migration processing.

Clause P2. The device according to Clause P1, the processing module includes a master processing sub-module and a plurality of slave processing sub-modules,

The main processing submodule is configured to process the scalar data to be migrated to obtain processed scalar data to be migrated, and store the processed scalar data to be migrated in the target address.

Clause P3. The device according to Clause P1, the operation domain further includes a scalar data migration amount,

Wherein, the control module is further configured to determine the scalar data migration amount according to the operation domain, and obtain scalar data to be migrated corresponding to the scalar data migration amount from the scalar data address to be migrated.

Clause P4. The device according to Clause P1, the operation domain further includes migration parameters,

Among them, the migration parameters required for migration processing are determined, including:

According to the operation domain, the migration parameters required for the migration process are determined.

Clause P5. The device according to Clause P1, the operation code is also used to indicate a migration parameter,

According to the operation code, the migration parameters required for the migration process are determined.

Clause P6. The device according to Clause P1, the device further comprising:

A storage module for storing the scalar data to be migrated,

Wherein, the storage module includes at least one of a register and a cache,

The cache is used to store data to be calculated, and the cache includes at least one neuron cache NRAM;

The register is used to store the scalar data in the data to be migrated and the scalar data in the data to be calculated;

Clause P7. The device according to Clause P1, the control module includes:

An instruction storage sub-module for storing the compiled scalar data migration instruction;

An instruction processing sub-module, which is used to parse the compiled scalar data migration instruction to obtain the operation code and operation domain of the scalar data migration instruction;

The queue storage submodule is used to store an instruction queue, and the instruction queue includes a plurality of instructions to be executed, which are sequentially arranged according to the execution order, and the plurality of instructions to be executed include the compiled scalar data migration instruction.

Clause P8. The device according to Clause P7, the control module, further comprising:

Clause P9, the device according to Clause P1,

The control module is also used to generate an assembly file according to the scalar data migration instruction and translate the assembly file into a binary file,

Wherein, the binary file is the compiled scalar data migration instruction.

Clause P10. A machine learning computing device, the device comprising:

One or more scalar data migration instruction processing devices as described in any one of Clause P1-Clause P9, used to obtain scalar data and control information to be migrated from other processing devices, and perform specified machine learning operations, which will execute the results Passed to other processing devices through the I / O interface;

When the machine learning computing device includes a plurality of scalar data migration instruction processing devices, the plurality of scalar data migration instruction processing devices may be connected and transmit data through a specific structure;

Among them, a plurality of said scalar data migration instruction processing apparatuses interconnect and transmit data through a PCIE bus, a fast external device interconnection bus, to support larger-scale machine learning operations; a plurality of said scalar data migration instruction processing apparatuses share the same The control system may have its own control system; the multiple scalar data migration instruction processing devices share memory or have their own memories; the interconnection method of the multiple scalar data migration instruction processing devices is any interconnection topology.

Clause P11. A combined processing device, the combined processing device comprising:

Machine learning computing device, general interconnection interface and other processing devices as described in Clause P10;

Clause P12. A machine learning chip, the machine learning chip includes:

The machine learning arithmetic device according to clause P10 or the combined processing device according to clause P11.

Article P13. An electronic device, the electronic device comprising:

Machine learning chip as described in clause P12.

Clause P14, a board card including: a storage device, an interface device and a control device, and a machine learning chip as described in Clause P12;

The storage device is used for storing data;

Clause P15. A scalar data migration instruction processing method. The method is applied to a scalar data migration instruction processing device. The device includes a control module and a processing module. The method includes:

The control module is used to compile the acquired scalar data migration instruction to obtain the compiled scalar data migration instruction, and the compiled scalar data migration instruction is analyzed to obtain the scalar data migration instruction's operation code and operation domain. Obtaining the scalar data to be migrated and the target address required to execute the scalar data migration instruction by the operation code and the operation domain, and determining the migration parameters required for the migration process;

Clause P16. The method according to Clause P15, the processing module includes a master processing submodule and a plurality of slave processing submodules,

Wherein, storing the scalar data to be migrated into the target address according to the migration parameter includes:

The main processing submodule is used to process the scalar data to be migrated to obtain processed scalar data to be migrated, and the processed scalar data to be migrated is stored in the target address.

Clause P17. The method according to Clause P15, the operation domain further includes a scalar data migration amount,

Wherein, acquiring the scalar data to be migrated and the target address required to execute the scalar data migration instruction according to the operation code and the operation domain includes:

Clause P18. The method according to Clause P15, the operation domain further includes migration parameters,

Clause P19, the method according to Clause P15, the operation code is also used to indicate the migration parameter,

Clause P20. The method according to Clause P15, the method further comprising:

Use the storage module of the device to store the scalar data to be migrated,

Wherein, the storage module includes at least one of a register and a cache,

The register is used to store the scalar data in the to-be-migrated scalar and the to-be-calculated data;

Clause P21. According to the method described in Clause P15, parse the compiled scalar data migration instruction to obtain the operation code and operation domain of the scalar data migration instruction, including:

Store the compiled scalar data migration instruction;

Parse the compiled scalar data migration instruction to obtain the operation code and operation domain of the scalar data migration instruction;

An instruction queue is stored. The instruction queue includes a plurality of instructions to be executed in order according to the execution order, and the plurality of instructions to be executed include the compiled scalar data migration instruction.

Clause P22. The method according to Clause P21, the method further comprising:

Clause P23. According to the method described in Clause P15, compile the acquired scalar data migration to obtain a compiled scalar data migration instruction, including:

Generate an assembly file according to the scalar data migration instruction, and translate the assembly file into a binary file,

Wherein, the binary file is the compiled scalar data migration instruction.

Clause P24. A non-volatile computer-readable storage medium having computer program instructions stored thereon, which when executed by a processor implements the method of any one of Clause P15 to Clause P23.

17-1 shows a block diagram of a scalar control flow instruction processing apparatus according to an embodiment of the present disclosure. As shown in Figure 17-1, the device includes a control module 17-11. The control module 11 includes an instruction compilation submodule 17-111, a data acquisition submodule 17-112, and a jump control submodule 17-113.

The instruction compilation submodule 17-111 compiles the obtained scalar control flow instruction to obtain the compiled scalar control flow instruction.

The data acquisition sub-module 17-112 obtains the scalar control target and the target jump address required to execute the scalar control flow instruction according to the compiled scalar control flow instruction operation code and operation domain, and determines the jump corresponding to the scalar control flow instruction Transfer conditions.

The jump control sub-module 17-113, when the scalar to be judged meets the jump condition, controls the instruction flow to jump to the target jump address.

The operation code is used to instruct the scalar control flow instruction to process the data as scalar jump processing, and the operation field includes the scalar address to be judged and the target jump address.

In this embodiment, there may be one or more scalars to be determined. The operation domain may include the scalar address to be judged, or may directly include the scalar to be judged, so that the control module can obtain the scalar to be judged.

In this embodiment, the scalar control flow instruction acquired by the control module is an uncompiled software instruction that cannot be directly executed by hardware. The control module needs to first compile the scalar control flow instruction (uncompiled). After the compiled scalar control flow instruction is obtained, the compiled scalar control flow instruction can be parsed. The compiled scalar control flow instructions are hardware instructions that can be directly executed by hardware. The control module may obtain a scalar control flow instruction and a scalar to be judged through a data input / output unit, and the data input / output unit may be one or more data I / O interfaces or I / O pins.

In this embodiment, the operation code may be the part of the instruction or field (usually represented by code) specified in the computer program to perform the operation, and is the instruction sequence number used to inform the device that executes the instruction which instruction needs to be executed. . The operation domain may be a source of all data required to execute the corresponding instruction, and all data required to execute the corresponding instruction include a scalar to be determined, a scalar address to be determined, a target jump address, a jump condition, and so on. For a scalar control flow instruction, it must include an operation code and an operation field, where the operation field includes at least the storage of the scalar address to be judged and the target jump address.

It should be understood that those skilled in the art can set the instruction format of the scalar control flow instruction, as well as the included operation codes and operation domains as needed, and the disclosure does not limit this.

In this embodiment, the device may include one or more control modules, and the number of control modules may be set according to actual needs, which is not limited in the present disclosure. The device can be used for calculation of machine learning algorithms, such as neural network algorithms.

In this embodiment, the device may further include a processing module. The control module can also be used to receive calculation instructions to obtain data to be processed. The processing module is used to perform operation processing on the data to be processed according to the calculation instruction to obtain the operation result.

A scalar control flow instruction processing device provided by an embodiment of the present disclosure includes a control module. The control module includes: an instruction compilation submodule, which compiles the obtained scalar control flow instruction to obtain a compiled scalar control flow instruction; data Obtain a submodule, obtain the scalar to be judged and the target jump address required to execute the scalar control flow instruction according to the operation code and operation domain of the compiled scalar control flow instruction, and determine the jump condition corresponding to the scalar control flow instruction; jump The transfer control submodule, when the scalar to be determined meets the jump condition, controls the instruction flow to jump to the target jump address. The scalar control flow instruction processing device provided by the embodiments of the present disclosure has a wide application range, and has high processing efficiency and fast processing speed for scalar control flow instructions.

In a possible implementation, the jump control sub-modules 17-113 may include:

At least one comparator is used to compare the scalar to be judged according to the jump condition to obtain a comparison result, and the comparison result is used to indicate whether the scalar to be judged meets the jump condition.

In a possible implementation manner, the operation domain may further include a jump condition. Wherein, the data acquisition sub-module 17-112 may be used to determine the jump condition corresponding to the scalar control flow instruction according to the operation domain when the operation domain includes the jump condition.

In a possible implementation manner, the operation code may also be used to indicate a jump condition. Among them, the data acquisition sub-module 17-112 can be used to determine the jump condition corresponding to the scalar control flow instruction according to the operation code when the operation code is used to indicate the jump condition.

In a possible implementation manner, the jump condition may include a judgment condition and a data type of a scalar to be judged. The judgment condition is used to indicate the type of judgment or comparison that the scalar control flow instruction needs to make to judge the scalar.

In a possible implementation manner, the judgment condition may include any one of the following:

The first scalar to be judged in the scalar to be judged is equal to the second scalar to be judged in the scalar to be judged;

The first scalar to be judged in the scalar to be judged is not equal to the second scalar to be judged in the scalar to be judged;

The first scalar to be judged in the scalar to be judged is smaller than the second scalar to be judged in the scalar to be judged;

The first scalar to be judged in the scalar to be judged is greater than or equal to the second scalar to be judged in the scalar to be judged;

The scalar to be judged is greater than the specified value.

In this implementation manner, the judgment condition may also be another judgment condition for the scalar to be judged, for example, the judgment condition may also be that the first scalar to be judged in the scalar to be judged is smaller than the second scalar to be judged in the scalar to be judged. The judgment condition may also be that the scalar to be judged is less than the specified value, the scalar to be judged is equal to the specified value, etc. The specified value may be a preset value. The judgment condition may also be the sum of the first scalar to be judged and the second scalar to be judged is greater than, or equal to, or less than, or less than or equal to, or greater than or equal to, or not equal to the third of the scalar to be judged Scalar, etc. Those skilled in the art can set the judgment conditions according to actual needs, and this disclosure does not limit this.

In this implementation manner, different judgment condition flags can be set to distinguish different judgment conditions. For example, the judgment condition flag of "the first scalar to be judged equal to the second scalar to be judged" is set to "beq", and the "first scalar to be judged scalar" can be set to "beq" The judgment condition flag not equal to the second scalar to be judged in the scalar to be judged "is set to" bne ". The judgment condition flag of "the first scalar to be judged in the scalar to be judged is smaller than the second scalar to be judged in the scalar to be judged" may be set to "blt". The judgment condition flag of “the first scalar to be judged in the scalar to be judged is greater than or equal to the second scalar to be judged in the scalar to be judged” may be set to “bge”. The judgment condition flag of "the scalar to be judged is greater than the specified value" may be set to "blt.a", where a is the specified value.

In a possible implementation, the data types may include 16-bit unsigned types, 32-bit unsigned types, 48-bit unsigned types, 16-bit signed types, 32-bit signed types, and 48-bit signed types. Any kind.

In this implementation manner, the scalar to be determined may be a scalar of an integer type or the like and corresponding to the above data type. A person skilled in the art may set the data type and type of the scalar to be judged according to actual needs, which is not limited in the present disclosure.

In a possible implementation, the default data type can be preset. When the data type is not included in the jump condition, the default data type can be determined as the data type of the scalar to be judged.

In a possible implementation, the scalar control flow instruction does not include the jump condition and the scalar address to be determined, or the jump condition and the scalar address to be determined are empty, or the jump condition and the scalar address to be determined are specified content , You can directly control the instruction flow to jump to the target jump address.

17-2 shows a block diagram of a scalar control flow instruction processing apparatus according to an embodiment of the present disclosure. In a possible implementation manner, as shown in FIG. 17-2, the device may further include a storage module 17-13. The storage modules 17-13 are used to store scalars to be judged.

In this implementation manner, the storage module may include one or more of a memory, a cache, and a register, and the cache may include a high-speed temporary storage cache. The scalar to be judged can be stored in the memory, cache and / or register in the storage module as needed, and the disclosure does not limit this.

In a possible implementation manner, the instruction compilation sub-module may compile the scalar control flow instruction according to the need of the control module to execute the instruction, for example, compile into a binary file or a hexadecimal file that the control module can recognize. Compiling the obtained scalar control flow instruction to obtain the compiled scalar control flow instruction may include: generating an assembly file according to the scalar control flow instruction and translating the assembly file into a binary file, where the binary file is compiled Scalar control flow instruction.

In a possible implementation, the instruction format of the scalar control flow instruction may be:

jump, src, label, type1.type2

Among them, jump is the operation code of the scalar control flow instruction, src, label, type1. Type2 is the operation domain of the scalar control flow instruction. Among them, label is the target jump address. src is a scalar address to be judged, wherein, when there are multiple scalars to be judged, the scalar control flow instruction may include multiple scalar addresses to be judged, such as src1, src2, ..., srcn. type1.type2 represents the jump condition, where type1 in type1.type2 represents the judgment condition, and type2 in type1.type2 represents the data type of the scalar to be judged.

Where there are multiple scalars to be judged, the instruction format may include multiple scalar addresses to be judged. The following takes two scalars to be judged as examples. The instruction format of the scalar control flow instruction may be:

jump, src0, src1, label, type1.type2

In a possible implementation, the instruction format of the scalar control flow instruction may also be:

type1.type2, src, label

Among them, type1.type2 is the operation code of the scalar control flow instruction, and src and label are the operation domains of the scalar control flow instruction. Among them, type1.type2 is used to indicate that the instruction is a scalar control flow instruction, where type1 in type1.type2 represents the judgment condition, and type2 in type1.type2 represents the data type of the scalar to be judged. src is a scalar address to be judged, wherein, when there are multiple scalars to be judged, the scalar control flow instruction may include multiple scalar addresses to be judged, such as src1, src2, ..., srcn.

type1.type2, src0, src1, label

In a possible implementation manner, corresponding instruction formats may be set for different scalar control flow instructions.

In a possible implementation, the judgment format of the scalar control flow instruction whose judgment condition is "the first scalar to be judged is equal to the second scalar to be judged" is set to beq. type12, src0, src1, label. The scalar control flow instruction indicates that the first to-be-determined scalar and the second to-be-determined scalar whose data types are respectively stored in src0 and src1 are type2 are compared. The control instruction flow jumps to the target jump address label.

In a possible implementation, the instruction format of the scalar control flow instruction whose judgment condition is "the first scalar to be judged in the scalar to be judged is not equal to the second scalar to be judged" is set to: bne .type2, src0, src1, label. The scalar control flow instruction indicates that the first to-be-determined scalar and the second to-be-determined scalar whose data types stored in src0 and src1 are respectively type2 are compared, when the first to-be-determined scalar is not equal to the second to-be-determined scalar , The control instruction flow jumps to the target jump address label.

In a possible implementation, the judgment format of the scalar control flow instruction whose judgment condition is "the first scalar to be judged in the scalar to be judged is less than the second scalar to be judged in the scalar to be judged" is set to: blt. type2, src0, src1, label. The scalar control flow instruction indicates that the first to-be-determined scalar and the second to-be-determined scalar whose data types are respectively stored in src0 and src1 are type2 are compared. The control instruction flow jumps to the target jump address label.

In a possible implementation, the instruction format of the scalar control flow instruction whose judgment condition is "the first scalar to be judged in the scalar to be judged is greater than or equal to the second scalar to be judged" is set as: bge.type2, src0, src1, label. The scalar control flow instruction indicates that the first to-be-determined scalar and the second to-be-determined scalar whose data types stored in src0 and src1 are respectively type2 are compared, where the first to-be-determined scalar is greater than or equal to the second to-be-determined scalar , The control instruction flow jumps to the target jump address label.

In a possible implementation manner, the instruction format of the scalar control flow instruction that jumps directly into the instruction flow without judgment can be set as: jmp, label. The scalar control flow instruction indicates that when the instruction is received, the instruction flow is directly controlled to jump to the target jump address label.

It should be understood that those skilled in the art may set the position of the operation code, operation code and operation field in the instruction format of the scalar control flow instruction according to needs, and this disclosure does not limit this.

In a possible implementation manner, the device may be set in a graphics processor (Graphics Processing Unit, GPU for short), a central processing unit (CPU Processing), and an embedded neural network processor (Neural-network Processing Unit) , Referred to as NPU).

It should be noted that although the above embodiment is taken as an example to introduce the scalar control flow instruction processing apparatus as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various modules flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.

Application examples

The following uses "scalar control flow instruction processing device for address fetch processing" as an exemplary application scenario to give an application example according to an embodiment of the present disclosure, in order to understand the flow of the scalar control flow instruction processing device. Those skilled in the art should understand that the following application examples are only for the purpose of facilitating the understanding of the embodiments of the present disclosure, and should not be considered as limitations to the embodiments of the present disclosure.

17-3 shows a schematic diagram of an application scenario of a scalar control flow instruction processing apparatus according to an embodiment of the present disclosure. As shown in Figure 17-3, the scalar control flow instruction processing device processes the scalar control flow instruction as follows:

As shown in FIG. 17-3, the control module 17-11 compiles the obtained scalar control flow instruction 1 to obtain the compiled scalar control flow instruction 1. Analyze the compiled scalar control flow instruction 1 (for example, scalar control flow instruction 1 is beq.u16, 101, 102, 500), and obtain the compiled scalar control flow instruction 1 operation code and operation domain. It is determined that the judgment condition is “the first scalar to be judged in the scalar to be judged is equal to the second scalar to be judged in the scalar to be judged”, the data type is a 16-bit unsigned type, and the target jump address is 500. A 16-bit unsigned first to-be-determined scalar s1 is acquired from the first to-be-determined scalar address 101, and a 16-bit unsigned to-be-determined scalar s2 is acquired from the second to-be-determined scalar address 102. The comparator is used to compare the first to-be-determined scalar s1 and the second to-be-determined scalar s2. When the first to-be-determined scalar s1 is equal to the second to-be-determined scalar s2, the control instruction flow jumps to the target jump address 500.

For the working process of the above control module, please refer to the related description above.

In this way, the scalar control flow instruction processing device can efficiently and quickly process the scalar control flow instruction.

17-4 shows a flowchart of a scalar control flow instruction processing method according to an embodiment of the present disclosure. The method can be applied to computer equipment including a memory and a processor, where the memory is used to store data used in the execution of the method; the processor is used to perform related processing and operation steps, such as performing the following steps S51-17 And step S52-17. As shown in FIG. 17-4, the method is applied to the above scalar control flow instruction processing device, and the method includes steps S51-17 to S53-17.

In step S51-17, compile the obtained scalar control flow instruction to obtain a compiled scalar control flow instruction.

In step S52-17, obtain the to-be-determined scalar and target jump addresses required to execute the scalar control flow instruction according to the compiled scalar control flow instruction operation code and operation domain, and determine the jump corresponding to the scalar control flow instruction condition. The operation code is used to instruct the scalar control flow instruction to process the data as scalar jump processing, and the operation field includes the scalar address to be judged and the target jump address.

In step S53-17, when the scalar to be determined meets the jump condition, the control instruction flow jumps to the target jump address.

In a possible implementation manner, the method may further include: when the scalar to be judged meets the jump condition, controlling the instruction flow to jump to the target jump address may include:

According to the jump condition, at least one comparator is used to compare the scalar to be judged to obtain a comparison result, and the comparison result is used to indicate whether the scalar to be judged meets the jump condition.

In a possible implementation manner, the operation domain may further include a jump condition. Wherein, determining the jump condition corresponding to the scalar control flow instruction may include: when the operation domain includes the jump condition, determining the jump condition corresponding to the scalar control flow instruction according to the operation domain.

In a possible implementation manner, the operation code may also be used to indicate a jump condition. Wherein, determining the jump condition corresponding to the scalar control flow instruction may include: when the operation code is used to indicate the jump condition, determining the jump condition corresponding to the scalar control flow instruction according to the operation code.

In a possible implementation manner, the jump condition may include a judgment condition and a data type of a scalar to be judged.

The judgment condition may include any of the following:

The scalar to be judged is greater than the specified value.

The data type may include any of the following: 16-bit unsigned type, 32-bit unsigned type, 48-bit unsigned type, 16-bit signed type, 32-bit signed type, and 48-bit signed type.

In a possible implementation manner, the method may further include: storing a scalar to be judged.

In a possible implementation manner, the method may further include:

Store compiled scalar control flow instructions;

Analyze the compiled scalar control flow instruction to obtain the operation code and operation domain of the compiled scalar control flow instruction;

The instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed in order according to the execution order, and the plurality of instructions to be executed may include compiled scalar control flow instructions.

In a possible implementation manner, the method may further include: when determining that the first to-be-executed instruction among the plurality of to-be-executed instructions is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction, caching the first The instruction to be executed, and after it is determined that the execution of the zeroth instruction to be executed is completed, the execution of the first instruction to be executed is controlled.

The first to-be-executed instruction is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction, including: a first storage address interval storing data required by the first to-be-executed instruction and data required to store the zeroth to-be-executed instruction The zeroth storage address interval has overlapping areas.

In a possible implementation, compiling the obtained scalar control flow instruction to obtain the compiled scalar control flow instruction may include: generating an assembly file according to the scalar control flow instruction, and translating the assembly file into a binary file. Among them, the binary file is a compiled scalar control flow instruction.

It should be noted that although the above embodiment is taken as an example to introduce the scalar control flow instruction processing method as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various steps flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.

The scalar control flow instruction processing method provided by the embodiments of the present disclosure has a wide application range, and has high processing efficiency and fast processing speed for the scalar control flow instruction.

The foregoing can be better understood based on the following clauses:

Clause Q1, a scalar control flow instruction processing device, the device includes a control module, the control module includes:

The instruction compilation submodule compiles the obtained scalar control flow instruction to obtain the compiled scalar control flow instruction;

The data acquisition submodule obtains the scalar to be judged and the target jump address required to execute the scalar control flow instruction according to the compiled scalar control flow instruction operation code and operation domain, and determines the jump corresponding to the scalar control flow instruction condition;

A jump control submodule, when the scalar to be judged satisfies the jump condition, controlling the instruction flow to jump to the target jump address,

Wherein, the operation code is used to instruct the scalar control flow instruction to process the data as scalar jump processing, and the operation field includes a scalar address to be judged and the target jump address.

Clause Q2. The device according to Clause Q1, the jump control sub-module includes:

At least one comparator is configured to compare the scalar to be determined according to the jump condition to obtain a comparison result, and the comparison result is used to indicate whether the scalar to be determined meets the jump condition.

Clause Q3. The device according to Clause Q1, the operation domain further includes a jump condition,

Wherein, the data acquisition sub-module is used to determine the jump condition corresponding to the scalar control flow instruction according to the operation domain when the operation domain includes the jump condition.

Clause Q4. The device according to Clause Q1, the operation code is also used to indicate a jump condition,

Wherein, the data acquisition sub-module is used to determine the jump condition corresponding to the scalar control flow instruction according to the operation code when the operation code is used to indicate the jump condition.

Clause Q5. The device according to Clause Q1, the jump condition includes a judgment condition and a data type of a scalar to be judged,

Wherein, the judgment condition includes any one of the following:

The scalar to be judged is greater than the specified value;

The data type includes any of the following:

16-bit unsigned type, 32-bit unsigned type, 48-bit unsigned type, 16-bit signed type, 32-bit signed type, 48-bit signed type.

Clause Q6. The device according to Clause Q1, the device further comprising:

The storage module is used for storing the scalar to be determined.

Clause Q7. The device according to Clause Q1, the control module includes:

Instruction storage sub-module for storing the compiled scalar control flow instruction;

An instruction processing submodule, used for parsing the compiled scalar control flow instruction to obtain the operation code and operation domain of the compiled scalar control flow instruction;

A queue storage submodule is used to store an instruction queue, and the instruction queue includes a plurality of instructions to be executed in order according to an execution order, and the plurality of instructions to be executed include the compiled scalar control flow instruction.

Clause Q8. The device according to Clause Q7, the control module, further comprising:

The dependency processing sub-module is used to determine the first pending instruction when there is an association relationship between the first pending instruction in the plurality of pending instructions and the zeroth pending instruction before the first pending instruction The execution instruction is cached in the instruction storage submodule, and after the execution of the zeroth instruction to be executed is completed, the execution of the first instruction to be executed is extracted and controlled from the instruction storage submodule,

Clause Q9. According to the device described in Clause Q1, compile the obtained scalar control flow instruction to obtain the compiled scalar control flow instruction, including:

Generate an assembly file according to the scalar control flow instruction, and translate the assembly file into a binary file,

Wherein, the binary file is the compiled scalar control flow instruction.

Clause Q10. A machine learning computing device, the device comprising:

One or more scalar control flow instruction processing devices as described in any one of Clause Q1-Clause Q9, used to obtain the scalar and control information to be judged from other processing devices, and perform the specified machine learning operation, and pass the execution result The I / O interface is passed to other processing devices;

When the machine learning computing device includes a plurality of the scalar control flow instruction processing devices, the plurality of the scalar control flow instruction processing devices can be connected and transmit data through a specific structure;

Among them, a plurality of the scalar control flow instruction processing devices interconnect and transmit data through a fast external device interconnect bus PCIE bus to support larger-scale machine learning operations; a plurality of the scalar control flow instruction processing devices share the same The control system may have its own control system; the multiple scalar control flow instruction processing devices share memory or have their own memories; the interconnection method of the multiple scalar control flow instruction processing devices is any interconnection topology.

Clause Q11. A combined processing device, the combined processing device comprising:

Machine learning computing devices, general interconnection interfaces and other processing devices as described in clause Q10;

Article Q12. A machine learning chip, the machine learning chip includes:

The machine learning arithmetic device according to clause Q10 or the combined processing device according to clause Q11.

Article Q13. An electronic device, the electronic device comprising:

Machine learning chip as described in clause Q12.

Clause Q14, a board card, the board card includes: a storage device, an interface device and a control device, and a machine learning chip as described in Clause Q12;

The storage device is used for storing data;

Clause Q15. A scalar control flow instruction processing method, the method comprising:

Compile the obtained scalar control flow instruction to obtain the compiled scalar control flow instruction;

Acquiring the scalar to be judged and the target jump address required to execute the scalar control flow instruction according to the compiled scalar control flow instruction operation code and operation domain, and determining the jump condition corresponding to the scalar control flow instruction;

When the scalar to be judged satisfies the jump condition, the control instruction flow jumps to the target jump address,

Clause Q16. According to the method described in Clause Q15, when the scalar to be judged satisfies the jump condition, controlling the instruction flow to jump to the target jump address includes:

According to the jump condition, at least one comparator is used to compare the scalar to be determined to obtain a comparison result, and the comparison result is used to indicate whether the scalar to be determined meets the jump condition.

Clause Q17, the method according to Clause Q15, the operation domain further includes a jump condition,

Among them, determining the jump condition corresponding to the scalar control flow instruction includes:

When the operation domain includes a jump condition, the jump condition corresponding to the scalar control flow instruction is determined according to the operation domain.

Clause Q18, the method according to Clause Q15, the operation code is also used to indicate a jump condition,

When the operation code is used to indicate a jump condition, the jump condition corresponding to the scalar control flow instruction is determined according to the operation code.

Clause Q19. The method according to Clause Q15, the jump condition includes a judgment condition and a data type of a scalar to be judged,

Wherein, the judgment condition includes any one of the following:

The scalar to be judged is greater than the specified value;

The data type includes any of the following:

Clause Q20. The method according to Clause Q15, the method further comprising:

Store the scalar to be judged.

Clause Q21. The method according to Clause Q15, the method further comprising:

Storing the compiled scalar control flow instruction;

Analyzing the compiled scalar control flow instruction to obtain the operation code and operation domain of the compiled scalar control flow instruction;

An instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed that are sequentially arranged according to an execution order, and the plurality of instructions to be executed include the compiled scalar control flow instruction.

Clause Q22. The method according to Clause Q21, the method further comprising:

Clause Q23. According to the method described in Clause Q15, compile the obtained scalar control flow instruction to obtain the compiled scalar control flow instruction, including:

Wherein, the binary file is the compiled scalar control flow instruction.

Due to the extensive use of neural network algorithms, the ability of computer hardware operators to continuously improve, the types and number of data operations involved in practical applications continue to increase. Due to the variety of programming languages, in different language environments, in order to implement vector operations, in related technologies, since there are no instructions for vector operations that can be widely applied to various programming languages at this stage, technicians need to customize the corresponding One or more instructions in its programming language environment implement vector operations, resulting in low efficiency and slow speed of vector operations. The present disclosure provides a vector instruction processing method, device, computer equipment, and storage medium. Vector operations can be implemented with only one instruction, which can significantly improve the efficiency and speed of vector operation.

18-1 shows a block diagram of a vector instruction processing apparatus according to an embodiment of the present disclosure. As shown in Figure 18-1, the device includes a control module 18-11 and an arithmetic module 18-12.

The control module 18-11 is used to compile the obtained vector instructions to obtain the compiled vector instructions, and parse the compiled vector instructions to obtain the operation codes and operation domains of the vector instructions, and according to the operation codes and operation domains Obtain the vector to be operated and the target address required to execute the vector instruction, and determine the vector operation type of the vector instruction. The operation code is used to indicate that the operation performed by the vector instruction on the data is a vector operation, and the operation domain includes the vector address and the target address to be operated.

The operation module 18-12 is used for performing vector operation on the operation vector according to the type of vector operation, obtaining the operation result, and storing the operation result in the target address.

In this embodiment, there may be one or more vectors to be calculated. The type of vector operation may indicate the type or type of arithmetic operation or logical operation performed on the vector to be operated. For example, vector addition operation. A person skilled in the art can set the type of vector operation according to actual needs, which is not limited in the present disclosure.

In this embodiment, the vector instructions acquired by the control module are uncompiled software instructions that cannot be directly executed by the hardware. The control module needs to first compile the vector instructions (uncompiled). After the compiled vector instruction is obtained, the compiled vector instruction can be parsed. The compiled vector instructions are hardware instructions that can be directly executed by hardware. The control module can obtain vectors to be calculated from the addresses of the vectors to be calculated. The control module can obtain instructions and data through the data input and output unit, which can be one or more data I / O interfaces or I / O pins.

In this embodiment, the operation code may be the part of the instruction or field (usually represented by code) specified in the computer program to perform the operation, and is the instruction sequence number used to inform the device that executes the instruction which instruction needs to be executed. . The operation domain may be the source of all data required to execute the corresponding instruction. All data required to execute the corresponding instruction include parameters such as the vector to be operated, vector operation type, and corresponding operation methods. For a vector instruction, it must include an operation code and an operation field, where the operation field includes at least the vector address and the target address to be operated.

It should be understood that, those skilled in the art can set the instruction format of the vector instruction, as well as the included operation codes and operation domains as needed, and this disclosure does not limit this.

In this embodiment, the device may include one or more control modules and one or more arithmetic modules, and the number of control modules and arithmetic modules may be set according to actual needs, which is not limited in the present disclosure. When the device includes a control module, the control module can receive vector instructions and control one or more arithmetic modules to perform vector operations. When the device includes multiple control modules, the multiple control modules can respectively receive vector instructions and control the corresponding one or more arithmetic modules to perform vector operations.

The vector instruction processing device provided by the embodiment of the present disclosure includes a control module and an operation module. The control module is used to compile the obtained vector instruction to obtain the compiled vector instruction and parse the compiled vector instruction. Obtain the operation code and operation domain of the vector instruction, and obtain the operation vector and the target address required to execute the vector instruction according to the operation code and operation domain, and determine the vector operation type of the vector instruction; the operation module is used to perform the operation according to the vector operation type The vector performs vector operation to obtain the operation result, and stores the operation result in the target address. The vector instruction processing device provided by the embodiments of the present disclosure has a wide application range, and has high processing efficiency and fast processing speed for vector instructions, and high processing efficiency and fast processing speed for vector operations.

18-2a shows a block diagram of a vector instruction processing apparatus according to an embodiment of the present disclosure. In a possible implementation, as shown in FIG. 18-2a, the operation module 18-12 may include a plurality of vector operators 18-120. A plurality of vector operators 18-120 are used to perform vector operations corresponding to the types of vector operations.

In this implementation manner, the vector operator may include an adder, a divider, a multiplier, a comparator, and the like that can perform arithmetic operations, logical operations, and the like on the vector. The type and number of vector operators can be set according to the size of the data amount of the vector operation, the type of vector operation, the processing speed and efficiency of the vector operation, etc., and the disclosure does not limit this.

18-2b shows a block diagram of a vector instruction processing apparatus according to an embodiment of the present disclosure. In a possible implementation, as shown in FIG. 18-2b, the operation module 18-12 may include a master operation sub-module 18-121 and a plurality of slave operation sub-modules 18-122. The main operation sub-module 18-121 may include a plurality of vector operators (not shown in the figure). The main operation sub-module 18-121 is used to perform vector operation by using a plurality of vector operators, obtain the operation result, and store the operation result in the target address.

In a possible implementation, as shown in FIG. 18-2b, the operation module 18-12 may include a master operation sub-module 18-121 and a plurality of slave operation sub-modules 18-122, and the slave operation sub-module 18-122 may Includes multiple vector operators (not shown). The sub-modules 18-122 are used to execute the corresponding vector operations in parallel using the included multiple vector operators, obtain the operation results, store the operation results in the corresponding sub-cache space, and send the operation results to Main operation sub-module 18-121. The main operation sub-module 18-121 is also used to receive the operation result and store the operation result in the target address.

In this implementation manner, the control module may determine that the currently received vector instruction is executed by the master operation sub-module or multiple slave operation sub-modules according to the type of vector operation and the amount of operation tasks. For example, when it is determined that the vector to be calculated needs to be summed, the main calculation sub-module can be controlled to perform the calculation. When it is determined that the operation vector needs to be multiplied, multiple slave operation sub-modules can be controlled to perform operations.

In a possible implementation, the operation domain may also include a vector operation type.

Among them, the control module 18-11 can also be used to determine the vector operation type according to the operation domain.

In a possible implementation manner, the type of vector operation may include at least one of the following: vector multiplication operation, vector and scalar multiplication operation, vector addition operation, vector sum operation, operation to store specified value operation that meets the operation conditions, press Bitwise AND operation, bitwise OR operation, bitwise XOR operation, bitwise inverse operation, bitwise maximum value operation, bitwise minimum value operation. The calculation conditions may include any of the following: bitwise equal, bitwise unequal, bitwise less than, bitwise greater than or equal, bitwise greater than, bitwise less than or equal. The specified value may be a numerical value of 0, 1, etc., and this disclosure does not limit it.

The operation of satisfying the bit-by-bit equal storage of the specified value may be: judging whether the corresponding bits of the first to-be-calculated vector and the second to-be-calculated vector in the to-be-computed vector are equal, and the correspondence between the first to-be-calculated vector and the second to-be-calculated vector When the bits are equal, the specified value is stored; when the corresponding bit is not equal, the value of the first to-be-computed vector or the second to-be-computed vector at the corresponding bit is stored, or a value such as 0 that is different from the specified value is stored.

Satisfying bitwise inequality to store the specified value operation may be: judging whether the corresponding bits of the first to-be-calculated vector and the second to-be-calculated vector in the to-be-computed vector are equal, and the corresponding bits of the first to-be-calculated vector and the second to-be-calculated vector When they are not equal, the specified value is stored; when the corresponding bit is equal, the value of the first to-be-computed vector or the second to-be-computed vector at the corresponding bit is stored, or a value such as 0 that is different from the specified value is stored.

The operation that satisfies the bit-less than storing the specified value may be: judging the size relationship between the corresponding bits of the first to-be-computed vector and the second to-be-computed vector in the to-be-computed vector, the value of the first to-be-computed vector on the corresponding bit is less than the second When the value of the vector to be calculated is stored, the specified value is stored; when the value of the first vector to be calculated on the corresponding bit is greater than or equal to the value of the second vector to be calculated, the first vector to be calculated or the second vector to be calculated is stored in the corresponding bit Or store a value such as 0 that is different from the specified value.

Satisfying the bitwise operation greater than or equal to storing the specified value may be: judging the size relationship between the corresponding bits of the first to-be-computed vector and the second to-be-computed vector in the to-be-computed vector, the value of the first to-be-computed vector on the corresponding bit is greater than Or equal to the value of the second to-be-computed vector, store the specified value; when the value of the first to-be-computed vector in the corresponding bit is less than the value of the second to-be-computed vector, store the first to-be-computed vector or the second to-be-computed vector at The value of the corresponding bit or a value other than the specified value such as 0 is stored.

The operation that satisfies the bit-wise greater than storing the specified value may be: judging the size relationship between the corresponding bits of the first to-be-computed vector and the second to-be-computed vector in the to-be-computed vector, the value of the first to-be-computed vector on the corresponding bit is greater than the second When the value of the to-be-computed vector is stored, the specified value is stored; when the value of the first to-be-computed vector on the corresponding bit is less than or equal to the value of the second to-be-computed vector, the first to-be-computed vector or the second to-be-computed vector is stored in the corresponding bit Or store a value such as 0 that is different from the specified value.

Satisfying bitwise less than or equal to storing the specified value operation may be: judging the size relationship between the corresponding bits of the first to-be-computed vector and the second to-be-computed vector in the to-be-computed vector, the value of the first to-be-computed vector on the corresponding bit is less than Or equal to the value of the second to-be-computed vector, store the specified value; when the value of the first to-be-computed vector in the corresponding bit is greater than the value of the second to-be-computed vector, store the first to-be-computed vector or the second to-be-computed vector at The value of the corresponding bit or a value other than the specified value such as 0 is stored.

In this implementation, different operation domain codes can be set for different types of vector operations to distinguish different types of operations. For example, the code of "vector multiplication operation" can be set to "mult". The code for "multiplying vector and scalar" can be set to "mult.const". The code of "vector addition operation" can be set to "add". The code of "vector summation operation" can be set to "sub". The code of "bitwise AND operation" can be set to "and". The code of "bitwise OR operation" can be set to "or". The code for "bitwise XOR operation" can be set to "xor". You can set the code for "bitwise inversion" to "not". The code for "maximum bitwise operation" can be set to "max". The "minimum bitwise operation" code can be set to "min". You can set the code for "Save the specified value 1 operation if bitwise equality is satisfied" to "eq". You can set the code "meet the operation of storing the specified value 1 if the bitwise inequality is satisfied" as "ne". You can set the code that "satisfies the bitwise operation less than the storage specified value 1" to "lt". You can set the code that meets the "bitwise greater than or equal to store specified value 1 operation" code to "ge". You can set the code that "satisfies the bitwise operation greater than the storage specified value 1" to "gt". The code "meet the operation of bitwise less than or equal to storing the specified value 1" can be set to "le".

A person skilled in the art can set the type of operation and its corresponding code according to actual needs, and this disclosure does not limit this.

In a possible implementation manner, the operation domain may further include an input amount. Among them, the control module 18-11 is also used to determine the input amount according to the operation domain, and obtain the to-be-calculated vector whose data amount is the input amount from the data address to be calculated.

In this implementation manner, the input amount may be a parameter that characterizes the data amount of the vector to be calculated, for example, vector length, width, and the like.

In a possible implementation, the default input amount can be set. When the input quantity cannot be determined according to the operation domain, the default input quantity can be determined as the input quantity of the current vector instruction, and the to-be-calculated vector whose data quantity is the default input quantity can be obtained from the data address to be calculated.

In a possible implementation manner, as shown in FIGS. 18-2a and 18-2b, the device may further include a storage module 18-13. The storage modules 18-13 are used to store vectors to be calculated.

In this implementation, the storage module may include one or more of a cache and a register. The cache may include a high-speed temporary storage cache, and may also include at least one NRAM (Neuron Random Access Memory). Cache, used to store data and vector to be calculated. The register is used to store the scalar data in the data to be calculated.

In a possible implementation, the control module 18-11 can also be used to generate an assembly file according to the vector instructions and translate the assembly file into a binary file, where the binary file is a compiled vector instruction.

In a possible implementation manner, the instruction format of the vector instruction may be:

opcode dst src type size

Among them, opcode is the operation code of the vector instruction, dst, src, type, size are the operation domain of the vector instruction. Among them, dst is the target address. src is a vector address to be operated. When there are multiple vectors to be operated, src may include multiple data addresses to be operated src0, src1, ..., srcn, which is not limited in the present disclosure. type is the type of vector operation. size is the amount of input. Among them, type can be a code of vector operation type, such as mult, mult.const, add, sub, eq, ne, lt, ge, gt, le, eq, and, or, xor, not, max, min.

When there are multiple vectors to be operated, the instruction format may include multiple data addresses to be operated. The following takes the two vectors to be operated as an example. The instruction format of the vector instruction may be:

opcode dst src0 src1 type size

type, dst, src, size

In a possible implementation, the instruction format of the vector instruction used for "vector multiplication" can be set to: mult dst src0 src1 size. It means: Obtain the first to-be-operated vector of size size from the first to-be-operated address src0, the second to-be-operated vector of size size from the second to-be-operated address src1, Perform the multiplication operation to get the operation result. And store the operation result to the target address dst.

In a possible implementation, the instruction format of the vector instruction used for "vector and scalar multiplication operation" can be set to: mult.const dst src0 src1 size. It means: Obtain the size-to-be-calculated vector of the size from the first data-to-be-operated data address src0, obtain the size-to-be-calculated scalar from the second data-to-be-operated data address src1, multiply the vector and the scalar to be calculated result. And store the operation result to the target address dst.

In a possible implementation, the instruction format of the vector instruction used for "vector addition operation" can be set to: add dst src0 src1 size. It means: Obtain the first to-be-operated vector of size size from the first to-be-operated address src0, obtain the second to-be-calculated size of the size from the second to-be-operated address src1, and compare the first to-be-operated vector and the second to-be-operated vector Perform the addition operation to obtain the operation result. And store the operation result to the target address dst.

In a possible implementation, the instruction format of the vector instruction used for "vector summation operation" can be set to: sub dst src size. It means that multiple size-to-be-operated vectors of size size are obtained from the address-to-be-operated address src, and a summation operation is performed on the plurality of to-be-operated vectors to obtain an operation result. And store the operation result to the target address dst.

In a possible implementation, the instruction format of the vector instruction used for "bitwise AND operation" can be set to: and dst src0 src1 size. It means: Obtain the first to-be-operated vector of size size from the first to-be-operated address src0, and the second to-be-calculated vector of the size of size from the second to-be-operated address src1. Perform bitwise AND operation to get the operation result. And store the operation result to the target address dst.

In a possible implementation, the instruction format of the vector instruction used for "bitwise OR operation" can be set to: or dst src0 src1 size. It means: Obtain the first to-be-operated vector of size size from the first to-be-operated address src0, and the second to-be-calculated vector of the size of size from the second to-be-operated address src1. Perform bitwise OR operation to get the operation result. And store the operation result to the target address dst.

In a possible implementation, the instruction format of the vector instruction used for "bitwise XOR operation" can be set to: xordst src0 src1 size. It means: Obtain the first to-be-operated vector of size size from the first to-be-operated address src0, and the second to-be-calculated vector of the size of size from the second to-be-operated address src1. Perform bitwise XOR operation to get the operation result. And store the operation result to the target address dst.

In a possible implementation, the instruction format of the vector instruction used for "bitwise inversion operation" may be set to: not dst src size. It means that the size-to-be-operated vector of size is obtained from the address-to-be-operated address src, and the bitwise inverse operation is performed on the vector to obtain the operation result. And store the operation result to the target address dst.

In a possible implementation, the instruction format of the vector instruction for "maximum bitwise operation" can be set to: max dst src0 src1 size. It means: Obtain the first to-be-operated vector of size size from the first to-be-operated address src0, obtain the second to-be-calculated size of the size from the second to-be-operated address src1, and compare the first to-be-operated vector and the second to-be-operated vector Carry out the operation of seeking the maximum value bit by bit to obtain the operation result. And store the operation result to the target address dst.

In a possible implementation manner, the instruction format of the vector instruction for "minimum bitwise operation" can be set to: mindst src0 src1 size. It means: Obtain the first to-be-operated vector of size size from the first to-be-operated address src0, obtain the second to-be-calculated size of the size from the second to-be-operated address src1, and compare the first to-be-operated vector and the second to-be-operated vector Carry out the operation of finding the minimum value bit by bit and obtain the operation result. And store the operation result to the target address dst.

In a possible implementation, the instruction format of the vector instruction used for "save the specified value 1 if the bitwise equality is met" can be set to: eq dst src0 src1 size. It means: Obtain the first to-be-operated vector of size size from the first to-be-operated address src0, obtain the second to-be-calculated size of the size from the second to-be-operated address src1, and compare the first to-be-operated vector and the second to-be-operated vector Perform a bit-by-bit comparison, store the specified value 1 when the corresponding bits of the first to-be-calculated vector and the second to-be-calculated vector are equal, and obtain the operation result. And store the operation result to the target address dst.

In a possible implementation manner, the instruction format of the vector instruction used for "save the specified value 1 if the bitwise inequality is satisfied" may be set to: nedst src0 src1 size. It means: Obtain the first to-be-operated vector of size size from the first to-be-operated address src0, obtain the second to-be-calculated size of the size from the second to-be-operated address src1, and compare the first to-be-operated vector and the second to-be-operated vector Perform a bit-by-bit comparison, store the specified value 1 when the corresponding bits of the first to-be-calculated vector and the second to-be-calculated vector are not equal, and obtain the operation result. And store the operation result to the target address dst.

In a possible implementation manner, the instruction format of the vector instruction used for "satisfying bitwise less than storing specified value 1 operation" can be set to: lt dst src0 src1 size. It means: Obtain the first to-be-operated vector of size size from the first to-be-operated address src0, and the second to-be-calculated vector of the size of size from the second to-be-operated address src1. Perform a bit-by-bit comparison, store the specified value 1 when the value of the first to-be-calculated vector on the corresponding bit is smaller than the value of the second to-be-calculated vector, and obtain the operation result. And store the operation result to the target address dst.

In a possible implementation, the instruction format of the vector instruction used for "meet bitwise greater than or equal to store specified value 1 operation" can be set to: ge dst src0 src1 size. It means: Obtain the first to-be-operated vector of size size from the first to-be-operated address src0, and the second to-be-calculated vector of the size of size from the second to-be-operated address src1. Perform a bit-by-bit comparison, store the specified value 1 when the value of the first to-be-calculated vector on the corresponding bit is greater than or equal to the value of the second to-be-calculated vector, and obtain the operation result. And store the operation result to the target address dst.

In a possible implementation manner, the instruction format of the vector instruction used for "satisfying bitwise greater than storing specified value 1 operation" can be set to: gtdstsrc0src1size. It means: Obtain the first to-be-operated vector of size size from the first to-be-operated address src0, obtain the second to-be-calculated size of the size from the second to-be-operated address src1, and compare the first to-be-operated vector and the second to-be-operated vector Perform a bit-by-bit comparison, store the specified value 1 when the value of the first to-be-calculated vector on the corresponding bit is greater than the value of the second to-be-calculated vector, and obtain the operation result. And store the operation result to the target address dst.

In a possible implementation, the instruction format of the vector instruction used for "meet bitwise less than or equal to store specified value 1 operation" can be set to: le dst src0 src1 size. It means: Obtain the first to-be-operated vector of size size from the first to-be-operated address src0, obtain the second to-be-calculated size of the size from the second to-be-operated address src1, and compare the first to-be-operated vector and the second to-be-operated vector Perform a bit-by-bit comparison, store the specified value 1 when the value of the first to-be-calculated vector on the corresponding bit is less than or equal to the value of the second to-be-calculated vector, and obtain the operation result. And store the operation result to the target address dst.

It should be understood that those skilled in the art can set the operation code of the vector instruction, the position of the operation code and the operation field in the instruction format as needed, and the disclosure does not limit this.

It should be noted that although the vector instruction processing apparatus is described above by taking the above embodiment as an example, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various modules flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.

Application examples

The following describes an application example according to an embodiment of the present disclosure in conjunction with "using vector instruction processing apparatus for vector operation" as an exemplary application scenario to facilitate understanding of the flow of the vector instruction processing apparatus. Those skilled in the art should understand that the following application examples are only for the purpose of facilitating understanding of the embodiments of the present disclosure, and should not be considered as limitations to the embodiments of the present disclosure

18-3 shows a schematic diagram of an application scenario of a vector instruction processing apparatus according to an embodiment of the present disclosure. As shown in Figure 18-3, the vector instruction processing device processes the vector instruction as follows: the control module 18-11 compiles the acquired vector instruction 1 to obtain the compiled vector instruction 1 (for example, the vector instruction 1 is an opcode 500, 101, 102, and add 1024) to analyze the compiled vector instruction to obtain the operation code and operation domain of vector instruction 1. The operation code of the vector instruction 1 is opcode, the target address is 500, the first to-be-calculated vector address is 101, and the second to-be-calculated data address is 102. The vector operation type is add (vector addition operation). The input is 1024. The control module 18-11 obtains a first to-be-operated vector whose data amount is 1024 from the to-be-operated vector address 101, and a second to-be-operated vector whose data amount is 1024 from the to-be-operated vector address 102. The operation module 18-12 performs an addition operation on the first to-be-operated vector and the second to-be-operated vector to obtain an operation result 1, and stores the operation result 1 in the target address 500.

Among them, the vector instruction 1 can be not only the above opcode 500, 101, 102, add, 1024, but also add 500, 101, 102, 1024. The processing procedure of vector instructions in different instruction formats is similar and will not be repeated here.

In this way, the vector instruction processing device can efficiently and quickly process the vector instruction, and the vector operation has high processing efficiency and fast processing speed.

18-4 shows a flowchart of a vector instruction processing method according to an embodiment of the present disclosure. The method can be applied to computer equipment including a memory and a processor, where the memory is used to store data used in the execution of the method; the processor is used to perform related processing and operation steps, such as performing the following steps S51-18和步骤 S52-18. As shown in FIG. 18-4, this method is applied to the above-mentioned vector instruction processing apparatus. The method includes steps S51-18 and S52-18.

In step S51-18, the control module is used to compile the acquired vector instruction to obtain a compiled vector instruction, and the compiled vector instruction is analyzed to obtain the operation code and operation domain of the vector instruction, and according to the operation code and The operation domain obtains the vector to be operated and the target address required to execute the vector instruction, and determines the vector operation type of the vector instruction. The operation code is used to indicate that the operation performed by the vector instruction on the data is a vector operation, and the operation domain includes the vector address and the target address to be operated.

In step S52-18, the operation module is used to perform vector operation on the operation vector according to the vector operation type to obtain an operation result, and the operation result is stored in the target address.

In a possible implementation manner, performing vector operation on the operation vector according to the vector operation type to obtain an operation result may include: using multiple vector operators in the operation module to perform vector operation corresponding to the vector operation type.

In a possible implementation manner, the operation module may include a master operation submodule and a plurality of slave operation submodules, and the master operation submodule may include the plurality of vector operators. Wherein, step S52-18 may include: using a plurality of vector operators in the main operation sub-module to perform a vector operation corresponding to the type of vector operation, obtain an operation result, and store the operation result in a target address.

In a possible implementation, the operation module includes a main operation sub-module and multiple sub-operation sub-modules, and the sub-operation sub-module includes multiple vector operators, wherein the vector operation is performed according to the vector operation type to obtain the operation Results, and store the operation results in the target address, including: use multiple vector operators included in each slave operation sub-module to perform corresponding vector operations in parallel to obtain the operation results, and store the operation results in the corresponding sub In the buffer space, and send the operation result to the main operation sub-module; use the main operation sub-module to receive the operation result and store the operation result in the target address.

In a possible implementation, the operation domain may also include a vector operation type. Wherein, determining the vector operation type of the vector instruction may include: when the vector operation type is included in the operation domain, determining the vector operation type according to the operation domain.

In a possible implementation manner, the operation domain may further include an input amount. Wherein, obtaining the to-be-computed vector and the target address required to execute the vector instruction according to the operation code and the operation domain may also include: determining the input volume according to the operation domain, and obtaining the to-be-calculated vector whose data volume is the input volume from the to-be-calculated data address .

In a possible implementation, the operation code is also used to indicate the type of vector operation. Wherein, determining the vector operation type of the vector instruction may include: when the operation code is used to indicate the vector operation type, determining the vector operation type according to the operation code.

In a possible implementation manner, the type of vector operation may include at least one of the following: vector multiplication operation, vector and scalar multiplication operation, vector addition operation, vector sum operation, operation to store specified value operation that meets the operation conditions, press Bitwise AND operation, bitwise OR operation, bitwise XOR operation, bitwise inverse operation, bitwise maximum value operation, bitwise minimum value operation. The calculation conditions may include any of the following: bitwise equal, bitwise unequal, bitwise less than, bitwise greater than or equal, bitwise greater than, bitwise less than or equal.

In a possible implementation manner, the method may further include: using the storage module of the device to store the to-be-calculated vector, wherein the storage module includes at least one of a register and a cache, and the cache is used to store the to-be-calculated data and the to-be-calculated Vectors, caches include at least one neuron cache NRAM; registers, used to store scalar data in data to be operated; neuron caches, used to store neuron data in data to be operated, and neuron data includes neuron vector data.

In a possible implementation, parsing the compiled vector instruction to obtain the operation code and operation domain of the vector instruction may include: storing the compiled vector instruction; parsing the compiled vector instruction to obtain the vector instruction Operation code and operation domain; store the instruction queue, the instruction queue includes a plurality of instructions to be executed in order according to the execution order, and the plurality of instructions to be executed may include compiled vector instructions.

The first to-be-executed instruction is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction may include: a first storage address interval that stores data required by the first to-be-executed instruction and a zeroth to-be-executed instruction The zeroth storage address interval of data has overlapping areas.

In a possible implementation manner, compiling the obtained vector instruction to obtain the compiled vector instruction may include: generating an assembly file according to the vector instruction, and translating the assembly file into a binary file. Among them, the binary file is a compiled vector instruction.

It should be noted that although the vector instruction processing method is described above using the above embodiment as an example, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various steps flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.

The vector instruction processing method provided by the embodiments of the present disclosure has a wide application range, high processing efficiency and fast processing speed for vectors, and high processing efficiency and fast processing speed for vector operations.

The foregoing can be better understood based on the following clauses:

Clause R1, a vector instruction processing device, the device comprising:

The control module is used to compile the obtained vector instructions to obtain the compiled vector instructions, parse the compiled vector instructions to obtain the operation codes and operation domains of the vector instructions, and according to the operation codes and operations The operation domain obtains the vector to be operated and the target address required to execute the vector instruction, and determines the vector operation type of the vector instruction;

An operation module, configured to perform vector operation on the to-be-operated vector according to the vector operation type, obtain an operation result, and store the operation result in the target address,

Wherein, the operation code is used to indicate that the operation performed by the vector instruction on the data is a vector operation, and the operation domain includes the vector address to be operated and the target address.

Clause R2. The device according to Clause R1, the arithmetic module includes:

A plurality of vector operators are used to perform vector operations corresponding to the vector operation type.

Clause R3. The device according to Clause R2, the operation module includes a master operation sub-module and a plurality of slave operation sub-modules, the master operation sub-module includes the plurality of vector operators,

The main operation sub-module is used to perform the vector operation by using the plurality of vector operators, obtain an operation result, and store the operation result in the target address.

Clause R4. The device according to Clause R2, the operation module includes a master operation sub-module and a plurality of slave operation sub-modules, the slave operation sub-module includes the plurality of vector operators,

From the operation sub-module, it is used to perform the corresponding vector operation in parallel by using a plurality of included vector operators to obtain the operation result, and store the operation result in the corresponding sub-cache space, and send the operation result To the main operation submodule;

The main operation sub-module is also used to receive the operation result and store the operation result in the target address.

Clause R5. The device according to Clause R1, the operation domain further includes a vector operation type,

Wherein, the control module is further used to determine the vector operation type according to the operation domain when the operation domain includes the vector operation type.

Clause R6. The device according to Clause R1, the operation domain further includes an input,

Wherein, the control module is further configured to determine the input amount according to the operation domain, and obtain a to-be-calculated vector whose data amount is the input amount from the data address to be calculated.

Clause R7. The apparatus according to Clause R1, the operation code is further used to indicate the type of vector operation,

The control module is further configured to determine the vector operation type according to the operation code when the operation code is used to indicate the vector operation type.

Clause R8. The apparatus according to Clause R1, the vector operation type includes at least one of the following:

Vector multiply operation, vector and scalar multiplication operation, vector addition operation, vector sum operation, store specified value operation that meets the operation conditions, bitwise AND operation, bitwise OR operation, bitwise XOR operation, bitwise inversion Operation, bitwise maximum value operation, bitwise minimum value operation,

Wherein, the calculation conditions include any one of the following: bitwise equal, bitwise unequal, bitwise less than, bitwise greater than or equal, bitwise greater than, bitwise less than or equal.

Clause R9. The device according to Clause R1, the device further comprising:

A storage module, used to store the vector to be calculated,

Wherein, the storage module includes at least one of a register and a cache,

The cache is used to store data to be calculated and the vector to be calculated, and the cache includes at least one neuron cache NRAM;

The register is used to store scalar data in the data to be calculated;

Clause R10. The device according to Clause R1, the control module includes:

Instruction storage sub-module for storing the compiled vector instruction;

Instruction processing sub-module, which is used to parse the compiled vector instruction to obtain the operation code and operation domain of the vector instruction;

The queue storage submodule is used to store an instruction queue, and the instruction queue includes a plurality of instructions to be executed, which are sequentially arranged according to the execution order, and the plurality of instructions to be executed include the compiled vector instructions.

Clause R11. The device according to Clause R10, the control module, further comprising:

Clause R12, the device according to Clause R1,

The control module is also used to generate an assembly file according to the vector instruction and translate the assembly file into a binary file,

Wherein, the binary file is the compiled vector instruction.

Clause R13. A machine learning computing device, the device comprising:

One or more vector instruction processing devices as described in any one of Clause R1-Clause R12, used to obtain vectors and control information to be calculated from other processing devices, and perform specified machine learning operations, and pass the execution results through I / O interface is passed to other processing devices;

When the machine learning computing device includes a plurality of the vector instruction processing devices, the plurality of vector instruction processing devices may be connected and transmit data through a specific structure;

Among them, a plurality of the vector instruction processing devices interconnect and transmit data through a fast external device interconnect bus PCIE bus to support larger-scale machine learning operations; a plurality of the vector instruction processing devices share the same control system or own Respective control systems; a plurality of the vector instruction processing devices share memory or have their own memories; the interconnection mode of the plurality of vector instruction processing devices is an arbitrary interconnection topology.

Clause R14. A combined processing device, the combined processing device comprising:

Machine learning computing device, general interconnection interface and other processing devices as described in clause R13;

Clause R15. A machine learning chip, the machine learning chip includes:

The machine learning arithmetic device according to clause R13 or the combined processing device according to clause R14.

Clause R16. An electronic device, the electronic device comprising:

Machine learning chip as described in clause R15.

Clause R17, a board card, the board card includes: a storage device, an interface device and a control device, and a machine learning chip as described in Clause R15;

The storage device is used for storing data;

Clause R18. A vector instruction processing method. The method is applied to a vector instruction processing apparatus. The apparatus includes a control module and an arithmetic module. The method includes:

The control module is used to compile the obtained vector instruction to obtain a compiled vector instruction, and the compiled vector instruction is parsed to obtain the operation code and operation domain of the vector instruction, and according to the operation code and the operation The domain obtains the vector to be operated and the target address required to execute the vector instruction, and determines the vector operation type of the vector instruction;

Using an operation module to perform a vector operation on the to-be-operated vector according to the vector operation type to obtain an operation result, and store the operation result in the target address,

Clause R19, according to the method described in Clause R18, performing vector operations on the vector to be operated according to the vector operation type, including:

A plurality of vector operators in the operation module are used to perform vector operations corresponding to the vector operation type.

Clause R20. The method according to Clause R19, the operation module includes a master operation sub-module and a plurality of slave operation sub-modules, the master operation sub-module includes the plurality of vector operators,

Wherein, performing vector operation on the vector to be operated according to the vector operation type to obtain an operation result, and storing the operation result in the target address includes:

Use the plurality of vector operators in the main operation sub-module to perform a vector operation corresponding to the vector operation type, obtain an operation result, and store the operation result in the target address.

Clause R21. The method according to Clause R19, the operation module includes a master operation sub-module and a plurality of slave operation sub-modules, the slave operation sub-module includes the plurality of vector operators,

Use multiple vector operators included in each slave operation sub-module to execute corresponding vector operations in parallel to obtain operation results, store the operation results in the corresponding sub-cache space, and send the operation results to The main operation sub-module;

The main operation sub-module is used to receive the operation result and store the operation result in the target address.

Clause R22. The method according to Clause R18, the operation domain further includes a vector operation type,

Among them, determining the vector operation type of the vector instruction includes:

When a vector operation type is included in the operation domain, the vector operation type is determined according to the operation domain.

Clause R23. The method according to Clause R18, the operation domain further includes an input,

Wherein, obtaining the to-be-operated vector and the target address required to execute the vector instruction according to the operation code and the operation domain also includes:

The input amount is determined according to the operation domain, and a to-be-calculated vector whose data amount is the input amount is obtained from the data address to be calculated.

Clause R24. The method according to Clause R18, the opcode is also used to indicate the type of vector operation,

When the operation code is used to indicate the vector operation type, the vector operation type is determined according to the operation code.

Clause R25. The method according to Clause R18, the vector operation type includes at least one of the following:

Vector multiplication operation, vector and scalar multiplication operation, vector addition operation, vector sum operation, operation to store specified value operation, bitwise AND operation, bitwise OR operation, bitwise XOR operation, bitwise inversion Operation, bitwise maximum value operation, bitwise minimum value operation,

Clause R26. The method according to Clause R18, the method further comprising:

Use the storage module of the device to store the vector to be calculated,

Wherein, the storage module includes at least one of a register and a cache,

The register is used to store scalar data in the data to be calculated;

Clause R27. According to the method described in Clause R18, parse the compiled vector instruction to obtain the opcode and operation domain of the vector instruction, including:

Storing the compiled vector instruction;

Analyzing the compiled vector instruction to obtain the operation code and operation domain of the vector instruction;

An instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed, which are sequentially arranged according to an execution order, and the plurality of instructions to be executed include the compiled vector instructions.

Clause R28. The method according to Clause R27, the method further comprising:

Clause R29. Compile the obtained vector instructions according to the method described in Clause R18 to obtain the compiled vector instructions, including:

Generate an assembly file according to the vector instruction, and translate the assembly file into a binary file,

Wherein, the binary file is the compiled vector instruction.

Clause R30. A non-volatile computer-readable storage medium having computer program instructions stored thereon, which when executed by a processor implements the method of any one of clause R18 to clause R29.

Due to the extensive use of neural network algorithms, the ability of computer hardware operators to continuously improve, the types and number of data operations involved in practical applications continue to increase. Due to the variety of programming languages, in different language environments, in order to realize the loop operation of vectors, in the related art, because there are no instructions for the loop operation of vectors that can be widely applied to various programming languages at this stage, the technical staff Need to customize one or more instructions corresponding to its programming language environment to implement vector operations, resulting in low efficiency and slow speed of vector operations. The present disclosure provides a cyclic vector instruction processing method, device, computer equipment, and storage medium. The cyclic vector operation can be implemented with only one instruction, which can significantly improve the efficiency and speed of cyclic vector operation.

19-1 shows a block diagram of a loop vector instruction processing apparatus according to an embodiment of the present disclosure. As shown in Figure 19-1, the device includes a control module 19-11 and an arithmetic module 19-12.

The control module 19-11 is used to compile the obtained loop vector instruction to obtain the compiled loop vector instruction, parse the compiled loop vector instruction to obtain the operation code and operation domain of the loop vector instruction, and according to the operation The code and the operation domain obtain the first to-be-operated vector, the second to-be-operated vector, and the target address required to execute the loop vector instruction, and determine the vector operation type of the loop vector instruction. The operation code is used to indicate that the operation performed by the cyclic vector instruction on the data is a cyclic vector operation, and the operation domain includes the first to-be-calculated vector address, the second to-be-calculated vector address, and the target address.

The operation module 19-12 is configured to divide the first to-be-operated vector into a plurality of divided vectors according to the second to-be-operated vector, and perform vector operations on each of the divided vector and the second to-be-operated vector according to the type of vector operation, Obtain the operation result, and store the operation result in the target address.

In this embodiment, the cyclic vector operation may be to divide a vector with a larger data volume into multiple divided vectors with the same data volume as another vector with a smaller data volume, and then divide each vector into The other vector performs the operation corresponding to the type of vector operation to obtain the operation result. The type of vector operation may indicate the type or type of arithmetic operation or logical operation performed on the split vector and the second to-be-operated vector. For example, vector addition operation. A person skilled in the art can set the type of vector operation according to actual needs, which is not limited in the present disclosure.

In this embodiment, each split vector and the second to-be-calculated vector are subjected to vector operations according to the type of vector operation, and multiple split operation results corresponding to each split vector can be obtained, and the multiple splits can be obtained. The result of the division operation is stored in the target address as the operation result of the loop vector instruction, that is, the results of the multiple division operations are used as the operation result of the vector operation of the first to-be-operated vector and the second to-be-operated vector. The data amount of the first to-be-operated vector may be an integer multiple of the data amount of the second to-be-operated vector, so as to ensure that the obtained segmentation vector can perform vector operation with the second to-be-operated vector.

In this embodiment, the loop vector instruction acquired by the control module is an uncompiled software instruction that cannot be directly executed by hardware. The control module needs to first compile the loop vector instruction (uncompiled). After the compiled loop vector instruction is obtained, the compiled loop vector instruction can be parsed. The compiled loop vector instructions are hardware instructions that can be directly executed by the hardware. The control module may obtain the first to-be-computed vector and the second to-be-computed vector from the first to-be-computed vector address and the second to-be-computed vector address, respectively. The control module can obtain instructions and data through the data input and output unit, which can be one or more data I / O interfaces or I / O pins.

In this embodiment, the operation code may be the part of the instruction or field (usually represented by code) specified in the computer program to perform the operation, and is the instruction sequence number used to inform the device that executes the instruction which instruction needs to be executed. . The operation domain can be the source of all the data required to execute the corresponding instruction. All the data required to execute the corresponding instruction include the first to-be-computed vector, the second to-be-computed vector, vector operation type and other parameters, and the corresponding operation method, etc. . For a loop vector instruction, it must include an operation code and an operation field, where the operation field includes at least a first to-be-operated vector address, a second to-be-operated vector address, and a target address.

It should be understood that those skilled in the art may set the instruction format of the loop vector instruction, as well as the included opcodes and operation fields as needed, and this disclosure does not limit this.

In this embodiment, the device may include one or more control modules and one or more arithmetic modules, and the number of control modules and arithmetic modules may be set according to actual needs, which is not limited in the present disclosure. When the device includes a control module, the control module can receive a loop vector instruction and control one or more arithmetic modules to perform a loop vector operation. When the device includes a plurality of control modules, the plurality of control modules can respectively receive the loop vector instruction and control the corresponding one or more arithmetic modules to perform the loop vector operation.

A loop vector instruction processing device provided by an embodiment of the present disclosure includes a control module and an operation module. The control module is used to compile the obtained loop vector instruction to obtain a compiled loop vector instruction and a compiled loop vector instruction. The instruction is parsed to obtain the operation code and operation domain of the loop vector instruction, and according to the operation code and operation domain, the first to-be-operated vector, the second to-be-operated vector and the target address required to execute the loop vector instruction are obtained, and the loop vector instruction is determined The type of vector operation; the operation module is used to divide the first to-be-operated vector into a plurality of divided vectors according to the second to-be-operated vector, and perform vector operations on each of the divided vector and the second to-be-operated vector according to the vector operation type To obtain the operation result and store the operation result in the target address. The loop vector instruction processing device provided by the embodiments of the present disclosure has a wide application range, and has high processing efficiency and fast processing speed for loop vector instructions, and high processing efficiency and fast processing speed for performing calculations.

19-2a shows a block diagram of a loop vector instruction processing apparatus according to an embodiment of the present disclosure. In a possible implementation, as shown in FIG. 19-2a, the operation module 19-12 may include multiple vector operators 19-120. A plurality of vector operators 19-120 are used to perform vector operations corresponding to the types of vector operations.

19-2b shows a block diagram of a loop vector instruction processing apparatus according to an embodiment of the present disclosure. In a possible implementation, as shown in FIG. 19-2b, the operation module 19-12 may include a master operation sub-module 19-121 and a plurality of slave operation sub-modules 19-122. The main operation sub-modules 19-121 may include multiple vector operators (not shown in the figure).

The main operation sub-modules 19-121 are used to perform vector operations using multiple vector operators to obtain operation results, and store the operation results in the target address.

In a possible implementation, as shown in FIG. 19-2b, the operation module 19-12 may include a master operation sub-module 19-121 and a plurality of slave operation sub-modules 19-122, and the slave operation sub-module 19-122 may Includes multiple vector operators (not shown). The sub-modules 19-122 are used to execute the corresponding vector operations in parallel using the included multiple vector operators, obtain the operation results, store the operation results in the corresponding sub-cache space, and send the operation results to Main operation sub-module 19-121. The main operation sub-module 19-121 is also used to receive the operation result and store the operation result in the target address.

In this implementation manner, the control module may determine that the currently received vector instruction is executed by the master operation sub-module or multiple slave operation sub-modules according to the type of vector operation and the amount of operation tasks. For example, when it is determined that the vector operation type is a vector addition operation, the main operation sub-module can be controlled to perform the operation. When the vector operation type is determined to be vector multiplication operation, multiple slave operation sub-modules can be controlled to perform operations.

Among them, the control module 19-11 can also be used to determine the vector operation type according to the operation domain.

In a possible implementation manner, the vector operation type may include at least one of the following: vector multiplication operation, vector addition operation, vector sum operation, operation to store a specified value when the operation condition is met, bitwise AND operation, bitwise or Operation, bitwise XOR operation. The calculation conditions may include any of the following: bitwise equal, bitwise unequal, bitwise less than, bitwise greater than or equal, bitwise greater than, bitwise less than or equal. The specified value may be a numerical value of 0, 1, etc., and this disclosure does not limit it.

Among them, the operation of satisfying the bit-wise equal storage of the specified value may be: judging whether the corresponding bits of the split vector and the second to-be-calculated vector are equal, and storing the specified value when the corresponding bits of the split vector and the second to-be-calculated vector are equal; When the corresponding bits are not equal, the value of the split vector or the second to-be-calculated vector at the corresponding bit is stored, or a value other than the specified value such as 0 is stored.

Satisfying bitwise inequality to store the specified value operation may be: judging whether the corresponding bits of the cut vector and the second to-be-calculated vector are equal, and storing the specified value when the corresponding bits of the cut vector and the second to-be-calculated vector are not equal; When the corresponding bits are equal, the value of the first segmented vector or the second to-be-calculated vector at the corresponding bit is stored, or a value such as 0, which is different from the specified value, is stored.

The operation that satisfies the bitwise less than storing the specified value may be: judging the size relationship between the corresponding bit of the cutting vector and the second to-be-calculated vector, when the value of the cutting vector on the corresponding bit is smaller than the value of the second to-be-calculating vector, storing the specified Value; when the value of the segmentation vector on the corresponding bit is greater than or equal to the value of the second to-be-operated vector, store the value of the first or second to-be-computed vector in the corresponding bit, or store 0, etc., different from the specified value Value.

Satisfying the bitwise operation greater than or equal to storing the specified value may be: judging the size relationship between the corresponding bit of the cutting vector and the second to-be-calculated vector, the value of the cutting vector at the corresponding bit is greater than or equal to the value of the second to-be-calculating vector When storing the specified value; when the value of the segmentation vector on the corresponding bit is less than the value of the second to-be-calculated vector, store the value of the first or second to-be-computed vector in the corresponding bit, or store 0 and the like Values with different values.

The operation that satisfies the bit-wise greater than storing the specified value may be: judging the size relationship between the corresponding bit of the cutting vector and the second to-be-calculated vector. Value; when the value of the segmentation vector on the corresponding bit is less than or equal to the value of the second to-be-computed vector, store the value of the first or second to-be-computed vector in the corresponding bit, or store 0, etc., different from the specified value Value.

Satisfying the bitwise operation less than or equal to storing the specified value can be: judging the size relationship between the corresponding bit of the segmentation vector and the second to-be-calculated vector, the value of the segmentation vector at the corresponding bit is less than or equal to the value of the second to-be-calculated vector Store the specified value; when the value of the cut vector on the corresponding bit is greater than the value of the second vector to be calculated, store the value of the first cut vector or the second vector to be calculated in the corresponding bit, or store 0 and the like Values with different values.

In this implementation, different operation domain codes can be set for different types of vector operations to distinguish different types of operations. For example, the code for "vector multiplication operation" can be set to "mult.cycle". The code of "vector addition operation" can be set to "add.cycle". The code of "vector summation operation" can be set to "sub.cycle". The code of "bitwise AND operation" can be set to "and.cycle". The code of "bitwise OR operation" can be set to "or.cycle". The code for "bitwise XOR operation" can be set to "xor.cycle". You can set the code of "Save the specified value 1 operation if bitwise equality is satisfied" as "eq.cycle". You can set the code "means that the specified value 1 operation is satisfied if bitwise inequality is satisfied" as "ne.cycle". You can set the code for "Satisfy bitwise operation less than 1" to "lt.cycle". You can set the code for "meet bitwise greater than or equal to the specified value 1 operation" to "ge.cycle". You can set the code that "satisfies the bitwise operation greater than the storage specified value 1" to "gt.cycle". You can set the code for "meet bitwise less than or equal to the operation of storing the specified value 1" as "le.cycle".

In a possible implementation manner, the operation domain may further include a first input amount and a second input amount. The control module 19-11 can also be used to determine the first input amount and the second input amount according to the operation domain, and obtain the first to-be-calculated vector whose data amount is the first input amount from the first to-be-calculated vector address, And obtaining a second to-be-calculated vector whose data amount is a second input amount from the second to-be-calculated vector address.

In a possible implementation manner, dividing the first to-be-calculated vector into n divided vectors according to the second to-be-calculated vector may include: determining the divided data amount of each of the divided vectors according to the second input amount, And divide the first to-be-operated vector into n divided vectors according to the amount of divided data.

In this implementation, the second input amount can be determined as the amount of segmentation data for each segmentation vector.

In this implementation manner, the first input amount and the second input amount may be parameters that characterize the data amount of the first to-be-computed vector and the second to-be-computed vector, for example, vector length, width, and the like.

In a possible implementation manner, the default first input amount and the second input amount may be set. When the first input amount and the second input amount cannot be determined according to the operation domain, the default first input amount and the second input amount can be determined as the first input amount and the second input amount of the current loop vector instruction, and the The data to be obtained from the vector address is a first to-be-calculated vector with a default first input amount and a second to-be-calculated vector with a default second input amount.

In a possible implementation manner, as shown in FIGS. 19-2a and 19-2b, the device may further include a storage module 19-13. The storage modules 19-13 are used to store the first to-be-calculated vector and the second to-be-calculated vector.

In this implementation, the storage module may include one or more of memory, cache, and registers. The cache may include a high-speed temporary storage cache, and may also include at least one NRAM (Neuron Random Access Memory). . The cache is used to store the data to be calculated, the first vector to be calculated, and the second vector to be calculated. The register is used to store the scalar data in the data to be calculated.

In a possible implementation, the control module 19-11 can also be used to generate an assembly file according to the loop vector instruction and translate the assembly file into a binary file, where the binary file is a compiled loop vector instruction.

In a possible implementation, the instruction format of the loop vector instruction may be:

opcode dst src0 src1 src0_size src1_size type.cycle

Among them, opcode is the operation code of the loop vector instruction, dst, src, type, src0_size, src1_size are the operation domain of the loop vector instruction. Among them, dst is the target address. src0 is the first vector address to be calculated. src1 is the second vector address to be calculated. type is the type of vector operation. src0_size is the first input amount. src1_size is the second input amount. Among them, type.cycle can be the code of the vector operation type, such as mult.cycle, add.cycle, sub.cycle, eq.cycle, ne.cycle, lt.cycle, ge.cycle, gt.cycle, le.cycle, eq.cycle, and.cycle, or.cycle, xor.cycle.

In a possible implementation, the instruction format of the loop vector instruction may also be:

type.cycle dst src0 src1 src0_size src1_size

In a possible implementation, the instruction format of the loop vector instruction used for "vector multiplication operation" can be set as: mult.cycle dst, src0 src1 src0_size src1_size. It means that the first to-be-operated vector of size src0_size is obtained from the first to-be-operated address src0, and the second to-be-calculated vector of size src1_size is obtained from the second to-be-operated address src1. The first to-be-operated vector is divided into multiple divided vectors, and the data amount of each divided vector is the same as src1_size. Multiplying each cutting vector and the second to-be-calculated vector to obtain an operation result. And store the operation result to the target address dst.

In a possible implementation, the instruction format of the loop vector instruction used for "vector addition operation" can be set as: add.cycle dst src0 src1 src0_size src1_size. It means that the first to-be-operated vector of size src0_size is obtained from the first to-be-operated address src0, and the second to-be-calculated vector of size src1_size is obtained from the second to-be-operated address src1. The first to-be-operated vector is divided into multiple divided vectors, and the data amount of each divided vector is the same as src1_size. Each addition vector and the second to-be-calculated vector are separately added to obtain an operation result. And store the operation result to the target address dst.

In a possible implementation, the instruction format of the loop vector instruction used for "vector summation operation" can be set to: sub.cycle dst src0 src1 src0_size src1_size. It means that the first to-be-operated vector of size src0_size is obtained from the first to-be-operated address src0, and the second to-be-calculated vector of size src1_size is obtained from the second to-be-operated address src1. The first to-be-operated vector is divided into multiple divided vectors, and the data amount of each divided vector is the same as src1_size. Perform a sum operation on each of the segmentation vector and the second to-be-calculated vector to obtain an operation result. And store the operation result to the target address dst.

In a possible implementation, the instruction format of the loop vector instruction used for "bitwise AND operation" can be set as: and.cycle dst src0 src1 src0_size src1_size. It means that the first to-be-calculated vector of size src0_size is obtained from the first to-be-calculated address src0, and the second to-be-calculated vector of size src1_size is obtained from the second to-be-calculated address src1. The first to-be-operated vector is divided into multiple divided vectors, and the data amount of each divided vector is the same as src1_size. Perform bitwise AND operation on each of the split vector and the second to-be-calculated vector to obtain the operation result. And store the operation result to the target address dst.

In a possible implementation, the instruction format of the loop vector instruction used for "bitwise OR operation" can be set as: or.cycle dst src0 src1 src0_size src1_size. It means that the first to-be-operated vector of size src0_size is obtained from the first to-be-operated address src0, and the second to-be-calculated vector of size src1_size is obtained from the second to-be-operated address src1. The first to-be-operated vector is divided into multiple divided vectors, and the data amount of each divided vector is the same as src1_size. Perform bitwise OR operation on each of the segmentation vector and the second to-be-calculated vector to obtain the operation result. And store the operation result to the target address dst.

In a possible implementation, the instruction format of the loop vector instruction used for "bitwise XOR operation" can be set to: xor.cycle dst src0 src1 src0_size src1_size. It means that the first to-be-operated vector of size src0_size is obtained from the first to-be-operated address src0, and the second to-be-calculated vector of size src1_size is obtained from the second to-be-operated address src1. The first to-be-operated vector is divided into multiple divided vectors, and the data amount of each divided vector is the same as src1_size. Perform a bitwise XOR operation on each of the split vector and the second to-be-calculated vector to obtain an operation result. And store the operation result to the target address dst.

In a possible implementation manner, the instruction format of the loop vector instruction used for "save bit value equal to store specified value 1 operation" can be set to: eq.cycle dst src0 src1 src0_size src1_size. It means that the first to-be-operated vector of size src0_size is obtained from the first to-be-operated address src0, and the second to-be-calculated vector of size src1_size is obtained from the second to-be-operated address src1. The first to-be-operated vector is divided into multiple divided vectors, and the data amount of each divided vector is the same as src1_size. It is determined whether the corresponding bits of the segmentation vector and the second to-be-calculated vector are equal. When the corresponding bits of the segmentation vector and the second to-be-calculated vector are equal, the specified value 1 is stored to obtain the operation result. And store the operation result to the target address dst.

In a possible implementation manner, the instruction format of the loop vector instruction used for "meet bitwise inequality to store specified value 1 operation" can be set to: ne.cycle dst src0 src1 src0_size src1_size. It means that the first to-be-operated vector of size src0_size is obtained from the first to-be-operated address src0, and the second to-be-calculated vector of size src1_size is obtained from the second to-be-operated address src1. The first to-be-operated vector is divided into multiple divided vectors, and the data amount of each divided vector is the same as src1_size. It is determined whether the corresponding bits of the segmentation vector and the second to-be-calculated vector are equal. When the corresponding bits of the segmentation vector and the second to-be-calculated vector are not equal, the specified value 1 is stored to obtain the operation result. And store the operation result to the target address dst.

In a possible implementation, the instruction format of the loop vector instruction for "meet bitwise less than the specified value 1 operation" can be set to: lt.cycle dst src0 src1 src0_size src1_size. It means that the first to-be-operated vector of size src0_size is obtained from the first to-be-operated address src0, and the second to-be-calculated vector of size src1_size is obtained from the second to-be-operated address src1. The first to-be-operated vector is divided into multiple divided vectors, and the data amount of each divided vector is the same as src1_size. The size relationship between the segment vector and the second to-be-calculated vector is determined. When the value of the segment vector on the corresponding bit is smaller than the value of the second to-be-calculated vector, the specified value 1 is stored to obtain the operation result. And store the operation result to the target address dst.

In a possible implementation, the instruction format of the loop vector instruction for "meet bitwise greater than or equal to store specified value 1 operation" can be set to: ge.cycle dst src0 src1 src0_size src1_size. It means that the first to-be-operated vector of size src0_size is obtained from the first to-be-operated address src0, and the second to-be-calculated vector of size src1_size is obtained from the second to-be-operated address src1. The first to-be-operated vector is divided into multiple divided vectors, and the data amount of each divided vector is the same as src1_size. The size relationship between the segment vector and the second to-be-calculated vector is determined. When the value of the segment vector on the corresponding bit is greater than or equal to the value of the second to-be-calculated vector, the specified value 1 is stored to obtain the operation result. And store the operation result to the target address dst.

In a possible implementation, the instruction format of the loop vector instruction used for "meet bitwise greater than storage specified value 1 operation" can be set as: gt.cycle dst src0 src1 src0_size src1_size. It means that the first to-be-operated vector of size src0_size is obtained from the first to-be-operated address src0, and the second to-be-calculated vector of size src1_size is obtained from the second to-be-operated address src1. The first to-be-operated vector is divided into multiple divided vectors, and the data amount of each divided vector is the same as src1_size. Judge the size relationship between the corresponding bits of the segmentation vector and the second to-be-calculated vector. When the value of the segmentation vector on the corresponding bit is greater than the value of the second to-be-calculated vector, store the specified value 1 to obtain the operation result. And store the operation result to the target address dst.

In a possible implementation, the instruction format of the loop vector instruction used for "meet bitwise less than or equal to store specified value 1 operation" can be set as: le.cycle dst src0 src1 src0_size src1_size. It means that the first to-be-operated vector of size src0_size is obtained from the first to-be-operated address src0, and the second to-be-calculated vector of size src1_size is obtained from the second to-be-operated address src1. The first to-be-operated vector is divided into multiple divided vectors, and the data amount of each divided vector is the same as src1_size. The size relationship between the segment vector and the second to-be-calculated vector is determined. When the value of the segment vector on the corresponding bit is smaller than the value of the second to-be-calculated vector, the specified value 1 is stored to obtain the operation result. And store the operation result to the target address dst.

It should be understood that those skilled in the art can set the operation code of the loop vector instruction, the position of the operation code and the operation field in the instruction format as needed, and the disclosure does not limit this.

It should be noted that although the loop vector instruction processing apparatus is described above using the above embodiment as an example, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various modules flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.

Application examples

In the following, an application example according to an embodiment of the present disclosure is given in conjunction with "using a loop vector instruction processing device for vector operations" as an exemplary application scenario, so as to facilitate understanding of the flow of the loop vector instruction processing device. Those skilled in the art should understand that the following application examples are only for the purpose of facilitating understanding of the embodiments of the present disclosure, and should not be considered as limitations to the embodiments of the present disclosure

19-3 shows a schematic diagram of an application scenario of a loop vector instruction processing device according to an embodiment of the present disclosure. As shown in Figure 19-3, the loop vector instruction processing device processes the loop vector instruction as follows:

The control module 19-11 compiles the obtained loop vector instruction 1 to obtain the compiled loop vector instruction 1 (for example, the loop vector instruction 1 is opcode 500 500 101 102 add.cycle 64), and executes the compiled loop vector instruction Analyze to get the operation code and operation domain of the loop vector instruction 1. Wherein, the operation code of the loop vector instruction 1 is opcode, the target address is 500, the first to-be-computed vector address is 101, and the second to-be-computed vector address is 102. The vector operation type is add.cycle (vector addition operation). The first input is 64. The second input is 16. The control module 19-11 obtains the first to-be-calculated vector whose data amount is the first input amount 64 from the first to-be-calculated vector address 101, and the second input amount that the data amount is obtained from the second to-be-calculated vector address 102 The second pending vector of 16.

The operation module 19-12 divides the first to-be-operated vector into 4 divided vectors, as shown in FIG. 19-3, divided vector 1, divided vector 2, divided vector 3, divided vector 4, each The data volume of the segmentation vector is 16. And add each segmentation vector and the second to-be-calculated vector separately to obtain the corresponding segmentation operation result, as shown in Figure 19-3, the segmentation operation result 1, the segmentation operation result 2, the segmentation operation Result 3, split operation result 4. The division operation result 1, the division operation result 2, the division operation result 3, and the division operation result 4 are used as the operation result 1 of the loop vector instruction 1, and the operation result 1 is stored in the target address 500.

Among them, the loop vector instruction 1 can be not only the above opcode 500, 101, 102, add.cycle, 64, 16 but also add.cycle, 500, 101, 102, 64, 16. The processing procedure of the loop vector instructions in different instruction formats is similar and will not be repeated here.

In this way, the loop vector instruction processing device can efficiently and quickly process the loop vector instruction, and the calculation processing efficiency is high and the processing speed is high.

19-4 shows a flowchart of a loop vector instruction processing method according to an embodiment of the present disclosure. The method can be applied to computer equipment including a memory and a processor, where the memory is used to store data used in the execution of the method; the processor is used to perform related processing and operation steps, such as performing the following steps S51-19和步骤 S52-19. As shown in FIG. 19-4, this method is applied to the above loop vector instruction processing device. The method includes steps S51-19 and S52-19.

In step S51-19, the control module is used to compile the obtained loop vector instruction to obtain a compiled loop vector instruction, and the compiled loop vector instruction is analyzed to obtain the operation code and operation domain of the loop vector instruction, and Obtain the first to-be-operated vector, the second to-be-operated vector and the target address required to execute the loop vector instruction according to the operation code and the operation domain, and determine the vector operation type of the loop vector instruction. The operation code is used to indicate that the operation performed by the cyclic vector instruction on the data is a cyclic vector operation, and the operation domain includes the first to-be-calculated vector address, the second to-be-calculated vector address, and the target address.

In step S52-19, the operation module is used to divide the first to-be-operated vector into a plurality of divided vectors according to the second to-be-operated vector, and each divided vector and the second to-be-operated vector are separately vectored according to the type of vector operation Operate, get the operation result, and store the operation result in the target address.

In a possible implementation manner, performing vector operations on each of the divided vectors and the second to-be-operated vector separately according to the type of vector operation may include: using multiple vector operators in the operation module to perform the operation corresponding to the type of vector operation Vector operation.

In a possible implementation manner, the operation module may include a master operation sub-module and multiple slave operation sub-modules, and the master operation sub-module may include multiple vector operators. Wherein, step S52-19 may include: using a plurality of vector operators in the main operation sub-module to perform a vector operation corresponding to the type of vector operation, obtain an operation result, and store the operation result in a target address.

In a possible implementation manner, the operation module includes a master operation sub-module and multiple slave operation sub-modules, and the slave operation sub-module includes multiple vector operators, wherein step S52-19 may include: using each slave operator The multiple vector operators included in the module execute the corresponding vector operation in parallel to obtain the operation result, and store the operation result in the corresponding sub-cache space, and send the operation result to the main operation sub-module; use the main operation sub-module Receive the operation result, and store the operation result in the target address.

In a possible implementation, the operation domain may also include a vector operation type. Wherein, determining the vector operation type of the loop vector instruction may include: when the vector operation type is included in the operation domain, determining the vector operation type according to the operation domain.

In a possible implementation manner, the operation domain may further include a first input amount and a second input amount. Wherein, obtaining the first to-be-computed vector, the second to-be-computed vector and the target address required to execute the loop vector instruction according to the operation code and the operation domain may further include: determining the first input amount and the second input volume according to the operation domain, and Obtain a first to-be-calculated vector whose data amount is the first input amount from the first to-be-calculated vector address, and obtain a second to-be-calculated vector whose data amount is the second input amount from the second to-be-calculated vector address. Wherein, dividing the first to-be-computed vector into multiple divided vectors according to the second to-be-computed vector may include: determining the divided data amount of each divided vector according to the second input amount, and according to the divided data amount The first to-be-operated vector is divided into multiple divided vectors.

In a possible implementation, the operation code is also used to indicate the type of vector operation to determine the type of vector operation of the loop vector instruction, which may include: when the operation code is used to indicate the type of vector operation, determining the type of vector operation according to the operation code.

In a possible implementation manner, the vector operation type may include at least one of the following: vector multiplication operation, vector addition operation, vector sum operation, operation to store a specified value when the operation condition is met, bitwise AND operation, bitwise or Operation, bitwise XOR operation. The calculation conditions may include any of the following: bitwise equal, bitwise unequal, bitwise less than, bitwise greater than or equal, bitwise greater than, bitwise less than or equal.

In a possible implementation manner, the method may further include: using the storage module of the device to store the first to-be-computed vector and the second to-be-computed vector, where the storage module includes at least one of a register and a cache, the cache is used To store the data to be calculated, the first vector to be calculated and the second vector to be calculated, the cache includes at least one neuron cache NRAM; the register is used to store the scalar data in the data to be calculated; the neuron cache is used to store the data to be calculated Neuron data, neuron data includes neuron vector data.

In a possible implementation, parsing the compiled loop vector instruction to obtain the operation code and operation domain of the loop vector instruction may include: storing the compiled loop vector instruction; parsing the compiled loop vector instruction To obtain the operation code and operation domain of the loop vector instruction; store the instruction queue, the instruction queue includes a plurality of instructions to be executed in order according to the execution order, and the plurality of instructions to be executed may include a compiled loop vector instruction.

In a possible implementation manner, compiling the obtained loop vector instruction to obtain the compiled loop vector instruction may include: generating an assembly file according to the loop vector instruction, and translating the assembly file into a binary file. Among them, the binary file is a compiled loop vector instruction.

It should be noted that, although the above embodiment is taken as an example to introduce the loop vector instruction processing method as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various steps flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.

The loop vector instruction processing method provided by the embodiments of the present disclosure has a wide application range, high processing efficiency and fast processing speed for vectors, and high processing efficiency and fast processing speed for performing calculations.

The foregoing can be better understood based on the following clauses:

Clause S1, a loop vector instruction processing device, the device comprising:

The control module is used to compile the obtained loop vector instruction to obtain the compiled loop vector instruction, analyze the compiled loop vector instruction to obtain the operation code and operation domain of the loop vector instruction, and according to the The operation code and the operation domain obtain the first to-be-operated vector and the second to-be-operated vector and the target address required to execute the loop vector instruction, and determine the vector operation type of the loop vector instruction;

An operation module, configured to divide the first to-be-operated vector into a plurality of divided vectors according to the second to-be-operated vector, and separately perform each divided vector and the second to-be-operated vector according to the vector operation type Vector operation to obtain the operation result, and store the operation result in the target address,

Wherein, the operation code is used to indicate that the operation performed by the cyclic vector instruction on the data is a cyclic vector operation, and the operation domain includes a first to-be-computed vector address, a second to-be-computed vector address, and the target address.

Clause S2. The device according to Clause S1, the calculation module includes:

Clause S3. The device according to Clause S2, the operation module includes a master operation sub-module and a plurality of slave operation sub-modules, the master operation sub-module includes the plurality of vector operators,

Clause S4. The device according to Clause S2, the operation module includes a master operation sub-module and a plurality of slave operation sub-modules, the slave operation sub-module includes the plurality of vector operators,

The slave operation sub-module is used to execute corresponding vector operations in parallel by using a plurality of vector operators included to obtain operation results, store the operation results in the corresponding sub-cache space, and store the operations The result is sent to the main operation sub-module;

Clause S5. The device according to Clause S1, the operation domain further includes a vector operation type,

Clause S6. The device according to Clause S1, the operation domain further includes a first input amount and a second input amount,

The control module is further configured to determine the first input amount and the second input amount according to the operation domain, and obtain the data amount from the first to-be-calculated vector address as the first input The first to-be-calculated vector of the quantity, and the second to-be-calculated vector whose data quantity is obtained from the second to-be-calculated vector address as the second input quantity,

Wherein, dividing the first to-be-calculated vector into a plurality of divided vectors according to the second to-be-calculated vector includes:

The amount of segmentation data of each segmentation vector is determined according to the second input amount, and the first to-be-operated vector is segmented into multiple segmentation vectors according to the amount of segmentation data.

Clause S7. The device according to Clause S1, the operation code is further used to indicate the vector operation type,

Clause S8. The apparatus according to Clause S1, the vector operation type includes at least one of the following:

Vector multiplication operation, vector addition operation, vector sum operation, storage specified value operation, bitwise AND operation, bitwise OR operation, bitwise XOR operation

Clause S9. The device according to Clause S1, the device further comprising:

A storage module, configured to store the first to-be-calculated vector and the second to-be-calculated vector,

Wherein, the storage module includes at least one of a register and a cache,

The cache is used to store data to be calculated, the first vector to be calculated, and the second vector to be calculated, and the cache includes at least one neuron cache NRAM;

The register is used to store scalar data in the data to be calculated;

Clause S10. The device according to Clause S1, the control module includes:

An instruction storage submodule, used to store the compiled loop vector instruction;

An instruction processing sub-module, which is used to analyze the compiled loop vector instruction to obtain the operation code and operation domain of the loop vector instruction;

A queue storage submodule is used to store an instruction queue, and the instruction queue includes a plurality of instructions to be executed, which are sequentially arranged according to an execution order, and the plurality of instructions to be executed include the compiled loop vector instruction.

Clause S11. The device according to Clause S10, the control module, further comprising:

Clause S12, the device according to Clause S1,

The control module is also used to generate an assembly file according to the loop vector instruction and translate the assembly file into a binary file,

Wherein, the binary file is the compiled loop vector instruction.

Clause S13. A machine learning computing device, the device comprising:

One or more loop vector instruction processing devices as described in any one of Clause S1-Clause S12, used to obtain vectors and control information to be calculated from other processing devices, and perform specified machine learning operations, and pass the execution result through I / O interface is passed to other processing devices;

When the machine learning computing device includes a plurality of the loop vector instruction processing devices, the plurality of loop vector instruction processing devices may be connected and transmit data through a specific structure;

Among them, a plurality of the loop vector instruction processing devices interconnect and transmit data through a fast external device interconnect bus PCIE bus to support larger-scale machine learning operations; a plurality of the loop vector instruction processing devices share the same control system Or have their own control systems; a plurality of the loop vector instruction processing devices share memory or have their own memory; the interconnection method of the plurality of loop vector instruction processing devices is an arbitrary interconnection topology.

Clause S14. A combined processing device, the combined processing device comprising:

Machine learning computing device, general interconnection interface and other processing devices as described in clause S13;

Clause S15. A machine learning chip, the machine learning chip includes:

The machine learning arithmetic device according to clause S13 or the combined processing device according to clause S14.

Clause S16. An electronic device, the electronic device comprising:

Machine learning chip as described in clause S15.

Clause S17, a board card, the board card includes: a storage device, an interface device and a control device, and a machine learning chip as described in Clause S15;

The storage device is used for storing data;

Clause S18. A cyclic vector instruction processing method. The method is applied to a cyclic vector instruction processing device. The device includes a control module and an operation module. The method includes:

The control module is used to compile the obtained loop vector instruction to obtain a compiled loop vector instruction. The compiled loop vector instruction is analyzed to obtain the operation code and operation domain of the loop vector instruction, and the operation code is determined according to the operation code. Acquiring the first to-be-operated vector, the second to-be-operated vector and the target address required to execute the loop vector instruction with the operation domain, and determining the vector operation type of the loop vector instruction;

The operation module is used to divide the first to-be-operated vector into a plurality of divided vectors according to the second to-be-operated vector, and each divided vector and the second to-be-operated vector are respectively subjected to vector operations according to the vector operation type To obtain the operation result and store the operation result in the target address,

Clause S19. According to the method described in Clause S18, performing vector operations on each of the split vector and the second to-be-calculated vector according to the vector operation type includes:

Clause S20. The method according to Clause S19, the operation module includes a master operation sub-module and a plurality of slave operation sub-modules, the master operation sub-module includes the plurality of vector operators,

Wherein, the first to-be-operated vector is divided into a plurality of divided vectors according to the second to-be-operated vector, and each divided vector and the second to-be-operated vector are separately subjected to vector operations according to the vector operation type, Obtain the operation result, and store the operation result in the target address, including:

Clause S21. The method according to Clause S19, the operation module includes a master operation sub-module and a plurality of slave operation sub-modules, the slave operation sub-module includes the plurality of vector operators,

Clause S22. The method according to Clause S18, the operation domain further includes a vector operation type,

Among them, determine the vector operation type of the loop vector instruction, including:

Clause S23. The method according to Clause S18, the operation domain further includes a first input amount and a second input amount,

Wherein, obtaining the first to-be-computed vector, the second to-be-computed vector and the target address required to execute the loop vector instruction according to the operation code and the operation domain, further includes:

Determine the first input amount and the second input amount according to the operation domain, and obtain the first to-be-calculated vector whose data amount is the first input amount from the first to-be-calculated vector address, and from Obtaining a second to-be-calculated vector whose data amount is the second input amount from the second to-be-calculated vector address,

Clause S24. The method according to Clause S18, the operation code is also used to indicate the type of vector operation,

Clause S25. The method according to Clause S18, the vector operation type includes at least one of the following:

Clause S26. The method according to Clause S18, the method further comprising:

Using the storage module of the device to store the first to-be-computed vector and the second to-be-computed vector,

Wherein, the storage module includes at least one of a register and a cache,

The register is used to store scalar data in the data to be calculated;

Clause S27. According to the method described in Clause S18, parse the compiled loop vector instruction to obtain the operation code and operation domain of the loop vector instruction, including:

Storing the compiled loop vector instruction;

Parse the compiled loop vector instruction to obtain the operation code and operation domain of the loop vector instruction;

An instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed, which are sequentially arranged according to an execution order, and the plurality of instructions to be executed include the compiled loop vector instruction.

Clause S28. The method according to Clause S27, the method further comprising:

Clause S29. According to the method described in Clause S18, compile the obtained loop vector instruction to obtain the compiled loop vector instruction, including:

Generate an assembly file according to the loop vector instruction, and translate the assembly file into a binary file,

Wherein, the binary file is the compiled loop vector instruction.

Clause S30. A non-volatile computer-readable storage medium having computer program instructions stored thereon, which when executed by a processor implements the method of any one of Clause S18 to Clause S29.

Due to the extensive use of neural network algorithms, the ability of computer hardware operators to continuously improve, the types and number of data operations involved in practical applications continue to increase. Due to the variety of programming languages, in different language environments, in order to realize the migration of vector data, in related technologies, because there is no vector data migration instruction that can be widely applied to various programming languages at this stage, technicians need to customize the corresponding One or more instructions in its programming language environment implement vector data migration, resulting in low efficiency and slow speed of vector data migration. The present disclosure provides a vector data migration instruction processing method, device, computer equipment, and storage medium. Vector data migration can be achieved with only one instruction, which can significantly improve the efficiency and speed of vector data migration.

FIG. 20-1 shows a block diagram of a vector data migration instruction processing apparatus according to an embodiment of the present disclosure. As shown in FIG. 20-1, the device includes a control module 20-11 and a processing module 20-12.

The control module 20-11 is used to compile the acquired vector data migration instruction to obtain the compiled vector data migration instruction, and parse the compiled vector data migration instruction to obtain the operation code and operation domain of the vector data migration instruction And obtain the vector data to be migrated and the target address required to execute the vector data migration instruction according to the operation code and the operation domain, and determine the migration parameters required for the migration process. The operation code is used to instruct the vector data migration instruction to process the vector data as migration processing. The operation domain includes the address of the vector data to be migrated and the target address, and the migration parameter may include the initial storage space and target where the address of the vector data to be migrated is located. The target storage space where the address is located and the migration type to be migrated.

The processing module 20-12 stores the vector data to be migrated into the target address according to the migration parameters.

In this embodiment, there may be one or more vector data to be migrated. The migration type may indicate the vector data storage speed of the initial storage space, the vector data storage speed of the target storage space, and the speed relationship between the storage speeds of the two. In the vector data migration instruction, different codes can be set for the storage speed relationship between different target storage spaces and the initial storage space to distinguish the storage speed. For example, the code whose migration type is "the storage speed of the initial storage space is greater than the storage speed of the target storage space" can be set to "st". The code whose migration type is "the storage speed of the initial storage space is equal to the storage speed of the target storage space" can be set to "mv". The code whose migration type is "the storage speed of the initial storage space is less than the storage speed of the target storage space" can be set to "ld". A person skilled in the art may set the migration type and the code of the migration type according to actual needs, which is not limited in the present disclosure.

In this embodiment, the initial storage space and the target storage space may be NRAM, WRAM, DRAM, registers, etc. of the device for storing data, and DRAM may include LDRAM, GDRAM, etc. Among them, NRAM (Nanotube Random Access Memory) is a non-volatile memory based on carbon nanotube (Carbon Nanotube, CNT for short). WRAM (Window RAM) is a type of VRAM (Video RAM, the image is randomly accessed to the memory). DRAM (Dynamic Random Access Memory) is a dynamic random access memory. LDRAM is Local DRAM, which can be a DRAM unique to a computing core in the device. GDRAM is Global DRAM, which can be a DRAM shared by multiple computing cores in the device. The computing core is a unit, module, etc. that performs data operations in the device, such as the following processing module.

In this embodiment, the vector data migration instruction acquired by the control module is an uncompiled software instruction that cannot be directly executed by hardware. The control module needs to first compile the vector data migration instruction (uncompiled). After the compiled vector data migration instruction is obtained, the compiled vector data migration instruction can be parsed. The compiled vector data migration instructions are hardware instructions that can be directly executed by the hardware. The control module may obtain vector data to be migrated from the vector data address to be migrated. The control module can obtain instructions and data through the data input and output unit, which can be one or more data I / O interfaces or I / O pins.

In this embodiment, the operation code may be the part of the instruction or field (usually represented by code) specified in the computer program to perform the operation, and is the instruction sequence number used to inform the device that executes the instruction which instruction needs to be executed. . The operation domain may be the source of all data required to execute the corresponding instruction. All data required to execute the corresponding instruction include the target address, the vector data address to be migrated, the initial storage space where the vector data address to be migrated is located, and the target address is located. Target storage space and migration parameters for migration processing, etc. For a vector data migration instruction, it must include an operation code and an operation field, where the operation field includes at least the vector data to be migrated and the target address.

It should be understood that a person skilled in the art may set the format of the vector data migration instruction and the included operation codes and operation fields as required, and the disclosure does not limit this.

In this embodiment, the device may include one or more control modules and one or more processing modules, and the number of control modules and processing modules may be set according to actual needs, which is not limited in the present disclosure. When the device includes a control module, the control module can receive a vector data migration instruction and control one or more processing modules to perform vector data migration. When the device includes multiple control modules, the multiple control modules may respectively receive vector data migration instructions and control the corresponding one or more processing modules to perform vector data migration.

The vector data migration instruction processing device provided by the embodiment of the present disclosure includes a control module and a processing module. The control module is used to compile the acquired vector data migration instruction to obtain the compiled vector data migration instruction, and parse the compiled vector data migration instruction to obtain the operation code and operation domain of the vector data migration instruction, and according to the operation The code and the operation domain acquire the vector data to be migrated and the target address required to execute the vector data migration instruction, and determine the migration parameters required for the migration process. The processing module is used to store the vector data to be migrated into the target address according to the migration parameters. The vector data migration instruction processing device provided by the embodiments of the present disclosure has a wide application range, and has high processing efficiency and fast processing speed for vector data migration instructions, and high processing efficiency and fast processing speed for vector data migration.

FIG. 20-2 shows a block diagram of a vector data migration instruction processing apparatus according to an embodiment of the present disclosure. In a possible implementation, as shown in FIG. 20-2, the processing module 20-12 may include a master processing sub-module 20-121 and a plurality of slave processing sub-modules 20-122.

The main processing submodules 20-121 are used to process the to-be-migrated vector data, obtain the processed to-be-migrated vector data, and store the processed to-be-migrated vector data in the target address. The processing to be performed on the migration vector data includes data type conversion and other processing, and it may be stored directly without processing the migration vector data, which is not limited in the present disclosure.

In a possible implementation, the operation domain may also include vector data migration. The control module 20-11 is also used to determine the amount of vector data migration according to the operation domain, and obtain vector data to be migrated corresponding to the amount of vector data migration from the address of the vector data to be migrated.

In this implementation manner, the vector data migration amount may be the acquired data amount of the vector data to be migrated.

In a possible implementation manner, the default vector data migration amount may be preset. When the vector data migration amount is not included in the operation domain, the default vector data migration amount may be determined as the vector data migration amount of the current vector data migration instruction. Then, the vector data to be migrated corresponding to the migration amount of the vector data is acquired from the vector data address to be migrated.

In a possible implementation manner, when the vector data migration amount is not included in the operation domain, all vector data to be migrated stored therein may be directly obtained from the vector data address to be migrated.

In a possible implementation, default migration parameters can also be set. When the migration parameter of the current vector data migration instruction cannot be determined according to the operation domain and the operation code, the default migration parameter may be determined as the migration parameter of the current vector data migration instruction.

In a possible implementation, the initial storage space and the target storage space corresponding to the vector data address and the target address to be migrated may be determined respectively, and then the storage speed and storage space type of the initial storage space, the target storage space, etc. Parameters to determine the migration parameters.

In a possible implementation manner, as shown in FIG. 20-2, the device may further include a storage module 20-13. The storage modules 20-13 are used to store vector data to be migrated.

In this implementation, the storage module may include one or more of a cache and a register. The cache may include a high-speed temporary storage cache, and may also include at least one NRAM (Neuron Random Access Memory). The cache is used to store data to be calculated and vector data to be migrated. The register is used to store the scalar data in the data to be calculated.

In a possible implementation manner, the control module 20-11 may also be used to generate an assembly file according to the vector data migration instruction and translate the assembly file into a binary file, where the binary file is a compiled vector data migration instruction.

In a possible implementation, the instruction format of the vector data migration instruction may be:

vector dst src type.space1.space2 size

Among them, vector is the operation code of the vector data migration instruction, dst, src0, type.space1.space2, size are the operation domain of the vector data migration instruction. Where dst is the target address and src is the address of the vector data to be migrated. When there are multiple vector data to be migrated, src may include multiple addresses of the vector data to be migrated src0, src1, ..., srcn. No restrictions. type.space1.space2 is the migration parameter, type in type.space1.space2 indicates the migration type, space1 in type.space1.space2 indicates the initial storage space where the vector data address src to be migrated is located, and space2 in type.space1.space2 Indicates the target storage space where the target address dst is located. size is the amount of vector data migration.

In a possible implementation manner, the instruction format of the vector data migration instruction may also be:

type.space1.space2 dst src size

Among them, type.space1.space2 is the operation code of the vector data migration instruction, and dst, src, and size are the operation domains of the vector data migration instruction. Where dst is the target address and src is the address of the vector data to be migrated. When there are multiple vector data to be migrated, src may include multiple addresses of the vector data to be migrated src0, src1, ..., srcn No restrictions. size is the amount of vector data migration. The type in opcode type.space1.space2 represents the migration type, space1 in type.space1.space2 represents the initial storage space where the vector data address src to be migrated is located, and space2 in type.space1.space2 represents the destination where the destination address dst is located storage.

In a possible implementation, the instruction format of the vector data migration instruction whose migration type is "the storage speed of the initial storage space is less than the storage speed of the target storage space" can be set as: ld.space1.space2, dst, src0, size. According to the vector data migration amount size, the initial storage space space1, the target storage space space2 and the migration type ld, obtain the vector data to be migrated with the data amount of vector data migration amount size from the vector data address src0 in the initial storage space space1, And store the vector data to be migrated into the target address dst in the target storage space space2. The storage speed of the initial storage space space1 is less than the storage speed of the target storage space space2.

In a possible implementation, the instruction format of the vector data migration instruction whose migration type is "the storage speed of the initial storage space is greater than the storage speed of the target storage space" can be set to: st.space1.space2, dst, src0, size. According to the vector data migration size size, the initial storage space space1, the target storage space space2 and the migration type st, obtain the vector data to be migrated from the vector data address src0 in the initial storage space space1 whose data volume is the vector data migration amount size, And store the vector data to be migrated into the target address dst in the target storage space space2. The storage speed of the initial storage space space1 is greater than the storage speed of the target storage space space2.

In a possible implementation, the instruction format of the vector data migration instruction whose migration type is "the storage speed of the initial storage space is equal to the storage speed of the target storage space" can be set as: mv.space1.space2, dst, src0, size. According to the vector data migration amount size, initial storage space space1, target storage space space2 and migration type st, obtain the vector data to be migrated in the amount of vector data migration size from the vector data address src0 in the initial storage space space1, and Store the vector data to be migrated into the target address dst in the target storage space space2. The storage speed of the initial storage space space1 is equal to the storage speed of the target storage space space2.

It should be understood that those skilled in the art can set the operation code of the vector data migration instruction, the position of the operation code and the operation field in the instruction format according to need, and the disclosure does not limit this.

It should be noted that although the above embodiment is taken as an example to introduce the vector data migration instruction processing apparatus as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various modules flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.

Application examples

The following describes an application example according to an embodiment of the present disclosure in conjunction with "data migration using a vector data migration instruction processing device" as an exemplary application scenario, so as to facilitate understanding of the flow of the vector data migration instruction processing device. Those skilled in the art should understand that the following application examples are only for the purpose of facilitating the understanding of the embodiments of the present disclosure, and should not be considered as limitations to the embodiments of the present disclosure.

20-3 shows a schematic diagram of an application scenario of a vector data migration instruction processing apparatus according to an embodiment of the present disclosure. As shown in Figure 20-3, the vector data migration instruction processing device processes the vector data migration instruction as follows:

The control module 20-11 compiles the acquired vector data migration instruction 1 to obtain the compiled vector data migration instruction 1. Analyze the compiled vector data migration instruction 1 (such as vector data migration instruction 1 is ld.200.300, 500, and 400) to obtain the operation code and operation domain of the vector data migration instruction 1. The operation code of the vector data migration instruction 1 is ld, the initial storage space is 200, the target storage space is 300, the target address is 500, the address of the vector data to be migrated is 400, and the amount of vector data migration is 5. According to the operation code ld, it can be determined that the storage speed of the initial storage space 200 is less than the storage speed of the target storage space 300. The control module 20-11 obtains vector data to be migrated with a data volume of 5 from the vector data address 400 to be migrated in the initial storage space 200. The processing module 20-12 stores the vector data to be migrated into the target address 500 in the target storage space 300 according to the migration parameters.

Among them, the vector data migration instruction 1 can be not only the above ld.200.300, 500, 4005, but also the vector, 500, 400, ld.200.300, etc. The processing procedures of the two are similar and will not be repeated here.

In this way, the vector data migration instruction processing device can efficiently and quickly process the vector data migration instruction, and the processing efficiency of vector data migration is high and the processing speed is fast.

20-4 shows a flowchart of a vector data migration instruction processing method according to an embodiment of the present disclosure. The method can be applied to computer equipment including a memory and a processor, where the memory is used to store data used in the execution of the method; the processor is used to perform related processing and operation steps, such as performing the following steps S51-20和步骤 S52-20. As shown in FIG. 20-4, the method is applied to the above vector data migration instruction processing device using the control module. The method includes steps S51-20 and S52-20.

In step S51-20, the processing module is used to compile the acquired vector data migration instruction to obtain a compiled vector data migration instruction, and the compiled vector data migration instruction is analyzed to obtain an operation code of the vector data migration instruction And the operation domain, and obtain the vector data to be migrated and the target address required to execute the vector data migration instruction according to the operation code and the operation domain, and determine the migration parameters required for the migration process. The operation code is used to instruct the vector data migration instruction to process the vector data as migration processing. The operation domain includes the vector data address to be migrated and the target address, and the migration parameters include the initial storage space and target address where the vector data address to be migrated is located. The target storage space and migration type for migration processing.

In step S52-20, the vector data to be migrated is stored in the target address according to the migration parameters.

In a possible implementation manner, the processing module may include a master processing sub-module and multiple slave processing sub-modules. . Wherein, step S52-20 may include:

The vector data to be migrated is processed to obtain the processed vector data to be migrated, and the processed vector data to be migrated is stored in the target address.

In a possible implementation, the operation domain may also include vector data migration. Wherein, acquiring the vector data to be migrated and the target address required to execute the vector data migration instruction according to the operation code and the operation domain may include:

The vector data migration amount is determined according to the operation domain, and vector data to be migrated corresponding to the vector data migration amount is obtained from the vector data to be migrated address.

In a possible implementation manner, the method further includes: using the storage module of the device to store the vector data to be migrated,

Wherein, the storage module includes at least one of a register and a cache,

Cache, used to store data to be calculated and vector data to be migrated, the cache includes at least one neuron cache NRAM;

Register, used to store scalar data in the data to be calculated;

In a possible implementation, step S51-20 may include:

Store compiled vector data migration instructions;

Analyze the compiled vector data migration instruction to obtain the operation code and operation domain of the vector data migration instruction;

The instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed in order according to the execution order, and the plurality of instructions to be executed include compiled vector data migration instructions.

In a possible implementation manner, the method may further include:

In a possible implementation manner, compiling the acquired vector data migration to obtain a compiled vector data migration instruction may include:

Generate assembly files according to vector data migration instructions, and translate the assembly files into binary files,

Among them, the binary file is a compiled vector data migration instruction.

It should be noted that although the above embodiment is taken as an example to introduce the processing method of the vector data migration instruction as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various steps flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.

The method for processing vector data migration instructions provided by the embodiments of the present disclosure has a wide application range, and has high processing efficiency and fast processing speed for vector data migration instructions, and high processing efficiency and fast processing speed for vector data migration.

The foregoing can be better understood based on the following clauses:

Clause T1, a vector data migration instruction processing device, the device comprising:

The control module is used to compile the acquired vector data migration instruction to obtain the compiled vector data migration instruction, and parse the compiled vector data migration instruction to obtain the operation code and operation domain of the vector data migration instruction, and according to The operation code and the operation domain acquire the vector data to be migrated and the target address required to execute the vector data migration instruction, and determine the migration parameters required for the migration process;

The processing module stores the vector data to be migrated into the target address according to the migration parameter,

Wherein, the operation code is used to indicate that the processing performed by the vector data migration instruction on the vector data is migration processing, the operation domain includes the address of the vector data to be migrated and the target address, and the migration parameter includes the The initial storage space where the migration vector data address is located, the target storage space where the target address is located, and the type of migration for migration processing.

Clause T2. The apparatus according to Clause T1, the processing module includes a master processing sub-module and a plurality of slave processing sub-modules,

The main processing submodule is configured to process the vector data to be migrated to obtain processed vector data to be migrated, and store the processed vector data to be migrated in the target address.

Clause T3. The device according to Clause T1, the operation domain further includes vector data migration,

Wherein, the control module is further configured to determine the vector data migration amount according to the operation domain, and obtain the vector data to be migrated corresponding to the vector data migration amount from the vector data address to be migrated.

Clause T4. The device according to Clause T1, the operation domain further includes migration parameters,

Clause T5. The device according to Clause T1, the operation code is also used to indicate a migration parameter,

Clause T6. The device according to Clause T1, the device further comprising:

A storage module, used to store the vector data to be migrated,

Wherein, the storage module includes at least one of a register and a cache,

The cache is used to store data to be calculated and the vector data to be migrated, and the cache includes at least one neuron cache NRAM;

The register is used to store scalar data in the data to be calculated;

Clause T7. The device according to Clause T1, the control module includes:

Instruction storage submodule, used to store the compiled vector data migration instruction;

An instruction processing sub-module, which is used to analyze the compiled vector data migration instruction to obtain the operation code and operation domain of the vector data migration instruction;

The queue storage sub-module is used to store an instruction queue, and the instruction queue includes a plurality of instructions to be executed, which are sequentially arranged according to the execution order, and the plurality of instructions to be executed include the compiled vector data migration instructions.

Clause T8. The device according to Clause T7, the control module, further comprising:

Clause T9, the device according to Clause T1,

The control module is also used to generate an assembly file according to the vector data migration instruction and translate the assembly file into a binary file,

Wherein, the binary file is the compiled vector data migration instruction.

Clause T10. A machine learning computing device, the device comprising:

One or more vector data migration instruction processing devices as described in any one of clauses T1 to T9, used to obtain vector data and control information to be migrated from other processing devices, and perform specified machine learning operations, which will execute the results Passed to other processing devices through the I / O interface;

When the machine learning operation device includes a plurality of vector data migration instruction processing devices, the plurality of vector data migration instruction processing devices can be connected and transmit data through a specific structure;

Among them, a plurality of the vector data migration instruction processing devices interconnect and transmit data through a fast external device interconnect bus PCIE bus to support larger-scale machine learning operations; a plurality of vector data migration instruction processing devices share the same control The system may have its own control system; a plurality of the vector data migration instruction processing devices share memory or have their own memories; the interconnection mode of the plurality of vector data migration instruction processing devices is an arbitrary interconnection topology.

Clause T11. A combined processing device, the combined processing device comprising:

Machine learning computing device, general interconnection interface and other processing devices as described in clause T10;

Clause T12. A machine learning chip, the machine learning chip includes:

The machine learning arithmetic device according to clause T10 or the combined processing device according to clause T11.

Article T13. An electronic device, the electronic device comprising:

Machine learning chip as described in clause T12.

Clause T14, a board card, the board card includes: a storage device, an interface device and a control device, and a machine learning chip as described in Clause T12;

The storage device is used for storing data;

Clause T15. A method for processing vector data migration instructions. The method is applied to a vector data migration instruction processing device. The device includes a control module and a processing module. The method includes:

The control module is used to compile the acquired vector data migration instruction to obtain the compiled vector data migration instruction, and the compiled vector data migration instruction is analyzed to obtain the operation code and operation domain of the vector data migration instruction. The operation code and the operation domain acquire the vector data to be migrated and the target address required to execute the vector data migration instruction, and determine the migration parameters required for the migration process;

Clause T16. The method according to Clause T15, the processing module includes a master processing sub-module and a plurality of slave processing sub-modules,

Wherein, storing the vector data to be migrated into the target address according to the migration parameter includes:

Processing the vector data to be migrated to obtain processed vector data to be migrated, and storing the processed vector data to be migrated in the target address.

Clause T17. The method according to Clause T15, the operation domain further includes vector data migration,

Wherein, obtaining the vector data to be migrated and the target address required to execute the vector data migration instruction according to the operation code and the operation domain includes:

Clause T18, the method according to Clause T15, the operation domain further includes migration parameters,

Clause T19, the method according to Clause T15, the operation code is also used to indicate the migration parameter,

Clause T20. The method according to Clause T15, the method further comprising:

Using the storage module of the device to store the vector data to be migrated,

Wherein, the storage module includes at least one of a register and a cache,

The register is used to store scalar data in the data to be calculated;

Clause T21. According to the method described in Clause T15, parse the compiled vector data migration instruction to obtain the operation code and operation domain of the vector data migration instruction, including:

Storing the compiled vector data migration instruction;

Parse the compiled vector data migration instruction to obtain the operation code and operation domain of the vector data migration instruction;

An instruction queue is stored. The instruction queue includes a plurality of instructions to be executed, which are sequentially arranged according to the execution order, and the plurality of instructions to be executed include the compiled vector data migration instructions.

Clause T22. The method according to Clause T21, the method further comprising:

Clause T23. Compile the acquired vector data migration according to the method described in Clause T15 to obtain the compiled vector data migration instruction, including:

Generate an assembly file according to the vector data migration instruction, and translate the assembly file into a binary file,

Wherein, the binary file is the compiled vector data migration instruction.

Clause T24. A non-volatile computer-readable storage medium having computer program instructions stored thereon, which when executed by a processor implements the method of any one of Clause T15 to Clause T23.

Due to the extensive use of neural network algorithms, the ability of computer hardware operators to continuously improve, the types and number of data operations involved in practical applications continue to increase. Due to the variety of programming languages, in different language environments, in order to achieve the synchronous control process, in the related art, because there is no synchronous control command that can be widely applied to various programming languages, technicians need to customize the corresponding programming language environment Multiple instructions to achieve synchronous control, resulting in low efficiency and slow speed of synchronous control. The present disclosure provides a synchronous control instruction processing method, device, computer equipment, and storage medium, which can realize synchronous control with only one instruction, and can significantly improve the efficiency and speed of synchronous control.

21-1a shows a block diagram of a synchronization control instruction processing apparatus according to an embodiment of the present disclosure. As shown in Fig. 21-1a, the device includes a control module 21-11, a plurality of arithmetic modules 21-12 and a compilation module 21-13.

The compiling module 21-13 is used to compile the acquired synchronization control instruction to obtain the compiled synchronization control instruction.

The control module 21-11 is used to analyze the compiled synchronous control instruction, obtain the operation code of the synchronous control instruction, and determine the target operation module that needs to execute the synchronous control instruction. The operation code is used to indicate that the synchronization control instruction is used to perform synchronization control on multiple operation modules of the device.

The target operation module in the plurality of operation modules 21-12 is used to enter the suspended state when the compiled synchronous control instruction is executed. Among them, in the suspended state, the target computing module suspends work, no longer performs data calculation, and cannot continue to execute the calculation instructions it needs to execute.

The control module 21-11 is also used to monitor the operating states of the multiple computing modules 21-12, and when it is determined that the target computing modules are all in the suspended state, control the target computing modules in the suspended state to enter the working state. Among them, the target computing module in the working state can perform data computing and execute the computing instructions it needs to execute.

In this embodiment, the synchronous control instruction can synchronously control the process of the calculation module executing the calculation instruction, so that the operation module can suspend work after the execution of the synchronous control instruction, and wait for the control module to issue an instruction to continue working to achieve the purpose of synchronous control .

In this embodiment, the synchronization control instruction acquired by the device is an uncompiled software instruction that cannot be directly executed by the hardware. The compilation module needs to first compile the synchronization control instruction (uncompiled). The compiled synchronization control instruction It is a hardware instruction that can be directly executed by hardware. The control module can obtain instructions and data through the data input and output unit, which can be one or more data I / O interfaces or I / O pins.

Optionally, the instruction processing apparatus may include a general-purpose processor and an artificial intelligence processor, and the compilation module therein may be set on a general-purpose processor such as a CPU, that is, a general-purpose processor such as a CPU may be used to implement the compilation operation, the artificial intelligence The processor may include the above-mentioned control module and operation module, etc. The specific structure of the artificial intelligence processor can be found in the following description. The artificial intelligence processor can analyze the received compiled synchronous control instruction and run the corresponding Instructions.

In this embodiment, the operation code may be the part of the instruction or field (usually represented by code) specified in the computer program to perform the operation, and is the instruction sequence number used to inform the device that executes the instruction which instruction needs to be executed. . The operation domain may be the source of all data required to execute the corresponding instruction, and all data required to execute the corresponding instruction include parameters to be operated, parameters such as quantity thresholds, and corresponding operation methods. For a synchronous control instruction, it may include an operation code and an operation field.

It should be understood that a person skilled in the art may set the instruction format of the synchronization control instruction and the included operation codes and operation domains as needed, and this disclosure does not limit this.

In this embodiment, the device may include one or more control modules and one or more arithmetic modules, and the number of control modules and arithmetic modules may be set according to actual needs, which is not limited in the present disclosure. When the device includes a control module, the control module may receive a synchronous control instruction and implement synchronous control of the corresponding at least one arithmetic module. When the device includes a plurality of control modules, the plurality of control modules may receive synchronization control instructions respectively, and respectively realize synchronization control of the corresponding plurality of arithmetic modules.

In this embodiment, the arithmetic module may be a device or module capable of executing calculation instructions, such as a core of the device, a processor in the device, etc., which is not limited in the present disclosure.

The synchronization control instruction processing device provided by the embodiment of the present disclosure includes a compilation module, a control module, and a plurality of arithmetic modules. The compilation module is used to compile the acquired synchronization control instruction to obtain the compiled synchronization control instruction; control The module is used to analyze the compiled synchronous control instruction, obtain the operation code of the synchronous control instruction, and determine the target operation module that needs to execute the synchronous control instruction; the target operation module is used to enter the compiled synchronous control instruction. Suspended state; the control module is also used to monitor the running state of multiple computing modules. When it is determined that the target computing modules are all in the suspended state, the target computing modules in the suspended state are controlled to enter the working state synchronously. The synchronous control instruction processing method, device and related products provided by the embodiments of the present disclosure have a wide range of applications, high processing efficiency and fast processing speed for synchronous control instructions, and improved processing efficiency and processing speed for synchronous control of corresponding arithmetic modules , Which in turn improves the efficiency and speed of computing data.

In a possible implementation, the operation code may be used to indicate the target signal required for synchronization by the target operation module, or the operation domain may include the target signal required for synchronization by the target operation module, so that the control module The target signal is determined according to the operation code or the operation domain. The target operation module is also used to control the processing corresponding to the target signal determined by the control module to enter a suspended state when the compiled synchronous control instruction is executed.

The target signal may include at least one of the following: calculation queue signal, IO signal, and arrival signal. Among them, the arrival signal is a type of signal that arrives in parallel between the arithmetic modules, including all signals that the arithmetic modules need to execute synchronously. The calculation queue signal may be a signal of a queue of calculation tasks waiting for execution in the operation module, and the IO signal may be an input and / or output signal of the operation module.

In this implementation manner, when the synchronization control instruction does not indicate the target signal, the preset default target signal may be determined as the target signal, which is not limited in the present disclosure.

In this implementation manner, specific operation codes or operation domains can be set for different target signals, so that the processor can determine a signal to be suspended in the target operation module according to the specific operation codes or operation domains. The target signal may be other signals related to the operation, control, and calculation of the arithmetic module, which is not limited in this disclosure.

In a possible implementation manner, determining the target operation module that needs to execute the synchronous control instruction may include: determining the operation module that executes the target task among the plurality of operation modules as the target operation module according to the target task identifier . The identification of the target task includes at least one of the following: task name, task type, and task number. The identification of the target task may also include other information that can characterize the target task, which is not limited in this disclosure.

Among them, in the device, the control module will assign one or more arithmetic modules to the task according to the type of the task (including the above-mentioned target task) and the working state of the arithmetic module to make it execute the task.

In the above manner, synchronous control of all arithmetic modules performing the target task can be achieved.

In a possible implementation, the control module may determine the target computing module that needs to execute the synchronous control instruction according to the preset target task identifier. At this time, the synchronization control instruction may include only the operation code. The instruction format of the synchronous control instruction may be "sync_all ()", wherein, based on the synchronous control instruction device, synchronous control of all arithmetic modules of the target task identified by the preset target task may be achieved.

Optionally, the synchronization control instruction may be included in a kernel function, and the general processor of the device may compile the instruction and other programs in the kernel function, and send the compiled kernel function to the corresponding interface by calling the corresponding interface. The corresponding operation module on the artificial intelligence machine is executed. Among them, the device can also determine the identification of the target task according to the characteristics of the kernel function. In this way, the synchronization control instruction in the kernel function can determine the target operation module according to the determined target task identifier.

21-1b shows a schematic structural diagram of a module cluster in a synchronous control instruction processing apparatus according to an embodiment of the present disclosure. As shown in FIG. 21-1b, the device includes a compilation module 100, a control module 200, an interconnect module 500, a global memory 400, a plurality of module clusters, and a shared storage 600 corresponding to each module cluster. The shared storage 600 includes at least one DMA 610. Among them, only two

module clusters

1 and 2 are shown in FIG. 21-1b, and the structure of the remaining module clusters is similar to that of module cluster 1 and module cluster 2, which is not shown in the figure. Each module cluster includes four arithmetic modules: arithmetic module 1, arithmetic module 2, arithmetic module 3 and arithmetic module 4. Among them, the interconnection module 500 is used to implement communication interconnection between the global content 400, the control module 200, and the module cluster. The global memory 400 is used to store the control module 200 and the module cluster in the device.

For example, assume that the target task identifier UNION can specify the corresponding computing module by specifying the number of module clusters. For example, UNION1 means that when calling a nuclear function to perform a task, it occupies 1 module cluster and shares 4 arithmetic modules. UNION2 means that when calling a nuclear function to perform a task, it occupies 2 module clusters and shares 8 arithmetic modules. UNION4 means that when calling a core function to perform a task, it occupies 4 module clusters and shares 16 cores. UNION8 means that when calling a core function to perform a task, it occupies 8 module clusters and shares 32 cores. Taking “UNION1” as an example, the control module 200 may designate a module cluster 1 that is idle or capable of executing tasks according to “UNION1”, so that the module cluster 1 can execute the task “UNION1”.

In a possible implementation manner, the plurality of operation modules are divided into a plurality of module clusters (cluster), and each module cluster includes one or more operation modules (as shown in FIG. 21-1b). Wherein, determining the target operation module that needs to execute the synchronous control instruction may include: determining all the operation modules in the target module cluster related to the execution of the target task in the plurality of module clusters as the target operation according to the identification of the target task Module, all or part of the operation modules in the target cluster are used to execute the target task.

In the above manner, synchronous control of all computing modules in the target module cluster related to the execution of the target task can be achieved.

For example, the control module may determine the number of computing module clusters required by the target task identifier according to the preset target task identifier, and further determine the target computing module that needs to execute the synchronous control instruction. At this time, the synchronization control instruction may include only the operation code. The instruction format of the synchronization control instruction may be "sync_all0 ()", wherein, based on the synchronization control instruction device, the synchronization control of the operation modules in all operation module clusters of the target task identified by the preset target task may be achieved.

Optionally, the synchronization control instruction may be included in a kernel function, and the general processor of the device may compile the instruction and other programs in the kernel function, and send the compiled kernel function to the corresponding interface by calling the corresponding interface. The corresponding operation module on the artificial intelligence machine is executed. Among them, the device can also determine the identification of the target task according to the characteristics of the kernel function. In this way, the synchronization control instruction in the kernel function can determine the target computing module cluster according to the determined target task identifier. Therefore, the control module can use all the computing modules in the target computing module cluster as the target computing module.

In a possible implementation manner, the operation code or the operation field may be used to indicate the identification of the target task.

In a possible implementation, the instruction format of the synchronization control instruction may be "sync_sign2_all1 ()". Among them, sign2 is the identification of the target task, and the synchronization control instruction device can perform synchronous control of all the arithmetic modules in the target module cluster related to the execution of the target task identified as sign2.

In a possible implementation manner, the plurality of operation modules are divided into a plurality of module clusters, and each module cluster includes one or more operation modules, and the operation code or the operation domain is used to indicate the target module The ID of the cluster. Wherein, determining the target operation module that needs to execute the synchronous control instruction may include: determining the operation module belonging to the target module cluster among the plurality of operation modules as the target operation module according to the identifier of the target module cluster.

In this implementation manner, the identification of the target module cluster may be the identification information of the target module cluster in multiple module clusters, such as the serial number, name, and the like, which can characterize the target module cluster, which is not limited in the present disclosure.

In the above manner, synchronous control of all arithmetic modules in one or more target module clusters can be achieved.

In a possible implementation, the instruction format of the synchronization control instruction may be "sync_cluster". Among them, cluster is the identifier of the target module cluster. Through the synchronous control instruction, the device can realize synchronous control of all arithmetic modules in the target cluster identified as cluster. When the number of target clusters is multiple, the command format of the synchronization control command may be "sync_cluster0 cluster1 ... clustern", cluster0 cluster1 ... clustern are the identification of the first target module cluster and the second target module cluster respectively The mark ... The mark of the nth target module cluster, to realize the synchronous control of all computing modules in multiple module clusters.

In a possible implementation manner, the operation code or the operation field may be used to indicate the identification of the target operation module. Wherein, determining the target operation module that needs to execute the synchronization control instruction according to the operation code or the operation domain may include: determining the target operation module from the plurality of operation modules according to the identifier of the target operation module.

In this implementation manner, the identification of the target operation module may be the identification information of the target operation module among the multiple operation modules, such as the number, serial number, and name, which can characterize the target operation module, which is not limited in this disclosure.

In the above manner, synchronous control of one or more target computing modules can be achieved.

In a possible implementation, the instruction format of the synchronization control instruction may be "syn_sign3_0 sign3_1 ... sign3_n". Among them, sign3_0sign3_1 ... sign3_n are the identification of the first target operation module, the identification of the second target operation module ... the identification of the nth target operation module. Through the synchronous control instruction, the device can realize the synchronous control of the target arithmetic module identified as sign3_0sign3_1 ... sign3_n.

In a possible implementation, when neither the operation code nor the operation domain of the synchronous control instruction indicates the target operation module, the control module is also used to determine the kernel function (kernel) where the synchronous control instruction is located. The operation module that calls the kernel function among the plurality of operation modules is determined as the target operation module.

In this implementation, the device can call one or more kernel functions, and the operation module can call the kernel functions to perform tasks that require the kernel functions. The synchronization control instruction can be written in the kernel function in advance, and the control module can determine the target operation module according to the record of the operation module calling the kernel function.

Among them, when the control module controls multiple computing modules to perform tasks, the computing module itself (or under the control of the control module) can determine the required nuclear function to be called according to the task information such as the type of task performed, the degree of task parallelism, mission accomplished.

In the above manner, synchronous control of the target operation module that calls the "core function where the synchronous control instruction is located" can be achieved.

In a possible implementation manner, the operation domain includes a quantity threshold. The control module is also used to control the target computing module in the suspended state to enter a working state when it is determined that the number of target computing modules in the suspended state reaches the number threshold.

In this implementation manner, on the basis of the determined target operation module, the number of synchronously controlled target operation modules may also be limited, and the number threshold may be less than or equal to the determined target operation module.

In a possible implementation, the command format of the synchronization control command may be "barrier N". Among them, N is the number threshold. The barrier is only used to indicate that the instruction is a synchronous control instruction and its target signal is an arrival signal. Through the synchronous control instruction, the device can realize the synchronous control of the target operation module invoking the "kernel function where the synchronous control instruction is located", and when the number of target operation modules in the suspended state reaches the number threshold, the control is in The target computing module in the suspended state enters the working state.

For the above synchronization control instructions "sync_sign1_all0 ()", "sync_sign2_all1 ()", "sync_cluster", "sync_cluster0cluster1 ... clustern", "syn_sign3_0sign3_1 ... sign3_n", and "barrierN", sync, syn, barrier It can also be used to indicate a target signal, where the target signal indicated by sync may be a calculation queue signal and an IO signal, and the target signals indicated by syn and barrier may be arrival signals. The components of the synchronous control instruction, sync, syn, barrier, all0 (), all1 (), cluster, cluster0, cluster1 ... clustern, sign3_0, sign3_1 ... sign3_n, N, can be set to opcodes or operation domains as needed, and Setting its position in the synchronous control command is not limited in this disclosure. Moreover, the above synchronization control instructions are only a few examples of the technical solutions of the present disclosure, and those skilled in the art can set their instruction formats according to the technical solutions of the present disclosure as needed, and the present disclosure does not limit this.

21-2 shows a block diagram of a synchronization control instruction processing apparatus according to an embodiment of the present disclosure. In a possible implementation manner, as shown in FIG. 21-2, the computing module 21-12 may include multiple operators 21-120. The multiple operators 21-120 are used to perform operations corresponding to the operation type of the calculation instruction.

In this implementation, the arithmetic unit may include an arithmetic unit capable of performing arithmetic operations, logical operations, and the like on the data, such as adders, dividers, multipliers, and comparators. The type and number of arithmetic units can be set according to the size of the amount of data to be calculated, the type of calculation, the processing speed and efficiency of performing calculation on the data, etc., and the disclosure does not limit this.

In a possible implementation manner, as shown in FIG. 21-2, the device may further include a storage module 21-13. The storage modules 21-13 are used to store data to be calculated.

In a possible implementation, the control module 21-11 can also be used to generate a first assembly file according to the synchronization control instruction and translate the first assembly file into a first binary file, where the first binary The file is the compiled synchronous control instruction.

In a possible implementation, the control module 21-11 may also be used to generate a second assembly file according to the calculation instruction and translate the second assembly file into a second binary file, where the second binary file It is the compiled calculation instruction.

In this implementation, the assembly file can also be translated into other carry files, which is not limited in this disclosure.

It should be noted that although the foregoing embodiment is used as an example to introduce the synchronization control instruction processing apparatus as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various modules flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.

Application examples

The following describes an application example according to an embodiment of the present disclosure in conjunction with "synchronization control using a synchronization control instruction processing device" as an exemplary application scenario, so as to facilitate understanding of the flow of the synchronization control instruction processing device. Those skilled in the art should understand that the following application examples are only for the purpose of facilitating understanding of the embodiments of the present disclosure, and should not be considered as limitations to the embodiments of the present disclosure

21-3 illustrate a schematic diagram of an application scenario of a synchronization control instruction processing apparatus according to an embodiment of the present disclosure. As shown in Figure 21-3, the synchronization control instruction processing device processes the synchronization control instruction as follows:

The compiling module 21-13 compiles the acquired synchronization control instruction 1 to obtain the compiled synchronization control instruction 1 (for example, the synchronization control instruction 1 is barrier 16). The control module 21-11 analyzes the compiled synchronous control instruction to obtain the operation code of the synchronous control instruction 1. Among them, the operation code of the synchronous control instruction 1 is a barrier, the number threshold is 16, and the target signal is determined to be an arrival signal according to the barrier. The control module 21-11 sends the compiled synchronous control instruction to all arithmetic modules of the device.

The target arithmetic module of the plurality of arithmetic modules 21-12 controls the processing related to the arrival signal to enter the suspended state when the compiled synchronous control instruction is executed.

The control module 21-11 is also used to detect the operating states of the multiple arithmetic modules 21-12. When it is determined that the number of arithmetic modules in the suspended state reaches the number threshold 16, the 16 arithmetic modules 21-12 in the suspended state and the arrival are controlled Signal-related processing enters the working state synchronously.

In this way, the synchronous control instruction processing device can efficiently and quickly process the synchronous control instruction. For the working process of the above modules, please refer to the relevant description above.

21-4 illustrate a flowchart of a method for processing synchronization control instructions according to an embodiment of the present disclosure. The method can be applied to computer equipment including a memory and a processor, where the memory is used to store data used in the execution of the method; the processor is used to perform related processing and operation steps, such as performing the following steps S51-21 Go to step S54-21. As shown in FIG. 21-4, this method is applied to the above synchronization control instruction processing device. The method includes steps S51-21 to S54-21.

In step S51-21, the control module is controlled to compile the acquired synchronization control instruction to obtain the compiled synchronization control instruction.

In step S52-21, the control module is controlled to parse the compiled synchronous control instruction to obtain the operation code of the synchronous control instruction, and determine the target operation module that needs to execute the synchronous control instruction. Wherein, the operation code is used to instruct the synchronization control instruction to process the multiple arithmetic modules of the device as synchronization control.

In step S53-21, the target arithmetic module is controlled to enter a suspended state when the compiled synchronous control instruction is executed.

In step S54-21, the control module is controlled to monitor the running state of the plurality of operation modules, and when it is determined that the target operation modules are all in the suspended state, the target operation modules in the suspended state are controlled to enter the working state synchronously.

In a possible implementation manner, the synchronization control instruction further includes an operation domain, and the operation code is used to indicate a target signal that the target operation module needs to synchronize, or the operation domain includes the target operation module. A target signal to be synchronized, so that the control module determines the target signal according to the operation code or the operation domain,

The target computing module is further configured to control the processing corresponding to the target signal determined by the control module to enter a suspended state when the compiled synchronous control instruction is executed,

Wherein, the target signal includes at least one of the following: calculation queue signal, IO signal, and arrival signal.

In a possible implementation, when determining the target operation module that needs to execute the synchronous control instruction, it includes:

According to the identification of the target task, the operation module that executes the target task among the plurality of operation modules is determined as the target operation module, wherein the identification of the target task includes at least one of the following: task name, task type, task number .

In a possible implementation manner, the multiple computing modules are divided into multiple module clusters, and each module cluster includes one or more computing modules,

Among them, determining the target computing module that needs to execute the synchronous control instruction includes:

According to the identification of the target task, all operation modules in the target module cluster related to executing the target task in the plurality of module clusters are determined as target operation modules, and all or part of the operation modules in the target cluster are used for execution The target task, wherein the target task identifier includes at least one of the following: task name, task type, and task number.

In a possible implementation manner, the operation code or the operation field is used to indicate that an identification of the target task is obtained.

In a possible implementation manner, the plurality of operation modules are divided into a plurality of module clusters, and each module cluster includes one or more operation modules, and the operation code or the operation domain is used to indicate the target module The identity of the cluster,

According to the identifier of the target module cluster, the operation module belonging to the target module cluster among the plurality of operation modules is determined as the target operation module.

In a possible implementation manner, the operation code or the operation field is used to indicate the identification of the target operation module,

The target operation module is determined from the plurality of operation modules according to the identification of the target operation module.

In a possible implementation manner, the control module is further configured to determine a core function where the synchronization control instruction is located, and determine an operation module that calls the core function among the plurality of operation modules as a target operation module.

In a possible implementation manner, the operation domain includes a quantity threshold,

The control module is also used to control the target computing module in the suspended state to enter a working state when it is determined that the number of target computing modules in the suspended state reaches the number threshold.

In a possible implementation manner, the arithmetic module includes a master arithmetic sub-module and multiple slave arithmetic sub-modules, and the method further includes:

Controlling the compilation module to compile the obtained calculation instruction to obtain the compiled calculation instruction;

Controlling the control module to obtain data to be operated required for executing the compiled calculation instruction, and parsing the compiled calculation instruction to obtain multiple operation instructions;

Controlling the main operation sub-module to perform pre-processing on the data to be operated and to transmit data and operation instructions;

Controlling the slave operation sub-module to perform intermediate operations in parallel according to the transmitted data and operation instructions to obtain multiple intermediate results;

Controlling the main operation sub-module to perform subsequent processing on the plurality of intermediate results to obtain operation results.

In a possible implementation manner, the method may further include: storing data to be calculated,

Wherein, the storage module includes at least one of a register and a cache,

The register is used to store scalar data in the data to be calculated;

In a possible implementation manner, the method may further include:

The control control module stores the compiled synchronous control instruction and the compiled calculation instruction;

The control control module separately analyzes the compiled synchronous control instruction and the compiled calculation instruction to obtain the corresponding operation code and operation domain;

The control control module stores an instruction queue, and the instruction queue includes a plurality of instructions to be executed in order according to the execution order, and the plurality of instructions to be executed include a compiled synchronous control instruction and a compiled calculation instruction.

In a possible implementation manner, the control control module compiles the acquired synchronization control instruction to obtain the compiled synchronization control instruction, which may include: generating a first assembly file according to the synchronization control instruction and translating the first assembly file Into the first binary file. Among them, the first binary file is a compiled synchronous control instruction.

In a possible implementation manner, compiling the obtained calculation instruction to obtain the compiled calculation instruction may include: generating a second assembly file according to the calculation instruction, and translating the second assembly file into the second binary file. Among them, the second binary file is a compiled calculation instruction.

It should be noted that although the above embodiment is taken as an example to introduce the synchronization control instruction processing method as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various steps flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.

The synchronous control instruction processing method provided by the embodiments of the present disclosure has a wide application range, high processing efficiency and fast processing speed for synchronous control instructions, and improves the efficiency and speed of computing data.

The foregoing can be better understood based on the following clauses:

Clause U1. A synchronous control instruction processing device, the device includes a compilation module, a control module, and a plurality of arithmetic modules,

The compiling module is used to compile the acquired synchronization control instruction to obtain the compiled synchronization control instruction;

The control module is used to analyze the compiled synchronous control instruction, obtain the operation code of the synchronous control instruction, and determine the target operation module that needs to execute the synchronous control instruction;

The target operation module is configured to enter a suspended state when the compiled synchronous control instruction is executed;

The control module is also used to monitor the running state of the multiple computing modules, and when it is determined that the target computing modules are all in the suspended state, control the target computing modules in the suspended state to synchronously enter the working state,

Wherein, the operation code is used to instruct the synchronization control instruction to process the multiple arithmetic modules of the device as synchronization control.

Clause U2. The apparatus according to Clause U1, the synchronization control instruction further includes an operation domain, the operation code is used to indicate a target signal required for synchronization by the target operation module, or the operation domain includes the target operation A target signal to be synchronized by the module, so that the control module determines the target signal according to the operation code or the operation domain,

Clause U3. The device according to Clause U1 or Clause U2, characterized in that the target operation module that needs to execute the synchronous control instruction is determined, including:

According to the identification of the target task, the operation module that executes the target task among the plurality of operation modules is determined as the target operation module, wherein the identification of the target task includes at least one of the following: task name, task type, Task number.

Clause U4. The device according to Clause U1 or Clause U2, characterized in that the plurality of arithmetic modules are divided into a plurality of module clusters, and each module cluster includes one or more arithmetic modules,

Clause U5. The apparatus according to Clause U3 or Clause U4, the operation code or the operation field is used to indicate that an identification of the target task is obtained.

Clause U6. The apparatus according to Clause U1 or Clause U2, the plurality of operation modules are divided into a plurality of module clusters, each module cluster includes one or more operation modules, the operation code or the operation domain The ID used to indicate the target module cluster,

Clause U7. The device according to Clause U1 or Clause U2, the operation code or the operation field is used to indicate the identification of the target operation module,

Clause U8. The device according to Clause U1 or Clause U2, characterized in that the control module is further configured to determine a core function where the synchronous control instruction is located, and call the core function from the plurality of arithmetic modules The operation module is determined as the target operation module.

Clause U9. The device according to Clause U3 or Clause U8, wherein the operation domain includes a number threshold,

Clause U10. The device according to Clause U1, the arithmetic module includes a master arithmetic sub-module and a plurality of slave arithmetic sub-modules,

The compiling module is also used to compile the obtained calculation instruction to obtain the compiled calculation instruction;

The control module is configured to obtain the data to be operated required to execute the compiled calculation instruction, parse the compiled calculation instruction to obtain a plurality of operation instructions, and combine the data to be operated and the plurality of operations The instruction is sent to the main operation submodule;

The master operation sub-module is used for performing pre-processing on the data to be operated, and transmitting data and operation instructions with the plurality of slave operation sub-modules;

The slave operation sub-module is configured to execute intermediate operations in parallel according to data and operation instructions transmitted from the master operation sub-module to obtain a plurality of intermediate results, and transmit the plurality of intermediate results to the master operation sub-module;

The main operation sub-module is also used to perform subsequent processing on the plurality of intermediate results to obtain operation results.

Clause U11. The device according to Clause U10, the device further comprising:

A storage module for storing the data to be calculated,

Wherein, the storage module includes at least one of a register and a cache,

The register is used to store scalar data in the data to be calculated;

Clause U12. The device according to Clause U10, the control module comprising:

An instruction storage sub-module for storing the compiled synchronization control instruction and the compiled calculation instruction;

The instruction processing sub-module is used to parse the compiled synchronous control instruction and the compiled calculation instruction respectively to obtain the corresponding operation code and operation domain;

The queue storage submodule is used to store an instruction queue, and the instruction queue includes a plurality of instructions to be executed, which are sequentially arranged according to an execution order, and the plurality of instructions to be executed include the compiled synchronous control instruction and the compiled Calculation instructions.

Clause U13. The device according to Clause U12, the control module, further comprising:

Clause U14. The device according to Clause U1,

The control module is further configured to generate a first assembly file according to the synchronization control instruction and translate the first assembly file into a first binary file, where the first binary file is the Synchronized control instructions after compilation.

Clause U15. A machine learning computing device, the device comprising:

One or more synchronous control instruction processing devices as described in any one of clauses U1 to U14, used to obtain data and control information to be calculated from other processing devices, and perform specified machine learning operations, and pass the execution result through I / O interface is passed to other processing devices;

When the machine learning operation device includes a plurality of the synchronization control instruction processing devices, the plurality of the synchronization control instruction processing devices may be connected and transmit data through a specific structure;

Among them, a plurality of the synchronous control instruction processing devices interconnect and transmit data through a PCIE bus that is a fast external device interconnection bus to support larger-scale machine learning operations; a plurality of the synchronous control instruction processing devices share the same control system Or have their own control systems; a plurality of the synchronous control instruction processing devices share memory or have their own memories; the interconnection method of the plurality of synchronous control instruction processing devices is an arbitrary interconnection topology.

Clause U16. A combined processing device, the combined processing device comprising:

Machine learning computing devices, general interconnect interfaces and other processing devices as described in clause U15;

Clause U17. A machine learning chip, the machine learning chip comprising:

The machine learning arithmetic device according to clause U15 or the combined processing device according to clause U16.

Clause U18. An electronic device, the electronic device comprising:

Machine learning chip as described in clause U17.

Clause U19. A board card comprising: a storage device, an interface device and a control device, and a machine learning chip as described in Clause U17;

The storage device is used for storing data;

Clause U20. A synchronous control instruction processing method, the method is applied to a synchronous control instruction processing device, the device includes a plurality of arithmetic modules, a compilation module, and a control module, the method includes:

Controlling the compilation module to compile the acquired synchronization control instruction to obtain the compiled synchronization control instruction;

Controlling the control module to parse the compiled synchronous control instruction, obtain the operation code of the synchronous control instruction, and determine the target operation module that needs to execute the synchronous control instruction;

Controlling the target operation module to enter a suspended state when the compiled synchronous control instruction is executed;

Controlling the control module to monitor the operating states of the plurality of computing modules, and when it is determined that the target computing modules are all in the suspended state, controlling the target computing modules in the suspended state to enter the working state synchronously,

Clause U21. The method according to Clause U20, the synchronization control instruction further includes an operation domain, the operation code is used to indicate a target signal required for synchronization by the target operation module, or the operation domain includes the target operation A target signal to be synchronized by the module, so that the control module determines the target signal according to the operation code or the operation domain,

Wherein, controlling the target operation module to enter the suspended state when the compiled synchronous control instruction is executed includes:

Controlling the target arithmetic module to control the processing corresponding to the target signal determined by the control module to enter a suspended state when the compiled synchronous control instruction is executed,

Clause U22. According to the method described in Clause U21 or Clause U20, determine the target computing module that needs to execute the synchronous control instruction, including:

According to the identification of the target task, the operation module that executes the target task among the plurality of operation modules is determined as the target operation module, and the identification of the target task includes at least one of the following: task name, task type, task number ,.

Clause U23. According to the method described in Clause U19 or Clause U20, the plurality of computing modules are divided into a plurality of module clusters, and each module cluster includes one or more computing modules,

According to the identification of the target task, all operation modules in the target module cluster related to executing the target task in the plurality of module clusters are determined as target operation modules, and all or part of the operation modules in the target cluster are used for execution For the target task, the target task identifier includes at least one of the following: task name, task type, and task number.

Clause U24. The method according to Clause U20 or 21, the plurality of operation modules are divided into a plurality of module clusters, each module cluster includes one or more operation modules, the operation code or the operation domain is used To indicate the ID of the target module cluster,

Among them, the target computing module that needs to execute the synchronous control instruction includes:

Clause U25. The method according to Clause U20 or 21, the operation code or the operation field is used to indicate the identification of the target operation module,

Clause U26. The method according to Clause U20 or 21, further comprising:

Controlling the control module to determine the nuclear function where the synchronous control instruction is located, and determining the arithmetic module calling the nuclear function among the plurality of arithmetic modules as the target arithmetic module.

Clause U27. The method according to Clause U22 or 26, wherein the operation domain includes a quantity threshold, and the method further includes:

When the control module determines that the number of target operation modules in the suspended state reaches the number threshold, controls the target operation modules in the suspended state to enter the working state.

Clause U28. The method according to Clause U20, the arithmetic module includes a master arithmetic sub-module and a plurality of slave arithmetic sub-modules, the method further includes:

Clause U29 is in accordance with the method of Clause U28, the method further comprising:

Controlling the storage module of the device to store the data to be calculated,

Wherein, the storage module includes at least one of a register and a cache,

The register is used to store scalar data in the data to be calculated;

Clause U30. The method according to Clause U28, the method further comprising:

Controlling the control module to store the compiled synchronous control instruction and the compiled calculation instruction;

Controlling the control module to separately parse the compiled synchronous control instruction and the compiled calculation instruction to obtain the corresponding operation code and operation domain;

Controlling the control module to store an instruction queue, the instruction queue including a plurality of instructions to be executed in order according to an execution order, the plurality of instructions to be executed including the compiled synchronous control instruction and the compiled calculation instruction .

Clause U31. The method according to Clause U30, the method further comprising:

Controlling the control module to cache the first instruction to be executed when it is determined that the first instruction to be executed among the plurality of instructions to be executed is associated with the zeroth instruction to be executed before the first instruction to be executed, And after determining that the execution of the zeroth to-be-executed instruction is completed, control to execute the execution of the first to-be-executed instruction,

Clause U32. According to the method described in Clause U20,

Controlling the control module to compile the acquired synchronization control instruction to obtain the compiled synchronization control instruction, including:

Generating a first assembly file according to the synchronization control instruction, and translating the first assembly file into a first binary file, wherein the first binary file is the compiled synchronization control instruction.

Clause U33. The method according to Clause U22 or Clause U23, the operation code or the operation field is used to indicate the identification of the target task.

Clause U34. A non-volatile computer-readable storage medium having computer program instructions stored thereon, which when executed by a processor implements the method of any one of clause U20 to clause U33.

22-1 shows a block diagram of an interrupt storage instruction processing apparatus according to an embodiment of the present disclosure. As shown in FIG. 22-1, the device includes a control module 22-11. The control module 22-11 includes an instruction compilation submodule 22-111, a parameter acquisition submodule 22-112, and an interrupt storage submodule 22-113.

The instruction compilation submodule 22-111 compiles the obtained interrupt storage instruction to obtain the compiled interrupt storage instruction.

The parameter acquisition sub-module 22-112 determines the storage parameters required for the process of responding to the interrupt exit based on the operation domain and operation code of the compiled interrupt storage instruction. Among them, the operation code is used to instruct the interrupt storage instruction to perform the processing when the device interrupts and exits as interrupt storage processing. The storage parameter is used to indicate the data that needs to be stored when the device interrupts and exits. The interruption storage process includes device interruption exit and data storage according to storage parameters.

The interrupt storage submodule 22-113, when the compiled interrupt storage instruction is executed, the control device interrupts and exits, and performs data storage according to the storage parameters.

In this embodiment, in the process of debugging and testing the device, according to the interrupt storage instruction, when the device can be interrupted and exited in real time, data that can indicate the operating state of the device can be stored, that is, according to the storage parameters Perform data storage. So that relevant personnel can determine the results of debugging and testing based on the data stored according to the storage parameters, and analyze the operating status of the device based on the stored data.

In this embodiment, the data that needs to be stored can be determined according to the storage parameters, and the determined data that needs to be stored can be stored in the memory of the device or stored in another location, which is not limited in the present disclosure.

In this embodiment, the interrupt storage instruction acquired by the control module is an uncompiled software instruction that cannot be directly executed by hardware. The control module needs to first compile the interrupt storage instruction (uncompiled). The compiled interrupt storage instructions are hardware instructions that can be directly executed by the hardware. The control module can obtain instructions and data through the data input and output unit, which can be one or more data I / O interfaces or I / O pins.

In this embodiment, the operation code may be the part of the instruction or field (usually represented by code) specified in the computer program to perform the operation, and is the instruction sequence number used to inform the device that executes the instruction which instruction needs to be executed. . The operation domain may be the source of all data required to execute the corresponding instruction, and all data required to execute the corresponding instruction include data to be operated, storage parameters, and corresponding operation methods, and so on. For an interrupt storage instruction, it must include the operation code and operation field.

It should be understood that, those skilled in the art can set the instruction format of the interrupt storage instruction, as well as the included operation codes and operation fields as required, and the disclosure does not limit this.

In this embodiment, the device may include one or more control modules, and the number of control modules may be set according to actual needs, which is not limited in the present disclosure.

An interrupt storage instruction processing device provided by an embodiment of the present disclosure includes a control module. The control module includes: an instruction compilation submodule that compiles the obtained interrupt storage instruction to obtain a compiled interrupt storage instruction; the parameter acquisition submodule is based on The operation domain and operation code of the compiled interrupt storage instruction determine the storage parameters required for the process of responding to the interrupt exit; when the interrupt storage submodule executes the compiled interrupt storage instruction, the control device interrupts the exit and according to the storage parameters Perform data storage. The interrupt storage instruction processing method, device and related products provided by the embodiments of the present disclosure have a wide range of applications, and have high processing efficiency and fast processing speed for the interrupt storage instruction, and can efficiently and quickly respond to the interrupt exit of the device, and Improve the efficiency and speed of data operation.

22-2a shows a block diagram of an interrupt storage instruction processing apparatus according to an embodiment of the present disclosure. In a possible implementation manner, as shown in FIG. 22-2a, the device may further include an arithmetic module 22-12.

The control module 22-11 is also used to compile the obtained calculation instruction, obtain the compiled calculation instruction, and obtain the data to be calculated required to execute the compiled calculation instruction.

The operation module 22-12 is configured to perform operation on the data to be calculated according to the compiled calculation instruction to obtain an operation result.

Wherein, when the compiled interrupt storage instruction is executed, the control device interrupts and exits, which may include: when the compiled interrupt storage instruction is executed, the control arithmetic module interrupts the execution of the currently compiled calculation instruction.

The operation module 22-12 may include multiple operators 22-120. A plurality of operators 22-120 are used to perform operations corresponding to the operation type of the calculation instruction.

In this implementation manner, in the process of executing the calculation instruction, if the control module executes the interrupt storage instruction, the calculation process of the calculation instruction currently being executed is interrupted, and data storage is performed according to the storage parameter.

In this implementation, the calculation instruction may be different from the interrupt storage instruction, which performs arithmetic operations such as scalar, vector, matrix, and tensor data on the arithmetic operation, logical operation, etc., for example, scalar calculation instruction, convolution calculation instruction, etc. A person skilled in the art may set the calculation instruction according to actual needs, and this disclosure does not limit this. The calculation instruction acquired by the control module is an uncompiled software instruction that cannot be directly executed by the hardware. The control module needs to first compile the calculation instruction (uncompiled). The compiled calculation instructions are hardware instructions that can be directly executed by the hardware.

In this implementation, the control module is also used to parse the compiled calculation instruction to obtain the operation code and operation domain of the calculation instruction, and obtain the data to be calculated according to the operation code and operation domain.

22-2b shows a block diagram of an interrupt storage instruction processing apparatus according to an embodiment of the present disclosure. In a possible implementation, as shown in FIG. 22-2b, the operation module 22-12 may include a master operation sub-module 22-121 and a plurality of slave operation sub-modules 22-122. The master operation sub-module 22-121 may include multiple operators, and / or the slave operation sub-module 22-122 may include multiple operators (not shown in the figure).

The control module 22-11 is also used to parse the compiled calculation instructions to obtain multiple operation instructions, and send the data to be operated and the multiple operation instructions to the main operation sub-module 22-121.

The master operation sub-module 22-121 is used to perform pre-processing on the data to be operated, and to transmit data and operation instructions with a plurality of slave operation sub-modules 22-122.

The slave operation submodule 22-122 is used to execute intermediate operations in parallel according to the data and operation instructions transmitted from the master operation submodule 22-121 to obtain multiple intermediate results, and transmit the multiple intermediate results to the master operation submodule 22-122 .

The main operation submodule 22-121 is also used to perform subsequent processing on multiple intermediate results to obtain operation results.

In this implementation manner, when the calculation instruction is an operation performed on scalar and vector data, the device may control the main operation sub-module to perform an operation corresponding to the calculation instruction using the arithmetic unit therein. When the calculation instruction is to perform operations on data with dimensions greater than or equal to 2 such as a matrix and a tensor, the device may control the sub-module from the operation sub-module to perform an operation corresponding to the calculation instruction.

In a possible implementation manner, the storage parameter may include a storage space type and a storage space identifier. The operation code can also be used to indicate the type of storage space, and the operation domain can include a storage space identifier. Among them, data storage according to storage parameters may include:

Determine at least one target storage space that matches the storage space type and storage space identifier, and store the data in the target storage space.

In this implementation manner, the target storage space may be the memory of the device, such as caches, registers, etc. that are located in the device, or an internal storage space related to the operation of the device. The storage space type can indicate the location, storage speed, and other information of the storage space. The code in the interrupt storage instruction can be set for different types of storage space. For example, the code of the register can be set to "gpr", and the code of the NRAM can be set to "nram". Wherein, the cache may include NRAM (Neuron Random Access Memory) random access memory, which is a memory specially used for storing neurons. The memory of the device may be used to store data to be calculated and the like required to execute the calculation instruction, which is not limited in this disclosure.

In this implementation, the storage space identifier may store the number, name, and type of storage space in the device that can characterize the storage space.

In a possible implementation manner, when there are multiple target storage spaces, the data in each target storage space is a set of data to be stored, and the multiple sets of data to be stored correspond to at least one data format.

In this implementation manner, the data format of the data stored in the target storage space is not limited, the data format of the same set of data to be stored is the same, and the data format between multiple sets of data to be stored may be the same or different. The data format may include at least one of data type and data length. For example, the data stored in the target storage space may be 16-bit integer data, 32-bit unsigned integer data, and so on. For example, assume that after the device exits after interruption, it is determined that the data in register 1, register 2 and register 3 need to be stored according to the storage parameters, where the data in register 1 is stored in a 16-bit integer data format and the data in register 2 is The 32-bit unsigned integer data format is stored and the data in register 3 is stored in an 8-bit integer data format. The interrupt storage module can store 16-bit integer data in register 1, 32-bit unsigned integer data in register 2, and 8-bit integer data in register 3.

In this implementation, during the data storage process, the interrupt storage module can directly store the data according to the data format of the data in the target storage space, or convert the data in the target storage space into a specified data format for storage .

In this implementation manner, the interrupt storage module may store all data in the target storage space or part of the data in the target storage space, which is not limited in the present disclosure.

In a possible implementation manner, the storage parameter may include a storage space identifier and an address of data to be stored. The operation code can also be used to indicate the storage space identifier, and the operation domain can include the data address to be stored. Among them, data storage according to storage parameters may include:

Determine the target storage space corresponding to the storage space identifier, and obtain the data to be stored from the data to be stored address of the target storage space, and store the data to be stored.

In a possible implementation manner, when there is only one storage space corresponding to the storage space identifier in the storage space related to the device, and the type of the storage space is only one, the storage parameter may also include the storage space type and The data address to be stored. The operation code can also be used to indicate the type of storage space, and the operation field can include the address of the data to be stored. Wherein, data storage according to the storage parameters may include: determining a target storage space corresponding to the type of storage space, obtaining data to be stored from the data storage address of the target storage space, and storing the data to be stored.

In this implementation, when the unique data to be stored can be determined according to the data address to be stored and the storage space type, the storage parameter may also include the storage space type and the data to be stored address.

In a possible implementation manner, the storage parameter may further include a target storage amount. Among them, the operation domain may also include the target storage amount. Wherein, obtaining the data to be stored from the data address to be stored in the target storage space, and storing the data to be stored may include:

Obtain the data to be stored from the data address to be stored in the target storage space as the target storage amount, and store the data to be stored.

In a possible implementation, the default target storage amount can be set. When the target storage cannot be determined according to the operation domain of the interrupt storage instruction, the default target storage can be determined as the target storage corresponding to the current interrupt storage instruction, and then the data amount can be obtained from the data address to be stored in the target storage space as the target storage The amount of data (that is, the default target storage amount) to be stored.

It should be understood that, those skilled in the art can set the arithmetic module that receives the interrupt storage instruction according to actual needs, which is not limited in the present disclosure.

In a possible implementation, the operation domain may include an identifier for indicating that the storage space is interrupted. Among them, the interrupt storage sub-module 22-113 may store the required storage data acquired according to the storage parameters into the interrupt storage space corresponding to the identifier of the interrupt storage space.

In a possible implementation, the interrupt storage space may include off-chip storage and / or on-chip storage of the device. The off-chip storage may include at least one DDR (that is, DDR SDRAM, English: Double Data Rate SDRAM, double rate synchronous dynamic random access memory), and DDR may include at least one LDRAM (Local DRAM, local dynamic random access memory). The on-chip storage may include at least one of registers and NRAM, and each type of storage space (register and / or NRAM) in the on-chip storage may include at least one. The available storage space of off-chip storage is less than or equal to the specified storage capacity.

In this implementation manner, the available storage space of off-chip storage may be storage space in off-chip storage that can be used for data storage after the device performs an interrupt exit during execution of an interrupt storage instruction. Among them, vector data can be stored in NRAM, LDRAM. The specified storage capacity may be set in advance according to the data calculation process performed by the device, for example, 1024KB, etc., which is not limited in the present disclosure.

In this implementation manner, the identifier of the interrupt storage space in the operation domain may include the number, name, and first address of the interrupt storage space, and other parameters that can characterize the interrupt storage space, which is not limited in the present disclosure.

In a possible implementation, if the operation domain does not include the interrupt storage space, different types of data can be stored according to the preset default storage method. For example, the scalar data can be stored in the register by default, and the Vector data is stored in NRAM. A person skilled in the art may set the default storage mode according to actual needs, which is not limited in the present disclosure.

In a possible implementation manner, as shown in FIG. 22-2a, the device may further include a storage module 22-13. The storage modules 22-13 may include off-chip storage and / or on-chip storage, and on-chip storage may be used to store data to be calculated. Among them, the on-chip storage may include at least one of a register and a cache. The cache is used to store data to be calculated, and the cache includes at least one NRAM. The register is used to store the scalar data in the data to be calculated. Among them, the data to be calculated includes scalar data, vector data, tensor data and other types of data. The data to be calculated can be the data used for the operation in machine learning. Machine learning operations may include neural network operations.

In a possible implementation, the cache may include a neuron cache. The neuron cache can be used to store neuron data in the data to be calculated. Neuron data can be data used for neural network operations, such as vector data.

In a possible implementation, the control module 22-11 may also be used to generate a first assembly file according to the interrupt storage instruction and translate the first assembly file into a first binary file, where the first binary The file is the interrupt storage instruction after compilation.

In a possible implementation, the control module 22-11 may also be used to generate a second assembly file according to the calculation instruction and translate the second assembly file into a second binary file, where the second binary file It is the compiled calculation instruction.

In a possible implementation, the instruction format of the interrupt storage instruction may be:

breakdump.typeaddrSpacesign0

Among them, breakdump.type is the opcode of the interrupt storage instruction. sign0, addrSpace are the operation domains of interrupt storage instructions. The type in breakdump.type represents the storage space type. sign0 represents a storage space identifier, where, when there are multiple storage spaces, there may be multiple storage space identifiers. addrSpace is the identifier of the interrupted storage space. It means that when the device is interrupted and exited, the storage space corresponding to the storage space type type and identified as the storage space identifier sign0 is determined as the target storage space, and all data in the target storage space is stored to the identification of the interrupted storage space The interrupt storage space corresponding to addrSpace. type can be gpr.

For example, when data in a register needs to be stored, it is assumed that there are 6 registers that need to store data, and the storage space identifiers of the 6 registers are sign0, sign1, sign2, sign3, sign4, and sign5, respectively, and their corresponding interrupt storage The instruction format of the instruction can be: breakdump.gpr nram0 sign0 sign1 sign2 sign3 sign4 sign4 sign5. It means that when the device interrupts and exits, all the data in the six registers with the storage spaces identified as sign0, sign1, sign2, sign3, sign4, and sign5 are stored in the interrupt storage space corresponding to the identification nram0 of the interrupt storage space.

In a possible implementation, the instruction format of the interrupt storage instruction may also be:

breakdump.sign addrSpace src size

Among them, breakdump.sign is the opcode of the interrupt storage instruction. addrSpace, src, and size are the operation domains of the interrupt storage instruction. src represents the data address to be stored. size represents the target storage capacity. addrSpace is the identifier of the interrupted storage space. It means that when the device is interrupted and exited, it is determined that the storage space identifier sign corresponds to the target storage space, and the data to be stored whose data amount is the target storage amount size is obtained from the data address to be stored src of the target storage space, and the data to be stored is stored To the interrupt storage space corresponding to the identifier addrSpace of the interrupt storage space.

In a possible implementation manner, when data in a memory such as NRAM needs to be stored, the storage parameter may include a storage space type, a data address to be stored, and a target storage amount. The instruction format of the interrupt storage instruction may be: breakdump.nram ldram0 src size. It means that when the device is interrupted and exited, the data to be stored whose data volume is the target storage size is obtained from the data address to be stored src of NRAM, and the data to be stored is stored in the interrupt storage space corresponding to the identifier ldram0 of the interrupt storage space in.

It should be noted that although the foregoing embodiment is taken as an example to introduce the interrupt storage instruction processing apparatus as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various modules flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.

Application examples

In the following, an application example according to an embodiment of the present disclosure will be given in conjunction with "using an interrupt storage instruction processing device to execute an interrupt storage instruction" as an exemplary application scenario, so as to facilitate understanding of the flow of the interrupt storage instruction processing device. Those skilled in the art should understand that the following application examples are only for the purpose of facilitating understanding of the embodiments of the present disclosure, and should not be considered as limitations to the embodiments of the present disclosure

22-3a and 22-3b are schematic diagrams illustrating application scenarios of an apparatus for processing interrupt storage instructions according to an embodiment of the present disclosure. As shown in Figures 22-3a and 22-3b, the interrupt storage instruction processing device processes the interrupt storage instruction as follows:

Example one

As shown in Figure 22-3a, the control module 22-11 compiles the obtained interrupt storage instruction 1 to obtain the compiled interrupt storage instruction 1 (for example, the interrupt storage instruction 1 is breakdump.gpr nram0 r0 rr1 r2 r3 r4 r5) , Analyze the compiled interrupt storage instruction 1 to obtain the operation code and operation domain of the interrupt storage instruction 1. Among them, the operation code of the interrupt storage instruction 1 is breakdump.gpr, and gpr is the storage space type, indicating a register. r0, r1, r2, r3, r4, and r5 represent storage space identifiers. nram0 indicates the ID of the interrupt storage space. When the compiled interrupt storage instruction 1 is executed, the control device interrupts and exits, and stores the data in the six registers with the storage space identifications r0, r1, r2, r3, r4, and r5 to the interrupt storage space identification nram0 Corresponding interrupt storage space.

Wherein, the interruption and exit of the control device may include: interrupting the execution of the current calculation instruction when the compiled interruption instruction 1 is executed.

Example 2

As shown in Figure 22-3b, the control module 22-11 compiles the obtained interrupt storage instruction 2 to obtain the compiled interrupt storage instruction 2 (for example, the interrupt storage instruction 2 is breakdump. The interrupt storage instruction 2 is analyzed to obtain the operation code and operation domain of the interrupt storage instruction 2. Among them, the operation code of the interrupt storage instruction 2 is breakdump.nram, and nram is a storage space type, indicating NRAM. 500 indicates the data address to be stored. 1024 is the target storage capacity. ldram0 indicates the ID of the interrupt storage space. After the execution of the compiled interrupt storage instruction 2, the control device interrupts and exits, and obtains the data to be stored of the target storage amount 1024 from the data address 500 to be stored in the NRAM, and stores the data to be stored in the ID ldram0 of the interrupt storage space Corresponding interrupt storage space.

The interruption of the control device may include interrupting the execution of the current calculation instruction when the compiled interruption instruction 2 is executed.

In this way, the interrupt storage instruction processing device can efficiently and quickly process the interrupt storage instruction, can efficiently and quickly respond to the interrupt exit of the device, and improve the efficiency and speed of computing data. For the working process of the above modules, please refer to the relevant description above.

22-4 shows a flowchart of an interrupt storage instruction processing method according to an embodiment of the present disclosure. The method can be applied to computer equipment including a memory and a processor, where the memory is used to store data used in the execution of the method; the processor is used to perform related processing and operation steps, such as performing the following steps S51-22 Go to step S53-22. As shown in FIG. 22-4, the method is applied to the above-mentioned interrupt storage instruction processing device, and the method includes steps S51-22 to S53-22.

In step S51-22, compile the obtained interrupt storage instruction to obtain a compiled interrupt storage instruction.

In step S52-22, according to the operation field and operation code of the compiled interrupt storage instruction, the storage parameters required for the process of responding to the interrupt exit are determined. Among them, the operation code is used to instruct the interrupt storage instruction to perform the processing when the device is interrupted and exited as interrupt storage processing, and the storage parameter is used to indicate the data that needs to be stored when the device is interrupted and exited.

In step S53-22, when the compiled interrupt storage instruction is executed, the control device quits and performs data storage according to the storage parameters.

In a possible implementation manner, the operation domain may include an identifier for indicating an interrupt storage space. Among them, data storage according to storage parameters may include:

Store the required storage data acquired according to the storage parameters into the interrupt storage space corresponding to the identifier of the interrupt storage space.

In a possible implementation, the interrupt storage space may include off-chip storage and / or on-chip storage of the device. The off-chip storage may include at least one DDR, the DDR may include at least one LDRAM, the on-chip storage may include at least one of registers and NRAM, and the available storage space of the off-chip storage is less than or equal to a specified storage capacity.

In a possible implementation manner, the storage parameter includes a storage space type and a storage space identifier. Among them, the operation code is also used to indicate the storage space type, and the operation domain includes the storage space identifier. Among them, data storage according to storage parameters may include:

In a possible implementation manner, the storage parameter may include a storage space identifier and an address of data to be stored. Among them, the operation code can also be used to indicate the storage space identification, and the operation domain can include the address of the data to be stored. Among them, data storage according to storage parameters may include:

In a possible implementation manner, the method may further include:

Obtain the calculation instruction, compile the calculation instruction to obtain the compiled calculation instruction, and obtain the data to be calculated required to execute the compiled calculation instruction;

According to the data to be calculated, execute the compiled calculation instruction to obtain the calculation result,

Wherein, when the compiled interrupt storage instruction is executed, the control device interrupts and exits, which may include: when the compiled interrupt storage instruction is executed, controlling the arithmetic module to stop the execution of the currently compiled calculation instruction.

In a possible implementation manner, the method may further include:

Analyze the compiled calculation instructions to obtain multiple calculation instructions;

According to the data to be calculated, executing the compiled calculation instruction to obtain the calculation result may include:

Perform pre-processing on the data to be operated, and transfer data and operation instructions;

Perform intermediate operations in parallel based on the transmitted data and operation instructions to obtain multiple intermediate results;

Perform subsequent processing on multiple intermediate results to obtain the operation result.

In a possible implementation manner, the method may further include:

Use the on-chip storage in the storage module to store the data to be calculated,

The storage module may include off-chip storage and / or on-chip storage, and on-chip storage may include at least one of registers and caches.

The cache is used to store data to be calculated, and the cache may include at least one NRAM;

The register is used to store the scalar data in the data to be calculated.

In a possible implementation manner, the cache may include a neuron cache, and the neuron cache is used to store neuron data in the data to be calculated.

In a possible implementation manner, the method may further include:

Store the compiled interrupt storage instruction and the compiled calculation instruction;

Analyze the compiled interrupt storage instruction and the compiled calculation instruction respectively to obtain the corresponding operation code and operation domain;

An instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed, which are sequentially arranged according to an execution order, and the plurality of instructions to be executed include a compiled interrupt storage instruction and a compiled calculation instruction.

In a possible implementation manner, the control control module compiles the obtained interrupt storage instruction to obtain the compiled interrupt storage instruction, which may include: generating a first assembly file according to the interrupt storage instruction and translating the first assembly file Into the first binary file. The first binary file is a compiled interrupt storage instruction.

It should be noted that although the above embodiment is taken as an example to describe the interrupt storage instruction processing method as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various steps flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.

The method for processing interrupt storage instructions provided by the embodiments of the present disclosure has a wide range of application, has high processing efficiency and fast processing speed for interrupt storage instructions, can respond to interrupt exits of the device efficiently and quickly, and improves data processing. Operational efficiency and speed.

The foregoing can be better understood based on the following clauses:

Clause V1, an interrupt storage instruction processing device, the device includes a control module, the control module includes:

The instruction compilation submodule compiles the obtained interrupt storage instruction to obtain the compiled interrupt storage instruction;

The parameter acquisition sub-module, according to the operation domain and the operation code of the compiled interrupt storage instruction, determine the storage parameters required for the process of responding to the interrupt exit;

An interruption storage submodule, when the compiled interruption storage instruction is executed, the device is controlled to interrupt and exit, and perform data storage according to the storage parameter,

Wherein, the operation code is used to instruct the interrupt storage instruction to perform the interrupt storage process on the device when the device is interrupted and exited, and the storage parameter is used to indicate the data that needs to be stored when the device is interrupted and exited.

Clause V2. The device according to Clause V1, the operation field includes an identifier for indicating an interruption of the storage space,

Wherein, the interrupt storage sub-module is also used to store the data needed to be stored obtained according to the storage parameters in the interrupt storage space corresponding to the identifier of the interrupt storage space.

Clause V3. The device according to Clause V2, the interrupt storage space includes off-chip storage and / or on-chip storage of the device,

Wherein, the off-chip storage includes at least one DDR, the DDR includes at least one LDRAM, the on-chip storage includes at least one of registers and NRAM, and the available storage space of the off-chip storage is less than or equal to a specified storage capacity.

Clause V4. The device according to Clause V1, the storage parameter includes a storage space type and a storage space identifier,

Wherein, the operation code is also used to indicate the storage space type, and the operation domain includes a storage space identifier,

The data storage according to the storage parameters includes:

Clause V5. The device according to Clause V4, when there are multiple target storage spaces, the data in each target storage space is a set of data to be stored, and the multiple sets of data to be stored correspond to at least one data format.

Clause V6. The device according to Clause V1, the storage parameters include a storage space identifier and a data address to be stored,

Wherein, the operation code is also used to indicate the storage space identifier, the operation domain includes the data address to be stored,

The data storage according to the storage parameters includes:

A target storage space corresponding to the storage space identifier is determined, and data to be stored is obtained from a data address to be stored of the target storage space, and the data to be stored is stored.

Clause V7. The device according to Clause V6, the storage parameter further includes a target storage amount, wherein the operation domain further includes a target storage amount,

Wherein, obtaining the data to be stored from the data address to be stored in the target storage space, and storing the data to be stored includes:

Obtaining data to be stored from the data address to be stored in the target storage space as the target storage amount, and storing the data to be stored.

Clause V8. The device according to Clause V1, the device further includes an arithmetic module,

The control module is also used to obtain calculation instructions and compile the calculation instructions to obtain the compiled calculation instructions, obtain the data to be calculated required to execute the compiled calculation instructions, and convert the calculation instructions The data and the compiled calculation instruction are sent to the calculation module;

The calculation module is configured to execute the compiled calculation instruction according to the data to be calculated to obtain an operation result,

Wherein, when the compiled interrupt storage instruction is executed, controlling the device to interrupt and exit includes:

When the compiled interruption storage instruction is executed, the operation module is controlled to interrupt the execution of the currently compiled calculation instruction.

Clause V9. The device according to Clause V8, the operation module includes a master operation sub-module and a plurality of slave operation sub-modules,

The control module is further configured to parse the compiled calculation instruction to obtain a plurality of operation instructions, and send the data to be operated and the plurality of operation instructions to the main operation submodule;

Clause V10. The apparatus according to Clause V8, the apparatus further comprises a storage module, the storage module includes off-chip storage and / or on-chip storage,

The on-chip storage is used to store the data to be calculated,

Wherein, the on-chip storage includes at least one of a register and a cache,

The cache is used to store the data to be calculated, and the cache includes at least one NRAM;

The register is used to store scalar data in the data to be calculated.

Clause V11. The device according to Clause V10, the cache includes a neuron cache,

The neuron cache is used to store neuron data in the data to be calculated.

Clause V12. The device according to Clause V8, the control module includes:

An instruction storage sub-module for storing the compiled interrupt storage instruction and the compiled calculation instruction;

An instruction processing sub-module, which is used to separately parse the compiled interrupt storage instruction and the compiled calculation instruction to obtain the corresponding operation code and operation domain;

A queue storage submodule is used to store an instruction queue, and the instruction queue includes a plurality of instructions to be executed in order according to an execution order, and the plurality of instructions to be executed include the compiled interrupt storage instruction and the compiled Calculation instructions.

Clause V13. The device according to Clause V12, the control module, further comprising:

Clause V14. The device according to Clause V1, the control module is further configured to generate a first assembly file according to the interrupt storage instruction and translate the first assembly file into a first binary file, wherein, The first binary file is the compiled interrupt storage instruction.

Article V15. A machine learning computing device, the device comprising:

One or more interrupt storage instruction processing devices as described in any one of Clause V1-Clause V14, used to obtain data to be calculated and control information from other processing devices, and perform specified machine learning operations, and pass the execution result through I / O interface is passed to other processing devices;

When the machine learning operation device includes a plurality of the interrupt storage instruction processing devices, the plurality of the interrupt storage instruction processing devices may be connected and transmit data through a specific structure;

Among them, a plurality of the interrupt storage instruction processing devices interconnect and transmit data through a fast external device interconnect bus PCIE bus to support larger-scale machine learning operations; the plurality of interrupt storage instruction processing devices share the same control system Or have their own control systems; a plurality of the interrupt storage instruction processing devices share memory or have their own memories; the interconnection method of the plurality of interrupt storage instruction processing devices is any interconnected topology.

Clause V16. A combined processing device, the combined processing device comprising:

Machine learning computing devices, general interconnection interfaces and other processing devices as described in clause V15;

Wherein, the combined processing device further includes a storage device, which is respectively connected to the machine learning computing device and the other processing device, and is used for storing data of the machine learning computing device and the other processing device.

Clause V17. A machine learning chip, the machine learning chip includes:

The machine learning arithmetic device according to clause V15 or the combined processing device according to clause V16.

Article V18. An electronic device, the electronic device comprising:

Machine learning chip as described in clause V17.

Clause V19, a board card including: a storage device, an interface device and a control device, and a machine learning chip as described in Clause V17;

The storage device is used for storing data;

Clause V20. A method for processing an interrupt storage instruction. The method is applied to an apparatus for processing an interrupt storage instruction. The method includes:

Compile the obtained interrupt storage instruction to obtain the compiled interrupt storage instruction;

According to the operation domain and operation code of the compiled interrupt storage instruction, determine the storage parameters required for the process of responding to the interrupt exit;

When the compiled interrupt storage instruction is executed, the device is controlled to interrupt and exit, and data storage is performed according to the storage parameter,

Clause V21, the method according to Clause V20, the operation field includes an identifier for indicating an interruption of the storage space,

The data storage according to the storage parameters includes:

Storing the required storage data acquired according to the storage parameter into the interrupt storage space corresponding to the identifier of the interrupt storage space.

Clause V22, the method according to Clause V21, the interrupt storage space includes off-chip storage and / or on-chip storage of the device,

Clause V23, the method according to Clause V20, the storage parameter includes a storage space type and a storage space identifier,

The data storage according to the storage parameters includes:

Clause V24. According to the method described in Clause V23, when there are multiple target storage spaces, the data in each target storage space is a set of data to be stored, and the multiple sets of data to be stored correspond to at least one data format.

Clause V25, the method according to Clause V20, the storage parameters include a storage space identifier and an address of the data to be stored,

The data storage according to the storage parameters includes:

Clause V26. The method according to Clause V25, the storage parameter further includes a target storage amount, wherein the operation domain further includes a target storage amount,

Clause V27. The method according to Clause V20, the method further comprising:

Obtaining calculation instructions, compiling the calculation instructions to obtain the compiled calculation instructions, and obtaining the data to be calculated required to execute the compiled calculation instructions;

Execute the compiled calculation instruction according to the data to be calculated to obtain the calculation result,

Clause V28. The method according to Clause V27, the method further comprising:

Parse the compiled calculation instructions to obtain multiple calculation instructions;

Wherein, according to the data to be calculated, executing the compiled calculation instruction to obtain the calculation result includes:

Perform pre-order processing on the data to be operated, and transmit data and operation instructions;

Perform subsequent processing on the plurality of intermediate results to obtain an operation result.

Clause V29. The method according to Clause V27, the method further comprising:

Using on-chip storage in the storage module to store the data to be calculated,

Wherein, the storage module includes off-chip storage and / or on-chip storage, and the on-chip storage includes at least one of a register and a cache,

The register is used to store scalar data in the data to be calculated.

Clause V30. The method according to Clause V29, the cache includes a neuron cache, and the neuron cache is used to store neuron data in the data to be operated.

Clause V31. The method according to Clause V29, the method further comprising:

Storing the compiled interrupt storage instruction and the compiled calculation instruction;

Separately parse the compiled interrupt storage instruction and the compiled calculation instruction to obtain the corresponding operation code and operation domain;

An instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed that are sequentially arranged according to an execution order, and the plurality of instructions to be executed include the compiled interrupt storage instruction and the compiled calculation instruction.

Clause V32. The method according to Clause V31, the method further comprising:

Clause V33. According to the method described in Clause V20, compile the obtained interrupt storage instruction to obtain the compiled interrupt storage instruction, including:

A first assembly file is generated according to the interrupt storage instruction, and the first assembly file is translated into a first binary file, where the first binary file is the compiled interrupt storage instruction.

The arithmetic module in the above device (or the processing module in the above device) may include a master arithmetic sub-module 121 and a plurality of slave arithmetic sub-modules 122, so as to realize the processing of the above-mentioned instructions and the processing of the calculation instructions.

In a possible implementation, the control module is also used to parse the obtained calculation instruction to obtain the operation domain and operation code of the calculation instruction, and obtain the to-be-calculated required to execute the calculation instruction according to the operation domain and operation code data. The calculation module is also used to perform calculation on the data to be calculated according to the calculation instruction to obtain the calculation result of the calculation instruction. The operation module may include a plurality of operators, which are used to perform operations corresponding to the operation type of the calculation instruction.

In this implementation, the calculation instruction may be other instructions that perform arithmetic operations and logical operations on data such as scalars, vectors, matrices, and tensors. Those skilled in the art can set the calculation instructions according to actual needs. There are no restrictions.

In this implementation, the operator may include an adder, a divider, a multiplier, a comparator, and the like that can perform arithmetic operations, logical operations, and the like on the data. The type and number of arithmetic units can be set according to the size of the amount of data to be calculated, the type of calculation, the processing speed and efficiency of the calculation of the data, etc., and this disclosure does not limit this.

In a possible implementation manner, the control module is further configured to parse the calculation instruction to obtain a plurality of operation instructions, and send the data to be operated and the plurality of operation instructions to the main operation sub-module 121.

The master operation sub-module 121 is used to perform pre-processing on the data to be operated, and to transmit data and operation instructions with a plurality of slave operation sub-modules 122.

The sub-operation sub-module 122 is configured to execute intermediate operations in parallel according to the data and operation instructions transmitted from the main sub-module 121 to obtain multiple intermediate results, and transmit the multiple intermediate results to the main sub-module 122.

The main operation sub-module 121 is also used to perform subsequent processing on multiple intermediate results to obtain the calculation result of the calculation instruction, and store the calculation result in the corresponding address.

It should be noted that those skilled in the art can set the connection mode between the main operation sub-module and multiple slave operation sub-modules according to actual needs, so as to implement the architecture setting of the operation module. For example, the architecture of the operation module may be The “H” -type architecture, the array-type architecture, the tree-type architecture, etc. are not limited in this disclosure.

23a shows a block diagram of an arithmetic module according to an embodiment of the present disclosure. In a possible implementation, as shown in FIG. 23a, the operation module may further include one or more branch operation sub-modules 123, and the branch operation sub-module 123 is used to forward the master operation sub-module 121 and the slave operation sub-module 122 Between data and / or arithmetic instructions. The main operation sub-module 121 is connected to one or more branch operation sub-modules 123. In this way, the main operation sub-module, the branch operation sub-module and the slave operation sub-module in the operation module are connected with an "H" architecture, and the data and / or operation instructions are forwarded through the branch operation sub-module, saving the main operation sub-module Of resources, which in turn increases the processing speed of instructions.

23b shows a block diagram of an arithmetic module according to an embodiment of the present disclosure. In a possible implementation, as shown in FIG. 23b, multiple slave operation sub-modules 122 are distributed in an array.

Each slave operation sub-module 122 is connected to other adjacent slave operation sub-modules 122, and the master operation sub-module 121 is connected to the k slave operation sub-modules 122 of the plurality of slave operation sub-modules 122. : N slave operation submodules 122 in the first row, n slave operation submodules 122 in the mth row, and m slave operation submodules 122 in the first column.

Among them, as shown in FIG. 23b, the k slave operation submodules include only n slave operation submodules in the first row, n slave operation submodules in the mth row, and m slave operation submodules in the first column, namely The k slave operation sub-modules are slave operation sub-modules directly connected to the master operation sub-module among the plurality of slave operation sub-modules. Among them, k slave operation sub-modules are used for forwarding data and instructions between the master operation sub-module and multiple slave operation sub-modules. In this way, multiple slave operation sub-modules are distributed in an array, which can increase the speed of sending data and / or operation instructions from the master operation sub-module to the slave operation sub-modules, thereby increasing the processing speed of instructions.

23c shows a block diagram of an arithmetic module according to an embodiment of the present disclosure. In a possible implementation manner, as shown in FIG. 23c, the calculation module may further include a tree-shaped submodule 124. The tree-shaped submodule 124 includes a root port 401 and multiple branch ports 402. The root port 401 is connected to the main operation sub-module 121, and the plurality of branch ports 402 are respectively connected to the plurality of sub-operation sub-modules 122. The tree-shaped sub-module 124 has a transceiver function for forwarding data and / or operation instructions between the main operation sub-module 121 and the slave operation sub-module 122. In this way, through the function of the tree-shaped submodule, the operation modules are connected in a tree structure, and the forwarding function of the tree-shaped submodule can be used to increase the speed of sending data and / or operation instructions from the main operation submodule to the slave operation submodule, thereby increasing The processing speed of the instruction.

In a possible implementation, the tree-shaped sub-module 124 may be an optional result of the device, which may include at least one layer of nodes. The node has a line structure with a forwarding function, and the node itself does not have a computing function. The node at the lowermost layer is connected to the slave operation sub-module to forward data and / or operation instructions between the master operation sub-module 121 and the slave operation sub-module 122. In particular, if the tree-shaped submodule has zero-level nodes, the device does not require a tree-shaped submodule.

In a possible implementation, the tree-shaped submodule 124 may include multiple nodes of an n-ary tree structure, and multiple nodes of the n-ary tree structure may have multiple layers.

For example, FIG. 23d shows a block diagram of an arithmetic module according to an embodiment of the present disclosure. As shown in FIG. 23d, the n-ary tree structure may be a binary tree structure, and the tree-shaped submodule includes 2-layer nodes 01. The lowermost node 01 is connected to the slave operation sub-module 122 to forward data and / or operation instructions between the master operation sub-module 121 and the slave operation sub-module 122.

In this implementation, the n-ary tree structure may also be a tri-tree structure, etc., where n is a positive integer greater than or equal to 2. A person skilled in the art may set n in the n-ary tree structure and the number of nodes in the n-ary tree structure as needed, and the disclosure does not limit this.

23e shows a block diagram of a control module according to an embodiment of the present disclosure. In a possible implementation, as shown in FIG. 23e, the control module may include an instruction storage sub-module 114, an instruction processing sub-module 115, and a queue storage sub-module 116.

The instruction storage submodule 114 is used to store the above instructions and calculation instructions.

The instruction processing sub-module 115 is used to parse the above instructions and calculation instructions respectively to obtain corresponding operation codes and operation domains. That is, parsing the above instruction to obtain the operation code and operation domain of the instruction, and parsing the calculation instruction to obtain the operation code and operation domain of the calculation instruction.

The queue storage sub-module 116 is used to store an instruction queue. The instruction queue includes a plurality of instructions to be executed in order according to the execution order. The plurality of instructions to be executed may include the above-mentioned instructions and calculation instructions.

In this implementation manner, the execution order of the plurality of instructions to be executed can be arranged according to the reception time and priority level of the instruction to be executed to obtain an instruction queue, so that the plurality of instructions to be executed can be sequentially executed according to the instruction queue.

In a possible implementation, as shown in FIG. 23e, the control module may further include a dependency processing sub-module 117. The dependency processing sub-module 117 is configured to cache the first instruction to be executed in the instruction when it is determined that the first instruction to be executed among the plurality of instructions to be executed is associated with the zeroth instruction to be executed before the first instruction to be executed In the storage submodule 114, after the execution of the zeroth instruction to be executed is completed, the first instruction to be executed is extracted from the instruction storage submodule 114 and sent to the arithmetic module.

The first to-be-executed instruction is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction, including: a first storage address interval storing data required by the first to-be-executed instruction and data required to store the zeroth to-be-executed instruction The zeroth storage address interval has overlapping areas. Conversely, there is no association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed may be that there is no overlapping area between the first storage address interval and the zeroth storage address interval.

In this way, according to the dependency relationship between the first to-be-executed instruction and the zero-th to-be-executed instruction before the first to-be-executed instruction, after the execution of the first zero-to-be-executed instruction is completed, the subsequent One instruction to be executed to ensure the accuracy of the calculation results.

It should be noted that, for the sake of simple description, the foregoing method embodiments are all expressed as a series of action combinations, but those skilled in the art should know that this disclosure is not limited by the sequence of actions described. Because according to the present disclosure, certain steps can be performed in other orders or simultaneously. Secondly, those skilled in the art should also know that the embodiments described in the specification are all optional embodiments, and the involved actions and modules are not necessarily required by the present disclosure.

It should be further noted that although Figure 1-4, Figure 2-4, Figure 3-4, Figure 4-4, Figure 5-4, Figure 6-4, Figure 7-4, Figure 8-4, and Figure 9- 4.Figure 10-4, Figure 11-4, Figure 12-4, Figure 13-4, Figure 14-4, Figure 15-4, Figure 16-4, Figure 17-4, Figure 18-4, Figure 19- 4. The steps in the flowcharts of FIGS. 20-4, 21-4, and 22-4 are displayed in order according to the arrows, but these steps are not necessarily executed in the order indicated by the arrows. Unless clearly stated in this article, the execution of these steps is not strictly limited in order, and these steps can be executed in other orders. Furthermore, Figures 1-4, 2-4, 3-4, 4-4, 5-4, 6-4, 7-4, 8-4, 9-4, and 10- 4.Figure 11-4, Figure 12-4, Figure 13-4, Figure 14-4, Figure 15-4, Figure 16-4, Figure 17-4, Figure 18-4, Figure 19-4, Figure 20- 4. At least some of the steps in FIGS. 21-4 and 22-4 may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed at the same time, but may be executed at different times. The execution order of these sub-steps or stages is not necessarily sequential, but may be executed in turn or alternately with at least a part of other steps or sub-steps or stages of other steps.

It should be understood that the above device embodiments are only schematic, and the device of the present disclosure may also be implemented in other ways. For example, the division of the units / modules in the above embodiments is only a division of logical functions, and there may be other divisions in actual implementation. For example, multiple units, modules, or components may be combined, or may be integrated into another system, or some features may be ignored or not implemented.

In addition, unless otherwise specified, each functional unit / module in each embodiment of the present disclosure may be integrated into one unit / module, or each unit / module may exist alone physically, or two or more units / The modules are integrated together. The above integrated units / modules may be implemented in the form of hardware or software program modules.

If the integrated unit / module is implemented in the form of hardware, the hardware may be a digital circuit, an analog circuit, or the like. The physical implementation of the hardware structure includes but is not limited to transistors, memristors, and so on. Unless otherwise specified, the artificial intelligence processor may be any suitable hardware processor, such as CPU, GPU, FPGA, DSP, ASIC, and so on. Unless otherwise specified, the storage unit may be any suitable magnetic storage medium or magneto-optical storage medium, such as RRAM (Resistive Random Access Memory), DRAM (Dynamic Random Access Memory), Static random access memory SRAM (Static Random-Access Memory), enhanced dynamic random access memory EDRAM (Enhanced Dynamic Random Access Memory), high-bandwidth memory HBM (High-Bandwidth Memory), hybrid storage cube HMC (Hybrid Memory Memory Cube), etc. Wait.

If the integrated unit / module is implemented in the form of a software program module and sold or used as an independent product, it may be stored in a computer-readable memory. Based on such an understanding, the technical solution of the present disclosure essentially or part of the contribution to the existing technology or all or part of the technical solution can be embodied in the form of a software product, the computer software product is stored in a memory, Several instructions are included to enable a computer device (which may be a personal computer, server, network device, etc.) to perform all or part of the steps of the methods described in various embodiments of the present disclosure. The aforementioned memory includes: U disk, Read-Only Memory (ROM, Read-Only Memory), Random Access Memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program codes.

In the above embodiments, the description of each embodiment has its own emphasis. For a part that is not detailed in an embodiment, you can refer to the related descriptions of other embodiments. The technical features of the above embodiments can be arbitrarily combined. To simplify the description, all possible combinations of the technical features in the above embodiments are not described. However, as long as there is no contradiction in the combination of these technical features, they should be It is considered as the scope described in this specification.

The present disclosure provides a machine learning computing device. The machine learning computing device may include one or more of the above instruction processing devices for acquiring data to be operated and control information from other processing devices, and performing specified machine learning operations. The machine learning computing device can obtain instructions from other machine learning computing devices or non-machine learning computing devices, and transfer the execution results to peripheral devices (also called other processing devices) through the I / O interface. Peripheral equipment such as camera, monitor, mouse, keyboard, network card, wifi interface, server. When more than one instruction processing device is included, the instruction processing device can link and transmit data through a specific structure, for example, interconnect and transmit data through the PCIE bus to support a larger-scale neural network operation. At this time, you can share the same control system or have separate control systems; you can share memory, or each accelerator has its own memory. In addition, the interconnection method can be any interconnection topology.

The machine learning computing device has high compatibility, and can be connected with various types of servers through the PCIE interface.

24a shows a block diagram of a combined processing device according to an embodiment of the present disclosure. As shown in FIG. 24a, the combined processing device includes the above-mentioned machine learning computing device, a universal interconnection interface, and other processing devices. The machine learning computing device interacts with other processing devices to complete the operation specified by the user.

Other processing devices include one or more types of general-purpose / special-purpose processors such as central processing unit CPU, graphics processor GPU, neural network processor. The number of processors included in other processing devices is not limited. Other processing devices serve as an interface between the machine learning computing device and external data and control, including data handling, to complete the basic control of starting and stopping the machine learning computing device; other processing devices can also cooperate with the machine learning computing device to complete the computing task.

General interconnection interface, used to transfer data and control instructions between machine learning computing devices and other processing devices. The machine learning computing device obtains the required input data from other processing devices and writes them into the on-chip storage device of the machine learning computing device; it can obtain control instructions from other processing devices and write them into the control cache of the machine learning computing device; also The data in the storage module of the machine learning computing device can be read and transmitted to other processing devices.

24b shows a block diagram of a combined processing device according to an embodiment of the present disclosure. In a possible implementation, as shown in FIG. 24b, the combined processing device may further include a storage device, and the storage device is respectively connected to the machine learning operation device and the other processing device. The storage device is used to store data stored in the machine learning computing device and the other processing devices, and is particularly suitable for data that cannot be saved in the internal storage of the machine learning computing device or other processing devices.

The combined processing device can be used as an SOC on-chip system for mobile phones, robots, drones, video surveillance equipment, etc., effectively reducing the core area of the control part, increasing processing speed, and reducing overall power consumption. In this case, the general interconnection interface of the combined processing device is connected to some components of the device. Some components such as camera, monitor, mouse, keyboard, network card, wifi interface.

The present disclosure provides a machine learning chip including the above machine learning arithmetic device or combination processing device.

The present disclosure provides a machine learning chip packaging structure including the above machine learning chip.

The present disclosure provides a board card. FIG. 25 shows a schematic diagram of a board card according to an embodiment of the present disclosure. As shown in FIG. 25, the board includes the above machine learning chip packaging structure or the above machine learning chip. In addition to the machine learning chip 389, the board may also include other supporting components. The supporting components include but are not limited to: a storage device 390, an interface device 391, and a control device 392.

The storage device 390 and the machine learning chip 389 (or the machine learning chip in the machine learning chip package structure) are connected via a bus, and are used to store data. The memory device 390 may include multiple sets of memory cells 393. Each group of storage units 393 and the machine learning chip 389 are connected by a bus. It can be understood that each group of storage units 393 may be DDR SDRAM (English: Double Data Rate SDRAM, double rate synchronous dynamic random access memory).

DDR can double the speed of SDRAM without increasing the clock frequency. DDR allows data to be read on the rising and falling edges of the clock pulse. DDR is twice as fast as standard SDRAM.

In one embodiment, the memory device 390 may include 4 sets of memory cells 393. Each group of memory cells 393 may include multiple DDR4 particles (chips). In one embodiment, the machine learning chip 389 may include four 72-bit DDR4 controllers. Among the 72-bit DDR4 controllers, 64 bits are used for data transmission, and 8 bits are used for ECC check. It can be understood that when DDR4-3200 particles are used in each group of memory cells 393, the theoretical bandwidth of data transmission can reach 25600MB / s.

In one embodiment, each group of storage units 393 includes multiple double-rate synchronous dynamic random access memories arranged in parallel. DDR can transfer data twice in one clock cycle. A controller for controlling DDR is provided in the machine learning chip 389 for controlling the data transmission and data storage of each storage unit 393.

The interface device 391 is electrically connected to the machine learning chip 389 (or the machine learning chip in the machine learning chip packaging structure). The interface device 391 is used to realize data transmission between the machine learning chip 389 and an external device (such as a server or a computer). For example, in one embodiment, the interface device 391 may be a standard PCIE interface. For example, the data to be processed is transferred from the server to the machine learning chip 289 through a standard PCIE interface to realize data transfer. Preferably, when the PCIE 3.0 X 16 interface is used for transmission, the theoretical bandwidth can reach 16000MB / s. In another embodiment, the interface device 391 may also be other interfaces. The present disclosure does not limit the specific expressions of the other interfaces described above, and the interface device can implement the transfer function. In addition, the calculation result of the machine learning chip is still transmitted back to the external device (such as a server) by the interface device.

The control device 392 is electrically connected to the machine learning chip 389. The control device 392 is used to monitor the state of the machine learning chip 389. Specifically, the machine learning chip 389 and the control device 392 may be electrically connected through an SPI interface. The control device 392 may include a microcontroller (Micro Controller Unit, MCU). For example, the machine learning chip 389 may include multiple processing chips, multiple processing cores, or multiple processing circuits, and may drive multiple loads. Therefore, the machine learning chip 389 can be in different working states such as multiple loads and light loads. The control device can realize the regulation of the working states of multiple processing chips, multiple processes and / or multiple processing circuits in the machine learning chip.

The present disclosure provides an electronic device including the aforementioned machine learning chip or board.

Electronic equipment can include data processing devices, computer equipment, robots, computers, printers, scanners, tablets, smart terminals, mobile phones, driving recorders, navigation devices, sensors, cameras, servers, cloud servers, cameras, cameras, projectors , Watches, headphones, mobile storage, wearable devices, vehicles, household appliances, and / or medical devices.

Vehicles may include airplanes, ships, and / or vehicles. Household appliances may include TVs, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lights, gas stoves, and range hoods. The medical equipment may include a nuclear magnetic resonance apparatus, a B-mode ultrasound apparatus, and / or an electrocardiograph.

The present disclosure also provides a non-volatile computer-readable storage medium on which computer program instructions are stored, and when the computer program instructions are executed by a processor, the above instruction processing method is implemented.

An embodiment of the present disclosure also provides an electronic device, including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to call the instructions stored in the memory to perform the above instruction processing method .

The embodiments of the present disclosure have been described in detail above, and specific examples have been used herein to explain the principles and implementations of the present disclosure. The descriptions of the above embodiments are only used to help understand the methods and core ideas of the present disclosure. At the same time, those skilled in the art based on the ideas of the present disclosure, based on the specific embodiments of the present disclosure and the changes or modifications made in the scope of application, all fall within the scope of protection of the present disclosure. In summary, the content of this specification should not be construed as limiting the disclosure.

Claims

An activation instruction processing device, characterized in that the device includes:

The control module is used to compile the obtained activation instruction to obtain the compiled activation instruction, analyze the compiled activation instruction, obtain the operation code and operation domain of the activation instruction, and according to the operation code and The operation domain obtains the data to be calculated and the target address required to execute the activation instruction;

The operation module is used for performing activation operation on the data to be operated to obtain an operation result, and storing the operation result in the target address,

Wherein, the operation code is used to indicate that the operation performed by the activation instruction on the data is an activation operation, and the operation domain includes the data address to be operated and the target address.
The device according to claim 1, characterized in that

The control module is also used to obtain an activation parameter table according to the operation code and / or the operation domain;

The calculation module is also used to perform activation calculation on the data to be calculated according to the activation parameter table to obtain an operation result,

Wherein, the activation parameter table includes an activation table and a constant table.
The device according to claim 1, wherein the arithmetic module comprises:

A plurality of activation calculators are used to perform activation calculation on the data to be calculated.
The device according to claim 3, wherein the operation module includes a master operation sub-module and a plurality of slave operation sub-modules, the master operation sub-module includes the plurality of activation operators,

The main operation sub-module is configured to perform activation operation on the data to be operated by using the plurality of activation operators to obtain an operation result, and store the operation result in the target address.
The apparatus according to claim 1, wherein the operation domain includes a read-in amount or a storage address of the read-in amount, wherein the control module is further configured to obtain the read-in amount and follow The read-in amount acquires the data to be calculated.
The device according to claim 1, wherein the device further comprises:

The storage module is used for storing the data to be calculated.
The device according to claim 1, wherein the control module comprises:

An instruction storage sub-module for storing the compiled activation instruction;

An instruction processing sub-module, which is used to parse the compiled activation instruction to obtain the operation code and operation domain of the activation instruction;

A queue storage sub-module is used to store an instruction queue. The instruction queue includes a plurality of instructions to be executed, which are sequentially arranged according to the execution order, and the plurality of instructions to be executed include the compiled activation instructions.
The device according to claim 7, wherein the control module further comprises:

The dependency processing sub-module is used to determine the first pending instruction when there is an association relationship between the first pending instruction in the plurality of pending instructions and the zeroth pending instruction before the first pending instruction The execution instruction is cached in the instruction storage submodule, and after the execution of the zeroth instruction to be executed is completed, the first instruction to be executed is extracted from the instruction storage submodule and sent to the arithmetic module,

Wherein, the association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:

The first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
The device according to claim 1, characterized in that

The control module is also used to generate an assembly file according to the activation instruction and translate the assembly file into a binary file,

Wherein, the binary file is the compiled activation instruction.
The device according to any one of claims 1 to 9, wherein the activation function used by the activation operation includes at least one of the following:

Linear rectification function, S-shaped growth curve function, hyperbolic tangent function, linear rectification function with leakage, maximum function and power function.
A machine learning computing device, characterized in that the device includes:

One or more activation instruction processing devices according to any one of claims 1-10, used to obtain data to be calculated and control information from other processing devices, and perform a specified machine learning operation, and pass the execution result through I / O interface is passed to other processing devices;

When the machine learning computing device includes a plurality of the activation instruction processing devices, the plurality of activation instruction processing devices may be connected and transmit data through a specific structure;

Among them, a plurality of the activation instruction processing devices interconnect and transmit data through a PCIE bus, a fast external device interconnection bus, to support larger-scale machine learning operations; a plurality of the activation instruction processing devices share the same control system or own Respective control systems; a plurality of the activation instruction processing devices share memory or have their own memories; the interconnection mode of the plurality of activation instruction processing devices is an arbitrary interconnection topology.
A combined processing device, characterized in that the combined processing device includes:

The machine learning computing device, universal interconnection interface, and other processing devices of claim 11;

The machine learning computing device interacts with the other processing device to jointly complete the calculation operation specified by the user,

Wherein, the combined processing device further includes: a storage device, which is respectively connected to the machine learning computing device and the other processing device, and is used for storing data of the machine learning computing device and the other processing device.
A machine learning chip, characterized in that the machine learning chip includes:

The machine learning arithmetic device according to claim 11 or the combined processing device according to claim 12.
An electronic device, characterized in that the electronic device includes:

The machine learning chip according to claim 13.
A board card, characterized in that the board card comprises: a storage device, an interface device and a control device, and the machine learning chip according to claim 13;

Wherein, the machine learning chip is respectively connected to the storage device, the control device and the interface device;

The storage device is used for storing data;

The interface device is used to realize data transmission between the machine learning chip and an external device;

The control device is used for monitoring the state of the machine learning chip.
An activation instruction processing method, characterized in that the method is applied to an activation instruction processing device, and the method includes:

The control module is used to compile the obtained activation instruction to obtain a compiled activation instruction, and the compiled activation instruction is parsed to obtain the operation code and operation domain of the activation instruction, and according to the operation code and the operation The domain obtains the data to be calculated and the target address required to execute the activation instruction;

Using an arithmetic module to perform an activation operation on the data to be operated to obtain an operation result, and store the operation result in the target address,

Wherein, the operation code is used to indicate that the operation performed by the activation instruction on the data is an activation operation, and the operation domain includes the data address to be operated and the target address.
The method according to claim 16, wherein the method further comprises:

Obtaining an activation parameter table according to the operation code and / or the operation domain;

Wherein, the operation module is used to activate the operation data to obtain the operation result, including:

Performing an activation operation on the data to be calculated according to the activation parameter table to obtain an operation result,

Wherein, the activation parameter table includes an activation table and a constant table.
The method according to claim 16, wherein the operation module is used to activate the data to be operated to obtain an operation result, including:

A plurality of activation calculators are used to perform activation calculation on the data to be calculated.
The method according to claim 18, wherein the operation module includes a master operation sub-module and a plurality of slave operation sub-modules, the master operation sub-module includes the plurality of activation operators,

Wherein, the operation module is used to activate the operation data to obtain the operation result, including:

Use multiple activation operators in the main operation sub-module to perform activation operation on the data to be operated to obtain an operation result, and store the operation result in the target address.
The method according to claim 16, wherein the operation domain further includes a read-in amount or a storage address of the read-in amount,

Wherein, obtaining the data to be operated, the activation table, the constant table and the target address required to execute the activation instruction according to the operation code and the operation domain include:

Acquiring the read-in amount, and acquiring the data to be calculated according to the read-in amount.
The method according to claim 16, wherein the method further comprises:

Store the data to be calculated.
The method according to claim 16, characterized in that parsing the compiled activation instruction to obtain the operation code and operation domain of the activation instruction includes:

Storing the compiled activation instruction;

Parse the compiled activation instruction to obtain the operation code and operation domain of the activation instruction;

An instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed, which are sequentially arranged according to an execution order, and the plurality of instructions to be executed include the compiled activation instructions.
The method according to claim 22, wherein the method further comprises:

When it is determined that the first to-be-executed instruction among the plurality of to-be-executed instructions is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction, cache the first to-be-executed instruction, and determine the After the execution of the zeroth to-be-executed instruction is completed, control to execute the execution of the first to-be-executed instruction,

Wherein, the association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:

The first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
The method according to claim 16, wherein using the control module to compile the acquired activation instruction to obtain the compiled activation instruction includes:

Generate an assembly file according to the activation instruction, and translate the assembly file into a binary file,

Wherein, the binary file is the compiled activation instruction.
The method according to any one of claims 16 to 24, wherein the activation function utilized by the activation operation includes at least one of the following:

Linear rectification function, S-shaped growth curve function, hyperbolic tangent function, linear rectification function with leakage, maximum function and power function.