WO2020073923A1 - Operation method and device, computer equipment, and storage medium - Google Patents

Operation method and device, computer equipment, and storage medium Download PDF

Info

Publication number
WO2020073923A1
WO2020073923A1 PCT/CN2019/110146 CN2019110146W WO2020073923A1 WO 2020073923 A1 WO2020073923 A1 WO 2020073923A1 CN 2019110146 W CN2019110146 W CN 2019110146W WO 2020073923 A1 WO2020073923 A1 WO 2020073923A1
Authority
WO
WIPO (PCT)
Prior art keywords
instruction
data
activation
module
executed
Prior art date
Application number
PCT/CN2019/110146
Other languages
French (fr)
Chinese (zh)
Inventor
苏振宇
周晓勇
张定飞
孟小甫
Original Assignee
上海寒武纪信息科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN201910548315.2A external-priority patent/CN110096283A/en
Priority claimed from CN201910548268.1A external-priority patent/CN110096309B/en
Application filed by 上海寒武纪信息科技有限公司 filed Critical 上海寒武纪信息科技有限公司
Priority to CN201980039761.9A priority Critical patent/CN113330460A/en
Publication of WO2020073923A1 publication Critical patent/WO2020073923A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Definitions

  • the present disclosure relates to the field of computer technology, and in particular, to an arithmetic method, device, computer equipment, and storage medium.
  • an activation instruction processing apparatus including:
  • the control module is used to compile the obtained activation instruction to obtain the compiled activation instruction, analyze the compiled activation instruction, obtain the operation code and operation domain of the activation instruction, and according to the operation code Acquiring the data to be calculated and the target address required to execute the activation instruction with the operation domain;
  • the operation module is used for performing activation operation on the data to be operated to obtain an operation result, and storing the operation result in the target address,
  • the operation code is used to indicate that the operation performed by the activation instruction on the data is an activation operation
  • the operation domain includes the data address to be operated and the target address.
  • a machine learning computing device including:
  • One or more of the above-mentioned activation instruction processing devices are used to obtain data and control information to be calculated from other processing devices, and perform designated machine learning operations, and pass the execution results to other processing devices through the I / O interface;
  • the plurality of activation instruction processing devices may be connected and transmit data through a specific structure
  • a plurality of the activation instruction processing devices interconnect and transmit data through a PCIE bus, a fast external device interconnection bus, to support larger-scale machine learning operations; a plurality of the activation instruction processing devices share the same control system or Respective control systems; a plurality of the activation instruction processing devices share memory or have their own memories; the interconnection mode of the plurality of activation instruction processing devices is an arbitrary interconnection topology.
  • a combined processing device comprising:
  • the machine learning operation device interacts with the other processing device to jointly complete the calculation operation specified by the user.
  • a machine learning chip including the above machine learning network operation device or the above combination processing device.
  • a machine learning chip packaging structure including the above machine learning chip.
  • a board card including the above machine learning chip packaging structure.
  • an electronic device including the aforementioned machine learning chip or the aforementioned board.
  • an activation instruction processing method is provided. The method is applied to an activation instruction processing device. The method includes:
  • the control module is used to compile the obtained activation instruction to obtain a compiled activation instruction, and the compiled activation instruction is parsed to obtain the operation code and operation domain of the activation instruction, and according to the operation code and The operation domain obtains the data to be calculated and the target address required to execute the activation instruction;
  • the operation code is used to indicate that the operation performed by the activation instruction on the data is an activation operation
  • the operation domain includes the data address to be operated and the target address.
  • a non-volatile computer-readable storage medium having computer program instructions stored thereon, the computer program instructions implementing the above activation instruction processing method when executed by a processor.
  • Embodiments of the present disclosure provide an activation instruction processing method, device, and related products.
  • the device includes a control module and an arithmetic module.
  • the control module is used to compile the obtained activation instruction to obtain a compiled activation instruction.
  • the activation instruction is analyzed to obtain the operation code and operation domain of the activation instruction, and the data to be operated and the target address required to execute the activation instruction are obtained according to the operation code and the operation domain; the operation module is used to activate the operation data to obtain the operation result And store the operation result in the target address.
  • the method, device and related products for processing activation instructions provided by the embodiments of the present disclosure have a wide range of applications, and have high processing efficiency and processing speed for activation instructions, and high processing efficiency and processing speed for performing activation calculations.
  • the electronic device includes a data processing device, a robot, a computer, a printer, a scanner, a tablet computer, a smart terminal, a mobile phone, a driving recorder, a navigator, a sensor, a camera, a server, a cloud server, a camera, Cameras, projectors, watches, headphones, mobile storage, wearable devices, vehicles, household appliances, and / or medical devices.
  • the vehicle includes an airplane, ship, and / or vehicle;
  • the household appliance includes a TV, air conditioner, microwave oven, refrigerator, rice cooker, humidifier, washing machine, electric lamp, gas stove, and range hood;
  • the medical Equipment includes MRI, B-mode ultrasound and / or electrocardiograph.
  • FIG. 1 shows a schematic diagram of a processor of an instruction processing method according to an embodiment of the present disclosure.
  • FIG. 1-1 shows a block diagram of an activation instruction processing apparatus according to an embodiment of the present disclosure.
  • 1-2a and 1-2b show block diagrams of an activation instruction processing apparatus according to an embodiment of the present disclosure.
  • FIG. 1-3 are schematic diagrams illustrating application scenarios of an activation instruction processing apparatus according to an embodiment of the present disclosure.
  • FIG. 1-4 show a flowchart of an activation instruction processing method according to an embodiment of the present disclosure.
  • FIG. 2-1 shows a block diagram of a linear rectification function activation instruction processing device according to an embodiment of the present disclosure.
  • 2-2a and 2-2b show block diagrams of a linear rectification function activation instruction processing device according to an embodiment of the present disclosure.
  • FIG. 2-3 shows a schematic diagram of an application scenario of a linear rectification function activation instruction processing device according to an embodiment of the present disclosure.
  • FIG. 2-4 illustrate a flowchart of a method for processing a linear rectification function activation instruction according to an embodiment of the present disclosure.
  • FIG. 3-1 shows a block diagram of an S-shaped growth curve function activation instruction processing apparatus according to an embodiment of the present disclosure.
  • 3-2a and 3-2b show block diagrams of an S-shaped growth curve function activation instruction processing device according to an embodiment of the present disclosure.
  • 3-3 shows a schematic diagram of an application scenario of an S-shaped growth curve function activation instruction processing apparatus according to an embodiment of the present disclosure.
  • FIGS. 3-4 illustrate a flowchart of an S-shaped growth curve function activation instruction processing method according to an embodiment of the present disclosure.
  • FIG. 4-1 shows a block diagram of an exponential function activation instruction processing apparatus according to an embodiment of the present disclosure.
  • 4-2a and 4-2b show block diagrams of an exponential function activation instruction processing device according to an embodiment of the present disclosure.
  • 4-3 shows a schematic diagram of an application scenario of an exponential function activation instruction processing device according to an embodiment of the present disclosure.
  • 4-4 shows a flowchart of an exponential function activation instruction processing method according to an embodiment of the present disclosure.
  • FIG. 5-1 shows a block diagram of a selection instruction processing apparatus according to an embodiment of the present disclosure.
  • 5-2a and 5-2b show block diagrams of a selection instruction processing apparatus according to an embodiment of the present disclosure.
  • 5-3 shows a schematic diagram of an application scenario for selecting an instruction processing apparatus according to an embodiment of the present disclosure.
  • 5-4 shows a flowchart of a selection instruction processing method according to an embodiment of the present disclosure.
  • 6-1 shows a block diagram of a count instruction processing device according to an embodiment of the present disclosure.
  • 6-2a and 6-2b show block diagrams of a counting instruction processing device according to an embodiment of the present disclosure.
  • FIG. 6-3 shows a schematic diagram of an application scenario of a counting instruction processing device according to an embodiment of the present disclosure.
  • 6-4 shows a flowchart of a counting instruction processing method according to an embodiment of the present disclosure.
  • FIG. 7-1 shows a block diagram of a fully connected instruction processing apparatus according to an embodiment of the present disclosure.
  • 7-2a and 7-2b show block diagrams of a fully connected instruction processing apparatus according to an embodiment of the present disclosure.
  • FIG. 7-3 shows a schematic diagram of an application scenario of a fully connected instruction processing apparatus according to an embodiment of the present disclosure.
  • FIG. 7-4 shows a flowchart of a fully connected instruction processing method according to an embodiment of the present disclosure.
  • FIG. 8-1 shows a block diagram of a convolution instruction processing device according to an embodiment of the present disclosure.
  • 8-2a and 8-2b show block diagrams of a convolution instruction processing device according to an embodiment of the present disclosure.
  • FIG. 8-3 shows a schematic diagram of an application scenario of a convolution instruction processing device according to an embodiment of the present disclosure.
  • FIG. 8-4 shows a flowchart of a convolution instruction processing method according to an embodiment of the present disclosure.
  • 9-1 shows a block diagram of a maximum pooled instruction processing apparatus according to an embodiment of the present disclosure.
  • 9-2a and 9-2b show block diagrams of a maximum pooled instruction processing apparatus according to an embodiment of the present disclosure.
  • 9-3 shows a schematic diagram of an application scenario of a maximum pooled instruction processing apparatus according to an embodiment of the present disclosure.
  • 9-4 shows a flowchart of a maximum pooling instruction processing method according to an embodiment of the present disclosure.
  • FIG. 10-1 shows a block diagram of a filling instruction processing apparatus according to an embodiment of the present disclosure.
  • 10-2a and 10-2b show block diagrams of a filling instruction processing device according to an embodiment of the present disclosure.
  • FIG. 10-3 shows a schematic diagram of an application scenario of a filling instruction processing apparatus according to an embodiment of the present disclosure.
  • 10-4 shows a flowchart of a filling instruction processing method according to an embodiment of the present disclosure.
  • 11-1 shows a block diagram of a matrix transposition instruction processing apparatus according to an embodiment of the present disclosure.
  • 11-2a and 11-2b show block diagrams of a matrix transposition instruction processing device according to an embodiment of the present disclosure.
  • 11-3 shows a schematic diagram of an application scenario of a matrix transposition instruction processing apparatus according to an embodiment of the present disclosure.
  • 11-4 shows a flowchart of a matrix transposition instruction processing method according to an embodiment of the present disclosure.
  • 12-1 shows a block diagram of an average pooled instruction processing apparatus according to an embodiment of the present disclosure.
  • 12-2a and 12-2b show block diagrams of an average pooled instruction processing apparatus according to an embodiment of the present disclosure.
  • FIG. 12-3 shows a schematic diagram of an application scenario of an average pooled instruction processing apparatus according to an embodiment of the present disclosure.
  • FIG. 12-4 shows a flowchart of an average pooling instruction processing method according to an embodiment of the present disclosure.
  • FIG. 13-1 shows a block diagram of a scalar instruction processing device according to an embodiment of the present disclosure.
  • 13-2a and 13-2b show block diagrams of a scalar instruction processing device according to an embodiment of the present disclosure.
  • 13-3a and 13-3b show schematic diagrams of application scenarios of a scalar instruction processing apparatus according to an embodiment of the present disclosure.
  • FIG. 13-4 shows a flowchart of a scalar instruction processing method according to an embodiment of the present disclosure.
  • FIG. 14-1 shows a block diagram of a scalar type conversion instruction processing device according to an embodiment of the present disclosure.
  • 14-2a and 14-2b show block diagrams of a scalar type conversion instruction processing device according to an embodiment of the present disclosure.
  • FIG. 14-3 shows a schematic diagram of an application scenario of a scalar type conversion instruction processing device according to an embodiment of the present disclosure.
  • FIG. 14-4 shows a flowchart of a scalar type conversion instruction processing method according to an embodiment of the present disclosure.
  • 15-1 shows a block diagram of an address fetch instruction processing apparatus according to an embodiment of the present disclosure.
  • 15-2 shows a block diagram of an address fetch instruction processing apparatus according to an embodiment of the present disclosure.
  • 15-3a and 15-3b show schematic diagrams of application scenarios of an address fetch instruction processing apparatus according to an embodiment of the present disclosure.
  • 15-4 shows a flowchart of an address fetch instruction processing method according to an embodiment of the present disclosure.
  • 16-1 shows a block diagram of a scalar data migration instruction processing apparatus according to an embodiment of the present disclosure.
  • 16-2 shows a block diagram of a scalar data migration instruction processing apparatus according to an embodiment of the present disclosure.
  • 16-3 shows a schematic diagram of an application scenario of a scalar data migration instruction processing apparatus according to an embodiment of the present disclosure.
  • 16-4 shows a flowchart of a scalar data migration instruction processing method according to an embodiment of the present disclosure.
  • 17-1 shows a block diagram of a scalar control flow instruction processing apparatus according to an embodiment of the present disclosure.
  • 17-2 shows a block diagram of a scalar control flow instruction processing apparatus according to an embodiment of the present disclosure.
  • 17-3 shows a schematic diagram of an application scenario of a scalar control flow instruction processing apparatus according to an embodiment of the present disclosure.
  • 17-4 shows a flowchart of a scalar control flow instruction processing method according to an embodiment of the present disclosure.
  • 18-1 shows a block diagram of a vector instruction processing apparatus according to an embodiment of the present disclosure.
  • 18-2a and 18-2b show block diagrams of a vector instruction processing device according to an embodiment of the present disclosure.
  • 18-3 shows a schematic diagram of an application scenario of a vector instruction processing apparatus according to an embodiment of the present disclosure.
  • 18-4 shows a flowchart of a vector instruction processing method according to an embodiment of the present disclosure.
  • 19-1 shows a block diagram of a loop vector instruction processing apparatus according to an embodiment of the present disclosure.
  • 19-2a and 19-2b show block diagrams of a loop vector instruction processing device according to an embodiment of the present disclosure.
  • 19-3 shows a schematic diagram of an application scenario of a loop vector instruction processing device according to an embodiment of the present disclosure.
  • 19-4 shows a flowchart of a loop vector instruction processing method according to an embodiment of the present disclosure.
  • FIG. 20-1 shows a block diagram of a vector data migration instruction processing apparatus according to an embodiment of the present disclosure.
  • FIG. 20-2 shows a block diagram of a vector data migration instruction processing apparatus according to an embodiment of the present disclosure.
  • 20-3 shows a schematic diagram of an application scenario of a vector data migration instruction processing apparatus according to an embodiment of the present disclosure.
  • 20-4 shows a flowchart of a vector data migration instruction processing method according to an embodiment of the present disclosure.
  • 21-1a shows a block diagram of a synchronization control instruction processing apparatus according to an embodiment of the present disclosure.
  • 21-1b shows a schematic structural diagram of a module cluster in a synchronous control instruction processing apparatus according to an embodiment of the present disclosure.
  • 21-2 shows a block diagram of a synchronization control instruction processing apparatus according to an embodiment of the present disclosure.
  • 21-3 illustrate a schematic diagram of an application scenario of a synchronization control instruction processing apparatus according to an embodiment of the present disclosure.
  • 21-4 illustrate a flowchart of a method for processing synchronization control instructions according to an embodiment of the present disclosure.
  • 22-1 shows a block diagram of an interrupt storage instruction processing apparatus according to an embodiment of the present disclosure.
  • 22-2a and 22-2b illustrate block diagrams of an interrupt storage instruction processing apparatus according to an embodiment of the present disclosure.
  • 22-3a and 22-3b are schematic diagrams illustrating application scenarios of an apparatus for processing interrupt storage instructions according to an embodiment of the present disclosure.
  • 22-4 shows a flowchart of an interrupt storage instruction processing method according to an embodiment of the present disclosure.
  • 23a-23d show a block diagram of an arithmetic module according to an embodiment of the present disclosure.
  • 23e shows a block diagram of a control module according to an embodiment of the present disclosure.
  • 24a and 24b show block diagrams of a combined processing device according to an embodiment of the present disclosure.
  • FIG. 25 shows a schematic structural diagram of a board according to an embodiment of the present disclosure.
  • the term “if” may be interpreted as “when” or “once” or “in response to a determination” or “in response to a detection” depending on the context.
  • the phrase “if determined” or “if [described condition or event] is detected” may be interpreted in the context to mean “once determined” or “in response to a determination” or “once detected [described condition or event ] “Or” In response to detection of [the described condition or event] ".
  • the present disclosure provides instruction processing methods and devices corresponding to different operations or processes, and computer equipment and storage media corresponding to each instruction processing method and device, and instruction processing methods corresponding to different operations or processes And devices include: selection instruction processing method and device, counting instruction processing method and device, fully connected instruction processing method and device, convolution instruction processing method and device, maximum pooling instruction processing method and device, linear rectification function activation instruction processing method And device, S-shaped growth curve function activation instruction processing method and device, activation instruction processing method and device, filling instruction processing method and device, matrix transposition instruction processing method and device, average pooling instruction processing method and device, exponential function activation Instruction processing method and device, scalar instruction processing method and device, scalar type conversion instruction processing method and device, address fetch instruction processing method and device, scalar data migration instruction processing method and device, scalar control flow instruction processing method and device, vector instruction Processor Method and device, loop vector instruction processing method and device, vector data migration instruction processing method and device, synchronous control instruction processing method and device, and interrupt storage instruction processing method and device
  • the instruction processing method may be applied to a processor, which may be a general-purpose processor, such as a CPU (Central Processing Unit), or artificial intelligence processing for performing artificial intelligence operations Device (IPU).
  • Artificial intelligence operations can include machine learning operations, brain-like operations, and so on. Among them, machine learning operations include neural network operations, k-means operations, support vector machine operations, etc.
  • the artificial intelligence processor may include, for example, GPU (Graphics Processing Unit), NPU (Neural-Network Processing Unit, neural network processing unit), DSP (Digital Signal Processing, digital signal processing unit), field programmable gate array (Field-Programmable Gate Array, FPGA) One or a combination of chips. This disclosure does not limit the specific types of processors.
  • the processor mentioned in the present disclosure may include multiple processing units, and each processing unit may independently run various assigned tasks, such as: convolution operation tasks and pooling tasks Or fully connected tasks.
  • the present disclosure does not limit the processing unit and the tasks executed by the processing unit.
  • FIG. 1 shows a schematic diagram of a processor of an instruction processing method according to an embodiment of the present disclosure.
  • the processor 100 includes a plurality of processing units 101 and a storage unit 102.
  • the plurality of processing units 101 are used to execute an instruction sequence
  • the storage unit 102 is used to store data, which may include a random access memory (RAM, Random Access Memory) And register file.
  • the multiple processing units 101 in the processor 100 can share a part of the storage space, for example, share a part of the RAM storage space and the register file, and can also have their own storage spaces at the same time.
  • FIG. 1-1 shows a block diagram of an activation instruction processing apparatus according to an embodiment of the present disclosure. As shown in Figure 1-1, the device includes a control module 8-11 and an arithmetic module 8-12.
  • the control module 8-11 is used to compile the obtained activation instruction to obtain the compiled activation instruction, and parse the compiled activation instruction to obtain the operation code and operation domain of the activation instruction, and according to the operation code and operation domain Obtain the data to be calculated and the target address required to execute the activation instruction.
  • the operation code is used to indicate that the operation performed by the activation instruction on the data is an activation operation, and the operation domain includes the data address and the target address to be operated.
  • the operation module 8-12 is used to activate the operation data to obtain the operation result, and store the operation result in the target address.
  • the activation instruction acquired by the control module is an uncompiled software instruction that cannot be directly executed by hardware.
  • the control module needs to first compile the activation instruction (uncompiled). After the compiled activation instruction is obtained, the compiled activation instruction can be parsed.
  • the compiled activation instruction is a hardware instruction that can be directly executed by the hardware.
  • the control module can obtain the data to be calculated from the data address to be calculated.
  • the control module can obtain instructions and data through the data input and output unit, which can be one or more data I / O interfaces or I / O pins.
  • the operation code may be the part of the instruction or field (usually represented by code) specified in the computer program to perform the operation, and is the instruction sequence number used to inform the device that executes the instruction which instruction needs to be executed.
  • the operation domain may be a source of all data required to execute the corresponding instruction, such as a corresponding address, etc. All data required to execute the corresponding instruction include data such as data to be operated and corresponding operation methods, etc.
  • an activation instruction it must include an operation code and an operation field, where the operation field includes at least the data address and the target address to be calculated.
  • the device may include one or more control modules and one or more arithmetic modules, and the number of control modules and arithmetic modules may be set according to actual needs, which is not limited in the present disclosure.
  • the control module may receive a linear rectification function activation instruction, and control one or more processing modules to perform the linear rectification function activation operation.
  • the multiple control modules may respectively receive linear rectification function activation instructions and control the corresponding one or more processing modules to perform linear rectification function activation operations.
  • An embodiment of the present disclosure provides an activation instruction processing apparatus.
  • the apparatus includes a control module and an arithmetic module.
  • the control module is used to compile the acquired activation instruction to obtain a compiled activation instruction and analyze the compiled activation instruction. Obtain the operation code and operation domain of the activation instruction, and obtain the data to be operated and the target address required to execute the activation instruction according to the operation code and operation domain; the operation module is used to activate the operation data to obtain the operation result, and the operation result Store in the target address.
  • the activation instruction processing device provided by the embodiments of the present disclosure has a wide application range, and has high processing efficiency and fast processing speed for activation instructions, and high processing efficiency and fast processing speed for performing activation calculations.
  • the activation function used by the activation operation may include at least one of the following: a linear rectification function (Rectified Linear Unit, ReLU, also called ReLU function), and an S-shaped growth curve function (Sigmoid function, It can be called Sigmoid function), hyperbolic tangent function (tanh, can also be called tanh function), linear rectification function with leakage (Leaky ReLU, a variant of ReLU function), the function of taking the maximum value (maxout function, output the maximum Value) and power function.
  • a linear rectification function (Rectified Linear Unit, ReLU, also called ReLU function)
  • Sigmoid function S-shaped growth curve function
  • tanh hyperbolic tangent function
  • Leaky ReLU a variant of ReLU function
  • the activation function used for the activation operation may also be other features that are non-linear, continuously differentiable, as unsaturated as possible in range, monotonic, approximate straight lines at dots, etc., available This disclosure does not limit the function of activating the operation.
  • control module 8-11 may also be used to obtain an activation parameter table according to the operation code and / or operation domain.
  • the operation modules 8-12 can also be used to perform activation calculation on the data to be calculated according to the activation parameter table to obtain the operation result.
  • the activation parameter table may include an activation table and a constant table.
  • the activation parameter table address may be included in the operation domain, so that the control module obtains the activation parameter table address from the activation parameter table address.
  • the control module may determine that the activation parameter table needs to be activated according to the operation code, and may directly obtain the activation parameter table from the storage address of the predetermined activation parameter table.
  • the control module may determine that the activation parameter table needs to be activated according to the operation code, it may directly obtain the activation parameter table corresponding to the activation command from the storage address of the predetermined parameter table.
  • a person skilled in the art may set the acquisition method of the activation parameter table according to actual needs, which is not limited in the present disclosure.
  • control module can also obtain an activation function corresponding to the activation instruction, so that the operation module can perform activation calculation on the operation data according to the activation function and the corresponding operator.
  • an activation table and a constant table required for activation operations using different activation functions can be predetermined.
  • the activation table and constant table corresponding to different activation functions are different.
  • the arithmetic module 8-12 may include a plurality of activation operators 8-120.
  • a plurality of activation calculators 8-120 are used to perform activation calculation on the data to be calculated.
  • the calculation module may also include an activation calculator.
  • the number of activation operators can be set according to the size of the data amount of the activation operation to be performed, the processing speed, efficiency, etc. of the activation operation, which is not limited in the present disclosure.
  • the operation module 8-12 may include a master operation sub-module 8-121 and multiple slave operation sub-modules 8-122, and the master operation sub-module 8-121 includes Multiple activation operators 8-120 (not shown in the figure).
  • the main operation sub-module 8-121 is used for performing an activation operation on the data to be calculated by using a plurality of activation operators, obtaining an operation result, and storing the operation result in a target address.
  • the operation domain may further include a read-in amount or a storage address of the read-in amount.
  • the control module 8-11 is also used to obtain the read-in amount, and obtain a plurality of data to be calculated according to the read-in amount.
  • the data amount of the plurality of data to be calculated may be less than or equal to the read-in amount.
  • the read-in amount may be the data amount of the acquired plurality of data to be calculated, and may be the size of the acquired data to be calculated.
  • the operation field directly contains the specific value of the read-in amount, the value can be determined as the read-in amount.
  • the storage address of the read-in amount is included in the operation domain, the read-in amount can be obtained from the storage address.
  • a plurality of data to be calculated may be obtained according to a preset default read-in amount.
  • the acquired data amount of the plurality of data to be calculated may be less than or equal to the default read-in amount.
  • the data amount and size of the operation data can be limited, the accuracy of the operation result can be guaranteed, and the device can execute the activation instruction.
  • the device may further include a storage module 8-13.
  • the storage modules 8-13 are used to store data to be calculated.
  • the storage modules 8-13 can also be used to store activation tables and constant tables.
  • the storage module may include a memory, such as one or more of a cache and a register, and the cache may include a high-speed temporary storage cache.
  • the data to be calculated, the activation table and the constant table can be stored in the cache and / or register of the storage module as needed, and the disclosure does not limit this.
  • the device may further include a direct memory access module, which is used to read or store data from the storage module.
  • control module 8-11 may also be used to generate an assembly file according to the activation instruction and translate the assembly file into a binary file, where the binary file is a compiled activation instruction.
  • the instruction format of the activation instruction may be:
  • active is the opcode of the activation instruction
  • dst, src0, active_table, const_table, and size are the operation domains of the activation instruction.
  • dst is the target address
  • src0 is the data address to be calculated
  • active_table is the active table address
  • const_table is the constant table address
  • size is the read-in amount.
  • the instruction format of the activation instruction may also be:
  • active is the operation code of the activation instruction
  • dst, src0, size are the operation domain of the activation instruction.
  • dst is the target address
  • src0 is the data address to be calculated
  • size is the read-in amount.
  • the device may be set in a graphics processor (Graphics Processing Unit, GPU for short), a central processing unit (Central Processing Unit, CPU for short) and an embedded neural network processor (Neural-network Processing Unit) , Referred to as NPU).
  • a graphics processor Graphics Processing Unit, GPU for short
  • CPU Central Processing Unit
  • NPU embedded neural network processor
  • FIG. 1-3 are schematic diagrams illustrating application scenarios of an activation instruction processing apparatus according to an embodiment of the present disclosure. As shown in Figure 1-3, the activation instruction processing device processes the activation instruction as follows:
  • the control module 8-11 compiles the acquired activation instruction 1 to obtain the compiled activation instruction 1 (for example, activation instruction 1 is active 500, 100, 200, 300, 64). Analyze the compiled activation instruction 1 to obtain the operation code and operation domain of the activation instruction 1.
  • the operation code of the activation instruction 1 is active, the target address is 500, the data address to be calculated is 100, the activation table address is 200, the constant table address is 300, and the read-in amount is 64.
  • the control module 8-11 obtains the data to be operated with a data amount of 64 (read-in amount) from the data to be operated 100, the activation table from the activation table address 200, and the constant table from the constant table address 300.
  • the operation module 8-12 performs activation calculation on the operation data according to the activation table and the constant table, obtains the operation result, and stores the operation result in the target address 500.
  • Example 1 The difference from Example 1 is that the activation instruction 1 is active 500 and 100. Assuming that the activation calculation needs to be performed according to the activation parameter table, the control module 8-11 needs to obtain the activation parameter table (see the above description for the specific implementation process).
  • the activation instruction processing device can process the activation instruction efficiently and quickly, and realize the efficient and rapid processing of the activation operation.
  • FIGS. 1-4 show a flowchart of an activation instruction processing method according to an embodiment of the present disclosure. As shown in FIGS. 1-4, this method is applied to the above-mentioned activation instruction processing apparatus.
  • the method includes step S51-8 and step S52-8.
  • the method can be applied to computer equipment including a memory and a processor, where the memory is used to store data used in the execution of the method; the processor is used to perform related processing and operation steps, such as performing the following steps S51-8 ⁇ ⁇ S52-8.
  • step S51-8 the control module is used to compile the obtained activation instruction to obtain a compiled activation instruction, and the compiled activation instruction is parsed to obtain the operation code and operation domain of the activation instruction, and according to the operation code and The operation domain obtains the data to be calculated and the target address required to execute the activation instruction.
  • the operation code is used to indicate that the operation performed by the activation instruction on the data is an activation operation, and the operation domain includes the data address and the target address to be operated.
  • step S52-8 the arithmetic module is used to activate the operation data to obtain the operation result, and the operation result is stored in the target address.
  • the method may further include:
  • the operation module is used to activate the operation data to obtain the operation result, including:
  • the activation parameter table may include an activation table and a constant table.
  • the operation module is used to activate the operation data to obtain the operation result, which may include:
  • the operation module may include a master operation sub-module and multiple slave operation sub-modules, and the master operation sub-module may include multiple activation operators,
  • the operation module is used to activate the operation data to obtain the operation result, which may include:
  • the operation domain may further include a read-in amount or a storage address of the read-in amount.
  • obtaining the data to be calculated, the activation table, the constant table, and the target address required to execute the activation instruction according to the operation code and the operation domain may include:
  • the method may further include: storing data to be calculated.
  • parsing the compiled activation instruction to obtain the operation code and operation domain of the activation instruction may include:
  • the instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed in order according to the execution order, and the plurality of instructions to be executed include the compiled activation instructions.
  • the method may further include: when determining that the first to-be-executed instruction among the plurality of to-be-executed instructions is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction, caching the first The instruction to be executed, and after determining that the execution of the zeroth instruction to be executed is completed, the execution of the first instruction to be executed is controlled,
  • association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
  • the first storage address interval storing data required for the first instruction to be executed has an overlapping area with the zeroth storage address interval storing data required for the zeroth instruction to be executed.
  • control module is used to compile the obtained activation instruction to obtain a compiled activation instruction, including:
  • the assembly file is generated according to the activation instruction, and the assembly file is translated into a binary file.
  • the binary file is the activation instruction after compilation.
  • the activation function utilized by the activation operation may include at least one of the following:
  • Linear rectification function S-shaped growth curve function, hyperbolic tangent function, linear rectification function with leakage, maximum function and power function.
  • the method for processing an activation instruction provided by the embodiments of the present disclosure has a wide application range, and has a high processing efficiency and a fast processing speed for the activation instruction, and a high processing efficiency and a fast processing speed for performing the activation operation.
  • an activation instruction processing device comprising:
  • the control module is used to compile the obtained activation instruction to obtain the compiled activation instruction, analyze the compiled activation instruction, obtain the operation code and operation domain of the activation instruction, and according to the operation code and The operation domain obtains the data to be calculated and the target address required to execute the activation instruction;
  • the operation module is used for performing activation operation on the data to be operated to obtain an operation result, and storing the operation result in the target address,
  • the operation code is used to indicate that the operation performed by the activation instruction on the data is an activation operation
  • the operation domain includes the data address to be operated and the target address.
  • the control module is also used to obtain an activation parameter table according to the operation code and / or the operation domain;
  • the calculation module is also used to perform activation calculation on the data to be calculated according to the activation parameter table to obtain an operation result
  • the activation parameter table includes an activation table and a constant table.
  • the arithmetic module includes:
  • a plurality of activation calculators are used to perform activation calculation on the data to be calculated.
  • the operation module includes a master operation sub-module and a plurality of slave operation sub-modules, the master operation sub-module includes the plurality of activation operators,
  • the main operation sub-module is configured to perform activation operation on the data to be operated by using the plurality of activation operators to obtain an operation result, and store the operation result in the target address.
  • Clause A5 The device according to Clause A1, the operation domain includes a read-in amount or a storage address of the read-in amount,
  • control module is also used to obtain the read-in amount, and obtain the data to be calculated according to the read-in amount.
  • Clause A6 The device according to Clause A1, the device further comprising:
  • the storage module is used for storing the data to be calculated.
  • control module includes:
  • An instruction storage sub-module for storing the compiled activation instruction
  • An instruction processing sub-module which is used to parse the compiled activation instruction to obtain the operation code and operation domain of the activation instruction;
  • a queue storage sub-module is used to store an instruction queue.
  • the instruction queue includes a plurality of instructions to be executed, which are sequentially arranged according to the execution order, and the plurality of instructions to be executed include the compiled activation instructions.
  • control module further comprising:
  • the dependency processing sub-module is used to determine the first pending instruction when there is an association relationship between the first pending instruction in the plurality of pending instructions and the zeroth pending instruction before the first pending instruction
  • the execution instruction is cached in the instruction storage submodule, and after the execution of the zeroth instruction to be executed is completed, the first instruction to be executed is extracted from the instruction storage submodule and sent to the arithmetic module,
  • association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
  • the first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
  • the control module is also used to generate an assembly file according to the activation instruction and translate the assembly file into a binary file
  • the binary file is the compiled activation instruction.
  • the activation function utilized by the activation operation includes at least one of the following:
  • Linear rectification function S-shaped growth curve function, hyperbolic tangent function, linear rectification function with leakage, maximum function and power function.
  • a machine learning computing device comprising:
  • the plurality of activation instruction processing devices may be connected and transmit data through a specific structure
  • a plurality of the activation instruction processing devices interconnect and transmit data through a PCIE bus, a fast external device interconnection bus, to support larger-scale machine learning operations; a plurality of the activation instruction processing devices share the same control system or Respective control systems; a plurality of the activation instruction processing devices share memory or have their own memories; the interconnection mode of the plurality of activation instruction processing devices is an arbitrary interconnection topology.
  • a combined processing device comprising:
  • Machine learning computing device general interconnection interface and other processing devices as described in clause A11;
  • the machine learning computing device interacts with the other processing device to jointly complete the calculation operation specified by the user,
  • the combined processing device further includes: a storage device, which is respectively connected to the machine learning computing device and the other processing device, and is used for storing data of the machine learning computing device and the other processing device.
  • a machine learning chip includes:
  • Article A14 An electronic device, the electronic device comprising:
  • a board card includes: a storage device, an interface device and a control device, and a machine learning chip as described in Clause A13;
  • the machine learning chip is respectively connected to the storage device, the control device and the interface device;
  • the storage device is used for storing data
  • the interface device is used to realize data transmission between the machine learning chip and an external device
  • the control device is used for monitoring the state of the machine learning chip.
  • An activation instruction processing method the method is applied to an activation instruction processing device, the method includes:
  • the control module is used to compile the obtained activation instruction to obtain a compiled activation instruction, and the compiled activation instruction is parsed to obtain the operation code and operation domain of the activation instruction, and according to the operation code and the operation The domain obtains the data to be calculated and the target address required to execute the activation instruction;
  • the operation code is used to indicate that the operation performed by the activation instruction on the data is an activation operation
  • the operation domain includes the data address to be operated and the target address.
  • Clause A17 The method according to Clause A16, the method further comprising:
  • the operation module is used to activate the operation data to obtain the operation result, including:
  • the activation parameter table includes an activation table and a constant table.
  • an operation module is used to activate the data to be operated to obtain an operation result, including:
  • a plurality of activation calculators are used to perform activation calculation on the data to be calculated.
  • the operation module includes a master operation sub-module and a plurality of slave operation sub-modules, the master operation sub-module includes the plurality of activation operators,
  • the operation module is used to activate the operation data to obtain the operation result, including:
  • the operation domain further includes a read-in amount or a storage address of the read-in amount
  • obtaining the data to be operated, the activation table, the constant table and the target address required to execute the activation instruction according to the operation code and the operation domain include:
  • Clause A21 The method according to Clause A16, the method further comprising:
  • Clause A22 parse the compiled activation instruction to obtain the operation code and operation domain of the activation instruction, including:
  • An instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed, which are sequentially arranged according to an execution order, and the plurality of instructions to be executed include the compiled activation instructions.
  • Clause A23 The method according to Clause A22, the method further comprising:
  • the first to-be-executed instruction among the plurality of to-be-executed instructions When it is determined that the first to-be-executed instruction among the plurality of to-be-executed instructions is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction, cache the first to-be-executed instruction, and determine the After the execution of the zeroth to-be-executed instruction is completed, control to execute the execution of the first to-be-executed instruction,
  • association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
  • the first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
  • Clause A24 use the control module to compile the obtained activation instruction to obtain the compiled activation instruction, including:
  • the binary file is the compiled activation instruction.
  • the activation function utilized by the activation operation includes at least one of the following:
  • Linear rectification function S-shaped growth curve function, hyperbolic tangent function, linear rectification function with leakage, maximum function and power function.
  • FIG. 2-1 shows a block diagram of a linear rectification function activation instruction processing device according to an embodiment of the present disclosure. As shown in Figure 2-1, the device includes a control module 6-11 and an arithmetic module 6-12.
  • the control module 6-11 is used to compile the obtained linear rectification function activation instruction to obtain the compiled linear rectification function activation instruction, and parse the compiled linear rectification function activation instruction to obtain the operation of the linear rectification function activation instruction Code and operation domain, and obtain the data to be calculated and the target address required to execute the linear rectification function activation instruction according to the operation code and operation domain.
  • the operation code is used to indicate that the activation operation performed by the linear rectification function activation instruction on the data is a linear rectification function activation operation.
  • the operation domain includes the data address and target address to be calculated.
  • the operation module 6-12 is configured to perform a linear rectification function activation operation on the data to be operated, obtain an operation result, and store the operation result in the target address.
  • the linear rectification function activation instruction acquired by the control module is an uncompiled software instruction that cannot be directly executed by the hardware.
  • the control module needs to first compile the linear rectification function activation instruction (uncompiled). After the compiled linear rectification function activation instruction is obtained, the compiled linear rectification function activation instruction can be analyzed.
  • the compiled linear rectification function activation instruction is a hardware instruction that can be directly executed by the hardware.
  • the compiled linear rectification function activation instruction may be the only instruction corresponding to the linear rectification function activation instruction.
  • the compiled linear rectification function activation instruction may also include all activation operation instructions, so that when the device processes the linear rectification function activation instruction, the device can perform the linear rectification function without distinguishing the activation function corresponding to the instruction. Activation operations simplify the process.
  • control module can obtain the data to be calculated from the data address to be calculated.
  • the control module may determine the data required to perform the linear rectification function activation operation according to the operation code of the linear rectification function activation instruction.
  • the control module can obtain instructions and data through the data input and output unit, which can be one or more data I / O interfaces or I / O pins.
  • the operation code may be the part of the instruction or field (usually represented by code) specified in the computer program to perform the operation, and is the instruction sequence number used to inform the device that executes the instruction which instruction needs to be executed.
  • the operation domain may be a source of all data required to execute the corresponding instruction, such as a corresponding address, etc. All data required to execute the corresponding instruction include data such as data to be operated and corresponding operation methods, etc.
  • a linear rectification function activation instruction it must include an operation code and an operation field, where the operation field includes at least the data address to be operated and the target address.
  • the device may include one or more control modules and one or more arithmetic modules, and the number of control modules and arithmetic modules may be set according to actual needs, which is not limited in the present disclosure.
  • the control module may receive a linear rectification function activation instruction, and control one or more processing modules to perform the linear rectification function activation operation.
  • the multiple control modules may respectively receive linear rectification function activation instructions and control the corresponding one or more processing modules to perform linear rectification function activation operations.
  • An embodiment of the present disclosure provides a linear rectification function activation instruction processing device.
  • the device includes a control module and an arithmetic module.
  • the control module is used to compile the obtained linear rectification function activation instruction to obtain a compiled linear rectification function activation instruction. Analyze the compiled linear rectification function activation instruction to obtain the operation code and operation domain of the linear rectification function activation instruction, and obtain the data to be operated and the target address required to execute the linear rectification function activation instruction according to the operation code and operation domain;
  • the module is used to perform linear rectification function activation operation on the operation data to obtain the operation result and store the operation result in the target address.
  • the linear rectification function activation instruction processing device provided by the embodiments of the present disclosure has a wide range of applications.
  • the linear rectification function activation instruction has high processing efficiency and fast processing speed, and the linear rectification function activation operation has high processing efficiency and fast processing speed.
  • control module 6-11 may also be used to obtain a linear rectification activation function parameter table according to the operation code and / or operation domain.
  • the operation module 6-12 can also be used to perform a linear rectification function activation operation on the data to be calculated according to the linear rectification activation function parameter table to obtain an operation result.
  • the linear rectification activation function parameter table may include a linear rectification activation function activation table and a linear rectification activation function constant table.
  • the operation domain may include a linear rectification activation function parameter table address, so that the control module obtains the linear rectification activation function parameter table address from the linear rectification activation function parameter table address.
  • the control module may determine that the linear rectification activation function parameter table is required to execute the linear rectification function activation instruction according to the operation code, and may directly obtain the linear rectification activation function parameter table from the storage address of the predetermined linear rectification activation function parameter table.
  • the linear rectification activation function corresponding to the linear rectification function activation instruction can be obtained directly from the storage address of the predetermined parameter table Parameters Table.
  • a person skilled in the art can set the acquisition method of the linear rectification activation function parameter table according to actual needs, which is not limited in the present disclosure.
  • control module may also obtain an activation function corresponding to the linear rectification function activation instruction, so that the operation module may perform linear rectification function activation operation on the operation data according to the activation function and the corresponding operator.
  • the arithmetic module 6-12 may include multiple activation operators 6-120.
  • a plurality of activation operators 6-120 are used to perform linear rectification function activation operations on the data to be operated.
  • the calculation module may also include an activation calculator.
  • the number of activation operators can be set according to the amount of data required to perform the linear rectification function activation operation, the processing speed and efficiency of the linear rectification function activation operation, and the disclosure does not limit this.
  • the operation module 6-12 may include a master operation submodule 6-121 and a plurality of slave operation submodules 6-122, and the master operation submodule 6-121 includes Multiple activation operators 6-120 (not shown in the figure).
  • the main operation sub-module 6-121 is used to perform a linear rectification function activation operation on the data to be calculated using a plurality of activation operators to obtain the operation result, and store the operation result in the target address.
  • the operation domain may further include a read-in amount or a storage address of the read-in amount.
  • the control module 6-11 is also used to obtain the read-in amount, and obtain a plurality of data to be calculated according to the read-in amount.
  • the data amount of the plurality of data to be calculated may be less than or equal to the read-in amount.
  • the read-in amount may be the data amount of the acquired plurality of data to be calculated, and may be the size of the acquired data to be calculated.
  • the operation field directly contains the specific value of the read-in amount, the value can be determined as the read-in amount.
  • the storage address of the read-in amount is included in the operation domain, the read-in amount can be obtained from the storage address.
  • a plurality of data to be calculated may be obtained according to a preset default read-in amount.
  • the acquired data amount of the plurality of data to be calculated may be less than or equal to the default read-in amount.
  • the data amount and size of the operation data can be limited, the accuracy of the operation result can be guaranteed, and the device can execute the linear rectification function activation instruction.
  • the device may further include a storage module 6-13.
  • the storage modules 6-13 are used to store data to be calculated.
  • the storage modules 6-13 can also be used to store the linear rectification activation function parameter table.
  • the storage module may include a memory, such as one or more of a cache and a register, and the cache may include a high-speed temporary storage cache.
  • the data to be calculated and the parameter table of the linear rectification activation function can be stored in the cache and / or register of the storage module as needed, and the disclosure does not limit this.
  • the device may further include a direct memory access module, which is used to read or store data from the storage module.
  • control module 6-11 can also be used to generate an assembly file according to the linear rectification function activation instruction, and translate the assembly file into a binary file.
  • the binary file is the compiled linear rectification function activation instruction.
  • the instruction format of the linear rectification function activation instruction may be:
  • active.relu is the operation code of the linear rectification function activation instruction
  • dst, src0, and size are the operation domains of the linear rectification function activation instruction.
  • dst is the target address
  • src0 is the data address to be calculated
  • size is the read-in amount.
  • the instruction format of the linear rectification function activation instruction may be:
  • active.relu is the operation code of the linear rectification function activation instruction
  • dst, src0, src1, and size are the operation domains of the linear rectification function activation instruction.
  • dst is the target address
  • src0 is the data address to be calculated
  • src1 is the address of the linear rectification activation function parameter table
  • size is the amount of reading.
  • the device may be set in a graphics processor (Graphics Processing Unit, GPU for short), a central processing unit (Central Processing Unit, CPU for short) and an embedded neural network processor (Neural-network Processing Unit) , Referred to as NPU).
  • a graphics processor Graphics Processing Unit, GPU for short
  • CPU Central Processing Unit
  • NPU embedded neural network processor
  • FIG. 2-3 shows a schematic diagram of an application scenario of a linear rectification function activation instruction processing device according to an embodiment of the present disclosure.
  • the linear rectification function activation instruction processing device processes the linear rectification function activation instruction as follows:
  • the control module 6-11 compiles the obtained linear rectification function activation instruction 1 to obtain the compiled linear rectification function activation instruction 1 (for example, the linear rectification function activation instruction 1 is active.relu500.100 64) Analyze the compiled linear rectification function activation instruction 1 to obtain the operation code and operation domain of the linear rectification function activation instruction 1.
  • the operation code of the linear rectification function activation instruction 1 is active.relu
  • the target address is 500
  • the data address to be calculated is 100
  • the read-in amount is 64.
  • the control module 6-11 acquires the data to be calculated with a data amount of 64 (read-in amount) from the data address to be calculated 100. Assuming that the activation calculation needs to be performed according to the linear rectification activation function parameter table, the control module 11 also needs to obtain the linear rectification activation function parameter table (see the above description for the specific implementation process).
  • the operation module 6-12 performs linear rectification function activation calculation on the operation data according to the linear rectification activation function parameter table, obtains the operation result, and stores the operation result in the target address 500.
  • the linear rectification function activation instruction processing device can efficiently and quickly process the linear rectification function activation instruction, and realize the efficient and rapid processing of the linear rectification function activation operation.
  • FIGS. 2-4 illustrate a flowchart of a method for processing a linear rectification function activation instruction according to an embodiment of the present disclosure.
  • the method is applied to the above linear rectification function activation instruction processing device, and the method includes step S51-6 and step S52-6.
  • the method can be applied to computer equipment including a memory and a processor, where the memory is used to store data used in the execution of the method; the processor is used to perform related processing and operation steps, such as performing the following steps S51-6 ⁇ ⁇ S52-6.
  • step S51-6 the control module is used to compile the obtained linear rectification function activation instruction to obtain a compiled linear rectification function activation instruction, and the compiled linear rectification function activation instruction is analyzed to obtain a linear rectification function activation instruction Operation code and operation domain, and obtain the data to be calculated and the target address required to execute the linear rectification function activation instruction according to the operation code and operation domain.
  • the operation code is used to indicate that the activation operation performed by the linear rectification function activation instruction on the data is the linear rectification function activation operation, and the operation domain includes the data address to be operated and the target address.
  • step S52-6 the operation module performs linear rectification function activation operation on the operation data to obtain the operation result, and stores the operation result in the target address.
  • the method may further include:
  • using the operation module to perform the linear rectification function activation operation on the operation data to obtain the operation result includes: performing the linear rectification function activation operation on the operation data according to the linear rectification activation function parameter table to obtain the operation result.
  • the linear rectification activation function parameter table may include a linear rectification activation function activation table and a linear rectification activation function constant table.
  • using the operation module to perform a linear rectification function activation operation on the operation data to obtain an operation result may include: using multiple activation operators to perform a linear rectification function activation operation on the operation data.
  • the operation module may include a master operation sub-module and multiple slave operation sub-modules, and the master operation sub-module may include multiple activation operators.
  • the operation module is used to activate the operation data to obtain the operation result, which may include:
  • a plurality of activation operators in the main operation sub-module are used to perform linear rectification function activation operation on the operation data to obtain an operation result.
  • the operation domain may further include a read-in amount or a storage address of the read-in amount.
  • obtaining the data to be operated and the target address required to execute the linear rectification function activation instruction according to the operation code and the operation domain may include: acquiring the read-in amount, and obtaining the data to be operated according to the read-in amount.
  • the method may further include: storing data to be calculated.
  • parsing the compiled linear rectification function activation instruction to obtain the operation code and operation domain of the linear rectification function activation instruction may include:
  • the instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed in order according to the execution order, and the plurality of instructions to be executed include a compiled linear rectification function activation instruction.
  • the method may further include: when determining that the first to-be-executed instruction among the plurality of to-be-executed instructions is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction, caching the first The instruction to be executed, and after determining that the execution of the zeroth instruction to be executed is completed, the execution of the first instruction to be executed is controlled,
  • association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
  • the first storage address interval storing data required for the first instruction to be executed has an overlapping area with the zeroth storage address interval storing data required for the zeroth instruction to be executed.
  • the obtained linear rectification function activation instruction is compiled by the control module to obtain a compiled linear rectification function activation instruction, which may include: generating an assembly file according to the linear rectification function activation instruction, and compiling The file is translated into a binary file.
  • the binary file is the compiled linear rectification function activation instruction.
  • the linear rectification function activation instruction processing method provided by the embodiments of the present disclosure has a wide range of application, and has high processing efficiency and fast processing speed for the linear rectification function activation instruction, and high processing efficiency and fast processing speed for performing the linear rectification function activation operation.
  • a linear rectification function activation command processing device comprising:
  • the control module is used to compile the obtained linear rectification function activation instruction to obtain the compiled linear rectification function activation instruction, analyze the compiled linear rectification function activation instruction, and obtain the operation code of the linear rectification function activation instruction And the operation domain, and according to the operation code and the operation domain, obtain the data to be operated and the target address required to execute the linear rectification function activation instruction;
  • the operation module is used to perform a linear rectification function activation operation on the data to be operated to obtain an operation result, and store the operation result in the target address,
  • the operation code is used to indicate that the activation operation performed by the linear rectification function activation instruction on the data is a linear rectification function activation operation, and the operation domain includes the data address to be operated and the target address.
  • the control module is further configured to obtain a linear rectification activation function parameter table according to the operation code and / or the operation domain;
  • the operation module is further configured to perform a linear rectification function activation operation on the data to be calculated according to the linear rectification activation function parameter table to obtain an operation result
  • the linear rectification activation function parameter table includes a linear rectification activation function activation table and a linear rectification activation function constant table.
  • the calculation module includes:
  • a plurality of activation operators are used to perform a linear rectification function activation operation on the data to be operated.
  • the operation module includes a master operation sub-module and a plurality of slave operation sub-modules, the master operation sub-module includes the plurality of activation operators,
  • the main operation sub-module is configured to use the plurality of activation operators to perform a linear rectification function activation operation on the data to be operated to obtain an operation result, and store the operation result in the target address.
  • Clause B5 The device according to Clause B1, the operation domain further includes a read-in amount or a storage address of the read-in amount,
  • control module is also used to obtain the read-in amount, and obtain the data to be calculated according to the read-in amount.
  • Clause B6 The device according to Clause B1, the device further comprising:
  • the storage module is used for storing the data to be calculated.
  • control module includes:
  • An instruction storage sub-module for storing the compiled linear rectification function activation instruction
  • An instruction processing sub-module which is used to analyze the compiled linear rectification function activation instruction to obtain the operation code and operation domain of the linear rectification function activation instruction;
  • the queue storage submodule is used to store an instruction queue, and the instruction queue includes a plurality of instructions to be executed that are sequentially arranged according to an execution order, and the plurality of instructions to be executed include the compiled linear rectification function activation instruction.
  • control module further comprising:
  • the dependency processing sub-module is used to determine the first pending instruction when there is an association relationship between the first pending instruction in the plurality of pending instructions and the zeroth pending instruction before the first pending instruction
  • the execution instruction is cached in the instruction storage submodule, and after the execution of the zeroth instruction to be executed is completed, the first instruction to be executed is extracted from the instruction storage submodule and sent to the arithmetic module,
  • association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
  • the first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
  • the control module is also used to generate an assembly file according to the linear rectification function activation instruction and translate the assembly file into a binary file,
  • the binary file is the compiled linear rectification function activation instruction.
  • Article B10 A machine learning computing device, the device comprising:
  • One or more linear rectification function activation instruction processing devices as described in any one of Clause B1-Clause B9, used to obtain data to be calculated and control information from other processing devices, and perform a specified machine learning operation, which will execute the result Passed to other processing devices through the I / O interface;
  • the machine learning operation device includes a plurality of the linear rectification function activation instruction processing devices
  • the plurality of linear rectification function activation instruction processing devices may be connected and transmit data through a specific structure
  • a plurality of the linear rectification function activation instruction processing devices interconnect and transmit data through a fast external device interconnect bus PCIE bus to support larger-scale machine learning operations; a plurality of the linear rectification function activation instruction processing devices
  • the same control system is shared or has its own control system; multiple linear rectification function activation instruction processing devices share memory or own memory; the interconnection method of multiple linear rectification function activation instruction processing devices is any interconnection topology.
  • a combined processing device comprising:
  • Machine learning computing device general interconnection interface and other processing devices as described in clause B10;
  • the machine learning computing device interacts with the other processing device to jointly complete the calculation operation specified by the user,
  • the combined processing device further includes: a storage device, which is respectively connected to the machine learning computing device and the other processing device, and is used for storing data of the machine learning computing device and the other processing device.
  • Article B12. A machine learning chip, the machine learning chip includes:
  • Article B3. An electronic device, the electronic device comprising:
  • a board card includes: a storage device, an interface device and a control device, and a machine learning chip as described in Clause B12;
  • the machine learning chip is respectively connected to the storage device, the control device and the interface device;
  • the storage device is used for storing data
  • the interface device is used to realize data transmission between the machine learning chip and an external device
  • the control device is used for monitoring the state of the machine learning chip.
  • Article B15 A method for processing a linear rectification function activation instruction.
  • the method is applied to a linear rectification function activation instruction processing apparatus.
  • the method includes:
  • the control module is used to compile the obtained linear rectification function activation instruction to obtain a compiled linear rectification function activation instruction, and the compiled linear rectification function activation instruction is analyzed to obtain the operation code and operation of the linear rectification function activation instruction. Field, and obtain the data to be operated and the target address required to execute the linear rectification function activation instruction according to the operation code and the operation field;
  • the operation code is used to indicate that the activation operation performed by the linear rectification function activation instruction on the data is a linear rectification function activation operation, and the operation domain includes the data address to be operated and the target address.
  • Clause B16 The method according to Clause B15, the method further comprising:
  • the operation module is used to perform a linear rectification function activation operation on the data to be operated to obtain an operation result, including:
  • the linear rectification activation function parameter table includes a linear rectification activation function activation table and a linear rectification activation function constant table.
  • the operation module is used to perform a linear rectification function activation operation on the data to be operated to obtain an operation result, including:
  • a plurality of activation operators are used to perform a linear rectification function activation operation on the data to be operated.
  • the operation module includes a master operation sub-module and a plurality of slave operation sub-modules, the master operation sub-module includes the plurality of activation operators,
  • the operation module is used to perform a linear rectification function activation operation on the data to be operated to obtain an operation result, including:
  • a plurality of activation operators in the main operation sub-module are used to perform a linear rectification function activation operation on the data to be operated to obtain an operation result.
  • the operation domain further includes a read-in amount or a storage address of the read-in amount
  • obtaining the data to be operated and the target address required to execute the linear rectification function activation instruction according to the operation code and the operation domain includes:
  • Clause B20 The method according to Clause B15, the method further comprising:
  • Clause B21 According to the method described in Clause B15, analyze the compiled linear rectification function activation instruction to obtain the operation code and operation domain of the linear rectification function activation instruction, including:
  • An instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed, which are sequentially arranged according to an execution order, and the plurality of instructions to be executed include the compiled linear rectification function activation instruction.
  • Clause B22 The method according to Clause B21, the method further comprising:
  • the first to-be-executed instruction among the plurality of to-be-executed instructions When it is determined that the first to-be-executed instruction among the plurality of to-be-executed instructions is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction, cache the first to-be-executed instruction, and determine the After the execution of the zeroth to-be-executed instruction is completed, control to execute the execution of the first to-be-executed instruction,
  • association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
  • the first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
  • Clause B23 According to the method described in Clause B15, use the control module to compile the obtained linear rectification function activation instruction to obtain the compiled linear rectification function activation instruction, including:
  • the binary file is the compiled linear rectification function activation instruction.
  • FIG. 3-1 shows a block diagram of an S-shaped growth curve function activation instruction processing apparatus according to an embodiment of the present disclosure. As shown in Figure 3-1, the device includes a control module 7-11 and an arithmetic module 7-12.
  • the control module 7-11 is used to compile the acquired S-shaped growth curve function activation instruction to obtain the compiled S-shaped growth curve function activation instruction, and parse the compiled S-shaped growth curve function activation instruction to obtain S
  • the growth code function activates the operation code and operation domain of the instruction, and obtains the data to be calculated and the target address required to execute the S-type growth curve function activation instruction according to the operation code and the operation domain.
  • the operation code is used to instruct the S-type growth curve function activation instruction to perform the activation operation on the data as the S-type growth curve function activation operation.
  • the operation domain includes the data address and target address to be calculated.
  • the operation module 7-12 is used to perform S-shaped growth curve function activation operation on the data to be calculated, obtain the operation result, and store the operation result in the target address.
  • the S-shaped growth curve function activation instruction acquired by the control module is an uncompiled software instruction that cannot be directly executed by the hardware.
  • the control module needs to first activate the S-shaped growth curve function activation instruction (uncompiled) To compile. After the compiled S-shaped growth curve function activation instruction is obtained, the compiled S-shaped growth curve function activation instruction can be analyzed.
  • the compiled S-shaped growth curve function activation instruction is a hardware instruction that can be directly executed by the hardware.
  • the compiled S-shaped growth curve function activation instruction may be the only instruction corresponding to the S-shaped growth curve function activation instruction.
  • the compiled S-shaped growth curve function activation instruction may also include all activation calculation instructions, so that when the device processes the S-shaped growth curve function activation instruction, it can do without distinguishing the activation function corresponding to the instruction. Perform S-shaped growth curve function activation operation to simplify the processing of instructions.
  • control module can obtain the data to be calculated from the data address to be calculated.
  • the control module may determine the data required for the S-shaped growth curve function activation operation according to the operation code of the S-shaped growth curve function activation instruction.
  • the control module can obtain instructions and data through the data input and output unit, which can be one or more data I / O interfaces or I / O pins.
  • the operation code may be the part of the instruction or field (usually represented by code) specified in the computer program to perform the operation, and is the instruction sequence number used to inform the device that executes the instruction which instruction needs to be executed.
  • the operation domain may be a source of all data required to execute the corresponding instruction, such as a corresponding address, etc. All data required to execute the corresponding instruction include data such as data to be operated and corresponding operation methods, etc.
  • S-shaped growth curve function activation instruction it must include an operation code and an operation field, where the operation field includes at least the data address and the target address to be calculated.
  • the device may include one or more control modules and one or more arithmetic modules, and the number of control modules and arithmetic modules may be set according to actual needs, which is not limited in the present disclosure.
  • the control module may receive an S-shaped growth curve function activation instruction, and control one or more processing modules to perform an S-shaped growth curve function activation operation.
  • the multiple control modules may respectively receive the S-shaped growth curve function activation instruction and control the corresponding one or more processing modules to perform the S-shaped growth curve function activation operation.
  • An embodiment of the present disclosure provides an S-shaped growth curve function activation instruction processing device.
  • the device includes a control module and an arithmetic module.
  • the control module is used to compile the obtained S-shaped growth curve function activation instruction to obtain the compiled S-type Growth curve function activation instruction, analyze the compiled S-shaped growth curve function activation instruction to obtain the operation code and operation domain of the S-shaped growth curve function activation instruction, and obtain and execute the S-shaped growth curve function activation according to the operation code and operation domain
  • the data to be operated and the target address required by the instruction; the operation module is used to perform the S-shaped growth curve function activation operation on the operation data to obtain the operation result, and store the operation result in the target address.
  • the S-shaped growth curve function activation instruction processing device provided by the embodiments of the present disclosure has a wide range of application, high processing efficiency and fast processing speed for the S-shaped growth curve function activation instruction, and high processing efficiency for performing the S-shaped growth curve function activation operation , Fast processing speed.
  • control module 7-11 can also be used to obtain the S-shaped growth curve activation function parameter table according to the operation code and / or operation domain.
  • the operation module can also be used to perform S-type growth curve function activation calculation on the data to be calculated according to the S-type growth curve activation function parameter table to obtain the operation result.
  • the S-type growth curve activation function parameter table may include an S-type growth curve activation function activation table and an S-type growth curve activation function constant table.
  • the S-type growth curve activation function parameter table address may be included in the operation domain, so that the control module obtains the S-type growth curve activation function parameter table address from the S-type growth curve activation function parameter table address.
  • the control module may determine that the S-shaped growth curve activation function parameter table is required to execute the S-shaped growth curve function activation instruction according to the operation code, and may directly obtain the S-shaped growth from the storage address of the predetermined S-shaped growth curve activation function parameter table Curve activation function parameter table.
  • control module may determine that the S-shaped growth curve activation function parameter table is required to execute the S-shaped growth curve function activation instruction according to the operation code, and may directly obtain the corresponding S-shaped growth curve function activation from the storage address of the predetermined parameter table Commanded S-shaped growth curve activation function parameter table.
  • the control module may determine that the S-shaped growth curve activation function parameter table is required to execute the S-shaped growth curve function activation instruction according to the operation code, and may directly obtain the corresponding S-shaped growth curve function activation from the storage address of the predetermined parameter table Commanded S-shaped growth curve activation function parameter table.
  • control module can also obtain an activation function corresponding to the S-shaped growth curve function activation instruction, so that the operation module can perform S-shaped growth curve function activation on the operation data according to the activation function and the corresponding operator Operation.
  • the arithmetic module 7-12 may include multiple activation operators 7-120.
  • a plurality of activation operators 7-120 are used to perform S-shaped growth curve function activation calculation on the data to be calculated.
  • the calculation module may also include an activation calculator.
  • the number of activation operators can be set according to the amount of data required for the S-shaped growth curve function activation operation, the processing speed and efficiency of the S-shaped growth curve function activation operation, and the disclosure does not limit this.
  • the operation module 7-12 may include a master operation sub-module 7-121 and a plurality of slave operation sub-modules 7-122, and the master operation sub-module 7-121 includes Multiple activation operators 7-120 (not shown in the figure).
  • the main operation sub-module 7-121 is used to perform S-shaped growth curve function activation calculation on the data to be calculated by using a plurality of activation operators to obtain operation results, and store the operation results in the target address.
  • the operation domain may further include a read-in amount or a storage address of the read-in amount.
  • the control module 7-11 is also used to obtain the read-in amount, and obtain a plurality of data to be calculated according to the read-in amount.
  • the data amount of the plurality of data to be calculated may be less than or equal to the read-in amount.
  • the read-in amount may be the data amount of the acquired plurality of data to be calculated, and may be the size of the acquired data to be calculated.
  • the operation field directly contains the specific value of the read-in amount, the value can be determined as the read-in amount.
  • the storage address of the read-in amount is included in the operation domain, the read-in amount can be obtained from the storage address.
  • a plurality of data to be calculated may be obtained according to a preset default read-in amount.
  • the acquired data amount of the plurality of data to be calculated may be less than or equal to the default read-in amount.
  • the data amount and size of the operation data can be limited, the accuracy of the operation result can be ensured, and the device can execute the S-shaped growth curve function activation instruction.
  • the device may further include a storage module 7-13.
  • the storage module 7-13 is used to store data to be calculated.
  • the storage module 7-13 can also be used to store the S-shaped growth curve activation function parameter table.
  • the storage module may include a memory, such as one or more of a cache and a register, and the cache may include a high-speed temporary storage cache.
  • the to-be-calculated data and the S-shaped growth curve activation function parameter table can be stored in the storage module cache and / or register as needed, and the disclosure does not limit this.
  • the device may further include a direct memory access module, which is used to read or store data from the storage module.
  • control module 7-11 is also used to generate an assembly file according to the S-shaped growth curve function activation instruction and translate the assembly file into a binary file, where the binary file is the compiled S-shaped growth curve Function activation instruction.
  • the instruction format of the S-shaped growth curve function activation instruction may be:
  • active.sigmoid is the operation code of the S-shaped growth curve function activation instruction
  • dst, src0, and size are the operation domains of the S-shaped growth curve function activation instruction.
  • dst is the target address
  • src0 is the data address to be calculated
  • size is the read-in amount.
  • the instruction format of the S-shaped growth curve function activation instruction may be:
  • active.sigmoid is the operation code of the S-type growth curve function activation instruction
  • dst, src0, src1, and size are the operation domains of the S-type growth curve function activation instruction.
  • dst is the target address
  • src0 is the data address to be calculated
  • src1 is the address of the S-type growth curve activation function parameter table
  • size is the read-in amount.
  • the device may be set in a graphics processor (Graphics Processing Unit, GPU for short), a central processing unit (Central Processing Unit, CPU for short) and an embedded neural network processor (Neural-network Processing Unit) , Referred to as NPU).
  • a graphics processor Graphics Processing Unit, GPU for short
  • CPU Central Processing Unit
  • NPU embedded neural network processor
  • FIG. 3-3 shows a schematic diagram of an application scenario of an S-shaped growth curve function activation instruction processing apparatus according to an embodiment of the present disclosure.
  • the S-shaped growth curve function activation instruction processing device processes the S-shaped growth curve function activation instruction as follows:
  • the control module 7-11 compiles the acquired S-shaped growth curve function activation instruction 1 to obtain the compiled S-shaped growth curve function activation instruction 1 (such as the S-shaped growth curve function activation instruction 1 Is active.sigmoid50050010064), analyzes the compiled S-shaped growth curve function activation instruction 1 to obtain the operation code and operation domain of the S-shaped growth curve function activation instruction 1.
  • the operation code of the S-shaped growth curve function activation instruction 1 is active.sigmoid
  • the target address is 500
  • the data address to be calculated is 100
  • the read-in amount is 64.
  • the control module 7-11 acquires the data to be calculated with a data amount of 64 (read-in amount) from the data address to be calculated 100. Assuming that the activation calculation needs to be performed according to the S-shaped growth curve activation function parameter table, the control module 7-11 also needs to obtain the S-shaped growth curve activation function parameter table (see the above description for the specific implementation process).
  • the operation module 7-12 performs the S-type growth curve function activation operation on the data to be calculated according to the S-type growth curve activation function parameter table, obtains the operation result, and stores the operation result in the target address 500.
  • the S-shaped growth curve function activation instruction processing device can efficiently and quickly process the S-shaped growth curve function activation instruction, and the S-shaped growth curve function activation operation has high processing efficiency and fast processing speed.
  • FIGS. 3-4 illustrate a flowchart of an S-shaped growth curve function activation instruction processing method according to an embodiment of the present disclosure. As shown in FIG. 3-4, this method is applied to the above S-shaped growth curve function activation instruction processing device.
  • the method includes steps S51-7 and S52-7.
  • the method can be applied to computer equipment including a memory and a processor, where the memory is used to store data used in the execution of the method; the processor is used to perform related processing and operation steps, such as performing the following steps S51-7 ⁇ ⁇ S52-7.
  • step S51-7 the control module is used to compile the acquired S-shaped growth curve function activation instruction to obtain a compiled S-shaped growth curve function activation instruction, and the compiled S-shaped growth curve function activation instruction is analyzed, Obtain the operation code and operation domain of the S-shaped growth curve function activation instruction, and obtain the data to be calculated and the target address required to execute the S-shaped growth curve function activation instruction according to the operation code and the operation domain.
  • the operation code is used to instruct the S-type growth curve function activation instruction to perform the activation operation on the data as the S-type growth curve function activation operation, and the operation domain includes the data address and the target address to be calculated.
  • step S52-7 an S-shaped growth curve function activation operation is performed on the data to be operated by the operation module to obtain the operation result, and the operation result is stored in the target address.
  • the method may further include:
  • the S-shaped growth curve function activation operation is performed on the operation data using the operation module to obtain the operation result, including:
  • the S-type growth curve activation function parameter table includes an S-type growth curve activation function activation table and an S-type growth curve activation function constant table.
  • using the operation module to perform an S-type growth curve function activation operation on the operation data to obtain the operation result may include:
  • the operation module may include a master operation sub-module and multiple slave operation sub-modules, and the master operation sub-module may include multiple activation operators.
  • using the operation module to perform the S-type growth curve function activation operation on the operation data to obtain the operation result may include: performing the S-type growth curve function activation operation using multiple activation operators in the main operation sub-module to obtain the operation result.
  • the operation domain may further include a read-in amount or a storage address of the read-in amount.
  • obtaining the data to be calculated and the target address required to execute the S-shaped growth curve function activation instruction according to the operation code and the operation domain may include: acquiring the read-in amount, and obtaining the pending operation according to the read-in amount
  • the method may further include: storing data to be calculated.
  • control module is used to parse the obtained S-shaped growth curve function activation instruction to obtain the operation code and operation domain of the S-shaped growth curve function activation instruction, which may include:
  • the instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed in order according to the execution order, and the plurality of instructions to be executed include a compiled S-shaped growth curve function activation instruction.
  • the method may further include: when determining that the first to-be-executed instruction among the plurality of to-be-executed instructions is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction, caching the first The instruction to be executed, and after determining that the execution of the zeroth instruction to be executed is completed, the execution of the first instruction to be executed is controlled,
  • association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
  • the first storage address interval storing data required for the first instruction to be executed has an overlapping area with the zeroth storage address interval storing data required for the zeroth instruction to be executed.
  • compiling the obtained S-shaped growth curve function activation instruction to obtain the compiled S-shaped growth curve function activation instruction may include:
  • the assembly file is generated according to the activation instruction of the S-shaped growth curve function, and the assembly file is translated into a binary file.
  • the binary file is the compiled S-shaped growth curve function activation instruction.
  • the S-shaped growth curve function activation instruction processing method provided by the embodiments of the present disclosure has a wide range of application, high processing efficiency and fast processing speed for the S-shaped growth curve function activation instruction, and high processing efficiency for performing the S-shaped growth curve function activation operation , Fast processing speed.
  • an S-shaped growth curve function activation instruction processing device comprising:
  • the control module is used to compile the obtained S-shaped growth curve function activation instruction to obtain the compiled S-shaped growth curve function activation instruction, and analyze the compiled S-shaped growth curve function activation instruction to obtain the S-type An operation code and an operation domain of the growth curve function activation instruction, and according to the operation code and the operation domain, obtain the data to be operated and the target address required to execute the S-shaped growth curve function activation instruction;
  • An operation module configured to perform an S-shaped growth curve function activation operation on the data to be operated to obtain an operation result, and store the operation result in the target address
  • the operation code is used to indicate that the activation operation performed by the S-shaped growth curve function activation instruction on the data is an S-shaped growth curve function activation operation
  • the operation domain includes the data address to be operated and the target address.
  • the control module is further configured to obtain an S-shaped growth curve activation function parameter table according to the operation code and / or the operation domain;
  • the operation module is further configured to perform an S-type growth curve function activation operation on the data to be calculated according to the S-type growth curve activation function parameter table to obtain an operation result,
  • the S-type growth curve activation function parameter table includes an S-type growth curve activation function activation table and an S-type growth curve activation function constant table.
  • the calculation module includes:
  • a plurality of activation calculators are used to perform S-shaped growth curve function activation calculation on the data to be calculated.
  • the operation module includes a master operation sub-module and a plurality of slave operation sub-modules, the master operation sub-module includes the plurality of activation operators,
  • the main operation sub-module is configured to use the plurality of activation operators to perform an S-shaped growth curve function activation operation on the data to be operated to obtain an operation result, and store the operation result in the target address.
  • the operation domain further includes a read-in amount or a storage address of the read-in amount
  • control module is also used to obtain the read-in amount, and obtain the data to be calculated according to the read-in amount.
  • Clause C6 The device according to Clause C1, the device further comprising:
  • the storage module is used for storing the data to be calculated.
  • control module includes:
  • An instruction storage sub-module for storing the compiled S-shaped growth curve function activation instruction
  • An instruction processing sub-module which is used to analyze the compiled S-shaped growth curve function activation instruction to obtain the operation code and operation domain of the S-shaped growth curve function activation instruction;
  • a queue storage submodule is used to store an instruction queue, and the instruction queue includes a plurality of instructions to be executed, which are sequentially arranged in order of execution, and the plurality of instructions to be executed include the compiled S-shaped growth curve function activation instruction.
  • control module further comprising:
  • the dependency processing sub-module is used to determine the first pending instruction when there is an association relationship between the first pending instruction in the plurality of pending instructions and the zeroth pending instruction before the first pending instruction
  • the execution instruction is cached in the instruction storage submodule, and after the execution of the zeroth instruction to be executed is completed, the first instruction to be executed is extracted from the instruction storage submodule and sent to the arithmetic module,
  • association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
  • the first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
  • the control module is also used to generate an assembly file according to the S-shaped growth curve function activation instruction, and translate the assembly file into a binary file,
  • the binary file is the compiled S-shaped growth curve function activation instruction.
  • a machine learning computing device comprising:
  • One or more S-shaped growth curve function activation instruction processing devices as described in any one of Clauses C1-C9, used to obtain data to be calculated and control information from other processing devices, and perform specified machine learning operations, will The execution result is transferred to other processing devices through the I / O interface;
  • the machine learning operation device includes a plurality of the S-shaped growth curve function activation instruction processing devices
  • the plurality of the S-shaped growth curve function activation instruction processing devices can be connected and transmit data through a specific structure
  • a plurality of the S-shaped growth curve function activation instruction processing devices interconnect and transmit data through a PCIE bus that is a fast external device interconnect bus to support larger-scale machine learning operations; a plurality of the S-shaped growth curve functions
  • the activation instruction processing device shares the same control system or has its own control system; a plurality of the S-shaped growth curve function activation instruction processing devices share memory or have their own memories; a plurality of the S-shaped growth curve function activation instruction processing devices
  • the interconnection method is any interconnection topology.
  • a combined processing device comprising:
  • Machine learning computing devices general interconnection interfaces and other processing devices as described in clause C10;
  • the machine learning computing device interacts with the other processing device to jointly complete the calculation operation specified by the user,
  • the combined processing device further includes: a storage device, which is respectively connected to the machine learning computing device and the other processing device, and is used for storing data of the machine learning computing device and the other processing device.
  • a machine learning chip includes:
  • Article C13 An electronic device, the electronic device comprising:
  • a board card includes: a storage device, an interface device and a control device, and a machine learning chip as described in Clause C12;
  • the machine learning chip is respectively connected to the storage device, the control device and the interface device;
  • the storage device is used for storing data
  • the interface device is used to realize data transmission between the machine learning chip and an external device
  • the control device is used for monitoring the state of the machine learning chip.
  • An S-shaped growth curve function activation instruction processing method The method is applied to an S-shaped growth curve function activation instruction processing device. The method includes:
  • the control module is used to compile the acquired S-shaped growth curve function activation instruction to obtain a compiled S-shaped growth curve function activation instruction, and the compiled S-shaped growth curve function activation instruction is analyzed to obtain an S-shaped growth curve.
  • An operation code and an operation domain of a function activation instruction and according to the operation code and the operation domain, obtain the data to be operated and the target address required to execute the S-shaped growth curve function activation instruction;
  • the operation code is used to indicate that the activation operation performed by the S-shaped growth curve function activation instruction on the data is an S-shaped growth curve function activation operation
  • the operation domain includes the data address to be operated and the target address.
  • Clause C16 The method according to Clause C15, the method further comprising:
  • using an operation module to perform an S-shaped growth curve function activation operation on the data to be operated to obtain an operation result includes:
  • the S-type growth curve activation function parameter table includes an S-type growth curve activation function activation table and an S-type growth curve activation function constant table.
  • an operation module is used to perform an S-shaped growth curve function activation operation on the data to be operated to obtain an operation result, including:
  • a plurality of activation calculators are used to perform S-shaped growth curve function activation calculation on the data to be calculated.
  • the operation module includes a master operation sub-module and a plurality of slave operation sub-modules, the master operation sub-module includes the plurality of activation operators,
  • using an operation module to perform an S-shaped growth curve function activation operation on the data to be operated to obtain an operation result includes:
  • a plurality of activation operators in the main operation sub-module are used to perform S-shaped growth curve function activation operation to obtain an operation result.
  • the operation domain further includes a read-in amount or a storage address of the read-in amount
  • obtaining the data to be calculated and the target address required to execute the S-shaped growth curve function activation instruction according to the operation code and the operation domain includes:
  • Clause C20 The method according to Clause C15, the method further comprising:
  • Clause C21 According to the method described in Clause C15, analyze the compiled S-shaped growth curve function activation instruction to obtain the operation code and operation domain of the S-shaped growth curve function activation instruction, including:
  • An instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed in order according to an execution order, and the plurality of instructions to be executed include the compiled S-shaped growth curve function activation instruction.
  • Clause C22 The method according to Clause C21, the method further comprising:
  • the first to-be-executed instruction among the plurality of to-be-executed instructions When it is determined that the first to-be-executed instruction among the plurality of to-be-executed instructions is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction, cache the first to-be-executed instruction, and determine the After the execution of the zeroth to-be-executed instruction is completed, control to execute the execution of the first to-be-executed instruction,
  • association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
  • the first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
  • Clause C23 According to the method described in Clause C15, use the control module to compile the obtained S-shaped growth curve function activation instruction to obtain the compiled S-shaped growth curve function activation instruction, including:
  • the binary file is the compiled S-shaped growth curve function activation instruction.
  • FIG. 4-1 shows a block diagram of an exponential function activation instruction processing apparatus according to an embodiment of the present disclosure.
  • the device includes a control module 12-11 and an arithmetic module 12-12.
  • the control module 12-11 is used to compile the obtained exponential function activation instruction to obtain the compiled exponential function activation instruction, and parse the compiled exponential function activation instruction to obtain the operation code and operation domain of the exponential function activation instruction , And obtain the data to be calculated and the target address required to execute the exponential function activation instruction according to the operation code and operation domain.
  • the operation code is used to indicate that the activation operation performed by the exponential function activation instruction on the data is the exponential function activation operation.
  • the operation domain includes the data address and target address to be calculated.
  • the operation module 12-12 is used to perform an exponential function activation operation on the data to be operated, obtain the operation result, and store the operation result in the target address.
  • the exponential function activation instruction obtained by the control module is an uncompiled software instruction that cannot be directly executed by hardware.
  • the control module needs to first compile the exponential function activation instruction (uncompiled). After the compiled index function activation instruction is obtained, the compiled index function activation instruction can be parsed.
  • the compiled exponential function activation instruction is a hardware instruction that can be directly executed by the hardware.
  • the compiled exponential function activation instruction may be an instruction corresponding only to the exponential function activation instruction.
  • the compiled exponential function activation instruction may also include all activation operation instructions, so that when the device processes the exponential function activation instruction, it can simplify the exponential function activation operation without distinguishing the activation function corresponding to the instruction Process.
  • control module can obtain the data to be calculated from the data address to be calculated.
  • the control module may determine the data required to perform the exponential function activation operation according to the operation code of the exponential function activation instruction.
  • the control module can obtain instructions and data through the data input and output unit, which can be one or more data I / O interfaces or I / O pins.
  • the operation code may be the part of the instruction or field (usually represented by code) specified in the computer program to perform the operation, and is the instruction sequence number used to inform the device that executes the instruction which instruction needs to be executed.
  • the operation domain may be a source of all data required to execute the corresponding instruction, such as a corresponding address, etc. All data required to execute the corresponding instruction include data such as data to be operated and corresponding operation methods, etc.
  • an exponential function activation instruction it must include an operation code and an operation field, where the operation field includes at least the data address to be calculated and the target address.
  • the device may include one or more control modules and one or more arithmetic modules, and the number of control modules and arithmetic modules may be set according to actual needs, which is not limited in the present disclosure.
  • the control module may receive an exponential function activation instruction, and control one or more processing modules to perform a linear rectification function activation operation.
  • the multiple control modules may respectively receive exponential function activation instructions and control the corresponding one or more processing modules to perform exponential function activation operations.
  • An embodiment of the present disclosure provides an exponential function activation instruction processing device.
  • the device includes a control module and an arithmetic module.
  • the control module is used to compile the obtained exponential function activation instruction to obtain the compiled exponential function activation instruction.
  • the exponential function activation instruction is parsed to obtain the opcode and operation domain of the exponential function activation instruction, and the data to be operated and the target address required to execute the exponential function activation instruction are obtained according to the opcode and operation domain; the operation module is used to process the operation data Perform the exponential function activation operation to obtain the operation result, and store the operation result in the target address.
  • the exponential function activation instruction processing device provided by the embodiments of the present disclosure has a wide range of application, and has high processing efficiency and fast processing speed for the exponential function activation instruction, and high processing efficiency and fast processing speed for performing the exponential function activation operation.
  • control module 12-11 may also be used to obtain an exponential activation function parameter table according to the operation code and / or operation domain.
  • the calculation module 12-12 can also be used to perform exponential function activation calculation on the data to be calculated according to the exponential activation function parameter table, to obtain the operation result,
  • the exponential activation function parameter table may include an exponential activation function activation table and an exponential activation function constant table.
  • the address of the index activation function parameter table may be included in the operation domain, so that the control module obtains the address of the index activation function parameter table from the address of the index activation function parameter table.
  • the control module may determine that the exponential activation function parameter table needs an exponential activation function parameter table according to the operation code, and may directly obtain the exponential activation function parameter table from a predetermined storage address of the exponential activation function parameter table.
  • the control module can determine that an exponential activation function parameter table is required to execute the exponential function activation instruction according to the operation code, it can directly obtain the exponential activation function parameter table corresponding to the exponential function activation instruction from the storage address of the predetermined parameter table.
  • a person skilled in the art can set the acquisition method of the index activation function parameter table according to actual needs, and this disclosure does not limit this.
  • control module can also obtain an activation function corresponding to the exponential function activation instruction, so that the operation module can perform a linear rectification function activation operation on the operation data according to the activation function and the corresponding operator.
  • the computing module 12-12 may include multiple activation operators 12-120.
  • a plurality of activation calculators 12-120 are used to perform exponential function activation calculation on the data to be calculated.
  • the calculation module may also include an activation calculator.
  • the number of activation operators can be set according to the size of the data required for the exponential function activation operation, the processing speed and efficiency of the exponential function activation operation, and the disclosure does not limit this.
  • the operation module 12-12 may include a master operation sub-module 12-121 and a plurality of slave operation sub-modules 12-122, and the master operation sub-module 12-121 includes Multiple activation operators 12-120 (not shown in the figure).
  • the main operation sub-module 12-121 is used for performing an exponential function activation operation on the data to be calculated by using a plurality of activation operators to obtain an operation result and storing the operation result in a target address.
  • the operation domain may further include a read-in amount or a storage address of the read-in amount.
  • the control module 12-11 is also used to obtain the read-in amount, and obtain a plurality of data to be calculated according to the read-in amount.
  • the data amount of the plurality of data to be calculated may be less than or equal to the read-in amount.
  • the read-in amount may be the data amount of the acquired plurality of data to be calculated, and may be the size of the acquired data to be calculated.
  • the operation field directly contains the specific value of the read-in amount, the value can be determined as the read-in amount.
  • the storage address of the read-in amount is included in the operation domain, the read-in amount can be obtained from the storage address.
  • a plurality of data to be calculated may be obtained according to a preset default read-in amount.
  • the acquired data amount of the plurality of data to be calculated may be less than or equal to the default read-in amount.
  • the data amount and size of the operation data can be limited, the accuracy of the operation result can be ensured, and the device can execute the exponential function activation instruction.
  • the device may further include a storage module 12-13.
  • the storage modules 12-13 are used to store data to be calculated.
  • the storage modules 12-13 can also be used to store exponential activation function parameter tables.
  • the storage module may include a memory, such as one or more of a cache and a register, and the cache may include a high-speed temporary storage cache.
  • the data to be calculated and the parameter table of the exponential activation function can be stored in the cache and / or register of the storage module as needed, and the disclosure does not limit this.
  • the device may further include a direct memory access module, which is used to read or store data from the storage module.
  • control module 12-11 can also be used to generate an assembly file according to the exponential function activation instruction and translate the assembly file into a binary file, where , The binary file is the compiled exponential function activation instruction.
  • the instruction format of the exponential function activation instruction may be:
  • active.exps is the opcode of the exponential function activation instruction
  • dst, src0, and size are the operation domains of the exponential function activation instruction.
  • dst is the target address
  • src0 is the data address to be calculated
  • size is the read-in amount.
  • the instruction format of the exponential function activation instruction may also be:
  • active.exps is the opcode of the exponential function activation instruction
  • dst, src0, and size are the operation domains of the exponential function activation instruction.
  • dst is the target address
  • src0 is the data address to be calculated
  • src1 is the address of the index activation function parameter table
  • size is the read-in amount.
  • the device may be set in a graphics processor (Graphics Processing Unit, GPU for short), a central processing unit (Central Processing Unit, CPU for short) and an embedded neural network processor (Neural-network Processing Unit) , Referred to as NPU).
  • a graphics processor Graphics Processing Unit, GPU for short
  • CPU Central Processing Unit
  • NPU embedded neural network processor
  • FIG. 4-3 shows a schematic diagram of an application scenario of an exponential function activation instruction processing device according to an embodiment of the present disclosure. As shown in Figure 4-3, the exponential function activation instruction processing device processes the exponential function activation instruction as follows:
  • the control module 12-11 compiles the obtained exponential function activation instruction 1 to obtain the compiled exponential function activation instruction 1 (for example, the exponential function activation instruction 1 is active.exps500, 100, 64), Analyze the compiled index function activation instruction 1 to obtain the opcode and operation domain of the index function activation instruction 1.
  • the operation code of the exponential function activation instruction 1 is active.exps, the target address is 500, the data address to be calculated is 100, and the read-in amount is 64.
  • the control module 12-11 acquires the data to be operated with a data amount of 64 (read-in amount) from the data address to be operated 100. Assuming that the activation calculation needs to be performed according to the exponential activation function parameter table, the control module 12-11 also needs to obtain the exponential activation function parameter table (see the above description for the specific implementation process).
  • the operation module 12-12 performs the exponential function activation operation on the operation data according to the exponential activation function parameter table, obtains the operation result, and stores the operation result in the target address 500.
  • the exponential function activation instruction processing device can process the exponential function activation instruction efficiently and quickly, and realize the efficient and rapid processing of the exponential function activation operation.
  • FIG. 4-4 shows a flowchart of an exponential function activation instruction processing method according to an embodiment of the present disclosure. As shown in FIG. 4-4, this method is applied to the above exponential function activation instruction processing device.
  • the method includes steps S51-12 and S52-12.
  • the method can be applied to computer equipment including a memory and a processor, where the memory is used to store data used in the execution of the method; the processor is used to perform related processing and operation steps, such as performing the following steps S51-12 ⁇ ⁇ S52-12.
  • step S51-12 the control module is used to compile the obtained index function activation instruction to obtain a compiled index function activation instruction, and the compiled index function activation instruction is parsed to obtain the operation code and The operation domain, and obtain the data to be calculated and the target address required to execute the exponential function activation instruction according to the operation code and the operation domain.
  • the operation code is used to indicate that the activation operation performed by the exponential function activation instruction on the data is the exponential function activation operation, and the operation domain includes the data address and the target address to be operated.
  • step S52-12 an arithmetic module is used to perform an exponential function activation operation on the operation data to obtain an operation result, and the operation result is stored in the target address.
  • the method may further include:
  • the operation module performs exponential function activation operation on the operation data to obtain the operation result, which may include:
  • the exponential activation function parameter table may include an exponential activation function activation table and an exponential activation function constant table.
  • the operation module performs an exponential function activation operation on the operation data to obtain an operation result, which may include: performing an exponential function activation operation on the operation data using a plurality of activation operators.
  • the operation module may include a master operation sub-module and multiple slave operation sub-modules, and the master operation sub-module may include multiple activation operators,
  • the operation module performs exponential function activation operation on the operation data to obtain the operation result, which may include:
  • the operation domain may further include a read-in amount or a storage address of the read-in amount.
  • obtaining the data to be calculated and the target address required to execute the exponential function activation instruction according to the operation code and the operation domain may include: acquiring the read-in amount, and acquiring the data to be calculated according to the read-in amount.
  • the method may further include: storing data to be calculated.
  • parsing the compiled index function activation instruction to obtain the operation code and operation domain of the index function activation instruction may include:
  • the instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed arranged in order according to the execution order, and the plurality of instructions to be executed include a compiled index function activation instruction.
  • the method may further include: when determining that the first to-be-executed instruction among the plurality of to-be-executed instructions is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction, caching the first The instruction to be executed, and after determining that the execution of the zeroth instruction to be executed is completed, the execution of the first instruction to be executed is controlled,
  • association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
  • the first storage address interval storing data required for the first instruction to be executed has an overlapping area with the zeroth storage address interval storing data required for the zeroth instruction to be executed.
  • using the control module to compile the obtained index function activation instruction to obtain the compiled index function activation instruction may include: generating an assembly file according to the index function activation instruction, and translating the assembly file into binary file. Among them, the binary file is the compiled exponential function activation instruction.
  • the processing method of the exponential function activation instruction provided by the embodiment of the present disclosure has a wide range of application, and the exponential function activation instruction has high processing efficiency and fast processing speed, and the exponential function activation operation has high processing efficiency and fast processing speed.
  • an exponential function activation instruction processing device comprising:
  • the control module is used to compile the obtained exponential function activation instruction to obtain a compiled exponential function activation instruction, and parse the compiled exponential function activation instruction to obtain the operation code and operation domain of the exponential function activation instruction, And obtain the data to be calculated and the target address required to execute the exponential function activation instruction according to the operation code and the operation domain;
  • the operation module is used to perform an exponential function activation operation on the data to be operated to obtain an operation result, and store the operation result in the target address,
  • the operation code is used to indicate that the activation operation performed by the exponential function activation instruction on the data is an exponential function activation operation
  • the operation domain includes the data address to be operated and the target address.
  • the control module is further configured to obtain an index activation function parameter table according to the operation code and / or the operation domain;
  • the calculation module is further configured to perform an exponential function activation operation on the data to be calculated according to the exponential activation function parameter table to obtain an operation result,
  • the exponential activation function parameter table includes an exponential activation function activation table and an exponential activation function constant table.
  • the calculation module includes:
  • a plurality of activation calculators are used to perform exponential function activation calculation on the data to be calculated.
  • the operation module includes a master operation sub-module and a plurality of slave operation sub-modules, the master operation sub-module includes the plurality of activation operators,
  • the main operation sub-module is used to perform an exponential function activation operation on the data to be operated by using the plurality of activation operators to obtain an operation result, and store the operation result in the target address.
  • the operation domain further includes a read-in amount or a storage address of the read-in amount
  • control module is also used to obtain the read-in amount, and obtain the data to be calculated according to the read-in amount.
  • Clause D6 The device according to Clause D1, the device further comprising:
  • the storage module is used for storing the data to be calculated.
  • control module includes:
  • An instruction storage sub-module for storing the compiled index function activation instruction
  • Instruction processing sub-module which is used to parse the compiled index function activation instruction to obtain the operation code and operation domain of the index function activation instruction
  • a queue storage submodule is used to store an instruction queue, and the instruction queue includes a plurality of instructions to be executed that are sequentially arranged according to an execution order, and the plurality of instructions to be executed include the compiled index function activation instruction.
  • control module further comprising:
  • the dependency processing sub-module is used to determine the first pending instruction when there is an association relationship between the first pending instruction in the plurality of pending instructions and the zeroth pending instruction before the first pending instruction
  • the execution instruction is cached in the instruction storage submodule, and after the execution of the zeroth instruction to be executed is completed, the first instruction to be executed is extracted from the instruction storage submodule and sent to the arithmetic module,
  • association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
  • the first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
  • the control module is also used to generate an assembly file according to the index function activation instruction and translate the assembly file into a binary file
  • the binary file is the compiled index function activation instruction.
  • a machine learning computing device comprising:
  • One or more exponential function activation instruction processing devices as described in any one of clauses D1 to D9, used to obtain data to be calculated and control information from other processing devices, and perform specified machine learning operations, passing the execution results
  • the I / O interface is passed to other processing devices;
  • the machine learning operation device includes a plurality of the exponential function activation instruction processing devices
  • the plurality of the exponential function activation instruction processing devices may be connected and transmit data through a specific structure
  • a plurality of the exponential function activation instruction processing devices interconnect and transmit data through a fast external device interconnect bus PCIE bus to support larger-scale machine learning operations; a plurality of the exponential function activation instruction processing devices share the same
  • the control system may have its own control system; the multiple exponential function activation instruction processing devices share memory or have their own memories; the interconnection method of the multiple exponential function activation instruction processing devices is any interconnected topology.
  • a combined processing device comprising:
  • Machine learning computing devices general interconnection interfaces and other processing devices as described in clause D10;
  • the machine learning computing device interacts with the other processing device to jointly complete the calculation operation specified by the user,
  • the combined processing device further includes: a storage device, which is respectively connected to the machine learning computing device and the other processing device, and is used for storing data of the machine learning computing device and the other processing device.
  • a machine learning chip includes:
  • Article D13 An electronic device, the electronic device comprising:
  • a board card including: a storage device, an interface device and a control device, and a machine learning chip as described in Clause D12;
  • the machine learning chip is respectively connected to the storage device, the control device and the interface device;
  • the storage device is used for storing data
  • the interface device is used to realize data transmission between the machine learning chip and an external device
  • the control device is used for monitoring the state of the machine learning chip.
  • An exponential function activation instruction processing method is applied to an exponential function activation instruction processing device.
  • the method includes:
  • the control module is used to compile the obtained exponential function activation instruction to obtain a compiled exponential function activation instruction, and the compiled exponential function activation instruction is analyzed to obtain the operation code and operation domain of the exponential function activation instruction, and according to The operation code and the operation domain acquire the data to be calculated and the target address required to execute the exponential function activation instruction;
  • the operation code is used to indicate that the activation operation performed by the exponential function activation instruction on the data is an exponential function activation operation
  • the operation domain includes the data address to be operated and the target address.
  • Clause D16 The method according to Clause D15, the method further comprising:
  • the operation module performs an exponential function activation operation on the data to be operated to obtain an operation result, including:
  • the exponential activation function parameter table includes an exponential activation function activation table and an exponential activation function constant table.
  • an arithmetic module is used to perform an exponential function activation operation on the data to be calculated to obtain an operation result, including:
  • the operation module includes a master operation submodule and a plurality of slave operation submodules, the master operation submodule includes the plurality of activation operators,
  • the operation module performs an exponential function activation operation on the data to be operated to obtain an operation result, including:
  • the operation domain further includes a read-in amount or a storage address of the read-in amount
  • obtaining the data to be calculated and the target address required to execute the exponential function activation instruction according to the operation code and the operation domain includes:
  • Clause D20 The method according to Clause D15, the method further comprising:
  • Clause D21 parse the compiled index function activation instruction to obtain the operation code and operation domain of the index function activation instruction, including:
  • An instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed that are sequentially arranged according to an execution order, and the plurality of instructions to be executed include the compiled exponential function activation instruction.
  • Clause D22 The method according to Clause D21, the method further comprising:
  • the first to-be-executed instruction among the plurality of to-be-executed instructions When it is determined that the first to-be-executed instruction among the plurality of to-be-executed instructions is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction, cache the first to-be-executed instruction, and determine the After the execution of the zeroth to-be-executed instruction is completed, control to execute the execution of the first to-be-executed instruction,
  • association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
  • the first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
  • control module is used to compile the obtained index function activation instruction to obtain the compiled index function activation instruction, including:
  • the binary file is the compiled index function activation instruction.
  • Selection operation is a processing operation to select data according to selection conditions. Due to the variety of programming languages, in different language environments, in order to realize the operation process of selecting operations, in related technologies, because there is no selection instruction that can be widely applied to various programming languages at this stage, technicians need to customize their corresponding programming Multiple instructions in the language environment are used to implement the selection operation, which results in low efficiency and slow speed in the selection operation.
  • the present disclosure provides a selection instruction processing method, device, computer equipment, and storage medium, and selection operation can be implemented with only one instruction, which can significantly improve the efficiency and speed of selection operation.
  • FIG. 5-1 shows a block diagram of a selection instruction processing apparatus according to an embodiment of the present disclosure. As shown in Figure 5-1, the device includes a control module 1-11 and an arithmetic module 1-12.
  • the control module 1-11 is used to compile the obtained selection instruction to obtain the compiled selection instruction, parse the compiled selection instruction to obtain the operation code and operation domain of the selection instruction, and according to the operation code and operation domain Obtain multiple index data, multiple data to be calculated and target addresses required to execute the selection instruction.
  • the operation code is used to indicate that the operation performed by the selection instruction on the data is a selection operation, and the operation field includes the data address to be operated, the index data address, and the target address.
  • the operation module 1-12 is used to sequentially determine whether a plurality of index data meets the storage conditions, and when the index data meets the storage conditions, sequentially store the data to be operated corresponding to the index data that meets the storage conditions in the target address.
  • the selection instruction acquired by the control module is an uncompiled software instruction that cannot be directly executed by the hardware.
  • the control module needs to first compile the selection instruction (uncompiled). After the compiled selection instruction is obtained, the compiled selection instruction can be parsed.
  • the compiled selection instructions are hardware instructions that can be directly executed by the hardware.
  • the control module may obtain a plurality of data to be calculated and a plurality of index data from the data address to be calculated and the index data address, respectively.
  • the control module may obtain a selection instruction, a plurality of data to be calculated, and a plurality of index data through a data input / output unit.
  • the data input / output unit may be one or more data I / O interfaces or I / O pins.
  • the operation code may be the part of the instruction or field (usually represented by code) specified in the computer program to perform the operation, and is the instruction sequence number used to inform the device that executes the instruction which instruction needs to be executed.
  • the operation domain may be a source of all data required to execute the corresponding instruction, such as a corresponding address, etc. All data required to execute the corresponding instruction include parameter data, data to be operated, corresponding operation methods, and so on. For a selection instruction, it must include an operation code and an operation field, where the operation field includes at least the data address to be operated, the index data address, and the target address.
  • the device may include one or more control modules and one or more arithmetic modules, and the number of control modules and arithmetic modules may be set according to actual needs, which is not limited in the present disclosure.
  • the control module can receive a selection instruction and control one or more arithmetic modules to perform a selection operation.
  • the multiple control modules may respectively receive selection instructions and control the corresponding one or more arithmetic modules to perform selection operations.
  • the selection instruction processing device includes a control module and an arithmetic module.
  • the control module is used to compile the obtained selection instruction to obtain the compiled selection instruction, parse the compiled selection instruction to obtain the operation code and operation domain of the selection instruction, and obtain and execute the selection instruction according to the operation code and operation domain The required multiple index data, multiple data to be calculated and the target address.
  • the operation module is used to sequentially determine whether a plurality of index data satisfy the storage conditions, and when the index data meets the storage conditions, sequentially store the data to be operated corresponding to the index data satisfying the storage conditions into the target address.
  • the selection instruction processing device provided by the embodiment of the present disclosure has a wide application range, and has high processing efficiency and fast processing speed for the selection instruction, and high processing efficiency and fast processing speed for performing the selection operation.
  • the storage condition may be that the index data is not zero.
  • the storage condition may also be that the index data is not a specified value, and the specified value may be a value such as 1. Those skilled in the art can set the storage conditions according to actual needs, and this disclosure does not limit this.
  • it can be set according to the required storage conditions or index data to store the data required in the data to be calculated to the target address.
  • different storage conditions may be set, or different index data may be set to realize different selections of the data to be calculated.
  • the operation module 1-12 may include a plurality of comparators 1-120, which are used to sequentially determine whether a plurality of index data meets storage conditions.
  • the comparator may sequentially compare the index data with 0 to determine whether the index data meets the storage condition.
  • the operation module can store the data to be operated corresponding to the index data other than 0 into the target address in sequence.
  • the number of comparators can be set according to the amount of data to be compared, the processing speed, efficiency, and other requirements of the comparison, which is not limited in the present disclosure.
  • the operation module 1-12 may include a master operation sub-module 1-121 and a plurality of slave operation sub-modules 1-122.
  • the main operation sub-module 1-121 may include a plurality of comparators 1-120 (not shown in the figure).
  • the main operation sub-module 1-121 is used to sequentially determine whether multiple index data satisfy the storage condition using multiple comparators, determine the data to be operated corresponding to the index data satisfying the storage condition, and compare the index with the storage condition
  • the data to be calculated corresponding to the data is sequentially stored in the target address.
  • the operation domain may further include a read-in amount or a storage address of the read-in amount.
  • the control module 1-11 is also used to obtain the read-in amount, and obtain a plurality of data to be calculated according to the read-in amount.
  • the data amount of the multiple data to be calculated is less than or equal to the read-in amount
  • the read-in amount is less than or equal to the data amount of the multiple index data.
  • the read-in amount may be the data amount of the acquired plurality of data to be calculated, and may be the size of the acquired data to be calculated.
  • the operation field directly contains the specific value of the read-in amount, the value can be determined as the read-in amount.
  • the storage address of the read-in amount is included in the operation domain, the read-in amount can be obtained from the storage address.
  • a plurality of data to be calculated may be obtained according to a preset default read-in amount.
  • the acquired data amount of the plurality of data to be calculated is less than or equal to the default read-in amount, and the default read-in amount is less than or equal to the data amount of multiple index data.
  • the amount of data to be calculated, the amount of data to be indexed, and the amount of data that can be stored at the target address can be the same, and can all be equal to the read-in amount or the default read-in amount. the amount.
  • the operation module can sequentially store the data to be operated corresponding to the index data that meets the storage conditions in the target address, which can avoid problems such as insufficient target addresses and waste of target addresses.
  • the device may further include a storage module 1-13.
  • the storage modules 1-13 are used to store multiple index data, multiple data to be calculated, and storage conditions.
  • the storage module may include memory, such as one or more of a cache and a register
  • the cache may include a high-speed temporary cache, and may also include at least one NRAM (Neuron Random Access Memory, neuron random access memory ).
  • the cache can be used to store data to be calculated and pooled cores, and the register can be used to store scalar data in the data to be calculated. .
  • the cache may include a neuron cache.
  • the neuron cache that is, the foregoing neuron random access memory, can be used to store neuron data in the data to be calculated, and the neuron data can include neuron vector data.
  • the device may further include a direct memory access module, which is used to read or store data from the storage module.
  • control modules 1-11 may also be used to generate an assembly file according to the selection instruction and translate the assembly file into a binary file, where the binary file is the selection instruction after compilation.
  • the instruction format of the selection instruction may be:
  • select is the opcode of the selection instruction
  • dst, src0, src1, and size are the operation fields of the selection instruction.
  • dst is the target address
  • src0 is the data address to be calculated
  • src1 is the index data address
  • size is the read-in amount.
  • the device may be set in (Graphics Processing Unit, GPU for short), Central Processing Unit (CPU for short), and Neural-network Processing Unit (NPU for short) ) Of one or more.
  • GPU Graphics Processing Unit
  • CPU Central Processing Unit
  • NPU Neural-network Processing Unit
  • FIG. 5-3 shows a schematic diagram of an application scenario for selecting an instruction processing apparatus according to an embodiment of the present disclosure.
  • the selection instruction processing device processes the selection instruction as follows:
  • the control module 1-11 compiles the obtained selection instruction 1 to obtain the compiled selection instruction 1. Analyze the compiled selection instruction 1 (for example, selection instruction 1 is select 500, 100, 200, and 5) to obtain the operation code and operation field of selection instruction 1. Among them, the operation code of the selection instruction 1 is select, the target address is 500, the data address to be calculated is 100, the index data address is 200, and the read-in amount is 5. The control module 1-11 obtains a plurality of data to be operated and a plurality of index data with a read-in amount of 5 from the data address to be operated 100 and the index data address 200, respectively.
  • the obtained plurality of data to be calculated include 1, 5, 6, 7, and 3.
  • the multiple index data includes 1, 8, 0, 6, and 9.
  • the storage condition is that the index data is not 0.
  • the operation module 1-12 sequentially judges whether multiple index data are 0, and when the index data is not 0, sequentially stores the data to be operated corresponding to the index data that is not 0 into the target address 500. Specifically, the arithmetic module 1-12 sequentially determines whether the multiple index data "1, 8, 0, 6, 9" are not 0. Since the third index data is 0, the "1" , 5, 7, 3 "are sequentially stored in the target address 500.
  • the relevant description above please refer to the relevant description above.
  • the selection instruction processing device can process the selection instruction efficiently and quickly, and the selection operation has high processing efficiency and fast processing speed.
  • FIG. 5-4 shows a flowchart of a selection instruction processing method according to an embodiment of the present disclosure.
  • the method is applied to the above selection instruction processing apparatus, and the method includes step S51-1 and step S52-1.
  • the method can be applied to computer equipment including a memory and a processor, where the memory is used to store data used in the execution of the method; the processor is used to perform related processing and operation steps, such as performing the following step S51-1 ⁇ ⁇ S52-1.
  • step S51-1 the control module is used to compile the obtained selection instruction to obtain the compiled selection instruction, and the compiled selection instruction is parsed to obtain the operation code and operation domain of the selection instruction, and according to the operation code and operation The domain acquires multiple index data, multiple data to be calculated, and a target address required to execute the selection instruction.
  • the operation code is used to indicate that the operation performed by the selection instruction on the data is a selection operation, and the operation field includes the data address to be operated, the index data address, and the target address.
  • step S52-1 the operation module is used to sequentially determine whether a plurality of index data meets the storage conditions, and when the index data meets the storage conditions, the data to be operated corresponding to the index data that meets the storage conditions are sequentially stored in the target address .
  • the operation module may include the multiple comparators,
  • using the operation module to sequentially determine whether a plurality of index data meets the storage conditions, and when the index data meets the storage conditions, sequentially store the data to be operated corresponding to the index data that meets the storage conditions into the target address which may include:
  • the multiple comparators in the arithmetic module are used to sequentially determine whether the multiple index data meet the storage conditions.
  • the operation module includes a master operation sub-module and multiple slave operation sub-modules
  • the master operation sub-module includes the multiple comparators
  • using the operation module to sequentially determine whether a plurality of index data meets the storage conditions, and when the index data meets the storage conditions, sequentially store the data to be operated corresponding to the index data that meets the storage conditions into the target address which may include:
  • the operation domain may further include a read-in amount or a storage address of the read-in amount.
  • Step S51-1 may include: acquiring the read-in amount, and acquiring a plurality of data to be calculated according to the read-in amount. Among them, the data amount of the multiple data to be calculated is less than or equal to the read-in amount, and the read-in amount is less than or equal to the data amount of the multiple index data.
  • the method may further include: using the storage module of the device to store multiple index data, multiple data to be calculated, and storage conditions,
  • the storage module includes at least one of a register and a cache
  • the cache is used to store the plurality of index data, the plurality of data to be calculated, and the storage conditions, and the cache includes at least one neuron cache NRAM;
  • the register is used to store the data to be calculated, the plurality of data to be calculated, and the scalar data in the storage condition;
  • the neuron cache is used to store the data to be operated, the plurality of data to be operated, and the neuron data in the storage condition, and the neuron data includes neuron vector data.
  • step S51-1 may include:
  • the instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed, which are arranged in order according to the execution order, and the plurality of instructions to be executed include a compiled selection instruction.
  • the method may further include:
  • association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
  • the first storage address interval storing data required for the first instruction to be executed has an overlapping area with the zeroth storage address interval storing data required for the zeroth instruction to be executed.
  • control module is used to compile the obtained selection instructions to obtain the compiled selection instructions, including:
  • the assembly file is generated according to the selection instruction, and the assembly file is translated into a binary file.
  • the binary file is the selection instruction after compilation.
  • the storage condition may include that the index data is not zero.
  • the selection instruction processing method provided by the embodiments of the present disclosure has a wide application range, and has high processing efficiency and fast processing speed for the selection instruction, and high processing efficiency and fast processing speed for performing the selection operation.
  • a selection instruction processing device comprising:
  • the control module is used to compile the obtained selection instruction to obtain the compiled selection instruction, parse the compiled selection instruction to obtain the operation code and operation domain of the selection instruction, and according to the operation code and the operation
  • the domain acquires multiple index data, multiple data to be calculated, and target addresses required to execute the selection instruction;
  • the operation module is used to sequentially determine whether the plurality of index data meets the storage condition, and when the index data meets the storage condition, sequentially store the data to be operated corresponding to the index data meeting the storage condition into the target address in,
  • the operation code is used to indicate that the operation performed by the selection instruction on the data is a selection operation
  • the operation field includes a data address to be operated, an index data address, and the target address.
  • the calculation module includes:
  • a plurality of comparators are used to sequentially determine whether the plurality of index data satisfy the storage condition.
  • the operation module includes a master operation sub-module and a plurality of slave operation sub-modules, the master operation sub-module includes the plurality of comparators,
  • the main operation sub-module is used to sequentially determine whether the plurality of index data satisfy the storage condition using the plurality of comparators, determine the data to be operated corresponding to the index data satisfying the storage condition, and compare with The data to be operated corresponding to the index data satisfying the storage conditions are sequentially stored in the target address.
  • the operation domain further includes a read-in amount or a storage address of the read-in amount
  • control module is also used to obtain the read-in amount, and obtain the plurality of data to be calculated according to the read-in amount,
  • the data amount of the plurality of data to be calculated is less than or equal to the read-in amount, and the read-in amount is less than or equal to the data amount of the plurality of index data.
  • Clause E5 The device according to Clause E1, the device further comprising:
  • a storage module configured to store the plurality of index data, the plurality of data to be calculated, and the storage conditions
  • the storage module includes at least one of a register and a cache
  • the cache is used to store the plurality of index data, the plurality of data to be calculated, and the storage conditions, and the cache includes at least one neuron cache NRAM;
  • the register is used to store the data to be calculated, the plurality of data to be calculated, and the scalar data in the storage condition;
  • the neuron cache is used to store the data to be operated, the plurality of data to be operated, and the neuron data in the storage condition, and the neuron data includes neuron vector data.
  • control module includes:
  • An instruction processing sub-module which is used to parse the compiled selection instruction to obtain the operation code and operation domain of the selection instruction
  • the queue storage submodule is used to store an instruction queue, and the instruction queue includes a plurality of instructions to be executed, which are sequentially arranged according to the execution order, and the plurality of instructions to be executed include a plurality of compiled selection instructions.
  • control module further comprising:
  • the dependency processing sub-module is used to determine the first pending instruction when there is an association relationship between the first pending instruction in the plurality of pending instructions and the zeroth pending instruction before the first pending instruction
  • the execution instruction is cached in the instruction storage submodule, and after the execution of the zeroth instruction to be executed is completed, the first instruction to be executed is extracted from the instruction storage submodule and sent to the arithmetic module,
  • association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
  • the first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
  • the control module is also used to generate an assembly file according to the selection instruction and translate the assembly file into a binary file
  • the binary file is the compiled selection instruction.
  • Clause E9 The device according to any one of Clause E1 to Clause E8, the storage condition includes that the index data is not zero.
  • a machine learning computing device characterized in that the device includes:
  • the machine learning computing device includes a plurality of the selection instruction processing devices
  • the plurality of the selection instruction processing devices can be connected and transmit data through a specific structure
  • a plurality of the selection instruction processing devices interconnect and transmit data through a fast external device interconnection bus PCIE bus to support larger-scale machine learning operations; a plurality of the selection instruction processing devices share the same control system or own Respective control systems; a plurality of the selection instruction processing devices share memory or have their own memories; the interconnection mode of the plurality of selection instruction processing devices is any interconnection topology.
  • a combined processing device characterized in that the combined processing device includes:
  • Machine learning computing devices general interconnection interfaces and other processing devices as described in clause E10;
  • the machine learning computing device interacts with the other processing device to jointly complete the calculation operation specified by the user,
  • the combined processing device further includes: a storage device, which is respectively connected to the machine learning computing device and the other processing device, and is used for storing data of the machine learning computing device and the other processing device.
  • a machine learning chip characterized in that the machine learning chip includes:
  • An electronic device characterized in that the electronic device includes:
  • a board card characterized in that the board card includes: a storage device, an interface device and a control device, and a machine learning chip as described in Clause E12;
  • the machine learning chip is respectively connected to the storage device, the control device and the interface device;
  • the storage device is used for storing data
  • the interface device is used to realize data transmission between the machine learning chip and an external device
  • the control device is used for monitoring the state of the machine learning chip.
  • a method for processing a selection instruction characterized in that the method is applied to a device for processing a selection instruction, the device includes a control module and an arithmetic module, and the method includes:
  • the control module is used to compile the obtained selection instruction to obtain the compiled selection instruction, and the compiled selection instruction is analyzed to obtain the operation code and operation domain of the selection instruction, and the operation code and the operation domain are determined according to the operation code and the operation domain. Obtain multiple index data, multiple data to be calculated and target addresses required to execute the selection instruction;
  • the operation module is used to sequentially determine whether the plurality of index data meet the storage conditions, and when the index data meets the storage conditions, sequentially store the data to be operated corresponding to the index data that meets the storage conditions into the target address,
  • the operation code is used to indicate that the operation performed by the selection instruction on the data is a selection operation
  • the operation field includes a data address to be operated, an index data address, and the target address.
  • Clause E16 The method according to Clause E15, characterized in that the arithmetic module includes the plurality of comparators,
  • the operation module is used to sequentially determine whether the plurality of index data meet the storage conditions, and when the index data meets the storage conditions, the data to be operated corresponding to the index data that meets the storage conditions are sequentially stored in the target address Including:
  • a plurality of comparators in the arithmetic module are used to sequentially determine whether the plurality of index data satisfy the storage condition.
  • Clause E17 The method according to Clause E16, wherein the operation module includes a master operation sub-module and a plurality of slave operation sub-modules, and the master operation sub-module includes the plurality of comparators,
  • the operation module is used to sequentially determine whether the plurality of index data meet the storage conditions, and when the index data meets the storage conditions, sequentially store the data to be operated corresponding to the index data that meets the storage conditions into the target address Including:
  • Clause E18 The method according to Clause E15, wherein the operation domain further includes a read-in amount or a storage address of the read-in amount,
  • acquiring a plurality of index data, a plurality of data to be calculated and a target address required to execute the selection instruction according to the operation code and the operation domain includes:
  • the data amount of the plurality of data to be calculated is less than or equal to the read-in amount, and the read-in amount is less than or equal to the data amount of the plurality of index data.
  • Clause E19 The method according to Clause E15, characterized in that the method further comprises:
  • the storage module includes at least one of a register and a cache
  • the cache is used to store the plurality of index data, the plurality of data to be calculated, and the storage conditions, and the cache includes at least one neuron cache NRAM;
  • the register is used to store the data to be calculated, the plurality of data to be calculated, and the scalar data in the storage condition;
  • the neuron cache is used to store the data to be operated, the plurality of data to be operated, and the neuron data in the storage condition, and the neuron data includes neuron vector data.
  • Clause E20 The method according to Clause E15, characterized in that the compiled selection instruction is parsed to obtain the operation code and operation domain of the selection instruction, including:
  • the instruction queue includes a plurality of instructions to be executed, which are sequentially arranged according to an execution order, and the plurality of instructions to be executed include a plurality of compiled selection instructions.
  • Clause E21 The method according to Clause E20, characterized in that the method further comprises:
  • the first to-be-executed instruction among the plurality of to-be-executed instructions When it is determined that the first to-be-executed instruction among the plurality of to-be-executed instructions is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction, cache the first to-be-executed instruction, and determine the After the execution of the zeroth to-be-executed instruction is completed, control to execute the execution of the first to-be-executed instruction,
  • association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
  • the first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
  • Clause E22 The method according to Clause E15, wherein the control module is used to compile the obtained selection instruction to obtain the compiled selection instruction, including:
  • the binary file is the compiled selection instruction.
  • Clause E23 The method according to any one of Clause E15 to Clause E22, wherein the storage condition includes that the index data is not zero.
  • the present disclosure provides a counting instruction processing method, device, computer equipment, and storage medium, and counting statistics can be realized with only one instruction, which can significantly improve the efficiency and speed of counting statistics.
  • FIG. 6-1 shows a block diagram of a count instruction processing device according to an embodiment of the present disclosure. As shown in Figure 6-1, the device includes a control module 2-11 and an arithmetic module 2-12.
  • the control module 2-11 is used to compile the obtained counting instruction to obtain the compiled counting instruction, analyze the compiled counting instruction to obtain the operation code and operation domain of the counting instruction, and according to the operation code and operation domain Obtain multiple data to be calculated and the target address required to execute the count instruction.
  • the operation code is used to indicate that the operation performed by the counting instruction on the data is a counting statistical operation
  • the operation domain includes the data address and the target address to be calculated.
  • the operation module 2-12 is used to determine the number of data to be operated satisfying the counting condition among the plurality of data to be operated, and store the number of data in the target address.
  • the counting instruction acquired by the control module is an uncompiled software instruction that cannot be directly executed by hardware.
  • the control module needs to first compile the counting instruction (uncompiled). After the compiled count instruction is obtained, the compiled count instruction can be analyzed.
  • the compiled count instruction is a hardware instruction that can be directly executed by the hardware.
  • the control module can obtain multiple data to be calculated from the data to be calculated.
  • the operation code may be the part of the instruction or field (usually represented by code) specified in the computer program to perform the operation, and is the instruction sequence number used to inform the device that executes the instruction which instruction needs to be executed.
  • the operation domain may be a source of all data required to execute the corresponding instruction, and all data required to execute the corresponding instruction include parameter data, data to be operated, corresponding operation methods, and so on. For a counting instruction, it must include an operation code and an operation field, where the operation field includes at least the data address to be operated and the target address.
  • the device may include one or more control modules and one or more arithmetic modules, and the number of control modules and arithmetic modules may be set according to actual needs, which is not limited in the present disclosure.
  • the control module can receive counting instructions and control one or more arithmetic modules to perform counting statistics.
  • the multiple control modules can respectively receive counting instructions and control the corresponding one or more arithmetic modules to perform counting statistics.
  • the counting instruction processing device includes a control module and an arithmetic module.
  • the control module is used to compile the obtained counting instruction to obtain the compiled counting instruction and analyze the compiled counting instruction. Obtain the operation code and operation domain of the counting instruction, and obtain the multiple data to be calculated and the target address required to execute the counting instruction according to the operation code and the operation domain; the operation module is used to determine the plurality of data to be operated that satisfy the counting condition The number of data, and store the number of data in the target address.
  • the counting instruction processing device provided by the embodiments of the present disclosure has a wide application range, high processing efficiency and fast processing speed for counting instructions, and high processing efficiency and fast processing speed for counting statistics.
  • the counting condition may be that the data to be calculated is not zero.
  • the data number of the data to be operated which is not 0 among the plurality of data to be operated is counted, and the data number is stored to the target address.
  • the counting condition may also be that the data to be calculated is not a specified value, and the specified value may be a value such as 1. Those skilled in the art can set the counting conditions according to actual needs, and this disclosure does not limit this.
  • the arithmetic module 2-12 may include multiple counters 2-120.
  • the plurality of counters 2-120 are used for counting and counting the number of data to be calculated satisfying the counting condition to obtain the number of data to be calculated satisfying the counting condition among the plurality of data to be calculated.
  • the arithmetic module may also include a counter.
  • the number of counters can be set according to the size of the data to be calculated, the processing speed, efficiency, and other requirements of the counting statistical operation, which is not limited in the present disclosure.
  • the operation module 2-12 may include a master operation submodule 2-121 and a plurality of slave operation submodules 2-122, and the master operation submodule 2-121 includes Multiple counters 2-120 (not shown in the figure).
  • the main operation submodule 2-121 is used to count and count the number of data to be calculated satisfying the counting condition using a plurality of counters 2-120 to determine the number of data to be calculated satisfying the counting condition among the plurality of data to be calculated And store the number of data in the target address.
  • the operation domain may further include a read-in amount or a storage address of the read-in amount.
  • the control module 2-11 is also used to obtain the read-in amount, and obtain a plurality of data to be calculated according to the read-in amount.
  • the data amount of the multiple data to be calculated is less than or equal to the read-in amount.
  • the read-in amount may be the data amount of the acquired plurality of data to be calculated, and may be the size of the acquired data to be calculated.
  • the operation field directly contains the specific value of the read-in amount, the value can be determined as the read-in amount.
  • the storage address of the read-in amount is included in the operation domain, the read-in amount can be obtained from the storage address.
  • a plurality of data to be calculated may be obtained according to a preset default read-in amount.
  • the data amount of the acquired multiple data to be calculated is less than or equal to the default read-in amount.
  • the data amount of a plurality of data to be calculated can be limited, to ensure the accuracy of the counted data number, and also to ensure that the device can run the counting instruction.
  • the device may further include a storage module 2-13.
  • the storage modules 2-13 are used to store a plurality of data to be calculated and counting conditions.
  • the storage module may include one or more of a cache and a register.
  • the cache may include a high-speed temporary storage cache, and may also include at least one NRAM (Neuron Random Access Memory). Registers can be used to store multiple data to be calculated and scalar data in counting conditions;
  • the neuron cache may be used to store multiple to-be-operated data and neuron data in counting conditions, and the neuron data includes neuron vector data.
  • control module 2-11 may also be used to generate an assembly file according to the counting instruction and translate the assembly file into a binary file, where the binary file is a compiled counting instruction.
  • the instruction format of the counting instruction may be:
  • select is the operation code of the count instruction, dst, src0, size are the operation domain of the count instruction.
  • dst is the target address
  • src0 is the data address to be calculated
  • size is the read-in amount.
  • the device may be set in (Graphics Processing Unit, GPU for short), Central Processing Unit (CPU for short), and Neural-network Processing Unit (NPU for short) ) Of one or more.
  • GPU Graphics Processing Unit
  • CPU Central Processing Unit
  • NPU Neural-network Processing Unit
  • FIG. 6-3 shows a schematic diagram of an application scenario of a counting instruction processing device according to an embodiment of the present disclosure.
  • the counting command processing device processes the counting command as follows:
  • the control module 2-11 compiles the obtained counting instruction 1 to obtain the compiled counting instruction 1 (if the counting instruction 1 is count 500 and 100), analyzes the compiled counting instruction 1 to obtain the operation of counting instruction 1 Code and operation domain. Among them, the operation code of the count instruction 1 is count, the target address is 500, the data address to be calculated is 100, and the read-in amount is 5. The control module 2-11 acquires a plurality of data to be calculated with a read amount of 5 from the data address to be calculated 100.
  • the counting condition is that the data to be calculated is not 0.
  • the operation module 2-12 counts the number of data to be operated which is not 0 among the plurality of data to be operated, and stores the number of data in the target address 500.
  • the counting instruction processing device can process the counting instruction efficiently and quickly, and the processing efficiency for counting statistics is high and the processing speed is fast.
  • 6-4 shows a flowchart of a counting instruction processing method according to an embodiment of the present disclosure.
  • the method can be applied to computer equipment including a memory and a processor, where the memory is used to store data used in the execution of the method; the processor is used to perform related processing and operation steps, such as performing the following step S51-2 ⁇ ⁇ S52-2.
  • the method is applied to the above-mentioned counting instruction processing device, and the method includes step S51-2 and step S52-2.
  • step S51-2 the control module is used to compile the obtained counting instruction to obtain a compiled counting instruction, and the compiled counting instruction is analyzed to obtain the operation code and operation domain of the counting instruction, and according to the operation code and The operation domain obtains a plurality of data to be calculated and a target address required for executing the counting instruction.
  • the operation code is used to indicate that the operation performed by the counting instruction on the data is a counting statistical operation, and the operation domain includes the data address and the target address to be calculated.
  • step S52-2 the operation module is used to determine the number of data to be operated satisfying the counting condition among the plurality of data to be operated, and the number of data is stored in the target address.
  • determining the number of data to be calculated among the plurality of data to be calculated that meets the counting condition may include:
  • the number of the data to be operated satisfying the counting condition is counted and counted by using a plurality of counters in the operation module to obtain the data number of the data to be operated meeting the counting condition among the plurality of data to be operated.
  • the operation module includes a main operation submodule and multiple slave operation submodules, and the main operation submodule includes multiple adders and multiple dividers,
  • determining the number of data to be calculated among the plurality of data to be calculated satisfying the counting condition, and storing the number of data in the target address includes:
  • the operation domain further includes a read-in amount or a storage address of the read-in amount
  • obtaining a plurality of data to be calculated and a target address required to execute the counting instruction according to the operation code and the operation domain may include:
  • the data amount of the multiple data to be calculated is less than or equal to the read-in amount.
  • the method may further include: using the storage module of the device to store a plurality of data to be calculated and counting conditions,
  • the storage module includes at least one of a register and a cache
  • the cache used to store multiple data to be calculated and counting conditions, the cache includes at least one neuron cache NRAM;
  • Registers used to store multiple data to be calculated and scalar data in counting conditions
  • the neuron cache is used to store a plurality of data to be operated and neuron data in counting conditions.
  • the neuron data includes neuron vector data.
  • analyzing the compiled counting instruction to obtain the operation code and operation domain of the counting instruction may include:
  • the instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed arranged in order according to the execution order, and the plurality of instructions to be executed may include a counted instruction after compilation.
  • the method may further include: when determining that the first to-be-executed instruction among the plurality of to-be-executed instructions is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction, caching the first The instruction to be executed, after the execution of the zeroth instruction to be executed is completed, the first instruction to be executed is executed,
  • association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
  • the first storage address interval storing data required for the first instruction to be executed has an overlapping area with the zeroth storage address interval storing data required for the zeroth instruction to be executed.
  • compiling the obtained counting instruction to obtain the compiled counting instruction may include:
  • the binary file is the count instruction after compilation.
  • the counting condition may include that the data to be calculated is not zero.
  • the counting instruction processing method provided by the embodiments of the present disclosure has a wide application range, high processing efficiency and fast processing speed for counting instructions, and high processing efficiency and fast processing speed for counting statistics.
  • a counting instruction processing device comprising:
  • the control module is used to compile the obtained counting instruction to obtain the compiled counting instruction, analyze the compiled counting instruction to obtain the operation code and operation domain of the counting instruction, and according to the operation code and all The operation domain obtains a plurality of data to be calculated and a target address required to execute the counting instruction;
  • the operation module is used to determine the number of data of the plurality of data to be operated that satisfy the counting condition and store the number of data in the target address,
  • the operation code is used to indicate that the operation performed by the counting instruction on the data is a counting statistical operation
  • the operation domain includes the data address to be operated and the target address.
  • the calculation module includes:
  • a plurality of counters are used for counting and counting the number of data to be operated that satisfy the counting condition to obtain the number of data to be operated that satisfy the counting condition among the plurality of data to be operated.
  • the operation module includes a master operation sub-module and a plurality of slave operation sub-modules, the master operation sub-module includes the plurality of counters,
  • the main operation sub-module is configured to use the plurality of counters to count statistics on the number of data to be operated that satisfy the counting condition, and determine data of the data to be operated that satisfy the counting condition among the plurality of data to be operated Number, and store the number of data in the target address.
  • Clause F4 The device according to Clause F1, the operation domain further includes a read-in amount or a storage address of the read-in amount,
  • control module is also used to obtain the read-in amount, and obtain the plurality of data to be calculated according to the read-in amount,
  • the data amount of the plurality of data to be calculated is less than or equal to the read-in amount.
  • Clause F5 The device according to Clause F1, the device further comprising:
  • a storage module for storing the plurality of data to be calculated and the counting condition
  • the storage module includes at least one of a register and a cache
  • the cache is used to store the plurality of data to be calculated and the counting condition, and the cache includes at least one neuron cache NRAM;
  • the register is used to store the plurality of data to be operated and the scalar data in the counting condition
  • the neuron cache is used to store the plurality of data to be operated and neuron data in the counting condition, and the neuron data includes neuron vector data.
  • control module includes:
  • the first instruction processing sub-module is used to parse the compiled counting instruction to obtain the operation code and operation domain of the counting instruction;
  • the queue storage sub-module is used to store an instruction queue, and the instruction queue includes a plurality of instructions to be executed, which are sequentially arranged according to an execution order, and the plurality of instructions to be executed include the compiled counting instructions.
  • control module further comprising:
  • the dependency processing sub-module is used to determine the first pending instruction when there is an association relationship between the first pending instruction in the plurality of pending instructions and the zeroth pending instruction before the first pending instruction
  • the execution instruction is cached in the instruction storage submodule, and after the execution of the zeroth instruction to be executed is completed, the first instruction to be executed is extracted from the instruction storage submodule and sent to the arithmetic module,
  • association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
  • the first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
  • the control module is also used to generate an assembly file according to the counting instruction and translate the assembly file into a binary file
  • the binary file is the compiled counting instruction.
  • Clause F9 The device according to any one of Clause F1 to Clause F8, the counting condition includes that the data to be calculated is not zero.
  • a machine learning computing device comprising:
  • One or more counting instruction processing devices as described in any one of clauses F1 to F9, used to obtain data to be operated and control information from other processing apparatuses, and perform specified machine learning operations, and pass the execution results through I / O interface is passed to other processing devices;
  • the machine learning operation device includes a plurality of the counting instruction processing devices
  • the plurality of counting instruction processing devices may be connected and transmit data through a specific structure
  • a plurality of the counting instruction processing devices are interconnected and transmitting data through a fast external device interconnect bus PCIE bus to support larger-scale machine learning operations; a plurality of the counting instruction processing devices share the same control system or own Respective control systems; a plurality of the counting instruction processing devices share memory or have their own memories; the interconnection method of the plurality of counting instruction processing devices is any interconnected topology.
  • a combined processing device comprising:
  • Machine learning computing devices general interconnection interfaces and other processing devices as described in clause F10;
  • the machine learning computing device interacts with the other processing device to jointly complete the calculation operation specified by the user,
  • the combined processing device further includes: a storage device, which is respectively connected to the machine learning computing device and the other processing device, and is used for storing data of the machine learning computing device and the other processing device.
  • a machine learning chip includes:
  • Article F13 An electronic device, the electronic device comprising:
  • a board card including: a storage device, an interface device and a control device, and a machine learning chip as described in Clause F12;
  • the machine learning chip is respectively connected to the storage device, the control device and the interface device;
  • the storage device is used for storing data
  • the interface device is used to realize data transmission between the machine learning chip and an external device
  • the control device is used for monitoring the state of the machine learning chip.
  • a counting instruction processing method is applied to a counting instruction processing apparatus.
  • the apparatus includes a control module and an arithmetic module.
  • the method includes:
  • the control module is used to compile the obtained counting instruction to obtain a compiled counting instruction, and the compiled counting instruction is analyzed to obtain the operation code and operation domain of the counting instruction, and according to the operation code and the operation The domain obtains multiple data to be calculated and the target address required to execute the counting instruction;
  • the operation code is used to indicate that the operation performed by the counting instruction on the data is a counting statistical operation
  • the operation domain includes the data address to be operated and the target address.
  • determining the number of data to be calculated among the plurality of data to be calculated satisfying the counting condition includes:
  • a plurality of counters in the calculation module are used to count and count the number of data to be calculated satisfying the counting condition to obtain the number of data to be calculated satisfying the counting condition among the plurality of data to be calculated.
  • the operation module includes a master operation submodule and a plurality of slave operation submodules
  • the master operation submodule includes a plurality of adders and a plurality of dividers
  • determining the number of data to be calculated among the plurality of data to be calculated satisfying the counting condition, and storing the number of data in the target address includes:
  • the operation domain further includes a read-in amount or a storage address of the read-in amount
  • acquiring a plurality of data to be calculated and a target address required to execute the counting instruction according to the operation code and the operation domain includes:
  • the data amount of the plurality of data to be calculated is less than or equal to the read-in amount.
  • Clause F19 The method according to Clause F15, the method further comprising:
  • the storage module includes at least one of a register and a cache
  • the cache is used to store the plurality of data to be calculated and the counting condition, and the cache includes at least one neuron cache NRAM;
  • the register is used to store the plurality of data to be operated and the scalar data in the counting condition
  • the neuron cache is used to store the plurality of data to be operated and neuron data in the counting condition, and the neuron data includes neuron vector data.
  • Clause F20 parse the compiled counting instruction to obtain the operation code and operation domain of the counting instruction, including:
  • An instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed, which are sequentially arranged according to an execution order, and the plurality of instructions to be executed include the compiled counting instruction.
  • Clause F21 The method according to Clause F20, the method further comprising:
  • the first to-be-executed instruction among the plurality of to-be-executed instructions When it is determined that the first to-be-executed instruction among the plurality of to-be-executed instructions is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction, cache the first to-be-executed instruction, and determine the After the execution of the zeroth to-be-executed instruction is completed, control to execute the execution of the first to-be-executed instruction,
  • association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
  • the first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
  • Clause F22 Compile the obtained counting instruction according to the method described in Clause F15 to obtain the compiled counting instruction, including:
  • the binary file is the compiled counting instruction.
  • Clause F23 The method according to any one of Clause F15 to Clause F22, the counting condition includes that the data to be calculated is not zero.
  • Clause F24 A non-volatile computer-readable storage medium having computer program instructions stored thereon, which when executed by a processor implements the method of any one of Clauses F15 to F23.
  • the present disclosure provides a fully-connected instruction processing method, device, computer equipment, and storage medium. Only one instruction can be used to realize fully-connected operation, which can significantly improve the efficiency and speed of fully-connected operation.
  • FIG. 7-1 shows a block diagram of a fully connected instruction processing apparatus according to an embodiment of the present disclosure. As shown in Figure 7-1, the device includes a control module 3-11 and an arithmetic module 3-12.
  • the control module 3-11 is used to compile the obtained fully-connected instruction to obtain the compiled fully-connected instruction, to parse the compiled fully-connected instruction, to obtain the fully-connected instruction's operation code and operation domain, and according to the operation
  • the code and the operation domain acquire the first data, the second data, the weight data, and the target address required to execute the fully connected instruction.
  • the operation code is used to indicate that the operation performed by the fully connected instruction on the data is a fully connected operation, and the operation domain includes a first data address, a second data address, a weight data address, and a target address.
  • the operation module 3-12 is configured to perform a fully connected operation on the first data and the second data according to the weight data to obtain an operation result, and store the operation result in the target address.
  • the fully-connected instruction acquired by the control module is an uncompiled software instruction that cannot be directly executed by the hardware.
  • the control module needs to first compile the fully-connected instruction (uncompiled). After the compiled fully-connected instruction is obtained, the compiled fully-connected instruction can be parsed.
  • the compiled fully-connected instructions are hardware instructions that can be directly executed by the hardware.
  • the control module may obtain the first data, the second data, and the weight data from the first data address, the second data address, and the weight data address, respectively.
  • the operation code may be the part of the instruction or field (usually represented by code) specified in the computer program to perform the operation, and is the instruction sequence number used to inform the device that executes the instruction which instruction needs to be executed.
  • the operation domain may be a source of all data required to execute the corresponding instruction, and all data required to execute the corresponding instruction include first data, second data, weight data, and corresponding calculation methods, and so on. For a fully connected instruction, it must include an operation code and an operation field, where the operation field includes at least a first data address, a second data address, a weight data address, and a target address.
  • the device may include one or more control modules and one or more arithmetic modules, and the number of control modules and arithmetic modules may be set according to actual needs, which is not limited in the present disclosure.
  • the control module can receive a fully connected instruction and control one or more arithmetic modules to perform fully connected operations.
  • the plurality of control modules can respectively receive a fully connected instruction and control the corresponding one or more arithmetic modules to perform a fully connected operation.
  • the fully-connected instruction processing device includes a control module and an arithmetic module.
  • the control module is used to compile the obtained fully-connected instruction to obtain the compiled fully-connected instruction and the compiled fully-connected instruction.
  • the instruction is parsed to obtain the operation code and operation domain of the fully connected instruction, and the first data, second data, weight data and target address required to execute the fully connected instruction are obtained according to the operation code and the operation domain;
  • the weight data performs a fully connected operation on the first data and the second data to obtain an operation result, and stores the operation result in the target address.
  • the fully connected command processing device provided by the embodiments of the present disclosure has a wide application range, and has high processing efficiency and fast processing speed for the fully connected command, and high processing efficiency and speed for performing the fully connected operation.
  • the arithmetic module 3-12 may include multiple multipliers 3-120 and multiple adders 3-120 '. Multiple multipliers 3-120 are used to perform the multiplication operation in the fully connected operation. A plurality of adders 3-120 'are used to perform addition operations in the fully connected operation.
  • the operation module may further include one adder and one multiplier, or one adder and multiple multipliers, or multiple adders and one multiplier.
  • the number of multipliers and adders can be set according to the data amount of the fully-connected operation, the processing speed, and the processing efficiency of the fully-connected operation, which is not limited in the present disclosure.
  • the operation module 3-12 may include a master operation sub-module 3-121 and multiple slave operation sub-modules 3-122, and the slave operation sub-module 3-122 includes Multiple multipliers 3-120 and multiple adders 3-120 '(not shown).
  • the control module 3-11 is also used to parse the compiled fully connected instruction to obtain a plurality of operation instructions, and send the first data, the second data, and the plurality of operation instructions to the main operation submodule 3-121.

Abstract

An operation method and device, computer equipment, and a storage medium. A combined processing device comprises a machine learning operation device, a universal interconnection interface, and other processing devices. The machine learning operation device interacts with said other processing devices to jointly complete computing operation specified by a user. The combined processing device further comprises a storage device. The storage device is connected to the machine learning operation device and other processing devices separately, for storing data of the machine learning operation device and other processing devices. The operation method and device, the computer equipment and the storage medium above have a wide range of applications, high operation processing efficiency, and quick processing speed.

Description

运算方法、装置、计算机设备和存储介质Calculation method, device, computer equipment and storage medium 技术领域Technical field
本公开涉及计算机技术领域,特别是涉及一种运算方法、装置、计算机设备和存储介质。The present disclosure relates to the field of computer technology, and in particular, to an arithmetic method, device, computer equipment, and storage medium.
背景技术Background technique
随着科技的不断发展,机器学习,尤其是神经网络算法的使用越来越广泛。其在图像识别、语音识别、自然语言处理等领域中都得到了良好的应用。但由于神经网络算法的复杂度越来越高,所涉及的数据运算种类和数量不断增大。相关技术中,在对数据进行选择运算、计数运算、全连接运算、卷积运算、最大池化运算、激活运算、填充运算、矩阵转置运算、平均池化运算、标量计算、标量类型转换、取地址处理、标量数据迁移、对指令流的跳转控制进行处理、向量计算、循环向量计算、向量数据迁移、同步控制、中断存储等运算或处理的效率低、速度慢。With the continuous development of science and technology, machine learning, especially neural network algorithms, are becoming more and more widely used. It has been well used in image recognition, speech recognition, natural language processing and other fields. However, due to the increasing complexity of neural network algorithms, the types and number of data operations involved are increasing. In related technologies, data selection operations, counting operations, fully connected operations, convolution operations, maximum pooling operations, activation operations, filling operations, matrix transposition operations, average pooling operations, scalar calculations, scalar type conversion, Fetching address processing, scalar data migration, processing of instruction flow jump control, vector calculation, loop vector calculation, vector data migration, synchronization control, interrupt storage and other operations or processing are inefficient and slow.
发明内容Summary of the invention
基于此,有必要针对上述技术问题,提供一种能够解决上述问题的运算方法、装置、计算机设备和存储介质。Based on this, it is necessary to provide an arithmetic method, device, computer equipment, and storage medium that can solve the above-mentioned technical problems.
根据本公开的一方面,提供了一种激活指令处理装置,所述装置包括:According to an aspect of the present disclosure, an activation instruction processing apparatus is provided, the apparatus including:
控制模块,用于对获取到的激活指令进行编译,得到编译后的激活指令,对所述编译后的激活指令进行解析,得到所述激活指令的操作码和操作域,并根据所述操作码和所述操作域获取执行所述激活指令所需的待运算数据和目标地址;The control module is used to compile the obtained activation instruction to obtain the compiled activation instruction, analyze the compiled activation instruction, obtain the operation code and operation domain of the activation instruction, and according to the operation code Acquiring the data to be calculated and the target address required to execute the activation instruction with the operation domain;
运算模块,用于对所述待运算数据进行激活运算,得到运算结果,并将所述运算结果存入所述目标地址中,The operation module is used for performing activation operation on the data to be operated to obtain an operation result, and storing the operation result in the target address,
其中,所述操作码用于指示所述激活指令对数据所进行的运算为激活运算,所述操作域包括待运算数据地址和所述目标地址。Wherein, the operation code is used to indicate that the operation performed by the activation instruction on the data is an activation operation, and the operation domain includes the data address to be operated and the target address.
根据本公开的另一方面,提供了一种机器学习运算装置,所述装置包括:According to another aspect of the present disclosure, a machine learning computing device is provided, the device including:
一个或多个上述激活指令处理装置,用于从其他处理装置中获取待运算数据和控制信息,并执行指定的机器学习运算,将执行结果通过I/O接口传递给其他处理装置;One or more of the above-mentioned activation instruction processing devices are used to obtain data and control information to be calculated from other processing devices, and perform designated machine learning operations, and pass the execution results to other processing devices through the I / O interface;
当所述机器学习运算装置包含多个所述激活指令处理装置时,所述多个所述激活指令处理装置间可以通过特定的结构进行连接并传输数据;When the machine learning computing device includes a plurality of the activation instruction processing devices, the plurality of activation instruction processing devices may be connected and transmit data through a specific structure;
其中,多个所述激活指令处理装置通过快速外部设备互连总线PCIE总线进行互联并传输数据,以支持更大规模的机器学习的运算;多个所述激活指令处理装置共享同一控制系统或拥有各自的控制系统;多个所述激活指令处理装置共享内存或者拥有各自的内存;多个所述激活指令处理装置的互联方式是任意互联拓扑。Among them, a plurality of the activation instruction processing devices interconnect and transmit data through a PCIE bus, a fast external device interconnection bus, to support larger-scale machine learning operations; a plurality of the activation instruction processing devices share the same control system or Respective control systems; a plurality of the activation instruction processing devices share memory or have their own memories; the interconnection mode of the plurality of activation instruction processing devices is an arbitrary interconnection topology.
根据本公开的另一方面,提供了一种组合处理装置,所述装置包括:According to another aspect of the present disclosure, a combined processing device is provided, the device comprising:
上述机器学习运算装置、通用互联接口和其他处理装置;The above-mentioned machine learning computing device, universal interconnection interface and other processing devices;
所述机器学习运算装置与所述其他处理装置进行交互,共同完成用户指定的计算操作。The machine learning operation device interacts with the other processing device to jointly complete the calculation operation specified by the user.
根据本公开的另一方面,提供了一种机器学习芯片,所述机器学习芯片包括上述机器学习络运算装置或上述组合处理装置。According to another aspect of the present disclosure, a machine learning chip is provided, the machine learning chip including the above machine learning network operation device or the above combination processing device.
根据本公开的另一方面,提供了一种机器学习芯片封装结构,该机器学习芯片封装结构包括上述机器学习芯片。According to another aspect of the present disclosure, there is provided a machine learning chip packaging structure including the above machine learning chip.
根据本公开的另一方面,提供了一种板卡,该板卡包括上述机器学习芯片封装结构。According to another aspect of the present disclosure, there is provided a board card including the above machine learning chip packaging structure.
根据本公开的另一方面,提供了一种电子设备,所述电子设备包括上述机器学习芯片或上述板卡。According to another aspect of the present disclosure, there is provided an electronic device including the aforementioned machine learning chip or the aforementioned board.
根据本公开的另一方面,提供了一种激活指令处理方法,所述方法应用于激活指令处理装置,所述方法包括:According to another aspect of the present disclosure, an activation instruction processing method is provided. The method is applied to an activation instruction processing device. The method includes:
利用控制模块对获取到的激活指令进行编译,得到编译后的激活指令,对所述编译后的激活指令 进行解析,得到所述激活指令的操作码和操作域,并根据所述操作码和所述操作域获取执行所述激活指令所需的待运算数据和目标地址;The control module is used to compile the obtained activation instruction to obtain a compiled activation instruction, and the compiled activation instruction is parsed to obtain the operation code and operation domain of the activation instruction, and according to the operation code and The operation domain obtains the data to be calculated and the target address required to execute the activation instruction;
利用运算模块对所述待运算数据进行激活运算,得到运算结果,并将所述运算结果存入所述目标地址中,Using an arithmetic module to perform an activation operation on the data to be operated to obtain an operation result, and store the operation result in the target address,
其中,所述操作码用于指示所述激活指令对数据所进行的运算为激活运算,所述操作域包括待运算数据地址和所述目标地址。Wherein, the operation code is used to indicate that the operation performed by the activation instruction on the data is an activation operation, and the operation domain includes the data address to be operated and the target address.
根据本公开的另一方面,提供了一种非易失性计算机可读存储介质,其上存储有计算机程序指令,所述计算机程序指令被处理器执行时实现上述激活指令处理方法。According to another aspect of the present disclosure, there is provided a non-volatile computer-readable storage medium having computer program instructions stored thereon, the computer program instructions implementing the above activation instruction processing method when executed by a processor.
本公开实施例提供一种激活指令处理方法、装置及相关产品,该装置包括控制模块和运算模块,控制模块用于对获取到的激活指令进行编译,得到编译后的激活指令,对编译后的激活指令进行解析,得到激活指令的操作码和操作域,并根据操作码和操作域获取执行激活指令所需的待运算数据和目标地址;运算模块用于对待运算数据进行激活运算,得到运算结果,并将运算结果存入目标地址中。本公开实施例所提供的激活指令处理方法、装置及相关产品的适用范围广,对激活指令的处理效率高、处理速度快,进行激活运算的处理效率高、处理速度快。Embodiments of the present disclosure provide an activation instruction processing method, device, and related products. The device includes a control module and an arithmetic module. The control module is used to compile the obtained activation instruction to obtain a compiled activation instruction. The activation instruction is analyzed to obtain the operation code and operation domain of the activation instruction, and the data to be operated and the target address required to execute the activation instruction are obtained according to the operation code and the operation domain; the operation module is used to activate the operation data to obtain the operation result And store the operation result in the target address. The method, device and related products for processing activation instructions provided by the embodiments of the present disclosure have a wide range of applications, and have high processing efficiency and processing speed for activation instructions, and high processing efficiency and processing speed for performing activation calculations.
在一些实施例中,所述电子设备包括数据处理装置、机器人、电脑、打印机、扫描仪、平板电脑、智能终端、手机、行车记录仪、导航仪、传感器、摄像头、服务器、云端服务器、相机、摄像机、投影仪、手表、耳机、移动存储、可穿戴设备、交通工具、家用电器、和/或医疗设备。In some embodiments, the electronic device includes a data processing device, a robot, a computer, a printer, a scanner, a tablet computer, a smart terminal, a mobile phone, a driving recorder, a navigator, a sensor, a camera, a server, a cloud server, a camera, Cameras, projectors, watches, headphones, mobile storage, wearable devices, vehicles, household appliances, and / or medical devices.
在一些实施例中,所述交通工具包括飞机、轮船和/或车辆;所述家用电器包括电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机;所述医疗设备包括核磁共振仪、B超仪和/或心电图仪。In some embodiments, the vehicle includes an airplane, ship, and / or vehicle; the household appliance includes a TV, air conditioner, microwave oven, refrigerator, rice cooker, humidifier, washing machine, electric lamp, gas stove, and range hood; and the medical Equipment includes MRI, B-mode ultrasound and / or electrocardiograph.
根据下面参考附图对示例性实施例的详细说明,本公开的其它特征及方面将变得清楚。Other features and aspects of the present disclosure will become clear from the following detailed description of exemplary embodiments with reference to the accompanying drawings.
附图说明BRIEF DESCRIPTION
包含在说明书中并且构成说明书的一部分的附图与说明书一起示出了本公开的示例性实施例、特征和方面,并且用于解释本公开的原理。The drawings included in the specification and forming a part of the specification together with the specification show exemplary embodiments, features, and aspects of the present disclosure, and are used to explain the principles of the present disclosure.
图1示出根据本公开实施例的指令处理方法的处理器的示意图。FIG. 1 shows a schematic diagram of a processor of an instruction processing method according to an embodiment of the present disclosure.
图1-1示出根据本公开一实施例的激活指令处理装置的框图。FIG. 1-1 shows a block diagram of an activation instruction processing apparatus according to an embodiment of the present disclosure.
图1-2a、图1-2b示出根据本公开一实施例的激活指令处理装置的框图。1-2a and 1-2b show block diagrams of an activation instruction processing apparatus according to an embodiment of the present disclosure.
图1-3示出根据本公开一实施例的激活指令处理装置的应用场景的示意图。1-3 are schematic diagrams illustrating application scenarios of an activation instruction processing apparatus according to an embodiment of the present disclosure.
图1-4示出根据本公开一实施例的激活指令处理方法的流程图。1-4 show a flowchart of an activation instruction processing method according to an embodiment of the present disclosure.
图2-1示出根据本公开一实施例的线性整流函数激活指令处理装置的框图。FIG. 2-1 shows a block diagram of a linear rectification function activation instruction processing device according to an embodiment of the present disclosure.
图2-2a、图2-2b示出根据本公开一实施例的线性整流函数激活指令处理装置的框图。2-2a and 2-2b show block diagrams of a linear rectification function activation instruction processing device according to an embodiment of the present disclosure.
图2-3示出根据本公开一实施例的线性整流函数激活指令处理装置的应用场景的示意图。FIG. 2-3 shows a schematic diagram of an application scenario of a linear rectification function activation instruction processing device according to an embodiment of the present disclosure.
图2-4示出根据本公开一实施例的线性整流函数激活指令处理方法的流程图。2-4 illustrate a flowchart of a method for processing a linear rectification function activation instruction according to an embodiment of the present disclosure.
图3-1示出根据本公开一实施例的S型生长曲线函数激活指令处理装置的框图。FIG. 3-1 shows a block diagram of an S-shaped growth curve function activation instruction processing apparatus according to an embodiment of the present disclosure.
图3-2a、图3-2b示出根据本公开一实施例的S型生长曲线函数激活指令处理装置的框图。3-2a and 3-2b show block diagrams of an S-shaped growth curve function activation instruction processing device according to an embodiment of the present disclosure.
图3-3示出根据本公开一实施例的S型生长曲线函数激活指令处理装置的应用场景的示意图。3-3 shows a schematic diagram of an application scenario of an S-shaped growth curve function activation instruction processing apparatus according to an embodiment of the present disclosure.
图3-4示出根据本公开一实施例的S型生长曲线函数激活指令处理方法的流程图。FIGS. 3-4 illustrate a flowchart of an S-shaped growth curve function activation instruction processing method according to an embodiment of the present disclosure.
图4-1示出根据本公开一实施例的指数函数激活指令处理装置的框图。FIG. 4-1 shows a block diagram of an exponential function activation instruction processing apparatus according to an embodiment of the present disclosure.
图4-2a、图4-2b示出根据本公开一实施例的指数函数激活指令处理装置的框图。4-2a and 4-2b show block diagrams of an exponential function activation instruction processing device according to an embodiment of the present disclosure.
图4-3示出根据本公开一实施例的指数函数激活指令处理装置的应用场景的示意图。4-3 shows a schematic diagram of an application scenario of an exponential function activation instruction processing device according to an embodiment of the present disclosure.
图4-4示出根据本公开一实施例的指数函数激活指令处理方法的流程图。4-4 shows a flowchart of an exponential function activation instruction processing method according to an embodiment of the present disclosure.
图5-1示出根据本公开一实施例的选择指令处理装置的框图。FIG. 5-1 shows a block diagram of a selection instruction processing apparatus according to an embodiment of the present disclosure.
图5-2a、图5-2b示出根据本公开一实施例的选择指令处理装置的框图。5-2a and 5-2b show block diagrams of a selection instruction processing apparatus according to an embodiment of the present disclosure.
图5-3示出根据本公开一实施例的选择指令处理装置的应用场景的示意图。5-3 shows a schematic diagram of an application scenario for selecting an instruction processing apparatus according to an embodiment of the present disclosure.
图5-4示出根据本公开一实施例的选择指令处理方法的流程图。5-4 shows a flowchart of a selection instruction processing method according to an embodiment of the present disclosure.
图6-1示出根据本公开一实施例的计数指令处理装置的框图。6-1 shows a block diagram of a count instruction processing device according to an embodiment of the present disclosure.
图6-2a、图6-2b示出根据本公开一实施例的计数指令处理装置的框图。6-2a and 6-2b show block diagrams of a counting instruction processing device according to an embodiment of the present disclosure.
图6-3示出根据本公开一实施例的计数指令处理装置的应用场景的示意图。6-3 shows a schematic diagram of an application scenario of a counting instruction processing device according to an embodiment of the present disclosure.
图6-4示出根据本公开一实施例的计数指令处理方法的流程图。6-4 shows a flowchart of a counting instruction processing method according to an embodiment of the present disclosure.
图7-1示出根据本公开一实施例的全连接指令处理装置的框图。FIG. 7-1 shows a block diagram of a fully connected instruction processing apparatus according to an embodiment of the present disclosure.
图7-2a、图7-2b示出根据本公开一实施例的全连接指令处理装置的框图。7-2a and 7-2b show block diagrams of a fully connected instruction processing apparatus according to an embodiment of the present disclosure.
图7-3示出根据本公开一实施例的全连接指令处理装置的应用场景的示意图。7-3 shows a schematic diagram of an application scenario of a fully connected instruction processing apparatus according to an embodiment of the present disclosure.
图7-4示出根据本公开一实施例的全连接指令处理方法的流程图。7-4 shows a flowchart of a fully connected instruction processing method according to an embodiment of the present disclosure.
图8-1示出根据本公开一实施例的卷积指令处理装置的框图。8-1 shows a block diagram of a convolution instruction processing device according to an embodiment of the present disclosure.
图8-2a、图8-2b示出根据本公开一实施例的卷积指令处理装置的框图。8-2a and 8-2b show block diagrams of a convolution instruction processing device according to an embodiment of the present disclosure.
图8-3示出根据本公开一实施例的卷积指令处理装置的应用场景的示意图。8-3 shows a schematic diagram of an application scenario of a convolution instruction processing device according to an embodiment of the present disclosure.
图8-4示出根据本公开一实施例的卷积指令处理方法的流程图。8-4 shows a flowchart of a convolution instruction processing method according to an embodiment of the present disclosure.
图9-1示出根据本公开一实施例的最大池化指令处理装置的框图。9-1 shows a block diagram of a maximum pooled instruction processing apparatus according to an embodiment of the present disclosure.
图9-2a、图9-2b示出根据本公开一实施例的最大池化指令处理装置的框图。9-2a and 9-2b show block diagrams of a maximum pooled instruction processing apparatus according to an embodiment of the present disclosure.
图9-3示出根据本公开一实施例的最大池化指令处理装置的应用场景的示意图。9-3 shows a schematic diagram of an application scenario of a maximum pooled instruction processing apparatus according to an embodiment of the present disclosure.
图9-4示出根据本公开一实施例的最大池化指令处理方法的流程图。9-4 shows a flowchart of a maximum pooling instruction processing method according to an embodiment of the present disclosure.
图10-1示出根据本公开一实施例的填充指令处理装置的框图。10-1 shows a block diagram of a filling instruction processing apparatus according to an embodiment of the present disclosure.
图10-2a、图10-2b示出根据本公开一实施例的填充指令处理装置的框图。10-2a and 10-2b show block diagrams of a filling instruction processing device according to an embodiment of the present disclosure.
图10-3示出根据本公开一实施例的填充指令处理装置的应用场景的示意图。10-3 shows a schematic diagram of an application scenario of a filling instruction processing apparatus according to an embodiment of the present disclosure.
图10-4示出根据本公开一实施例的填充指令处理方法的流程图。10-4 shows a flowchart of a filling instruction processing method according to an embodiment of the present disclosure.
图11-1示出根据本公开一实施例的矩阵转置指令处理装置的框图。11-1 shows a block diagram of a matrix transposition instruction processing apparatus according to an embodiment of the present disclosure.
图11-2a、图11-2b示出根据本公开一实施例的矩阵转置指令处理装置的框图。11-2a and 11-2b show block diagrams of a matrix transposition instruction processing device according to an embodiment of the present disclosure.
图11-3示出根据本公开一实施例的矩阵转置指令处理装置的应用场景的示意图。11-3 shows a schematic diagram of an application scenario of a matrix transposition instruction processing apparatus according to an embodiment of the present disclosure.
图11-4示出根据本公开一实施例的矩阵转置指令处理方法的流程图。11-4 shows a flowchart of a matrix transposition instruction processing method according to an embodiment of the present disclosure.
图12-1示出根据本公开一实施例的平均池化指令处理装置的框图。12-1 shows a block diagram of an average pooled instruction processing apparatus according to an embodiment of the present disclosure.
图12-2a、图12-2b示出根据本公开一实施例的平均池化指令处理装置的框图。12-2a and 12-2b show block diagrams of an average pooled instruction processing apparatus according to an embodiment of the present disclosure.
图12-3示出根据本公开一实施例的平均池化指令处理装置的应用场景的示意图。12-3 shows a schematic diagram of an application scenario of an average pooled instruction processing apparatus according to an embodiment of the present disclosure.
图12-4示出根据本公开一实施例的平均池化指令处理方法的流程图。12-4 shows a flowchart of an average pooling instruction processing method according to an embodiment of the present disclosure.
图13-1示出根据本公开一实施例的标量指令处理装置的框图。13-1 shows a block diagram of a scalar instruction processing device according to an embodiment of the present disclosure.
图13-2a、图13-2b示出根据本公开一实施例的标量指令处理装置的框图。13-2a and 13-2b show block diagrams of a scalar instruction processing device according to an embodiment of the present disclosure.
图13-3a、图13-3b示出根据本公开一实施例的标量指令处理装置的应用场景的示意图。13-3a and 13-3b show schematic diagrams of application scenarios of a scalar instruction processing apparatus according to an embodiment of the present disclosure.
图13-4示出根据本公开一实施例的标量指令处理方法的流程图。13-4 shows a flowchart of a scalar instruction processing method according to an embodiment of the present disclosure.
图14-1示出根据本公开一实施例的标量类型转换指令处理装置的框图。14-1 shows a block diagram of a scalar type conversion instruction processing device according to an embodiment of the present disclosure.
图14-2a、图14-2b示出根据本公开一实施例的标量类型转换指令处理装置的框图。14-2a and 14-2b show block diagrams of a scalar type conversion instruction processing device according to an embodiment of the present disclosure.
图14-3示出根据本公开一实施例的标量类型转换指令处理装置的应用场景的示意图。14-3 shows a schematic diagram of an application scenario of a scalar type conversion instruction processing device according to an embodiment of the present disclosure.
图14-4示出根据本公开一实施例的标量类型转换指令处理方法的流程图。14-4 shows a flowchart of a scalar type conversion instruction processing method according to an embodiment of the present disclosure.
图15-1示出根据本公开一实施例的取地址指令处理装置的框图。15-1 shows a block diagram of an address fetch instruction processing apparatus according to an embodiment of the present disclosure.
图15-2示出根据本公开一实施例的取地址指令处理装置的框图。15-2 shows a block diagram of an address fetch instruction processing apparatus according to an embodiment of the present disclosure.
图15-3a、图15-3b示出根据本公开一实施例的取地址指令处理装置的应用场景的示意图。15-3a and 15-3b show schematic diagrams of application scenarios of an address fetch instruction processing apparatus according to an embodiment of the present disclosure.
图15-4示出根据本公开一实施例的取地址指令处理方法的流程图。15-4 shows a flowchart of an address fetch instruction processing method according to an embodiment of the present disclosure.
图16-1示出根据本公开一实施例的标量数据迁移指令处理装置的框图。16-1 shows a block diagram of a scalar data migration instruction processing apparatus according to an embodiment of the present disclosure.
图16-2示出根据本公开一实施例的标量数据迁移指令处理装置的框图。16-2 shows a block diagram of a scalar data migration instruction processing apparatus according to an embodiment of the present disclosure.
图16-3示出根据本公开一实施例的标量数据迁移指令处理装置的应用场景的示意图。16-3 shows a schematic diagram of an application scenario of a scalar data migration instruction processing apparatus according to an embodiment of the present disclosure.
图16-4示出根据本公开一实施例的标量数据迁移指令处理方法的流程图。16-4 shows a flowchart of a scalar data migration instruction processing method according to an embodiment of the present disclosure.
图17-1示出根据本公开一实施例的标量控制流指令处理装置的框图。17-1 shows a block diagram of a scalar control flow instruction processing apparatus according to an embodiment of the present disclosure.
图17-2示出根据本公开一实施例的标量控制流指令处理装置的框图。17-2 shows a block diagram of a scalar control flow instruction processing apparatus according to an embodiment of the present disclosure.
图17-3示出根据本公开一实施例的标量控制流指令处理装置的应用场景的示意图。17-3 shows a schematic diagram of an application scenario of a scalar control flow instruction processing apparatus according to an embodiment of the present disclosure.
图17-4示出根据本公开一实施例的标量控制流指令处理方法的流程图。17-4 shows a flowchart of a scalar control flow instruction processing method according to an embodiment of the present disclosure.
图18-1示出根据本公开一实施例的向量指令处理装置的框图。18-1 shows a block diagram of a vector instruction processing apparatus according to an embodiment of the present disclosure.
图18-2a、图18-2b示出根据本公开一实施例的向量指令处理装置的框图。18-2a and 18-2b show block diagrams of a vector instruction processing device according to an embodiment of the present disclosure.
图18-3示出根据本公开一实施例的向量指令处理装置的应用场景的示意图。18-3 shows a schematic diagram of an application scenario of a vector instruction processing apparatus according to an embodiment of the present disclosure.
图18-4示出根据本公开一实施例的向量指令处理方法的流程图。18-4 shows a flowchart of a vector instruction processing method according to an embodiment of the present disclosure.
图19-1示出根据本公开一实施例的循环向量指令处理装置的框图。19-1 shows a block diagram of a loop vector instruction processing apparatus according to an embodiment of the present disclosure.
图19-2a、图19-2b示出根据本公开一实施例的循环向量指令处理装置的框图。19-2a and 19-2b show block diagrams of a loop vector instruction processing device according to an embodiment of the present disclosure.
图19-3示出根据本公开一实施例的循环向量指令处理装置的应用场景的示意图。19-3 shows a schematic diagram of an application scenario of a loop vector instruction processing device according to an embodiment of the present disclosure.
图19-4示出根据本公开一实施例的循环向量指令处理方法的流程图。19-4 shows a flowchart of a loop vector instruction processing method according to an embodiment of the present disclosure.
图20-1示出根据本公开一实施例的向量数据迁移指令处理装置的框图。FIG. 20-1 shows a block diagram of a vector data migration instruction processing apparatus according to an embodiment of the present disclosure.
图20-2示出根据本公开一实施例的向量数据迁移指令处理装置的框图。FIG. 20-2 shows a block diagram of a vector data migration instruction processing apparatus according to an embodiment of the present disclosure.
图20-3示出根据本公开一实施例的向量数据迁移指令处理装置的应用场景的示意图。20-3 shows a schematic diagram of an application scenario of a vector data migration instruction processing apparatus according to an embodiment of the present disclosure.
图20-4示出根据本公开一实施例的向量数据迁移指令处理方法的流程图。20-4 shows a flowchart of a vector data migration instruction processing method according to an embodiment of the present disclosure.
图21-1a示出根据本公开一实施例的同步控制指令处理装置的框图。21-1a shows a block diagram of a synchronization control instruction processing apparatus according to an embodiment of the present disclosure.
图21-1b示出根据本公开一实施例的同步控制指令处理装置中模块集群的结构示意图。21-1b shows a schematic structural diagram of a module cluster in a synchronous control instruction processing apparatus according to an embodiment of the present disclosure.
图21-2示出根据本公开一实施例的同步控制指令处理装置的框图。21-2 shows a block diagram of a synchronization control instruction processing apparatus according to an embodiment of the present disclosure.
图21-3示出根据本公开一实施例的同步控制指令处理装置的应用场景的示意图。21-3 illustrate a schematic diagram of an application scenario of a synchronization control instruction processing apparatus according to an embodiment of the present disclosure.
图21-4示出根据本公开一实施例的同步控制指令处理方法的流程图。21-4 illustrate a flowchart of a method for processing synchronization control instructions according to an embodiment of the present disclosure.
图22-1示出根据本公开一实施例的中断存储指令处理装置的框图。22-1 shows a block diagram of an interrupt storage instruction processing apparatus according to an embodiment of the present disclosure.
图22-2a、图22-2b示出根据本公开一实施例的中断存储指令处理装置的框图。22-2a and 22-2b illustrate block diagrams of an interrupt storage instruction processing apparatus according to an embodiment of the present disclosure.
图22-3a、图22-3b示出根据本公开一实施例的中断存储指令处理装置的应用场景的示意图。22-3a and 22-3b are schematic diagrams illustrating application scenarios of an apparatus for processing interrupt storage instructions according to an embodiment of the present disclosure.
图22-4示出根据本公开一实施例的中断存储指令处理方法的流程图。22-4 shows a flowchart of an interrupt storage instruction processing method according to an embodiment of the present disclosure.
图23a-图23d示出根据本公开一实施例的运算模块的框图。23a-23d show a block diagram of an arithmetic module according to an embodiment of the present disclosure.
图23e示出根据本公开一实施例的控制模块的框图。23e shows a block diagram of a control module according to an embodiment of the present disclosure.
图24a、图24b示出根据本公开一实施例的组合处理装置的框图。24a and 24b show block diagrams of a combined processing device according to an embodiment of the present disclosure.
图25示出根据本公开一实施例的板卡的结构示意图。FIG. 25 shows a schematic structural diagram of a board according to an embodiment of the present disclosure.
具体实施方式detailed description
下面将结合本公开实施例中的附图,对本公开实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本公开一部分实施例,而不是全部的实施例。基于本公开中的实施例,本领域技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本公开保护的范围。The technical solutions in the embodiments of the present disclosure will be described clearly and completely in conjunction with the drawings in the embodiments of the present disclosure. Obviously, the described embodiments are part of the embodiments of the present disclosure, but not all of the embodiments. Based on the embodiments in the present disclosure, all other embodiments obtained by those skilled in the art without creative work fall within the protection scope of the present disclosure.
应当理解,本公开的权利要求、说明书及附图中的术语“第一”、“第二”、“第零”等是用于区别不同对象,而不是用于描述特定顺序。本公开的说明书和权利要求书中使用的术语“包括”和“包含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。It should be understood that the terms "first", "second", "zeroth", etc. in the claims, specification, and drawings of the present disclosure are used to distinguish different objects, not to describe a specific order. The terms "comprising" and "including" used in the specification and claims of the present disclosure indicate the presence of the described features, wholes, steps, operations, elements and / or components, but do not exclude one or more other features, wholes , Steps, operations, elements, components and / or their existence or addition.
还应当理解,在此本公开说明书中所使用的术语仅仅是出于描述特定实施例的目的,而并不意在限定本公开。如在本公开说明书和权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。还应当进一步理解,在本公开说明书和权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。It should also be understood that the terminology used in the present specification of the disclosure is for the purpose of describing particular embodiments only, and is not intended to limit the disclosure. As used in this disclosure specification and claims, the singular forms "a", "an", and "the" are intended to include the plural forms unless the context clearly dictates otherwise. It should also be further understood that the term "and / or" used in the present specification and claims refers to any and all possible combinations of one or more of the associated listed items and includes these combinations.
如在本说明书和权利要求书中所使用的那样,术语“如果”可以依据上下文被解释为“当...时”或“一旦”或“响应于确定”或“响应于检测到”。类似地,短语“如果确定”或“如果检测到[所描述条件或事 件]”可以依据上下文被解释为意指“一旦确定”或“响应于确定”或“一旦检测到[所描述条件或事件]”或“响应于检测到[所描述条件或事件]”。As used in this specification and claims, the term "if" may be interpreted as "when" or "once" or "in response to a determination" or "in response to a detection" depending on the context. Similarly, the phrase "if determined" or "if [described condition or event] is detected" may be interpreted in the context to mean "once determined" or "in response to a determination" or "once detected [described condition or event ] "Or" In response to detection of [the described condition or event] ".
本公开提供了针对不同的运算或处理所对应的指令处理方法和装置、以及与每一种指令处理方法和设备相对应的计算机设备和存储介质,针对不同的运算或处理所对应的指令处理方法和装置包括:选择指令处理方法和装置、计数指令处理方法和装置、全连接指令处理方法和装置、卷积指令处理方法和装置、最大池化指令处理方法和装置、线性整流函数激活指令处理方法和装置、S型生长曲线函数激活指令处理方法和装置、激活指令处理方法和装置、填充指令处理方法和装置、矩阵转置指令处理方法和装置、平均池化指令处理方法和装置、指数函数激活指令处理方法和装置、标量指令处理方法和装置、标量类型转换指令处理方法和装置、取地址指令处理方法和装置、标量数据迁移指令处理方法和装置、标量控制流指令处理方法和装置、向量指令处理方法和装置、循环向量指令处理方法和装置、向量数据迁移指令处理方法和装置、同步控制指令处理方法和装置、中断存储指令处理方法和装置。下文所述的指令处理方法、指令处理装置可以为上述所列举的指令处理方法、装置中的任意一个。The present disclosure provides instruction processing methods and devices corresponding to different operations or processes, and computer equipment and storage media corresponding to each instruction processing method and device, and instruction processing methods corresponding to different operations or processes And devices include: selection instruction processing method and device, counting instruction processing method and device, fully connected instruction processing method and device, convolution instruction processing method and device, maximum pooling instruction processing method and device, linear rectification function activation instruction processing method And device, S-shaped growth curve function activation instruction processing method and device, activation instruction processing method and device, filling instruction processing method and device, matrix transposition instruction processing method and device, average pooling instruction processing method and device, exponential function activation Instruction processing method and device, scalar instruction processing method and device, scalar type conversion instruction processing method and device, address fetch instruction processing method and device, scalar data migration instruction processing method and device, scalar control flow instruction processing method and device, vector instruction Processor Method and device, loop vector instruction processing method and device, vector data migration instruction processing method and device, synchronous control instruction processing method and device, and interrupt storage instruction processing method and device. The instruction processing method and instruction processing device described below may be any of the instruction processing methods and devices listed above.
根据本公开实施例的指令处理方法可应用于处理器中,该处理器可以是通用处理器,例如CPU(Central Processing Unit,中央处理器),也可以是用于执行人工智能运算的人工智能处理器(IPU)。人工智能运算可包括机器学习运算,类脑运算等。其中,机器学习运算包括神经网络运算、k-means运算、支持向量机运算等。该人工智能处理器可例如包括GPU(Graphics Processing Unit,图形处理单元)、NPU(Neural-Network Processing Unit,神经网络处理单元)、DSP(Digital Signal Process,数字信号处理单元)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)芯片中的一种或组合。本公开对处理器的具体类型不作限制。The instruction processing method according to the embodiment of the present disclosure may be applied to a processor, which may be a general-purpose processor, such as a CPU (Central Processing Unit), or artificial intelligence processing for performing artificial intelligence operations Device (IPU). Artificial intelligence operations can include machine learning operations, brain-like operations, and so on. Among them, machine learning operations include neural network operations, k-means operations, support vector machine operations, etc. The artificial intelligence processor may include, for example, GPU (Graphics Processing Unit), NPU (Neural-Network Processing Unit, neural network processing unit), DSP (Digital Signal Processing, digital signal processing unit), field programmable gate array (Field-Programmable Gate Array, FPGA) One or a combination of chips. This disclosure does not limit the specific types of processors.
在一种可能的实现方式中,本公开中所提及的处理器可包括多个处理单元,每个处理单元可以独立运行所分配到的各种任务,如:卷积运算任务、池化任务或全连接任务等。本公开对处理单元及处理单元所运行的任务不作限制。In a possible implementation manner, the processor mentioned in the present disclosure may include multiple processing units, and each processing unit may independently run various assigned tasks, such as: convolution operation tasks and pooling tasks Or fully connected tasks. The present disclosure does not limit the processing unit and the tasks executed by the processing unit.
图1示出根据本公开实施例的指令处理方法的处理器的示意图。如图1所示,处理器100包括多个处理单元101以及存储单元102,多个处理单元101用于执行指令序列,存储单元102用于存储数据,可包括随机存储器(RAM,Random Access Memory)和寄存器堆。处理器100中的多个处理单元101既可共用部分存储空间,例如共用部分RAM存储空间和寄存器堆,又可同时拥有各自的存储空间。FIG. 1 shows a schematic diagram of a processor of an instruction processing method according to an embodiment of the present disclosure. As shown in FIG. 1, the processor 100 includes a plurality of processing units 101 and a storage unit 102. The plurality of processing units 101 are used to execute an instruction sequence, and the storage unit 102 is used to store data, which may include a random access memory (RAM, Random Access Memory) And register file. The multiple processing units 101 in the processor 100 can share a part of the storage space, for example, share a part of the RAM storage space and the register file, and can also have their own storage spaces at the same time.
图1-1示出根据本公开一实施例的激活指令处理装置的框图。如图1-1所示,该装置包括控制模块8-11和运算模块8-12。FIG. 1-1 shows a block diagram of an activation instruction processing apparatus according to an embodiment of the present disclosure. As shown in Figure 1-1, the device includes a control module 8-11 and an arithmetic module 8-12.
控制模块8-11,用于对获取到的激活指令进行编译,得到编译后的激活指令,对编译后的激活指令进行解析,得到激活指令的操作码和操作域,并根据操作码和操作域获取执行激活指令所需的待运算数据和目标地址。其中,操作码用于指示激活指令对数据所进行的运算为激活运算,操作域包括待运算数据地址和目标地址。The control module 8-11 is used to compile the obtained activation instruction to obtain the compiled activation instruction, and parse the compiled activation instruction to obtain the operation code and operation domain of the activation instruction, and according to the operation code and operation domain Obtain the data to be calculated and the target address required to execute the activation instruction. The operation code is used to indicate that the operation performed by the activation instruction on the data is an activation operation, and the operation domain includes the data address and the target address to be operated.
运算模块8-12,用于对待运算数据进行激活运算,得到运算结果,并将运算结果存入目标地址中。The operation module 8-12 is used to activate the operation data to obtain the operation result, and store the operation result in the target address.
在本实施例中,控制模块所获取到的激活指令为未编译的、不能直接供硬件执行的软件指令,控制模块需先对激活指令(未编译的)进行编译。在得到编译后的激活指令之后,才能对编译后的激活指令进行解析。编译后的激活指令为能够直接供硬件执行的硬件指令。控制模块可以从待运算数据地址中,获得待运算数据。控制模块可以通过数据输入输出单元获得指令和数据,该数据输入输出单元可以为一个或多个数据I/O接口或I/O引脚。In this embodiment, the activation instruction acquired by the control module is an uncompiled software instruction that cannot be directly executed by hardware. The control module needs to first compile the activation instruction (uncompiled). After the compiled activation instruction is obtained, the compiled activation instruction can be parsed. The compiled activation instruction is a hardware instruction that can be directly executed by the hardware. The control module can obtain the data to be calculated from the data address to be calculated. The control module can obtain instructions and data through the data input and output unit, which can be one or more data I / O interfaces or I / O pins.
在本实施例中,操作码可以是计算机程序中所规定的要执行操作的那一部分指令或字段(通常用代码表示),是指令序列号,用来告知执行指令的装置具体需要执行哪一条指令。操作域可以是执行对应的指令所需的所有数据的来源,如对应的地址等,执行对应的指令所需的所有数据包括待运算数据等数据以及对应的运算方法等等。对于一个激活指令其必须包括操作码和操作域,其中,操作域至少包括待运算数据地址和目标地址。In this embodiment, the operation code may be the part of the instruction or field (usually represented by code) specified in the computer program to perform the operation, and is the instruction sequence number used to inform the device that executes the instruction which instruction needs to be executed. . The operation domain may be a source of all data required to execute the corresponding instruction, such as a corresponding address, etc. All data required to execute the corresponding instruction include data such as data to be operated and corresponding operation methods, etc. For an activation instruction, it must include an operation code and an operation field, where the operation field includes at least the data address and the target address to be calculated.
应当理解的是,本领域技术人员可以根据需要对激活指令的指令格式以及所包含的操作码和操作域进行设置,本公开对此不作限制。It should be understood that those skilled in the art can set the instruction format of the activation instruction, as well as the included operation codes and operation domains as needed, and this disclosure does not limit this.
在本实施例中,该装置可以包括一个或多个控制模块,以及一个或多个运算模块,可以根据实际需要对控制模块和运算模块的数量进行设置,本公开对此不作限制。在装置包括一个控制模块时,该控制模块可以接收线性整流函数激活指令,并控制一个或多个处理模块进行线性整流函数激活运算。在装置包括多个控制模块时,多个控制模块可以分别接收线性整流函数激活指令,并控制对应的一个或多个处理模块进行线性整流函数激活运算。In this embodiment, the device may include one or more control modules and one or more arithmetic modules, and the number of control modules and arithmetic modules may be set according to actual needs, which is not limited in the present disclosure. When the device includes a control module, the control module may receive a linear rectification function activation instruction, and control one or more processing modules to perform the linear rectification function activation operation. When the device includes multiple control modules, the multiple control modules may respectively receive linear rectification function activation instructions and control the corresponding one or more processing modules to perform linear rectification function activation operations.
本公开实施例提供一种激活指令处理装置,该装置包括控制模块和运算模块,控制模块用于对获取到的激活指令进行编译,得到编译后的激活指令,对编译后的激活指令进行解析,得到激活指令的操作码和操作域,并根据操作码和操作域获取执行激活指令所需的待运算数据和目标地址;运算模块用于对待运算数据进行激活运算,得到运算结果,并将运算结果存入目标地址中。本公开实施例所提供的激活指令处理装置的适用范围广,对激活指令的处理效率高、处理速度快,进行激活运算的处理效率高、处理速度快。An embodiment of the present disclosure provides an activation instruction processing apparatus. The apparatus includes a control module and an arithmetic module. The control module is used to compile the acquired activation instruction to obtain a compiled activation instruction and analyze the compiled activation instruction. Obtain the operation code and operation domain of the activation instruction, and obtain the data to be operated and the target address required to execute the activation instruction according to the operation code and operation domain; the operation module is used to activate the operation data to obtain the operation result, and the operation result Store in the target address. The activation instruction processing device provided by the embodiments of the present disclosure has a wide application range, and has high processing efficiency and fast processing speed for activation instructions, and high processing efficiency and fast processing speed for performing activation calculations.
在一种可能的实现方式中,激活运算所利用的激活函数可以包括以下至少一种:线性整流函数(Rectified Linear Unit,ReLU,也可称ReLU函数)、S型生长曲线函数(Sigmoid function,也可称Sigmoid函数)、双曲正切函数(tanh,也可称tanh函数)、带泄露线性整流函数(Leaky ReLU,ReLU函数的一个变种)、取最大值函数(maxout函数,输出该层中的最大值)和幂函数(power function)。In a possible implementation, the activation function used by the activation operation may include at least one of the following: a linear rectification function (Rectified Linear Unit, ReLU, also called ReLU function), and an S-shaped growth curve function (Sigmoid function, It can be called Sigmoid function), hyperbolic tangent function (tanh, can also be called tanh function), linear rectification function with leakage (Leaky ReLU, a variant of ReLU function), the function of taking the maximum value (maxout function, output the maximum Value) and power function.
在该实现方式中,进行激活运算所用的激活函数还可以是其他具有非线性、连续可微、范围尽可能不饱和、具有单调性、在圆点处近似直线等特点中至少一个特点的、可用于激活运算的函数,本公开对此不作限制。In this implementation, the activation function used for the activation operation may also be other features that are non-linear, continuously differentiable, as unsaturated as possible in range, monotonic, approximate straight lines at dots, etc., available This disclosure does not limit the function of activating the operation.
在一种可能的实现方式中,控制模块8-11还可以用于根据操作码和/或操作域获取激活参数表。In a possible implementation manner, the control module 8-11 may also be used to obtain an activation parameter table according to the operation code and / or operation domain.
运算模块8-12,还可以用于根据激活参数表对待运算数据进行激活运算,得到运算结果。The operation modules 8-12 can also be used to perform activation calculation on the data to be calculated according to the activation parameter table to obtain the operation result.
其中,激活参数表可以包括激活表和常数表。The activation parameter table may include an activation table and a constant table.
在该实现方式中,操作域中可以包括激活参数表地址,以便于控制模块从激活参数表地址中获取激活参数表地址。或者,控制模块可以根据操作码确定执行该激活指令需要激活参数表时,可以直接从预先确定的激活参数表的存储地址获取激活参数表。再或者,控制模块可以根据操作码确定执行该激活指令需要激活参数表时,可以直接从预先确定的参数表的存储地址获取对应该激活指令的激活参数表。本领域技术人员可以根据实际需要对激活参数表的获取方式进行设置,本公开对此不作限制。In this implementation mode, the activation parameter table address may be included in the operation domain, so that the control module obtains the activation parameter table address from the activation parameter table address. Alternatively, the control module may determine that the activation parameter table needs to be activated according to the operation code, and may directly obtain the activation parameter table from the storage address of the predetermined activation parameter table. Alternatively, when the control module may determine that the activation parameter table needs to be activated according to the operation code, it may directly obtain the activation parameter table corresponding to the activation command from the storage address of the predetermined parameter table. A person skilled in the art may set the acquisition method of the activation parameter table according to actual needs, which is not limited in the present disclosure.
在一种可能的实现方式中,控制模块还可以获取对应于激活指令的激活函数,使得运算模块可以根据激活函数以及对应的运算器,对待运算数据进行激活运算。In a possible implementation manner, the control module can also obtain an activation function corresponding to the activation instruction, so that the operation module can perform activation calculation on the operation data according to the activation function and the corresponding operator.
需要说明的是,本领域技术人员可以根据实际需要对实现运算模块实现激活运算的方式进行设置,本公开对此不作限制。It should be noted that, those skilled in the art may set the manner in which the calculation module implements the activation calculation according to actual needs, which is not limited in the present disclosure.
在该实现方式中,可以预先确定采用不同激活函数进行激活运算所需的激活表和常数表。不同激活函数所对应的激活表和常数表不相同。In this implementation manner, an activation table and a constant table required for activation operations using different activation functions can be predetermined. The activation table and constant table corresponding to different activation functions are different.
图1-2a示出根据本公开一实施例的激活指令处理装置的框图。在一种可能的实现方式中,如图1-2a所示,运算模块8-12可以包括多个激活运算器8-120。多个激活运算器8-120用于对待运算数据进行激活运算。1-2a shows a block diagram of an activation instruction processing apparatus according to an embodiment of the present disclosure. In a possible implementation manner, as shown in FIG. 1-2a, the arithmetic module 8-12 may include a plurality of activation operators 8-120. A plurality of activation calculators 8-120 are used to perform activation calculation on the data to be calculated.
在该实现方式中,运算模块也可以包括一个激活运算器。可以根据所需进行的激活运算的数据量的大小、对激活运算的处理速度、效率等要求对激活运算器的数量进行设置,本公开对此不作限制。In this implementation, the calculation module may also include an activation calculator. The number of activation operators can be set according to the size of the data amount of the activation operation to be performed, the processing speed, efficiency, etc. of the activation operation, which is not limited in the present disclosure.
图1-2b示出根据本公开一实施例的激活指令处理装置的框图。在一种可能的实现方式中,如图1-2b所示,运算模块8-12可以包括主运算子模块8-121和多个从运算子模块8-122,主运算子模块8-121包括多个激活运算器8-120(图中未示出)。1-2b show a block diagram of an activation instruction processing apparatus according to an embodiment of the present disclosure. In a possible implementation, as shown in FIG. 1-2b, the operation module 8-12 may include a master operation sub-module 8-121 and multiple slave operation sub-modules 8-122, and the master operation sub-module 8-121 includes Multiple activation operators 8-120 (not shown in the figure).
主运算子模块8-121,用于利用多个激活运算器对待运算数据进行激活运算,得到运算结果,并将运算结果存入目标地址中。The main operation sub-module 8-121 is used for performing an activation operation on the data to be calculated by using a plurality of activation operators, obtaining an operation result, and storing the operation result in a target address.
在一种可能的实现方式中,操作域还可以包括读入量或读入量的存储地址。其中,控制模块8-11, 还用于获取读入量,并按照读入量获取多个待运算数据。其中,多个待运算数据的数据量可以小于或等于读入量。In a possible implementation manner, the operation domain may further include a read-in amount or a storage address of the read-in amount. Among them, the control module 8-11 is also used to obtain the read-in amount, and obtain a plurality of data to be calculated according to the read-in amount. Among them, the data amount of the plurality of data to be calculated may be less than or equal to the read-in amount.
在该实现方式中,读入量可以是所获取的多个待运算数据的数据量,可以是所获取的待运算数据的尺寸。在操作域直接包含读入量的具体数值时,可以将该数值确定为读入量。在操作域中包括读入量的存储地址时,可以从该存储地址中获取读入量。In this implementation manner, the read-in amount may be the data amount of the acquired plurality of data to be calculated, and may be the size of the acquired data to be calculated. When the operation field directly contains the specific value of the read-in amount, the value can be determined as the read-in amount. When the storage address of the read-in amount is included in the operation domain, the read-in amount can be obtained from the storage address.
在一种可能的实现方式中,在操作域中不包括读入量时,可以根据预先设置的默认读入量获取多个待运算数据。所获取的多个待运算数据的数据量可以小于或等于默认读入量。In a possible implementation manner, when the read-in amount is not included in the operation domain, a plurality of data to be calculated may be obtained according to a preset default read-in amount. The acquired data amount of the plurality of data to be calculated may be less than or equal to the default read-in amount.
通过上述方式,可以对待运算数据的数据量和尺寸进行限制,保证运算结果的准确性,并保证装置可以执行该激活指令。In the above manner, the data amount and size of the operation data can be limited, the accuracy of the operation result can be guaranteed, and the device can execute the activation instruction.
在一种可能的实现方式中,如图1-2a、图1-2b所示,该装置还可以包括存储模块8-13。存储模块8-13用于存储待运算数据。存储模块8-13还可以用于存储激活表和常数表。In a possible implementation manner, as shown in FIGS. 1-2a and 1-2b, the device may further include a storage module 8-13. The storage modules 8-13 are used to store data to be calculated. The storage modules 8-13 can also be used to store activation tables and constant tables.
在该实现方式中,存储模块可以包括内存,如缓存和寄存器中的一种或多种,缓存可以包括高速暂存缓存。可以根据需要将待运算数据、激活表和常数表存储在存储模块的缓存和/或寄存器中,本公开对此不作限制。In this implementation manner, the storage module may include a memory, such as one or more of a cache and a register, and the cache may include a high-speed temporary storage cache. The data to be calculated, the activation table and the constant table can be stored in the cache and / or register of the storage module as needed, and the disclosure does not limit this.
在一种可能的实现方式中,该装置还可以包括直接内存访问模块,用于从存储模块中读取或者存储数据。In a possible implementation, the device may further include a direct memory access module, which is used to read or store data from the storage module.
在一种可能的实现方式中,控制模块8-11还可以用于根据激活指令生成汇编文件,并将汇编文件翻译成二进制文件,其中,二进制文件为编译后的激活指令。In a possible implementation manner, the control module 8-11 may also be used to generate an assembly file according to the activation instruction and translate the assembly file into a binary file, where the binary file is a compiled activation instruction.
在一种可能的实现方式中,激活指令的指令格式可以是:In a possible implementation manner, the instruction format of the activation instruction may be:
active dst src0 active_table const_table sizeactive dst src0 active_table const_table size
其中,active为激活指令的操作码,dst、src0、active_table、const_table、size为激活指令的操作域。其中,dst为目标地址,src0为待运算数据地址,active_table为激活表地址,const_table为常数表地址、size为读入量。Where active is the opcode of the activation instruction, and dst, src0, active_table, const_table, and size are the operation domains of the activation instruction. Among them, dst is the target address, src0 is the data address to be calculated, active_table is the active table address, const_table is the constant table address, and size is the read-in amount.
在一种可能的实现方式中,激活指令的指令格式还可以是:In a possible implementation, the instruction format of the activation instruction may also be:
active dst src0 sizeactive src0 size
其中,active为激活指令的操作码,dst、src0、size为激活指令的操作域。其中,dst为目标地址,src0为待运算数据地址,size为读入量。Among them, active is the operation code of the activation instruction, dst, src0, size are the operation domain of the activation instruction. Among them, dst is the target address, src0 is the data address to be calculated, and size is the read-in amount.
应当理解的是,本领域技术人员可以根据需要对激活指令的操作码、指令格式中操作码和操作域的位置进行设置,本公开对此不作限制。It should be understood that those skilled in the art can set the operation code of the activation instruction, the position of the operation code and the operation field in the instruction format as needed, and the disclosure does not limit this.
在一种可能的实现方式中,该装置可以设置于图形处理器(Graphics Processing Unit,简称GPU)、中央处理器(Central Processing Unit,简称CPU)和嵌入式神经网络处理器(Neural-network Processing Unit,简称NPU)的一种或多种之中。In a possible implementation manner, the device may be set in a graphics processor (Graphics Processing Unit, GPU for short), a central processing unit (Central Processing Unit, CPU for short) and an embedded neural network processor (Neural-network Processing Unit) , Referred to as NPU).
需要说明的是,尽管以上述实施例作为示例介绍了激活指令处理装置如上,但本领域技术人员能够理解,本公开应不限于此。事实上,用户完全可根据个人喜好和/或实际应用场景灵活设定各模块,只要符合本公开的技术方案即可。It should be noted that although the above-mentioned embodiment is taken as an example to introduce the activation instruction processing apparatus as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various modules flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.
应用示例Application examples
以下结合“利用激活指令处理装置进行激活运算”作为一个示例性应用场景,给出根据本公开实施例的应用示例,以便于理解激活指令处理装置的流程。本领域技术人员应理解,以下应用示例仅仅是出于便于理解本公开实施例的目的,不应视为对本公开实施例的限制In the following, an application example according to an embodiment of the present disclosure will be given in conjunction with "using an activation instruction processing device to perform an activation operation" as an exemplary application scenario, so as to facilitate understanding of the flow of the activation instruction processing device. Those skilled in the art should understand that the following application examples are only for the purpose of facilitating understanding of the embodiments of the present disclosure, and should not be considered as limitations to the embodiments of the present disclosure
图1-3示出根据本公开一实施例的激活指令处理装置的应用场景的示意图。如图1-3所示,激活指令处理装置对激活指令进行处理的过程如下:1-3 are schematic diagrams illustrating application scenarios of an activation instruction processing apparatus according to an embodiment of the present disclosure. As shown in Figure 1-3, the activation instruction processing device processes the activation instruction as follows:
示例1Example 1
控制模块8-11对获取到的激活指令1进行编译,得到编译后的激活指令1(如激活指令1为active 500 100 200 300 64)。对编译后的激活指令1进行解析,得到激活指令1的操作码和操作域。其中,激活指令1的操作码为active,目标地址为500,待运算数据地址为100,激活表地址为200,常数表地址为300, 读入量为64。控制模块8-11从待运算数据地址100中获取数据量为64(读入量)的待运算数据,从激活表地址200中获取激活表,从常数表地址300中获取常数表。The control module 8-11 compiles the acquired activation instruction 1 to obtain the compiled activation instruction 1 (for example, activation instruction 1 is active 500, 100, 200, 300, 64). Analyze the compiled activation instruction 1 to obtain the operation code and operation domain of the activation instruction 1. The operation code of the activation instruction 1 is active, the target address is 500, the data address to be calculated is 100, the activation table address is 200, the constant table address is 300, and the read-in amount is 64. The control module 8-11 obtains the data to be operated with a data amount of 64 (read-in amount) from the data to be operated 100, the activation table from the activation table address 200, and the constant table from the constant table address 300.
运算模块8-12根据激活表和常数表对待运算数据进行激活运算,得到运算结果,并将运算结果存入目标地址500中。The operation module 8-12 performs activation calculation on the operation data according to the activation table and the constant table, obtains the operation result, and stores the operation result in the target address 500.
示例2Example 2
与示例1的区别在于,激活指令1为active 500 100 64,假定需要根据激活参数表进行激活运算,那么控制模块8-11需获取激活参数表(具体实现过程参见上文描述)。The difference from Example 1 is that the activation instruction 1 is active 500 and 100. Assuming that the activation calculation needs to be performed according to the activation parameter table, the control module 8-11 needs to obtain the activation parameter table (see the above description for the specific implementation process).
以上各模块的工作过程可参考上文的相关描述。For the working process of the above modules, please refer to the relevant description above.
这样,激活指令处理装置可以高效、快速地对激活指令进行处理,实现激活运算的高效、快速处理。In this way, the activation instruction processing device can process the activation instruction efficiently and quickly, and realize the efficient and rapid processing of the activation operation.
图1-4示出根据本公开一实施例的激活指令处理方法的流程图。如图1-4所示,该方法应用于上述激活指令处理装置,该方法包括步骤S51-8和步骤S52-8。该方法可以应用于包含存储器和处理器的如计算机设备等,其中,存储器用于存储执行方法过程中所使用的数据;处理器用于执行相关的处理、运算步骤,如执行下述步骤S51-8和步骤S52-8。1-4 show a flowchart of an activation instruction processing method according to an embodiment of the present disclosure. As shown in FIGS. 1-4, this method is applied to the above-mentioned activation instruction processing apparatus. The method includes step S51-8 and step S52-8. The method can be applied to computer equipment including a memory and a processor, where the memory is used to store data used in the execution of the method; the processor is used to perform related processing and operation steps, such as performing the following steps S51-8和 步骤 S52-8.
在步骤S51-8中,利用控制模块对获取到的激活指令进行编译,得到编译后的激活指令,对编译后的激活指令进行解析,得到激活指令的操作码和操作域,并根据操作码和操作域获取执行激活指令所需的待运算数据和目标地址。其中,操作码用于指示激活指令对数据所进行的运算为激活运算,操作域包括待运算数据地址和目标地址。In step S51-8, the control module is used to compile the obtained activation instruction to obtain a compiled activation instruction, and the compiled activation instruction is parsed to obtain the operation code and operation domain of the activation instruction, and according to the operation code and The operation domain obtains the data to be calculated and the target address required to execute the activation instruction. The operation code is used to indicate that the operation performed by the activation instruction on the data is an activation operation, and the operation domain includes the data address and the target address to be operated.
在步骤S52-8中,利用运算模块对待运算数据进行激活运算,得到运算结果,并将运算结果存入目标地址中。In step S52-8, the arithmetic module is used to activate the operation data to obtain the operation result, and the operation result is stored in the target address.
在一种可能的实现方式中,该方法还可以包括:In a possible implementation manner, the method may further include:
根据操作码和/或操作域获取激活参数表;Obtain the activation parameter table according to the operation code and / or operation field;
其中,利用运算模块对待运算数据进行激活运算,得到运算结果,包括:Among them, the operation module is used to activate the operation data to obtain the operation result, including:
根据激活参数表对待运算数据进行激活运算,得到运算结果,According to the activation parameter table, perform activation operation on the operation data to obtain the operation result,
其中,激活参数表可以包括激活表和常数表。The activation parameter table may include an activation table and a constant table.
在一种可能的实现方式中,利用运算模块对待运算数据进行激活运算,得到运算结果,可以包括:In a possible implementation manner, the operation module is used to activate the operation data to obtain the operation result, which may include:
利用多个激活运算器对待运算数据进行激活运算。Use multiple activation operators to perform activation operations on the data to be calculated.
在一种可能的实现方式中,运算模块可以包括主运算子模块和多个从运算子模块,主运算子模块可以包括多个激活运算器,In a possible implementation manner, the operation module may include a master operation sub-module and multiple slave operation sub-modules, and the master operation sub-module may include multiple activation operators,
其中,利用运算模块对待运算数据进行激活运算,得到运算结果,可以包括:Wherein, the operation module is used to activate the operation data to obtain the operation result, which may include:
利用主运算子模块中的多个激活运算器对待运算数据进行激活运算,得到运算结果,并将运算结果存入目标地址中。Use multiple activation operators in the main operation sub-module to perform activation operation on the operation data to obtain the operation result, and store the operation result in the target address.
在一种可能的实现方式中,操作域还可以包括读入量或读入量的存储地址。其中,根据操作码和操作域获取执行激活指令所需的待运算数据、激活表、常数表和目标地址,可以包括:In a possible implementation manner, the operation domain may further include a read-in amount or a storage address of the read-in amount. Among them, obtaining the data to be calculated, the activation table, the constant table, and the target address required to execute the activation instruction according to the operation code and the operation domain may include:
获取读入量,并按照读入量获取多个待运算数据。Obtain the read-in amount, and obtain multiple data to be calculated according to the read-in amount.
在一种可能的实现方式中,该方法还可以包括:存储待运算数据。In a possible implementation manner, the method may further include: storing data to be calculated.
在一种可能的实现方式中,对编译后的激活指令进行解析,得到激活指令的操作码和操作域,可以包括:In a possible implementation manner, parsing the compiled activation instruction to obtain the operation code and operation domain of the activation instruction may include:
存储编译后的激活指令;Store the compiled activation instruction;
对编译后的激活指令进行解析,得到激活指令的操作码和操作域;Analyze the compiled activation instruction to obtain the opcode and operation domain of the activation instruction;
存储指令队列,指令队列包括按照执行顺序依次排列的多个待执行指令,多个待执行指令包括编译后的激活指令。The instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed in order according to the execution order, and the plurality of instructions to be executed include the compiled activation instructions.
在一种可能的实现方式中,该方法还可以包括:在确定多个待执行指令中的第一待执行指令与第一待执行指令之前的第零待执行指令存在关联关系时,缓存第一待执行指令,并在确定第零待执行指令执行完毕后,控制进行第一待执行指令的执行,In a possible implementation manner, the method may further include: when determining that the first to-be-executed instruction among the plurality of to-be-executed instructions is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction, caching the first The instruction to be executed, and after determining that the execution of the zeroth instruction to be executed is completed, the execution of the first instruction to be executed is controlled,
其中,第一待执行指令与第一待执行指令之前的第零待执行指令存在关联关系包括:Wherein, the association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
存储第一待执行指令所需数据的第一存储地址区间与存储第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first storage address interval storing data required for the first instruction to be executed has an overlapping area with the zeroth storage address interval storing data required for the zeroth instruction to be executed.
在一种可能的实现方式中,利用控制模块对获取到的激活指令进行编译,得到编译后的激活指令,包括:In a possible implementation manner, the control module is used to compile the obtained activation instruction to obtain a compiled activation instruction, including:
根据激活指令生成汇编文件,并将汇编文件翻译成二进制文件。其中,二进制文件为编译后的激活指令。The assembly file is generated according to the activation instruction, and the assembly file is translated into a binary file. Among them, the binary file is the activation instruction after compilation.
在一种可能的实现方式中,激活运算所利用的激活函数可以包括以下至少一种:In a possible implementation manner, the activation function utilized by the activation operation may include at least one of the following:
线性整流函数、S型生长曲线函数、双曲正切函数、带泄露线性整流函数、取最大值函数和幂函数。Linear rectification function, S-shaped growth curve function, hyperbolic tangent function, linear rectification function with leakage, maximum function and power function.
需要说明的是,尽管以上述实施例作为示例介绍了激活指令处理方法如上,但本领域技术人员能够理解,本公开应不限于此。事实上,用户完全可根据个人喜好和/或实际应用场景灵活设定各步骤,只要符合本公开的技术方案即可。It should be noted that, although the above embodiment is taken as an example to introduce the activation instruction processing method as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various steps flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.
本公开实施例所提供的激活指令处理方法的适用范围广,对激活指令的处理效率高、处理速度快,进行激活运算的处理效率高、处理速度快。The method for processing an activation instruction provided by the embodiments of the present disclosure has a wide application range, and has a high processing efficiency and a fast processing speed for the activation instruction, and a high processing efficiency and a fast processing speed for performing the activation operation.
依据以下条款可更好地理解前述内容:The foregoing can be better understood based on the following terms:
条款A1、一种激活指令处理装置,所述装置包括:Clause A1, an activation instruction processing device, the device comprising:
控制模块,用于对获取到的激活指令进行编译,得到编译后的激活指令,对所述编译后的激活指令进行解析,得到激活指令的操作码和操作域,并根据所述操作码和所述操作域获取执行激活指令所需的待运算数据和目标地址;The control module is used to compile the obtained activation instruction to obtain the compiled activation instruction, analyze the compiled activation instruction, obtain the operation code and operation domain of the activation instruction, and according to the operation code and The operation domain obtains the data to be calculated and the target address required to execute the activation instruction;
运算模块,用于对所述待运算数据进行激活运算,得到运算结果,并将所述运算结果存入所述目标地址中,The operation module is used for performing activation operation on the data to be operated to obtain an operation result, and storing the operation result in the target address,
其中,所述操作码用于指示所述激活指令对数据所进行的运算为激活运算,所述操作域包括待运算数据地址和所述目标地址。Wherein, the operation code is used to indicate that the operation performed by the activation instruction on the data is an activation operation, and the operation domain includes the data address to be operated and the target address.
条款A2、根据条款A1所述的装置,Clause A2. The device according to Clause A1,
所述控制模块,还用于根据所述操作码和/或所述操作域获取激活参数表;The control module is also used to obtain an activation parameter table according to the operation code and / or the operation domain;
所述运算模块,还用于根据所述激活参数表对所述待运算数据进行激活运算,得到运算结果,The calculation module is also used to perform activation calculation on the data to be calculated according to the activation parameter table to obtain an operation result,
其中,所述激活参数表包括激活表和常数表。Wherein, the activation parameter table includes an activation table and a constant table.
条款A3、根据条款A1所述的装置,所述运算模块,包括:Clause A3. The device according to Clause A1, the arithmetic module includes:
多个激活运算器,用于对所述待运算数据进行激活运算。A plurality of activation calculators are used to perform activation calculation on the data to be calculated.
条款A4、根据条款A3所述的装置,所述运算模块包括主运算子模块和多个从运算子模块,所述主运算子模块包括所述多个激活运算器,Clause A4. The device according to Clause A3, the operation module includes a master operation sub-module and a plurality of slave operation sub-modules, the master operation sub-module includes the plurality of activation operators,
所述主运算子模块,用于利用所述多个激活运算器对所述待运算数据进行激活运算,得到运算结果,并将所述运算结果存入所述目标地址中。The main operation sub-module is configured to perform activation operation on the data to be operated by using the plurality of activation operators to obtain an operation result, and store the operation result in the target address.
条款A5、根据条款A1所述的装置,所述操作域包括读入量或所述读入量的存储地址,Clause A5. The device according to Clause A1, the operation domain includes a read-in amount or a storage address of the read-in amount,
其中,所述控制模块,还用于获取所述读入量,并按照所述读入量获取所述待运算数据。Wherein, the control module is also used to obtain the read-in amount, and obtain the data to be calculated according to the read-in amount.
条款A6、根据条款A1所述的装置,所述装置还包括:Clause A6. The device according to Clause A1, the device further comprising:
存储模块,用于存储所述待运算数据。The storage module is used for storing the data to be calculated.
条款A7、根据条款A1所述的装置,所述控制模块,包括:Clause A7. The device according to Clause A1, the control module includes:
指令存储子模块,用于存储所述编译后的激活指令;An instruction storage sub-module for storing the compiled activation instruction;
指令处理子模块,用于对所述编译后的激活指令进行解析,得到激活指令的操作码和操作域;An instruction processing sub-module, which is used to parse the compiled activation instruction to obtain the operation code and operation domain of the activation instruction;
队列存储子模块,用于存储指令队列,所述指令队列包括按照执行顺序依次排列的多个待执行指令,所述多个待执行指令包括所述编译后的激活指令。A queue storage sub-module is used to store an instruction queue. The instruction queue includes a plurality of instructions to be executed, which are sequentially arranged according to the execution order, and the plurality of instructions to be executed include the compiled activation instructions.
条款A8、根据条款A7所述的装置,所述控制模块,还包括:Clause A8. The device according to Clause A7, the control module, further comprising:
依赖关系处理子模块,用于在确定所述多个待执行指令中的第一待执行指令与所述第一待执行指 令之前的第零待执行指令存在关联关系时,将所述第一待执行指令缓存在所述指令存储子模块中,在所述第零待执行指令执行完毕后,从所述指令存储子模块中提取所述第一待执行指令发送至所述运算模块,The dependency processing sub-module is used to determine the first pending instruction when there is an association relationship between the first pending instruction in the plurality of pending instructions and the zeroth pending instruction before the first pending instruction The execution instruction is cached in the instruction storage submodule, and after the execution of the zeroth instruction to be executed is completed, the first instruction to be executed is extracted from the instruction storage submodule and sent to the arithmetic module,
其中,所述第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系包括:Wherein, the association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
存储所述第一待执行指令所需数据的第一存储地址区间与存储所述第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
条款A9、根据条款A1所述的装置,Clause A9. The device according to Clause A1,
所述控制模块,还用于根据所述激活指令生成汇编文件,并将所述汇编文件翻译成二进制文件,The control module is also used to generate an assembly file according to the activation instruction and translate the assembly file into a binary file,
其中,所述二进制文件为所述编译后的激活指令。Wherein, the binary file is the compiled activation instruction.
条款A10、根据条款A1至条款A9任一项所述的装置,所述激活运算所利用的激活函数包括以下至少一种:Clause A10. The device according to any one of Clause A1 to Clause A9, the activation function utilized by the activation operation includes at least one of the following:
线性整流函数、S型生长曲线函数、双曲正切函数、带泄露线性整流函数、取最大值函数和幂函数。Linear rectification function, S-shaped growth curve function, hyperbolic tangent function, linear rectification function with leakage, maximum function and power function.
条款A11、一种机器学习运算装置,所述装置包括:Clause A11. A machine learning computing device, the device comprising:
一个或多个如条款A1-条款A10任一项所述的激活指令处理装置,用于从其他处理装置中获取待运算数据和控制信息,并执行指定的机器学习运算,将执行结果通过I/O接口传递给其他处理装置;One or more activation instruction processing devices as described in any one of Clause A1-Clause A10, which are used to obtain data to be calculated and control information from other processing devices, and perform specified machine learning operations, and pass the execution result through I / O interface is passed to other processing devices;
当所述机器学习运算装置包含多个所述激活指令处理装置时,所述多个所述激活指令处理装置间可以通过特定的结构进行连接并传输数据;When the machine learning computing device includes a plurality of the activation instruction processing devices, the plurality of activation instruction processing devices may be connected and transmit data through a specific structure;
其中,多个所述激活指令处理装置通过快速外部设备互连总线PCIE总线进行互联并传输数据,以支持更大规模的机器学习的运算;多个所述激活指令处理装置共享同一控制系统或拥有各自的控制系统;多个所述激活指令处理装置共享内存或者拥有各自的内存;多个所述激活指令处理装置的互联方式是任意互联拓扑。Among them, a plurality of the activation instruction processing devices interconnect and transmit data through a PCIE bus, a fast external device interconnection bus, to support larger-scale machine learning operations; a plurality of the activation instruction processing devices share the same control system or Respective control systems; a plurality of the activation instruction processing devices share memory or have their own memories; the interconnection mode of the plurality of activation instruction processing devices is an arbitrary interconnection topology.
条款A12、一种组合处理装置,所述组合处理装置包括:Clause A12. A combined processing device, the combined processing device comprising:
如条款A11所述的机器学习运算装置、通用互联接口和其他处理装置;Machine learning computing device, general interconnection interface and other processing devices as described in clause A11;
所述机器学习运算装置与所述其他处理装置进行交互,共同完成用户指定的计算操作,The machine learning computing device interacts with the other processing device to jointly complete the calculation operation specified by the user,
其中,所述组合处理装置还包括:存储装置,该存储装置分别与所述机器学习运算装置和所述其他处理装置连接,用于保存所述机器学习运算装置和所述其他处理装置的数据。Wherein, the combined processing device further includes: a storage device, which is respectively connected to the machine learning computing device and the other processing device, and is used for storing data of the machine learning computing device and the other processing device.
条款A13、一种机器学习芯片,所述机器学习芯片包括:Article A13. A machine learning chip, the machine learning chip includes:
如条款A11所述的机器学习运算装置或如权利要求A12所述的组合处理装置。The machine learning arithmetic device according to clause A11 or the combined processing device according to claim A12.
条款A14、一种电子设备,所述电子设备包括:Article A14. An electronic device, the electronic device comprising:
如条款A13所述的机器学习芯片。Machine learning chip as described in clause A13.
条款A15、一种板卡,所述板卡包括:存储器件、接口装置和控制器件以及如条款A13所述的机器学习芯片;Clause A15, a board card, the board card includes: a storage device, an interface device and a control device, and a machine learning chip as described in Clause A13;
其中,所述机器学习芯片与所述存储器件、所述控制器件以及所述接口装置分别连接;Wherein, the machine learning chip is respectively connected to the storage device, the control device and the interface device;
所述存储器件,用于存储数据;The storage device is used for storing data;
所述接口装置,用于实现所述机器学习芯片与外部设备之间的数据传输;The interface device is used to realize data transmission between the machine learning chip and an external device;
所述控制器件,用于对所述机器学习芯片的状态进行监控。The control device is used for monitoring the state of the machine learning chip.
条款A16、一种激活指令处理方法,所述方法应用于激活指令处理装置,所述方法包括:Clause A16. An activation instruction processing method, the method is applied to an activation instruction processing device, the method includes:
利用控制模块对获取到的激活指令进行编译,得到编译后的激活指令,对所述编译后的激活指令进行解析,得到激活指令的操作码和操作域,并根据所述操作码和所述操作域获取执行激活指令所需的待运算数据和目标地址;The control module is used to compile the obtained activation instruction to obtain a compiled activation instruction, and the compiled activation instruction is parsed to obtain the operation code and operation domain of the activation instruction, and according to the operation code and the operation The domain obtains the data to be calculated and the target address required to execute the activation instruction;
利用运算模块对所述待运算数据进行激活运算,得到运算结果,并将所述运算结果存入所述目标地址中,Using an arithmetic module to perform an activation operation on the data to be operated to obtain an operation result, and store the operation result in the target address,
其中,所述操作码用于指示所述激活指令对数据所进行的运算为激活运算,所述操作域包括待运算数据地址和所述目标地址。Wherein, the operation code is used to indicate that the operation performed by the activation instruction on the data is an activation operation, and the operation domain includes the data address to be operated and the target address.
条款A17、根据条款A16所述的方法,所述方法还包括:Clause A17. The method according to Clause A16, the method further comprising:
根据所述操作码和/或所述操作域获取激活参数表;Obtaining an activation parameter table according to the operation code and / or the operation domain;
其中,利用运算模块对所述待运算数据进行激活运算,得到运算结果,包括:Wherein, the operation module is used to activate the operation data to obtain the operation result, including:
根据所述激活参数表对所述待运算数据进行激活运算,得到运算结果,Performing an activation operation on the data to be calculated according to the activation parameter table to obtain an operation result,
其中,所述激活参数表包括激活表和常数表。Wherein, the activation parameter table includes an activation table and a constant table.
条款A18、根据条款A16所述的方法,利用运算模块对所述待运算数据进行激活运算,得到运算结果,包括:Clause A18. According to the method described in Clause A16, an operation module is used to activate the data to be operated to obtain an operation result, including:
利用多个激活运算器对所述待运算数据进行激活运算。A plurality of activation calculators are used to perform activation calculation on the data to be calculated.
条款A19、根据条款A18所述的方法,所述运算模块包括主运算子模块和多个从运算子模块,所述主运算子模块包括所述多个激活运算器,Clause A19. The method according to Clause A18, the operation module includes a master operation sub-module and a plurality of slave operation sub-modules, the master operation sub-module includes the plurality of activation operators,
其中,利用运算模块对所述待运算数据进行激活运算,得到运算结果,包括:Wherein, the operation module is used to activate the operation data to obtain the operation result, including:
利用所述主运算子模块中的多个激活运算器对所述待运算数据进行激活运算,得到运算结果,并将所述运算结果存入所述目标地址中。Use multiple activation operators in the main operation sub-module to perform activation operation on the data to be operated to obtain an operation result, and store the operation result in the target address.
条款A20、根据条款A16所述的方法,所述操作域还包括读入量或所述读入量的存储地址,Clause A20. The method according to Clause A16, the operation domain further includes a read-in amount or a storage address of the read-in amount,
其中,根据所述操作码和所述操作域获取执行激活指令所需的待运算数据、激活表、常数表和目标地址,包括:Wherein, obtaining the data to be operated, the activation table, the constant table and the target address required to execute the activation instruction according to the operation code and the operation domain include:
获取所述读入量,并按照所述读入量获取所述待运算数据。Acquiring the read-in amount, and acquiring the data to be calculated according to the read-in amount.
条款A21、根据条款A16所述的方法,所述方法还包括:Clause A21. The method according to Clause A16, the method further comprising:
存储所述待运算数据。Store the data to be calculated.
条款A22、根据条款A16所述的方法,对所述编译后的激活指令进行解析,得到激活指令的操作码和操作域,包括:Clause A22. According to the method described in Clause A16, parse the compiled activation instruction to obtain the operation code and operation domain of the activation instruction, including:
存储所述编译后的激活指令;Storing the compiled activation instruction;
对所述编译后的激活指令进行解析,得到激活指令的操作码和操作域;Parse the compiled activation instruction to obtain the operation code and operation domain of the activation instruction;
存储指令队列,所述指令队列包括按照执行顺序依次排列的多个待执行指令,所述多个待执行指令包括所述编译后的激活指令。An instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed, which are sequentially arranged according to an execution order, and the plurality of instructions to be executed include the compiled activation instructions.
条款A23、根据条款A22所述的方法,所述方法还包括:Clause A23. The method according to Clause A22, the method further comprising:
在确定所述多个待执行指令中的第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系时,缓存所述第一待执行指令,并在确定所述第零待执行指令执行完毕后,控制进行所述第一待执行指令的执行,When it is determined that the first to-be-executed instruction among the plurality of to-be-executed instructions is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction, cache the first to-be-executed instruction, and determine the After the execution of the zeroth to-be-executed instruction is completed, control to execute the execution of the first to-be-executed instruction,
其中,所述第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系包括:Wherein, the association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
存储所述第一待执行指令所需数据的第一存储地址区间与存储所述第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
条款A24、根据条款A16所述的方法,利用控制模块对获取到的激活指令进行编译,得到编译后的激活指令,包括:Clause A24. According to the method described in Clause A16, use the control module to compile the obtained activation instruction to obtain the compiled activation instruction, including:
根据所述激活指令生成汇编文件,并将所述汇编文件翻译成二进制文件,Generate an assembly file according to the activation instruction, and translate the assembly file into a binary file,
其中,所述二进制文件为所述编译后的激活指令。Wherein, the binary file is the compiled activation instruction.
条款A25、根据条款A16至条款A24任一项所述的方法,所述激活运算所利用的激活函数包括以下至少一种:Clause A25. The method according to any one of Clause A16 to Clause A24, the activation function utilized by the activation operation includes at least one of the following:
线性整流函数、S型生长曲线函数、双曲正切函数、带泄露线性整流函数、取最大值函数和幂函数。Linear rectification function, S-shaped growth curve function, hyperbolic tangent function, linear rectification function with leakage, maximum function and power function.
图2-1示出根据本公开一实施例的线性整流函数激活指令处理装置的框图。如图2-1所示,该装置包括控制模块6-11和运算模块6-12。FIG. 2-1 shows a block diagram of a linear rectification function activation instruction processing device according to an embodiment of the present disclosure. As shown in Figure 2-1, the device includes a control module 6-11 and an arithmetic module 6-12.
控制模块6-11,用于对获取到的线性整流函数激活指令进行编译,得到编译后的线性整流函数激活指令,对编译后的线性整流函数激活指令进行解析,得到线性整流函数激活指令的操作码和操作域, 并根据操作码和操作域获取执行线性整流函数激活指令所需的待运算数据和目标地址。The control module 6-11 is used to compile the obtained linear rectification function activation instruction to obtain the compiled linear rectification function activation instruction, and parse the compiled linear rectification function activation instruction to obtain the operation of the linear rectification function activation instruction Code and operation domain, and obtain the data to be calculated and the target address required to execute the linear rectification function activation instruction according to the operation code and operation domain.
其中,操作码用于指示线性整流函数激活指令对数据所进行的激活运算为线性整流函数激活运算。操作域包括待运算数据地址和目标地址。The operation code is used to indicate that the activation operation performed by the linear rectification function activation instruction on the data is a linear rectification function activation operation. The operation domain includes the data address and target address to be calculated.
运算模块6-12,用于对待运算数据进行线性整流函数激活运算,得到运算结果,并将运算结果存入所述目标地址中。The operation module 6-12 is configured to perform a linear rectification function activation operation on the data to be operated, obtain an operation result, and store the operation result in the target address.
在本实施例中,控制模块所获取到的线性整流函数激活指令为未编译的、不能直接供硬件执行的软件指令,控制模块需先对线性整流函数激活指令(未编译的)进行编译。在得到编译后的线性整流函数激活指令之后,才能对编译后的卷线性整流函数激活指令进行解析。编译后的线性整流函数激活指令为能够直接供硬件执行的硬件指令。In this embodiment, the linear rectification function activation instruction acquired by the control module is an uncompiled software instruction that cannot be directly executed by the hardware. The control module needs to first compile the linear rectification function activation instruction (uncompiled). After the compiled linear rectification function activation instruction is obtained, the compiled linear rectification function activation instruction can be analyzed. The compiled linear rectification function activation instruction is a hardware instruction that can be directly executed by the hardware.
在本实施例中,编译后的线性整流函数激活指令可以是与线性整流函数激活指令唯一对应的指令。编译后的线性整流函数激活指令还可以包含所有激活运算的指令,使得装置在处理线性整流函数激活指令时,能够在不对该指令所对应是激活函数进行区分的前提下,便可以进行线性整流函数激活运算简化处理过程。In this embodiment, the compiled linear rectification function activation instruction may be the only instruction corresponding to the linear rectification function activation instruction. The compiled linear rectification function activation instruction may also include all activation operation instructions, so that when the device processes the linear rectification function activation instruction, the device can perform the linear rectification function without distinguishing the activation function corresponding to the instruction. Activation operations simplify the process.
在本实施例中,控制模块可以从待运算数据地址中获得待运算数据。控制模块可以根据线性整流函数激活指令的操作码确定进行线性整流函数激活运算所需的数据。控制模块可以通过数据输入输出单元获得指令和数据,该数据输入输出单元可以为一个或多个数据I/O接口或I/O引脚。In this embodiment, the control module can obtain the data to be calculated from the data address to be calculated. The control module may determine the data required to perform the linear rectification function activation operation according to the operation code of the linear rectification function activation instruction. The control module can obtain instructions and data through the data input and output unit, which can be one or more data I / O interfaces or I / O pins.
在本实施例中,操作码可以是计算机程序中所规定的要执行操作的那一部分指令或字段(通常用代码表示),是指令序列号,用来告知执行指令的装置具体需要执行哪一条指令。操作域可以是执行对应的指令所需的所有数据的来源,如对应的地址等,执行对应的指令所需的所有数据包括待运算数据等数据以及对应的运算方法等等。对于一个线性整流函数激活指令其必须包括操作码和操作域,其中,操作域至少包括待运算数据地址和目标地址。In this embodiment, the operation code may be the part of the instruction or field (usually represented by code) specified in the computer program to perform the operation, and is the instruction sequence number used to inform the device that executes the instruction which instruction needs to be executed. . The operation domain may be a source of all data required to execute the corresponding instruction, such as a corresponding address, etc. All data required to execute the corresponding instruction include data such as data to be operated and corresponding operation methods, etc. For a linear rectification function activation instruction, it must include an operation code and an operation field, where the operation field includes at least the data address to be operated and the target address.
应当理解的是,本领域技术人员可以根据需要对线性整流函数激活指令的指令格式以及所包含的操作码和操作域进行设置,本公开对此不作限制。It should be understood that those skilled in the art may set the instruction format of the linear rectification function activation instruction, as well as the included operation codes and operation domains as needed, which is not limited in this disclosure.
在本实施例中,该装置可以包括一个或多个控制模块,以及一个或多个运算模块,可以根据实际需要对控制模块和运算模块的数量进行设置,本公开对此不作限制。在装置包括一个控制模块时,该控制模块可以接收线性整流函数激活指令,并控制一个或多个处理模块进行线性整流函数激活运算。在装置包括多个控制模块时,多个控制模块可以分别接收线性整流函数激活指令,并控制对应的一个或多个处理模块进行线性整流函数激活运算。In this embodiment, the device may include one or more control modules and one or more arithmetic modules, and the number of control modules and arithmetic modules may be set according to actual needs, which is not limited in the present disclosure. When the device includes a control module, the control module may receive a linear rectification function activation instruction, and control one or more processing modules to perform the linear rectification function activation operation. When the device includes multiple control modules, the multiple control modules may respectively receive linear rectification function activation instructions and control the corresponding one or more processing modules to perform linear rectification function activation operations.
本公开实施例提供一种线性整流函数激活指令处理装置,该装置包括控制模块和运算模块,控制模块用于对获取到的线性整流函数激活指令进行编译,得到编译后的线性整流函数激活指令,对编译后的线性整流函数激活指令进行解析,得到线性整流函数激活指令的操作码和操作域,并根据操作码和操作域获取执行线性整流函数激活指令所需的待运算数据和目标地址;运算模块用于对待运算数据进行线性整流函数激活运算,得到运算结果,并将运算结果存入目标地址中。本公开实施例所提供的线性整流函数激活指令处理装置的适用范围广,对线性整流函数激活指令的处理效率高、处理速度快,进行线性整流函数激活运算的处理效率高、处理速度快。An embodiment of the present disclosure provides a linear rectification function activation instruction processing device. The device includes a control module and an arithmetic module. The control module is used to compile the obtained linear rectification function activation instruction to obtain a compiled linear rectification function activation instruction. Analyze the compiled linear rectification function activation instruction to obtain the operation code and operation domain of the linear rectification function activation instruction, and obtain the data to be operated and the target address required to execute the linear rectification function activation instruction according to the operation code and operation domain; The module is used to perform linear rectification function activation operation on the operation data to obtain the operation result and store the operation result in the target address. The linear rectification function activation instruction processing device provided by the embodiments of the present disclosure has a wide range of applications. The linear rectification function activation instruction has high processing efficiency and fast processing speed, and the linear rectification function activation operation has high processing efficiency and fast processing speed.
在一种可能的实现方式中,控制模块6-11还可以用于根据操作码和/或操作域获取线性整流激活函数参数表。In a possible implementation manner, the control module 6-11 may also be used to obtain a linear rectification activation function parameter table according to the operation code and / or operation domain.
运算模块6-12,还可以用于根据线性整流激活函数参数表对待运算数据进行线性整流函数激活运算,得到运算结果。The operation module 6-12 can also be used to perform a linear rectification function activation operation on the data to be calculated according to the linear rectification activation function parameter table to obtain an operation result.
其中,线性整流激活函数参数表可以包括线性整流激活函数激活表和线性整流激活函数常数表。The linear rectification activation function parameter table may include a linear rectification activation function activation table and a linear rectification activation function constant table.
在该实现方式中,操作域中可以包括线性整流激活函数参数表地址,以便于控制模块从线性整流激活函数参数表地址中获取线性整流激活函数参数表地址。或者,控制模块可以根据操作码确定执行该线性整流函数激活指令需要线性整流激活函数参数表时,可以直接从预先确定的线性整流激活函数参数表的存储地址获取线性整流激活函数参数表。再或者,控制模块可以根据操作码确定执行该线性整流函数激活指令需要线性整流激活函数参数表时,可以直接从预先确定的参数表的存储地址获取对 应该线性整流函数激活指令的线性整流激活函数参数表。本领域技术人员可以根据实际需要对线性整流激活函数参数表的获取方式进行设置,本公开对此不作限制。In this implementation, the operation domain may include a linear rectification activation function parameter table address, so that the control module obtains the linear rectification activation function parameter table address from the linear rectification activation function parameter table address. Alternatively, the control module may determine that the linear rectification activation function parameter table is required to execute the linear rectification function activation instruction according to the operation code, and may directly obtain the linear rectification activation function parameter table from the storage address of the predetermined linear rectification activation function parameter table. Or alternatively, when the control module can determine that the linear rectification function activation instruction requires a linear rectification activation function parameter table according to the operation code, the linear rectification activation function corresponding to the linear rectification function activation instruction can be obtained directly from the storage address of the predetermined parameter table Parameters Table. A person skilled in the art can set the acquisition method of the linear rectification activation function parameter table according to actual needs, which is not limited in the present disclosure.
在一种可能的实现方式中,控制模块还可以获取对应于线性整流函数激活指令的激活函数,使得运算模块可以根据激活函数以及对应的运算器,对待运算数据进行线性整流函数激活运算。In a possible implementation manner, the control module may also obtain an activation function corresponding to the linear rectification function activation instruction, so that the operation module may perform linear rectification function activation operation on the operation data according to the activation function and the corresponding operator.
需要说明的是,本领域技术人员可以根据实际需要对实现运算模块实现线性整流函数激活运算的方式进行设置,本公开对此不作限制。It should be noted that, those skilled in the art can set the manner in which the calculation module implements the linear rectification function activation calculation according to actual needs, and the disclosure does not limit this.
图2-2a示出根据本公开一实施例的线性整流函数激活指令处理装置的框图。在一种可能的实现方式中,如图2-2a所示,运算模块6-12可以包括多个激活运算器6-120。多个激活运算器6-120用于对待运算数据进行线性整流函数激活运算。2-2a shows a block diagram of a linear rectification function activation instruction processing device according to an embodiment of the present disclosure. In a possible implementation, as shown in FIG. 2-2a, the arithmetic module 6-12 may include multiple activation operators 6-120. A plurality of activation operators 6-120 are used to perform linear rectification function activation operations on the data to be operated.
在该实现方式中,运算模块也可以包括一个激活运算器。可以根据所需进行的线性整流函数激活运算的数据量的大小、对线性整流函数激活运算的处理速度、效率等要求对激活运算器的数量进行设置,本公开对此不作限制。In this implementation, the calculation module may also include an activation calculator. The number of activation operators can be set according to the amount of data required to perform the linear rectification function activation operation, the processing speed and efficiency of the linear rectification function activation operation, and the disclosure does not limit this.
图2-2b示出根据本公开一实施例的线性整流函数激活指令处理装置的框图。在一种可能的实现方式中,如图2-2b所示,运算模块6-12可以包括主运算子模块6-121和多个从运算子模块6-122,主运算子模块6-121包括多个激活运算器6-120(图中未示出)。2-2b shows a block diagram of a linear rectification function activation instruction processing device according to an embodiment of the present disclosure. In a possible implementation manner, as shown in FIG. 2-2b, the operation module 6-12 may include a master operation submodule 6-121 and a plurality of slave operation submodules 6-122, and the master operation submodule 6-121 includes Multiple activation operators 6-120 (not shown in the figure).
主运算子模块6-121,用于利用多个激活运算器对待运算数据进行线性整流函数激活运算,得到运算结果,并将运算结果存入目标地址中。The main operation sub-module 6-121 is used to perform a linear rectification function activation operation on the data to be calculated using a plurality of activation operators to obtain the operation result, and store the operation result in the target address.
在一种可能的实现方式中,操作域还可以包括读入量或读入量的存储地址。其中,控制模块6-11,还用于获取读入量,并按照读入量获取多个待运算数据。其中,多个待运算数据的数据量可以小于或等于读入量。In a possible implementation manner, the operation domain may further include a read-in amount or a storage address of the read-in amount. Among them, the control module 6-11 is also used to obtain the read-in amount, and obtain a plurality of data to be calculated according to the read-in amount. Among them, the data amount of the plurality of data to be calculated may be less than or equal to the read-in amount.
在该实现方式中,读入量可以是所获取的多个待运算数据的数据量,可以是所获取的待运算数据的尺寸。在操作域直接包含读入量的具体数值时,可以将该数值确定为读入量。在操作域中包括读入量的存储地址时,可以从该存储地址中获取读入量。In this implementation manner, the read-in amount may be the data amount of the acquired plurality of data to be calculated, and may be the size of the acquired data to be calculated. When the operation field directly contains the specific value of the read-in amount, the value can be determined as the read-in amount. When the storage address of the read-in amount is included in the operation domain, the read-in amount can be obtained from the storage address.
在一种可能的实现方式中,在操作域中不包括读入量时,可以根据预先设置的默认读入量获取多个待运算数据。所获取的多个待运算数据的数据量可以小于或等于默认读入量。In a possible implementation manner, when the read-in amount is not included in the operation domain, a plurality of data to be calculated may be obtained according to a preset default read-in amount. The acquired data amount of the plurality of data to be calculated may be less than or equal to the default read-in amount.
通过上述方式,可以对待运算数据的数据量和尺寸进行限制,保证运算结果的准确性,并保证装置可以执行该线性整流函数激活指令。In the above manner, the data amount and size of the operation data can be limited, the accuracy of the operation result can be guaranteed, and the device can execute the linear rectification function activation instruction.
在一种可能的实现方式中,如图2-2a、图2-2b所示,该装置还可以包括存储模块6-13。存储模块6-13用于存储待运算数据。存储模块6-13还可以用于存储线性整流激活函数参数表。In a possible implementation manner, as shown in FIGS. 2-2a and 2-2b, the device may further include a storage module 6-13. The storage modules 6-13 are used to store data to be calculated. The storage modules 6-13 can also be used to store the linear rectification activation function parameter table.
在该实现方式中,存储模块可以包括内存,如缓存和寄存器中的一种或多种,缓存可以包括高速暂存缓存。可以根据需要将待运算数据和线性整流激活函数参数表存储在存储模块的缓存和/或寄存器中,本公开对此不作限制。In this implementation manner, the storage module may include a memory, such as one or more of a cache and a register, and the cache may include a high-speed temporary storage cache. The data to be calculated and the parameter table of the linear rectification activation function can be stored in the cache and / or register of the storage module as needed, and the disclosure does not limit this.
在一种可能的实现方式中,该装置还可以包括直接内存访问模块,用于从存储模块中读取或者存储数据。In a possible implementation, the device may further include a direct memory access module, which is used to read or store data from the storage module.
在一种可能的实现方式中,如图2-2a、图2-2b所示,控制模块6-11还可以用于根据线性整流函数激活指令生成汇编文件,并将汇编文件翻译成二进制文件,其中,二进制文件为编译后的线性整流函数激活指令。In a possible implementation, as shown in FIGS. 2-2a and 2-2b, the control module 6-11 can also be used to generate an assembly file according to the linear rectification function activation instruction, and translate the assembly file into a binary file. Among them, the binary file is the compiled linear rectification function activation instruction.
在一种可能的实现方式中,线性整流函数激活指令的指令格式可以是:In a possible implementation manner, the instruction format of the linear rectification function activation instruction may be:
active.relu dst src0 sizeactive.reludst src0 size
其中,active.relu是线性整流函数激活指令的操作码,dst、src0、size是线性整流函数激活指令的操作域。其中,dst是目标地址,src0是待运算数据地址,size是读入量。Among them, active.relu is the operation code of the linear rectification function activation instruction, and dst, src0, and size are the operation domains of the linear rectification function activation instruction. Among them, dst is the target address, src0 is the data address to be calculated, and size is the read-in amount.
在一种可能的实现方式中,线性整流函数激活指令的指令格式可以是:In a possible implementation manner, the instruction format of the linear rectification function activation instruction may be:
active.relu dst src0 src1 sizeactive.reludst src0 src1 size
其中,active.relu是线性整流函数激活指令的操作码,dst、src0、src1、size是线性整流函数激活指令的操作域。其中,dst是目标地址,src0是待运算数据地址,src1是线性整流激活函数参数表地址, size是读入量。Among them, active.relu is the operation code of the linear rectification function activation instruction, and dst, src0, src1, and size are the operation domains of the linear rectification function activation instruction. Among them, dst is the target address, src0 is the data address to be calculated, src1 is the address of the linear rectification activation function parameter table, and size is the amount of reading.
应当理解的是,本领域技术人员可以根据需要对线性整流函数激活指令的操作码、指令格式中操作码和操作域的位置进行设置,本公开对此不作限制。It should be understood that those skilled in the art can set the operation code of the linear rectification function activation instruction, the position of the operation code and the operation domain in the instruction format according to need, and this disclosure does not limit this.
在一种可能的实现方式中,该装置可以设置于图形处理器(Graphics Processing Unit,简称GPU)、中央处理器(Central Processing Unit,简称CPU)和嵌入式神经网络处理器(Neural-network Processing Unit,简称NPU)的一种或多种之中。In a possible implementation manner, the device may be set in a graphics processor (Graphics Processing Unit, GPU for short), a central processing unit (Central Processing Unit, CPU for short) and an embedded neural network processor (Neural-network Processing Unit) , Referred to as NPU).
需要说明的是,尽管以上述实施例作为示例介绍了线性整流函数激活指令处理装置如上,但本领域技术人员能够理解,本公开应不限于此。事实上,用户完全可根据个人喜好和/或实际应用场景灵活设定各模块,只要符合本公开的技术方案即可。It should be noted that, although the above embodiment is taken as an example to introduce the linear rectification function activation instruction processing apparatus as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various modules flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.
应用示例Application examples
以下结合“利用线性整流函数激活指令处理装置进行激活运算”作为一个示例性应用场景,给出根据本公开实施例的应用示例,以便于理解线性整流函数激活指令处理装置的流程。本领域技术人员应理解,以下应用示例仅仅是出于便于理解本公开实施例的目的,不应视为对本公开实施例的限制In the following, an application example according to an embodiment of the present disclosure will be given in conjunction with “using a linear rectification function to activate an instruction processing device to perform an activation operation” as an exemplary application scenario, so as to facilitate understanding of a flow of a linear rectification function to activate an instruction processing device. Those skilled in the art should understand that the following application examples are only for the purpose of facilitating understanding of the embodiments of the present disclosure, and should not be considered as limitations to the embodiments of the present disclosure
图2-3示出根据本公开一实施例的线性整流函数激活指令处理装置的应用场景的示意图。如图2-3所示,线性整流函数激活指令处理装置对线性整流函数激活指令进行处理的过程如下:FIG. 2-3 shows a schematic diagram of an application scenario of a linear rectification function activation instruction processing device according to an embodiment of the present disclosure. As shown in Figure 2-3, the linear rectification function activation instruction processing device processes the linear rectification function activation instruction as follows:
如图2-3所示,控制模块6-11对获取到的线性整流函数激活指令1进行编译,得到编译后的线性整流函数激活指令1(如线性整流函数激活指令1为active.relu 500 100 64),对编译后的线性整流函数激活指令1进行解析,得到线性整流函数激活指令1的操作码和操作域。其中,线性整流函数激活指令1的操作码为active.relu,目标地址为500,待运算数据地址为100,读入量为64。控制模块6-11从待运算数据地址100中获取数据量为64(读入量)的待运算数据。假定需要根据线性整流激活函数参数表进行激活运算,那么控制模块11还需获取线性整流激活函数参数表(具体实现过程参见上文描述)。As shown in Figure 2-3, the control module 6-11 compiles the obtained linear rectification function activation instruction 1 to obtain the compiled linear rectification function activation instruction 1 (for example, the linear rectification function activation instruction 1 is active.relu500.100 64) Analyze the compiled linear rectification function activation instruction 1 to obtain the operation code and operation domain of the linear rectification function activation instruction 1. The operation code of the linear rectification function activation instruction 1 is active.relu, the target address is 500, the data address to be calculated is 100, and the read-in amount is 64. The control module 6-11 acquires the data to be calculated with a data amount of 64 (read-in amount) from the data address to be calculated 100. Assuming that the activation calculation needs to be performed according to the linear rectification activation function parameter table, the control module 11 also needs to obtain the linear rectification activation function parameter table (see the above description for the specific implementation process).
运算模块6-12根据线性整流激活函数参数表对待运算数据进行线性整流函数激活运算,得到运算结果,并将运算结果存入目标地址500中。The operation module 6-12 performs linear rectification function activation calculation on the operation data according to the linear rectification activation function parameter table, obtains the operation result, and stores the operation result in the target address 500.
以上各模块的工作过程可参考上文的相关描述。For the working process of the above modules, please refer to the relevant description above.
这样,线性整流函数激活指令处理装置可以高效、快速地对线性整流函数激活指令进行处理,实现线性整流函数激活运算的高效、快速处理。In this way, the linear rectification function activation instruction processing device can efficiently and quickly process the linear rectification function activation instruction, and realize the efficient and rapid processing of the linear rectification function activation operation.
图2-4示出根据本公开一实施例的线性整流函数激活指令处理方法的流程图。如图2-4所示,该方法应用于上述线性整流函数激活指令处理装置,该方法包括步骤S51-6和步骤S52-6。该方法可以应用于包含存储器和处理器的如计算机设备等,其中,存储器用于存储执行方法过程中所使用的数据;处理器用于执行相关的处理、运算步骤,如执行下述步骤S51-6和步骤S52-6。2-4 illustrate a flowchart of a method for processing a linear rectification function activation instruction according to an embodiment of the present disclosure. As shown in FIGS. 2-4, the method is applied to the above linear rectification function activation instruction processing device, and the method includes step S51-6 and step S52-6. The method can be applied to computer equipment including a memory and a processor, where the memory is used to store data used in the execution of the method; the processor is used to perform related processing and operation steps, such as performing the following steps S51-6和 步骤 S52-6.
在步骤S51-6中,利用控制模块对获取到的线性整流函数激活指令进行编译,得到编译后的线性整流函数激活指令,对编译后的线性整流函数激活指令进行解析,得到线性整流函数激活指令的操作码和操作域,并根据操作码和操作域获取执行线性整流函数激活指令所需的待运算数据和目标地址。其中,操作码用于指示线性整流函数激活指令对数据所进行的激活运算为线性整流函数激活运算,操作域包括待运算数据地址和目标地址。In step S51-6, the control module is used to compile the obtained linear rectification function activation instruction to obtain a compiled linear rectification function activation instruction, and the compiled linear rectification function activation instruction is analyzed to obtain a linear rectification function activation instruction Operation code and operation domain, and obtain the data to be calculated and the target address required to execute the linear rectification function activation instruction according to the operation code and operation domain. The operation code is used to indicate that the activation operation performed by the linear rectification function activation instruction on the data is the linear rectification function activation operation, and the operation domain includes the data address to be operated and the target address.
在步骤S52-6中,利用运算模块对待运算数据进行线性整流函数激活运算,得到运算结果,并将运算结果存入目标地址中。In step S52-6, the operation module performs linear rectification function activation operation on the operation data to obtain the operation result, and stores the operation result in the target address.
在一种可能的实现方式中,该方法还可以包括:In a possible implementation manner, the method may further include:
根据操作码和/或操作域获取线性整流激活函数参数表;Obtain the linear rectification activation function parameter table according to the operation code and / or operation domain;
其中,利用运算模块对待运算数据进行线性整流函数激活运算,得到运算结果,包括:根据线性整流激活函数参数表对待运算数据进行线性整流函数激活运算,得到运算结果。其中,线性整流激活函数参数表可以包括线性整流激活函数激活表和线性整流激活函数常数表。Wherein, using the operation module to perform the linear rectification function activation operation on the operation data to obtain the operation result includes: performing the linear rectification function activation operation on the operation data according to the linear rectification activation function parameter table to obtain the operation result. The linear rectification activation function parameter table may include a linear rectification activation function activation table and a linear rectification activation function constant table.
在一种可能的实现方式中,利用运算模块对待运算数据进行线性整流函数激活运算,得到运算结果,可以包括:利用多个激活运算器对待运算数据进行线性整流函数激活运算。In a possible implementation manner, using the operation module to perform a linear rectification function activation operation on the operation data to obtain an operation result may include: using multiple activation operators to perform a linear rectification function activation operation on the operation data.
在一种可能的实现方式中,运算模块可以包括主运算子模块和多个从运算子模块,主运算子模块 可以包括多个激活运算器。In a possible implementation manner, the operation module may include a master operation sub-module and multiple slave operation sub-modules, and the master operation sub-module may include multiple activation operators.
其中,利用运算模块对待运算数据进行激活运算,得到运算结果,可以包括:Wherein, the operation module is used to activate the operation data to obtain the operation result, which may include:
利用所述主运算子模块中的多个激活运算器对待运算数据进行线性整流函数激活运算,得到运算结果。A plurality of activation operators in the main operation sub-module are used to perform linear rectification function activation operation on the operation data to obtain an operation result.
在一种可能的实现方式中,操作域还可以包括读入量或读入量的存储地址。其中,根据操作码和操作域获取执行线性整流函数激活指令所需的待运算数据和目标地址,可以包括:获取读入量,并按照读入量获取待运算数据。In a possible implementation manner, the operation domain may further include a read-in amount or a storage address of the read-in amount. Wherein, obtaining the data to be operated and the target address required to execute the linear rectification function activation instruction according to the operation code and the operation domain may include: acquiring the read-in amount, and obtaining the data to be operated according to the read-in amount.
在一种可能的实现方式中,该方法还可以包括:存储待运算数据。In a possible implementation manner, the method may further include: storing data to be calculated.
在一种可能的实现方式中,对编译后的线性整流函数激活指令进行解析,得到线性整流函数激活指令的操作码和操作域,可以包括:In a possible implementation manner, parsing the compiled linear rectification function activation instruction to obtain the operation code and operation domain of the linear rectification function activation instruction may include:
存储线编译后的性整流函数激活指令;Activation command of the rectification function after the storage line is compiled;
对编译后的线性整流函数激活指令进行解析,得到线性整流函数激活指令的操作码和操作域;Analyze the compiled linear rectification function activation instruction to obtain the operation code and operation domain of the linear rectification function activation instruction;
存储指令队列,指令队列包括按照执行顺序依次排列的多个待执行指令,多个待执行指令包括编译后的线性整流函数激活指令。The instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed in order according to the execution order, and the plurality of instructions to be executed include a compiled linear rectification function activation instruction.
在一种可能的实现方式中,该方法还可以包括:在确定多个待执行指令中的第一待执行指令与第一待执行指令之前的第零待执行指令存在关联关系时,缓存第一待执行指令,并在确定第零待执行指令执行完毕后,控制进行第一待执行指令的执行,In a possible implementation manner, the method may further include: when determining that the first to-be-executed instruction among the plurality of to-be-executed instructions is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction, caching the first The instruction to be executed, and after determining that the execution of the zeroth instruction to be executed is completed, the execution of the first instruction to be executed is controlled,
其中,第一待执行指令与第一待执行指令之前的第零待执行指令存在关联关系包括:Wherein, the association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
存储第一待执行指令所需数据的第一存储地址区间与存储第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first storage address interval storing data required for the first instruction to be executed has an overlapping area with the zeroth storage address interval storing data required for the zeroth instruction to be executed.
在一种可能的实现方式中,利用控制模块对获取到的线性整流函数激活指令进行编译,得到编译后的线性整流函数激活指令,可以包括:根据线性整流函数激活指令生成汇编文件,并将汇编文件翻译成二进制文件。其中,二进制文件为编译后的线性整流函数激活指令。In a possible implementation manner, the obtained linear rectification function activation instruction is compiled by the control module to obtain a compiled linear rectification function activation instruction, which may include: generating an assembly file according to the linear rectification function activation instruction, and compiling The file is translated into a binary file. Among them, the binary file is the compiled linear rectification function activation instruction.
需要说明的是,尽管以上述实施例作为示例介绍了线性整流函数激活指令处理方法如上,但本领域技术人员能够理解,本公开应不限于此。事实上,用户完全可根据个人喜好和/或实际应用场景灵活设定各步骤,只要符合本公开的技术方案即可。It should be noted that although the above embodiment is taken as an example to introduce the linear rectification function activation instruction processing method as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various steps flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.
本公开实施例所提供的线性整流函数激活指令处理方法的适用范围广,对线性整流函数激活指令的处理效率高、处理速度快,进行线性整流函数激活运算的处理效率高、处理速度快。The linear rectification function activation instruction processing method provided by the embodiments of the present disclosure has a wide range of application, and has high processing efficiency and fast processing speed for the linear rectification function activation instruction, and high processing efficiency and fast processing speed for performing the linear rectification function activation operation.
依据以下条款可更好地理解前述内容:The foregoing can be better understood based on the following terms:
条款B1、一种线性整流函数激活指令处理装置,所述装置包括:Clause B1, a linear rectification function activation command processing device, the device comprising:
控制模块,用于对获取到的线性整流函数激活指令进行编译,得到编译后的线性整流函数激活指令,对所述编译后的线性整流函数激活指令进行解析,得到线性整流函数激活指令的操作码和操作域,并根据所述操作码和所述操作域获取执行线性整流函数激活指令所需的待运算数据和目标地址;The control module is used to compile the obtained linear rectification function activation instruction to obtain the compiled linear rectification function activation instruction, analyze the compiled linear rectification function activation instruction, and obtain the operation code of the linear rectification function activation instruction And the operation domain, and according to the operation code and the operation domain, obtain the data to be operated and the target address required to execute the linear rectification function activation instruction;
运算模块,用于对所述待运算数据进行线性整流函数激活运算,得到运算结果,并将所述运算结果存入所述目标地址中,The operation module is used to perform a linear rectification function activation operation on the data to be operated to obtain an operation result, and store the operation result in the target address,
其中,所述操作码用于指示所述线性整流函数激活指令对数据所进行的激活运算为线性整流函数激活运算,所述操作域包括待运算数据地址和所述目标地址。Wherein, the operation code is used to indicate that the activation operation performed by the linear rectification function activation instruction on the data is a linear rectification function activation operation, and the operation domain includes the data address to be operated and the target address.
条款B2、根据条款B1所述的装置,Clause B2, the device according to Clause B1,
所述控制模块,还用于根据所述操作码和/或所述操作域获取线性整流激活函数参数表;The control module is further configured to obtain a linear rectification activation function parameter table according to the operation code and / or the operation domain;
所述运算模块,还用于根据所述线性整流激活函数参数表对所述待运算数据进行线性整流函数激活运算,得到运算结果,The operation module is further configured to perform a linear rectification function activation operation on the data to be calculated according to the linear rectification activation function parameter table to obtain an operation result,
其中,所述线性整流激活函数参数表包括线性整流激活函数激活表和线性整流激活函数常数表。Wherein, the linear rectification activation function parameter table includes a linear rectification activation function activation table and a linear rectification activation function constant table.
条款B3、根据条款B1所述的装置,所述运算模块,包括:Clause B3. The device according to Clause B1, the calculation module includes:
多个激活运算器,用于对所述待运算数据进行线性整流函数激活运算。A plurality of activation operators are used to perform a linear rectification function activation operation on the data to be operated.
条款B4、根据条款B3所述的装置,所述运算模块包括主运算子模块和多个从运算子模块,所述 主运算子模块包括所述多个激活运算器,Clause B4. The device according to Clause B3, the operation module includes a master operation sub-module and a plurality of slave operation sub-modules, the master operation sub-module includes the plurality of activation operators,
所述主运算子模块,用于利用所述多个激活运算器对所述待运算数据进行线性整流函数激活运算,得到运算结果,并将所述运算结果存入所述目标地址中。The main operation sub-module is configured to use the plurality of activation operators to perform a linear rectification function activation operation on the data to be operated to obtain an operation result, and store the operation result in the target address.
条款B5、根据条款B1所述的装置,所述操作域还包括读入量或所述读入量的存储地址,Clause B5. The device according to Clause B1, the operation domain further includes a read-in amount or a storage address of the read-in amount,
其中,所述控制模块,还用于获取所述读入量,并按照所述读入量获取所述待运算数据。Wherein, the control module is also used to obtain the read-in amount, and obtain the data to be calculated according to the read-in amount.
条款B6、根据条款B1所述的装置,所述装置还包括:Clause B6. The device according to Clause B1, the device further comprising:
存储模块,用于存储所述待运算数据。The storage module is used for storing the data to be calculated.
条款B7、根据条款B1所述的装置,所述控制模块包括:Clause B7. The device according to Clause B1, the control module includes:
指令存储子模块,用于存储所述编译后的线性整流函数激活指令;An instruction storage sub-module for storing the compiled linear rectification function activation instruction;
指令处理子模块,用于对所述编译后的线性整流函数激活指令进行解析,得到线性整流函数激活指令的操作码和操作域;An instruction processing sub-module, which is used to analyze the compiled linear rectification function activation instruction to obtain the operation code and operation domain of the linear rectification function activation instruction;
队列存储子模块,用于存储指令队列,所述指令队列包括按照执行顺序依次排列的多个待执行指令,所述多个待执行指令包括所述编译后的线性整流函数激活指令。The queue storage submodule is used to store an instruction queue, and the instruction queue includes a plurality of instructions to be executed that are sequentially arranged according to an execution order, and the plurality of instructions to be executed include the compiled linear rectification function activation instruction.
条款B8、根据条款B7所述的装置,所述控制模块,还包括:Clause B8. The device according to Clause B7, the control module, further comprising:
依赖关系处理子模块,用于在确定所述多个待执行指令中的第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系时,将所述第一待执行指令缓存在所述指令存储子模块中,在所述第零待执行指令执行完毕后,从所述指令存储子模块中提取所述第一待执行指令发送至所述运算模块,The dependency processing sub-module is used to determine the first pending instruction when there is an association relationship between the first pending instruction in the plurality of pending instructions and the zeroth pending instruction before the first pending instruction The execution instruction is cached in the instruction storage submodule, and after the execution of the zeroth instruction to be executed is completed, the first instruction to be executed is extracted from the instruction storage submodule and sent to the arithmetic module,
其中,所述第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系包括:Wherein, the association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
存储所述第一待执行指令所需数据的第一存储地址区间与存储所述第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
条款B9、根据条款B1所述的装置,Clause B9. The device according to Clause B1,
所述控制模块,还用于根据所述线性整流函数激活指令生成汇编文件,并将所述汇编文件翻译成二进制文件,The control module is also used to generate an assembly file according to the linear rectification function activation instruction and translate the assembly file into a binary file,
其中,所述二进制文件为所述编译后的线性整流函数激活指令。Wherein, the binary file is the compiled linear rectification function activation instruction.
条款B10、一种机器学习运算装置,所述装置包括:Article B10. A machine learning computing device, the device comprising:
一个或多个如条款B1-条款B9任一项所述的线性整流函数激活指令处理装置,用于从其他处理装置中获取待运算数据和控制信息,并执行指定的机器学习运算,将执行结果通过I/O接口传递给其他处理装置;One or more linear rectification function activation instruction processing devices as described in any one of Clause B1-Clause B9, used to obtain data to be calculated and control information from other processing devices, and perform a specified machine learning operation, which will execute the result Passed to other processing devices through the I / O interface;
当所述机器学习运算装置包含多个所述线性整流函数激活指令处理装置时,所述多个所述线性整流函数激活指令处理装置间可以通过特定的结构进行连接并传输数据;When the machine learning operation device includes a plurality of the linear rectification function activation instruction processing devices, the plurality of linear rectification function activation instruction processing devices may be connected and transmit data through a specific structure;
其中,多个所述线性整流函数激活指令处理装置通过快速外部设备互连总线PCIE总线进行互联并传输数据,以支持更大规模的机器学习的运算;多个所述线性整流函数激活指令处理装置共享同一控制系统或拥有各自的控制系统;多个所述线性整流函数激活指令处理装置共享内存或者拥有各自的内存;多个所述线性整流函数激活指令处理装置的互联方式是任意互联拓扑。Among them, a plurality of the linear rectification function activation instruction processing devices interconnect and transmit data through a fast external device interconnect bus PCIE bus to support larger-scale machine learning operations; a plurality of the linear rectification function activation instruction processing devices The same control system is shared or has its own control system; multiple linear rectification function activation instruction processing devices share memory or own memory; the interconnection method of multiple linear rectification function activation instruction processing devices is any interconnection topology.
条款B11、一种组合处理装置,所述组合处理装置包括:Clause B11. A combined processing device, the combined processing device comprising:
如条款B10所述的机器学习运算装置、通用互联接口和其他处理装置;Machine learning computing device, general interconnection interface and other processing devices as described in clause B10;
所述机器学习运算装置与所述其他处理装置进行交互,共同完成用户指定的计算操作,The machine learning computing device interacts with the other processing device to jointly complete the calculation operation specified by the user,
其中,所述组合处理装置还包括:存储装置,该存储装置分别与所述机器学习运算装置和所述其他处理装置连接,用于保存所述机器学习运算装置和所述其他处理装置的数据。Wherein, the combined processing device further includes: a storage device, which is respectively connected to the machine learning computing device and the other processing device, and is used for storing data of the machine learning computing device and the other processing device.
条款B12、一种机器学习芯片,所述机器学习芯片包括:Article B12. A machine learning chip, the machine learning chip includes:
如条款B10所述的机器学习运算装置或如条款B11所述的组合处理装置。The machine learning arithmetic device described in Item B10 or the combined processing device described in Item B11.
条款B3、一种电子设备,所述电子设备包括:Article B3. An electronic device, the electronic device comprising:
如条款B12所述的机器学习芯片。Machine learning chip as described in clause B12.
条款B14、一种板卡,所述板卡包括:存储器件、接口装置和控制器件以及如条款B12所述的机 器学习芯片;Clause B14, a board card, the board card includes: a storage device, an interface device and a control device, and a machine learning chip as described in Clause B12;
其中,所述机器学习芯片与所述存储器件、所述控制器件以及所述接口装置分别连接;Wherein, the machine learning chip is respectively connected to the storage device, the control device and the interface device;
所述存储器件,用于存储数据;The storage device is used for storing data;
所述接口装置,用于实现所述机器学习芯片与外部设备之间的数据传输;The interface device is used to realize data transmission between the machine learning chip and an external device;
所述控制器件,用于对所述机器学习芯片的状态进行监控。The control device is used for monitoring the state of the machine learning chip.
条款B15、一种线性整流函数激活指令处理方法,所述方法应用于线性整流函数激活指令处理装置,所述方法包括:Article B15. A method for processing a linear rectification function activation instruction. The method is applied to a linear rectification function activation instruction processing apparatus. The method includes:
利用控制模块对获取到的线性整流函数激活指令进行编译,得到编译后的线性整流函数激活指令,对所述编译后的线性整流函数激活指令进行解析,得到线性整流函数激活指令的操作码和操作域,并根据所述操作码和所述操作域获取执行线性整流函数激活指令所需的待运算数据和目标地址,;The control module is used to compile the obtained linear rectification function activation instruction to obtain a compiled linear rectification function activation instruction, and the compiled linear rectification function activation instruction is analyzed to obtain the operation code and operation of the linear rectification function activation instruction. Field, and obtain the data to be operated and the target address required to execute the linear rectification function activation instruction according to the operation code and the operation field;
利用运算模块对所述待运算数据进行线性整流函数激活运算,得到运算结果,并将所述运算结果存入所述目标地址中,Using an operation module to perform a linear rectification function activation operation on the data to be operated to obtain an operation result, and store the operation result in the target address,
其中,所述操作码用于指示所述线性整流函数激活指令对数据所进行的激活运算为线性整流函数激活运算,所述操作域包括待运算数据地址和所述目标地址。Wherein, the operation code is used to indicate that the activation operation performed by the linear rectification function activation instruction on the data is a linear rectification function activation operation, and the operation domain includes the data address to be operated and the target address.
条款B16、根据条款B15所述的方法,所述方法还包括:Clause B16. The method according to Clause B15, the method further comprising:
根据所述操作码和/或所述操作域获取线性整流激活函数参数表;Obtaining a linear rectification activation function parameter table according to the operation code and / or the operation domain;
其中,利用运算模块对所述待运算数据进行线性整流函数激活运算,得到运算结果,包括:Wherein, the operation module is used to perform a linear rectification function activation operation on the data to be operated to obtain an operation result, including:
根据所述线性整流激活函数参数表对所述待运算数据进行线性整流函数激活运算,得到运算结果,Performing a linear rectification function activation operation on the data to be calculated according to the linear rectification activation function parameter table to obtain an operation result,
其中,所述线性整流激活函数参数表包括线性整流激活函数激活表和线性整流激活函数常数表。Wherein, the linear rectification activation function parameter table includes a linear rectification activation function activation table and a linear rectification activation function constant table.
条款B17、根据条款B15所述的方法,利用运算模块对所述待运算数据进行线性整流函数激活运算,得到运算结果,包括:Clause B17. According to the method described in Clause B15, the operation module is used to perform a linear rectification function activation operation on the data to be operated to obtain an operation result, including:
利用多个激活运算器对所述待运算数据进行线性整流函数激活运算。A plurality of activation operators are used to perform a linear rectification function activation operation on the data to be operated.
条款B18、根据条款B17所述的方法,所述运算模块包括主运算子模块和多个从运算子模块,所述主运算子模块包括所述多个激活运算器,Clause B18. The method according to Clause B17, the operation module includes a master operation sub-module and a plurality of slave operation sub-modules, the master operation sub-module includes the plurality of activation operators,
其中,利用运算模块对所述待运算数据进行线性整流函数激活运算,得到运算结果,包括:Wherein, the operation module is used to perform a linear rectification function activation operation on the data to be operated to obtain an operation result, including:
利用所述主运算子模块中的多个激活运算器对所述待运算数据进行线性整流函数激活运算,得到运算结果。A plurality of activation operators in the main operation sub-module are used to perform a linear rectification function activation operation on the data to be operated to obtain an operation result.
条款B19、根据条款B15所述的方法,所述操作域还包括读入量或所述读入量的存储地址,Clause B19. The method according to Clause B15, the operation domain further includes a read-in amount or a storage address of the read-in amount,
其中,根据所述操作码和所述操作域获取执行线性整流函数激活指令所需的待运算数据和目标地址,包括:Wherein, obtaining the data to be operated and the target address required to execute the linear rectification function activation instruction according to the operation code and the operation domain includes:
获取所述读入量,并按照所述读入量获取所述待运算数据。Acquiring the read-in amount, and acquiring the data to be calculated according to the read-in amount.
条款B20、根据条款B15所述的方法,所述方法还包括:Clause B20. The method according to Clause B15, the method further comprising:
存储所述待运算数据。Store the data to be calculated.
条款B21、根据条款B15所述的方法,对所述编译后的线性整流函数激活指令进行解析,得到线性整流函数激活指令的操作码和操作域,包括:Clause B21. According to the method described in Clause B15, analyze the compiled linear rectification function activation instruction to obtain the operation code and operation domain of the linear rectification function activation instruction, including:
存储所述编译后的线性整流函数激活指令;Storing the compiled linear rectification function activation instruction;
对所述编译后的线性整流函数激活指令进行解析,得到所述线性整流函数激活指令的操作码和操作域;Parse the compiled linear rectification function activation instruction to obtain the operation code and operation domain of the linear rectification function activation instruction;
存储指令队列,所述指令队列包括按照执行顺序依次排列的多个待执行指令,所述多个待执行指令包括所述编译后的线性整流函数激活指令。An instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed, which are sequentially arranged according to an execution order, and the plurality of instructions to be executed include the compiled linear rectification function activation instruction.
条款B22、根据条款B21所述的方法,所述方法还包括:Clause B22. The method according to Clause B21, the method further comprising:
在确定所述多个待执行指令中的第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系时,缓存所述第一待执行指令,并在确定所述第零待执行指令执行完毕后,控制进行所述第一待执行指令的执行,When it is determined that the first to-be-executed instruction among the plurality of to-be-executed instructions is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction, cache the first to-be-executed instruction, and determine the After the execution of the zeroth to-be-executed instruction is completed, control to execute the execution of the first to-be-executed instruction,
其中,所述第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系包括:Wherein, the association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
存储所述第一待执行指令所需数据的第一存储地址区间与存储所述第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
条款B23、根据条款B15所述的方法,利用控制模块对获取到的线性整流函数激活指令进行编译,得到编译后的线性整流函数激活指令,包括:Clause B23. According to the method described in Clause B15, use the control module to compile the obtained linear rectification function activation instruction to obtain the compiled linear rectification function activation instruction, including:
根据所述线性整流函数激活指令生成汇编文件,并将所述汇编文件翻译成二进制文件,Generate an assembly file according to the linear rectification function activation instruction, and translate the assembly file into a binary file,
其中,所述二进制文件为所述编译后的线性整流函数激活指令。Wherein, the binary file is the compiled linear rectification function activation instruction.
图3-1示出根据本公开一实施例的S型生长曲线函数激活指令处理装置的框图。如图3-1所示,该装置包括控制模块7-11和运算模块7-12。FIG. 3-1 shows a block diagram of an S-shaped growth curve function activation instruction processing apparatus according to an embodiment of the present disclosure. As shown in Figure 3-1, the device includes a control module 7-11 and an arithmetic module 7-12.
控制模块7-11,用于对获取到的S型生长曲线函数激活指令进行编译,得到编译后的S型生长曲线函数激活指令,对编译后的S型生长曲线函数激活指令进行解析,得到S型生长曲线函数激活指令的操作码和操作域,并根据操作码和操作域获取执行S型生长曲线函数激活指令所需的待运算数据和目标地址。The control module 7-11 is used to compile the acquired S-shaped growth curve function activation instruction to obtain the compiled S-shaped growth curve function activation instruction, and parse the compiled S-shaped growth curve function activation instruction to obtain S The growth code function activates the operation code and operation domain of the instruction, and obtains the data to be calculated and the target address required to execute the S-type growth curve function activation instruction according to the operation code and the operation domain.
其中,操作码用于指示S型生长曲线函数激活指令对数据所进行的激活运算为S型生长曲线函数激活运算。操作域包括待运算数据地址和目标地址。The operation code is used to instruct the S-type growth curve function activation instruction to perform the activation operation on the data as the S-type growth curve function activation operation. The operation domain includes the data address and target address to be calculated.
运算模块7-12,用于对待运算数据进行S型生长曲线函数激活运算,得到运算结果,并将运算结果存入目标地址中。The operation module 7-12 is used to perform S-shaped growth curve function activation operation on the data to be calculated, obtain the operation result, and store the operation result in the target address.
在本实施例中,控制模块所获取到的S型生长曲线函数激活指令为未编译的、不能直接供硬件执行的软件指令,控制模块需先对S型生长曲线函数激活指令(未编译的)进行编译。在得到编译后的S型生长曲线函数激活指令之后,才能对编译后的S型生长曲线函数激活指令进行解析。编译后的S型生长曲线函数激活指令为能够直接供硬件执行的硬件指令。In this embodiment, the S-shaped growth curve function activation instruction acquired by the control module is an uncompiled software instruction that cannot be directly executed by the hardware. The control module needs to first activate the S-shaped growth curve function activation instruction (uncompiled) To compile. After the compiled S-shaped growth curve function activation instruction is obtained, the compiled S-shaped growth curve function activation instruction can be analyzed. The compiled S-shaped growth curve function activation instruction is a hardware instruction that can be directly executed by the hardware.
在本实施例中,编译后的S型生长曲线函数激活指令可以是与S型生长曲线函数激活指令唯一对应的指令。编译后的S型生长曲线函数激活指令还可以包含所有激活运算的指令,使得装置在处理S型生长曲线函数激活指令时,能够在不对该指令所对应是激活函数进行区分的前提下,便可以进行S型生长曲线函数激活运算,简化指令的处理过程。In this embodiment, the compiled S-shaped growth curve function activation instruction may be the only instruction corresponding to the S-shaped growth curve function activation instruction. The compiled S-shaped growth curve function activation instruction may also include all activation calculation instructions, so that when the device processes the S-shaped growth curve function activation instruction, it can do without distinguishing the activation function corresponding to the instruction. Perform S-shaped growth curve function activation operation to simplify the processing of instructions.
在本实施例中,控制模块可以从待运算数据地址中获得待运算数据。控制模块可以根据S型生长曲线函数激活指令的操作码确定进行S型生长曲线函数激活运算所需的数据。控制模块可以通过数据输入输出单元获得指令和数据,该数据输入输出单元可以为一个或多个数据I/O接口或I/O引脚。In this embodiment, the control module can obtain the data to be calculated from the data address to be calculated. The control module may determine the data required for the S-shaped growth curve function activation operation according to the operation code of the S-shaped growth curve function activation instruction. The control module can obtain instructions and data through the data input and output unit, which can be one or more data I / O interfaces or I / O pins.
在本实施例中,操作码可以是计算机程序中所规定的要执行操作的那一部分指令或字段(通常用代码表示),是指令序列号,用来告知执行指令的装置具体需要执行哪一条指令。操作域可以是执行对应的指令所需的所有数据的来源,如对应的地址等,执行对应的指令所需的所有数据包括待运算数据等数据以及对应的运算方法等等。对于一个S型生长曲线函数激活指令其必须包括操作码和操作域,其中,操作域至少包括待运算数据地址和目标地址。In this embodiment, the operation code may be the part of the instruction or field (usually represented by code) specified in the computer program to perform the operation, and is the instruction sequence number used to inform the device that executes the instruction which instruction needs to be executed. . The operation domain may be a source of all data required to execute the corresponding instruction, such as a corresponding address, etc. All data required to execute the corresponding instruction include data such as data to be operated and corresponding operation methods, etc. For an S-shaped growth curve function activation instruction, it must include an operation code and an operation field, where the operation field includes at least the data address and the target address to be calculated.
应当理解的是,本领域技术人员可以根据需要对S型生长曲线函数激活指令的指令格式以及所包含的操作码和操作域进行设置,本公开对此不作限制。It should be understood that, those skilled in the art can set the instruction format of the S-shaped growth curve function activation instruction, as well as the included operation codes and operation domains as needed, and this disclosure does not limit this.
在本实施例中,该装置可以包括一个或多个控制模块,以及一个或多个运算模块,可以根据实际需要对控制模块和运算模块的数量进行设置,本公开对此不作限制。在装置包括一个控制模块时,该控制模块可以接收S型生长曲线函数激活指令,并控制一个或多个处理模块进行S型生长曲线函数激活运算。在装置包括多个控制模块时,多个控制模块可以分别接收S型生长曲线函数激活指令,并控制对应的一个或多个处理模块进行S型生长曲线函数激活运算。In this embodiment, the device may include one or more control modules and one or more arithmetic modules, and the number of control modules and arithmetic modules may be set according to actual needs, which is not limited in the present disclosure. When the device includes a control module, the control module may receive an S-shaped growth curve function activation instruction, and control one or more processing modules to perform an S-shaped growth curve function activation operation. When the device includes multiple control modules, the multiple control modules may respectively receive the S-shaped growth curve function activation instruction and control the corresponding one or more processing modules to perform the S-shaped growth curve function activation operation.
本公开实施例提供一种S型生长曲线函数激活指令处理装置,该装置包括控制模块和运算模块,控制模块用于对获取到的S型生长曲线函数激活指令进行编译,得到编译后的S型生长曲线函数激活指令,对编译后的S型生长曲线函数激活指令进行解析,得到S型生长曲线函数激活指令的操作码和操作域,并根据操作码和操作域获取执行S型生长曲线函数激活指令所需的待运算数据和目标地址;运算模块用于对待运算数据进行S型生长曲线函数激活运算,得到运算结果,并将运算结果存入目标地址 中。本公开实施例所提供的S型生长曲线函数激活指令处理装置的适用范围广,对S型生长曲线函数激活指令的处理效率高、处理速度快,进行S型生长曲线函数激活运算的处理效率高、处理速度快。An embodiment of the present disclosure provides an S-shaped growth curve function activation instruction processing device. The device includes a control module and an arithmetic module. The control module is used to compile the obtained S-shaped growth curve function activation instruction to obtain the compiled S-type Growth curve function activation instruction, analyze the compiled S-shaped growth curve function activation instruction to obtain the operation code and operation domain of the S-shaped growth curve function activation instruction, and obtain and execute the S-shaped growth curve function activation according to the operation code and operation domain The data to be operated and the target address required by the instruction; the operation module is used to perform the S-shaped growth curve function activation operation on the operation data to obtain the operation result, and store the operation result in the target address. The S-shaped growth curve function activation instruction processing device provided by the embodiments of the present disclosure has a wide range of application, high processing efficiency and fast processing speed for the S-shaped growth curve function activation instruction, and high processing efficiency for performing the S-shaped growth curve function activation operation , Fast processing speed.
在一种可能的实现方式中,控制模块7-11,还可以用于根据操作码和/或操作域获取S型生长曲线激活函数参数表.In a possible implementation, the control module 7-11 can also be used to obtain the S-shaped growth curve activation function parameter table according to the operation code and / or operation domain.
运算模块,还可以用于根据S型生长曲线激活函数参数表对待运算数据进行S型生长曲线函数激活运算,得到运算结果。The operation module can also be used to perform S-type growth curve function activation calculation on the data to be calculated according to the S-type growth curve activation function parameter table to obtain the operation result.
其中,S型生长曲线激活函数参数表可以包括S型生长曲线激活函数激活表和S型生长曲线激活函数常数表。The S-type growth curve activation function parameter table may include an S-type growth curve activation function activation table and an S-type growth curve activation function constant table.
在该实现方式中,操作域中可以包括S型生长曲线激活函数参数表地址,以便于控制模块从S型生长曲线激活函数参数表地址中获取S型生长曲线激活函数参数表地址。或者,控制模块可以根据操作码确定执行该S型生长曲线函数激活指令需要S型生长曲线激活函数参数表时,可以直接从预先确定的S型生长曲线激活函数参数表的存储地址获取S型生长曲线激活函数参数表。再或者,控制模块可以根据操作码确定执行该S型生长曲线函数激活指令需要S型生长曲线激活函数参数表时,可以直接从预先确定的参数表的存储地址获取对应该S型生长曲线函数激活指令的S型生长曲线激活函数参数表。本领域技术人员可以根据实际需要对S型生长曲线激活函数参数表的获取方式进行设置,本公开对此不作限制。In this implementation, the S-type growth curve activation function parameter table address may be included in the operation domain, so that the control module obtains the S-type growth curve activation function parameter table address from the S-type growth curve activation function parameter table address. Alternatively, the control module may determine that the S-shaped growth curve activation function parameter table is required to execute the S-shaped growth curve function activation instruction according to the operation code, and may directly obtain the S-shaped growth from the storage address of the predetermined S-shaped growth curve activation function parameter table Curve activation function parameter table. Or alternatively, the control module may determine that the S-shaped growth curve activation function parameter table is required to execute the S-shaped growth curve function activation instruction according to the operation code, and may directly obtain the corresponding S-shaped growth curve function activation from the storage address of the predetermined parameter table Commanded S-shaped growth curve activation function parameter table. Those skilled in the art can set the acquisition method of the S-shaped growth curve activation function parameter table according to actual needs, which is not limited in the present disclosure.
在一种可能的实现方式中,控制模块还可以获取对应于S型生长曲线函数激活指令的激活函数,使得运算模块可以根据激活函数以及对应的运算器,对待运算数据进行S型生长曲线函数激活运算。In a possible implementation, the control module can also obtain an activation function corresponding to the S-shaped growth curve function activation instruction, so that the operation module can perform S-shaped growth curve function activation on the operation data according to the activation function and the corresponding operator Operation.
需要说明的是,本领域技术人员可以根据实际需要对实现运算模块实现S型生长曲线函数激活运算的方式进行设置,本公开对此不作限制。It should be noted that those skilled in the art can set the manner in which the calculation module implements the S-shaped growth curve function activation calculation according to actual needs, and the disclosure does not limit this.
图3-2a示出根据本公开一实施例的S型生长曲线函数激活指令处理装置的框图。在一种可能的实现方式中,如图3-2a所示,运算模块7-12可以包括多个激活运算器7-120。多个激活运算器7-120用于对待运算数据进行S型生长曲线函数激活运算。3-2a shows a block diagram of an S-shaped growth curve function activation instruction processing apparatus according to an embodiment of the present disclosure. In a possible implementation manner, as shown in FIG. 3-2a, the arithmetic module 7-12 may include multiple activation operators 7-120. A plurality of activation operators 7-120 are used to perform S-shaped growth curve function activation calculation on the data to be calculated.
在该实现方式中,运算模块也可以包括一个激活运算器。可以根据所需进行的S型生长曲线函数激活运算的数据量的大小、对S型生长曲线函数激活运算的处理速度、效率等要求对激活运算器的数量进行设置,本公开对此不作限制。In this implementation, the calculation module may also include an activation calculator. The number of activation operators can be set according to the amount of data required for the S-shaped growth curve function activation operation, the processing speed and efficiency of the S-shaped growth curve function activation operation, and the disclosure does not limit this.
图3-2b示出根据本公开一实施例的S型生长曲线函数激活指令处理装置的框图。在一种可能的实现方式中,如图3-2b所示,运算模块7-12可以包括主运算子模块7-121和多个从运算子模块7-122,主运算子模块7-121包括多个激活运算器7-120(图中未示出)。3-2b shows a block diagram of an S-shaped growth curve function activation instruction processing apparatus according to an embodiment of the present disclosure. In a possible implementation, as shown in FIG. 3-2b, the operation module 7-12 may include a master operation sub-module 7-121 and a plurality of slave operation sub-modules 7-122, and the master operation sub-module 7-121 includes Multiple activation operators 7-120 (not shown in the figure).
主运算子模块7-121,用于利用多个激活运算器对待运算数据进行S型生长曲线函数激活运算,得到运算结果,并将运算结果存入目标地址中。The main operation sub-module 7-121 is used to perform S-shaped growth curve function activation calculation on the data to be calculated by using a plurality of activation operators to obtain operation results, and store the operation results in the target address.
在一种可能的实现方式中,操作域还可以包括读入量或读入量的存储地址。其中,控制模块7-11,还用于获取读入量,并按照读入量获取多个待运算数据。其中,多个待运算数据的数据量可以小于或等于读入量。In a possible implementation manner, the operation domain may further include a read-in amount or a storage address of the read-in amount. Among them, the control module 7-11 is also used to obtain the read-in amount, and obtain a plurality of data to be calculated according to the read-in amount. Among them, the data amount of the plurality of data to be calculated may be less than or equal to the read-in amount.
在该实现方式中,读入量可以是所获取的多个待运算数据的数据量,可以是所获取的待运算数据的尺寸。在操作域直接包含读入量的具体数值时,可以将该数值确定为读入量。在操作域中包括读入量的存储地址时,可以从该存储地址中获取读入量。In this implementation manner, the read-in amount may be the data amount of the acquired plurality of data to be calculated, and may be the size of the acquired data to be calculated. When the operation field directly contains the specific value of the read-in amount, the value can be determined as the read-in amount. When the storage address of the read-in amount is included in the operation domain, the read-in amount can be obtained from the storage address.
在一种可能的实现方式中,在操作域中不包括读入量时,可以根据预先设置的默认读入量获取多个待运算数据。所获取的多个待运算数据的数据量可以小于或等于默认读入量。In a possible implementation manner, when the read-in amount is not included in the operation domain, a plurality of data to be calculated may be obtained according to a preset default read-in amount. The acquired data amount of the plurality of data to be calculated may be less than or equal to the default read-in amount.
通过上述方式,可以对待运算数据的数据量和尺寸进行限制,保证运算结果的准确性,并保证装置可以执行该S型生长曲线函数激活指令。In the above manner, the data amount and size of the operation data can be limited, the accuracy of the operation result can be ensured, and the device can execute the S-shaped growth curve function activation instruction.
在一种可能的实现方式中,如图3-2a、图3-2b所示,该装置还可以包括存储模块7-13。存储模块7-13用于存储待运算数据。存储模块7-13还可以用于存储S型生长曲线激活函数参数表。In a possible implementation manner, as shown in FIGS. 3-2a and 3-2b, the device may further include a storage module 7-13. The storage module 7-13 is used to store data to be calculated. The storage module 7-13 can also be used to store the S-shaped growth curve activation function parameter table.
在该实现方式中,存储模块可以包括内存,如缓存和寄存器中的一种或多种,缓存可以包括高速暂存缓存。可以根据需要将待运算数据和S型生长曲线激活函数参数表存储在存储模块缓存和/或寄存 器中,本公开对此不作限制。In this implementation manner, the storage module may include a memory, such as one or more of a cache and a register, and the cache may include a high-speed temporary storage cache. The to-be-calculated data and the S-shaped growth curve activation function parameter table can be stored in the storage module cache and / or register as needed, and the disclosure does not limit this.
在一种可能的实现方式中,该装置还可以包括直接内存访问模块,用于从存储模块中读取或者存储数据。In a possible implementation, the device may further include a direct memory access module, which is used to read or store data from the storage module.
在一种可能的实现方式中,控制模块7-11还用于根据S型生长曲线函数激活指令生成汇编文件,并将汇编文件翻译成二进制文件,其中,二进制文件为编译后的S型生长曲线函数激活指令。In a possible implementation, the control module 7-11 is also used to generate an assembly file according to the S-shaped growth curve function activation instruction and translate the assembly file into a binary file, where the binary file is the compiled S-shaped growth curve Function activation instruction.
在一种可能的实现方式中,S型生长曲线函数激活指令的指令格式可以是:In a possible implementation manner, the instruction format of the S-shaped growth curve function activation instruction may be:
active.sigmoid dst src0 sizeactive.sigmoid dst src0 size
其中,active.sigmoid是S型生长曲线函数激活指令的操作码,dst、src0、size是S型生长曲线函数激活指令的操作域。其中,dst是目标地址,src0是待运算数据地址,size是读入量。Wherein, active.sigmoid is the operation code of the S-shaped growth curve function activation instruction, and dst, src0, and size are the operation domains of the S-shaped growth curve function activation instruction. Among them, dst is the target address, src0 is the data address to be calculated, and size is the read-in amount.
在一种可能的实现方式中,S型生长曲线函数激活指令的指令格式可以是:In a possible implementation manner, the instruction format of the S-shaped growth curve function activation instruction may be:
active.sigmoid dst src0 src1 sizeactive.sigmoid dst src0 src1 size
其中,active.sigmoid是S型生长曲线函数激活指令的操作码,dst、src0、src1、size是S型生长曲线函数激活指令的操作域。其中,dst是目标地址,src0是待运算数据地址,src1是S型生长曲线激活函数参数表地址,size是读入量。Wherein, active.sigmoid is the operation code of the S-type growth curve function activation instruction, and dst, src0, src1, and size are the operation domains of the S-type growth curve function activation instruction. Among them, dst is the target address, src0 is the data address to be calculated, src1 is the address of the S-type growth curve activation function parameter table, and size is the read-in amount.
应当理解的是,本领域技术人员可以根据需要对S型生长曲线函数激活指令的操作码、指令格式中操作码和操作域的位置进行设置,本公开对此不作限制。It should be understood that those skilled in the art can set the operation code of the S-shaped growth curve function activation instruction, the position of the operation code and the operation field in the instruction format according to needs, and this disclosure does not limit this.
在一种可能的实现方式中,该装置可以设置于图形处理器(Graphics Processing Unit,简称GPU)、中央处理器(Central Processing Unit,简称CPU)和嵌入式神经网络处理器(Neural-network Processing Unit,简称NPU)的一种或多种之中。In a possible implementation manner, the device may be set in a graphics processor (Graphics Processing Unit, GPU for short), a central processing unit (Central Processing Unit, CPU for short) and an embedded neural network processor (Neural-network Processing Unit) , Referred to as NPU).
需要说明的是,尽管以上述实施例作为示例介绍了S型生长曲线函数激活指令处理装置如上,但本领域技术人员能够理解,本公开应不限于此。事实上,用户完全可根据个人喜好和/或实际应用场景灵活设定各模块,只要符合本公开的技术方案即可。It should be noted that although the above-mentioned embodiment is taken as an example to introduce the S-shaped growth curve function activation instruction processing apparatus as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various modules flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.
应用示例Application examples
以下结合“利用S型生长曲线函数激活指令处理装置进行激活运算”作为一个示例性应用场景,给出根据本公开实施例的应用示例,以便于理解S型生长曲线函数激活指令处理装置的流程。本领域技术人员应理解,以下应用示例仅仅是出于便于理解本公开实施例的目的,不应视为对本公开实施例的限制In the following, an application example according to an embodiment of the present disclosure is given in conjunction with "using an S-shaped growth curve function to activate an instruction processing device for activation operation" as an exemplary application scenario, so as to facilitate understanding of a flow of an S-shaped growth curve function to activate an instruction processing device. Those skilled in the art should understand that the following application examples are only for the purpose of facilitating understanding of the embodiments of the present disclosure, and should not be considered as limitations to the embodiments of the present disclosure
图3-3示出根据本公开一实施例的S型生长曲线函数激活指令处理装置的应用场景的示意图。如图3-3所示,S型生长曲线函数激活指令处理装置对S型生长曲线函数激活指令进行处理的过程如下:3-3 shows a schematic diagram of an application scenario of an S-shaped growth curve function activation instruction processing apparatus according to an embodiment of the present disclosure. As shown in Figure 3-3, the S-shaped growth curve function activation instruction processing device processes the S-shaped growth curve function activation instruction as follows:
如图3-3所示,控制模块7-11对获取到的S型生长曲线函数激活指令1进行编译,得到编译后的S型生长曲线函数激活指令1(如S型生长曲线函数激活指令1为active.sigmoid 500 100 64),对编译后的S型生长曲线函数激活指令1进行解析,得到S型生长曲线函数激活指令1的操作码和操作域。其中,S型生长曲线函数激活指令1的操作码为active.sigmoid,目标地址为500,待运算数据地址为100,读入量为64。控制模块7-11从待运算数据地址100中获取数据量为64(读入量)的待运算数据。假定需要根据S型生长曲线激活函数参数表进行激活运算,那么控制模块7-11还需获取S型生长曲线激活函数参数表(具体实现过程参见上文描述)。As shown in Figure 3-3, the control module 7-11 compiles the acquired S-shaped growth curve function activation instruction 1 to obtain the compiled S-shaped growth curve function activation instruction 1 (such as the S-shaped growth curve function activation instruction 1 Is active.sigmoid50050010064), analyzes the compiled S-shaped growth curve function activation instruction 1 to obtain the operation code and operation domain of the S-shaped growth curve function activation instruction 1. The operation code of the S-shaped growth curve function activation instruction 1 is active.sigmoid, the target address is 500, the data address to be calculated is 100, and the read-in amount is 64. The control module 7-11 acquires the data to be calculated with a data amount of 64 (read-in amount) from the data address to be calculated 100. Assuming that the activation calculation needs to be performed according to the S-shaped growth curve activation function parameter table, the control module 7-11 also needs to obtain the S-shaped growth curve activation function parameter table (see the above description for the specific implementation process).
运算模块7-12根据S型生长曲线激活函数参数表对待运算数据进行S型生长曲线函数激活运算,得到运算结果,并将运算结果存入目标地址500中。The operation module 7-12 performs the S-type growth curve function activation operation on the data to be calculated according to the S-type growth curve activation function parameter table, obtains the operation result, and stores the operation result in the target address 500.
以上各模块的工作过程可参考上文的相关描述。For the working process of the above modules, please refer to the relevant description above.
这样,S型生长曲线函数激活指令处理装置可以高效、快速地对S型生长曲线函数激活指令进行处理,进行S型生长曲线函数激活运算的处理效率高、处理速度快。In this way, the S-shaped growth curve function activation instruction processing device can efficiently and quickly process the S-shaped growth curve function activation instruction, and the S-shaped growth curve function activation operation has high processing efficiency and fast processing speed.
图3-4示出根据本公开一实施例的S型生长曲线函数激活指令处理方法的流程图。如图3-4所示,该方法应用于上述S型生长曲线函数激活指令处理装置,该方法包括步骤S51-7和步骤S52-7。该方法可以应用于包含存储器和处理器的如计算机设备等,其中,存储器用于存储执行方法过程中所使用的数据;处理器用于执行相关的处理、运算步骤,如执行下述步骤S51-7和步骤S52-7。FIGS. 3-4 illustrate a flowchart of an S-shaped growth curve function activation instruction processing method according to an embodiment of the present disclosure. As shown in FIG. 3-4, this method is applied to the above S-shaped growth curve function activation instruction processing device. The method includes steps S51-7 and S52-7. The method can be applied to computer equipment including a memory and a processor, where the memory is used to store data used in the execution of the method; the processor is used to perform related processing and operation steps, such as performing the following steps S51-7和 步骤 S52-7.
在步骤S51-7中,利用控制模块对获取到的S型生长曲线函数激活指令进行编译,得到编译后的S型生长曲线函数激活指令,对编译后的S型生长曲线函数激活指令进行解析,得到S型生长曲线函数激活指令的操作码和操作域,并根据操作码和操作域获取执行S型生长曲线函数激活指令所需的待运算数据和目标地址。其中,操作码用于指示S型生长曲线函数激活指令对数据所进行的激活运算为S型生长曲线函数激活运算,操作域包括待运算数据地址和目标地址。In step S51-7, the control module is used to compile the acquired S-shaped growth curve function activation instruction to obtain a compiled S-shaped growth curve function activation instruction, and the compiled S-shaped growth curve function activation instruction is analyzed, Obtain the operation code and operation domain of the S-shaped growth curve function activation instruction, and obtain the data to be calculated and the target address required to execute the S-shaped growth curve function activation instruction according to the operation code and the operation domain. The operation code is used to instruct the S-type growth curve function activation instruction to perform the activation operation on the data as the S-type growth curve function activation operation, and the operation domain includes the data address and the target address to be calculated.
在步骤S52-7中,利用运算模块对待运算数据进行S型生长曲线函数激活运算,得到运算结果,并将运算结果存入目标地址中。In step S52-7, an S-shaped growth curve function activation operation is performed on the data to be operated by the operation module to obtain the operation result, and the operation result is stored in the target address.
在一种可能的实现方式中,该方法还可以包括:In a possible implementation manner, the method may further include:
根据操作码和/或操作域获取S型生长曲线激活函数参数表;Obtain the S-shaped growth curve activation function parameter table according to the operation code and / or operation domain;
其中,利用运算模块对待运算数据进行S型生长曲线函数激活运算,得到运算结果,包括:Among them, the S-shaped growth curve function activation operation is performed on the operation data using the operation module to obtain the operation result, including:
根据S型生长曲线激活函数参数表对待运算数据进行S型生长曲线函数激活运算,得到运算结果,According to the S-shaped growth curve activation function parameter table, perform the S-shaped growth curve function activation operation on the data to be obtained, and obtain the operation result,
其中,S型生长曲线激活函数参数表包括S型生长曲线激活函数激活表和S型生长曲线激活函数常数表。The S-type growth curve activation function parameter table includes an S-type growth curve activation function activation table and an S-type growth curve activation function constant table.
在一种可能的实现方式中,利用运算模块对待运算数据进行S型生长曲线函数激活运算,得到运算结果,可以包括:In a possible implementation manner, using the operation module to perform an S-type growth curve function activation operation on the operation data to obtain the operation result may include:
利用多个激活运算器对待运算数据进行S型生长曲线函数激活运算。Use multiple activation operators to perform S-shaped growth curve function activation calculation on the data to be calculated.
在一种可能的实现方式中,运算模块可以包括主运算子模块和多个从运算子模块,主运算子模块可以包括多个激活运算器。In a possible implementation manner, the operation module may include a master operation sub-module and multiple slave operation sub-modules, and the master operation sub-module may include multiple activation operators.
其中,利用运算模块对待运算数据进行S型生长曲线函数激活运算,得到运算结果,可以包括:利用主运算子模块中的多个激活运算器进行S型生长曲线函数激活运算,得到运算结果。Wherein, using the operation module to perform the S-type growth curve function activation operation on the operation data to obtain the operation result may include: performing the S-type growth curve function activation operation using multiple activation operators in the main operation sub-module to obtain the operation result.
在一种可能的实现方式中,操作域还可以包括读入量或读入量的存储地址。其中,根据操作码和操作域获取执行S型生长曲线函数激活指令所需的待运算数据和目标地址,可以包括:获取读入量,并按照读入量获取待运In a possible implementation manner, the operation domain may further include a read-in amount or a storage address of the read-in amount. Among them, obtaining the data to be calculated and the target address required to execute the S-shaped growth curve function activation instruction according to the operation code and the operation domain may include: acquiring the read-in amount, and obtaining the pending operation according to the read-in amount
在一种可能的实现方式中,该方法还可以包括:存储待运算数据。In a possible implementation manner, the method may further include: storing data to be calculated.
在一种可能的实现方式中,利用控制模块对获取到的S型生长曲线函数激活指令进行解析,得到S型生长曲线函数激活指令的操作码和操作域,可以包括:In a possible implementation manner, the control module is used to parse the obtained S-shaped growth curve function activation instruction to obtain the operation code and operation domain of the S-shaped growth curve function activation instruction, which may include:
存储线编译后的S型生长曲线函数激活指令;S-shaped growth curve function activation instruction after storage line compilation;
对编译后的S型生长曲线函数激活指令进行解析,得到S型生长曲线函数激活指令的操作码和操作域;Analyze the compiled S-shaped growth curve function activation instruction to obtain the operation code and operation domain of the S-shaped growth curve function activation instruction;
存储指令队列,指令队列包括按照执行顺序依次排列的多个待执行指令,多个待执行指令包括编译后的S型生长曲线函数激活指令。The instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed in order according to the execution order, and the plurality of instructions to be executed include a compiled S-shaped growth curve function activation instruction.
在一种可能的实现方式中,该方法还可以包括:在确定多个待执行指令中的第一待执行指令与第一待执行指令之前的第零待执行指令存在关联关系时,缓存第一待执行指令,并在确定第零待执行指令执行完毕后,控制进行第一待执行指令的执行,In a possible implementation manner, the method may further include: when determining that the first to-be-executed instruction among the plurality of to-be-executed instructions is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction, caching the first The instruction to be executed, and after determining that the execution of the zeroth instruction to be executed is completed, the execution of the first instruction to be executed is controlled,
其中,第一待执行指令与第一待执行指令之前的第零待执行指令存在关联关系包括:Wherein, the association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
存储第一待执行指令所需数据的第一存储地址区间与存储第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first storage address interval storing data required for the first instruction to be executed has an overlapping area with the zeroth storage address interval storing data required for the zeroth instruction to be executed.
在一种可能的实现方式中,对获取到的S型生长曲线函数激活指令进行编译,得到编译后的S型生长曲线函数激活指令,可以包括:In a possible implementation manner, compiling the obtained S-shaped growth curve function activation instruction to obtain the compiled S-shaped growth curve function activation instruction may include:
根据S型生长曲线函数激活指令生成汇编文件,并将汇编文件翻译成二进制文件。其中,二进制文件为编译后的S型生长曲线函数激活指令。The assembly file is generated according to the activation instruction of the S-shaped growth curve function, and the assembly file is translated into a binary file. Among them, the binary file is the compiled S-shaped growth curve function activation instruction.
需要说明的是,尽管以上述实施例作为示例介绍了S型生长曲线函数激活指令处理方法如上,但本领域技术人员能够理解,本公开应不限于此。事实上,用户完全可根据个人喜好和/或实际应用场景灵活设定各步骤,只要符合本公开的技术方案即可。It should be noted that although the above embodiment is taken as an example to introduce the processing method of the S-shaped growth curve function activation instruction as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various steps flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.
本公开实施例所提供的S型生长曲线函数激活指令处理方法的适用范围广,对S型生长曲线函数激 活指令的处理效率高、处理速度快,进行S型生长曲线函数激活运算的处理效率高、处理速度快。The S-shaped growth curve function activation instruction processing method provided by the embodiments of the present disclosure has a wide range of application, high processing efficiency and fast processing speed for the S-shaped growth curve function activation instruction, and high processing efficiency for performing the S-shaped growth curve function activation operation , Fast processing speed.
依据以下条款可更好地理解前述内容:The foregoing can be better understood based on the following terms:
条款C1、一种S型生长曲线函数激活指令处理装置,所述装置包括:Clause C1, an S-shaped growth curve function activation instruction processing device, the device comprising:
控制模块,用于对获取到的S型生长曲线函数激活指令进行编译,得到编译后的S型生长曲线函数激活指令,对所述编译后的S型生长曲线函数激活指令进行解析,得到S型生长曲线函数激活指令的操作码和操作域,并根据所述操作码和所述操作域获取执行S型生长曲线函数激活指令所需的待运算数据和目标地址;The control module is used to compile the obtained S-shaped growth curve function activation instruction to obtain the compiled S-shaped growth curve function activation instruction, and analyze the compiled S-shaped growth curve function activation instruction to obtain the S-type An operation code and an operation domain of the growth curve function activation instruction, and according to the operation code and the operation domain, obtain the data to be operated and the target address required to execute the S-shaped growth curve function activation instruction;
运算模块,用于对所述待运算数据进行S型生长曲线函数激活运算,得到运算结果,并将所述运算结果存入所述目标地址中,An operation module, configured to perform an S-shaped growth curve function activation operation on the data to be operated to obtain an operation result, and store the operation result in the target address,
其中,所述操作码用于指示所述S型生长曲线函数激活指令对数据所进行的激活运算为S型生长曲线函数激活运算,所述操作域包括待运算数据地址和所述目标地址。Wherein, the operation code is used to indicate that the activation operation performed by the S-shaped growth curve function activation instruction on the data is an S-shaped growth curve function activation operation, and the operation domain includes the data address to be operated and the target address.
条款C2、根据条款C1所述的装置,Clause C2, the device according to Clause C1,
所述控制模块,还用于根据所述操作码和/或所述操作域获取S型生长曲线激活函数参数表;The control module is further configured to obtain an S-shaped growth curve activation function parameter table according to the operation code and / or the operation domain;
所述运算模块,还用于根据所述S型生长曲线激活函数参数表对所述待运算数据进行S型生长曲线函数激活运算,得到运算结果,The operation module is further configured to perform an S-type growth curve function activation operation on the data to be calculated according to the S-type growth curve activation function parameter table to obtain an operation result,
其中,所述S型生长曲线激活函数参数表包括S型生长曲线激活函数激活表和S型生长曲线激活函数常数表。Wherein, the S-type growth curve activation function parameter table includes an S-type growth curve activation function activation table and an S-type growth curve activation function constant table.
条款C3、根据条款C1所述的装置,所述运算模块,包括:Clause C3. The device according to Clause C1, the calculation module includes:
多个激活运算器,用于对所述待运算数据进行S型生长曲线函数激活运算。A plurality of activation calculators are used to perform S-shaped growth curve function activation calculation on the data to be calculated.
条款C4、根据条款C3所述的装置,所述运算模块包括主运算子模块和多个从运算子模块,所述主运算子模块包括所述多个激活运算器,Clause C4. The device according to Clause C3, the operation module includes a master operation sub-module and a plurality of slave operation sub-modules, the master operation sub-module includes the plurality of activation operators,
所述主运算子模块,用于利用所述多个激活运算器对所述待运算数据进行S型生长曲线函数激活运算,得到运算结果,并将所述运算结果存入所述目标地址中。The main operation sub-module is configured to use the plurality of activation operators to perform an S-shaped growth curve function activation operation on the data to be operated to obtain an operation result, and store the operation result in the target address.
条款C5、根据条款C1所述的装置,所述操作域还包括读入量或所述读入量的存储地址,Clause C5. The device according to Clause C1, the operation domain further includes a read-in amount or a storage address of the read-in amount,
其中,所述控制模块,还用于获取所述读入量,并按照所述读入量获取所述待运算数据。Wherein, the control module is also used to obtain the read-in amount, and obtain the data to be calculated according to the read-in amount.
条款C6、根据条款C1所述的装置,所述装置还包括:Clause C6. The device according to Clause C1, the device further comprising:
存储模块,用于存储所述待运算数据。The storage module is used for storing the data to be calculated.
条款C7、根据条款C1所述的装置,所述控制模块包括:Clause C7. The device according to Clause C1, the control module includes:
指令存储子模块,用于存储所述编译后的S型生长曲线函数激活指令;An instruction storage sub-module for storing the compiled S-shaped growth curve function activation instruction;
指令处理子模块,用于对所述编译后的S型生长曲线函数激活指令进行解析,得到S型生长曲线函数激活指令的操作码和操作域;An instruction processing sub-module, which is used to analyze the compiled S-shaped growth curve function activation instruction to obtain the operation code and operation domain of the S-shaped growth curve function activation instruction;
队列存储子模块,用于存储指令队列,所述指令队列包括按照执行顺序依次排列的多个待执行指令,所述多个待执行指令包括所述编译后的S型生长曲线函数激活指令。A queue storage submodule is used to store an instruction queue, and the instruction queue includes a plurality of instructions to be executed, which are sequentially arranged in order of execution, and the plurality of instructions to be executed include the compiled S-shaped growth curve function activation instruction.
条款C8、根据条款C7所述的装置,所述控制模块,还包括:Clause C8. The device according to Clause C7, the control module, further comprising:
依赖关系处理子模块,用于在确定所述多个待执行指令中的第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系时,将所述第一待执行指令缓存在所述指令存储子模块中,在所述第零待执行指令执行完毕后,从所述指令存储子模块中提取所述第一待执行指令发送至所述运算模块,The dependency processing sub-module is used to determine the first pending instruction when there is an association relationship between the first pending instruction in the plurality of pending instructions and the zeroth pending instruction before the first pending instruction The execution instruction is cached in the instruction storage submodule, and after the execution of the zeroth instruction to be executed is completed, the first instruction to be executed is extracted from the instruction storage submodule and sent to the arithmetic module,
其中,所述第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系包括:Wherein, the association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
存储所述第一待执行指令所需数据的第一存储地址区间与存储所述第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
条款C9、根据条款C1所述的装置,Clause C9, the device according to Clause C1,
所述控制模块,还用于根据所述S型生长曲线函数激活指令生成汇编文件,并将所述汇编文件翻译成二进制文件,The control module is also used to generate an assembly file according to the S-shaped growth curve function activation instruction, and translate the assembly file into a binary file,
其中,所述二进制文件为所述编译后的S型生长曲线函数激活指令。Wherein, the binary file is the compiled S-shaped growth curve function activation instruction.
条款C10、一种机器学习运算装置,所述装置包括:Clause C10. A machine learning computing device, the device comprising:
一个或多个如条款C1-条款C9任一项所述的S型生长曲线函数激活指令处理装置,用于从其他处理装置中获取待运算数据和控制信息,并执行指定的机器学习运算,将执行结果通过I/O接口传递给其他处理装置;One or more S-shaped growth curve function activation instruction processing devices as described in any one of Clauses C1-C9, used to obtain data to be calculated and control information from other processing devices, and perform specified machine learning operations, will The execution result is transferred to other processing devices through the I / O interface;
当所述机器学习运算装置包含多个所述S型生长曲线函数激活指令处理装置时,所述多个所述S型生长曲线函数激活指令处理装置间可以通过特定的结构进行连接并传输数据;When the machine learning operation device includes a plurality of the S-shaped growth curve function activation instruction processing devices, the plurality of the S-shaped growth curve function activation instruction processing devices can be connected and transmit data through a specific structure;
其中,多个所述S型生长曲线函数激活指令处理装置通过快速外部设备互连总线PCIE总线进行互联并传输数据,以支持更大规模的机器学习的运算;多个所述S型生长曲线函数激活指令处理装置共享同一控制系统或拥有各自的控制系统;多个所述S型生长曲线函数激活指令处理装置共享内存或者拥有各自的内存;多个所述S型生长曲线函数激活指令处理装置的互联方式是任意互联拓扑。Among them, a plurality of the S-shaped growth curve function activation instruction processing devices interconnect and transmit data through a PCIE bus that is a fast external device interconnect bus to support larger-scale machine learning operations; a plurality of the S-shaped growth curve functions The activation instruction processing device shares the same control system or has its own control system; a plurality of the S-shaped growth curve function activation instruction processing devices share memory or have their own memories; a plurality of the S-shaped growth curve function activation instruction processing devices The interconnection method is any interconnection topology.
条款C11、一种组合处理装置,所述组合处理装置包括:Clause C11. A combined processing device, the combined processing device comprising:
如条款C10所述的机器学习运算装置、通用互联接口和其他处理装置;Machine learning computing devices, general interconnection interfaces and other processing devices as described in clause C10;
所述机器学习运算装置与所述其他处理装置进行交互,共同完成用户指定的计算操作,The machine learning computing device interacts with the other processing device to jointly complete the calculation operation specified by the user,
其中,所述组合处理装置还包括:存储装置,该存储装置分别与所述机器学习运算装置和所述其他处理装置连接,用于保存所述机器学习运算装置和所述其他处理装置的数据。Wherein, the combined processing device further includes: a storage device, which is respectively connected to the machine learning computing device and the other processing device, and is used for storing data of the machine learning computing device and the other processing device.
条款C12、一种机器学习芯片,所述机器学习芯片包括:Clause C12. A machine learning chip, the machine learning chip includes:
如条款C10所述的机器学习运算装置或如条款C11所述的组合处理装置。The machine learning arithmetic device according to clause C10 or the combined processing device according to clause C11.
条款C13、一种电子设备,所述电子设备包括:Article C13. An electronic device, the electronic device comprising:
如条款C12所述的机器学习芯片。Machine learning chip as described in clause C12.
条款C14、一种板卡,所述板卡包括:存储器件、接口装置和控制器件以及如条款C12所述的机器学习芯片;Clause C14, a board card, the board card includes: a storage device, an interface device and a control device, and a machine learning chip as described in Clause C12;
其中,所述机器学习芯片与所述存储器件、所述控制器件以及所述接口装置分别连接;Wherein, the machine learning chip is respectively connected to the storage device, the control device and the interface device;
所述存储器件,用于存储数据;The storage device is used for storing data;
所述接口装置,用于实现所述机器学习芯片与外部设备之间的数据传输;The interface device is used to realize data transmission between the machine learning chip and an external device;
所述控制器件,用于对所述机器学习芯片的状态进行监控。The control device is used for monitoring the state of the machine learning chip.
条款C15、一种S型生长曲线函数激活指令处理方法,所述方法应用于S型生长曲线函数激活指令处理装置,所述方法包括:Clause C15. An S-shaped growth curve function activation instruction processing method. The method is applied to an S-shaped growth curve function activation instruction processing device. The method includes:
利用控制模块对获取到的S型生长曲线函数激活指令进行编译,得到编译后的S型生长曲线函数激活指令,对所述编译后的S型生长曲线函数激活指令进行解析,得到S型生长曲线函数激活指令的操作码和操作域,并根据所述操作码和所述操作域获取执行S型生长曲线函数激活指令所需的待运算数据和目标地址;The control module is used to compile the acquired S-shaped growth curve function activation instruction to obtain a compiled S-shaped growth curve function activation instruction, and the compiled S-shaped growth curve function activation instruction is analyzed to obtain an S-shaped growth curve. An operation code and an operation domain of a function activation instruction, and according to the operation code and the operation domain, obtain the data to be operated and the target address required to execute the S-shaped growth curve function activation instruction;
利用运算模块对所述待运算数据进行S型生长曲线函数激活运算,得到运算结果,并将所述运算结果存入所述目标地址中,Using an operation module to perform an S-shaped growth curve function activation operation on the data to be operated to obtain an operation result, and store the operation result in the target address,
其中,所述操作码用于指示所述S型生长曲线函数激活指令对数据所进行的激活运算为S型生长曲线函数激活运算,所述操作域包括待运算数据地址和所述目标地址。Wherein, the operation code is used to indicate that the activation operation performed by the S-shaped growth curve function activation instruction on the data is an S-shaped growth curve function activation operation, and the operation domain includes the data address to be operated and the target address.
条款C16、根据条款C15所述的方法,所述方法还包括:Clause C16. The method according to Clause C15, the method further comprising:
根据所述操作码和/或所述操作域获取S型生长曲线激活函数参数表;Obtaining an S-shaped growth curve activation function parameter table according to the operation code and / or the operation domain;
其中,利用运算模块对所述待运算数据进行S型生长曲线函数激活运算,得到运算结果,包括:Wherein, using an operation module to perform an S-shaped growth curve function activation operation on the data to be operated to obtain an operation result includes:
根据所述S型生长曲线激活函数参数表对所述待运算数据进行S型生长曲线函数激活运算,得到运算结果,Performing an S-type growth curve function activation operation on the data to be calculated according to the S-type growth curve activation function parameter table to obtain an operation result,
其中,所述S型生长曲线激活函数参数表包括S型生长曲线激活函数激活表和S型生长曲线激活函数常数表。Wherein, the S-type growth curve activation function parameter table includes an S-type growth curve activation function activation table and an S-type growth curve activation function constant table.
条款C17、根据条款C15所述的方法,利用运算模块对所述待运算数据进行S型生长曲线函数激活运算,得到运算结果,包括:Clause C17. According to the method described in Clause C15, an operation module is used to perform an S-shaped growth curve function activation operation on the data to be operated to obtain an operation result, including:
利用多个激活运算器对所述待运算数据进行S型生长曲线函数激活运算。A plurality of activation calculators are used to perform S-shaped growth curve function activation calculation on the data to be calculated.
条款C18、根据条款C17所述的方法,所述运算模块包括主运算子模块和多个从运算子模块,所述主运算子模块包括所述多个激活运算器,Clause C18. The method according to Clause C17, the operation module includes a master operation sub-module and a plurality of slave operation sub-modules, the master operation sub-module includes the plurality of activation operators,
其中,利用运算模块对所述待运算数据进行S型生长曲线函数激活运算,得到运算结果,包括:Wherein, using an operation module to perform an S-shaped growth curve function activation operation on the data to be operated to obtain an operation result includes:
利用所述主运算子模块中的多个激活运算器进行S型生长曲线函数激活运算,得到运算结果。A plurality of activation operators in the main operation sub-module are used to perform S-shaped growth curve function activation operation to obtain an operation result.
条款C19、根据条款C15所述的方法,所述操作域还包括读入量或所述读入量的存储地址,Clause C19. The method according to Clause C15, the operation domain further includes a read-in amount or a storage address of the read-in amount,
其中,根据所述操作码和所述操作域获取执行S型生长曲线函数激活指令所需的待运算数据和目标地址,包括:Wherein, obtaining the data to be calculated and the target address required to execute the S-shaped growth curve function activation instruction according to the operation code and the operation domain includes:
获取所述读入量,并按照所述读入量获取所述待运算数据。Acquiring the read-in amount, and acquiring the data to be calculated according to the read-in amount.
条款C20、根据条款C15所述的方法,所述方法还包括:Clause C20. The method according to Clause C15, the method further comprising:
存储所述待运算数据。Store the data to be calculated.
条款C21、根据条款C15所述的方法,对编译后的S型生长曲线函数激活指令进行解析,得到S型生长曲线函数激活指令的操作码和操作域,包括:Clause C21. According to the method described in Clause C15, analyze the compiled S-shaped growth curve function activation instruction to obtain the operation code and operation domain of the S-shaped growth curve function activation instruction, including:
存储所述编译后的S型生长曲线函数激活指令;Storing the compiled S-shaped growth curve function activation instruction;
对所述编译后的S型生长曲线函数激活指令进行解析,得到所述S型生长曲线函数激活指令的操作码和操作域;Parse the compiled S-shaped growth curve function activation instruction to obtain the operation code and operation domain of the S-shaped growth curve function activation instruction;
存储指令队列,所述指令队列包括按照执行顺序依次排列的多个待执行指令,所述多个待执行指令包括所述编译后的S型生长曲线函数激活指令。An instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed in order according to an execution order, and the plurality of instructions to be executed include the compiled S-shaped growth curve function activation instruction.
条款C22、根据条款C21所述的方法,所述方法还包括:Clause C22. The method according to Clause C21, the method further comprising:
在确定所述多个待执行指令中的第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系时,缓存所述第一待执行指令,并在确定所述第零待执行指令执行完毕后,控制进行所述第一待执行指令的执行,When it is determined that the first to-be-executed instruction among the plurality of to-be-executed instructions is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction, cache the first to-be-executed instruction, and determine the After the execution of the zeroth to-be-executed instruction is completed, control to execute the execution of the first to-be-executed instruction,
其中,所述第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系包括:Wherein, the association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
存储所述第一待执行指令所需数据的第一存储地址区间与存储所述第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
条款C23、根据条款C15所述的方法,利用控制模块对获取到的S型生长曲线函数激活指令进行编译,得到编译后的S型生长曲线函数激活指令,包括:Clause C23. According to the method described in Clause C15, use the control module to compile the obtained S-shaped growth curve function activation instruction to obtain the compiled S-shaped growth curve function activation instruction, including:
根据所述S型生长曲线函数激活指令生成汇编文件,并将所述汇编文件翻译成二进制文件,Generate an assembly file according to the S-shaped growth curve function activation instruction, and translate the assembly file into a binary file,
其中,所述二进制文件为所述编译后的S型生长曲线函数激活指令。Wherein, the binary file is the compiled S-shaped growth curve function activation instruction.
图4-1示出根据本公开一实施例的指数函数激活指令处理装置的框图。如图4-1所示,该装置包括控制模块12-11和运算模块12-12。FIG. 4-1 shows a block diagram of an exponential function activation instruction processing apparatus according to an embodiment of the present disclosure. As shown in Figure 4-1, the device includes a control module 12-11 and an arithmetic module 12-12.
控制模块12-11,用于对获取到的指数函数激活指令进行编译,得到编译后的指数函数激活指令,对编译后的指数函数激活指令进行解析,得到指数函数激活指令的操作码和操作域,并根据操作码和操作域获取执行指数函数激活指令所需的待运算数据和目标地址。The control module 12-11 is used to compile the obtained exponential function activation instruction to obtain the compiled exponential function activation instruction, and parse the compiled exponential function activation instruction to obtain the operation code and operation domain of the exponential function activation instruction , And obtain the data to be calculated and the target address required to execute the exponential function activation instruction according to the operation code and operation domain.
其中,操作码用于指示指数函数激活指令对数据所进行的激活运算为指数函数激活运算。操作域包括待运算数据地址和目标地址。The operation code is used to indicate that the activation operation performed by the exponential function activation instruction on the data is the exponential function activation operation. The operation domain includes the data address and target address to be calculated.
运算模块12-12,用于对待运算数据进行指数函数激活运算,得到运算结果,并将运算结果存入目标地址中。The operation module 12-12 is used to perform an exponential function activation operation on the data to be operated, obtain the operation result, and store the operation result in the target address.
在本实施例中,控制模块所获取到的指数函数激活指令为未编译的、不能直接供硬件执行的软件指令,控制模块需先对指数函数激活指令(未编译的)进行编译。在得到编译后的指数函数激活指令之后,才能对编译后的指数函数激活指令进行解析。编译后的指数函数激活指令为能够直接供硬件执行的硬件指令。In this embodiment, the exponential function activation instruction obtained by the control module is an uncompiled software instruction that cannot be directly executed by hardware. The control module needs to first compile the exponential function activation instruction (uncompiled). After the compiled index function activation instruction is obtained, the compiled index function activation instruction can be parsed. The compiled exponential function activation instruction is a hardware instruction that can be directly executed by the hardware.
在本实施例中,编译后的指数函数激活指令可以是与指数函数激活指令唯一对应的指令。编译后的指数函数激活指令还可以包含所有激活运算的指令,使得装置在处理指数函数激活指令时,能够在不对该指令所对应是激活函数进行区分的前提下,便可以进行指数函数激活运算简化处理过程。In this embodiment, the compiled exponential function activation instruction may be an instruction corresponding only to the exponential function activation instruction. The compiled exponential function activation instruction may also include all activation operation instructions, so that when the device processes the exponential function activation instruction, it can simplify the exponential function activation operation without distinguishing the activation function corresponding to the instruction Process.
在本实施例中,控制模块可以从待运算数据地址中获得待运算数据。控制模块可以根据指数函数激活指令的操作码确定进行指数函数激活运算所需的数据。控制模块可以通过数据输入输出单元获得指令和数据,该数据输入输出单元可以为一个或多个数据I/O接口或I/O引脚。In this embodiment, the control module can obtain the data to be calculated from the data address to be calculated. The control module may determine the data required to perform the exponential function activation operation according to the operation code of the exponential function activation instruction. The control module can obtain instructions and data through the data input and output unit, which can be one or more data I / O interfaces or I / O pins.
在本实施例中,操作码可以是计算机程序中所规定的要执行操作的那一部分指令或字段(通常用代码表示),是指令序列号,用来告知执行指令的装置具体需要执行哪一条指令。操作域可以是执行对应的指令所需的所有数据的来源,如对应的地址等,执行对应的指令所需的所有数据包括待运算数据等数据以及对应的运算方法等等。对于一个指数函数激活指令其必须包括操作码和操作域,其中,操作域至少包括待运算数据地址和目标地址。In this embodiment, the operation code may be the part of the instruction or field (usually represented by code) specified in the computer program to perform the operation, and is the instruction sequence number used to inform the device that executes the instruction which instruction needs to be executed. . The operation domain may be a source of all data required to execute the corresponding instruction, such as a corresponding address, etc. All data required to execute the corresponding instruction include data such as data to be operated and corresponding operation methods, etc. For an exponential function activation instruction, it must include an operation code and an operation field, where the operation field includes at least the data address to be calculated and the target address.
应当理解的是,本领域技术人员可以根据需要对指数函数激活指令的指令格式以及所包含的操作码和操作域进行设置,本公开对此不作限制。It should be understood that, those skilled in the art can set the instruction format of the exponential function activation instruction, as well as the included operation codes and operation domains as required, which is not limited in this disclosure.
在本实施例中,该装置可以包括一个或多个控制模块,以及一个或多个运算模块,可以根据实际需要对控制模块和运算模块的数量进行设置,本公开对此不作限制。在装置包括一个控制模块时,该控制模块可以接收指数函数激活指令,并控制一个或多个处理模块进行线性整流函数激活运算。在装置包括多个控制模块时,多个控制模块可以分别接收指数函数激活指令,并控制对应的一个或多个处理模块进行指数函数激活运算。In this embodiment, the device may include one or more control modules and one or more arithmetic modules, and the number of control modules and arithmetic modules may be set according to actual needs, which is not limited in the present disclosure. When the device includes a control module, the control module may receive an exponential function activation instruction, and control one or more processing modules to perform a linear rectification function activation operation. When the device includes multiple control modules, the multiple control modules may respectively receive exponential function activation instructions and control the corresponding one or more processing modules to perform exponential function activation operations.
本公开实施例提供一种指数函数激活指令处理装置,该装置包括控制模块和运算模块,控制模块用于对获取到的指数函数激活指令进行编译,得到编译后的指数函数激活指令,对编译后的指数函数激活指令进行解析,得到指数函数激活指令的操作码和操作域,并根据操作码和操作域获取执行指数函数激活指令所需的待运算数据和目标地址;运算模块用于对待运算数据进行指数函数激活运算,得到运算结果,并将运算结果存入目标地址中。本公开实施例所提供的指数函数激活指令处理装置的适用范围广,对指数函数激活指令的处理效率高、处理速度快,进行指数函数激活运算的处理效率高、处理速度快。An embodiment of the present disclosure provides an exponential function activation instruction processing device. The device includes a control module and an arithmetic module. The control module is used to compile the obtained exponential function activation instruction to obtain the compiled exponential function activation instruction. The exponential function activation instruction is parsed to obtain the opcode and operation domain of the exponential function activation instruction, and the data to be operated and the target address required to execute the exponential function activation instruction are obtained according to the opcode and operation domain; the operation module is used to process the operation data Perform the exponential function activation operation to obtain the operation result, and store the operation result in the target address. The exponential function activation instruction processing device provided by the embodiments of the present disclosure has a wide range of application, and has high processing efficiency and fast processing speed for the exponential function activation instruction, and high processing efficiency and fast processing speed for performing the exponential function activation operation.
在一种可能的实现方式中,控制模块12-11,还可以用于根据操作码和/或操作域获取指数激活函数参数表。In a possible implementation manner, the control module 12-11 may also be used to obtain an exponential activation function parameter table according to the operation code and / or operation domain.
运算模块12-12,还可以用于根据指数激活函数参数表对待运算数据进行指数函数激活运算,得到运算结果,The calculation module 12-12 can also be used to perform exponential function activation calculation on the data to be calculated according to the exponential activation function parameter table, to obtain the operation result,
其中,指数激活函数参数表可以包括指数激活函数激活表和指数激活函数常数表。The exponential activation function parameter table may include an exponential activation function activation table and an exponential activation function constant table.
在该实现方式中,操作域中可以包括指数激活函数参数表地址,以便于控制模块从指数激活函数参数表地址中获取指数激活函数参数表地址。或者,控制模块可以根据操作码确定执行该指数函数激活指令需要指数激活函数参数表时,可以直接从预先确定的指数激活函数参数表的存储地址获取指数激活函数参数表。再或者,控制模块可以根据操作码确定执行该指数函数激活指令需要指数激活函数参数表时,可以直接从预先确定的参数表的存储地址获取对应该指数函数激活指令的指数激活函数参数表。本领域技术人员可以根据实际需要对指数激活函数参数表的获取方式进行设置,本公开对此不作限制。In this implementation manner, the address of the index activation function parameter table may be included in the operation domain, so that the control module obtains the address of the index activation function parameter table from the address of the index activation function parameter table. Alternatively, the control module may determine that the exponential activation function parameter table needs an exponential activation function parameter table according to the operation code, and may directly obtain the exponential activation function parameter table from a predetermined storage address of the exponential activation function parameter table. Alternatively, when the control module can determine that an exponential activation function parameter table is required to execute the exponential function activation instruction according to the operation code, it can directly obtain the exponential activation function parameter table corresponding to the exponential function activation instruction from the storage address of the predetermined parameter table. A person skilled in the art can set the acquisition method of the index activation function parameter table according to actual needs, and this disclosure does not limit this.
在一种可能的实现方式中,控制模块还可以获取对应于指数函数激活指令的激活函数,使得运算模块可以根据激活函数以及对应的运算器,对待运算数据进行线性整流函数激活运算。In a possible implementation manner, the control module can also obtain an activation function corresponding to the exponential function activation instruction, so that the operation module can perform a linear rectification function activation operation on the operation data according to the activation function and the corresponding operator.
需要说明的是,本领域技术人员可以根据实际需要对实现运算模块实现线性整流函数激活运算的方式进行设置,本公开对此不作限制。It should be noted that, those skilled in the art can set the manner in which the calculation module implements the linear rectification function activation calculation according to actual needs, and the disclosure does not limit this.
图4-2a示出根据本公开一实施例的指数函数激活指令处理装置的框图。在一种可能的实现方式中,如图4-2a所示,运算模块12-12可以包括多个激活运算器12-120。多个激活运算器12-120用于对待运算数据进行指数函数激活运算。4-2a shows a block diagram of an exponential function activation instruction processing apparatus according to an embodiment of the present disclosure. In a possible implementation, as shown in FIG. 4-2a, the computing module 12-12 may include multiple activation operators 12-120. A plurality of activation calculators 12-120 are used to perform exponential function activation calculation on the data to be calculated.
在该实现方式中,运算模块也可以包括一个激活运算器。可以根据所需进行的指数函数激活运算的数据量的大小、对指数函数激活运算的处理速度、效率等要求对激活运算器的数量进行设置,本公开对此不作限制。In this implementation, the calculation module may also include an activation calculator. The number of activation operators can be set according to the size of the data required for the exponential function activation operation, the processing speed and efficiency of the exponential function activation operation, and the disclosure does not limit this.
图4-2b示出根据本公开一实施例的指数函数激活指令处理装置的框图。在一种可能的实现方式中, 如图4-2b所示,运算模块12-12可以包括主运算子模块12-121和多个从运算子模块12-122,主运算子模块12-121包括多个激活运算器12-120(图中未示出)。4-2b shows a block diagram of an exponential function activation instruction processing apparatus according to an embodiment of the present disclosure. In a possible implementation, as shown in FIG. 4-2b, the operation module 12-12 may include a master operation sub-module 12-121 and a plurality of slave operation sub-modules 12-122, and the master operation sub-module 12-121 includes Multiple activation operators 12-120 (not shown in the figure).
主运算子模块12-121,用于利用多个激活运算器对待运算数据进行指数函数激活运算,得到运算结果,并将运算结果存入目标地址中。The main operation sub-module 12-121 is used for performing an exponential function activation operation on the data to be calculated by using a plurality of activation operators to obtain an operation result and storing the operation result in a target address.
在一种可能的实现方式中,操作域还可以包括读入量或读入量的存储地址。其中,控制模块12-11,还用于获取读入量,并按照读入量获取多个待运算数据。其中,多个待运算数据的数据量可以小于或等于读入量。In a possible implementation manner, the operation domain may further include a read-in amount or a storage address of the read-in amount. Among them, the control module 12-11 is also used to obtain the read-in amount, and obtain a plurality of data to be calculated according to the read-in amount. Among them, the data amount of the plurality of data to be calculated may be less than or equal to the read-in amount.
在该实现方式中,读入量可以是所获取的多个待运算数据的数据量,可以是所获取的待运算数据的尺寸。在操作域直接包含读入量的具体数值时,可以将该数值确定为读入量。在操作域中包括读入量的存储地址时,可以从该存储地址中获取读入量。In this implementation manner, the read-in amount may be the data amount of the acquired plurality of data to be calculated, and may be the size of the acquired data to be calculated. When the operation field directly contains the specific value of the read-in amount, the value can be determined as the read-in amount. When the storage address of the read-in amount is included in the operation domain, the read-in amount can be obtained from the storage address.
在一种可能的实现方式中,在操作域中不包括读入量时,可以根据预先设置的默认读入量获取多个待运算数据。所获取的多个待运算数据的数据量可以小于或等于默认读入量。In a possible implementation manner, when the read-in amount is not included in the operation domain, a plurality of data to be calculated may be obtained according to a preset default read-in amount. The acquired data amount of the plurality of data to be calculated may be less than or equal to the default read-in amount.
通过上述方式,可以对待运算数据的数据量和尺寸进行限制,保证运算结果的准确性,并保证装置可以执行该指数函数激活指令。In the above manner, the data amount and size of the operation data can be limited, the accuracy of the operation result can be ensured, and the device can execute the exponential function activation instruction.
在一种可能的实现方式中,如图4-2a、图4-2b所示,该装置还可以包括存储模块12-13。存储模块12-13用于存储待运算数据。存储模块12-13还可以用于存储指数激活函数参数表。In a possible implementation manner, as shown in FIGS. 4-2a and 4-2b, the device may further include a storage module 12-13. The storage modules 12-13 are used to store data to be calculated. The storage modules 12-13 can also be used to store exponential activation function parameter tables.
在该实现方式中,存储模块可以包括内存,如缓存和寄存器中的一种或多种,缓存可以包括高速暂存缓存。可以根据需要将待运算数据和指数激活函数参数表存储在存储模块的缓存和/或寄存器中,本公开对此不作限制。In this implementation manner, the storage module may include a memory, such as one or more of a cache and a register, and the cache may include a high-speed temporary storage cache. The data to be calculated and the parameter table of the exponential activation function can be stored in the cache and / or register of the storage module as needed, and the disclosure does not limit this.
在一种可能的实现方式中,该装置还可以包括直接内存访问模块,用于从存储模块中读取或者存储数据。In a possible implementation, the device may further include a direct memory access module, which is used to read or store data from the storage module.
在一种可能的实现方式中,如图4-2a、图4-2b所示,控制模块12-11还可以用于根据指数函数激活指令生成汇编文件,并将汇编文件翻译成二进制文件,其中,二进制文件为编译后的指数函数激活指令。In a possible implementation, as shown in FIGS. 4-2a and 4-2b, the control module 12-11 can also be used to generate an assembly file according to the exponential function activation instruction and translate the assembly file into a binary file, where , The binary file is the compiled exponential function activation instruction.
在一种可能的实现方式中,指数函数激活指令的指令格式可以是:In a possible implementation manner, the instruction format of the exponential function activation instruction may be:
active.exps dst src0 sizeactive.exps dst src0 size
其中,active.exps是指数函数激活指令的操作码,dst、src0、size是指数函数激活指令的操作域。其中,dst是目标地址,src0是待运算数据地址,size是读入量。Among them, active.exps is the opcode of the exponential function activation instruction, and dst, src0, and size are the operation domains of the exponential function activation instruction. Among them, dst is the target address, src0 is the data address to be calculated, and size is the read-in amount.
在一种可能的实现方式中,指数函数激活指令的指令格式还可以是:In a possible implementation manner, the instruction format of the exponential function activation instruction may also be:
active.exps dst src0 src1 sizeactive.exps dst src0 src1 size
其中,active.exps是指数函数激活指令的操作码,dst、src0、size是指数函数激活指令的操作域。其中,dst是目标地址,src0是待运算数据地址,src1是指数激活函数参数表地址,size是读入量。Among them, active.exps is the opcode of the exponential function activation instruction, and dst, src0, and size are the operation domains of the exponential function activation instruction. Among them, dst is the target address, src0 is the data address to be calculated, src1 is the address of the index activation function parameter table, and size is the read-in amount.
应当理解的是,本领域技术人员可以根据需要对指数函数激活指令的操作码、指令格式中操作码和操作域的位置进行设置,本公开对此不作限制。It should be understood that those skilled in the art can set the position of the operation code of the exponential function activation instruction, the operation code and the operation field in the instruction format according to needs, and the disclosure does not limit this.
在一种可能的实现方式中,该装置可以设置于图形处理器(Graphics Processing Unit,简称GPU)、中央处理器(Central Processing Unit,简称CPU)和嵌入式神经网络处理器(Neural-network Processing Unit,简称NPU)的一种或多种之中。In a possible implementation manner, the device may be set in a graphics processor (Graphics Processing Unit, GPU for short), a central processing unit (Central Processing Unit, CPU for short) and an embedded neural network processor (Neural-network Processing Unit) , Referred to as NPU).
需要说明的是,尽管以上述实施例作为示例介绍了指数函数激活指令处理装置如上,但本领域技术人员能够理解,本公开应不限于此。事实上,用户完全可根据个人喜好和/或实际应用场景灵活设定各模块,只要符合本公开的技术方案即可。It should be noted that, although the above embodiment is taken as an example to introduce the exponential function activation instruction processing apparatus as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various modules flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.
应用示例Application examples
以下结合“利用指数函数激活指令处理装置进行激活运算”作为一个示例性应用场景,给出根据本公开实施例的应用示例,以便于理解指数函数激活指令处理装置的流程。本领域技术人员应理解,以下应用示例仅仅是出于便于理解本公开实施例的目的,不应视为对本公开实施例的限制The following describes an application example according to an embodiment of the present disclosure in conjunction with "using an exponential function to activate an instruction processing device for activation operation" as an exemplary application scenario, so as to facilitate understanding of the flow of an exponential function activation instruction processing device. Those skilled in the art should understand that the following application examples are only for the purpose of facilitating understanding of the embodiments of the present disclosure, and should not be considered as limitations to the embodiments of the present disclosure
图4-3示出根据本公开一实施例的指数函数激活指令处理装置的应用场景的示意图。如图4-3所示, 指数函数激活指令处理装置对指数函数激活指令进行处理的过程如下:4-3 shows a schematic diagram of an application scenario of an exponential function activation instruction processing device according to an embodiment of the present disclosure. As shown in Figure 4-3, the exponential function activation instruction processing device processes the exponential function activation instruction as follows:
如图4-3所示,控制模块12-11对获取到的指数函数激活指令1进行编译,得到编译后的指数函数激活指令1(如指数函数激活指令1为active.exps 500 100 64),对编译后的指数函数激活指令1进行解析,得到指数函数激活指令1的操作码和操作域。其中,指数函数激活指令1的操作码为active.exps,目标地址为500,待运算数据地址为100,读入量为64。控制模块12-11从待运算数据地址100中获取数据量为64(读入量)的待运算数据。假定需要根据指数激活函数参数表进行激活运算,那么控制模块12-11还需获取指数激活函数参数表(具体实现过程参见上文描述)。As shown in Figure 4-3, the control module 12-11 compiles the obtained exponential function activation instruction 1 to obtain the compiled exponential function activation instruction 1 (for example, the exponential function activation instruction 1 is active.exps500, 100, 64), Analyze the compiled index function activation instruction 1 to obtain the opcode and operation domain of the index function activation instruction 1. The operation code of the exponential function activation instruction 1 is active.exps, the target address is 500, the data address to be calculated is 100, and the read-in amount is 64. The control module 12-11 acquires the data to be operated with a data amount of 64 (read-in amount) from the data address to be operated 100. Assuming that the activation calculation needs to be performed according to the exponential activation function parameter table, the control module 12-11 also needs to obtain the exponential activation function parameter table (see the above description for the specific implementation process).
运算模块12-12根据指数激活函数参数表对待运算数据进行指数函数激活运算,得到运算结果,并将运算结果存入目标地址500中。The operation module 12-12 performs the exponential function activation operation on the operation data according to the exponential activation function parameter table, obtains the operation result, and stores the operation result in the target address 500.
以上各模块的工作过程可参考上文的相关描述。For the working process of the above modules, please refer to the relevant description above.
这样,指数函数激活指令处理装置可以高效、快速地对指数函数激活指令进行处理,实现指数函数激活运算的高效、快速处理。In this way, the exponential function activation instruction processing device can process the exponential function activation instruction efficiently and quickly, and realize the efficient and rapid processing of the exponential function activation operation.
图4-4示出根据本公开一实施例的指数函数激活指令处理方法的流程图。如图4-4所示,该方法应用于上述指数函数激活指令处理装置,该方法包括步骤S51-12和步骤S52-12。该方法可以应用于包含存储器和处理器的如计算机设备等,其中,存储器用于存储执行方法过程中所使用的数据;处理器用于执行相关的处理、运算步骤,如执行下述步骤S51-12和步骤S52-12。4-4 shows a flowchart of an exponential function activation instruction processing method according to an embodiment of the present disclosure. As shown in FIG. 4-4, this method is applied to the above exponential function activation instruction processing device. The method includes steps S51-12 and S52-12. The method can be applied to computer equipment including a memory and a processor, where the memory is used to store data used in the execution of the method; the processor is used to perform related processing and operation steps, such as performing the following steps S51-12和 步骤 S52-12.
在步骤S51-12中,利用控制模块对获取到的指数函数激活指令进行编译,得到编译后的指数函数激活指令,对编译后的指数函数激活指令进行解析,得到指数函数激活指令的操作码和操作域,并根据操作码和操作域获取执行指数函数激活指令所需的待运算数据和目标地址。其中,操作码用于指示指数函数激活指令对数据所进行的激活运算为指数函数激活运算,操作域包括待运算数据地址和目标地址。In step S51-12, the control module is used to compile the obtained index function activation instruction to obtain a compiled index function activation instruction, and the compiled index function activation instruction is parsed to obtain the operation code and The operation domain, and obtain the data to be calculated and the target address required to execute the exponential function activation instruction according to the operation code and the operation domain. The operation code is used to indicate that the activation operation performed by the exponential function activation instruction on the data is the exponential function activation operation, and the operation domain includes the data address and the target address to be operated.
在步骤S52-12中,利用运算模块对待运算数据进行指数函数激活运算,得到运算结果,并将运算结果存入目标地址中。In step S52-12, an arithmetic module is used to perform an exponential function activation operation on the operation data to obtain an operation result, and the operation result is stored in the target address.
在一种可能的实现方式中,该方法还可以包括:In a possible implementation manner, the method may further include:
根据操作码和/或操作域获取指数激活函数参数表;Obtain the index activation function parameter table according to the operation code and / or operation domain;
其中,利用运算模块对待运算数据进行指数函数激活运算,得到运算结果,可以包括:Among them, the operation module performs exponential function activation operation on the operation data to obtain the operation result, which may include:
根据指数激活函数参数表对待运算数据进行指数函数激活运算,得到运算结果,Perform exponential function activation calculation on the operation data according to the exponential activation function parameter table to obtain the operation result,
其中,指数激活函数参数表可以包括指数激活函数激活表和指数激活函数常数表。The exponential activation function parameter table may include an exponential activation function activation table and an exponential activation function constant table.
在一种可能的实现方式中,利用运算模块对待运算数据进行指数函数激活运算,得到运算结果,可以包括:利用多个激活运算器对待运算数据进行指数函数激活运算。In a possible implementation manner, the operation module performs an exponential function activation operation on the operation data to obtain an operation result, which may include: performing an exponential function activation operation on the operation data using a plurality of activation operators.
在一种可能的实现方式中,运算模块可以包括主运算子模块和多个从运算子模块,主运算子模块可以包括多个激活运算器,In a possible implementation manner, the operation module may include a master operation sub-module and multiple slave operation sub-modules, and the master operation sub-module may include multiple activation operators,
其中,利用运算模块对待运算数据进行指数函数激活运算,得到运算结果,可以包括:Among them, the operation module performs exponential function activation operation on the operation data to obtain the operation result, which may include:
利用主运算子模块中的多个激活运算器对待运算数据进行指数函数激活运算,得到运算结果,并将运算结果存入目标地址中。Use multiple activation operators in the main operation sub-module to perform exponential function activation operation on the operation data to obtain the operation result, and store the operation result in the target address.
在一种可能的实现方式中,操作域还可以包括读入量或读入量的存储地址。其中,根据操作码和操作域获取执行指数函数激活指令所需的待运算数据和目标地址,可以包括:获取读入量,并按照读入量获取待运算数据。In a possible implementation manner, the operation domain may further include a read-in amount or a storage address of the read-in amount. Wherein, obtaining the data to be calculated and the target address required to execute the exponential function activation instruction according to the operation code and the operation domain may include: acquiring the read-in amount, and acquiring the data to be calculated according to the read-in amount.
在一种可能的实现方式中,该方法还可以包括:存储待运算数据。In a possible implementation manner, the method may further include: storing data to be calculated.
在一种可能的实现方式中,对编译后的指数函数激活指令进行解析,得到指数函数激活指令的操作码和操作域,可以包括:In a possible implementation manner, parsing the compiled index function activation instruction to obtain the operation code and operation domain of the index function activation instruction may include:
存储线编译后的指数函数激活指令;The exponential function activation instruction after the storage line is compiled;
对编译后的指数函数激活指令进行解析,得到指数函数激活指令的操作码和操作域;Analyze the compiled index function activation instruction to obtain the opcode and operation domain of the index function activation instruction;
存储指令队列,指令队列包括按照执行顺序依次排列的多个待执行指令,多个待执行指令包括编译后的指数函数激活指令。The instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed arranged in order according to the execution order, and the plurality of instructions to be executed include a compiled index function activation instruction.
在一种可能的实现方式中,该方法还可以包括:在确定多个待执行指令中的第一待执行指令与第一待执行指令之前的第零待执行指令存在关联关系时,缓存第一待执行指令,并在确定第零待执行指令执行完毕后,控制进行第一待执行指令的执行,In a possible implementation manner, the method may further include: when determining that the first to-be-executed instruction among the plurality of to-be-executed instructions is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction, caching the first The instruction to be executed, and after determining that the execution of the zeroth instruction to be executed is completed, the execution of the first instruction to be executed is controlled,
其中,第一待执行指令与第一待执行指令之前的第零待执行指令存在关联关系包括:Wherein, the association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
存储第一待执行指令所需数据的第一存储地址区间与存储第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first storage address interval storing data required for the first instruction to be executed has an overlapping area with the zeroth storage address interval storing data required for the zeroth instruction to be executed.
在一种可能的实现方式中,利用控制模块对获取到的指数函数激活指令进行编译,得到编译后的指数函数激活指令,可以包括:根据指数函数激活指令生成汇编文件,并将汇编文件翻译成二进制文件。其中,二进制文件为编译后的指数函数激活指令。In a possible implementation, using the control module to compile the obtained index function activation instruction to obtain the compiled index function activation instruction may include: generating an assembly file according to the index function activation instruction, and translating the assembly file into binary file. Among them, the binary file is the compiled exponential function activation instruction.
需要说明的是,尽管以上述实施例作为示例介绍了指数函数激活指令处理方法如上,但本领域技术人员能够理解,本公开应不限于此。事实上,用户完全可根据个人喜好和/或实际应用场景灵活设定各步骤,只要符合本公开的技术方案即可。It should be noted that although the above embodiment is taken as an example to introduce the processing method of the exponential function activation instruction as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various steps flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.
本公开实施例所提供的指数函数激活指令处理方法的适用范围广,对指数函数激活指令的处理效率高、处理速度快,进行指数函数激活运算的处理效率高、处理速度快。The processing method of the exponential function activation instruction provided by the embodiment of the present disclosure has a wide range of application, and the exponential function activation instruction has high processing efficiency and fast processing speed, and the exponential function activation operation has high processing efficiency and fast processing speed.
依据以下条款可更好地理解前述内容:The foregoing can be better understood based on the following terms:
条款D1、一种指数函数激活指令处理装置,所述装置包括:Clause D1, an exponential function activation instruction processing device, the device comprising:
控制模块,用于对获取到的指数函数激活指令进行编译,得到编译后的指数函数激活指令,对所述编译后的指数函数激活指令进行解析,得到指数函数激活指令的操作码和操作域,并根据所述操作码和所述操作域获取执行指数函数激活指令所需的待运算数据和目标地址;The control module is used to compile the obtained exponential function activation instruction to obtain a compiled exponential function activation instruction, and parse the compiled exponential function activation instruction to obtain the operation code and operation domain of the exponential function activation instruction, And obtain the data to be calculated and the target address required to execute the exponential function activation instruction according to the operation code and the operation domain;
运算模块,用于对所述待运算数据进行指数函数激活运算,得到运算结果,并将所述运算结果存入所述目标地址中,The operation module is used to perform an exponential function activation operation on the data to be operated to obtain an operation result, and store the operation result in the target address,
其中,所述操作码用于指示所述指数函数激活指令对数据所进行的激活运算为指数函数激活运算,所述操作域包括待运算数据地址和所述目标地址。Wherein, the operation code is used to indicate that the activation operation performed by the exponential function activation instruction on the data is an exponential function activation operation, and the operation domain includes the data address to be operated and the target address.
条款D2、根据条款D1所述的装置,Clause D2, the device according to Clause D1,
所述控制模块,还用于根据所述操作码和/或所述操作域获取指数激活函数参数表;The control module is further configured to obtain an index activation function parameter table according to the operation code and / or the operation domain;
所述运算模块,还用于根据所述指数激活函数参数表对所述待运算数据进行指数函数激活运算,得到运算结果,The calculation module is further configured to perform an exponential function activation operation on the data to be calculated according to the exponential activation function parameter table to obtain an operation result,
其中,所述指数激活函数参数表包括指数激活函数激活表和指数激活函数常数表。Wherein, the exponential activation function parameter table includes an exponential activation function activation table and an exponential activation function constant table.
条款D3、根据条款D1所述的装置,所述运算模块,包括:Clause D3. The device according to Clause D1, the calculation module includes:
多个激活运算器,用于对所述待运算数据进行指数函数激活运算。A plurality of activation calculators are used to perform exponential function activation calculation on the data to be calculated.
条款D4、根据条款D3所述的装置,所述运算模块包括主运算子模块和多个从运算子模块,所述主运算子模块包括所述多个激活运算器,Clause D4. The device according to Clause D3, the operation module includes a master operation sub-module and a plurality of slave operation sub-modules, the master operation sub-module includes the plurality of activation operators,
所述主运算子模块,用于利用所述多个激活运算器对所述待运算数据进行指数函数激活运算,得到运算结果,并将所述运算结果存入所述目标地址中。The main operation sub-module is used to perform an exponential function activation operation on the data to be operated by using the plurality of activation operators to obtain an operation result, and store the operation result in the target address.
条款D5、根据条款D1所述的装置,所述操作域还包括读入量或所述读入量的存储地址,Clause D5. The device according to Clause D1, the operation domain further includes a read-in amount or a storage address of the read-in amount,
其中,所述控制模块,还用于获取所述读入量,并按照所述读入量获取所述待运算数据。Wherein, the control module is also used to obtain the read-in amount, and obtain the data to be calculated according to the read-in amount.
条款D6、根据条款D1所述的装置,所述装置还包括:Clause D6. The device according to Clause D1, the device further comprising:
存储模块,用于存储所述待运算数据。The storage module is used for storing the data to be calculated.
条款D7、根据条款D1所述的装置,所述控制模块包括:Clause D7. The device according to Clause D1, the control module includes:
指令存储子模块,用于存储所述编译后的指数函数激活指令;An instruction storage sub-module for storing the compiled index function activation instruction;
指令处理子模块,用于对所述编译后的指数函数激活指令进行解析,得到指数函数激活指令的操作码和操作域;Instruction processing sub-module, which is used to parse the compiled index function activation instruction to obtain the operation code and operation domain of the index function activation instruction;
队列存储子模块,用于存储指令队列,所述指令队列包括按照执行顺序依次排列的多个待执行指令,所述多个待执行指令包括所述编译后的指数函数激活指令。A queue storage submodule is used to store an instruction queue, and the instruction queue includes a plurality of instructions to be executed that are sequentially arranged according to an execution order, and the plurality of instructions to be executed include the compiled index function activation instruction.
条款D8、根据条款D7所述的装置,所述控制模块,还包括:Clause D8. The device according to Clause D7, the control module, further comprising:
依赖关系处理子模块,用于在确定所述多个待执行指令中的第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系时,将所述第一待执行指令缓存在所述指令存储子模块中,在所述第零待执行指令执行完毕后,从所述指令存储子模块中提取所述第一待执行指令发送至所述运算模块,The dependency processing sub-module is used to determine the first pending instruction when there is an association relationship between the first pending instruction in the plurality of pending instructions and the zeroth pending instruction before the first pending instruction The execution instruction is cached in the instruction storage submodule, and after the execution of the zeroth instruction to be executed is completed, the first instruction to be executed is extracted from the instruction storage submodule and sent to the arithmetic module,
其中,所述第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系包括:Wherein, the association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
存储所述第一待执行指令所需数据的第一存储地址区间与存储所述第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
条款D9、根据条款D1所述的装置,Clause D9, the device according to Clause D1,
所述控制模块,还用于根据所述指数函数激活指令生成汇编文件,并将所述汇编文件翻译成二进制文件,The control module is also used to generate an assembly file according to the index function activation instruction and translate the assembly file into a binary file,
其中,所述二进制文件为所述编译后的指数函数激活指令。Wherein, the binary file is the compiled index function activation instruction.
条款D10、一种机器学习运算装置,所述装置包括:Clause D10. A machine learning computing device, the device comprising:
一个或多个如条款D1-条款D9任一项所述的指数函数激活指令处理装置,用于从其他处理装置中获取待运算数据和控制信息,并执行指定的机器学习运算,将执行结果通过I/O接口传递给其他处理装置;One or more exponential function activation instruction processing devices as described in any one of clauses D1 to D9, used to obtain data to be calculated and control information from other processing devices, and perform specified machine learning operations, passing the execution results The I / O interface is passed to other processing devices;
当所述机器学习运算装置包含多个所述指数函数激活指令处理装置时,所述多个所述指数函数激活指令处理装置间可以通过特定的结构进行连接并传输数据;When the machine learning operation device includes a plurality of the exponential function activation instruction processing devices, the plurality of the exponential function activation instruction processing devices may be connected and transmit data through a specific structure;
其中,多个所述指数函数激活指令处理装置通过快速外部设备互连总线PCIE总线进行互联并传输数据,以支持更大规模的机器学习的运算;多个所述指数函数激活指令处理装置共享同一控制系统或拥有各自的控制系统;多个所述指数函数激活指令处理装置共享内存或者拥有各自的内存;多个所述指数函数激活指令处理装置的互联方式是任意互联拓扑。Among them, a plurality of the exponential function activation instruction processing devices interconnect and transmit data through a fast external device interconnect bus PCIE bus to support larger-scale machine learning operations; a plurality of the exponential function activation instruction processing devices share the same The control system may have its own control system; the multiple exponential function activation instruction processing devices share memory or have their own memories; the interconnection method of the multiple exponential function activation instruction processing devices is any interconnected topology.
条款D11、一种组合处理装置,所述组合处理装置包括:Clause D11. A combined processing device, the combined processing device comprising:
如条款D10所述的机器学习运算装置、通用互联接口和其他处理装置;Machine learning computing devices, general interconnection interfaces and other processing devices as described in clause D10;
所述机器学习运算装置与所述其他处理装置进行交互,共同完成用户指定的计算操作,The machine learning computing device interacts with the other processing device to jointly complete the calculation operation specified by the user,
其中,所述组合处理装置还包括:存储装置,该存储装置分别与所述机器学习运算装置和所述其他处理装置连接,用于保存所述机器学习运算装置和所述其他处理装置的数据。Wherein, the combined processing device further includes: a storage device, which is respectively connected to the machine learning computing device and the other processing device, and is used for storing data of the machine learning computing device and the other processing device.
条款D12、一种机器学习芯片,所述机器学习芯片包括:Clause D12. A machine learning chip, the machine learning chip includes:
如条款D10所述的机器学习运算装置或如条款D11所述的组合处理装置。The machine learning arithmetic device described in Item D10 or the combined processing device described in Item D11.
条款D13、一种电子设备,所述电子设备包括:Article D13. An electronic device, the electronic device comprising:
如条款D12所述的机器学习芯片。Machine learning chip as described in clause D12.
条款D14、一种板卡,所述板卡包括:存储器件、接口装置和控制器件以及如条款D12所述的机器学习芯片;Clause D14, a board card including: a storage device, an interface device and a control device, and a machine learning chip as described in Clause D12;
其中,所述机器学习芯片与所述存储器件、所述控制器件以及所述接口装置分别连接;Wherein, the machine learning chip is respectively connected to the storage device, the control device and the interface device;
所述存储器件,用于存储数据;The storage device is used for storing data;
所述接口装置,用于实现所述机器学习芯片与外部设备之间的数据传输;The interface device is used to realize data transmission between the machine learning chip and an external device;
所述控制器件,用于对所述机器学习芯片的状态进行监控。The control device is used for monitoring the state of the machine learning chip.
条款D15、一种指数函数激活指令处理方法,所述方法应用于指数函数激活指令处理装置,所述方法包括:Clause D15. An exponential function activation instruction processing method. The method is applied to an exponential function activation instruction processing device. The method includes:
利用控制模块对获取到的指数函数激活指令进行编译,得到编译后的指数函数激活指令,对所述编译后的指数函数激活指令进行解析,得到指数函数激活指令的操作码和操作域,并根据所述操作码和所述操作域获取执行指数函数激活指令所需的待运算数据和目标地址;The control module is used to compile the obtained exponential function activation instruction to obtain a compiled exponential function activation instruction, and the compiled exponential function activation instruction is analyzed to obtain the operation code and operation domain of the exponential function activation instruction, and according to The operation code and the operation domain acquire the data to be calculated and the target address required to execute the exponential function activation instruction;
利用运算模块对所述待运算数据进行指数函数激活运算,得到运算结果,并将所述运算结果存入所述目标地址中,Use an arithmetic module to perform an exponential function activation operation on the data to be operated to obtain an operation result, and store the operation result in the target address,
其中,所述操作码用于指示所述指数函数激活指令对数据所进行的激活运算为指数函数激活运算,所述操作域包括待运算数据地址和所述目标地址。Wherein, the operation code is used to indicate that the activation operation performed by the exponential function activation instruction on the data is an exponential function activation operation, and the operation domain includes the data address to be operated and the target address.
条款D16、根据条款D15所述的方法,所述方法还包括:Clause D16. The method according to Clause D15, the method further comprising:
根据所述操作码和/或所述操作域获取指数激活函数参数表;Obtaining an exponential activation function parameter table according to the operation code and / or the operation domain;
其中,利用运算模块对所述待运算数据进行指数函数激活运算,得到运算结果,包括:Wherein, the operation module performs an exponential function activation operation on the data to be operated to obtain an operation result, including:
根据所述指数激活函数参数表对所述待运算数据进行指数函数激活运算,得到运算结果,Performing an exponential function activation operation on the data to be calculated according to the exponential activation function parameter table to obtain an operation result,
其中,所述指数激活函数参数表包括指数激活函数激活表和指数激活函数常数表。Wherein, the exponential activation function parameter table includes an exponential activation function activation table and an exponential activation function constant table.
条款D17、根据条款D15所述的方法,利用运算模块对所述待运算数据进行指数函数激活运算,得到运算结果,包括:Clause D17. According to the method described in Clause D15, an arithmetic module is used to perform an exponential function activation operation on the data to be calculated to obtain an operation result, including:
利用多个激活运算器对所述待运算数据进行指数函数激活运算。Multiple activation operators are used to perform an exponential function activation operation on the data to be operated.
条款D18、根据条款D17所述的方法,所述运算模块包括主运算子模块和多个从运算子模块,所述主运算子模块包括所述多个激活运算器,Clause D18. The method according to Clause D17, the operation module includes a master operation submodule and a plurality of slave operation submodules, the master operation submodule includes the plurality of activation operators,
其中,利用运算模块对所述待运算数据进行指数函数激活运算,得到运算结果,包括:Wherein, the operation module performs an exponential function activation operation on the data to be operated to obtain an operation result, including:
利用所述主运算子模块中的多个激活运算器对所述待运算数据进行指数函数激活运算,得到运算结果,并将所述运算结果存入所述目标地址中。Use a plurality of activation operators in the main operation sub-module to perform exponential function activation operation on the data to be operated to obtain an operation result, and store the operation result in the target address.
条款D19、根据条款D15所述的方法,所述操作域还包括读入量或所述读入量的存储地址,Clause D19. The method according to Clause D15, the operation domain further includes a read-in amount or a storage address of the read-in amount,
其中,根据所述操作码和所述操作域获取执行指数函数激活指令所需的待运算数据和目标地址,包括:Wherein, obtaining the data to be calculated and the target address required to execute the exponential function activation instruction according to the operation code and the operation domain includes:
获取所述读入量,并按照所述读入量获取所述待运算数据。Acquiring the read-in amount, and acquiring the data to be calculated according to the read-in amount.
条款D20、根据条款D15所述的方法,所述方法还包括:Clause D20. The method according to Clause D15, the method further comprising:
存储所述待运算数据。Store the data to be calculated.
条款D21、根据条款D15所述的方法,对所述编译后的指数函数激活指令进行解析,得到指数函数激活指令的操作码和操作域,包括:Clause D21. According to the method described in Clause D15, parse the compiled index function activation instruction to obtain the operation code and operation domain of the index function activation instruction, including:
存储所述编译后的指数函数激活指令;Storing the compiled index function activation instruction;
对所述编译后的指数函数激活指令进行解析,得到所述指数函数激活指令的操作码和操作域;Parse the compiled index function activation instruction to obtain the operation code and operation domain of the index function activation instruction;
存储指令队列,所述指令队列包括按照执行顺序依次排列的多个待执行指令,所述多个待执行指令包括所述编译后的指数函数激活指令。An instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed that are sequentially arranged according to an execution order, and the plurality of instructions to be executed include the compiled exponential function activation instruction.
条款D22、根据条款D21所述的方法,所述方法还包括:Clause D22. The method according to Clause D21, the method further comprising:
在确定所述多个待执行指令中的第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系时,缓存所述第一待执行指令,并在确定所述第零待执行指令执行完毕后,控制进行所述第一待执行指令的执行,When it is determined that the first to-be-executed instruction among the plurality of to-be-executed instructions is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction, cache the first to-be-executed instruction, and determine the After the execution of the zeroth to-be-executed instruction is completed, control to execute the execution of the first to-be-executed instruction,
其中,所述第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系包括:Wherein, the association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
存储所述第一待执行指令所需数据的第一存储地址区间与存储所述第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
条款D23、根据条款D15所述的方法,利用控制模块对获取到的指数函数激活指令进行编译,得到编译后的指数函数激活指令,包括:Clause D23. According to the method described in Clause D15, the control module is used to compile the obtained index function activation instruction to obtain the compiled index function activation instruction, including:
根据所述指数函数激活指令生成汇编文件,并将所述汇编文件翻译成二进制文件,Generate an assembly file according to the index function activation instruction, and translate the assembly file into a binary file,
其中,所述二进制文件为所述编译后的指数函数激活指令。Wherein, the binary file is the compiled index function activation instruction.
由于神经网络算法的广泛使用,计算机硬件运算人能力的不断提升,实际应用中所涉及到的数据运算的种类和数量不断提高。选择运算是一种依据选择条件对数据进行选择的处理操作。由于编程语言的种类多样,在不同的语言环境下,为实现选择运算的运算过程,相关技术中,由于现阶段没有能广泛适用于各类编程语言的选择指令,技术人员需要自定义对应其编程语言环境的多条指令来实现选择运算,导致进行选择运算效率低、速度慢。本公开提供一种选择指令处理方法、装置、计算机设备和存储介质,仅用一个指令即可以实现选择运算,能够显著提高进行选择运算的效率和速度。Due to the extensive use of neural network algorithms, the ability of computer hardware operators to continuously improve, the types and number of data operations involved in practical applications continue to increase. Selection operation is a processing operation to select data according to selection conditions. Due to the variety of programming languages, in different language environments, in order to realize the operation process of selecting operations, in related technologies, because there is no selection instruction that can be widely applied to various programming languages at this stage, technicians need to customize their corresponding programming Multiple instructions in the language environment are used to implement the selection operation, which results in low efficiency and slow speed in the selection operation. The present disclosure provides a selection instruction processing method, device, computer equipment, and storage medium, and selection operation can be implemented with only one instruction, which can significantly improve the efficiency and speed of selection operation.
图5-1示出根据本公开一实施例的选择指令处理装置的框图。如图5-1所示,该装置包括控制模块1-11和运算模块1-12。FIG. 5-1 shows a block diagram of a selection instruction processing apparatus according to an embodiment of the present disclosure. As shown in Figure 5-1, the device includes a control module 1-11 and an arithmetic module 1-12.
控制模块1-11,用于对获取到的选择指令进行编译,得到编译后的选择指令,对编译后的选择指令进行解析,得到选择指令的操作码和操作域,并根据操作码和操作域获取执行选择指令所需的多个索引数据、多个待运算数据和目标地址。其中,操作码用于指示选择指令对数据所进行的运算为选择运算,操作域包括待运算数据地址、索引数据地址和目标地址。The control module 1-11 is used to compile the obtained selection instruction to obtain the compiled selection instruction, parse the compiled selection instruction to obtain the operation code and operation domain of the selection instruction, and according to the operation code and operation domain Obtain multiple index data, multiple data to be calculated and target addresses required to execute the selection instruction. The operation code is used to indicate that the operation performed by the selection instruction on the data is a selection operation, and the operation field includes the data address to be operated, the index data address, and the target address.
运算模块1-12,用于依次判断多个索引数据是否满足存储条件,并在索引数据满足存储条件时,将与满足存储条件的索引数据相对应的待运算数据顺序存入目标地址中。The operation module 1-12 is used to sequentially determine whether a plurality of index data meets the storage conditions, and when the index data meets the storage conditions, sequentially store the data to be operated corresponding to the index data that meets the storage conditions in the target address.
在本实施例中,控制模块所获取到的选择指令为未编译的、不能直接供硬件执行的软件指令,控制模块需先对选择指令(未编译)进行编译。在得到编译后的选择指令之后,才能对编译后的选择指令进行解析。编译后的选择指令为能够直接供硬件执行的硬件指令。控制模块可以从待运算数据地址和索引数据地址中分别获取多个待运算数据和多个索引数据。控制模块可以通过数据输入输出单元获得选择指令、多个待运算数据和多个索引数据,该数据输入输出单元可以为一个或多个数据I/O接口或I/O引脚。In this embodiment, the selection instruction acquired by the control module is an uncompiled software instruction that cannot be directly executed by the hardware. The control module needs to first compile the selection instruction (uncompiled). After the compiled selection instruction is obtained, the compiled selection instruction can be parsed. The compiled selection instructions are hardware instructions that can be directly executed by the hardware. The control module may obtain a plurality of data to be calculated and a plurality of index data from the data address to be calculated and the index data address, respectively. The control module may obtain a selection instruction, a plurality of data to be calculated, and a plurality of index data through a data input / output unit. The data input / output unit may be one or more data I / O interfaces or I / O pins.
在本实施例中,操作码可以是计算机程序中所规定的要执行操作的那一部分指令或字段(通常用代码表示),是指令序列号,用来告知执行指令的装置具体需要执行哪一条指令。操作域可以是执行对应的指令所需的所有数据的来源,如对应的地址等,执行对应的指令所需的所有数据包括参数数据、待运算数据、对应的运算方法等等。对于一个选择指令其必须包括操作码和操作域,其中操作域至少包括待运算数据地址、索引数据地址和目标地址。In this embodiment, the operation code may be the part of the instruction or field (usually represented by code) specified in the computer program to perform the operation, and is the instruction sequence number used to inform the device that executes the instruction which instruction needs to be executed. . The operation domain may be a source of all data required to execute the corresponding instruction, such as a corresponding address, etc. All data required to execute the corresponding instruction include parameter data, data to be operated, corresponding operation methods, and so on. For a selection instruction, it must include an operation code and an operation field, where the operation field includes at least the data address to be operated, the index data address, and the target address.
应当理解的是,本领域技术人员可以根据需要对选择指令的指令格式以及所包含的操作码和操作域进行设置,本公开对此不作限制。It should be understood that, those skilled in the art can set the instruction format of the selection instruction, as well as the included operation codes and operation fields as required, and this disclosure does not limit this.
在本实施例中,该装置可以包括一个或多个控制模块,以及一个或多个运算模块,可以根据实际需要对控制模块和运算模块的数量进行设置,本公开对此不作限制。在装置包括一个控制模块时,该控制模块可以接收选择指令,并控制一个或多个运算模块进行选择运算。在装置包括多个控制模块时,多个控制模块可以分别接收选择指令,并控制对应的一个或多个运算模块进行选择运算。In this embodiment, the device may include one or more control modules and one or more arithmetic modules, and the number of control modules and arithmetic modules may be set according to actual needs, which is not limited in the present disclosure. When the device includes a control module, the control module can receive a selection instruction and control one or more arithmetic modules to perform a selection operation. When the device includes multiple control modules, the multiple control modules may respectively receive selection instructions and control the corresponding one or more arithmetic modules to perform selection operations.
本公开实施例所提供的选择指令处理装置,该装置包括控制模块和运算模块。控制模块用于对获取到的选择指令进行编译,得到编译后的选择指令,对编译后的选择指令进行解析,得到选择指令的操作码和操作域,并根据操作码和操作域获取执行选择指令所需的多个索引数据、多个待运算数据和目标地址。运算模块用于依次判断多个索引数据是否满足存储条件,并在索引数据满足存储条件时,将与满足存储条件的索引数据相对应的待运算数据顺序存入目标地址中。本公开实施例所提供的选择指令处理装置的适用范围广,对选择指令的处理效率高、处理速度快,进行选择运算的处理效率高、处理速度快。The selection instruction processing device provided by the embodiment of the present disclosure includes a control module and an arithmetic module. The control module is used to compile the obtained selection instruction to obtain the compiled selection instruction, parse the compiled selection instruction to obtain the operation code and operation domain of the selection instruction, and obtain and execute the selection instruction according to the operation code and operation domain The required multiple index data, multiple data to be calculated and the target address. The operation module is used to sequentially determine whether a plurality of index data satisfy the storage conditions, and when the index data meets the storage conditions, sequentially store the data to be operated corresponding to the index data satisfying the storage conditions into the target address. The selection instruction processing device provided by the embodiment of the present disclosure has a wide application range, and has high processing efficiency and fast processing speed for the selection instruction, and high processing efficiency and fast processing speed for performing the selection operation.
在一种可能的实现方式中,存储条件可以是索引数据不为零。In a possible implementation, the storage condition may be that the index data is not zero.
在该实现方式中,在索引数据不为零时,将与该不为零的索引数据相对应的待运算数据顺序存储至目标地址。存储条件还可以是索引数据不为指定数值,指定数值可以是1等数值。本领域技术人员可以根据实际需要对存储条件进行设置,本公开对此不作限制。In this implementation manner, when the index data is not zero, the data to be operated corresponding to the non-zero index data is sequentially stored to the target address. The storage condition may also be that the index data is not a specified value, and the specified value may be a value such as 1. Those skilled in the art can set the storage conditions according to actual needs, and this disclosure does not limit this.
在该实现方式中,可以根据需要存储条件或者索引数据进行设置,以将待运算数据中所需要的数据存储至目标地址。例如,为按照不同的选择需要对待运算数据进行选择,可以设置不同的存储条件,或者,设置不同的索引数据来实现对待运算数据的不同选择。In this implementation manner, it can be set according to the required storage conditions or index data to store the data required in the data to be calculated to the target address. For example, in order to select the data to be calculated according to different selection needs, different storage conditions may be set, or different index data may be set to realize different selections of the data to be calculated.
图5-2a示出根据本公开一实施例的选择指令处理装置的框图。在一种可能的实现方式中,如图5-2a所示,运算模块1-12可以包括多个比较器1-120,用于依次判断多个索引数据是否满足存储条件。5-2a shows a block diagram of a selection instruction processing apparatus according to an embodiment of the present disclosure. In a possible implementation manner, as shown in FIG. 5-2a, the operation module 1-12 may include a plurality of comparators 1-120, which are used to sequentially determine whether a plurality of index data meets storage conditions.
举例来说,以存储条件为“索引数据不为0”为例,比较器可以依次将索引数据与0进行比较,确定索引数据是否满足存储条件。进而使得运算模块可以将与不为0的索引数据相对应的待运算数据顺序存入目标地址中。可以根据所需进行比较的数据量的大小、对比较的处理速度、效率等要求对比较器的数量进行设置,本公开对此不作限制。For example, taking the storage condition as "index data is not 0" as an example, the comparator may sequentially compare the index data with 0 to determine whether the index data meets the storage condition. Furthermore, the operation module can store the data to be operated corresponding to the index data other than 0 into the target address in sequence. The number of comparators can be set according to the amount of data to be compared, the processing speed, efficiency, and other requirements of the comparison, which is not limited in the present disclosure.
图5-2b示出根据本公开一实施例的选择指令处理装置的框图。在一种可能的实现方式中,如图5-2b所示,运算模块1-12可以包括主运算子模块1-121和多个从运算子模块1-122。主运算子模块1-121可以 包括多个比较器1-120(图中未示出)。5-2b shows a block diagram of a selection instruction processing apparatus according to an embodiment of the present disclosure. In a possible implementation manner, as shown in FIG. 5-2b, the operation module 1-12 may include a master operation sub-module 1-121 and a plurality of slave operation sub-modules 1-122. The main operation sub-module 1-121 may include a plurality of comparators 1-120 (not shown in the figure).
主运算子模块1-121,用于利用多个比较器依次判断多个索引数据是否满足存储条件,确定出与满足存储条件的索引数据相对应的待运算数据,并将与满足存储条件的索引数据相对应的待运算数据顺序存入目标地址中。The main operation sub-module 1-121 is used to sequentially determine whether multiple index data satisfy the storage condition using multiple comparators, determine the data to be operated corresponding to the index data satisfying the storage condition, and compare the index with the storage condition The data to be calculated corresponding to the data is sequentially stored in the target address.
在一种可能的实现方式中,操作域还可以包括读入量或读入量的存储地址。其中,控制模块1-11,还用于获取读入量,并按照读入量获取多个待运算数据。其中,多个待运算数据的数据量小于或等于读入量,读入量小于或等于多个索引数据的数据量。In a possible implementation manner, the operation domain may further include a read-in amount or a storage address of the read-in amount. Among them, the control module 1-11 is also used to obtain the read-in amount, and obtain a plurality of data to be calculated according to the read-in amount. Among them, the data amount of the multiple data to be calculated is less than or equal to the read-in amount, and the read-in amount is less than or equal to the data amount of the multiple index data.
在该实现方式中,读入量可以是所获取的多个待运算数据的数据量,可以是所获取的待运算数据的尺寸。在操作域直接包含读入量的具体数值时,可以将该数值确定为读入量。在操作域中包括读入量的存储地址时,可以从该存储地址中获取读入量。In this implementation manner, the read-in amount may be the data amount of the acquired plurality of data to be calculated, and may be the size of the acquired data to be calculated. When the operation field directly contains the specific value of the read-in amount, the value can be determined as the read-in amount. When the storage address of the read-in amount is included in the operation domain, the read-in amount can be obtained from the storage address.
在一种可能的实现方式中,在操作域中不包括读入量时,可以根据预先设置的默认读入量获取多个待运算数据。所获取的多个待运算数据的数据量小于或等于默认读入量,默认读入量小于或等于多个索引数据的数据量。In a possible implementation manner, when the read-in amount is not included in the operation domain, a plurality of data to be calculated may be obtained according to a preset default read-in amount. The acquired data amount of the plurality of data to be calculated is less than or equal to the default read-in amount, and the default read-in amount is less than or equal to the data amount of multiple index data.
在该实现方式中,多个待运算数据的数据量、多个索引数据的数据量以及目标地址能够进行数据存储的数据量,三者可以是相同的,可以都等于读入量或默认读入量。这样,运算模块可以将与满足存储条件的索引数据相对应的待运算数据顺序存入目标地址中,可以避免出现目标地址不够用、目标地址浪费等问题。In this implementation, the amount of data to be calculated, the amount of data to be indexed, and the amount of data that can be stored at the target address can be the same, and can all be equal to the read-in amount or the default read-in amount. the amount. In this way, the operation module can sequentially store the data to be operated corresponding to the index data that meets the storage conditions in the target address, which can avoid problems such as insufficient target addresses and waste of target addresses.
在一种可能的实现方式中,如图5-2a、图5-2b所示,该装置还可以包括存储模块1-13。存储模块1-13用于存储多个索引数据、多个待运算数据以及存储条件。In a possible implementation manner, as shown in FIGS. 5-2a and 5-2b, the device may further include a storage module 1-13. The storage modules 1-13 are used to store multiple index data, multiple data to be calculated, and storage conditions.
在该实现方式中,存储模块可以包括内存,如缓存和寄存器中的一种或多种,缓存可以包括高速暂存缓存,还可以包括至少一个NRAM(Neuron Random Access Memory,神经元随机存取存储器)。缓存可以用于存储待运算数据和池化核,寄存器可以用于存储待运算数据中的标量数据。。In this implementation, the storage module may include memory, such as one or more of a cache and a register, the cache may include a high-speed temporary cache, and may also include at least one NRAM (Neuron Random Access Memory, neuron random access memory ). The cache can be used to store data to be calculated and pooled cores, and the register can be used to store scalar data in the data to be calculated. .
在一种可能的实现方式中,缓存可以包括神经元缓存。神经元缓存也即上述神经元随机存取存储器,可以用于存储待运算数据中的神经元数据,神经元数据可以包括神经元向量数据。In a possible implementation, the cache may include a neuron cache. The neuron cache, that is, the foregoing neuron random access memory, can be used to store neuron data in the data to be calculated, and the neuron data can include neuron vector data.
在一种可能的实现方式中,该装置还可以包括直接内存访问模块,用于从存储模块中读取或者存储数据。In a possible implementation, the device may further include a direct memory access module, which is used to read or store data from the storage module.
在一种可能的实现方式中,控制模块1-11还可以用于根据选择指令生成汇编文件,并将汇编文件翻译成二进制文件,其中,二进制文件为编译后的选择指令。In a possible implementation manner, the control modules 1-11 may also be used to generate an assembly file according to the selection instruction and translate the assembly file into a binary file, where the binary file is the selection instruction after compilation.
在一种可能的实现方式中,选择指令的指令格式可以是:In a possible implementation, the instruction format of the selection instruction may be:
select dst src0 src1 sizeselect src0 src1 size
其中,select是选择指令的操作码,dst、src0、src1、size是选择指令的操作域。dst是目标地址,src0是待运算数据地址,src1是索引数据地址,size是读入量。Where, select is the opcode of the selection instruction, and dst, src0, src1, and size are the operation fields of the selection instruction. dst is the target address, src0 is the data address to be calculated, src1 is the index data address, and size is the read-in amount.
应当理解的是,本领域技术人员可以根据需要对选择指令的操作码、指令格式中操作码和操作域的位置进行设置,本公开对此不作限制。It should be understood that those skilled in the art can set the operation code of the selection instruction, the position of the operation code and the operation field in the instruction format according to need, and the disclosure does not limit this.
在一种可能的实现方式中,该装置可以设置于(Graphics Processing Unit,简称GPU)、中央处理器(Central Processing Unit,简称CPU)和嵌入式神经网络处理器(Neural-network Processing Unit,简称NPU)的一种或多种之中。In a possible implementation, the device may be set in (Graphics Processing Unit, GPU for short), Central Processing Unit (CPU for short), and Neural-network Processing Unit (NPU for short) ) Of one or more.
需要说明的是,尽管以上述实施例作为示例介绍了选择指令处理装置如上,但本领域技术人员能够理解,本公开应不限于此。事实上,用户完全可根据个人喜好和/或实际应用场景灵活设定各模块,只要符合本公开的技术方案即可。It should be noted that although the above-mentioned embodiment is taken as an example to introduce the selection instruction processing apparatus as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various modules flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.
应用示例Application examples
以下结合“利用选择指令处理装置进行数据选择”作为一个示例性应用场景,给出根据本公开实施例的应用示例,以便于理解选择指令处理装置的流程。本领域技术人员应理解,以下应用示例仅仅是出于便于理解本公开实施例的目的,不应视为对本公开实施例的限制。In the following, an application example according to an embodiment of the present disclosure is given in conjunction with "data selection using a selection instruction processing device" as an exemplary application scenario, so as to facilitate understanding of a flow of a selection instruction processing device. Those skilled in the art should understand that the following application examples are only for the purpose of facilitating the understanding of the embodiments of the present disclosure, and should not be considered as limitations to the embodiments of the present disclosure.
图5-3示出根据本公开一实施例的选择指令处理装置的应用场景的示意图。如图5-3所示,选择指 令处理装置对选择指令进行处理的过程如下:5-3 shows a schematic diagram of an application scenario for selecting an instruction processing apparatus according to an embodiment of the present disclosure. As shown in Figure 5-3, the selection instruction processing device processes the selection instruction as follows:
控制模块1-11对获取到的选择指令1进行编译,得到编译后的选择指令1。对编译后的选择指令1(如选择指令1为select 500 100 200 5)进行解析,得到选择指令1的操作码和操作域。其中,选择指令1的操作码为select,目标地址为500,待运算数据地址为100,索引数据地址为200,读入量为5。控制模块1-11分别从待运算数据地址100和索引数据地址200中获取读入量为5的多个待运算数据和多个索引数据。The control module 1-11 compiles the obtained selection instruction 1 to obtain the compiled selection instruction 1. Analyze the compiled selection instruction 1 (for example, selection instruction 1 is select 500, 100, 200, and 5) to obtain the operation code and operation field of selection instruction 1. Among them, the operation code of the selection instruction 1 is select, the target address is 500, the data address to be calculated is 100, the index data address is 200, and the read-in amount is 5. The control module 1-11 obtains a plurality of data to be operated and a plurality of index data with a read-in amount of 5 from the data address to be operated 100 and the index data address 200, respectively.
假定获得的多个待运算数据包括1、5、6、7、3。多个索引数据包括1、8、0、6和9。存储条件为索引数据不为0。Assume that the obtained plurality of data to be calculated include 1, 5, 6, 7, and 3. The multiple index data includes 1, 8, 0, 6, and 9. The storage condition is that the index data is not 0.
运算模块1-12依次判断多个索引数据是否为0,并在索引数据不为0时,将与不为0的索引数据相对应的待运算数据顺序存入目标地址500中。具体地,运算模块1-12依次判断多个索引数据“1、8、0、6、9”是否不为0,由于第三个索引数据为0,则将多个待运算数据中的“1、5、7、3”依次存入目标地址500中。以上各模块的工作过程可参考上文的相关描述。The operation module 1-12 sequentially judges whether multiple index data are 0, and when the index data is not 0, sequentially stores the data to be operated corresponding to the index data that is not 0 into the target address 500. Specifically, the arithmetic module 1-12 sequentially determines whether the multiple index data "1, 8, 0, 6, 9" are not 0. Since the third index data is 0, the "1" , 5, 7, 3 "are sequentially stored in the target address 500. For the working process of the above modules, please refer to the relevant description above.
这样,选择指令处理装置可以高效、快速地对选择指令进行处理,进行选择运算的处理效率高、处理速度快。In this way, the selection instruction processing device can process the selection instruction efficiently and quickly, and the selection operation has high processing efficiency and fast processing speed.
图5-4示出根据本公开一实施例的选择指令处理方法的流程图。如图5-4所示,该方法应用于上述选择指令处理装置,该方法包括步骤S51-1和步骤S52-1。该方法可以应用于包含存储器和处理器的如计算机设备等,其中,存储器用于存储执行方法过程中所使用的数据;处理器用于执行相关的处理、运算步骤,如执行下述步骤S51-1和步骤S52-1。5-4 shows a flowchart of a selection instruction processing method according to an embodiment of the present disclosure. As shown in FIG. 5-4, the method is applied to the above selection instruction processing apparatus, and the method includes step S51-1 and step S52-1. The method can be applied to computer equipment including a memory and a processor, where the memory is used to store data used in the execution of the method; the processor is used to perform related processing and operation steps, such as performing the following step S51-1和 步骤 S52-1.
在步骤S51-1中,利用控制模块对获取到选择指令进行编译,得到编译后的选择指令,对编译后的选择指令进行解析,得到选择指令的操作码和操作域,并根据操作码和操作域获取执行选择指令所需的多个索引数据、多个待运算数据和目标地址。其中,操作码用于指示选择指令对数据所进行的运算为选择运算,操作域包括待运算数据地址、索引数据地址和目标地址。In step S51-1, the control module is used to compile the obtained selection instruction to obtain the compiled selection instruction, and the compiled selection instruction is parsed to obtain the operation code and operation domain of the selection instruction, and according to the operation code and operation The domain acquires multiple index data, multiple data to be calculated, and a target address required to execute the selection instruction. The operation code is used to indicate that the operation performed by the selection instruction on the data is a selection operation, and the operation field includes the data address to be operated, the index data address, and the target address.
在步骤S52-1中,利用运算模块依次判断多个索引数据是否满足存储条件,并在索引数据满足存储条件时,将与满足存储条件的索引数据相对应的待运算数据顺序存入目标地址中。In step S52-1, the operation module is used to sequentially determine whether a plurality of index data meets the storage conditions, and when the index data meets the storage conditions, the data to be operated corresponding to the index data that meets the storage conditions are sequentially stored in the target address .
在一种可能的实现方式中,运算模块可以包括所述多个比较器,In a possible implementation manner, the operation module may include the multiple comparators,
其中,利用运算模块依次判断多个索引数据是否满足存储条件,并在索引数据满足存储条件时,将与满足存储条件的索引数据相对应的待运算数据顺序存入目标地址中,可以包括:Among them, using the operation module to sequentially determine whether a plurality of index data meets the storage conditions, and when the index data meets the storage conditions, sequentially store the data to be operated corresponding to the index data that meets the storage conditions into the target address, which may include:
利用运算模块中的多个比较器依次判断多个索引数据是否满足存储条件。The multiple comparators in the arithmetic module are used to sequentially determine whether the multiple index data meet the storage conditions.
在一种可能的实现方式中,所述运算模块包括主运算子模块和多个从运算子模块,所述主运算子模块包括所述多个比较器,In a possible implementation manner, the operation module includes a master operation sub-module and multiple slave operation sub-modules, and the master operation sub-module includes the multiple comparators,
其中,利用运算模块依次判断多个索引数据是否满足存储条件,并在索引数据满足存储条件时,将与满足存储条件的索引数据相对应的待运算数据顺序存入目标地址中,可以包括:Among them, using the operation module to sequentially determine whether a plurality of index data meets the storage conditions, and when the index data meets the storage conditions, sequentially store the data to be operated corresponding to the index data that meets the storage conditions into the target address, which may include:
利用多个比较器依次判断多个索引数据是否满足存储条件,确定出与满足存储条件的索引数据相对应的待运算数据,并将与满足存储条件的索引数据相对应的待运算数据顺序存入目标地址中。Use multiple comparators to sequentially determine whether multiple index data meet the storage conditions, determine the data to be calculated corresponding to the index data that meets the storage conditions, and store the data to be operated corresponding to the index data that meet the storage conditions in sequence Target address.
在一种可能的实现方式中,操作域还可以包括读入量或读入量的存储地址,步骤S51-1可以包括:获取读入量,并按照读入量获取多个待运算数据。其中,多个待运算数据的数据量小于或等于读入量,读入量小于或等于多个索引数据的数据量。In a possible implementation manner, the operation domain may further include a read-in amount or a storage address of the read-in amount. Step S51-1 may include: acquiring the read-in amount, and acquiring a plurality of data to be calculated according to the read-in amount. Among them, the data amount of the multiple data to be calculated is less than or equal to the read-in amount, and the read-in amount is less than or equal to the data amount of the multiple index data.
在一种可能的实现方式中,该方法还可以包括:利用所述装置的存储模块存储多个索引数据、多个待运算数据以及存储条件,In a possible implementation manner, the method may further include: using the storage module of the device to store multiple index data, multiple data to be calculated, and storage conditions,
其中,所述存储模块包括寄存器和缓存中的至少一种,Wherein, the storage module includes at least one of a register and a cache,
所述缓存,用于存储所述多个索引数据、所述多个待运算数据以及所述存储条件,所述缓存包括至少一个神经元缓存NRAM;The cache is used to store the plurality of index data, the plurality of data to be calculated, and the storage conditions, and the cache includes at least one neuron cache NRAM;
所述寄存器,用于存储所述待运算数据、所述多个待运算数据以及所述存储条件中的标量数据;The register is used to store the data to be calculated, the plurality of data to be calculated, and the scalar data in the storage condition;
所述神经元缓存,用于存储所述待运算数据、所述多个待运算数据以及所述存储条件中的神经元数据,所述神经元数据包括神经元向量数据。The neuron cache is used to store the data to be operated, the plurality of data to be operated, and the neuron data in the storage condition, and the neuron data includes neuron vector data.
在一种可能的实现方式中,步骤S51-1可以包括:In a possible implementation manner, step S51-1 may include:
存储编译后的选择指令;Store the compiled selection instructions;
对编译后的选择指令进行解析,得到选择指令的操作码和操作域;Analyze the compiled selection instruction to obtain the operation code and operation domain of the selection instruction;
存储指令队列,指令队列包括按照执行顺序依次排列的多个待执行指令,多个待执行指令包括编译后的选择指令。The instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed, which are arranged in order according to the execution order, and the plurality of instructions to be executed include a compiled selection instruction.
在一种可能的实现方式中,该方法还可以包括:In a possible implementation manner, the method may further include:
在确定多个待执行指令中的第一待执行指令与第一待执行指令之前的第零待执行指令存在关联关系时,缓存第一待执行指令,并在确定第零待执行指令执行完毕后,控制进行第一待执行指令的执行,When it is determined that the first to-be-executed instruction among the plurality of to-be-executed instructions is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction, cache the first to-be-executed instruction, and after determining that the execution of the zeroth to-be-executed instruction is completed , Control the execution of the first instruction to be executed,
其中,第一待执行指令与第一待执行指令之前的第零待执行指令存在关联关系包括:Wherein, the association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
存储第一待执行指令所需数据的第一存储地址区间与存储第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first storage address interval storing data required for the first instruction to be executed has an overlapping area with the zeroth storage address interval storing data required for the zeroth instruction to be executed.
在一种可能的实现方式中,利用控制模块对获取到的选择指令进行编译,得到编译后的选择指令,包括:In a possible implementation manner, the control module is used to compile the obtained selection instructions to obtain the compiled selection instructions, including:
根据选择指令生成汇编文件,并将汇编文件翻译成二进制文件。其中,二进制文件为编译后的选择指令。The assembly file is generated according to the selection instruction, and the assembly file is translated into a binary file. Among them, the binary file is the selection instruction after compilation.
在一种可能的实现方式中,存储条件可以包括索引数据不为零。In a possible implementation, the storage condition may include that the index data is not zero.
需要说明的是,尽管以上述实施例作为示例介绍了选择指令处理方法如上,但本领域技术人员能够理解,本公开应不限于此。事实上,用户完全可根据个人喜好和/或实际应用场景灵活设定各步骤,只要符合本公开的技术方案即可。It should be noted that although the above-mentioned embodiment is used as an example to introduce the selection instruction processing method as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various steps flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.
本公开实施例所提供的选择指令处理方法的适用范围广,对选择指令的处理效率高、处理速度快,进行选择运算的处理效率高、处理速度快。The selection instruction processing method provided by the embodiments of the present disclosure has a wide application range, and has high processing efficiency and fast processing speed for the selection instruction, and high processing efficiency and fast processing speed for performing the selection operation.
依据以下条款可以更好的理解前述内容:The foregoing can be better understood based on the following clauses:
条款E1、一种选择指令处理装置,所述装置包括:Clause E1, a selection instruction processing device, the device comprising:
控制模块,用于对获取到的选择指令进行编译,得到编译后的选择指令,对编译后的选择指令进行解析,得到选择指令的操作码和操作域,并根据所述操作码和所述操作域获取执行选择指令所需的多个索引数据、多个待运算数据和目标地址;The control module is used to compile the obtained selection instruction to obtain the compiled selection instruction, parse the compiled selection instruction to obtain the operation code and operation domain of the selection instruction, and according to the operation code and the operation The domain acquires multiple index data, multiple data to be calculated, and target addresses required to execute the selection instruction;
运算模块,用于依次判断所述多个索引数据是否满足存储条件,并在索引数据满足所述存储条件时,将与满足存储条件的索引数据相对应的待运算数据顺序存入所述目标地址中,The operation module is used to sequentially determine whether the plurality of index data meets the storage condition, and when the index data meets the storage condition, sequentially store the data to be operated corresponding to the index data meeting the storage condition into the target address in,
其中,所述操作码用于指示所述选择指令对数据所进行的运算为选择运算,所述操作域包括待运算数据地址、索引数据地址和所述目标地址。Wherein, the operation code is used to indicate that the operation performed by the selection instruction on the data is a selection operation, and the operation field includes a data address to be operated, an index data address, and the target address.
条款E2、根据条款E1所述的装置,所述运算模块,包括:Clause E2. The device according to Clause E1, the calculation module includes:
多个比较器,用于依次判断所述多个索引数据是否满足所述存储条件。A plurality of comparators are used to sequentially determine whether the plurality of index data satisfy the storage condition.
条款E3、根据条款E2所述的装置,所述运算模块包括主运算子模块和多个从运算子模块,所述主运算子模块包括所述多个比较器,Clause E3. The device according to Clause E2, the operation module includes a master operation sub-module and a plurality of slave operation sub-modules, the master operation sub-module includes the plurality of comparators,
所述主运算子模块,用于利用所述多个比较器依次判断所述多个索引数据是否满足所述存储条件,确定出与满足存储条件的索引数据相对应的待运算数据,并将与满足存储条件的索引数据相对应的待运算数据顺序存入所述目标地址中。The main operation sub-module is used to sequentially determine whether the plurality of index data satisfy the storage condition using the plurality of comparators, determine the data to be operated corresponding to the index data satisfying the storage condition, and compare with The data to be operated corresponding to the index data satisfying the storage conditions are sequentially stored in the target address.
条款E4、根据条款E1所述的装置,所述操作域还包括读入量或所述读入量的存储地址,Clause E4. The device according to Clause E1, the operation domain further includes a read-in amount or a storage address of the read-in amount,
其中,所述控制模块,还用于获取所述读入量,并按照所述读入量获取所述多个待运算数据,Wherein, the control module is also used to obtain the read-in amount, and obtain the plurality of data to be calculated according to the read-in amount,
其中,所述多个待运算数据的数据量小于或等于所述读入量,所述读入量小于或等于所述多个索引数据的数据量。Wherein, the data amount of the plurality of data to be calculated is less than or equal to the read-in amount, and the read-in amount is less than or equal to the data amount of the plurality of index data.
条款E5、根据条款E1所述的装置,所述装置还包括:Clause E5. The device according to Clause E1, the device further comprising:
存储模块,用于存储所述多个索引数据、所述多个待运算数据以及所述存储条件,A storage module, configured to store the plurality of index data, the plurality of data to be calculated, and the storage conditions,
其中,所述存储模块包括寄存器和缓存中的至少一种,Wherein, the storage module includes at least one of a register and a cache,
所述缓存,用于存储所述多个索引数据、所述多个待运算数据以及所述存储条件,所述缓存包括至少一个神经元缓存NRAM;The cache is used to store the plurality of index data, the plurality of data to be calculated, and the storage conditions, and the cache includes at least one neuron cache NRAM;
所述寄存器,用于存储所述待运算数据、所述多个待运算数据以及所述存储条件中的标量数据;The register is used to store the data to be calculated, the plurality of data to be calculated, and the scalar data in the storage condition;
所述神经元缓存,用于存储所述待运算数据、所述多个待运算数据以及所述存储条件中的神经元数据,所述神经元数据包括神经元向量数据。The neuron cache is used to store the data to be operated, the plurality of data to be operated, and the neuron data in the storage condition, and the neuron data includes neuron vector data.
条款E6、根据条款E1所述的装置,所述控制模块包括:Clause E6. The device according to Clause E1, the control module includes:
指令存储子模块,用于存储所述编译后的选择指令;An instruction storage sub-module for storing the compiled selection instruction;
指令处理子模块,用于对所述编译后的选择指令进行解析,得到选择指令的操作码和操作域;An instruction processing sub-module, which is used to parse the compiled selection instruction to obtain the operation code and operation domain of the selection instruction;
队列存储子模块,用于存储指令队列,所述指令队列包括按照执行顺序依次排列的多个待执行指令,所述多个待执行指令包括多个编译后的选择指令。The queue storage submodule is used to store an instruction queue, and the instruction queue includes a plurality of instructions to be executed, which are sequentially arranged according to the execution order, and the plurality of instructions to be executed include a plurality of compiled selection instructions.
条款E7、根据条款E6所述的装置,所述控制模块,还包括:Clause E7. The device according to Clause E6, the control module, further comprising:
依赖关系处理子模块,用于在确定所述多个待执行指令中的第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系时,将所述第一待执行指令缓存在所述指令存储子模块中,在所述第零待执行指令执行完毕后,从所述指令存储子模块中提取所述第一待执行指令发送至所述运算模块,The dependency processing sub-module is used to determine the first pending instruction when there is an association relationship between the first pending instruction in the plurality of pending instructions and the zeroth pending instruction before the first pending instruction The execution instruction is cached in the instruction storage submodule, and after the execution of the zeroth instruction to be executed is completed, the first instruction to be executed is extracted from the instruction storage submodule and sent to the arithmetic module,
其中,所述第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系包括:Wherein, the association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
存储所述第一待执行指令所需数据的第一存储地址区间与存储所述第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
条款E8、根据条款E1所述的装置,Clause E8. The device according to Clause E1,
所述控制模块,还用于根据所述选择指令生成汇编文件,并将所述汇编文件翻译成二进制文件,The control module is also used to generate an assembly file according to the selection instruction and translate the assembly file into a binary file,
其中,所述二进制文件为所述编译后的选择指令。Wherein, the binary file is the compiled selection instruction.
条款E9、根据条款E1至条款E8任一项所述的装置,所述存储条件包括索引数据不为零。Clause E9. The device according to any one of Clause E1 to Clause E8, the storage condition includes that the index data is not zero.
条款E10、一种机器学习运算装置,其特征在于,所述装置包括:Clause E10. A machine learning computing device, characterized in that the device includes:
一个或多个如条款E1-条款E9任一项所述的选择指令处理装置,用于从其他处理装置中获取待运算数据和控制信息,并执行指定的机器学习运算,将执行结果通过I/O接口传递给其他处理装置;One or more selection instruction processing devices as described in any one of Clause E1-Clause E9, used to obtain the data and control information to be calculated from other processing devices, and perform the specified machine learning operation, and pass the execution result through I / O interface is passed to other processing devices;
当所述机器学习运算装置包含多个所述选择指令处理装置时,所述多个所述选择指令处理装置间可以通过特定的结构进行连接并传输数据;When the machine learning computing device includes a plurality of the selection instruction processing devices, the plurality of the selection instruction processing devices can be connected and transmit data through a specific structure;
其中,多个所述选择指令处理装置通过快速外部设备互连总线PCIE总线进行互联并传输数据,以支持更大规模的机器学习的运算;多个所述选择指令处理装置共享同一控制系统或拥有各自的控制系统;多个所述选择指令处理装置共享内存或者拥有各自的内存;多个所述选择指令处理装置的互联方式是任意互联拓扑。Among them, a plurality of the selection instruction processing devices interconnect and transmit data through a fast external device interconnection bus PCIE bus to support larger-scale machine learning operations; a plurality of the selection instruction processing devices share the same control system or own Respective control systems; a plurality of the selection instruction processing devices share memory or have their own memories; the interconnection mode of the plurality of selection instruction processing devices is any interconnection topology.
条款E11、一种组合处理装置,其特征在于,所述组合处理装置包括:Clause E11. A combined processing device, characterized in that the combined processing device includes:
如条款E10所述的机器学习运算装置、通用互联接口和其他处理装置;Machine learning computing devices, general interconnection interfaces and other processing devices as described in clause E10;
所述机器学习运算装置与所述其他处理装置进行交互,共同完成用户指定的计算操作,The machine learning computing device interacts with the other processing device to jointly complete the calculation operation specified by the user,
其中,所述组合处理装置还包括:存储装置,该存储装置分别与所述机器学习运算装置和所述其他处理装置连接,用于保存所述机器学习运算装置和所述其他处理装置的数据。Wherein, the combined processing device further includes: a storage device, which is respectively connected to the machine learning computing device and the other processing device, and is used for storing data of the machine learning computing device and the other processing device.
条款E12、一种机器学习芯片,其特征在于,所述机器学习芯片包括:Clause E12. A machine learning chip, characterized in that the machine learning chip includes:
如条款E10所述的机器学习运算装置或如条款E11所述的组合处理装置。The machine learning arithmetic device according to clause E10 or the combined processing device according to clause E11.
条款E13、一种电子设备,其特征在于,所述电子设备包括:Clause E13. An electronic device, characterized in that the electronic device includes:
如条款E10所述的机器学习芯片。Machine learning chip as described in clause E10.
条款E14、一种板卡,其特征在于,所述板卡包括:存储器件、接口装置和控制器件以及如条款E12所述的机器学习芯片;Clause E14, a board card, characterized in that the board card includes: a storage device, an interface device and a control device, and a machine learning chip as described in Clause E12;
其中,所述机器学习芯片与所述存储器件、所述控制器件以及所述接口装置分别连接;Wherein, the machine learning chip is respectively connected to the storage device, the control device and the interface device;
所述存储器件,用于存储数据;The storage device is used for storing data;
所述接口装置,用于实现所述机器学习芯片与外部设备之间的数据传输;The interface device is used to realize data transmission between the machine learning chip and an external device;
所述控制器件,用于对所述机器学习芯片的状态进行监控。The control device is used for monitoring the state of the machine learning chip.
条款E15、一种选择指令处理方法,其特征在于,所述方法应用于选择指令处理装置,所述装置包括控制模块和运算模块,所述方法包括:Item E15. A method for processing a selection instruction, characterized in that the method is applied to a device for processing a selection instruction, the device includes a control module and an arithmetic module, and the method includes:
利用控制模块对获取到选择指令进行编译,得到编译后的选择指令,对所述编译后的选择指令进行解析,得到选择指令的操作码和操作域,并根据所述操作码和所述操作域获取执行选择指令所需的多个索引数据、多个待运算数据和目标地址;The control module is used to compile the obtained selection instruction to obtain the compiled selection instruction, and the compiled selection instruction is analyzed to obtain the operation code and operation domain of the selection instruction, and the operation code and the operation domain are determined according to the operation code and the operation domain. Obtain multiple index data, multiple data to be calculated and target addresses required to execute the selection instruction;
利用运算模块依次判断所述多个索引数据是否满足存储条件,并在索引数据满足所述存储条件时,将与满足存储条件的索引数据相对应的待运算数据顺序存入所述目标地址中,The operation module is used to sequentially determine whether the plurality of index data meet the storage conditions, and when the index data meets the storage conditions, sequentially store the data to be operated corresponding to the index data that meets the storage conditions into the target address,
其中,所述操作码用于指示所述选择指令对数据所进行的运算为选择运算,所述操作域包括待运算数据地址、索引数据地址和所述目标地址。Wherein, the operation code is used to indicate that the operation performed by the selection instruction on the data is a selection operation, and the operation field includes a data address to be operated, an index data address, and the target address.
条款E16、根据条款E15所述的方法,其特征在于,所述运算模块包括所述多个比较器,Clause E16. The method according to Clause E15, characterized in that the arithmetic module includes the plurality of comparators,
其中,利用运算模块依次判断所述多个索引数据是否满足存储条件,并在索引数据满足所述存储条件时,将与满足存储条件的索引数据相对应的待运算数据顺序存入所述目标地址中,包括:Wherein, the operation module is used to sequentially determine whether the plurality of index data meet the storage conditions, and when the index data meets the storage conditions, the data to be operated corresponding to the index data that meets the storage conditions are sequentially stored in the target address Including:
利用所述运算模块中的多个比较器依次判断所述多个索引数据是否满足所述存储条件。A plurality of comparators in the arithmetic module are used to sequentially determine whether the plurality of index data satisfy the storage condition.
条款E17、根据条款E16所述的方法,其特征在于,所述运算模块包括主运算子模块和多个从运算子模块,所述主运算子模块包括所述多个比较器,Clause E17. The method according to Clause E16, wherein the operation module includes a master operation sub-module and a plurality of slave operation sub-modules, and the master operation sub-module includes the plurality of comparators,
其中,利用运算模块依次判断所述多个索引数据是否满足存储条件,并在索引数据满足所述存储条件时,将与满足存储条件的索引数据相对应的待运算数据顺序存入所述目标地址中,包括:Wherein, the operation module is used to sequentially determine whether the plurality of index data meet the storage conditions, and when the index data meets the storage conditions, sequentially store the data to be operated corresponding to the index data that meets the storage conditions into the target address Including:
利用所述多个比较器依次判断所述多个索引数据是否满足所述存储条件,确定出与满足存储条件的索引数据相对应的待运算数据,并将与满足存储条件的索引数据相对应的待运算数据顺序存入所述目标地址中。Use the plurality of comparators to sequentially determine whether the plurality of index data satisfy the storage condition, determine the data to be calculated corresponding to the index data that meets the storage condition, and compare the index data that meets the storage condition The data to be operated are sequentially stored in the target address.
条款E18、根据条款E15所述的方法,其特征在于,所述操作域还包括读入量或所述读入量的存储地址,Clause E18. The method according to Clause E15, wherein the operation domain further includes a read-in amount or a storage address of the read-in amount,
其中,根据所述操作码和所述操作域获取执行选择指令所需的多个索引数据、多个待运算数据和目标地址,包括:Wherein, acquiring a plurality of index data, a plurality of data to be calculated and a target address required to execute the selection instruction according to the operation code and the operation domain includes:
获取所述读入量,并按照所述读入量获取所述多个待运算数据,Acquiring the read-in amount, and acquiring the plurality of data to be calculated according to the read-in amount,
其中,所述多个待运算数据的数据量小于或等于所述读入量,所述读入量小于或等于所述多个索引数据的数据量。Wherein, the data amount of the plurality of data to be calculated is less than or equal to the read-in amount, and the read-in amount is less than or equal to the data amount of the plurality of index data.
条款E19、根据条款E15所述的方法,其特征在于,所述方法还包括:Clause E19. The method according to Clause E15, characterized in that the method further comprises:
利用所述装置的存储模块存储所述多个索引数据、所述多个待运算数据以及所述存储条件,Using the storage module of the device to store the plurality of index data, the plurality of data to be calculated, and the storage conditions,
其中,所述存储模块包括寄存器和缓存中的至少一种,Wherein, the storage module includes at least one of a register and a cache,
所述缓存,用于存储所述多个索引数据、所述多个待运算数据以及所述存储条件,所述缓存包括至少一个神经元缓存NRAM;The cache is used to store the plurality of index data, the plurality of data to be calculated, and the storage conditions, and the cache includes at least one neuron cache NRAM;
所述寄存器,用于存储所述待运算数据、所述多个待运算数据以及所述存储条件中的标量数据;The register is used to store the data to be calculated, the plurality of data to be calculated, and the scalar data in the storage condition;
所述神经元缓存,用于存储所述待运算数据、所述多个待运算数据以及所述存储条件中的神经元数据,所述神经元数据包括神经元向量数据。The neuron cache is used to store the data to be operated, the plurality of data to be operated, and the neuron data in the storage condition, and the neuron data includes neuron vector data.
条款E20、根据条款E15所述的方法,其特征在于,对所述编译后的选择指令进行解析,得到选择指令的操作码和操作域,包括:Clause E20. The method according to Clause E15, characterized in that the compiled selection instruction is parsed to obtain the operation code and operation domain of the selection instruction, including:
存储所述编译后的选择指令;Storing the compiled selection instruction;
对所述编译后的选择指令进行解析,得到选择指令的操作码和操作域;Parse the compiled selection instruction to obtain the operation code and operation domain of the selection instruction;
存储指令队列,所述指令队列包括按照执行顺序依次排列的多个待执行指令,所述多个待执行指令包括多个编译后的选择指令。An instruction queue is stored. The instruction queue includes a plurality of instructions to be executed, which are sequentially arranged according to an execution order, and the plurality of instructions to be executed include a plurality of compiled selection instructions.
条款E21、根据条款E20所述的方法,其特征在于,所述方法还包括:Clause E21. The method according to Clause E20, characterized in that the method further comprises:
在确定所述多个待执行指令中的第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系时,缓存所述第一待执行指令,并在确定所述第零待执行指令执行完毕后,控制进行所 述第一待执行指令的执行,When it is determined that the first to-be-executed instruction among the plurality of to-be-executed instructions is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction, cache the first to-be-executed instruction, and determine the After the execution of the zeroth to-be-executed instruction is completed, control to execute the execution of the first to-be-executed instruction,
其中,所述第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系包括:Wherein, the association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
存储所述第一待执行指令所需数据的第一存储地址区间与存储所述第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
条款E22、根据条款E15所述的方法,其特征在于,利用控制模块对获取到的选择指令进行编译,得到编译后的选择指令,包括:Clause E22. The method according to Clause E15, wherein the control module is used to compile the obtained selection instruction to obtain the compiled selection instruction, including:
根据所述选择指令生成汇编文件,并将所述汇编文件翻译成二进制文件,Generate an assembly file according to the selection instruction, and translate the assembly file into a binary file,
其中,所述二进制文件为所述编译后的选择指令。Wherein, the binary file is the compiled selection instruction.
条款E23、根据条款E15至条款E22任一项所述的方法,其特征在于,所述存储条件包括索引数据不为零。Clause E23. The method according to any one of Clause E15 to Clause E22, wherein the storage condition includes that the index data is not zero.
由于神经网络算法的广泛使用,计算机硬件运算人能力的不断提升,实际应用中所涉及到的数据运算的种类和数量不断提高。由于编程语言的种类多样,在不同的语言环境下,为实现计数统计的运算处理过程,相关技术中,由于现阶段没有能广泛适用于各类编程语言的计数指令,技术人员需要自定义对应其编程语言环境的多条指令或者针对不同的编程语言环境创建对应的计数指令来实现平计数统计,导致进行计数统计的效率低、速度慢。本公开提供一种计数指令处理方法、装置、计算机设备和存储介质,仅用一个指令即可以实现计数统计,能够显著提高进行计数统计的效率和速度。Due to the extensive use of neural network algorithms, the ability of computer hardware operators to continuously improve, the types and number of data operations involved in practical applications continue to increase. Due to the variety of programming languages, in different language environments, in order to realize the calculation processing process of counting statistics, in the related art, because there are no counting instructions that can be widely applied to various programming languages at this stage, technicians need to customize their corresponding Multiple instructions in the programming language environment or creating corresponding counting instructions for different programming language environments to achieve flat counting statistics, resulting in low efficiency and slow speed of counting statistics. The present disclosure provides a counting instruction processing method, device, computer equipment, and storage medium, and counting statistics can be realized with only one instruction, which can significantly improve the efficiency and speed of counting statistics.
图6-1示出根据本公开一实施例的计数指令处理装置的框图。如图6-1所示,该装置包括控制模块2-11和运算模块2-12。6-1 shows a block diagram of a count instruction processing device according to an embodiment of the present disclosure. As shown in Figure 6-1, the device includes a control module 2-11 and an arithmetic module 2-12.
控制模块2-11,用于对获取到的计数指令进行编译,得到编译后的计数指令,对编译后的计数指令进行解析,得到计数指令的操作码和操作域,并根据操作码和操作域获取执行计数指令所需的多个待运算数据和目标地址。其中,操作码用于指示计数指令对数据所进行的运算为计数统计运算,操作域包括待运算数据地址和目标地址。The control module 2-11 is used to compile the obtained counting instruction to obtain the compiled counting instruction, analyze the compiled counting instruction to obtain the operation code and operation domain of the counting instruction, and according to the operation code and operation domain Obtain multiple data to be calculated and the target address required to execute the count instruction. Among them, the operation code is used to indicate that the operation performed by the counting instruction on the data is a counting statistical operation, and the operation domain includes the data address and the target address to be calculated.
运算模块2-12,用于确定多个待运算数据中满足计数条件的待运算数据的数据个数,并将数据个数存入目标地址中。The operation module 2-12 is used to determine the number of data to be operated satisfying the counting condition among the plurality of data to be operated, and store the number of data in the target address.
在本实施例中,控制模块所获取到的计数指令为未编译的、不能直接供硬件执行的软件指令,控制模块需先对计数指令(未编译的)进行编译。在得到编译后的计数指令之后,才能对编译后的计数指令进行解析。编译后的计数指令为能够直接供硬件执行的硬件指令。控制模块可以从待运算数据地址中获取多个待运算数据。In this embodiment, the counting instruction acquired by the control module is an uncompiled software instruction that cannot be directly executed by hardware. The control module needs to first compile the counting instruction (uncompiled). After the compiled count instruction is obtained, the compiled count instruction can be analyzed. The compiled count instruction is a hardware instruction that can be directly executed by the hardware. The control module can obtain multiple data to be calculated from the data to be calculated.
在本实施例中,操作码可以是计算机程序中所规定的要执行操作的那一部分指令或字段(通常用代码表示),是指令序列号,用来告知执行指令的装置具体需要执行哪一条指令。操作域可以是执行对应的指令所需的所有数据的来源,执行对应的指令所需的所有数据包括参数数据、待运算数据、对应的运算方法等等。对于一个计数指令其必须包括操作码和操作域,其中操作域至少包括待运算数据地址和目标地址。In this embodiment, the operation code may be the part of the instruction or field (usually represented by code) specified in the computer program to perform the operation, and is the instruction sequence number used to inform the device that executes the instruction which instruction needs to be executed. . The operation domain may be a source of all data required to execute the corresponding instruction, and all data required to execute the corresponding instruction include parameter data, data to be operated, corresponding operation methods, and so on. For a counting instruction, it must include an operation code and an operation field, where the operation field includes at least the data address to be operated and the target address.
应当理解的是,本领域技术人员可以根据需要对计数指令的指令格式以及所包含的操作码和操作域进行设置,本公开对此不作限制。It should be understood that, those skilled in the art can set the instruction format of the counting instruction, as well as the included operation codes and operation fields as required, and the disclosure does not limit this.
在本实施例中,该装置可以包括一个或多个控制模块,以及一个或多个运算模块,可以根据实际需要对控制模块和运算模块的数量进行设置,本公开对此不作限制。在装置包括一个控制模块时,该控制模块可以接收计数指令,并控制一个或多个运算模块进行计数统计。在装置包括多个控制模块时,多个控制模块可以分别接收计数指令,并控制对应的一个或多个运算模块进行计数统计。In this embodiment, the device may include one or more control modules and one or more arithmetic modules, and the number of control modules and arithmetic modules may be set according to actual needs, which is not limited in the present disclosure. When the device includes a control module, the control module can receive counting instructions and control one or more arithmetic modules to perform counting statistics. When the device includes multiple control modules, the multiple control modules can respectively receive counting instructions and control the corresponding one or more arithmetic modules to perform counting statistics.
本公开实施例所提供的计数指令处理装置,该装置包括控制模块和运算模块,控制模块用于对获取到的计数指令进行编译,得到编译后的计数指令,对编译后的计数指令进行解析,得到计数指令的操作码和操作域,并根据操作码和操作域获取执行计数指令所需的多个待运算数据和目标地址;运算模块用于确定多个待运算数据中满足计数条件的待运算数据的数据个数,并将数据个数存入目标地址中。本公开实施例所提供的计数指令处理装置的适用范围广,对计数指令的处理效率高、处理速度快, 进行计数统计的处理效率高、处理速度快。The counting instruction processing device provided by the embodiment of the present disclosure includes a control module and an arithmetic module. The control module is used to compile the obtained counting instruction to obtain the compiled counting instruction and analyze the compiled counting instruction. Obtain the operation code and operation domain of the counting instruction, and obtain the multiple data to be calculated and the target address required to execute the counting instruction according to the operation code and the operation domain; the operation module is used to determine the plurality of data to be operated that satisfy the counting condition The number of data, and store the number of data in the target address. The counting instruction processing device provided by the embodiments of the present disclosure has a wide application range, high processing efficiency and fast processing speed for counting instructions, and high processing efficiency and fast processing speed for counting statistics.
在一种可能的实现方式中,计数条件可以是待运算数据不为零。In a possible implementation, the counting condition may be that the data to be calculated is not zero.
在该实现方式中,统计多个待运算数据中不为0的待运算数据的数据个数,并将数据个数存储至目标地址。计数条件还可以是待运算数据不为指定数值,指定数值可以是1等数值。本领域技术人员可以根据实际需要对计数条件进行设置,本公开对此不作限制。In this implementation manner, the data number of the data to be operated which is not 0 among the plurality of data to be operated is counted, and the data number is stored to the target address. The counting condition may also be that the data to be calculated is not a specified value, and the specified value may be a value such as 1. Those skilled in the art can set the counting conditions according to actual needs, and this disclosure does not limit this.
图6-2a示出根据本公开一实施例的计数指令处理装置的框图。在一种可能的实现方式中,如图6-2a所示,运算模块2-12可以包括多个计数器2-120。多个计数器2-120用于对满足计数条件的待运算数据的个数进行计数统计,得到所述多个待运算数据中满足计数条件的待运算数据的数据个数。6-2a shows a block diagram of a counting instruction processing device according to an embodiment of the present disclosure. In a possible implementation, as shown in FIG. 6-2a, the arithmetic module 2-12 may include multiple counters 2-120. The plurality of counters 2-120 are used for counting and counting the number of data to be calculated satisfying the counting condition to obtain the number of data to be calculated satisfying the counting condition among the plurality of data to be calculated.
在该实现方式中,运算模块也可以包括一个计数器。可以根据所需进行的计数统计运算的待运算数据的数据量的大小、对计数统计运算的处理速度、效率等要求对计数器的数量进行设置,本公开对此不作限制。In this implementation, the arithmetic module may also include a counter. The number of counters can be set according to the size of the data to be calculated, the processing speed, efficiency, and other requirements of the counting statistical operation, which is not limited in the present disclosure.
图6-2b示出根据本公开一实施例的计数指令处理装置的框图。在一种可能的实现方式中,如图6-2b所示,运算模块2-12可以包括主运算子模块2-121和多个从运算子模块2-122,主运算子模块2-121包括多个计数器2-120(图中未示出)。6-2b shows a block diagram of a counting instruction processing device according to an embodiment of the present disclosure. In a possible implementation, as shown in FIG. 6-2b, the operation module 2-12 may include a master operation submodule 2-121 and a plurality of slave operation submodules 2-122, and the master operation submodule 2-121 includes Multiple counters 2-120 (not shown in the figure).
主运算子模块2-121,用于利用多个计数器2-120对满足计数条件的待运算数据的个数进行计数统计,确定多个待运算数据中满足计数条件的待运算数据的数据个数,并将数据个数存入目标地址中。The main operation submodule 2-121 is used to count and count the number of data to be calculated satisfying the counting condition using a plurality of counters 2-120 to determine the number of data to be calculated satisfying the counting condition among the plurality of data to be calculated And store the number of data in the target address.
在一种可能的实现方式中,操作域还可以包括读入量或读入量的存储地址。其中,控制模块2-11,还用于获取读入量,并按照读入量获取多个待运算数据。其中,多个待运算数据的数据量小于或等于读入量。In a possible implementation manner, the operation domain may further include a read-in amount or a storage address of the read-in amount. Among them, the control module 2-11 is also used to obtain the read-in amount, and obtain a plurality of data to be calculated according to the read-in amount. Among them, the data amount of the multiple data to be calculated is less than or equal to the read-in amount.
在该实现方式中,读入量可以是所获取的多个待运算数据的数据量,可以是所获取的待运算数据的尺寸。在操作域直接包含读入量的具体数值时,可以将该数值确定为读入量。在操作域中包括读入量的存储地址时,可以从该存储地址中获取读入量。In this implementation manner, the read-in amount may be the data amount of the acquired plurality of data to be calculated, and may be the size of the acquired data to be calculated. When the operation field directly contains the specific value of the read-in amount, the value can be determined as the read-in amount. When the storage address of the read-in amount is included in the operation domain, the read-in amount can be obtained from the storage address.
在一种可能的实现方式中,在操作域中不包括读入量时,可以根据预先设置的默认读入量获取多个待运算数据。所获取的多个待运算数据的数据量小于或等于默认读入量。In a possible implementation manner, when the read-in amount is not included in the operation domain, a plurality of data to be calculated may be obtained according to a preset default read-in amount. The data amount of the acquired multiple data to be calculated is less than or equal to the default read-in amount.
通过上述方式,可以对多个待运算数据的数据量进行限制,保证统计出的数据个数的准确性,也保证装置可以运行该计数指令。In the above manner, the data amount of a plurality of data to be calculated can be limited, to ensure the accuracy of the counted data number, and also to ensure that the device can run the counting instruction.
在一种可能的实现方式中,如图6-2a、图6-2b所示,该装置还可以包括存储模块2-13。存储模块2-13用于存储多个待运算数据以及计数条件。In a possible implementation manner, as shown in FIGS. 6-2a and 6-2b, the device may further include a storage module 2-13. The storage modules 2-13 are used to store a plurality of data to be calculated and counting conditions.
在该实现方式中,存储模块可以包括缓存和寄存器中的一种或多种,缓存可以包括高速暂存缓存,还可以包括至少一个NRAM(Neuron Random Access Memory,神经元随机存取存储器)。寄存器,可以用于存储多个待运算数据和计数条件中的标量数据;In this implementation, the storage module may include one or more of a cache and a register. The cache may include a high-speed temporary storage cache, and may also include at least one NRAM (Neuron Random Access Memory). Registers can be used to store multiple data to be calculated and scalar data in counting conditions;
在一种可能的实现方式中,神经元缓存,可以用于存储多个待运算数据和计数条件中的神经元数据,神经元数据包括神经元向量数据。In a possible implementation manner, the neuron cache may be used to store multiple to-be-operated data and neuron data in counting conditions, and the neuron data includes neuron vector data.
在一种可能的实现方式中,控制模块2-11还可以用于根据计数指令生成汇编文件,并将汇编文件翻译成二进制文件,其中,二进制文件为编译后的计数指令。In a possible implementation manner, the control module 2-11 may also be used to generate an assembly file according to the counting instruction and translate the assembly file into a binary file, where the binary file is a compiled counting instruction.
在一种可能的实现方式中,计数指令的指令格式可以是:In a possible implementation, the instruction format of the counting instruction may be:
count dst src0 sizecount dst src0 size
其中,select是计数指令的操作码,dst、src0、size是计数指令的操作域。dst是目标地址,src0是待运算数据地址,size是读入量。Among them, select is the operation code of the count instruction, dst, src0, size are the operation domain of the count instruction. dst is the target address, src0 is the data address to be calculated, and size is the read-in amount.
应当理解的是,本领域技术人员可以根据需要对计数指令的操作码、指令格式中操作码和操作域的位置进行设置,本公开对此不作限制。It should be understood that those skilled in the art can set the operation code of the counting instruction, the position of the operation code and the operation field in the instruction format according to need, and the disclosure does not limit this.
在一种可能的实现方式中,该装置可以设置于(Graphics Processing Unit,简称GPU)、中央处理器(Central Processing Unit,简称CPU)和嵌入式神经网络处理器(Neural-network Processing Unit,简称NPU)的一种或多种之中。In a possible implementation, the device may be set in (Graphics Processing Unit, GPU for short), Central Processing Unit (CPU for short), and Neural-network Processing Unit (NPU for short) ) Of one or more.
需要说明的是,尽管以上述实施例作为示例介绍了计数指令处理装置如上,但本领域技术人员能 够理解,本公开应不限于此。事实上,用户完全可根据个人喜好和/或实际应用场景灵活设定各模块,只要符合本公开的技术方案即可。It should be noted that, although the counting instruction processing apparatus is described above by taking the above embodiment as an example, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various modules flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.
应用示例Application examples
以下结合“利用计数指令处理装置进行计数统计”作为一个示例性应用场景,给出根据本公开实施例的应用示例,以便于理解计数指令处理装置的流程。本领域技术人员应理解,以下应用示例仅仅是出于便于理解本公开实施例的目的,不应视为对本公开实施例的限制。The following uses "counting statistics using a count instruction processing device" as an exemplary application scenario to give an application example according to an embodiment of the present disclosure to facilitate understanding of the flow of the count instruction processing device. Those skilled in the art should understand that the following application examples are only for the purpose of facilitating the understanding of the embodiments of the present disclosure, and should not be considered as limitations to the embodiments of the present disclosure.
图6-3示出根据本公开一实施例的计数指令处理装置的应用场景的示意图。如图6-3所示,计数指令处理装置对计数指令进行处理的过程如下:6-3 shows a schematic diagram of an application scenario of a counting instruction processing device according to an embodiment of the present disclosure. As shown in Figure 6-3, the counting command processing device processes the counting command as follows:
控制模块2-11对获取到的计数指令1进行编译,得到编译后的计数指令1(如计数指令1为count 500 100 5),对编译后的计数指令1进行解析,得到计数指令1的操作码和操作域。其中,计数指令1的操作码为count,目标地址为500,待运算数据地址为100,读入量为5。控制模块2-11从待运算数据地址100中获取读入量为5的多个待运算数据。The control module 2-11 compiles the obtained counting instruction 1 to obtain the compiled counting instruction 1 (if the counting instruction 1 is count 500 and 100), analyzes the compiled counting instruction 1 to obtain the operation of counting instruction 1 Code and operation domain. Among them, the operation code of the count instruction 1 is count, the target address is 500, the data address to be calculated is 100, and the read-in amount is 5. The control module 2-11 acquires a plurality of data to be calculated with a read amount of 5 from the data address to be calculated 100.
假定获得的多个待运算数据包括1、5、0、7、3。计数条件为待运算数据不为0。Assume that the obtained plurality of data to be calculated include 1, 5, 0, 7, and 3. The counting condition is that the data to be calculated is not 0.
运算模块2-12统计多个待运算数据中不为0的待运算数据的数据个数,并将数据个数存入目标地址500中。以上各模块的工作过程可参考上文的相关描述。The operation module 2-12 counts the number of data to be operated which is not 0 among the plurality of data to be operated, and stores the number of data in the target address 500. For the working process of the above modules, please refer to the relevant description above.
这样,计数指令处理装置可以高效、快速地对计数指令进行处理,且进行计数统计的处理效率高、处理速度快。In this way, the counting instruction processing device can process the counting instruction efficiently and quickly, and the processing efficiency for counting statistics is high and the processing speed is fast.
图6-4示出根据本公开一实施例的计数指令处理方法的流程图。该方法可以应用于包含存储器和处理器的如计算机设备等,其中,存储器用于存储执行方法过程中所使用的数据;处理器用于执行相关的处理、运算步骤,如执行下述步骤S51-2和步骤S52-2。如图6-4所示,该方法应用于上述计数指令处理装置,该方法包括步骤S51-2和步骤S52-2。6-4 shows a flowchart of a counting instruction processing method according to an embodiment of the present disclosure. The method can be applied to computer equipment including a memory and a processor, where the memory is used to store data used in the execution of the method; the processor is used to perform related processing and operation steps, such as performing the following step S51-2和 步骤 S52-2. As shown in FIG. 6-4, the method is applied to the above-mentioned counting instruction processing device, and the method includes step S51-2 and step S52-2.
在步骤S51-2中,利用控制模块对获取到的计数指令进行编译,得到编译后的计数指令,对编译后的计数指令进行解析,得到计数指令的操作码和操作域,并根据操作码和操作域获取执行计数指令所需的多个待运算数据和目标地址。其中,操作码用于指示计数指令对数据所进行的运算为计数统计运算,操作域包括待运算数据地址和目标地址。In step S51-2, the control module is used to compile the obtained counting instruction to obtain a compiled counting instruction, and the compiled counting instruction is analyzed to obtain the operation code and operation domain of the counting instruction, and according to the operation code and The operation domain obtains a plurality of data to be calculated and a target address required for executing the counting instruction. Among them, the operation code is used to indicate that the operation performed by the counting instruction on the data is a counting statistical operation, and the operation domain includes the data address and the target address to be calculated.
在步骤S52-2中,利用运算模块确定多个待运算数据中满足计数条件的待运算数据的数据个数,并将数据个数存入目标地址中。In step S52-2, the operation module is used to determine the number of data to be operated satisfying the counting condition among the plurality of data to be operated, and the number of data is stored in the target address.
在一种可能的实现方式中,确定多个待运算数据中满足计数条件的待运算数据的数据个数,可以包括:In a possible implementation manner, determining the number of data to be calculated among the plurality of data to be calculated that meets the counting condition may include:
利用运算模块中的多个计数器对满足计数条件的待运算数据的个数进行计数统计,得到多个待运算数据中满足计数条件的待运算数据的数据个数。The number of the data to be operated satisfying the counting condition is counted and counted by using a plurality of counters in the operation module to obtain the data number of the data to be operated meeting the counting condition among the plurality of data to be operated.
在一种可能的实现方式中,运算模块包括主运算子模块和多个从运算子模块,主运算子模块包括多个加法器和多个除法器,In a possible implementation manner, the operation module includes a main operation submodule and multiple slave operation submodules, and the main operation submodule includes multiple adders and multiple dividers,
其中,确定多个待运算数据中满足计数条件的待运算数据的数据个数,并将数据个数存入目标地址中,包括:Among them, determining the number of data to be calculated among the plurality of data to be calculated satisfying the counting condition, and storing the number of data in the target address includes:
利用主运算子模块中的多个计数器对满足计数条件的待运算数据的个数进行计数统计,确定多个待运算数据中满足计数条件的待运算数据的数据个数,并将数据个数存入目标地址中。Use multiple counters in the main operation submodule to count the number of data to be calculated that meet the counting conditions, determine the number of data to be calculated that meets the counting conditions among the multiple data to be calculated, and store the number of data Into the destination address.
在一种可能的实现方式中,操作域还包括读入量或读入量的存储地址,In a possible implementation manner, the operation domain further includes a read-in amount or a storage address of the read-in amount,
其中,根据操作码和操作域获取执行计数指令所需的多个待运算数据和目标地址,可以包括:Among them, obtaining a plurality of data to be calculated and a target address required to execute the counting instruction according to the operation code and the operation domain may include:
获取读入量,并按照读入量获取多个待运算数据,Obtain the read-in amount, and obtain multiple data to be calculated according to the read-in amount,
其中,多个待运算数据的数据量小于或等于读入量。Among them, the data amount of the multiple data to be calculated is less than or equal to the read-in amount.
在一种可能的实现方式中,该方法还可以包括:利用装置的存储模块存储多个待运算数据和计数条件,In a possible implementation manner, the method may further include: using the storage module of the device to store a plurality of data to be calculated and counting conditions,
其中,存储模块包括寄存器和缓存中的至少一种,Wherein, the storage module includes at least one of a register and a cache,
缓存,用于存储多个待运算数据和计数条件,缓存包括至少一个神经元缓存NRAM;Cache, used to store multiple data to be calculated and counting conditions, the cache includes at least one neuron cache NRAM;
寄存器,用于存储多个待运算数据和计数条件中的标量数据;Registers, used to store multiple data to be calculated and scalar data in counting conditions;
神经元缓存,用于存储多个待运算数据和计数条件中的神经元数据,神经元数据包括神经元向量数据。The neuron cache is used to store a plurality of data to be operated and neuron data in counting conditions. The neuron data includes neuron vector data.
在一种可能的实现方式中,对编译后的计数指令进行解析,得到计数指令的操作码和操作域,可以包括:In a possible implementation manner, analyzing the compiled counting instruction to obtain the operation code and operation domain of the counting instruction may include:
存储编译后的计数指令;Store the compiled count instruction;
对编译后的计数指令进行解析,得到计数指令的操作码和操作域;Analyze the count instruction after compilation to obtain the opcode and operation domain of the count instruction;
存储指令队列,指令队列包括按照执行顺序依次排列的多个待执行指令,多个待执行指令可以包括编译后的计数指令。The instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed arranged in order according to the execution order, and the plurality of instructions to be executed may include a counted instruction after compilation.
在一种可能的实现方式中,该方法还可以包括:在确定多个待执行指令中的第一待执行指令与第一待执行指令之前的第零待执行指令存在关联关系时,缓存第一待执行指令,在第零待执行指令执行完毕后,执行第一待执行指令,In a possible implementation manner, the method may further include: when determining that the first to-be-executed instruction among the plurality of to-be-executed instructions is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction, caching the first The instruction to be executed, after the execution of the zeroth instruction to be executed is completed, the first instruction to be executed is executed,
其中,第一待执行指令与第一待执行指令之前的第零待执行指令存在关联关系包括:Wherein, the association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
存储第一待执行指令所需数据的第一存储地址区间与存储第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first storage address interval storing data required for the first instruction to be executed has an overlapping area with the zeroth storage address interval storing data required for the zeroth instruction to be executed.
在一种可能的实现方式中,对获取到的计数指令进行编译,得到编译后的计数指令,可以包括:In a possible implementation manner, compiling the obtained counting instruction to obtain the compiled counting instruction may include:
根据计数指令生成汇编文件,并将汇编文件翻译成二进制文件,Generate assembly files according to counting instructions, and translate the assembly files into binary files,
其中,二进制文件为编译后的计数指令。Among them, the binary file is the count instruction after compilation.
在一种可能的实现方式中,计数条件可以包括待运算数据不为零。In a possible implementation, the counting condition may include that the data to be calculated is not zero.
需要说明的是,尽管以上述实施例作为示例介绍了计数指令处理方法如上,但本领域技术人员能够理解,本公开应不限于此。事实上,用户完全可根据个人喜好和/或实际应用场景灵活设定各步骤,只要符合本公开的技术方案即可。It should be noted that although the counting instruction processing method is described above using the above embodiment as an example, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various steps flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.
本公开实施例所提供的计数指令处理方法的适用范围广,对计数指令的处理效率高、处理速度快,进行计数统计的处理效率高、处理速度快。The counting instruction processing method provided by the embodiments of the present disclosure has a wide application range, high processing efficiency and fast processing speed for counting instructions, and high processing efficiency and fast processing speed for counting statistics.
依据以下条款可以更好的理解前述内容:The foregoing can be better understood based on the following clauses:
条款F1、一种计数指令处理装置,所述装置包括:Clause F1, a counting instruction processing device, the device comprising:
控制模块,用于对获取到的计数指令进行编译,得到编译后的计数指令,对所述编译后的计数指令进行解析,得到计数指令的操作码和操作域,并根据所述操作码和所述操作域获取执行计数指令所需的多个待运算数据和目标地址;The control module is used to compile the obtained counting instruction to obtain the compiled counting instruction, analyze the compiled counting instruction to obtain the operation code and operation domain of the counting instruction, and according to the operation code and all The operation domain obtains a plurality of data to be calculated and a target address required to execute the counting instruction;
运算模块,用于确定所述多个待运算数据中满足计数条件的待运算数据的数据个数,并将所述数据个数存入所述目标地址中,The operation module is used to determine the number of data of the plurality of data to be operated that satisfy the counting condition and store the number of data in the target address,
其中,所述操作码用于指示所述计数指令对数据所进行的运算为计数统计运算,所述操作域包括待运算数据地址和所述目标地址。Wherein, the operation code is used to indicate that the operation performed by the counting instruction on the data is a counting statistical operation, and the operation domain includes the data address to be operated and the target address.
条款F2、根据条款F1所述的装置,所述运算模块,包括:Clause F2. The device according to Clause F1, the calculation module includes:
多个计数器,用于对满足所述计数条件的待运算数据的个数进行计数统计,得到所述多个待运算数据中满足计数条件的待运算数据的数据个数。A plurality of counters are used for counting and counting the number of data to be operated that satisfy the counting condition to obtain the number of data to be operated that satisfy the counting condition among the plurality of data to be operated.
条款F3、根据条款F2所述的装置,所述运算模块包括主运算子模块和多个从运算子模块,所述主运算子模块包括所述多个计数器,Clause F3. The device according to Clause F2, the operation module includes a master operation sub-module and a plurality of slave operation sub-modules, the master operation sub-module includes the plurality of counters,
所述主运算子模块,用于利用所述多个计数器对满足所述计数条件的待运算数据的个数进行计数统计,确定所述多个待运算数据中满足计数条件的待运算数据的数据个数,并将所述数据个数存入所述目标地址中。The main operation sub-module is configured to use the plurality of counters to count statistics on the number of data to be operated that satisfy the counting condition, and determine data of the data to be operated that satisfy the counting condition among the plurality of data to be operated Number, and store the number of data in the target address.
条款F4、根据条款F1所述的装置,所述操作域还包括读入量或所述读入量的存储地址,Clause F4. The device according to Clause F1, the operation domain further includes a read-in amount or a storage address of the read-in amount,
其中,所述控制模块,还用于获取所述读入量,并按照所述读入量获取所述多个待运算数据,Wherein, the control module is also used to obtain the read-in amount, and obtain the plurality of data to be calculated according to the read-in amount,
其中,所述多个待运算数据的数据量小于或等于所述读入量。Wherein, the data amount of the plurality of data to be calculated is less than or equal to the read-in amount.
条款F5、根据条款F1所述的装置,所述装置还包括:Clause F5. The device according to Clause F1, the device further comprising:
存储模块,用于存储所所述多个待运算数据和所述计数条件,A storage module for storing the plurality of data to be calculated and the counting condition,
其中,所述存储模块包括寄存器和缓存中的至少一种,Wherein, the storage module includes at least one of a register and a cache,
所述缓存,用于存储所述多个待运算数据和所述计数条件,所述缓存包括至少一个神经元缓存NRAM;The cache is used to store the plurality of data to be calculated and the counting condition, and the cache includes at least one neuron cache NRAM;
所述寄存器,用于存储所述多个待运算数据和所述计数条件中的标量数据;The register is used to store the plurality of data to be operated and the scalar data in the counting condition;
所述神经元缓存,用于存储所述多个待运算数据和所述计数条件中的神经元数据,所述神经元数据包括神经元向量数据。The neuron cache is used to store the plurality of data to be operated and neuron data in the counting condition, and the neuron data includes neuron vector data.
条款F6、根据条款F1所述的装置,所述控制模块包括:Clause F6. The device according to Clause F1, the control module includes:
指令存储子模块,用于存储所述编译后的计数指令;An instruction storage sub-module for storing the compiled counting instruction;
第一指令处理子模块,用于对所述编译后的计数指令进行解析,得到计数指令的操作码和操作域;The first instruction processing sub-module is used to parse the compiled counting instruction to obtain the operation code and operation domain of the counting instruction;
队列存储子模块,用于存储指令队列,所述指令队列包括按照执行顺序依次排列的多个待执行指令,所述多个待执行指令包括所述编译后的计数指令。The queue storage sub-module is used to store an instruction queue, and the instruction queue includes a plurality of instructions to be executed, which are sequentially arranged according to an execution order, and the plurality of instructions to be executed include the compiled counting instructions.
条款F7、根据条款F6所述的装置,所述控制模块,还包括:Clause F7. The device according to Clause F6, the control module, further comprising:
依赖关系处理子模块,用于在确定所述多个待执行指令中的第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系时,将所述第一待执行指令缓存在所述指令存储子模块中,在所述第零待执行指令执行完毕后,从所述指令存储子模块中提取所述第一待执行指令发送至所述运算模块,The dependency processing sub-module is used to determine the first pending instruction when there is an association relationship between the first pending instruction in the plurality of pending instructions and the zeroth pending instruction before the first pending instruction The execution instruction is cached in the instruction storage submodule, and after the execution of the zeroth instruction to be executed is completed, the first instruction to be executed is extracted from the instruction storage submodule and sent to the arithmetic module,
其中,所述第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系包括:Wherein, the association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
存储所述第一待执行指令所需数据的第一存储地址区间与存储所述第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
条款F8、根据条款F1所述的装置,Clause F8. The device according to Clause F1,
所述控制模块,还用于根据所述计数指令生成汇编文件,并将所述汇编文件翻译成二进制文件,The control module is also used to generate an assembly file according to the counting instruction and translate the assembly file into a binary file,
其中,所述二进制文件为所述编译后的计数指令。Wherein, the binary file is the compiled counting instruction.
条款F9、根据条款F1至条款F8任一项所述的装置,所述计数条件包括待运算数据不为零。Clause F9. The device according to any one of Clause F1 to Clause F8, the counting condition includes that the data to be calculated is not zero.
条款F10、一种机器学习运算装置,所述装置包括:Clause F10. A machine learning computing device, the device comprising:
一个或多个如条款F1-条款F9任一项所述的计数指令处理装置,用于从其他处理装置中获取待运算数据和控制信息,并执行指定的机器学习运算,将执行结果通过I/O接口传递给其他处理装置;One or more counting instruction processing devices as described in any one of clauses F1 to F9, used to obtain data to be operated and control information from other processing apparatuses, and perform specified machine learning operations, and pass the execution results through I / O interface is passed to other processing devices;
当所述机器学习运算装置包含多个所述计数指令处理装置时,所述多个所述计数指令处理装置间可以通过特定的结构进行连接并传输数据;When the machine learning operation device includes a plurality of the counting instruction processing devices, the plurality of counting instruction processing devices may be connected and transmit data through a specific structure;
其中,多个所述计数指令处理装置通过快速外部设备互连总线PCIE总线进行互联并传输数据,以支持更大规模的机器学习的运算;多个所述计数指令处理装置共享同一控制系统或拥有各自的控制系统;多个所述计数指令处理装置共享内存或者拥有各自的内存;多个所述计数指令处理装置的互联方式是任意互联拓扑。Among them, a plurality of the counting instruction processing devices are interconnected and transmitting data through a fast external device interconnect bus PCIE bus to support larger-scale machine learning operations; a plurality of the counting instruction processing devices share the same control system or own Respective control systems; a plurality of the counting instruction processing devices share memory or have their own memories; the interconnection method of the plurality of counting instruction processing devices is any interconnected topology.
条款F11、一种组合处理装置,所述组合处理装置包括:Clause F11. A combined processing device, the combined processing device comprising:
如条款F10所述的机器学习运算装置、通用互联接口和其他处理装置;Machine learning computing devices, general interconnection interfaces and other processing devices as described in clause F10;
所述机器学习运算装置与所述其他处理装置进行交互,共同完成用户指定的计算操作,The machine learning computing device interacts with the other processing device to jointly complete the calculation operation specified by the user,
其中,所述组合处理装置还包括:存储装置,该存储装置分别与所述机器学习运算装置和所述其他处理装置连接,用于保存所述机器学习运算装置和所述其他处理装置的数据。Wherein, the combined processing device further includes: a storage device, which is respectively connected to the machine learning computing device and the other processing device, and is used for storing data of the machine learning computing device and the other processing device.
条款F12、一种机器学习芯片,所述机器学习芯片包括:Clause F12. A machine learning chip, the machine learning chip includes:
如条款F10所述的机器学习运算装置或如条款F11所述的组合处理装置。The machine learning arithmetic device according to clause F10 or the combined processing device according to clause F11.
条款F13、一种电子设备,所述电子设备包括:Article F13. An electronic device, the electronic device comprising:
如条款F12所述的机器学习芯片。Machine learning chip as described in clause F12.
条款F14、一种板卡,所述板卡包括:存储器件、接口装置和控制器件以及如条款F12所述的机器学习芯片;Clause F14, a board card including: a storage device, an interface device and a control device, and a machine learning chip as described in Clause F12;
其中,所述机器学习芯片与所述存储器件、所述控制器件以及所述接口装置分别连接;Wherein, the machine learning chip is respectively connected to the storage device, the control device and the interface device;
所述存储器件,用于存储数据;The storage device is used for storing data;
所述接口装置,用于实现所述机器学习芯片与外部设备之间的数据传输;The interface device is used to realize data transmission between the machine learning chip and an external device;
所述控制器件,用于对所述机器学习芯片的状态进行监控。The control device is used for monitoring the state of the machine learning chip.
条款F15、一种计数指令处理方法,所述方法应用于计数指令处理装置,所述装置包括控制模块和运算模块,所述方法包括:Clause F15. A counting instruction processing method. The method is applied to a counting instruction processing apparatus. The apparatus includes a control module and an arithmetic module. The method includes:
利用控制模块对获取到的计数指令进行编译,得到编译后的计数指令,对所述编译后的计数指令进行解析,得到计数指令的操作码和操作域,并根据所述操作码和所述操作域获取执行计数指令所需的多个待运算数据和目标地址;The control module is used to compile the obtained counting instruction to obtain a compiled counting instruction, and the compiled counting instruction is analyzed to obtain the operation code and operation domain of the counting instruction, and according to the operation code and the operation The domain obtains multiple data to be calculated and the target address required to execute the counting instruction;
利用运算模块确定所述多个待运算数据中满足计数条件的待运算数据的数据个数,并将所述数据个数存入所述目标地址中,Using an operation module to determine the number of data of the plurality of data to be calculated that satisfy the counting condition and storing the number of data in the target address,
其中,所述操作码用于指示所述计数指令对数据所进行的运算为计数统计运算,所述操作域包括待运算数据地址和所述目标地址。Wherein, the operation code is used to indicate that the operation performed by the counting instruction on the data is a counting statistical operation, and the operation domain includes the data address to be operated and the target address.
条款F16、根据条款F15所述的方法,确定所述多个待运算数据中满足计数条件的待运算数据的数据个数,包括:Clause F16. According to the method described in Clause F15, determining the number of data to be calculated among the plurality of data to be calculated satisfying the counting condition includes:
利用所述运算模块中的多个计数器对满足所述计数条件的待运算数据的个数进行计数统计,得到所述多个待运算数据中满足计数条件的待运算数据的数据个数。A plurality of counters in the calculation module are used to count and count the number of data to be calculated satisfying the counting condition to obtain the number of data to be calculated satisfying the counting condition among the plurality of data to be calculated.
条款F17、根据条款F16所述的方法,所述运算模块包括主运算子模块和多个从运算子模块,所述主运算子模块包括多个加法器和多个除法器,Clause F17. The method according to Clause F16, the operation module includes a master operation submodule and a plurality of slave operation submodules, the master operation submodule includes a plurality of adders and a plurality of dividers,
其中,确定所述多个待运算数据中满足计数条件的待运算数据的数据个数,并将所述数据个数存入所述目标地址中,包括:Wherein, determining the number of data to be calculated among the plurality of data to be calculated satisfying the counting condition, and storing the number of data in the target address includes:
利用所述主运算子模块中的所述多个计数器对满足所述计数条件的待运算数据的个数进行计数统计,确定所述多个待运算数据中满足计数条件的待运算数据的数据个数,并将所述数据个数存入所述目标地址中。Using the plurality of counters in the main operation sub-module to count statistics on the number of data to be operated satisfying the counting condition to determine the number of data to be operated meeting the counting condition among the plurality of data to be operated Count and store the number of data in the target address.
条款F18、根据条款F15所述的方法,所述操作域还包括读入量或所述读入量的存储地址,Clause F18. The method according to Clause F15, the operation domain further includes a read-in amount or a storage address of the read-in amount,
其中,根据所述操作码和所述操作域获取执行计数指令所需的多个待运算数据和目标地址,包括:Wherein, acquiring a plurality of data to be calculated and a target address required to execute the counting instruction according to the operation code and the operation domain includes:
获取所述读入量,并按照所述读入量获取所述多个待运算数据,Acquiring the read-in amount, and acquiring the plurality of data to be calculated according to the read-in amount,
其中,所述多个待运算数据的数据量小于或等于所述读入量。Wherein, the data amount of the plurality of data to be calculated is less than or equal to the read-in amount.
条款F19、根据条款F15所述的方法,所述方法还包括:Clause F19. The method according to Clause F15, the method further comprising:
利用所述装置的存储模块存储所述多个待运算数据和所述计数条件,Using the storage module of the device to store the plurality of data to be calculated and the counting condition,
其中,所述存储模块包括寄存器和缓存中的至少一种,Wherein, the storage module includes at least one of a register and a cache,
所述缓存,用于存储所述多个待运算数据和所述计数条件,所述缓存包括至少一个神经元缓存NRAM;The cache is used to store the plurality of data to be calculated and the counting condition, and the cache includes at least one neuron cache NRAM;
所述寄存器,用于存储所述多个待运算数据和所述计数条件中的标量数据;The register is used to store the plurality of data to be operated and the scalar data in the counting condition;
所述神经元缓存,用于存储所述多个待运算数据和所述计数条件中的神经元数据,所述神经元数据包括神经元向量数据。The neuron cache is used to store the plurality of data to be operated and neuron data in the counting condition, and the neuron data includes neuron vector data.
条款F20、根据条款F15所述的方法,对所述编译后的计数指令进行解析,得到计数指令的操作码和操作域,包括:Clause F20. According to the method described in Clause F15, parse the compiled counting instruction to obtain the operation code and operation domain of the counting instruction, including:
存储所述编译后的计数指令;Store the compiled counting instruction;
对所述编译后的计数指令进行解析,得到计数指令的操作码和操作域;Parse the compiled counting instruction to obtain the operation code and operation domain of the counting instruction;
存储指令队列,所述指令队列包括按照执行顺序依次排列的多个待执行指令,所述多个待执行指令包括所述编译后的计数指令。An instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed, which are sequentially arranged according to an execution order, and the plurality of instructions to be executed include the compiled counting instruction.
条款F21、根据条款F20所述的方法,所述方法还包括:Clause F21. The method according to Clause F20, the method further comprising:
在确定所述多个待执行指令中的第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系时,缓存所述第一待执行指令,并在确定所述第零待执行指令执行完毕后,控制进行所述第一待执行指令的执行,When it is determined that the first to-be-executed instruction among the plurality of to-be-executed instructions is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction, cache the first to-be-executed instruction, and determine the After the execution of the zeroth to-be-executed instruction is completed, control to execute the execution of the first to-be-executed instruction,
其中,所述第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系包括:Wherein, the association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
存储所述第一待执行指令所需数据的第一存储地址区间与存储所述第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
条款F22、根据条款F15所述的方法,对获取到的计数指令进行编译,得到编译后的计数指令,包括:Clause F22. Compile the obtained counting instruction according to the method described in Clause F15 to obtain the compiled counting instruction, including:
根据所述计数指令生成汇编文件,并将所述汇编文件翻译成二进制文件,Generate an assembly file according to the counting instruction, and translate the assembly file into a binary file,
其中,所述二进制文件为所述编译后的计数指令。Wherein, the binary file is the compiled counting instruction.
条款F23、根据条款F15至条款F22任一项所述的方法,所述计数条件包括待运算数据不为零。Clause F23. The method according to any one of Clause F15 to Clause F22, the counting condition includes that the data to be calculated is not zero.
条款F24、一种非易失性计算机可读存储介质,其上存储有计算机程序指令,所述计算机程序指令被处理器执行时实现条款F15至条款F23任一项所述的方法。Clause F24. A non-volatile computer-readable storage medium having computer program instructions stored thereon, which when executed by a processor implements the method of any one of Clauses F15 to F23.
由于神经网络算法的广泛使用,计算机硬件运算人能力的不断提升,实际应用中所涉及到的数据运算的种类和数量不断提高。由于编程语言的种类多样,在不同的语言环境下,为实现平均池化运算的运算过程,相关技术中,由于现阶段没有能广泛适用于各类编程语言的全连接指令,技术人员需要自定义对应其编程语言环境的多条指令来实现全连接运算,导致进行全连接运算效率低、速度慢。本公开提供一种全连接指令处理方法、装置、计算机设备和存储介质,仅用一个指令即可以实现全连接运算,能够显著提高进行全连接运算的效率和速度。Due to the extensive use of neural network algorithms, the ability of computer hardware operators to continuously improve, the types and number of data operations involved in practical applications continue to increase. Due to the variety of programming languages, in different language environments, in order to achieve the operation process of average pooling operations, in related technologies, because there is no fully connected instructions that can be widely applied to various programming languages at this stage, the technicians need to customize Corresponding to multiple instructions in its programming language environment to realize fully connected operations, resulting in low efficiency and slow speed of fully connected operations. The present disclosure provides a fully-connected instruction processing method, device, computer equipment, and storage medium. Only one instruction can be used to realize fully-connected operation, which can significantly improve the efficiency and speed of fully-connected operation.
图7-1示出根据本公开一实施例的全连接指令处理装置的框图。如图7-1所示,该装置包括控制模块3-11和运算模块3-12。FIG. 7-1 shows a block diagram of a fully connected instruction processing apparatus according to an embodiment of the present disclosure. As shown in Figure 7-1, the device includes a control module 3-11 and an arithmetic module 3-12.
控制模块3-11,用于对获取到的全连接指令进行编译,得到编译后的全连接指令,对编译后的全连接指令进行解析,得到全连接指令的操作码和操作域,并根据操作码和操作域获取执行全连接指令所需的第一数据、第二数据、权值数据和目标地址。其中,操作码用于指示全连接指令对数据所进行的运算为全连接运算,操作域包括第一数据地址、第二数据地址、权值数据地址和目标地址。The control module 3-11 is used to compile the obtained fully-connected instruction to obtain the compiled fully-connected instruction, to parse the compiled fully-connected instruction, to obtain the fully-connected instruction's operation code and operation domain, and according to the operation The code and the operation domain acquire the first data, the second data, the weight data, and the target address required to execute the fully connected instruction. The operation code is used to indicate that the operation performed by the fully connected instruction on the data is a fully connected operation, and the operation domain includes a first data address, a second data address, a weight data address, and a target address.
运算模块3-12,用于根据权值数据对第一数据和第二数据进行全连接运算,得到运算结果,并将运算结果存入目标地址中。The operation module 3-12 is configured to perform a fully connected operation on the first data and the second data according to the weight data to obtain an operation result, and store the operation result in the target address.
在本实施例中,控制模块所获取到的全连接指令为未编译的、不能直接供硬件执行的软件指令,控制模块需先对全连接指令(未编译的)进行编译。在得到编译后的全连接指令之后,才能对编译后的全连接指令进行解析。编译后的全连接指令为能够直接供硬件执行的硬件指令。控制模块可以从第一数据地址、第二数据地址和权值数据地址中分别获得第一数据、第二数据和权值数据。In this embodiment, the fully-connected instruction acquired by the control module is an uncompiled software instruction that cannot be directly executed by the hardware. The control module needs to first compile the fully-connected instruction (uncompiled). After the compiled fully-connected instruction is obtained, the compiled fully-connected instruction can be parsed. The compiled fully-connected instructions are hardware instructions that can be directly executed by the hardware. The control module may obtain the first data, the second data, and the weight data from the first data address, the second data address, and the weight data address, respectively.
在本实施例中,操作码可以是计算机程序中所规定的要执行操作的那一部分指令或字段(通常用代码表示),是指令序列号,用来告知执行指令的装置具体需要执行哪一条指令。操作域可以是执行对应的指令所需的所有数据的来源,执行对应的指令所需的所有数据包括第一数据、第二数据、权值数据以及对应的运算方法等等。对于一个全连接指令其必须包括操作码和操作域,其中,操作域至少包括第一数据地址、第二数据地址、权值数据地址和目标地址。In this embodiment, the operation code may be the part of the instruction or field (usually represented by code) specified in the computer program to perform the operation, and is the instruction sequence number used to inform the device that executes the instruction which instruction needs to be executed. . The operation domain may be a source of all data required to execute the corresponding instruction, and all data required to execute the corresponding instruction include first data, second data, weight data, and corresponding calculation methods, and so on. For a fully connected instruction, it must include an operation code and an operation field, where the operation field includes at least a first data address, a second data address, a weight data address, and a target address.
应当理解的是,本领域技术人员可以根据需要对全连接指令的指令格式以及所包含的操作码和操作域进行设置,本公开对此不作限制。It should be understood that those skilled in the art can set the instruction format of the fully connected instruction, as well as the included operation codes and operation domains as needed, and this disclosure does not limit this.
在本实施例中,该装置可以包括一个或多个控制模块,以及一个或多个运算模块,可以根据实际需要对控制模块和运算模块的数量进行设置,本公开对此不作限制。在装置包括一个控制模块时,该控制模块可以接收全连接指令,并控制一个或多个运算模块进行全连接运算。在装置包括多个控制模块时,多个控制模块可以分别接收全连接指令,并控制对应的一个或多个运算模块进行全连接运算。In this embodiment, the device may include one or more control modules and one or more arithmetic modules, and the number of control modules and arithmetic modules may be set according to actual needs, which is not limited in the present disclosure. When the device includes a control module, the control module can receive a fully connected instruction and control one or more arithmetic modules to perform fully connected operations. When the device includes a plurality of control modules, the plurality of control modules can respectively receive a fully connected instruction and control the corresponding one or more arithmetic modules to perform a fully connected operation.
本公开实施例所提供的全连接指令处理装置,该装置包括控制模块和运算模块,控制模块用于对获取到的全连接指令进行编译,得到编译后的全连接指令,对编译后的全连接指令进行解析,得到全连接指令的操作码和操作域,并根据操作码和操作域获取执行全连接指令所需的第一数据、第二数据、权值数据和目标地址;运算模块用于根据权值数据对第一数据和第二数据进行全连接运算,得到运算结果,并将运算结果存入目标地址中。本公开实施例所提供的全连接指令处理装置的适用范围广,对 全连接指令的处理效率高、处理速度快,进行全连接运算的处理效率高、速度快。The fully-connected instruction processing device provided by the embodiment of the present disclosure includes a control module and an arithmetic module. The control module is used to compile the obtained fully-connected instruction to obtain the compiled fully-connected instruction and the compiled fully-connected instruction. The instruction is parsed to obtain the operation code and operation domain of the fully connected instruction, and the first data, second data, weight data and target address required to execute the fully connected instruction are obtained according to the operation code and the operation domain; The weight data performs a fully connected operation on the first data and the second data to obtain an operation result, and stores the operation result in the target address. The fully connected command processing device provided by the embodiments of the present disclosure has a wide application range, and has high processing efficiency and fast processing speed for the fully connected command, and high processing efficiency and speed for performing the fully connected operation.
图7-2a示出根据本公开一实施例的全连接指令处理装置的框图。在一种可能的实现方式中,如图7-2a所示,运算模块3-12可以包括多个乘法器3-120和多个加法器3-120’。多个乘法器3-120,用于执行全连接运算中的乘法运算。多个加法器3-120’,用于执行全连接运算中的加法运算。7-2a shows a block diagram of a fully connected instruction processing apparatus according to an embodiment of the present disclosure. In a possible implementation, as shown in FIG. 7-2a, the arithmetic module 3-12 may include multiple multipliers 3-120 and multiple adders 3-120 '. Multiple multipliers 3-120 are used to perform the multiplication operation in the fully connected operation. A plurality of adders 3-120 'are used to perform addition operations in the fully connected operation.
在该实现方式中,运算模块还可以包括一个加法器、一个乘法器,或者包括一个加法器、多个乘法器,再或者包括多个加法器、一个乘法器。可以根据所需进行的全连接运算的数据量、对全连接运算的处理速度、处理效率等需求对乘法器和加法器的数量进行设置,本公开对此不作限制。In this implementation manner, the operation module may further include one adder and one multiplier, or one adder and multiple multipliers, or multiple adders and one multiplier. The number of multipliers and adders can be set according to the data amount of the fully-connected operation, the processing speed, and the processing efficiency of the fully-connected operation, which is not limited in the present disclosure.
图7-2b示出根据本公开一实施例的全连接指令处理装置的框图。在一种可能的实现方式中,如图7-2b所示,运算模块3-12可以包括主运算子模块3-121和多个从运算子模块3-122,从运算子模块3-122包括多个乘法器3-120和多个加法器3-120’(图中未示出)。7-2b shows a block diagram of a fully connected instruction processing apparatus according to an embodiment of the present disclosure. In a possible implementation, as shown in FIG. 7-2b, the operation module 3-12 may include a master operation sub-module 3-121 and multiple slave operation sub-modules 3-122, and the slave operation sub-module 3-122 includes Multiple multipliers 3-120 and multiple adders 3-120 '(not shown).
控制模块3-11,还用于解析编译后的全连接指令得到多个运算指令,并将第一数据、第二数据和多个运算指令发送至主运算子模块3-121。The control module 3-11 is also used to parse the compiled fully connected instruction to obtain a plurality of operation instructions, and send the first data, the second data, and the plurality of operation instructions to the main operation submodule 3-121.
主运算子模块3-121,用于对第一数据和第二数据执行前序处理,以及与多个从运算子模块3-122进行数据和运算指令的传输。The main operation sub-module 3-121 is used to perform pre-processing on the first data and the second data, and to transmit data and operation instructions with a plurality of slave operation sub-modules 3-122.
从运算子模块3-122,用于基于多个乘法器3-120和多个加法器3-120’,根据从主运算子模块3-121传输的数据和运算指令并行执行中间运算得到多个中间结果,并将多个中间结果传输给主运算子模块3-122。The slave operation sub-module 3-122 is used to perform multiple operations in parallel based on the data and operation instructions transmitted from the master operation sub-module 3-121 based on multiple multipliers 3-120 and multiple adders 3-120 ′. Intermediate results, and transmit multiple intermediate results to the main operation sub-module 3-122.
主运算子模块3-121,还用于对多个中间结果执行后续处理,得到运算结果,并将运算结果存入目标地址中。The main operation submodule 3-121 is also used to perform subsequent processing on a plurality of intermediate results, obtain operation results, and store the operation results in the target address.
在一种可能的实现方式中,操作域还可以包括权值宽度和权值高度。其中,控制模块3-11,还用于按照权值高度和权值宽度从权值数据地址中获取权值数据。In a possible implementation manner, the operation domain may further include a weight width and a weight height. Among them, the control module 3-11 is also used to obtain the weight data from the weight data address according to the weight height and weight width.
在该实现方式中,权值宽度和权值高度可以限定所获取的权值数据的数据量的。操作域中所包括的权值宽度和权值高度可以是具体的数值,还可以是存储权值宽度和权值高度的存储地址。在操作域中直接包括权值宽度和权值高度的具体数值时,将该具体数值确定为对应的权值宽度和权值高度。在操作域中包括权值宽度和权值高度的存储地址时,可以分别从权值宽度和权值高度的存储地址中获取权值高度和权值宽度。In this implementation, the weight width and weight height may limit the amount of weight data acquired. The weight width and weight height included in the operation domain may be specific numerical values, and may also be storage addresses storing the weight width and weight height. When the specific value of the weight width and weight height is directly included in the operation domain, the specific value is determined as the corresponding weight width and weight height. When the storage address of the weight width and the weight height is included in the operation domain, the weight height and the weight width can be obtained from the storage addresses of the weight width and the weight height, respectively.
在一种可能的实现方式中,在操作域中不包括权值高度和/或权值宽度时,可以根据预先设置的默认权值高度和默认权值宽度获取权值数据。In a possible implementation manner, when the weight height and / or weight width are not included in the operation domain, weight data may be obtained according to the preset default weight height and default weight width.
通过上述方式,可以对权值数据的数据量进行限制,保证运算结果的准确性。In this way, the amount of weight data can be limited to ensure the accuracy of the calculation results.
在一种可能的实现方式中,如图7-2a、图7-2b所示,该装置还可以包括存储模块3-13。存储模块3-13用于存储第一数据、第二数据和权值数据。In a possible implementation manner, as shown in FIGS. 7-2a and 7-2b, the device may further include a storage module 3-13. The storage module 3-13 is used to store the first data, the second data and the weight data.
在该实现方式中,存储模块可以包括缓存和寄存器中的一种或多种,缓存可以包括高速暂存缓存,还可以包括至少一个NRAM(Neuron Random Access Memory,神经元随机存取存储器)。缓存用于存储第一数据、第二数据以及权值数据,寄存器用于存储第一数据、第二数据以及权值数据中的标量数据。In this implementation, the storage module may include one or more of a cache and a register. The cache may include a high-speed temporary storage cache, and may also include at least one NRAM (Neuron Random Access Memory). The buffer is used to store the first data, the second data, and the weight data, and the register is used to store the scalar data in the first data, the second data, and the weight data.
在一种可能的实现方式中,神经元缓存,用于存储第一数据、第二数据以及权值数据中的神经元数据,神经元数据包括神经元向量数据。In a possible implementation manner, the neuron cache is used to store the neuron data in the first data, the second data, and the weight data, and the neuron data includes neuron vector data.
在一种可能的实现方式中,控制模块3-11还可以用于根据全连接指令生成汇编文件,并将汇编文件翻译成二进制文件,其中,二进制文件为编译后的全连接指令。In a possible implementation manner, the control module 3-11 can also be used to generate an assembly file according to the fully-connected instruction and translate the assembly file into a binary file, where the binary file is a fully-connected instruction after compilation.
在一种可能的实现方式中,全连接指令的指令格式可以是:In a possible implementation, the command format of the fully connected command may be:
mlp dst A B Weight Weight.width Weight.heightmlpdstAABWeightWeight.widthWeight.height
其中,mlp是全连接指令的操作码。dst、A、B、Weight、Weight.width、Weight.height是全连接指令的操作域。其中,dst是目标地址,A是第一数据地址,B是第二数据地址,Weight是权值数据地址,Weight.width是权值宽度,Weight.height是权值高度。Among them, mlp is the opcode of the fully connected instruction. dst, A, B, Weight, Weight.width, Weight.height are the operation domains of fully connected instructions. Among them, dst is the target address, A is the first data address, B is the second data address, Weight is the weight data address, Weight.width is the weight width, Weight.height is the weight height.
应当理解的是,本领域技术人员可以根据需要对全连接指令的操作码、指令格式中操作码和操作 域的位置进行设置,本公开对此不作限制。It should be understood that those skilled in the art can set the operation code of the fully connected instruction, the position of the operation code and the operation field in the instruction format as required, and this disclosure does not limit this.
在一种可能的实现方式中,该装置可以设置于(Graphics Processing Unit,简称GPU)、中央处理器(Central Processing Unit,简称CPU)和嵌入式神经网络处理器(Neural-network Processing Unit,简称NPU)的一种或多种之中。In a possible implementation, the device may be set in (Graphics Processing Unit, GPU for short), Central Processing Unit (CPU for short), and Neural-network Processing Unit (NPU for short) ) Of one or more.
需要说明的是,尽管以上述实施例作为示例介绍了全连接指令处理装置如上,但本领域技术人员能够理解,本公开应不限于此。事实上,用户完全可根据个人喜好和/或实际应用场景灵活设定各模块,只要符合本公开的技术方案即可。It should be noted that although the above-mentioned embodiment is taken as an example to introduce the fully-connected instruction processing apparatus as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various modules flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.
应用示例Application examples
以下结合“利用全连接指令处理装置进行全连接运算”作为一个示例性应用场景,给出根据本公开实施例的应用示例,以便于理解全连接指令处理装置的流程。本领域技术人员应理解,以下应用示例仅仅是出于便于理解本公开实施例的目的,不应视为对本公开实施例的限制In the following, an application example according to an embodiment of the present disclosure is given in conjunction with "using a fully-connected instruction processing device to perform a fully-connected operation" as an exemplary application scenario, so as to facilitate understanding of the flow of the fully-connected instruction processing device. Those skilled in the art should understand that the following application examples are only for the purpose of facilitating understanding of the embodiments of the present disclosure, and should not be considered as limitations to the embodiments of the present disclosure
图7-3示出根据本公开一实施例的全连接指令处理装置的应用场景的示意图。如图7-3所示,全连接指令处理装置对全连接指令进行处理的过程如下:7-3 shows a schematic diagram of an application scenario of a fully connected instruction processing apparatus according to an embodiment of the present disclosure. As shown in Figure 7-3, the fully connected command processing device processes the fully connected command as follows:
控制模块3-11对获取到的全连接指令1进行编译,得到编译后的全连接指令1(如全连接指令1为mlp 500 100 200 300 5 9)。对编译后的全连接指令1进行解析,得到全连接指令1的操作码和操作域。其中,全连接指令1的操作码为mlp,目标地址为500,第一数据地址为100,第二数据地址为200,权值数据地址为300,权值宽度为5,权值高度为9。控制模块3-11从第一数据地址100中获取第一数据,从第二数据地址200中获取第二数据以及从权值数据地址300中获取权值宽度为5、权值高度为9的权值数据。The control module 3-11 compiles the obtained fully-connected instruction 1 to obtain the compiled fully-connected instruction 1 (for example, the fully-connected instruction 1 is mlp500 500 100 200 200 300 9). Analyze the fully-connected instruction 1 after compilation to obtain the operation code and operation domain of the fully-connected instruction 1. Among them, the operation code of the full connection instruction 1 is mlp, the target address is 500, the first data address is 100, the second data address is 200, the weight data address is 300, the weight width is 5, and the weight height is 9. The control module 3-11 acquires the first data from the first data address 100, the second data from the second data address 200, and the weights with a weight width of 5 and a weight height of 9 from the weight data address 300 Value data.
运算模块3-12根据权值数据对第一数据和第二数据进行全连接运算,得到运算结果,并将运算结果存入目标地址500中。以上各模块的工作过程可参考上文的相关描述。The operation module 3-12 performs full connection operation on the first data and the second data according to the weight data to obtain an operation result, and stores the operation result in the target address 500. For the working process of the above modules, please refer to the relevant description above.
这样,全连接指令处理装置可以高效、快速地对全连接指令进行处理,进行全连接运算的处理效率高、速度快。In this way, the fully connected command processing device can efficiently and quickly process the fully connected command, and the processing efficiency of the fully connected operation is high and fast.
图7-4示出根据本公开一实施例的全连接指令处理方法的流程图。该方法可以应用于包含存储器和处理器的如计算机设备等,其中,存储器用于存储执行方法过程中所使用的数据;处理器用于执行相关的处理、运算步骤,如执行下述步骤S51-3和步骤S52-3。如图7-4所示,该方法应用于上述全连接指令处理装置,该方法包括步骤S51-3和步骤S52-3。7-4 shows a flowchart of a fully connected instruction processing method according to an embodiment of the present disclosure. The method can be applied to computer equipment including a memory and a processor, where the memory is used to store data used in the execution of the method; the processor is used to perform related processing and operation steps, such as performing the following steps S51-3和 步骤 S52-3. As shown in FIG. 7-4, this method is applied to the above-mentioned fully-connected instruction processing apparatus. The method includes step S51-3 and step S52-3.
在步骤S51-3中,利用控制模块对获取到的全连接指令进行编译,得到编译后的全连接指令,对编译后的全连接指令进行解析,得到全连接指令的操作码和操作域,并根据操作码和操作域获取执行全连接指令所需的第一数据、第二数据、权值数据和目标地址。其中,操作码用于指示全连接指令对数据所进行的运算为全连接运算,操作域包括第一数据地址、第二数据地址、权值数据地址和目标地址。In step S51-3, the control module is used to compile the obtained fully-connected instruction to obtain the compiled fully-connected instruction, and the compiled fully-connected instruction is parsed to obtain the fully-connected instruction's operation code and operation domain, and Obtain the first data, the second data, the weight data and the target address required to execute the fully connected instruction according to the operation code and the operation domain. The operation code is used to indicate that the operation performed by the fully connected instruction on the data is a fully connected operation, and the operation domain includes a first data address, a second data address, a weight data address, and a target address.
在步骤S52-3中,利用运算模块根据权值数据对第一数据和第二数据进行全连接运算,得到运算结果,并将运算结果存入目标地址中。In step S52-3, the operation module is used to perform a fully connected operation on the first data and the second data according to the weight data to obtain an operation result, and store the operation result in the target address.
在一种可能的实现方式中,根据权值数据对第一数据和第二数据进行全连接运算,可以包括:利用运算模块中的多个乘法器执行全连接运算中的乘法运算,以及利用多个运算模块中的加法器执行全连接运算中的加法运算。In a possible implementation manner, performing a fully connected operation on the first data and the second data according to the weight data may include: using multiple multipliers in the operation module to perform the multiplication operation in the fully connected operation, and using multiple The adders in each arithmetic module perform the addition operation in the fully connected operation.
在一种可能的实现方式中,运算模块包括主运算子模块和多个从运算子模块,从运算子模块包括多个乘法器和多个加法器,In a possible implementation manner, the operation module includes a master operation sub-module and multiple slave operation sub-modules, and the slave operation sub-module includes multiple multipliers and multiple adders,
其中,该方法还可以包括:Among them, the method may further include:
利用控制模块解析编译后的全连接指令得到多个运算指令;Use the control module to parse the compiled fully-connected instructions to obtain multiple operation instructions;
其中,根据权值数据对第一数据和第二数据进行全连接运算,得到运算结果,并将运算结果存入目标地址中,包括:Among them, the first data and the second data are fully connected according to the weight data to obtain the operation result, and the operation result is stored in the target address, including:
利用主运算子模块对第一数据和第二数据执行前序处理,以及进行数据和运算指令的传输;Use the main operation sub-module to perform pre-processing on the first data and the second data, and to transmit data and operation instructions;
基于从运算子模块中的多个乘法器和多个加法器,根据传输的数据和运算指令并行执行中间运算 得到多个中间结果;Based on multiple multipliers and multiple adders in the slave operation sub-module, performing intermediate operations in parallel according to the transmitted data and operation instructions to obtain multiple intermediate results;
利用主运算子模块对多个中间结果执行后续处理,得到运算结果,并将运算结果存入目标地址中。Use the main operation sub-module to perform subsequent processing on multiple intermediate results to obtain the operation results, and store the operation results in the target address.
在一种可能的实现方式中,操作域还可以包括权值宽度和权值高度。其中,根据操作码和操作域获取执行全连接指令所需的第一数据、第二数据、权值数据和目标地址,可以包括:按照权值高度和权值宽度从权值数据地址中获取权值数据。In a possible implementation manner, the operation domain may further include a weight width and a weight height. Wherein, obtaining the first data, the second data, the weight data and the target address required to execute the fully-connected instruction according to the operation code and the operation domain may include: obtaining the weight from the weight data address according to the weight height and weight width Value data.
在一种可能的实现方式中,该方法还可以包括:利用装置的存储模块存储第一数据、第二数据以及权值数据,In a possible implementation manner, the method may further include: using the storage module of the device to store the first data, the second data, and the weight data,
其中,存储模块包括寄存器和缓存中的至少一种,Wherein, the storage module includes at least one of a register and a cache,
缓存,用于存储第一数据、第二数据以及权值数据,缓存包括至少一个神经元缓存NRAM;The cache is used to store the first data, the second data, and the weight data. The cache includes at least one neuron cache NRAM;
寄存器,用于存储第一数据、第二数据以及权值数据中的标量数据;The register is used to store the scalar data in the first data, the second data and the weight data;
神经元缓存,用于存储第一数据、第二数据以及权值数据中的神经元数据,神经元数据包括神经元向量数据。The neuron cache is used to store the neuron data in the first data, the second data, and the weight data. The neuron data includes neuron vector data.
在一种可能的实现方式中,对获取到的全连接指令进行解析,得到全连接指令的操作码和操作域,可以包括:In a possible implementation manner, parsing the obtained fully connected instruction to obtain the operation code and operation domain of the fully connected instruction may include:
存储编译后的全连接指令;Store the compiled fully connected instructions;
对编译后的全连接指令进行解析,得到全连接指令的操作码和操作域;Analyze the fully-connected instructions after compilation to obtain the operation codes and operation domains of the fully-connected instructions;
存储指令队列,指令队列包括按照执行顺序依次排列的多个待执行指令,多个待执行指令可以包括编译后的全连接指令。The instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed in order according to the execution order, and the plurality of instructions to be executed may include a fully-connected instruction after compilation.
在一种可能的实现方式中,该方法还可以包括:在确定多个待执行指令中的第一待执行指令与第一待执行指令之前的第零待执行指令存在关联关系时,缓存第一待执行指令,在第零待执行指令执行完毕后,执行第一待执行指令,In a possible implementation manner, the method may further include: when determining that the first to-be-executed instruction among the plurality of to-be-executed instructions is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction, caching the first The instruction to be executed, after the execution of the zeroth instruction to be executed is completed, the first instruction to be executed is executed,
其中,第一待执行指令与第一待执行指令之前的第零待执行指令存在关联关系包括:Wherein, the association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
存储第一待执行指令所需数据的第一存储地址区间与存储第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first storage address interval storing data required for the first instruction to be executed has an overlapping area with the zeroth storage address interval storing data required for the zeroth instruction to be executed.
在一种可能的实现方式中,对获取到的全连接指令进行编译,得到编译后的全连接指令,可以包括:In a possible implementation manner, compiling the obtained fully-connected instruction to obtain the compiled fully-connected instruction may include:
根据全连接指令生成汇编文件,并将汇编文件翻译成二进制文件。其中,二进制文件为编译后的全连接指令。The assembly file is generated according to the full connection instruction, and the assembly file is translated into a binary file. Among them, the binary file is a fully connected instruction after compilation.
需要说明的是,尽管以上述实施例作为示例介绍了全连接指令处理方法如上,但本领域技术人员能够理解,本公开应不限于此。事实上,用户完全可根据个人喜好和/或实际应用场景灵活设定各步骤,只要符合本公开的技术方案即可。It should be noted that although the above embodiment is taken as an example to introduce the method for processing the fully connected command as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various steps flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.
本公开实施例所提供的全连接指令处理方法的适用范围广,对全连接指令的处理效率高、处理速度快,进行全连接运算的处理效率高、速度快。The fully connected instruction processing method provided by the embodiments of the present disclosure has a wide range of application, and has high processing efficiency and fast processing speed for the fully connected instruction, and high processing efficiency and speed for performing the fully connected operation.
依据以下条款可以更好的理解前述内容:The foregoing can be better understood based on the following clauses:
条款G1、一种全连接指令处理装置,所述装置包括:Clause G1, a fully connected command processing device, the device comprising:
控制模块,用于对获取到的全连接指令进行编译,得到编译后的全连接指令,对所述编译后的全连接指令进行解析,得到全连接指令的操作码和操作域,并根据所述操作码和所述操作域获取执行全连接指令所需的第一数据、第二数据、权值数据和目标地址;The control module is used to compile the obtained fully-connected instruction to obtain the compiled fully-connected instruction, analyze the compiled fully-connected instruction to obtain the fully-connected instruction's operation code and operation domain, and according to the The operation code and the operation domain obtain the first data, the second data, the weight data and the target address required to execute the fully connected instruction;
运算模块,用于根据所述权值数据对所述第一数据和所述第二数据进行全连接运算,得到运算结果,并将所述运算结果存入所述目标地址中,An operation module, configured to perform a fully connected operation on the first data and the second data according to the weight data, obtain an operation result, and store the operation result in the target address,
其中,所述操作码用于指示所述全连接指令对数据所进行的运算为全连接运算,所述操作域包括第一数据地址、第二数据地址、权值数据地址和所述目标地址。Wherein, the operation code is used to indicate that the operation performed on the data by the fully connected instruction is a fully connected operation, and the operation domain includes a first data address, a second data address, a weight data address, and the target address.
条款G2、根据条款G1所述的装置,所述运算模块,包括:Clause G2. The device according to Clause G1, the calculation module includes:
多个乘法器,用于执行所述全连接运算中的乘法运算;Multiple multipliers for performing the multiplication operation in the fully connected operation;
多个加法器,用于执行所述全连接运算中的加法运算。A plurality of adders are used to perform the addition operation in the fully connected operation.
条款G3、根据条款G2所述的装置,所述运算模块包括主运算子模块和多个从运算子模块,所述从运算子模块包括所述多个乘法器和所述多个加法器,Clause G3. The device according to Clause G2, the operation module includes a master operation submodule and a plurality of slave operation submodules, the slave operation submodule includes the plurality of multipliers and the plurality of adders,
所述控制模块,还用于解析所述编译后的全连接指令得到多个运算指令,并将所述第一数据、所述第二数据和所述多个运算指令发送至所述主运算子模块;The control module is further configured to parse the compiled fully-connected instruction to obtain a plurality of operation instructions, and send the first data, the second data, and the plurality of operation instructions to the main operator Module
所述主运算子模块,用于对所述第一数据和所述第二数据执行前序处理,以及与所述多个从运算子模块进行数据和运算指令的传输;The master operation sub-module is used to perform pre-processing on the first data and the second data, and transmit data and operation instructions with the plurality of slave operation sub-modules;
所述从运算子模块,用于基于所述多个乘法器和所述多个加法器,根据从所述主运算子模块传输的数据和运算指令并行执行中间运算得到多个中间结果,并将所述多个中间结果传输给所述主运算子模块;The slave operation sub-module is configured to execute intermediate operations in parallel based on the data and operation instructions transmitted from the master operation sub-module based on the multiple multipliers and the multiple adders to obtain multiple intermediate results, and Transmitting the plurality of intermediate results to the main operation submodule;
所述主运算子模块,还用于对所述多个中间结果执行后续处理,得到运算结果,并将所述运算结果存入所述目标地址中。The main operation sub-module is also used to perform subsequent processing on the plurality of intermediate results to obtain an operation result, and store the operation result in the target address.
条款G4、根据条款G1所述的装置,所述操作域还包括权值宽度和权值高度,Clause G4. The device according to Clause G1, the operation domain further includes a weight width and a weight height,
其中,所述控制模块,还用于按照所述权值高度和所述权值宽度从所述权值数据地址中获取所述权值数据。Wherein, the control module is further configured to obtain the weight data from the weight data address according to the weight height and the weight width.
条款G5、根据条款G1所述的装置,所述装置还包括:Clause G5. The device according to Clause G1, the device further comprising:
存储模块,用于存储所述第一数据、所述第二数据以及所述权值数据,A storage module, configured to store the first data, the second data, and the weight data,
其中,所述存储模块包括寄存器和缓存中的至少一种,Wherein, the storage module includes at least one of a register and a cache,
所述缓存,用于存储所述第一数据、所述第二数据以及所述权值数据,所述缓存包括至少一个神经元缓存NRAM;The cache is used to store the first data, the second data, and the weight data, and the cache includes at least one neuron cache NRAM;
所述寄存器,用于存储所述第一数据、所述第二数据以及所述权值数据中的标量数据;The register is used to store the scalar data in the first data, the second data, and the weight data;
所述神经元缓存,用于存储所述第一数据、所述第二数据以及所述权值数据中的神经元数据,所述神经元数据包括神经元向量数据。The neuron cache is used to store neuron data in the first data, the second data, and the weight data, and the neuron data includes neuron vector data.
条款G6、根据条款G1所述的装置,所述控制模块包括:Clause G6. The device according to Clause G1, the control module includes:
指令存储子模块,用于存储所述编译后的全连接指令;An instruction storage submodule, used to store the compiled fully connected instruction;
指令处理子模块,用于对所述编译后的全连接指令进行解析,得到全连接指令的操作码和操作域;Instruction processing sub-module, which is used to parse the compiled fully connected instruction to obtain the operation code and operation domain of the fully connected instruction;
队列存储子模块,用于存储指令队列,所述指令队列包括按照执行顺序依次排列的多个待执行指令,所述多个待执行指令包括所述编译后的全连接指令。The queue storage sub-module is used to store an instruction queue, and the instruction queue includes a plurality of instructions to be executed in order according to the execution order, and the plurality of instructions to be executed include the compiled fully-connected instructions.
条款G7、根据条款G6所述的装置,所述控制模块,还包括:Clause G7. The device according to Clause G6, the control module, further comprising:
依赖关系处理子模块,用于在确定所述多个待执行指令中的第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系时,将所述第一待执行指令缓存在所述指令存储子模块中,在所述第零待执行指令执行完毕后,从所述指令存储子模块中提取所述第一待执行指令发送至所述运算模块,The dependency processing sub-module is used to determine the first pending instruction when there is an association relationship between the first pending instruction in the plurality of pending instructions and the zeroth pending instruction before the first pending instruction The execution instruction is cached in the instruction storage submodule, and after the execution of the zeroth instruction to be executed is completed, the first instruction to be executed is extracted from the instruction storage submodule and sent to the arithmetic module,
其中,所述第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系包括:Wherein, the association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
存储所述第一待执行指令所需数据的第一存储地址区间与存储所述第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
条款G8、根据条款G1所述的装置,Clause G8, the device according to Clause G1,
所述控制模块,还用于根据所述全连接指令生成汇编文件,并将所述汇编文件翻译成二进制文件,The control module is also used to generate an assembly file according to the fully connected instruction and translate the assembly file into a binary file,
其中,所述二进制文件为所述编译后的全连接指令。Wherein, the binary file is the compiled fully connected instruction.
条款G9、一种机器学习运算装置,所述装置包括:Clause G9. A machine learning computing device, the device comprising:
一个或多个如条款G1-条款G8任一项所述的全连接指令处理装置,用于从其他处理装置中获取待运算数据和控制信息,并执行指定的机器学习运算,将执行结果通过I/O接口传递给其他处理装置;One or more fully connected instruction processing devices as described in any one of clauses G1 to G8, used to obtain data and control information to be calculated from other processing devices, and perform specified machine learning operations, and pass the execution result to I / O interface is passed to other processing devices;
当所述机器学习运算装置包含多个所述全连接指令处理装置时,所述多个所述全连接指令处理装置间可以通过特定的结构进行连接并传输数据;When the machine learning operation device includes a plurality of the fully connected instruction processing devices, the plurality of fully connected instruction processing devices may be connected and transmit data through a specific structure;
其中,多个所述全连接指令处理装置通过快速外部设备互连总线PCIE总线进行互联并传输数据,以支持更大规模的机器学习的运算;多个所述全连接指令处理装置共享同一控制系统或拥有各自的控 制系统;多个所述全连接指令处理装置共享内存或者拥有各自的内存;多个所述全连接指令处理装置的互联方式是任意互联拓扑。Among them, a plurality of the fully connected command processing devices interconnect and transmit data through a fast external device interconnect bus PCIE bus to support larger-scale machine learning operations; a plurality of the fully connected command processing devices share the same control system Or have their own control systems; multiple of the fully-connected instruction processing devices share memory or have their own memories; the interconnection method of multiple of the fully-connected instruction processing devices is an arbitrary interconnection topology.
条款G10、一种组合处理装置,所述组合处理装置包括:Clause G10. A combined processing device, the combined processing device comprising:
如条款G9所述的机器学习运算装置、通用互联接口和其他处理装置;Machine learning computing devices, general interconnection interfaces and other processing devices as described in Clause G9;
所述机器学习运算装置与所述其他处理装置进行交互,共同完成用户指定的计算操作,The machine learning computing device interacts with the other processing device to jointly complete the calculation operation specified by the user,
其中,所述组合处理装置还包括:存储装置,该存储装置分别与所述机器学习运算装置和所述其他处理装置连接,用于保存所述机器学习运算装置和所述其他处理装置的数据。Wherein, the combined processing device further includes: a storage device, which is respectively connected to the machine learning computing device and the other processing device, and is used for storing data of the machine learning computing device and the other processing device.
条款G11、一种机器学习芯片,所述机器学习芯片包括:Clause G11. A machine learning chip, the machine learning chip includes:
如条款G9所述的机器学习运算装置或如条款G10所述的组合处理装置。The machine learning arithmetic device according to clause G9 or the combined processing device according to clause G10.
条款G12、一种电子设备,所述电子设备包括:Article G12. An electronic device, the electronic device comprising:
如条款G11所述的机器学习芯片。Machine learning chip as described in clause G11.
条款G13、一种板卡,所述板卡包括:存储器件、接口装置和控制器件以及如条款G11所述的机器学习芯片;Clause G13, a board card including: a storage device, an interface device and a control device, and a machine learning chip as described in Clause G11;
其中,所述机器学习芯片与所述存储器件、所述控制器件以及所述接口装置分别连接;Wherein, the machine learning chip is respectively connected to the storage device, the control device and the interface device;
所述存储器件,用于存储数据;The storage device is used for storing data;
所述接口装置,用于实现所述机器学习芯片与外部设备之间的数据传输;The interface device is used to realize data transmission between the machine learning chip and an external device;
所述控制器件,用于对所述机器学习芯片的状态进行监控。The control device is used for monitoring the state of the machine learning chip.
条款G14、一种全连接指令处理方法,所述方法应用于全连接指令处理装置,所述装置包括控制模块和运算模块,所述方法包括:Clause G14. A method for processing a fully connected command. The method is applied to a device for processing a fully connected command. The device includes a control module and an arithmetic module. The method includes:
利用控制模块对获取到的全连接指令进行编译,得到编译后的全连接指令,对所述编译后的全连接指令进行解析,得到全连接指令的操作码和操作域,并根据所述操作码和所述操作域获取执行全连接指令所需的第一数据、第二数据、权值数据和目标地址;The control module is used to compile the obtained fully-connected instruction to obtain the compiled fully-connected instruction. The compiled fully-connected instruction is analyzed to obtain the fully-connected instruction's operation code and operation domain, and the operation code Acquiring the first data, the second data, the weight data and the target address required to execute the fully connected instruction with the operation domain;
利用运算模块根据所述权值数据对所述第一数据和所述第二数据进行全连接运算,得到运算结果,并将所述运算结果存入所述目标地址中,Using an operation module to perform a fully connected operation on the first data and the second data according to the weight data to obtain an operation result, and store the operation result in the target address,
其中,所述操作码用于指示所述全连接指令对数据所进行的运算为全连接运算,所述操作域包括第一数据地址、第二数据地址、权值数据地址和所述目标地址。Wherein, the operation code is used to indicate that the operation performed on the data by the fully connected instruction is a fully connected operation, and the operation domain includes a first data address, a second data address, a weight data address, and the target address.
条款G15、根据条款G14所述的方法,根据所述权值数据对所述第一数据和所述第二数据进行全连接运算,包括:Clause G15. According to the method of Clause G14, performing a fully connected operation on the first data and the second data according to the weight data includes:
利用所述运算模块中的多个乘法器执行所述全连接运算中的乘法运算,以及利用所述运算模块中的多个加法器执行所述全连接运算中的加法运算。A plurality of multipliers in the operation module are used to perform the multiplication operation in the fully connected operation, and a plurality of adders in the operation module are used to perform the addition operation in the fully connected operation.
条款G16、根据条款G15所述的方法,所述运算模块包括主运算子模块和多个从运算子模块,所述从运算子模块包括多个乘法器和多个加法器,Clause G16. The method according to Clause G15, the operation module includes a master operation submodule and a plurality of slave operation submodules, the slave operation submodule includes a plurality of multipliers and a plurality of adders,
其中,所述方法还包括:Wherein, the method further includes:
利用控制模块解析所述编译后的全连接指令得到多个运算指令;Use the control module to parse the compiled fully connected instruction to obtain multiple operation instructions;
其中,根据所述权值数据对所述第一数据和所述第二数据进行全连接运算,得到运算结果,并将所述运算结果存入所述目标地址中,包括:Wherein, performing a fully connected operation on the first data and the second data according to the weight data to obtain an operation result, and storing the operation result in the target address includes:
利用所述主运算子模块对所述第一数据和所述第二数据执行前序处理,以及进行数据和运算指令的传输;Using the main operation sub-module to perform pre-processing on the first data and the second data, and to transmit data and operation instructions;
基于所述从运算子模块中的多个乘法器和多个加法器,根据传输的数据和运算指令并行执行中间运算得到多个中间结果;Based on the multiple multipliers and multiple adders in the slave operation submodule, performing intermediate operations in parallel according to the transmitted data and operation instructions to obtain multiple intermediate results;
利用所述主运算子模块对所述多个中间结果执行后续处理,得到运算结果,并将所述运算结果存入所述目标地址中。The main operation sub-module is used to perform subsequent processing on the plurality of intermediate results to obtain an operation result, and the operation result is stored in the target address.
条款G17、根据条款G14所述的方法,所述操作域还包括权值宽度和权值高度,Clause G17. The method according to Clause G14, the operation domain further includes a weight width and a weight height,
其中,根据所述操作码和所述操作域获取执行全连接指令所需的第一数据、第二数据、权值数据和目标地址,包括:Wherein, obtaining the first data, the second data, the weight data and the target address required to execute the fully connected instruction according to the operation code and the operation domain includes:
按照所述权值高度和所述权值宽度从所述权值数据地址中获取所述权值数据。Obtain the weight data from the weight data address according to the weight height and the weight width.
条款G18、根据条款G14所述的方法,所述方法还包括:Clause G18. The method according to Clause G14, the method further comprising:
利用所述装置的存储模块存储所述第一数据、所述第二数据以及所述权值数据,Using the storage module of the device to store the first data, the second data, and the weight data,
其中,所述存储模块包括寄存器和缓存中的至少一种,Wherein, the storage module includes at least one of a register and a cache,
所述缓存,用于存储所述第一数据、所述第二数据以及所述权值数据,所述缓存包括至少一个神经元缓存NRAM;The cache is used to store the first data, the second data, and the weight data, and the cache includes at least one neuron cache NRAM;
所述寄存器,用于存储所述第一数据、所述第二数据以及所述权值数据中的标量数据;The register is used to store the scalar data in the first data, the second data, and the weight data;
所述神经元缓存,用于存储所述第一数据、所述第二数据以及所述权值数据中的神经元数据,所述神经元数据包括神经元向量数据。The neuron cache is used to store neuron data in the first data, the second data, and the weight data, and the neuron data includes neuron vector data.
条款G19、根据条款G14所述的方法,对获取到的全连接指令进行解析,得到全连接指令的操作码和操作域,包括:Clause G19. According to the method described in Clause G14, parse the obtained fully-connected instruction to obtain the operation code and operation domain of the fully-connected instruction, including:
存储所述编译后的全连接指令;Store the compiled fully connected instruction;
对所述编译后的全连接指令进行解析,得到全连接指令的操作码和操作域;Parse the compiled fully connected instruction to obtain the operation code and operation domain of the fully connected instruction;
存储指令队列,所述指令队列包括按照执行顺序依次排列的多个待执行指令,所述多个待执行指令包括所述编译后的全连接指令。An instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed, which are sequentially arranged according to an execution order, and the plurality of instructions to be executed include the compiled fully-connected instructions.
条款G20、根据条款G19所述的方法,所述方法还包括:Clause G20. The method according to Clause G19, the method further comprising:
在确定所述多个待执行指令中的第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系时,缓存所述第一待执行指令,并在确定所述第零待执行指令执行完毕后,控制进行所述第一待执行指令的执行,When it is determined that the first to-be-executed instruction among the plurality of to-be-executed instructions is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction, cache the first to-be-executed instruction, and determine the After the execution of the zeroth to-be-executed instruction is completed, control to execute the execution of the first to-be-executed instruction,
其中,所述第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系包括:Wherein, the association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
存储所述第一待执行指令所需数据的第一存储地址区间与存储所述第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
条款G21、根据条款G14所述的方法,对获取到的全连接指令进行编译,得到编译后的全连接指令,包括:Clause G21. According to the method described in Clause G14, compile the obtained fully-connected instruction to obtain the compiled fully-connected instruction, including:
根据所述全连接指令生成汇编文件,并将所述汇编文件翻译成二进制文件,Generate an assembly file according to the fully connected instruction, and translate the assembly file into a binary file,
其中,所述二进制文件为所述编译后的全连接指令。Wherein, the binary file is the compiled fully connected instruction.
条款G22、一种非易失性计算机可读存储介质,其上存储有计算机程序指令,所述计算机程序指令被处理器执行时实现条款G14至条款G21任一项所述的方法。Clause G22. A non-volatile computer-readable storage medium having computer program instructions stored thereon, which when executed by a processor implements the method of any one of Clause G14 to Clause G21.
由于神经网络算法的广泛使用,计算机硬件运算人能力的不断提升,实际应用中所涉及到的数据运算的种类和数量不断提高。卷积运算是通过两个函数f和g生成第三个函数的一种数学算子,表征函数f与g经过翻转和平移的重叠部分的面积。由于编程语言的种类多样,在不同的语言环境下,为实现卷积运算的运算过程,相关技术中,由于现阶段没有能广泛适用于各类编程语言的卷积指令,技术人员需要自定义对应其编程语言环境的多条指令来实现卷积运算,导致进行卷积运算效率低、速度慢。本公开提供一种卷积指令处理方法、装置、计算机设备和存储介质,仅用一个指令即可以实现卷积运算,能够显著提高进行卷积运算的效率和速度。Due to the extensive use of neural network algorithms, the ability of computer hardware operators to continuously improve, the types and number of data operations involved in practical applications continue to increase. The convolution operation is a mathematical operator that generates a third function from two functions f and g, and characterizes the area of the overlapping part of the functions f and g after flipping and translation. Due to the variety of programming languages, in different language environments, in order to achieve the operation process of convolution operation, in related technologies, because there is no convolution instruction that can be widely applied to various programming languages at this stage, technicians need to customize the corresponding Multiple instructions in its programming language environment are used to implement the convolution operation, which results in low efficiency and slow speed of the convolution operation. The present disclosure provides a convolution instruction processing method, device, computer equipment, and storage medium. Convolution operations can be implemented with only one instruction, which can significantly improve the efficiency and speed of performing convolution operations.
图8-1示出根据本公开一实施例的卷积指令处理装置的框图。如图8-1所示,该装置包括控制模块4-11和运算模块4-12。8-1 shows a block diagram of a convolution instruction processing device according to an embodiment of the present disclosure. As shown in Figure 8-1, the device includes a control module 4-11 and an arithmetic module 4-12.
控制模块4-11,用于对获取到的卷积指令进行编译,得到编译后的卷积指令,对编译后的卷积指令进行解析,得到卷积指令的操作码和操作域,并根据操作码和操作域获取执行卷积指令所需的待运算数据、卷积核和目标地址。其中,操作码用于指示卷积指令对数据所进行的运算为卷积运算,操作域包括待运算数据地址、卷积核地址和目标地址。The control module 4-11 is used to compile the obtained convolution instruction to obtain the compiled convolution instruction, analyze the compiled convolution instruction to obtain the operation code and operation domain of the convolution instruction, and according to the operation The code and operation domain obtain the data to be operated, the convolution kernel and the target address required to execute the convolution instruction. The operation code is used to instruct the operation performed by the convolution instruction on the data to be a convolution operation, and the operation domain includes the data address to be operated, the convolution kernel address, and the target address.
运算模块4-12,用于根据卷积核对待运算数据进行卷积运算(convolutional computation),获得运算结果,并将运算结果存入目标地址中。The operation module 4-12 is used to perform a convolution operation on the data to be calculated according to the convolution kernel, obtain the operation result, and store the operation result in the target address.
在本实施例中,控制模块所获取到的卷积指令为未编译的、不能直接供硬件执行的软件指令,控 制模块需先对卷积指令(未编译的)进行编译。在得到编译后的卷积指令之后,才能对编译后的卷积指令进行解析。编译后的卷积指令为能够直接供硬件执行的硬件指令。控制模块可以从待运算数据地址和卷积核地址中,分别获得待运算数据和卷积核。In this embodiment, the convolution instruction acquired by the control module is an uncompiled software instruction that cannot be directly executed by hardware. The control module needs to first compile the convolution instruction (uncompiled). After the compiled convolution instruction is obtained, the compiled convolution instruction can be analyzed. The compiled convolutional instructions are hardware instructions that can be directly executed by hardware. The control module can obtain the data to be operated and the convolution kernel from the address of the data to be operated and the address of the convolution kernel, respectively.
在本实施例中,操作码可以是计算机程序中所规定的要执行操作的那一部分指令或字段(通常用代码表示),是指令序列号,用来告知执行指令的装置具体需要执行哪一条指令。操作域可以是执行对应的指令所需的所有数据的来源,执行对应的指令所需的所有数据包括待运算数据、卷积核等参数以及对应的运算方法等等。对于一个卷积指令其必须包括操作码和操作域,其中,操作域至少包括待运算数据地址、卷积核地址和目标地址In this embodiment, the operation code may be the part of the instruction or field (usually represented by code) specified in the computer program to perform the operation, and is the instruction sequence number used to inform the device that executes the instruction which instruction needs to be executed. . The operation domain may be the source of all data required to execute the corresponding instruction. All data required to execute the corresponding instruction include data to be operated, parameters such as convolution kernels, and corresponding operation methods. For a convolution instruction, it must include an operation code and an operation domain, where the operation domain includes at least the data address to be operated, the convolution kernel address, and the target address
应当理解的是,本领域技术人员可以根据需要对卷积指令的指令格式以及所包含的操作码和操作域进行设置,本公开对此不作限制。It should be understood that, those skilled in the art can set the instruction format of the convolution instruction, as well as the included operation codes and operation domains as needed, and this disclosure does not limit this.
在本实施例中,该装置可以包括一个或多个控制模块,以及一个或多个运算模块,可以根据实际需要对控制模块和运算模块的数量进行设置,本公开对此不作限制。在装置包括一个控制模块时,该控制模块可以接收卷积指令,并控制一个或多个运算模块进行卷积运算。在装置包括多个控制模块时,多个控制模块可以分别接收卷积指令,并控制对应的一个或多个运算模块进行卷积运算。In this embodiment, the device may include one or more control modules and one or more arithmetic modules, and the number of control modules and arithmetic modules may be set according to actual needs, which is not limited in the present disclosure. When the device includes a control module, the control module can receive the convolution instruction and control one or more arithmetic modules to perform the convolution operation. When the device includes multiple control modules, the multiple control modules may respectively receive convolution instructions and control the corresponding one or more arithmetic modules to perform convolution operations.
本公开实施例所提供的卷积指令处理装置,该装置包括控制模块和运算模块,控制模块用于对获取到的卷积指令进行编译,得到编译后的卷积指令,对编译后的卷积指令进行解析,得到卷积指令的操作码和操作域,并根据操作码和操作域获取执行卷积指令所需的待运算数据、卷积核和目标地址;运算模块,用于根据卷积核对待运算数据进行卷积运算,获得运算结果,并将运算结果存入目标地址中。本公开实施例所提供的卷积指令处理装置的适用范围广,对卷积指令的处理效率高、处理速度快,进行卷积运算的处理效率高、速度快。A convolution instruction processing device provided by an embodiment of the present disclosure includes a control module and an operation module. The control module is used to compile the obtained convolution instruction to obtain a compiled convolution instruction, and the compiled convolution instruction The instruction is analyzed to obtain the operation code and operation domain of the convolution instruction, and according to the operation code and operation domain, the data to be operated, the convolution kernel and the target address required to execute the convolution instruction are obtained; the operation module is used to calculate the convolution kernel Perform convolution operation on the data to be calculated, obtain the operation result, and store the operation result in the target address. The convolution instruction processing device provided by the embodiments of the present disclosure has a wide application range, and has high processing efficiency and fast processing speed for convolution instructions, and high processing efficiency and speed for performing convolution operations.
图8-2a示出根据本公开一实施例的卷积指令处理装置的框图。在一种可能的实现方式中,如图8-2a所示,运算模块4-12可以包括多个乘法器4-120和多个加法器4-120’。多个乘法器4-120,用于执行卷积运算中的乘法运算。多个加法器4-120’,用于执行卷积运算中的加法运算。8-2a shows a block diagram of a convolution instruction processing device according to an embodiment of the present disclosure. In a possible implementation, as shown in FIG. 8-2a, the arithmetic module 4-12 may include multiple multipliers 4-120 and multiple adders 4-120 '. Multiple multipliers 4-120 are used to perform multiplication operations in convolution operations. A plurality of adders 4-120 'are used to perform addition operations in convolution operations.
在该实现方式中,运算模块还可以包括一个加法器、一个乘法器,或者包括一个加法器、多个乘法器,再或者包括多个加法器、一个乘法器。可以根据所需进行的卷积运算的数据量、对卷积运算的处理速度、处理效率等需求对乘法器和加法器的数量进行设置,本公开对此不作限制。In this implementation manner, the operation module may further include one adder and one multiplier, or one adder and multiple multipliers, or multiple adders and one multiplier. The number of multipliers and adders can be set according to the amount of data required for the convolution operation, the processing speed of the convolution operation, the processing efficiency, etc., and the disclosure does not limit this.
图8-2b示出根据本公开一实施例的卷积指令处理装置的框图。在一种可能的实现方式中,如图8-2b所示,运算模块4-12可以包括主运算子模块4-121和多个从运算子模块4-122,从运算子模块4-122包括多个乘法器4-120和多个加法器4-120’(图中未示出)。8-2b shows a block diagram of a convolution instruction processing device according to an embodiment of the present disclosure. In a possible implementation, as shown in FIG. 8-2b, the operation module 4-12 may include a master operation submodule 4-121 and a plurality of slave operation submodules 4-122, and the slave operation submodule 4-122 includes Multiple multipliers 4-120 and multiple adders 4-120 '(not shown).
控制模块4-11,还用于解析编译后的卷积指令得到多个运算指令,并将待运算数据、卷积核和多个运算指令发送至主运算子模块4-121。The control module 4-11 is also used to parse the compiled convolution instruction to obtain a plurality of operation instructions, and send the data to be operated, the convolution kernel and the plurality of operation instructions to the main operation sub-module 4-121.
主运算子模块4-121,用于对待运算数据和卷积核执行前序处理,以及与多个从运算子模块4-122进行数据和运算指令的传输。The main operation sub-module 4-121 is used for performing pre-processing on the operation data and the convolution kernel, and transmitting data and operation instructions with a plurality of slave operation sub-modules 4-122.
从运算子模块4-122,用于基于多个乘法器4-120和多个加法器4-120’,根据从主运算子模块4-121传输的数据和运算指令并行执行中间运算得到多个中间结果,并将多个中间结果传输给主运算子模块4-122。The slave operation sub-module 4-122 is used to execute multiple operations in parallel based on the data and operation instructions transmitted from the master operation sub-module 4-121 based on multiple multipliers 4-120 and multiple adders 4-120 '. The intermediate result, and transmit multiple intermediate results to the main operation sub-module 4-122.
主运算子模块4-121,还用于对多个中间结果执行后续处理,得到运算结果,并将运算结果存入目标地址中。The main operation submodule 4-121 is also used to perform subsequent processing on a plurality of intermediate results, obtain operation results, and store the operation results in the target address.
在一种可能的实现方式中,操作域还可以包括输入高度和输入宽度。In a possible implementation manner, the operation domain may further include an input height and an input width.
其中,控制模块,还用于从待运算数据地址中,获取对应输入宽度和输入高度的待运算数据。The control module is also used to obtain data to be calculated corresponding to the input width and input height from the data to be calculated.
在该实现方式中,输入高度和输入宽度可以限定所获得的待运算数据的数据量和尺寸。操作域所包括的输入高度和输入宽度可以是具体的数值,还可以是存储输入高度和输入宽度的存储地址。在操作域中直接包括输入高度和输入宽度的具体数值时,将该具体数值确定为对应的输入高度和输入宽度。在操作域中包括输入高度和输入宽度的存储地址时,可以分别从输入高度和输入宽度的存储地址中获得输入高度和输入宽度。In this implementation, the input height and input width can define the data amount and size of the obtained data to be calculated. The input height and input width included in the operation domain may be specific numerical values, and may also be a storage address that stores the input height and input width. When the specific values of the input height and input width are directly included in the operation domain, the specific values are determined as the corresponding input height and input width. When the storage addresses of the input height and the input width are included in the operation domain, the input height and the input width can be obtained from the storage addresses of the input height and the input width, respectively.
在一种可能的实现方式中,在操作域中不包括输入高度和/或输入宽度时,可以根据预先设置的默认输入高度和默认输入宽度获取待运算数据。In a possible implementation manner, when the input height and / or input width are not included in the operation domain, the data to be calculated may be obtained according to the preset default input height and default input width.
通过上述方式,可以对待运算数据的数据量和尺寸进行限制,保证运算结果的准确性,并保证装置可以执行该卷积指令。In the above manner, the data amount and size of the operation data can be limited, the accuracy of the operation result can be ensured, and the device can execute the convolution instruction.
在一种可能的实现方式中,操作域还可以包括卷积核高度、卷积核宽度。In a possible implementation manner, the operation domain may further include a convolution kernel height and a convolution kernel width.
其中,控制模块4-11,还用于按照卷积核高度和卷积核宽度从卷积核地址中获取卷积核。The control module 4-11 is also used to obtain the convolution kernel from the convolution kernel address according to the height of the convolution kernel and the width of the convolution kernel.
在一种可能的实现方式中,操作域还可以包括第一步幅。其中,运算模块4-12,还用于在待运算数据的X方向上按照第一步幅移动卷积核。In a possible implementation, the operation domain may also include the first step. Among them, the calculation module 4-12 is also used to move the convolution kernel according to the first step in the X direction of the data to be calculated.
在一种可能的实现方式中,操作域还可以包括第二步幅。其中,运算模块4-12,还用于在待运算数据的Y方向上按照第二步幅移动卷积核。In a possible implementation manner, the operation domain may further include a second step. Among them, the calculation module 4-12 is also used to move the convolution kernel according to the second step in the Y direction of the data to be calculated.
在一种可能的实现方式中,在操作域中不包括卷积核高度、卷积核宽度、卷积核的第一步幅和第二步幅中的一种或多种时,可以获取预先设置的默认卷积核高度、默认卷积核宽度、卷积核的默认第一步幅和默认第二步幅,使得控制模块和运算模块可以执行卷积指令。In a possible implementation manner, when one or more of the first and second steps of the convolution kernel height, the width of the convolution kernel, and the convolution kernel are not included in the operation domain, the advance The set default convolution kernel height, default convolution kernel width, convolution kernel default first step width and default second step width enable the control module and the arithmetic module to execute the convolution instruction.
在一种可能的实现方式中,操作域还可以包括卷积核数量。其中,运算模块4-12,还用于通过数量为卷积核数量的多个卷积核,对待运算数据进行卷积运算。In a possible implementation, the operation domain may also include the number of convolution kernels. Among them, the calculation module 4-12 is also used to perform convolution operation on the data to be calculated through a plurality of convolution kernels whose number is the number of convolution kernels.
在该实现方式中,卷积核数量与待运算数据相对应。例如,在卷积核数量为5时,可以确定待运算数据可以分为五个部分,需要5个卷积核分别对待运算数据的五个部分进行卷积运算。In this implementation, the number of convolution kernels corresponds to the data to be calculated. For example, when the number of convolution kernels is 5, it can be determined that the data to be calculated can be divided into five parts, and five convolution kernels are required to perform convolution operations on the five parts of the data to be calculated.
在该实现方式中,在操作域不包括卷积核数量时,可以确定待运算数据仅需一个卷积核便可以实现卷积运算。In this implementation manner, when the operation domain does not include the number of convolution kernels, it can be determined that only one convolution kernel is needed for the data to be calculated to implement the convolution operation.
在一种可能的实现方式中,操作域还可以包括通道数。其中,运算模块4-12,还用于根据通道数,通过对应的通道对待运算数据进行卷积运算,得到运算结果。In a possible implementation, the operation domain may also include the number of channels. Among them, the calculation module 4-12 is also used to perform convolution operation on the data to be calculated through the corresponding channel according to the number of channels to obtain the calculation result.
举例来说,在通道数为3时,可以在三个通道上对待运算数据进行卷积运算得到运算结果。For example, when the number of channels is 3, the data to be operated can be convoluted on the three channels to obtain the operation result.
在一种可能的实现方式中,如图8-2a、图8-2b所示,该装置还可以包括存储模块4-13。存储模块4-13用于存储待运算数据和卷积核。In a possible implementation manner, as shown in FIGS. 8-2a and 8-2b, the device may further include a storage module 4-13. The storage modules 4-13 are used to store the data to be calculated and the convolution kernel.
在该实现方式中,存储模块可以包括缓存和寄存器中的一种或多种,缓存可以包括高速暂存缓存,还可以包括至少一个NRAM(Neuron Random Access Memory,神经元随机存取存储器)。缓存用于存储待运算数据和所述卷积核,寄存器用于存储待运算数据中的标量数据。In this implementation, the storage module may include one or more of a cache and a register. The cache may include a high-speed temporary storage cache, and may also include at least one NRAM (Neuron Random Access Memory). The buffer is used to store the data to be operated and the convolution kernel, and the register is used to store the scalar data in the data to be operated.
在一种可能的实现方式中,缓存可以包括神经元缓存。神经元缓存用于存储待运算数据中的神经元数据,神经元数据包括神经元向量数据。In a possible implementation, the cache may include a neuron cache. The neuron cache is used to store the neuron data in the data to be calculated, and the neuron data includes neuron vector data.
在一种可能的实现方式中,控制模块4-11还可以用于根据卷积指令生成汇编文件,并将汇编文件翻译成二进制文件,其中,二进制文件为编译后的卷积指令。In a possible implementation manner, the control module 4-11 may also be used to generate an assembly file according to the convolution instruction and translate the assembly file into a binary file, where the binary file is a compiled convolution instruction.
在一种可能的实现方式中,卷积指令的指令格式可以是:In a possible implementation, the instruction format of the convolution instruction may be:
conv dst src0 kernel srcChannel srcHeigh srcWidth kernelHeight kernelWidth strideX strideY dstChannelconvdstsrc0kernelsrcChannelsrcHeighsrcWidthkernelHeightkernelWidthstrideXstrideYdstChannel
其中,conv是卷积指令的操作码,dst、src0、kernel、srcChannel、srcHeigh、srcWidth、kernelHeight、kernelWidth、strideX、strideY、dstChannel是卷积指令的操作域。其中,dst是目标地址,src0是待运算数据地址,kernel是卷积核或卷积核地址,srcChannel是卷积核数量,srcHeigh是待运算数据的输入高度,srcWidth是待运算数据的输入宽度,kernelHeight是卷积核高度,kernelWidth是卷积核宽度,strideX是第一步幅,strideY是第二步幅,dstChannel是通道数。Among them, conv is the operation code of the convolution instruction, dst, src0, kernel, srcChannel, srcHeigh, srcWidth, kernelHeight, kernelWidth, strideX, strideY, dstChannel are the operation domain of the convolution instruction. Among them, dst is the target address, src0 is the address of the data to be calculated, kernel is the address of the convolution core or convolution core, srcChannel is the number of convolution cores, srcHeigh is the input height of the data to be calculated, srcWidth is the input width of the data to be calculated, kernelHeight is the height of the convolution kernel, kernelWidth is the width of the convolution kernel, strideX is the first step, strideY is the second step, and dstChannel is the number of channels.
应当理解的是,本领域技术人员可以根据需要对卷积指令的操作码、指令格式中操作码和操作域的位置进行设置,本公开对此不作限制。It should be understood that those skilled in the art can set the operation code of the convolution instruction, the position of the operation code and the operation domain in the instruction format according to needs, and the disclosure does not limit this.
在一种可能的实现方式中,该装置可以设置于(Graphics Processing Unit,简称GPU)、中央处理器(Central Processing Unit,简称CPU)和嵌入式神经网络处理器(Neural-network Processing Unit,简称NPU)的一种或多种之中。In a possible implementation, the device may be set in (Graphics Processing Unit, GPU for short), Central Processing Unit (CPU for short), and Neural-network Processing Unit (NPU for short) ) Of one or more.
需要说明的是,尽管以上述实施例作为示例介绍了卷积指令处理装置如上,但本领域技术人员能 够理解,本公开应不限于此。事实上,用户完全可根据个人喜好和/或实际应用场景灵活设定各模块,只要符合本公开的技术方案即可。It should be noted that, although the convolution instruction processing apparatus is described above by taking the above embodiment as an example, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various modules flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.
应用示例Application examples
以下结合“利用卷积指令处理装置进行卷积运算”作为一个示例性应用场景,给出根据本公开实施例的应用示例,以便于理解卷积指令处理装置的流程。本领域技术人员应理解,以下应用示例仅仅是出于便于理解本公开实施例的目的,不应视为对本公开实施例的限制The following uses “convolution instruction processing device for convolution operation” as an exemplary application scenario to give an application example according to an embodiment of the present disclosure, in order to facilitate understanding of the flow of the convolution instruction processing device. Those skilled in the art should understand that the following application examples are only for the purpose of facilitating understanding of the embodiments of the present disclosure, and should not be considered as limitations to the embodiments of the present disclosure
图8-3示出根据本公开一实施例的卷积指令处理装置的应用场景的示意图。如图8-3所示,卷积指令处理装置对卷积指令进行处理的过程如下:8-3 shows a schematic diagram of an application scenario of a convolution instruction processing device according to an embodiment of the present disclosure. As shown in Figure 8-3, the convolution instruction processing device processes the convolution instruction as follows:
控制模块4-11对获取到的卷积指令1进行编译,得到编译后的卷积指令1(如卷积指令1为conv 500 100 200 5 64 32 2 1 2 3 3).对编译后的卷积指令1进行解析,得到卷积指令1的操作码和操作域。其中,卷积指令1的操作码为conv,目标地址为500,待运算数据地址为100,卷积核地址为200,卷积核数量为5,输入高度为64,输入宽度为32,卷积核高度为2,卷积核宽度为1,第一步幅为2,第二步幅为3,通道数为3。控制模块4-11从待运算数据地址100中获取64×32的待运算数据,从卷积核地址200中获取2×1的卷积核。The control module 4-11 compiles the obtained convolution instruction 1 to obtain the compiled convolution instruction 1 (for example, the convolution instruction 1 is conv500, 100, 200, 5, 64, 32, 2, 2, 3, 3). The product instruction 1 is analyzed to obtain the operation code and operation domain of the convolution instruction 1. Among them, the operation code of convolution instruction 1 is conv, the target address is 500, the address of the data to be calculated is 100, the address of the convolution kernel is 200, the number of convolution kernels is 5, the input height is 64, the input width is 32, and the convolution The kernel height is 2, the convolution kernel width is 1, the first step width is 2, the second step width is 3, and the number of channels is 3. The control module 4-11 acquires 64 × 32 to-be-calculated data from the data-to-be-operated data address 100, and acquires a 2 × 1 convolution kernel from the convolution kernel address 200.
运算模块4-12根据卷积核数量5、第一步幅2、第二步幅3和通道数3对待运算数据进行卷积运算,得到运算结果,并将运算结果存入目标地址500中。The operation module 4-12 performs convolution operation on the data to be calculated according to the number of convolution kernels 5, the first step width 2, the second step width 3, and the number of channels 3, obtains the operation result, and stores the operation result in the target address 500.
以上各模块的工作过程可参考上文的相关描述。For the working process of the above modules, please refer to the relevant description above.
这样,卷积指令处理装置可以高效、快速地对卷积指令进行处理,进行卷积运算的处理效率高、速度快。In this way, the convolution instruction processing device can efficiently and quickly process the convolution instruction, and the processing efficiency of the convolution operation is high and the speed is fast.
图8-4示出根据本公开一实施例的卷积指令处理方法的流程图。该方法可以应用于包含存储器和处理器的如计算机设备等,其中,存储器用于存储执行方法过程中所使用的数据;处理器用于执行相关的处理、运算步骤,如执行下述步骤S51-4和步骤S52-4。如图8-4所示,该方法应用于上述卷积指令处理装置,该方法包括步骤S51-4和步骤S52-4。8-4 shows a flowchart of a convolution instruction processing method according to an embodiment of the present disclosure. The method can be applied to computer equipment including a memory and a processor, where the memory is used to store data used in the execution of the method; the processor is used to perform related processing and operation steps, such as performing the following steps S51-4和 步骤 S52-4. As shown in FIG. 8-4, the method is applied to the above-mentioned convolution instruction processing device. The method includes step S51-4 and step S52-4.
在步骤S51-4中,利用控制模块对获取到的卷积指令进行编译,得到编译后的卷积指令,对编译后的卷积指令进行解析,得到卷积指令的操作码和操作域,并根据操作码和操作域获取执行卷积指令所需的待运算数据、卷积核和目标地址。其中,操作码用于指示卷积指令对数据所进行的运算为卷积运算,操作域包括待运算数据地址、卷积核地址和目标地址。In step S51-4, the control module is used to compile the acquired convolution instruction to obtain the compiled convolution instruction, and the compiled convolution instruction is analyzed to obtain the operation code and operation domain of the convolution instruction, and Obtain the data to be operated, the convolution kernel and the target address required to execute the convolution instruction according to the operation code and the operation domain. The operation code is used to instruct the operation performed by the convolution instruction on the data to be a convolution operation, and the operation domain includes the data address to be operated, the convolution kernel address, and the target address.
在步骤S52-4中,利用运算模块根据卷积核对待运算数据进行卷积运算,获得运算结果,并将运算结果存入目标地址中。In step S52-4, the operation module is used to perform convolution operation on the data to be operated according to the convolution kernel to obtain the operation result, and the operation result is stored in the target address.
在一种可能的实现方式中,根据卷积核对待运算数据进行卷积运算,可以包括:In a possible implementation manner, performing convolution operation on the data to be calculated according to the convolution kernel may include:
利用多个乘法器执行卷积运算中的乘法运算,以及利用多个加法器执行卷积运算中的加法运算。Multiple multipliers are used to perform multiplication operations in convolution operations, and multiple adders are used to perform addition operations in convolution operations.
在一种可能的实现方式中,运算模块包括主运算子模块和多个从运算子模块,从运算子模块包括多个乘法器和多个加法器,In a possible implementation manner, the operation module includes a master operation sub-module and multiple slave operation sub-modules, and the slave operation sub-module includes multiple multipliers and multiple adders,
其中,该方法还可以包括:利用控制模块解析编译后的卷积指令得到多个运算指令;Wherein, the method may further include: parsing the compiled convolution instruction using the control module to obtain multiple operation instructions;
其中,步骤S52-4可以包括:Wherein, step S52-4 may include:
利用主运算子模块对待运算数据和卷积核执行前序处理,以及进行数据和运算指令的传输;Use the main operation sub-module to perform pre-processing on the operation data and convolution kernel, as well as the transmission of data and operation instructions;
基于从运算子模块中的多个乘法器和多个加法器,根据传输的数据和运算指令并行执行中间运算得到多个中间结果;Based on multiple multipliers and multiple adders in the slave operation sub-module, performing intermediate operations in parallel according to the transmitted data and operation instructions to obtain multiple intermediate results;
利用主运算子模块对多个中间结果执行后续处理,得到运算结果,并将运算结果存入目标地址中。Use the main operation sub-module to perform subsequent processing on multiple intermediate results to obtain the operation results, and store the operation results in the target address.
在一种可能的实现方式中,操作域还可以包括读输入高度和输入宽度。其中,根据操作码和操作域获取执行卷积指令所需的待运算数据、卷积核和目标地址,可以包括:In a possible implementation manner, the operation domain may further include a read input height and an input width. Among them, obtaining the data to be operated, the convolution kernel and the target address required to execute the convolution instruction according to the operation code and the operation domain may include:
从待运算数据地址中,获取对应输入宽度和输入高度的待运算数据。Obtain the data to be calculated corresponding to the input width and input height from the address of the data to be calculated.
在一种可能的实现方式中,操作域还可以包括卷积核高度、卷积核宽度。其中,根据操作码和操作域获取执行卷积指令所需的待运算数据、卷积核和目标地址,可以包括:按照卷积核高度和卷积核宽度从卷积核地址中获取卷积核。In a possible implementation manner, the operation domain may further include a convolution kernel height and a convolution kernel width. Among them, obtaining the data to be operated, the convolution kernel and the target address required to execute the convolution instruction according to the operation code and the operation domain may include: obtaining the convolution kernel from the convolution kernel address according to the height of the convolution kernel and the width of the convolution kernel .
在一种可能的实现方式中,操作域还可以包括第一步幅,其中,根据卷积核对待运算数据进行卷积运算,获得运算结果,包括:在待运算数据的X方向上按照第一步幅移动卷积核。In a possible implementation, the operation domain may further include a first step, in which the data to be operated is convoluted according to the convolution kernel to obtain the operation result, including: according to the first direction in the X direction of the data to be operated The stride moves the convolution kernel.
在一种可能的实现方式中,操作域还可以包括第二步幅,其中,根据卷积核对待运算数据进行卷积运算,获得运算结果,包括:在待运算数据的Y方向上按照第二步幅移动卷积核。In a possible implementation manner, the operation domain may further include a second stride, where performing convolution operation on the data to be calculated according to the convolution kernel to obtain the operation result includes: according to the second The stride moves the convolution kernel.
在一种可能的实现方式中,操作域还可以包括卷积核数量。其中,根据卷积核对待运算数据进行卷积运算,获得运算结果,可以包括:In a possible implementation, the operation domain may also include the number of convolution kernels. Wherein, performing convolution operation on the data to be calculated according to the convolution kernel to obtain the operation result may include:
通过数量为卷积核数量的多个卷积核,对待运算数据进行卷积运算。The number of convolution kernels is the number of convolution kernels to perform convolution operation on the data to be calculated.
在一种可能的实现方式中,操作域还可以包括通道数。其中,根据卷积核对待运算数据进行卷积运算,获得运算结果,可以包括:In a possible implementation, the operation domain may also include the number of channels. Wherein, performing convolution operation on the data to be calculated according to the convolution kernel to obtain the operation result may include:
根据通道数,通过对应的通道对待运算数据进行卷积运算,得到运算结果。According to the number of channels, perform convolution operation on the data to be calculated through the corresponding channel to obtain the operation result.
在一种可能的实现方式中,该方法还可以包括:利用装置的存储模块存储待运算数据和卷积核,In a possible implementation manner, the method may further include: using the storage module of the device to store the data to be calculated and the convolution kernel,
其中,存储模块包括寄存器和缓存中的至少一种,Wherein, the storage module includes at least one of a register and a cache,
缓存,用于存储待运算数据和卷积核,缓存包括至少一个神经元缓存NRAM;The cache is used to store the data to be calculated and the convolution kernel. The cache includes at least one neuron cache NRAM;
寄存器,用于存储待运算数据中的标量数据;Register, used to store scalar data in the data to be calculated;
神经元缓存,用于存储待运算数据中的神经元数据,神经元数据包括神经元向量数据。The neuron cache is used to store neuron data in the data to be calculated, and the neuron data includes neuron vector data.
在一种可能的实现方式中,对获取到的卷积指令进行解析,得到卷积指令的操作码和操作域,可以包括:In a possible implementation manner, parsing the obtained convolution instruction to obtain the operation code and operation domain of the convolution instruction may include:
存储编译后的卷积指令;Store the compiled convolution instruction;
对编译后的卷积指令进行解析,得到卷积指令的操作码和操作域;Analyze the compiled convolution instruction to obtain the operation code and operation domain of the convolution instruction;
存储指令队列,指令队列包括按照执行顺序依次排列的多个待执行指令,多个待执行指令可以包括编译后的卷积指令。An instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed in order according to the execution order, and the plurality of instructions to be executed may include compiled convolutional instructions.
在一种可能的实现方式中,该方法还可以包括:In a possible implementation manner, the method may further include:
在确定多个待执行指令中的第一待执行指令与第一待执行指令之前的第零待执行指令存在关联关系时,缓存第一待执行指令,在第零待执行指令执行完毕后,执行第一待执行指令,When it is determined that the first to-be-executed instruction among the plurality of to-be-executed instructions is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction, cache the first to-be-executed instruction, and execute The first instruction to be executed,
其中,第一待执行指令与第一待执行指令之前的第零待执行指令存在关联关系包括:Wherein, the association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
存储第一待执行指令所需数据的第一存储地址区间与存储第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first storage address interval storing data required for the first instruction to be executed has an overlapping area with the zeroth storage address interval storing data required for the zeroth instruction to be executed.
在一种可能的实现方式中,对获取到的卷积指令进行编译,得到编译后的卷积指令,可以包括:In a possible implementation manner, compiling the obtained convolution instruction to obtain the compiled convolution instruction may include:
根据卷积指令生成汇编文件,并将汇编文件翻译成二进制文件。其中,二进制文件为编译后的卷积指令。The assembly file is generated according to the convolution instruction, and the assembly file is translated into a binary file. Among them, the binary file is a compiled convolution instruction.
需要说明的是,尽管以上述实施例作为示例介绍了卷积指令处理方法如上,但本领域技术人员能够理解,本公开应不限于此。事实上,用户完全可根据个人喜好和/或实际应用场景灵活设定各步骤,只要符合本公开的技术方案即可。It should be noted that although the above embodiment is taken as an example to introduce the convolution instruction processing method as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various steps flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.
本公开实施例所提供的卷积指令处理方法的适用范围广,对卷积指令的处理效率高、处理速度快,进行卷积运算的处理效率高、速度快。The convolution instruction processing method provided by the embodiments of the present disclosure has a wide application range, and has high processing efficiency and fast processing speed for convolution instructions, and high processing efficiency and speed for performing convolution operations.
依据以下条款可以更好的理解前述内容:The foregoing can be better understood based on the following clauses:
条款H1、一种卷积指令处理装置,所述装置包括:Clause H1, a convolution instruction processing device, the device comprising:
控制模块,用于对获取到的卷积指令进行编译,得到编译后的卷积指令,对所述编译后的卷积指令进行解析,得到卷积指令的操作码和操作域,并根据所述操作码和所述操作域获取执行卷积指令所需的待运算数据、卷积核和目标地址;The control module is used to compile the obtained convolution instruction to obtain the compiled convolution instruction, analyze the compiled convolution instruction to obtain the operation code and operation domain of the convolution instruction, and according to the The operation code and the operation domain obtain the data to be operated, the convolution kernel and the target address required to execute the convolution instruction;
运算模块,用于根据所述卷积核对所述待运算数据进行卷积运算,获得运算结果,并将所述运算结果存入所述目标地址中,An operation module, configured to perform a convolution operation on the data to be operated according to the convolution kernel, obtain an operation result, and store the operation result in the target address,
其中,所述操作码用于指示所述卷积指令对数据所进行的运算为卷积运算,所述操作域包括待运算数据地址、卷积核地址和所述目标地址。Wherein, the operation code is used to indicate that the operation performed by the convolution instruction on the data is a convolution operation, and the operation domain includes the data address to be operated, the convolution kernel address, and the target address.
条款H2、根据条款H1所述的装置,所述运算模块,包括:Clause H2. The device according to Clause H1, the arithmetic module includes:
多个乘法器,用于执行所述卷积运算中的乘法运算;Multiple multipliers for performing the multiplication operation in the convolution operation;
多个加法器,用于执行所述卷积运算中的加法运算。A plurality of adders are used to perform the addition operation in the convolution operation.
条款H3、根据条款H2所述的装置,所述运算模块包括主运算子模块和多个从运算子模块,所述从运算子模块包括所述多个乘法器和所述多个加法器,Clause H3. The device according to Clause H2, the operation module includes a master operation sub-module and a plurality of slave operation sub-modules, the slave operation sub-module includes the plurality of multipliers and the plurality of adders,
所述控制模块,还用于解析所述编译后的卷积指令得到多个运算指令,并将所述待运算数据、所述卷积核和所述多个运算指令发送至所述主运算子模块;The control module is further configured to parse the compiled convolution instruction to obtain a plurality of operation instructions, and send the data to be operated, the convolution kernel, and the plurality of operation instructions to the main operator Module
所述主运算子模块,用于对所述待运算数据和所述卷积核执行前序处理,以及与所述多个从运算子模块进行数据和运算指令的传输;The master operation sub-module is used to perform pre-processing on the data to be operated and the convolution kernel, and to transmit data and operation instructions with the plurality of slave operation sub-modules;
所述从运算子模块,用于基于所述多个乘法器和所述多个加法器,根据从所述主运算子模块传输的数据和运算指令并行执行中间运算得到多个中间结果,并将所述多个中间结果传输给所述主运算子模块;The slave operation sub-module is configured to execute intermediate operations in parallel based on the data and operation instructions transmitted from the master operation sub-module based on the multiple multipliers and the multiple adders to obtain multiple intermediate results, and Transmitting the plurality of intermediate results to the main operation submodule;
所述主运算子模块,还用于对所述多个中间结果执行后续处理,得到运算结果,并将所述运算结果存入所述目标地址中。The main operation sub-module is also used to perform subsequent processing on the plurality of intermediate results to obtain an operation result, and store the operation result in the target address.
条款H4、根据条款H1所述的装置,所述操作域还包括输入高度和输入宽度,Clause H4. The device according to Clause H1, the operation domain further includes an input height and an input width,
其中,所述控制模块,还用于从所述待运算数据地址中,获取对应所述输入宽度和所述输入高度的待运算数据。Wherein, the control module is also used to obtain the data to be calculated corresponding to the input width and the input height from the data to be calculated address.
条款H5、根据条款H1所述的装置,所述操作域还包括卷积核高度和卷积核宽度,Clause H5. The device according to Clause H1, the operation domain further includes a convolution kernel height and a convolution kernel width,
其中,所述控制模块,还用于按照所述卷积核高度和所述卷积核宽度从所述卷积核地址中获取所述卷积核。Wherein, the control module is further configured to obtain the convolution kernel from the convolution kernel address according to the height of the convolution kernel and the width of the convolution kernel.
条款H6、根据条款H1所述的装置,所述操作域还包括第一步幅,Clause H6. The device according to Clause H1, the operation domain further includes a first step,
所述运算模块,还用于在所述待运算数据的X方向上按照第一步幅移动所述卷积核。The calculation module is also used to move the convolution kernel in the X direction of the data to be calculated according to the first step.
条款H7、根据条款H1所述的装置,所述操作域还包括第二步幅,Clause H7. The device according to Clause H1, the operation domain further includes a second step,
所述运算模块,还用于在所述待运算数据的Y方向上按照第二步幅移动所述卷积核。The calculation module is also used to move the convolution kernel in the Y direction of the data to be calculated according to a second step.
条款H8、根据条款H1所述的装置,所述操作域还包括卷积核数量,Clause H8. The device according to Clause H1, the operation domain further includes the number of convolution kernels,
其中,所述运算模块,还用于通过数量为所述卷积核数量的多个卷积核,对所述待运算数据进行卷积运算。Wherein, the operation module is also used to perform convolution operation on the data to be operated through a plurality of convolution kernels whose number is the number of the convolution kernels.
条款H9、根据条款H1所述的装置,所述操作域还包括通道数,Clause H9. The device according to Clause H1, the operation domain further includes the number of channels,
其中,所述运算模块,还用于根据所述通道数,通过对应的通道对所述待运算数据进行卷积运算,得到运算结果。Wherein, the operation module is also used to perform convolution operation on the data to be operated through the corresponding channel according to the number of channels to obtain an operation result.
条款H10、根据条款H1所述的装置,所述装置还包括:Clause H10. The device according to Clause H1, the device further comprising:
存储模块,用于存储所述待运算数据和所述卷积核,A storage module for storing the data to be calculated and the convolution kernel,
其中,所述存储模块包括寄存器和缓存中的至少一种,Wherein, the storage module includes at least one of a register and a cache,
所述缓存,用于存储所述待运算数据和所述卷积核,所述缓存包括至少一个神经元缓存NRAM;The cache is used to store the data to be operated and the convolution kernel, and the cache includes at least one neuron cache NRAM;
所述寄存器,用于存储所述待运算数据中的标量数据;The register is used to store scalar data in the data to be calculated;
所述神经元缓存,用于存储所述待运算数据中的神经元数据,所述神经元数据包括神经元向量数据。The neuron cache is used to store neuron data in the data to be operated, and the neuron data includes neuron vector data.
条款H11、根据条款H1所述的装置,所述控制模块包括:Clause H11. The device according to Clause H1, the control module includes:
指令存储子模块,用于存储所述编译后的卷积指令;An instruction storage sub-module for storing the compiled convolution instruction;
指令处理子模块,用于对所述编译后的卷积指令进行解析,得到卷积指令的操作码和操作域;Instruction processing sub-module, which is used to parse the compiled convolution instruction to obtain the operation code and operation domain of the convolution instruction;
队列存储子模块,用于存储指令队列,所述指令队列包括按照执行顺序依次排列的多个待执行指令,所述多个待执行指令包括所述编译后的卷积指令。A queue storage submodule is used to store an instruction queue, and the instruction queue includes a plurality of instructions to be executed, which are sequentially arranged according to an execution order, and the plurality of instructions to be executed include the compiled convolutional instructions.
条款H12、根据条款H11所述的装置,所述控制模块,还包括:Clause H12. The device according to Clause H11, the control module, further comprising:
依赖关系处理子模块,用于在确定所述多个待执行指令中的第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系时,将所述第一待执行指令缓存在所述指令存储子模块中,在所述第零待执行指令执行完毕后,从所述指令存储子模块中提取所述第一待执行指令发送至所述运算 模块,The dependency processing sub-module is used to determine the first pending instruction when there is an association relationship between the first pending instruction in the plurality of pending instructions and the zeroth pending instruction before the first pending instruction The execution instruction is cached in the instruction storage submodule, and after the execution of the zeroth instruction to be executed is completed, the first instruction to be executed is extracted from the instruction storage submodule and sent to the arithmetic module,
其中,所述第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系包括:Wherein, the association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
存储所述第一待执行指令所需数据的第一存储地址区间与存储所述第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
条款H13、根据条款H1所述的装置,Clause H13, the device according to Clause H1,
所述控制模块,还用于根据所述卷积指令生成汇编文件,并将所述汇编文件翻译成二进制文件,The control module is also used to generate an assembly file according to the convolution instruction and translate the assembly file into a binary file,
其中,所述二进制文件为所述编译后的卷积指令。Wherein, the binary file is the compiled convolution instruction.
条款H14、一种机器学习运算装置,所述装置包括:Clause H14. A machine learning computing device, the device comprising:
一个或多个如条款H1-条款H13任一项所述的卷积指令处理装置,用于从其他处理装置中获取待运算数据和控制信息,并执行指定的机器学习运算,将执行结果通过I/O接口传递给其他处理装置;One or more convolution instruction processing devices as described in any one of Clause H1-Clause H13, used to obtain data to be calculated and control information from other processing devices, and perform specified machine learning operations, and pass the execution result through I / O interface is passed to other processing devices;
当所述机器学习运算装置包含多个所述卷积指令处理装置时,所述多个所述卷积指令处理装置间可以通过特定的结构进行连接并传输数据;When the machine learning operation device includes a plurality of the convolution instruction processing devices, the plurality of the convolution instruction processing devices can be connected and transmit data through a specific structure;
其中,多个所述卷积指令处理装置通过快速外部设备互连总线PCIE总线进行互联并传输数据,以支持更大规模的机器学习的运算;多个所述卷积指令处理装置共享同一控制系统或拥有各自的控制系统;多个所述卷积指令处理装置共享内存或者拥有各自的内存;多个所述卷积指令处理装置的互联方式是任意互联拓扑。Among them, a plurality of the convolution instruction processing devices interconnect and transmit data through a fast external device interconnect bus PCIE bus to support larger-scale machine learning operations; a plurality of the convolution instruction processing devices share the same control system Or have their own control systems; multiple convolutional instruction processing devices share memory or have their own memory; the interconnection method of multiple convolutional instruction processing devices is any interconnected topology.
条款H15、一种组合处理装置,所述组合处理装置包括:Clause H15. A combined processing device, the combined processing device comprising:
如条款H14所述的机器学习运算装置、通用互联接口和其他处理装置;Machine learning computing devices, general interconnection interfaces and other processing devices as described in clause H14;
所述机器学习运算装置与所述其他处理装置进行交互,共同完成用户指定的计算操作,The machine learning computing device interacts with the other processing device to jointly complete the calculation operation specified by the user,
其中,所述组合处理装置还包括:存储装置,该存储装置分别与所述机器学习运算装置和所述其他处理装置连接,用于保存所述机器学习运算装置和所述其他处理装置的数据。Wherein, the combined processing device further includes: a storage device, which is respectively connected to the machine learning computing device and the other processing device, and is used for storing data of the machine learning computing device and the other processing device.
条款H16、一种机器学习芯片,所述机器学习芯片包括:Clause H16. A machine learning chip, the machine learning chip includes:
如条款H14所述的机器学习运算装置或如条款H15所述的组合处理装置。The machine learning arithmetic device according to clause H14 or the combined processing device according to clause H15.
条款H17、一种电子设备,所述电子设备包括:Clause H17. An electronic device, the electronic device comprising:
如条款H16所述的机器学习芯片。Machine learning chip as described in clause H16.
条款H18、一种板卡,所述板卡包括:存储器件、接口装置和控制器件以及如条款H16所述的机器学习芯片;Clause H18, a board card including: a storage device, an interface device and a control device, and a machine learning chip as described in Clause H16;
其中,所述机器学习芯片与所述存储器件、所述控制器件以及所述接口装置分别连接;Wherein, the machine learning chip is respectively connected to the storage device, the control device and the interface device;
所述存储器件,用于存储数据;The storage device is used for storing data;
所述接口装置,用于实现所述机器学习芯片与外部设备之间的数据传输;The interface device is used to realize data transmission between the machine learning chip and an external device;
所述控制器件,用于对所述机器学习芯片的状态进行监控。The control device is used for monitoring the state of the machine learning chip.
条款H19、一种卷积指令处理方法,所述方法应用于卷积指令处理装置,所述装置包括控制模块和运算模块,所述方法包括:Clause H19. A convolution instruction processing method. The method is applied to a convolution instruction processing device. The device includes a control module and an operation module. The method includes:
利用控制模块对获取到的卷积指令进行编译,得到编译后的卷积指令,对所述编译后的卷积指令进行解析,得到卷积指令的操作码和操作域,并根据所述操作码和所述操作域获取执行卷积指令所需的待运算数据、卷积核和目标地址;The control module is used to compile the obtained convolution instruction to obtain the compiled convolution instruction, and the compiled convolution instruction is analyzed to obtain the operation code and operation domain of the convolution instruction, and the operation code is based on the operation code. Acquiring the data to be operated, the convolution kernel and the target address required to execute the convolution instruction with the operation domain;
利用运算模块根据所述卷积核对所述待运算数据进行卷积运算,获得运算结果,并将所述运算结果存入所述目标地址中,Using an operation module to perform a convolution operation on the data to be operated according to the convolution kernel to obtain an operation result, and store the operation result in the target address,
其中,所述操作码用于指示所述卷积指令对数据所进行的运算为卷积运算,所述操作域包括待运算数据地址、卷积核地址和所述目标地址。Wherein, the operation code is used to indicate that the operation performed by the convolution instruction on the data is a convolution operation, and the operation domain includes the data address to be operated, the convolution kernel address, and the target address.
条款H20、根据条款H19所述的方法,根据所述卷积核对所述待运算数据进行卷积运算,包括:Clause H20. According to the method of Clause H19, performing convolution operation on the data to be calculated according to the convolution kernel includes:
利用多个乘法器执行所述卷积运算中的乘法运算,以及利用多个加法器执行所述卷积运算中的加法运算。The multiplication operation in the convolution operation is performed using multiple multipliers, and the addition operation in the convolution operation is performed using multiple adders.
条款H21、根据条款H19所述的方法,所述运算模块包括主运算子模块和多个从运算子模块,所述从运算子模块包括所述多个乘法器和所述多个加法器,Clause H21. The method according to Clause H19, the operation module includes a master operation submodule and a plurality of slave operation submodules, the slave operation submodule includes the plurality of multipliers and the plurality of adders,
其中,所述方法还包括:Wherein, the method further includes:
利用控制模块解析所述编译后的卷积指令得到多个运算指令;Use the control module to parse the compiled convolution instruction to obtain multiple operation instructions;
其中,根据所述卷积核对所述待运算数据进行卷积运算,获得运算结果,并将所述运算结果存入所述目标地址中,包括:Wherein, performing convolution operation on the data to be operated according to the convolution kernel to obtain an operation result, and storing the operation result in the target address includes:
利用所述主运算子模块对所述待运算数据和所述卷积核执行前序处理,以及进行数据和运算指令的传输;Using the main operation sub-module to perform pre-processing on the data to be operated and the convolution kernel, and to transmit data and operation instructions;
基于所述从运算子模块中的所述多个乘法器和所述多个加法器,根据传输的数据和运算指令并行执行中间运算得到多个中间结果;Based on the plurality of multipliers and the plurality of adders in the slave operation submodule, performing intermediate operations in parallel according to the transmitted data and operation instructions to obtain multiple intermediate results;
利用所述主运算子模块对所述多个中间结果执行后续处理,得到运算结果,并将所述运算结果存入所述目标地址中。The main operation sub-module is used to perform subsequent processing on the plurality of intermediate results to obtain an operation result, and the operation result is stored in the target address.
条款H22、根据条款H19所述的方法,所述操作域还包括读输入高度和输入宽度,Clause H22. The method according to Clause H19, the operation field further includes reading input height and input width,
其中,根据所述操作码和所述操作域获取执行所述卷积指令所需的待运算数据、卷积核和目标地址,包括:Wherein, obtaining the data to be operated, the convolution kernel and the target address required to execute the convolution instruction according to the operation code and the operation domain includes:
从所述待运算数据地址中,获取对应所述输入宽度和所述输入高度的待运算数据。Obtain the data to be calculated corresponding to the input width and the input height from the address of the data to be calculated.
条款H23、根据条款H19所述的方法,所述操作域还包括卷积核高度、卷积核宽度,Clause H23. The method according to Clause H19, the operation domain further includes a convolution kernel height and a convolution kernel width,
其中,根据所述操作码和所述操作域获取执行所述卷积指令所需的待运算数据、卷积核和目标地址,包括:Wherein, obtaining the data to be operated, the convolution kernel and the target address required to execute the convolution instruction according to the operation code and the operation domain includes:
按照所述卷积核高度和所述卷积核宽度从所述卷积核地址中获取所述卷积核。Obtain the convolution kernel from the convolution kernel address according to the height of the convolution kernel and the width of the convolution kernel.
条款H24、根据条款H19所述的方法,所述操作域还包括第一步幅,Clause H24, the method according to Clause H19, the operation domain further includes a first step,
其中,根据所述卷积核对所述待运算数据进行卷积运算,获得运算结果,包括:Wherein, performing convolution operation on the data to be calculated according to the convolution kernel to obtain an operation result includes:
在所述待运算数据的X方向上按照第一步幅移动所述卷积核。Move the convolution kernel according to the first step in the X direction of the data to be calculated.
条款H25、根据条款H19所述的方法,所述操作域还包括第二步幅,Clause H25. The method according to Clause H19, the operation domain further includes a second step,
其中,根据所述卷积核对所述待运算数据进行卷积运算,获得运算结果,包括:Wherein, performing convolution operation on the data to be calculated according to the convolution kernel to obtain an operation result includes:
在所述待运算数据的Y方向上按照第二步幅移动所述卷积核。The convolution kernel is moved in the Y direction of the data to be calculated according to a second step.
条款H26、根据条款H19所述的方法,所述操作域还包括卷积核数量,Clause H26. The method according to Clause H19, the operation domain further includes the number of convolution kernels,
其中,根据所述卷积核对所述待运算数据进行卷积运算,获得运算结果,包括:Wherein, performing convolution operation on the data to be calculated according to the convolution kernel to obtain an operation result includes:
通过数量为所述卷积核数量的多个卷积核,对所述待运算数据进行卷积运算。The convolution operation is performed on the data to be operated through a plurality of convolution kernels whose number is the number of the convolution kernels.
条款H27、根据条款H19所述的方法,所述操作域还包括通道数,Clause H27, the method according to Clause H19, the operation domain further includes the number of channels,
其中,根据所述卷积核对所述待运算数据进行卷积运算,获得运算结果,包括:Wherein, performing convolution operation on the data to be calculated according to the convolution kernel to obtain an operation result includes:
根据所述通道数,通过对应的通道对所述待运算数据进行卷积运算,得到运算结果。According to the number of channels, perform convolution operation on the data to be calculated through the corresponding channel to obtain an operation result.
条款H28、根据条款H19所述的方法,所述方法还包括:Clause H28. The method according to Clause H19, the method further comprising:
利用所述装置的存储模块存储所述待运算数据和所述卷积核,Using the storage module of the device to store the data to be calculated and the convolution kernel,
其中,所述存储模块包括寄存器和缓存中的至少一种,Wherein, the storage module includes at least one of a register and a cache,
所述缓存,用于存储所述待运算数据和所述卷积核,所述缓存包括至少一个神经元缓存NRAM;The cache is used to store the data to be operated and the convolution kernel, and the cache includes at least one neuron cache NRAM;
所述寄存器,用于存储所述待运算数据中的标量数据;The register is used to store scalar data in the data to be calculated;
所述神经元缓存,用于存储所述待运算数据中的神经元数据,所述神经元数据包括神经元向量数据。The neuron cache is used to store neuron data in the data to be operated, and the neuron data includes neuron vector data.
条款H29、根据条款H19所述的方法,对获取到的卷积指令进行解析,得到所述卷积指令的操作码和操作域,包括:Clause H29. According to the method described in Clause H19, parse the obtained convolution instruction to obtain the operation code and operation domain of the convolution instruction, including:
存储所述编译后的卷积指令;Storing the compiled convolution instruction;
对所述编译后的卷积指令进行解析,得到卷积指令的操作码和操作域;Analyzing the compiled convolution instruction to obtain the operation code and operation domain of the convolution instruction;
存储指令队列,所述指令队列包括按照执行顺序依次排列的多个待执行指令,所述多个待执行指令包括所述编译后的卷积指令。An instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed, which are sequentially arranged according to an execution order, and the plurality of instructions to be executed include the compiled convolutional instructions.
条款H30、根据条款H29所述的方法,所述方法还包括:Clause H30. The method according to Clause H29, the method further comprising:
在确定所述多个待执行指令中的第一待执行指令与所述第一待执行指令之前的第零待执行指令 存在关联关系时,缓存所述第一待执行指令,并在确定所述第零待执行指令执行完毕后,控制进行所述第一待执行指令的执行,When it is determined that the first to-be-executed instruction among the plurality of to-be-executed instructions is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction, cache the first to-be-executed instruction, and determine the After the execution of the zeroth to-be-executed instruction is completed, control to execute the execution of the first to-be-executed instruction,
其中,所述第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系包括:Wherein, the association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
存储所述第一待执行指令所需数据的第一存储地址区间与存储所述第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
条款H31、根据条款H19所述的方法,对获取到的卷积指令进行编译,得到编译后的卷积指令,包括:Clause H31. Compile the obtained convolution instruction according to the method described in Clause H19 to obtain the compiled convolution instruction, including:
根据所述卷积指令生成汇编文件,并将所述汇编文件翻译成二进制文件,Generate an assembly file according to the convolution instruction, and translate the assembly file into a binary file,
其中,所述二进制文件为所述编译后的卷积指令。Wherein, the binary file is the compiled convolution instruction.
条款H32、一种非易失性计算机可读存储介质,其上存储有计算机程序指令,所述计算机程序指令被处理器执行时实现条款H19至条款H31任一项所述的方法。Clause H32. A non-volatile computer-readable storage medium having computer program instructions stored thereon, which when executed by a processor implements the method of any one of Clause H19 to Clause H31.
由于神经网络算法的广泛使用,计算机硬件运算人能力的不断提升,实际应用中所涉及到的数据运算的种类和数量不断提高。最大池化运算(max-pooling)是一种获取局部区中所有数据的最大值。由于编程语言的种类多样,在不同的语言环境下,为实现最大池化运算的运算过程,相关技术中,由于现阶段没有能广泛适用于各类编程语言的最大池化指令,技术人员需要自定义对应其编程语言环境的多条指令来实现最大池化运算,导致进行最大池化运算效率低、速度慢。本公开提供一种最大池化指令处理方法、装置、计算机设备和存储介质,仅用一个指令即可以实现最大池化运算,能够显著提高进行最大池化运算的效率和速度。Due to the extensive use of neural network algorithms, the ability of computer hardware operators to continuously improve, the types and number of data operations involved in practical applications continue to increase. Max-pooling (max-pooling) is a method to obtain the maximum value of all data in the local area. Due to the variety of programming languages, in different language environments, in order to achieve the operation process of the maximum pooling operation, in the related technology, because there is no maximum pooling instruction that can be widely applied to various programming languages at this stage, the technical staff needs to Define multiple instructions corresponding to its programming language environment to achieve maximum pooling operation, which results in low efficiency and slow speed for maximum pooling operation. The present disclosure provides a maximum pooling instruction processing method, device, computer equipment, and storage medium. The maximum pooling operation can be realized with only one instruction, which can significantly improve the efficiency and speed of performing the maximum pooling operation.
图9-1示出根据本公开一实施例的最大池化指令处理装置的框图。如图9-1所示,该装置包括控制模块5-11和运算模块5-12。9-1 shows a block diagram of a maximum pooled instruction processing apparatus according to an embodiment of the present disclosure. As shown in Figure 9-1, the device includes a control module 5-11 and an arithmetic module 5-12.
控制模块5-11,用于对获取到的最大池化指令进行编译,得到编译后的最大池化指令,对编译后的最大池化指令进行解析,得到最大池化指令的操作码和操作域,并根据操作码和操作域获取执行最大池化指令所需的待运算数据、池化核和目标地址。其中,操作码用于指示最大池化指令对数据所进行的运算为最大池化运算,操作域包括待运算数据地址、池化核地址和目标地址。The control module 5-11 is used to compile the obtained maximum pooling instruction to obtain the compiled maximum pooling instruction, and parse the compiled maximum pooling instruction to obtain the operation code and operation domain of the maximum pooling instruction , And obtain the data to be calculated, the pooling core and the target address required to execute the maximum pooling instruction according to the operation code and the operation domain. The operation code is used to indicate that the operation performed by the maximum pooling instruction on the data is the maximum pooling operation, and the operation domain includes the data address to be operated, the pooling core address, and the target address.
运算模块5-12,用于根据池化核对待运算数据进行最大池化运算,获得运算结果,并将运算结果存入目标地址中。The operation module 5-12 is used for performing the maximum pooling operation on the data to be calculated according to the pooling core, obtaining the operation result, and storing the operation result in the target address.
在本实施例中,控制模块所获取到的最大池化指令为未编译的、不能直接供硬件执行的软件指令,控制模块需先对最大池化指令(未编译的)进行编译。在得到编译后的最大池化指令之后,才能对编译后的最大池化指令进行解析。编译后的最大池化指令为能够直接供硬件执行的硬件指令。控制模块可以从待运算数据地址和池化核地址中,分别获得待运算数据和池化核。控制模块可以通过数据输入输出单元获得指令和数据,该数据输入输出单元可以为一个或多个数据I/O接口或I/O引脚。In this embodiment, the maximum pooling instruction acquired by the control module is an uncompiled software instruction that cannot be directly executed by hardware. The control module needs to first compile the maximum pooling instruction (uncompiled). After the compiled maximum pooling instruction is obtained, the compiled maximum pooling instruction can be parsed. The compiled maximum pooled instructions are hardware instructions that can be directly executed by the hardware. The control module can obtain the data to be calculated and the pooled core from the data to be calculated and the pooled core address, respectively. The control module can obtain instructions and data through the data input and output unit, which can be one or more data I / O interfaces or I / O pins.
在本实施例中,操作码可以是计算机程序中所规定的要执行操作的那一部分指令或字段(通常用代码表示),是指令序列号,用来告知执行指令的装置具体需要执行哪一条指令。操作域可以是执行对应的指令所需的所有数据的来源,执行对应的指令所需的所有数据包括待运算数据、池化核等参数以及对应的运算方法等等。对于一个最大池化指令其必须包括操作码和操作域,其中,操作域至少包括待运算数据地址、池化核地址和目标地址。In this embodiment, the operation code may be the part of the instruction or field (usually represented by code) specified in the computer program to perform the operation, and is the instruction sequence number used to inform the device that executes the instruction which instruction needs to be executed. . The operation domain may be a source of all data required to execute the corresponding instruction, and all data required to execute the corresponding instruction include parameters to be operated, parameters such as pooling cores, and corresponding operation methods. For a maximum pooling instruction, it must include an operation code and an operation domain, where the operation domain includes at least the data address to be calculated, the pooling core address, and the target address.
应当理解的是,本领域技术人员可以根据需要对最大池化指令的指令格式以及所包含的操作码和操作域进行设置,本公开对此不作限制。It should be understood that, those skilled in the art can set the instruction format of the maximum pooling instruction, as well as the included operation codes and operation domains as needed, and this disclosure does not limit this.
在本实施例中,该装置可以包括一个或多个控制模块,以及一个或多个运算模块,可以根据实际需要对控制模块和运算模块的数量进行设置,本公开对此不作限制。在装置包括一个控制模块时,该控制模块可以接收最大池化指令,并控制一个或多个运算模块进行最大池化运算。在装置包括多个控制模块时,多个控制模块可以分别接收最大池化指令,并控制对应的一个或多个运算模块进行最大池化运算。In this embodiment, the device may include one or more control modules and one or more arithmetic modules, and the number of control modules and arithmetic modules may be set according to actual needs, which is not limited in the present disclosure. When the device includes a control module, the control module can receive the maximum pooling instruction and control one or more arithmetic modules to perform the maximum pooling operation. When the device includes multiple control modules, the multiple control modules may respectively receive the maximum pooling instruction and control the corresponding one or more arithmetic modules to perform the maximum pooling operation.
本公开实施例所提供的最大池化指令处理装置,该装置包括控制模块和运算模块,控制模块用于 对获取到的最大池化指令进行编译,得到编译后的最大池化指令,对编译后的最大池化指令进行解析,得到最大池化指令的操作码和操作域,并根据操作码和操作域获取执行最大池化指令所需的待运算数据、池化核和目标地址;运算模块用于根据池化核对待运算数据进行最大池化运算,获得运算结果,并将运算结果存入目标地址中。本公开实施例所提供的最大池化指令处理装置的适用范围广,对最大池化指令的处理效率高、处理速度快,进行最大池化运算的处理效率高、速度快。The maximum pooling instruction processing device provided by an embodiment of the present disclosure includes a control module and an arithmetic module. The control module is used to compile the obtained maximum pooling instruction to obtain the compiled maximum pooling instruction. The largest pooling instruction is parsed to obtain the operation code and operation domain of the largest pooling instruction, and according to the operation code and operation domain, the data to be calculated, the pooling core, and the target address required to execute the largest pooling instruction are obtained; According to the pooling core, the maximum pooling operation is performed on the data to be calculated, the operation result is obtained, and the operation result is stored in the target address. The maximum pooling instruction processing device provided by the embodiments of the present disclosure has a wide application range, and has high processing efficiency and fast processing speed for the maximum pooling instruction, and high processing efficiency and speed for performing the maximum pooling operation.
图9-2a示出根据本公开一实施例的最大池化指令处理装置的框图。在一种可能的实现方式中,如图9-2a所示,运算模块5-12可以包括多个比较器5-120。多个比较器5-120用于对池化核所对应的区域中的多个待运算数据进行比较运算,获得运算结果。9-2a shows a block diagram of a maximum pooled instruction processing apparatus according to an embodiment of the present disclosure. In a possible implementation, as shown in FIG. 9-2a, the arithmetic module 5-12 may include multiple comparators 5-120. A plurality of comparators 5-120 are used to perform comparison operations on a plurality of data to be operated in the area corresponding to the pooled core, and obtain operation results.
在该实现方式中,运算模块也可以包括一个比较器。可以根据所需进行的比较运算的数据量的大小、对比较运算的处理速度、效率等要求对比较器的数量进行设置,本公开对此不作限制。In this implementation, the arithmetic module may also include a comparator. The number of comparators can be set according to the size of the data amount of the comparison operation to be performed, the processing speed and efficiency of the comparison operation, and the like, which is not limited in the present disclosure.
图9-2b示出根据本公开一实施例的最大池化指令处理装置的框图。在一种可能的实现方式中,如图9-2b所示,运算模块5-12可以包括主运算子模块5-121和多个从运算子模块5-122,主运算子模块5-121包括多个比较器。9-2b shows a block diagram of a maximum pooled instruction processing apparatus according to an embodiment of the present disclosure. In a possible implementation, as shown in FIG. 9-2b, the operation module 5-12 may include a master operation submodule 5-121 and a plurality of slave operation submodules 5-122, and the master operation submodule 5-121 includes Multiple comparators.
主运算子模块5-121,用于利用多个比较器对池化核所对应的区域中的多个待运算数据进行比较运算,得到运算结果,并将运算结果存入目标地址中。The main operation sub-module 5-121 is used to perform comparison operations on a plurality of data to be operated in the area corresponding to the pooling core by using a plurality of comparators to obtain operation results, and store the operation results in the target address.
在一种可能的实现方式中,操作域还可以包括输入高度和输入宽度。In a possible implementation manner, the operation domain may further include an input height and an input width.
其中,控制模块,还用于从待运算数据地址中,获取对应输入宽度和输入高度的待运算数据。The control module is also used to obtain data to be calculated corresponding to the input width and input height from the data to be calculated.
在该实现方式中,输入高度和输入宽度可以限定所获得的待运算数据的数据量和尺寸。操作域所包括的输入高度和输入宽度可以是具体的数值,还可以是存储输入高度和输入宽度的存储地址。在操作域中直接包括输入高度和输入宽度的具体数值时,将该具体数值确定为对应的输入高度和输入宽度。在操作域中包括输入高度和输入宽度的存储地址时,可以分别从输入高度和输入宽度的存储地址中获得输入高度和输入宽度。In this implementation, the input height and input width can define the data amount and size of the obtained data to be calculated. The input height and input width included in the operation domain may be specific numerical values, and may also be a storage address that stores the input height and input width. When the specific values of the input height and input width are directly included in the operation domain, the specific values are determined as the corresponding input height and input width. When the storage addresses of the input height and the input width are included in the operation domain, the input height and the input width can be obtained from the storage addresses of the input height and the input width, respectively.
在一种可能的实现方式中,在操作域中不包括输入高度和/或输入宽度时,可以根据预先设置的默认输入高度和默认输入宽度获取待运算数据。In a possible implementation manner, when the input height and / or input width are not included in the operation domain, the data to be calculated may be obtained according to the preset default input height and default input width.
通过上述方式,可以对待运算数据的数据量和尺寸进行限制,保证运算结果的准确性,并保证装置可以执行该最大池化指令。In the above manner, the data amount and size of the operation data can be limited, the accuracy of the operation result can be ensured, and the device can execute the maximum pooling instruction.
在一种可能的实现方式中,操作域还可以包括池化核高度和池化核宽度。In a possible implementation, the operation domain may further include pooled core height and pooled core width.
其中,控制模块5-11,还用于按照池化核高度和池化核宽度从池化核地址中获取池化核。Among them, the control module 5-11 is also used to obtain the pooled core from the pooled core address according to the pooled core height and the pooled core width.
在一种可能的实现方式中,操作域还可以包括第一步幅。其中,运算模块5-12,还可以用于按照第一步幅在x方向上移动池化核。In a possible implementation, the operation domain may also include the first step. Among them, the arithmetic modules 5-12 can also be used to move the pooled core in the x direction according to the first step.
在一种可能的实现方式中,操作域还可以包括第二步幅。其中,运算模块5-12,还可以用于按照第二步幅在y方向上移动池化核。In a possible implementation manner, the operation domain may further include a second step. Among them, the arithmetic modules 5-12 can also be used to move the pooling core in the y direction according to the second step.
在该实现方式中,最大池化运算的步幅是在进行最大池化运算中每一次移动池化核的幅度。第一步幅可以是在x方向上移动池化核的幅度,第二步幅可以是在y方向上移动池化核的幅度。In this implementation, the step of the maximum pooling operation is the amplitude of each movement of the pooling core during the maximum pooling operation. The first step may be to move the amplitude of the pooled core in the x direction, and the second step may be to move the amplitude of the pooled core in the y direction.
需要说明的是,在本公开中仅以池化核为二维为例,描述了进行最大池化运算所需的池化核的高度、宽度、第一步幅和第二步幅等参数,若池化核为多维,在相应地池化核的参数则包括其每个维度的尺寸和步幅。It should be noted that in this disclosure, only the pooling core is taken as a two-dimensional example, and the parameters such as the height, width, first step width and second step width of the pooling core required for the maximum pooling operation are described. If the pooling kernel is multi-dimensional, the parameters of the pooling kernel include the size and stride of each dimension.
在一种可能的实现方式中,在最大池化指令的操作域中并未给出第一步幅和第二步幅时,运算模块可以以池化核的高度和宽度分别为其对应维度的步幅,保证最大池化运算的正常进行。例如,运算模块5-12还可以用于在待运算数据上非重叠移动池化核,并比较池化核所对应的区域中的多个待运算数据,获得运算结果。In a possible implementation manner, when the first step width and the second step width are not given in the operation domain of the maximum pooling instruction, the computing module may use the height and width of the pooling core as their corresponding dimensions, respectively The stride ensures the normal operation of the maximum pooling operation. For example, the calculation modules 5-12 can also be used to move the pooled cores non-overlapping on the data to be calculated, and compare multiple data to be calculated in the area corresponding to the pooled cores to obtain the calculation result.
在一种可能的实现方式中,在操作域中不包括池化核高度、池化核宽度、时,可以获取预先设置的默认池化核高度、默认池化核宽度,使得控制模块和运算模块可以执行最大池化指令。In a possible implementation, when the pooled core height, pooled core width, and the pooled core are not included in the operation domain, the preset default pooled core height and default pooled core width can be obtained, so that the control module and the arithmetic module Can execute maximum pooling instructions.
在一种可能的实现方式中,操作域还可以包括池化核数量。其中,运算模块5-12,还用于通过数量为池化核数量的多个池化核,对待运算数据进行最大池化运算。In a possible implementation, the operation domain may further include the number of pooled cores. Among them, the calculation module 5-12 is also used to perform the maximum pooling operation on the data to be calculated through a plurality of pooling cores with the number of pooling cores.
在该实现方式中,池化核数量与待运算数据相对应。例如,在池化核数量为5时,可以确定待运算数据可以分为五个部分,需要5个池化核分别对待运算数据的五个部分进行最大池化运算。In this implementation, the number of pooled cores corresponds to the data to be calculated. For example, when the number of pooling cores is 5, it can be determined that the data to be calculated can be divided into five parts, and five pooling cores are required to perform the maximum pooling operation on the five parts of the data to be calculated, respectively.
在该实现方式中,在操作域不包括池化核数量时,可以确定待运算数据仅需一个池化核便可以实现最大池化运算。In this implementation manner, when the operation domain does not include the number of pooled cores, it can be determined that only one pooled core is needed for the data to be calculated to achieve the maximum pooled operation.
在一种可能的实现方式中,运算模块5-12还用于在待运算数据的尺寸为池化核的尺寸的非整数倍时,对待运算数据中为池化核的尺寸的整数倍的数据进行最大池化运算。其中,待运算数据的尺寸为所述池化核的尺寸的非整数倍,可以包括以下至少一项:待运算数据的输入宽度为池化核的宽度的非整数倍、待运算数据的输入高度为池化核的高度的非整数倍。In a possible implementation, the calculation module 5-12 is further used to calculate data that is an integer multiple of the pooled core size in the data to be calculated when the size of the data to be calculated is a non-integer multiple of the pooled core size. Perform maximum pooling operations. The size of the data to be calculated is a non-integer multiple of the size of the pooled core, which may include at least one of the following: the input width of the data to be calculated is a non-integer multiple of the width of the pooled core, and the input height of the data to be calculated It is a non-integer multiple of the height of the pooled core.
在该实现方式中,对于待运算数据中为池化核的非整数倍的部分剩余数据可以不进行最大池化运算。In this implementation manner, the maximum pooling operation may not be performed on a part of the remaining data that is a non-integer multiple of the pooling core in the data to be calculated.
在一种可能的实现方式中,如图9-2a、图9-2b所示,该装置还可以包括存储模块5-13。存储模块5-13用于存储待运算数据和池化核。In a possible implementation manner, as shown in FIGS. 9-2a and 9-2b, the device may further include a storage module 5-13. Storage modules 5-13 are used to store data to be calculated and pooled cores.
在该实现方式中,存储模块可以包括缓存和寄存器中的一种或多种,缓存可以包括高速暂存缓存,还可以包括至少一个NRAM(Neuron Random Access Memory,神经元随机存取存储器)。缓存可以用于存储待运算数据和池化核,寄存器可以用于存储待运算数据中的标量数据。In this implementation, the storage module may include one or more of a cache and a register. The cache may include a high-speed temporary storage cache, and may also include at least one NRAM (Neuron Random Access Memory). The cache can be used to store data to be calculated and pooled cores, and the register can be used to store scalar data in the data to be calculated.
在一种可能的实现方式中,缓存可以包括神经元缓存。神经元缓存也即上述神经元随机存取存储器,可以用于存储待运算数据中的神经元数据,神经元数据可以包括神经元向量数据。In a possible implementation, the cache may include a neuron cache. The neuron cache, that is, the foregoing neuron random access memory, can be used to store neuron data in the data to be calculated, and the neuron data can include neuron vector data.
在一种可能的实现方式中,该装置还可以包括直接内存访问模块,用于从存储模块中读取或者存储数据。In a possible implementation, the device may further include a direct memory access module, which is used to read or store data from the storage module.
在一种可能的实现方式中,控制模块5-11还可以用于根据最大池化指令生成汇编文件,并将汇编文件翻译成二进制文件,其中,二进制文件为编译后的最大池化指令。In a possible implementation manner, the control module 5-11 may also be used to generate an assembly file according to the maximum pooling instruction and translate the assembly file into a binary file, where the binary file is the compiled maximum pooling instruction.
在一种可能的实现方式中,最大池化指令的指令格式可以是:In a possible implementation manner, the instruction format of the maximum pooling instruction may be:
maxpool dst src0 srcChannel srcHeigh srcWidth kernelHeight kernelWidth sx symaxpool dst src0 srcChannel srcHeigh srcWidth kernelHeight kernelWidth sxsy
其中,maxpool为最大池化指令的操作码,dst、src0、srcChannel、srcHeigh、srcWidth、kernelHeight、kernelWidth、sx、sy为最大池化指令的操作域。其中,dst为目标地址,src0为待运算数据地址,srcChannel为池化核数量,srcHeigh为输入高度,srcWidth为输入宽度,kernelHeight为池化核高度,kernelWidth为池化核宽度,sx为池化核在x方向上进行移动的第一步幅,sy为池化核在y方向上进行移动的第二步幅。Among them, maxpool is the operation code of the largest pooling instruction, and dst, src0, srcChannel, srcHeigh, srcWidth, kernelHeight, kernelWidth, sx, and sy are the operation domains of the largest pooling instruction. Where dst is the target address, src0 is the data address to be calculated, srcChannel is the number of pooled cores, srcHeigh is the input height, srcWidth is the input width, kernelHeight is the pooled core height, kernelWidth is the pooled core width, and sx is the pooled core The first step of the movement in the x direction, sy is the second step of the movement of the pooling core in the y direction.
应当理解的是,本领域技术人员可以根据需要对最大池化指令的操作码、指令格式中操作码和操作域的位置进行设置,本公开对此不作限制。It should be understood that those skilled in the art can set the position of the operation code of the maximum pooling instruction, the operation code and the operation domain in the instruction format according to needs, and this disclosure does not limit this.
在一种可能的实现方式中,该装置可以设置于图形处理器(Graphics Processing Unit,简称GPU)、中央处理器(Central Processing Unit,简称CPU)和嵌入式神经网络处理器(Neural-network Processing Unit,简称NPU)的一种或多种之中。In a possible implementation manner, the device may be set in a graphics processor (Graphics Processing Unit, GPU for short), a central processing unit (Central Processing Unit, CPU for short) and an embedded neural network processor (Neural-network Processing Unit) , Referred to as NPU).
需要说明的是,尽管以上述实施例作为示例介绍了最大池化指令处理装置如上,但本领域技术人员能够理解,本公开应不限于此。事实上,用户完全可根据个人喜好和/或实际应用场景灵活设定各模块,只要符合本公开的技术方案即可。It should be noted that, although the above embodiment is used as an example to introduce the maximum pooling instruction processing apparatus as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various modules flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.
应用示例Application examples
以下结合“利用最大池化指令处理装置进行最大池化运算”作为一个示例性应用场景,给出根据本公开实施例的应用示例,以便于理解最大池化指令处理装置的流程。本领域技术人员应理解,以下应用示例仅仅是出于便于理解本公开实施例的目的,不应视为对本公开实施例的限制The following uses "maximizing pooled instruction processing device for maximum pooling operation" as an exemplary application scenario, and gives an application example according to an embodiment of the present disclosure to facilitate understanding of the flow of the maximum pooling instruction processing device. Those skilled in the art should understand that the following application examples are only for the purpose of facilitating understanding of the embodiments of the present disclosure, and should not be considered as limitations to the embodiments of the present disclosure
图9-3示出根据本公开一实施例的最大池化指令处理装置的应用场景的示意图。如图9-3所示,最大池化指令处理装置对最大池化指令进行处理的过程如下:9-3 shows a schematic diagram of an application scenario of a maximum pooled instruction processing apparatus according to an embodiment of the present disclosure. As shown in Figure 9-3, the maximum pooling instruction processing device processes the maximum pooling instruction as follows:
控制模块5-11对获取到的最大池化指令1进行编译,得到编译后的最大池化指令1(如最大池化指令1为maxpool 500 100 200 5 64 32 2 1 2 1),对编译后的最大池化指令进行解析,得到最大池化指令1的操作码和操作域。其中,最大池化指令1的操作码为maxpool,目标地址为500,待运算数据地址为 100,池化核地址为200,池化核数量为5,输入高度为64,输入宽度为32,池化核高度为2,池化核宽度为1,第一步幅为2,第二步幅为1。控制模块5-11从待运算数据地址100中获取64×32的待运算数据,从池化核地址200中获取2×1的池化核。The control module 5-11 compiles the obtained maximum pooling instruction 1 to obtain the compiled maximum pooling instruction 1 (for example, the maximum pooling instruction 1 is maxpool 500 500 100 100 200 5 64 64 32 2 2 1), after compiling The largest pooling instruction is parsed to obtain the operation code and operation domain of the largest pooling instruction 1. Among them, the operation code of the max pooling instruction 1 is maxpool, the target address is 500, the data address to be calculated is 100, the pooling core address is 200, the number of pooling cores is 5, the input height is 64, the input width is 32, the pool The nucleus height is 2, the pooling nucleus width is 1, the first step is 2, and the second step is 1. The control module 5-11 obtains 64 × 32 to-be-calculated data from the data-to-be-operated data address 100 and 2 × 1 pooled cores from the pooled core address 200.
运算模块5-12利用5个池化核对待运算数据进行最大池化运算,得到运算结果,并将运算结果存入目标地址500中。The calculation module 5-12 uses 5 pooling cores to perform maximum pooling operation on the data to be calculated, obtains the calculation result, and stores the calculation result in the target address 500.
以上各模块的工作过程可参考上文的相关描述。For the working process of the above modules, please refer to the relevant description above.
这样,可以高效、快速地对最大池化指令进行处理,且进行最大池化运算的效率和速度也有显著提高。In this way, the maximum pooling instruction can be processed efficiently and quickly, and the efficiency and speed of the maximum pooling operation are also significantly improved.
图9-4示出根据本公开一实施例的最大池化指令处理方法的流程图。该方法可以应用于包含存储器和处理器的如计算机设备等,其中,存储器用于存储执行方法过程中所使用的数据;处理器用于执行相关的处理、运算步骤,如执行下述步骤S51-5和步骤S52-5。如图9-4所示,该方法应用于上述最大池化指令处理装置,该方法包括步骤S51-5和步骤S52-5。9-4 shows a flowchart of a maximum pooling instruction processing method according to an embodiment of the present disclosure. The method can be applied to a computer device including a memory and a processor, where the memory is used to store data used in the execution of the method; the processor is used to perform related processing and operation steps, such as performing the following steps S51-5和 步骤 S52-5. As shown in FIG. 9-4, this method is applied to the above-mentioned maximum pooling instruction processing device. The method includes step S51-5 and step S52-5.
在步骤S51-5中,利用控制模块对获取到的最大池化指令进行编译,得到编译后的最大池化指令,对编译后的最大池化指令进行解析,得到最大池化指令的操作码和操作域,并根据操作码和操作域获取执行最大池化指令所需的待运算数据、池化核和目标地址。其中,操作码用于指示最大池化指令对数据所进行的运算为最大池化运算,操作域包括待运算数据地址、池化核地址和目标地址。In step S51-5, the control module is used to compile the obtained maximum pooling instruction to obtain the compiled maximum pooling instruction, and the compiled maximum pooling instruction is analyzed to obtain the operation code and The operation domain, and according to the operation code and the operation domain, obtain the data to be operated, the pooling core, and the target address required to execute the maximum pooling instruction. The operation code is used to indicate that the operation performed by the maximum pooling instruction on the data is the maximum pooling operation, and the operation domain includes the data address to be operated, the pooling core address, and the target address.
在步骤S52-5中,利用运算模块根据池化核对待运算数据进行最大池化运算,获得运算结果,并将运算结果存入目标地址中。In step S52-5, the operation module is used to perform the maximum pooling operation on the data to be calculated according to the pooling core to obtain the operation result, and the operation result is stored in the target address.
在一种可能的实现方式中,根据池化核对待运算数据进行最大池化运算,获得运算结果,可以包括:In a possible implementation manner, performing the maximum pooling operation on the data to be calculated according to the pooling core to obtain the operation result may include:
利用运算模块中的多个比较器对池化核所对应的区域中的多个待运算数据进行比较运算,获得运算结果。A plurality of comparators in the operation module are used to perform comparison operation on a plurality of data to be operated in the area corresponding to the pooled core, and the operation result is obtained.
在一种可能的实现方式中,运算模块包括主运算子模块和多个从运算子模块,主运算子模块包括多个比较器,In a possible implementation manner, the operation module includes a master operation sub-module and multiple slave operation sub-modules, and the master operation sub-module includes multiple comparators,
其中,根据池化核对待运算数据进行最大池化运算,获得运算结果,并将运算结果存入目标地址中,包括:Among them, the maximum pooling operation is performed on the data to be calculated according to the pooling core to obtain the operation result, and the operation result is stored in the target address, including:
利用多个比较器对池化核所对应的区域中的多个待运算数据进行比较运算,得到运算结果,并将运算结果存入目标地址中。A plurality of comparators are used to perform comparison operations on a plurality of data to be operated in the area corresponding to the pooled core, to obtain operation results, and store the operation results in the target address.
在一种可能的实现方式中,操作域还可以包括输入高度和输入宽度。其中,根据操作码和操作域获取执行最大池化指令所需的待运算数据、池化核和目标地址,可以包括:In a possible implementation manner, the operation domain may further include an input height and an input width. Among them, obtaining the data to be calculated, the pooling core, and the target address required to execute the maximum pooling instruction according to the operation code and the operation domain may include:
从待运算数据地址中,获取对应输入宽度和输入高度的待运算数据。Obtain the data to be calculated corresponding to the input width and input height from the address of the data to be calculated.
在一种可能的实现方式中,操作域还可以包括池化核高度和池化核宽度。其中,根据操作码和操作域获取执行最大池化指令所需的待运算数据、池化核和目标地址,可以包括:In a possible implementation, the operation domain may further include pooled core height and pooled core width. Among them, obtaining the data to be calculated, the pooling core, and the target address required to execute the maximum pooling instruction according to the operation code and the operation domain may include:
按照池化核高度和池化核宽度从池化核地址中获取池化核。Obtain the pooled core from the pooled core address according to the pooled core height and the pooled core width.
在一种可能的实现方式中,根据池化核对待运算数据进行最大池化运算,获得运算结果,可以包括:In a possible implementation manner, performing the maximum pooling operation on the data to be calculated according to the pooling core to obtain the operation result may include:
在待运算数据上非重叠移动池化核,并比较池化核所对应的区域中的多个待运算数据,获得运算结果。Move the pooled cores non-overlapping on the data to be calculated, and compare multiple data to be calculated in the area corresponding to the pooled cores to obtain the calculation result.
在一种可能的实现方式中,根据池化核对待运算数据进行最大池化运算,获得运算结果,可以包括:In a possible implementation manner, performing the maximum pooling operation on the data to be calculated according to the pooling core to obtain the operation result may include:
在待运算数据的尺寸为池化核的尺寸的非整数倍时,对待运算数据中为池化核的尺寸的整数倍的数据进行最大池化运算,When the size of the data to be calculated is a non-integer multiple of the size of the pooled core, the maximum pooling operation is performed on the data to be calculated that is an integer multiple of the size of the pooled core,
其中,待运算数据的尺寸为池化核的尺寸的非整数倍,可以包括以下至少一项:待运算数据的输入宽度为池化核的宽度的非整数倍、待运算数据的输入高度为池化核的高度的非整数倍。The size of the data to be calculated is a non-integer multiple of the size of the pooled core, which may include at least one of the following: the input width of the data to be calculated is a non-integer multiple of the width of the pooled core, and the input height of the data to be calculated is the pool Non-integer multiple of the height of the chemical core.
在一种可能的实现方式中,操作域还可以包括池化核数量。其中,根据池化核对待运算数据进行 最大池化运算,获得运算结果,可以包括:In a possible implementation, the operation domain may further include the number of pooled cores. Among them, performing the maximum pooling operation on the data to be calculated according to the pooling core to obtain the operation result may include:
通过数量为池化核数量的多个池化核,对待运算数据进行最大池化运算。Through multiple pooling cores with the number of pooling cores, the maximum pooling operation is performed on the data to be calculated.
在一种可能的实现方式中,该方法还可以包括:利用装置的存储模块存储待运算数据和池化核。其中,存储模块可以包括寄存器和缓存中的至少一种,缓存用于存储待运算数据和池化核,缓存可以包括至少一个神经元缓存NRAM;寄存器用于存储待运算数据中的标量数据;神经元缓存用于存储待运算数据中的神经元数据,神经元数据可以包括神经元向量数据。In a possible implementation manner, the method may further include: using the storage module of the device to store the data to be calculated and the pooled core. Among them, the storage module may include at least one of a register and a cache, the cache is used to store the data to be calculated and the pooled core, the cache may include at least one neuron cache NRAM; the register is used to store the scalar data in the data to be calculated; nerve The meta buffer is used to store neuron data in the data to be operated, and the neuron data may include neuron vector data.
在一种可能的实现方式中,对获取到的最大池化指令进行解析,得到最大池化指令的操作码和操作域,可以包括:In a possible implementation manner, parsing the obtained maximum pooling instruction to obtain the operation code and operation domain of the maximum pooling instruction may include:
存储编译后的最大池化指令;Store the compiled maximum pooled instruction;
对编译后的最大池化指令进行解析,得到最大池化指令的操作码和操作域;Analyze the compiled maximum pooling instruction to obtain the operation code and operation domain of the maximum pooling instruction;
存储指令队列,指令队列包括按照执行顺序依次排列的多个待执行指令,多个待执行指令可以包括编译后的最大池化指令。The instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed in order according to the execution order, and the plurality of instructions to be executed may include a compiled maximum pooled instruction.
在一种可能的实现方式中,该方法还可以包括:在确定多个待执行指令中的第一待执行指令与第一待执行指令之前的第零待执行指令存在关联关系时,缓存第一待执行指令,在第零待执行指令执行完毕后,执行第一待执行指令,In a possible implementation manner, the method may further include: when determining that the first to-be-executed instruction among the plurality of to-be-executed instructions is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction, caching the first The instruction to be executed, after the execution of the zeroth instruction to be executed is completed, the first instruction to be executed is executed,
其中,第一待执行指令与第一待执行指令之前的第零待执行指令存在关联关系包括:Wherein, the association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
存储第一待执行指令所需数据的第一存储地址区间与存储第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first storage address interval storing data required for the first instruction to be executed has an overlapping area with the zeroth storage address interval storing data required for the zeroth instruction to be executed.
在一种可能的实现方式中,对获取到的最大池化指令进行编译,得到编译后的最大池化指令,可以包括:In a possible implementation manner, compiling the obtained maximum pooling instruction to obtain the compiled maximum pooling instruction may include:
根据最大池化指令生成汇编文件,并将汇编文件翻译成二进制文件。其中,二进制文件为编译后的最大池化指令。The assembly file is generated according to the maximum pooling instruction, and the assembly file is translated into a binary file. Among them, the binary file is the largest pooled instruction after compilation.
需要说明的是,尽管以上述实施例作为示例介绍了最大池化指令处理方法如上,但本领域技术人员能够理解,本公开应不限于此。事实上,用户完全可根据个人喜好和/或实际应用场景灵活设定各步骤,只要符合本公开的技术方案即可。It should be noted that, although the above embodiment is used as an example to introduce the maximum pooling instruction processing method as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various steps flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.
本公开实施例所提供的最大池化指令处理方法的适用范围广,对最大池化指令的处理效率高、处理速度快,进行最大池化运算的效率高、速度快。The maximum pooling instruction processing method provided by the embodiments of the present disclosure has a wide application range, and has high processing efficiency and fast processing speed for the maximum pooling instruction, and high efficiency and speed for performing the maximum pooling operation.
依据以下条款可以更好的理解前述内容:The foregoing can be better understood based on the following clauses:
条款I1、一种最大池化指令处理装置,所述装置包括:Clause I1, a maximum pooling instruction processing device, the device comprising:
控制模块,用于对获取到的最大池化指令进行编译,得到编译后的最大池化指令,对所述编译后的最大池化指令进行解析,得到最大池化指令的操作码和操作域,并根据所述操作码和所述操作域获取执行最大池化指令所需的待运算数据、池化核和目标地址;The control module is used to compile the obtained maximum pooling instruction to obtain the compiled maximum pooling instruction, and parse the compiled maximum pooling instruction to obtain the operation code and operation domain of the maximum pooling instruction, And obtain the data to be operated, the pooling core and the target address required to execute the maximum pooling instruction according to the operation code and the operation domain;
运算模块,用于根据所述池化核对所述待运算数据进行最大池化运算,获得运算结果,并将所述运算结果存入所述目标地址中,An operation module, configured to perform a maximum pooling operation on the data to be calculated according to the pooling check, obtain an operation result, and store the operation result in the target address,
其中,所述操作码用于指示所述最大池化指令对数据所进行的运算为最大池化运算,所述操作域包括待运算数据地址、池化核地址和所述目标地址。Wherein, the operation code is used to indicate that the operation performed by the maximum pooling instruction on the data is the maximum pooling operation, and the operation domain includes the data address to be operated, the pooling core address, and the target address.
条款I2、根据条款I1所述的装置,所述运算模块,包括:Clause I2. The device according to Clause I1, the calculation module includes:
多个比较器,用于对所述池化核所对应的区域中的多个待运算数据进行比较运算,获得运算结果。A plurality of comparators are used to perform comparison operations on a plurality of data to be operated in the area corresponding to the pooled core to obtain operation results.
条款I3、根据条款I2所述的装置,所述运算模块包括主运算子模块和多个从运算子模块,所述主运算子模块包括所述多个比较器,Clause I3. The device according to Clause I2, the operation module includes a master operation sub-module and a plurality of slave operation sub-modules, the master operation sub-module includes the plurality of comparators,
所述主运算子模块,用于利用所述多个比较器对所述池化核所对应的区域中的多个待运算数据进行比较运算,得到运算结果,并将所述运算结果存入所述目标地址中。The main operation sub-module is configured to use the plurality of comparators to perform comparison operations on a plurality of data to be operated in the area corresponding to the pooled core, obtain operation results, and store the operation results in the Described in the target address.
条款I4、根据条款I1所述的装置,所述操作域还包括输入高度和输入宽度,Clause I4. The device according to Clause I1, the operation domain further includes an input height and an input width,
其中,所述控制模块,还用于从所述待运算数据地址中,获取对应所述输入宽度和所述输入高度的待运算数据。Wherein, the control module is also used to obtain the data to be calculated corresponding to the input width and the input height from the data to be calculated address.
条款I5、根据条款I1所述的装置,所述操作域还包括池化核高度和池化核宽度,Clause I5. The device according to Clause I1, the operation domain further includes a pooled core height and a pooled core width,
其中,所述控制模块,还用于按照所述池化核高度和所述池化核宽度从所述池化核地址中获取所述池化核。Wherein, the control module is further configured to obtain the pooled core from the pooled core address according to the pooled core height and the pooled core width.
条款I6、根据条款I1所述的装置,所述操作域还包括第一步幅,Clause I6. The device according to Clause I1, the operation domain further includes a first step,
其中,所述运算模块,还用于按照所述第一步幅在x方向上移动所述池化核。Wherein, the arithmetic module is also used to move the pooling core in the x direction according to the first step.
条款I7、根据条款I1所述的装置,所述操作域还包括第二步幅,Clause I7. The device according to Clause I1, the operation domain further includes a second step,
其中,所述运算模块,还用于按照所述第二步幅在y方向上移动所述池化核。Wherein, the calculation module is also used to move the pooling core in the y direction according to the second step.
条款I8、根据条款I1所述的装置,Clause I8. The device according to Clause I1,
所述运算模块,还用于在所述待运算数据上非重叠移动所述池化核,并比较所述池化核所对应的区域中的多个待运算数据,获得所述运算结果。The calculation module is also used to move the pooled core on the data to be calculated non-overlapping, and compare a plurality of data to be calculated in the area corresponding to the pooled core to obtain the calculation result.
条款I9、根据条款I1所述的装置,Clause I9, the device according to Clause I1,
所述运算模块,还用于在所述待运算数据的尺寸为所述池化核的尺寸的非整数倍时,对所述待运算数据中为所述池化核的尺寸的整数倍的数据进行最大池化运算,The calculation module is also used to, when the size of the data to be calculated is a non-integer multiple of the size of the pooled core, the data to be calculated is an integer multiple of the size of the pooled core Perform maximum pooling operations,
其中,所述待运算数据的尺寸为所述池化核的尺寸的非整数倍,包括以下至少一项:所述待运算数据的输入宽度为所述池化核的宽度的非整数倍、所述待运算数据的输入高度为所述池化核的高度的非整数倍。Wherein, the size of the data to be calculated is a non-integer multiple of the size of the pooled core, including at least one of the following: The input width of the data to be calculated is a non-integer multiple of the width of the pooled core. The input height of the data to be calculated is a non-integer multiple of the height of the pooled core.
条款I10、根据条款I1所述的装置,所述操作域还包括池化核数量,Clause I10. The device according to Clause I1, the operation domain further includes the number of pooled cores,
其中,所述运算模块,还用于通过数量为所述池化核数量的多个池化核,对所述待运算数据进行最大池化运算。Wherein, the calculation module is also used to perform a maximum pooling operation on the data to be calculated through a plurality of pooling cores whose number is the number of the pooling cores.
条款I11、根据条款I1所述的装置,所述装置还包括:Clause I11. The device according to Clause I1, the device further comprising:
存储模块,用于存储所述待运算数据和所述池化核,A storage module for storing the data to be calculated and the pooled core,
其中,所述存储模块包括寄存器和缓存中的至少一种,Wherein, the storage module includes at least one of a register and a cache,
所述缓存,用于存储所述待运算数据和所述池化核,所述缓存包括至少一个神经元缓存NRAM;The cache is used to store the data to be operated and the pooled core, and the cache includes at least one neuron cache NRAM;
所述寄存器,用于存储所述待运算数据中的标量数据;The register is used to store scalar data in the data to be calculated;
所述神经元缓存,用于存储所述待运算数据中的神经元数据,所述神经元数据包括神经元向量数据。The neuron cache is used to store neuron data in the data to be operated, and the neuron data includes neuron vector data.
条款I12、根据条款I1所述的装置,所述控制模块,包括:Clause I12. The device according to Clause I1, the control module includes:
指令存储子模块,用于存储所述编译后的最大池化指令;An instruction storage submodule, used to store the compiled maximum pooled instruction;
指令处理子模块,用于对所述编译后的最大池化指令进行解析,得到最大池化指令的操作码和操作域;Instruction processing sub-module, which is used to parse the compiled maximum pooled instruction to obtain the operation code and operation domain of the maximum pooled instruction;
队列存储子模块,用于存储指令队列,所述指令队列包括按照执行顺序依次排列的多个待执行指令,所述多个待执行指令包括所述编译后的最大池化指令。The queue storage submodule is used to store an instruction queue, and the instruction queue includes a plurality of instructions to be executed, which are sequentially arranged according to the execution order, and the plurality of instructions to be executed include the compiled maximum pooled instruction.
条款I13、根据条款I12所述的装置,所述控制模块,还包括:Clause I13. The device according to Clause I12, the control module, further comprising:
依赖关系处理子模块,用于在确定所述多个待执行指令中的第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系时,将所述第一待执行指令缓存在所述指令存储子模块中,在所述第零待执行指令执行完毕后,从所述指令存储子模块中提取所述第一待执行指令发送至所述运算模块,The dependency processing sub-module is used to determine the first pending instruction when there is an association relationship between the first pending instruction in the plurality of pending instructions and the zeroth pending instruction before the first pending instruction The execution instruction is cached in the instruction storage submodule, and after the execution of the zeroth instruction to be executed is completed, the first instruction to be executed is extracted from the instruction storage submodule and sent to the arithmetic module,
其中,所述第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系包括:Wherein, the association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
存储所述第一待执行指令所需数据的第一存储地址区间与存储所述第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
条款I14、根据条款I1所述的装置,Clause I14, the device according to Clause I1,
所述控制模块,还用于根据所述最大池化指令生成汇编文件,并将所述汇编文件翻译成二进制文件,The control module is also used to generate an assembly file according to the maximum pooling instruction and translate the assembly file into a binary file,
其中,所述二进制文件为所述编译后的最大池化指令。Wherein, the binary file is the compiled maximum pooling instruction.
条款I15、一种机器学习运算装置,所述装置包括:Clause I15. A machine learning computing device, the device comprising:
一个或多个如条款I1-条款I14任一项所述的最大池化指令处理装置,用于从其他处理装置中获取待运算数据和控制信息,并执行指定的机器学习运算,将执行结果通过I/O接口传递给其他处理装置;One or more maximum pooling instruction processing devices as described in any one of Clause I1-Clause I14, used to obtain the data and control information to be calculated from other processing devices, and perform the specified machine learning operation, and pass the execution result The I / O interface is passed to other processing devices;
当所述机器学习运算装置包含多个所述最大池化指令处理装置时,所述多个所述最大池化指令处理装置间可以通过特定的结构进行连接并传输数据;When the machine learning computing device includes a plurality of the maximum pooled instruction processing devices, the plurality of maximum pooled instruction processing devices may be connected and transmit data through a specific structure;
其中,多个所述最大池化指令处理装置通过快速外部设备互连总线PCIE总线进行互联并传输数据,以支持更大规模的机器学习的运算;多个所述最大池化指令处理装置共享同一控制系统或拥有各自的控制系统;多个所述最大池化指令处理装置共享内存或者拥有各自的内存;多个所述最大池化指令处理装置的互联方式是任意互联拓扑。Among them, a plurality of the largest pooled instruction processing apparatuses interconnect and transmit data through a PCIE bus, a fast external device interconnection bus, to support larger-scale machine learning operations; a plurality of the largest pooled instruction processing apparatuses share the same The control system may have its own control system; the multiple maximum pooled instruction processing devices share memory or have their own memories; the interconnection method of the multiple maximum pooled instruction processing devices is any interconnection topology.
条款I16、一种组合处理装置,所述组合处理装置包括:Clause I16. A combined processing device, the combined processing device comprising:
如条款I15所述的机器学习运算装置、通用互联接口和其他处理装置;Machine learning computing devices, general interconnect interfaces and other processing devices as described in clause I15;
所述机器学习运算装置与所述其他处理装置进行交互,共同完成用户指定的计算操作,The machine learning computing device interacts with the other processing device to jointly complete the calculation operation specified by the user,
其中,所述组合处理装置还包括:存储装置,该存储装置分别与所述机器学习运算装置和所述其他处理装置连接,用于保存所述机器学习运算装置和所述其他处理装置的数据。Wherein, the combined processing device further includes: a storage device, which is respectively connected to the machine learning computing device and the other processing device, and is used for storing data of the machine learning computing device and the other processing device.
条款I17、一种机器学习芯片,所述机器学习芯片包括:Clause I17. A machine learning chip, the machine learning chip includes:
如条款I15所述的机器学习运算装置或如条款I16所述的组合处理装置。The machine learning arithmetic device according to clause I15 or the combined processing device according to clause I16.
条款I18、一种电子设备,所述电子设备包括:Clause I18. An electronic device, the electronic device comprising:
如条款I17所述的机器学习芯片。Machine learning chip as described in clause I17.
条款I19、一种板卡,所述板卡包括:存储器件、接口装置和控制器件以及如条款I17所述的机器学习芯片;Clause I19. A board card including: a storage device, an interface device and a control device, and a machine learning chip as described in Clause I17;
其中,所述机器学习芯片与所述存储器件、所述控制器件以及所述接口装置分别连接;Wherein, the machine learning chip is respectively connected to the storage device, the control device and the interface device;
所述存储器件,用于存储数据;The storage device is used for storing data;
所述接口装置,用于实现所述机器学习芯片与外部设备之间的数据传输;The interface device is used to realize data transmission between the machine learning chip and an external device;
所述控制器件,用于对所述机器学习芯片的状态进行监控。The control device is used for monitoring the state of the machine learning chip.
条款I20、一种最大池化指令处理方法,所述方法应用于最大池化指令处理装置,所述装置包括控制模块和运算模块,所述方法包括:Clause I20. A method for processing maximum pooled instructions. The method is applied to a device for processing maximum pooled instructions. The device includes a control module and an arithmetic module. The method includes:
利用控制模块对获取到的最大池化指令进行编译,得到编译后的最大池化指令,对所述编译后的最大池化指令进行解析,得到最大池化指令的操作码和操作域,并根据所述操作码和所述操作域获取执行最大池化指令所需的待运算数据、池化核和目标地址;Use the control module to compile the obtained maximum pooling instruction to obtain the compiled maximum pooling instruction, analyze the compiled maximum pooling instruction to obtain the operation code and operation domain of the maximum pooling instruction, and according to The operation code and the operation domain obtain the data to be operated, the pooling core, and the target address required to execute the maximum pooling instruction;
利用运算模块根据所述池化核对所述待运算数据进行最大池化运算,获得运算结果,并将所述运算结果存入所述目标地址中,Using an operation module to perform a maximum pooling operation on the data to be calculated according to the pooling check, to obtain an operation result, and to store the operation result in the target address,
其中,所述操作码用于指示所述最大池化指令对数据所进行的运算为最大池化运算,所述操作域包括待运算数据地址、池化核地址和所述目标地址。Wherein, the operation code is used to indicate that the operation performed by the maximum pooling instruction on the data is the maximum pooling operation, and the operation domain includes the data address to be operated, the pooling core address, and the target address.
条款I21、根据条款I20所述的方法,根据所述池化核对所述待运算数据进行最大池化运算,获得运算结果,包括:Clause I21. According to the method described in Clause I20, perform a maximum pooling operation on the data to be calculated according to the pooling check, to obtain an operation result, including:
利用所述运算模块中的多个比较器对所述池化核所对应的区域中的多个待运算数据进行运算,获得运算结果。A plurality of comparators in the operation module are used to perform operation on a plurality of data to be operated in the area corresponding to the pooled core to obtain operation results.
条款I22、根据条款I21所述的方法,所述运算模块包括主运算子模块和多个从运算子模块,所述主运算子模块包括所述多个比较器,Clause I22. The method according to Clause I21, the operation module includes a master operation sub-module and a plurality of slave operation sub-modules, the master operation sub-module includes the plurality of comparators,
其中,根据所述池化核对所述待运算数据进行最大池化运算,获得运算结果,并将所述运算结果存入所述目标地址中,包括:Wherein, performing maximum pooling operation on the data to be calculated according to the pooling verification to obtain an operation result, and storing the operation result in the target address includes:
利用所述多个比较器对所述池化核所对应的区域中的多个待运算数据进行比较运算,得到运算结果,并将所述运算结果存入所述目标地址中。The plurality of comparators are used to perform comparison operations on a plurality of data to be operated in the area corresponding to the pooled core to obtain operation results, and the operation results are stored in the target address.
条款I23、根据条款I20所述的方法,所述操作域还包括输入高度和输入宽度,Clause I23. The method according to Clause I20, the operation domain further includes an input height and an input width,
其中,根据所述操作码和所述操作域获取执行所述最大池化指令所需的待运算数据、池化核和目标地址,包括:Wherein, obtaining the data to be operated, the pooling core and the target address required to execute the maximum pooling instruction according to the operation code and the operation domain includes:
从所述待运算数据地址中,获取对应所述输入宽度和所述输入高度的待运算数据。Obtain the data to be calculated corresponding to the input width and the input height from the address of the data to be calculated.
条款I24、根据条款I20所述的方法,所述操作域还包括池化核高度和池化核宽度,Clause I24. The method according to Clause I20, the operation domain further includes a pooled core height and a pooled core width,
其中,根据所述操作码和所述操作域获取执行最大池化指令所需的待运算数据、池化核和目标地址,包括:Wherein, obtaining the data to be operated, the pooling core and the target address required to execute the maximum pooling instruction according to the operation code and the operation domain include:
按照所述池化核高度和所述池化核宽度从所述池化核地址中获取所述池化核。Obtain the pooled core from the pooled core address according to the pooled core height and the pooled core width.
条款I25、根据条款I20所述的方法,所述操作域还包括第一步幅,Clause I25. The method according to Clause I20, the operation domain further includes a first step,
其中,根据所述池化核对所述待运算数据进行最大池化运算,包括:Wherein, performing the maximum pooling operation on the data to be calculated according to the pooling verification includes:
按照所述第一步幅在x方向上移动所述池化核。The pooling core is moved in the x direction according to the first step.
条款I26、根据条款I20所述的方法,所述操作域还包括第二步幅,Clause I26. The method according to Clause I20, the operation domain further includes a second step,
其中,根据所述池化核对所述待运算数据进行最大池化运算,包括:Wherein, performing the maximum pooling operation on the data to be calculated according to the pooling verification includes:
按照所述第二步幅在y方向上移动所述池化核。The pooling core is moved in the y direction according to the second step.
条款I27、根据条款I20所述的方法,根据所述池化核对所述待运算数据进行最大池化运算,获得运算结果,包括:Clause I27. According to the method described in Clause I20, perform a maximum pooling operation on the data to be calculated according to the pooling check, to obtain an operation result, including:
在所述待运算数据上非重叠移动所述池化核,并比较所述池化核所对应的区域中的多个待运算数据,获得所述运算结果。Move the pooled cores non-overlapping on the data to be calculated, and compare multiple data to be calculated in the area corresponding to the pooled cores to obtain the calculation result.
条款I28、根据条款I20所述的方法,根据所述池化核对所述待运算数据进行最大池化运算,获得运算结果,包括:Clause I28. According to the method described in Clause I20, perform a maximum pooling operation on the data to be calculated according to the pooling check, to obtain an operation result, including:
在所述待运算数据的尺寸为所述池化核的尺寸的非整数倍时,对所述待运算数据中为所述池化核的尺寸的整数倍的数据进行最大池化运算,When the size of the data to be calculated is a non-integer multiple of the size of the pooled core, perform a maximum pooling operation on the data to be calculated that is an integer multiple of the size of the pooled core,
其中,所述待运算数据的尺寸为所述池化核的尺寸的非整数倍,包括以下至少一项:所述待运算数据的输入宽度为所述池化核的宽度的非整数倍、所述待运算数据的输入高度为所述池化核的高度的非整数倍。Wherein, the size of the data to be calculated is a non-integer multiple of the size of the pooled core, including at least one of the following: The input width of the data to be calculated is a non-integer multiple of the width of the pooled core. The input height of the data to be calculated is a non-integer multiple of the height of the pooled core.
条款I29、根据条款I20所述的方法,所述操作域还包括池化核数量,Clause I29. The method according to Clause I20, the operation domain further includes the number of pooled cores,
其中,根据所述池化核对所述待运算数据进行最大池化运算,获得运算结果,包括:Wherein, performing maximum pooling operation on the data to be calculated according to the pooling verification to obtain the operation result includes:
通过数量为所述池化核数量的多个池化核,对所述待运算数据进行最大池化运算。The maximum pooling operation is performed on the data to be calculated through a plurality of pooling cores whose number is the number of the pooling cores.
条款I30、根据条款I20所述的方法,所述方法还包括:Clause I30. The method according to Clause I20, the method further comprising:
利用所述装置的存储模块存储所述待运算数据和所述池化核,Using the storage module of the device to store the data to be calculated and the pooled core,
其中,所述存储模块包括寄存器和缓存中的至少一种,Wherein, the storage module includes at least one of a register and a cache,
所述缓存,用于存储所述待运算数据和所述池化核,所述缓存包括至少一个神经元缓存NRAM;The cache is used to store the data to be operated and the pooled core, and the cache includes at least one neuron cache NRAM;
所述寄存器,用于存储所述待运算数据中的标量数据;The register is used to store scalar data in the data to be calculated;
所述神经元缓存用于存储所述待运算数据中的神经元数据,所述神经元数据包括神经元向量数据。The neuron buffer is used to store neuron data in the data to be operated, and the neuron data includes neuron vector data.
条款I31、根据条款I20所述的方法,对获取到的最大池化指令进行解析,得到所述最大池化指令的操作码和操作域,包括:Clause I31. According to the method described in Clause I20, parse the obtained maximum pooling instruction to obtain the operation code and operation domain of the maximum pooling instruction, including:
存储所述编译后的最大池化指令;Storing the compiled maximum pooling instruction;
对所述编译后的最大池化指令进行解析,得到最大池化指令的操作码和操作域;Analyzing the compiled maximum pooling instruction to obtain the operation code and operation domain of the maximum pooling instruction;
存储指令队列,所述指令队列包括按照执行顺序依次排列的多个待执行指令,所述多个待执行指令包括所述编译后的最大池化指令。An instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed that are sequentially arranged according to an execution order, and the plurality of instructions to be executed include the compiled maximum pooled instruction.
条款I32、根据条款I31所述的方法,所述方法还包括:Clause I32. The method according to Clause I31, the method further comprising:
在确定所述多个待执行指令中的第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系时,缓存所述第一待执行指令,并在确定所述第零待执行指令执行完毕后,控制进行所述第一待执行指令的执行,When it is determined that the first to-be-executed instruction among the plurality of to-be-executed instructions is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction, cache the first to-be-executed instruction, and determine the After the execution of the zeroth to-be-executed instruction is completed, control to execute the execution of the first to-be-executed instruction,
其中,所述第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系包括:Wherein, the association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
存储所述第一待执行指令所需数据的第一存储地址区间与存储所述第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
条款I33、根据条款I20所述的方法,对获取到的最大池化指令进行编译,得到编译后的最大池化 指令,包括:Clause I33. According to the method described in Clause I20, compile the obtained maximum pooling instruction to obtain the compiled maximum pooling instruction, including:
根据所述最大池化指令生成汇编文件,并将所述汇编文件翻译成二进制文件,Generate an assembly file according to the maximum pooling instruction, and translate the assembly file into a binary file,
其中,所述二进制文件为所述编译后的最大池化指令。Wherein, the binary file is the compiled maximum pooling instruction.
条款I34、一种非易失性计算机可读存储介质,其上存储有计算机程序指令,所述计算机程序指令被处理器执行时实现条款I20至条款I33任一项所述的方法。Clause I34. A non-volatile computer-readable storage medium having computer program instructions stored thereon, which when executed by a processor implements the method of any one of Clause I20 to Clause I33.
由于神经网络算法的广泛使用,计算机硬件运算人能力的不断提升,实际应用中所涉及到的数据运算的种类和数量不断提高。由于编程语言的种类多样,在不同的语言环境下,为实现填充运算的处理过程,相关技术中,由于现阶段没有能广泛适用于各类编程语言的填充指令,技术人员需要自定义对应其编程语言环境的一条或多条指令来实现填充运算,导致进行填充运算效率低、速度慢。本公开提供一种填充指令处理方法、装置、计算机设备和存储介质,仅用一个指令即可以实现填充运算,能够显著提高进行填充运算的效率和速度。Due to the extensive use of neural network algorithms, the ability of computer hardware operators to continuously improve, the types and number of data operations involved in practical applications continue to increase. Due to the variety of programming languages, in different language environments, in order to realize the process of filling operation, in the related art, because there is no filling instruction that can be widely applied to various programming languages at this stage, technicians need to customize their corresponding programming One or more instructions of the locale are used to implement the padding operation, which results in low efficiency and slow speed of the padding operation. The present disclosure provides a filling instruction processing method, device, computer equipment, and storage medium. The filling operation can be realized with only one instruction, which can significantly improve the efficiency and speed of performing the filling operation.
图10-1示出根据本公开一实施例的填充指令处理装置的框图。如图10-1所示,该装置包括控制模块9-11和运算模块9-12。10-1 shows a block diagram of a filling instruction processing apparatus according to an embodiment of the present disclosure. As shown in Figure 10-1, the device includes a control module 9-11 and an arithmetic module 9-12.
控制模块9-11,用于对获取到的填充指令进行编译,得到编译后的填充指令,对编译后的填充指令进行解析,得到填充指令的操作码和操作域,并根据操作码和操作域获取执行填充指令所需的待运算数据、填充核和目标地址。其中,操作码用于指示填充指令对数据所进行的运算为填充运算,操作域包括待运算数据地址、填充核地址和目标地址。The control module 9-11 is used to compile the obtained stuffing instruction to obtain the compiled stuffing instruction, and parse the compiled stuffing instruction to obtain the opcode and operation domain of the stuffing instruction, and according to the opcode and operation domain Obtain the data to be calculated, the padding core and the target address required to execute the padding instruction. The operation code is used to indicate that the operation performed by the filling instruction on the data is a filling operation, and the operation domain includes the data address to be operated, the filling core address, and the target address.
运算模块9-12,用于根据填充核对待运算数据进行填充运算(pad),获得运算结果,并将运算结果存入目标地址中。The operation module 9-12 is used to perform pad operation (pad) on the data to be operated according to the filling core, obtain the operation result, and store the operation result in the target address.
在本实施例中,控制模块所获取到的填充指令为未编译的、不能直接供硬件执行的软件指令,控制模块需先对填充指令(未编译的)进行编译。在得到编译后的填充指令之后,才能对编译后的填充指令进行解析。编译后的填充指令为能够直接供硬件执行的硬件指令。控制模块可以从待运算数据地址和填充核地址中,分别获得待运算数据和填充核。控制模块可以通过数据输入输出单元获得指令和数据,该数据输入输出单元可以为一个或多个数据I/O接口或I/O引脚。In this embodiment, the filling instructions obtained by the control module are uncompiled software instructions that cannot be directly executed by the hardware. The control module needs to first compile the filling instructions (uncompiled). After the compiled filling instruction is obtained, the compiled filling instruction can be parsed. The compiled filling instructions are hardware instructions that can be directly executed by the hardware. The control module can obtain the data to be calculated and the filling core from the address of the data to be calculated and the address of the filling core, respectively. The control module can obtain instructions and data through the data input and output unit, which can be one or more data I / O interfaces or I / O pins.
在本实施例中,操作码可以是计算机程序中所规定的要执行操作的那一部分指令或字段(通常用代码表示),是指令序列号,用来告知执行指令的装置具体需要执行哪一条指令。操作域可以是执行对应的指令所需的所有数据的来源,执行对应的指令所需的所有数据包括待运算数据、填充核等参数以及对应的运算方法等等。对于一个填充指令其必须包括操作码和操作域,其中,操作域至少包括待运算数据地址、填充核地址和目标地址In this embodiment, the operation code may be the part of the instruction or field (usually represented by code) specified in the computer program to perform the operation, and is the instruction sequence number used to inform the device that executes the instruction which instruction needs to be executed. . The operation domain may be the source of all data required to execute the corresponding instruction, and all data required to execute the corresponding instruction include data to be operated, parameters such as core filling, and corresponding operation methods. For a stuffing instruction, it must include an opcode and an operation field, where the operation field includes at least the data address to be calculated, the stuffing core address, and the target address
应当理解的是,本领域技术人员可以根据需要对填充指令的指令格式以及所包含的操作码和操作域进行设置,本公开对此不作限制。It should be understood that those skilled in the art can set the instruction format of the padding instruction, as well as the included operation codes and operation fields as needed, and this disclosure does not limit this.
在本实施例中,该装置可以包括一个或多个控制模块,以及一个或多个运算模块,可以根据实际需要对控制模块和运算模块的数量进行设置,本公开对此不作限制。在装置包括一个控制模块时,该控制模块可以接收填充指令,并控制一个或多个运算模块进行填充运算。在装置包括多个控制模块时,多个控制模块可以分别接收填充指令,并控制对应的一个或多个运算模块进行填充运算。In this embodiment, the device may include one or more control modules and one or more arithmetic modules, and the number of control modules and arithmetic modules may be set according to actual needs, which is not limited in the present disclosure. When the device includes a control module, the control module can receive the filling instruction and control one or more arithmetic modules to perform the filling operation. When the device includes multiple control modules, the multiple control modules may respectively receive the filling instruction and control the corresponding one or more arithmetic modules to perform the filling operation.
本公开实施例所提供的填充指令处理装置,该装置包括控制模块和运算模块,控制模块用于对获取到的填充指令进行编译,得到编译后的填充指令,对编译后的填充指令进行解析,得到填充指令的操作码和操作域,并根据操作码和操作域获取执行填充指令所需的待运算数据、填充核和目标地址;运算模块,用于根据填充核对待运算数据进行填充运算,获得运算结果,并将运算结果存入目标地址中。本公开实施例所提供的填充指令处理装置的适用范围广,对填充指令的处理效率高、处理速度快,进行填充运算的处理效率高、速度快。A filling instruction processing device provided by an embodiment of the present disclosure includes a control module and an arithmetic module. The control module is used to compile the obtained filling instruction to obtain a compiled filling instruction and analyze the compiled filling instruction. Obtain the operation code and operation field of the filling instruction, and obtain the data to be operated, the filling core and the target address required to execute the filling instruction according to the operation code and the operation field; the operation module is used to perform the filling operation on the operation data according to the filling core to obtain The operation result, and store the operation result in the target address. The filling instruction processing device provided by the embodiments of the present disclosure has a wide application range, and has high processing efficiency and fast processing speed for filling instructions, and high processing efficiency and fast speed for performing filling operations.
图10-2a示出根据本公开一实施例的填充指令处理装置的框图。在一种可能的实现方式中,如图10-2a所示,运算模块9-12可以包括多个比较器9-120。多个比较器9-120,用于根据填充核对待运算数据进行填充运算。10-2a shows a block diagram of a filling instruction processing apparatus according to an embodiment of the present disclosure. In a possible implementation, as shown in FIG. 10-2a, the arithmetic module 9-12 may include multiple comparators 9-120. A plurality of comparators 9-120 are used to perform filling operation on the data to be operated according to the filling kernel.
在该实现方式中,运算模块还可以包括一个比较器。可以根据所需进行的填充运算的数据量、对填充运算的处理速度、处理效率等需求对比较器的数量进行设置,本公开对此不作限制。In this implementation, the arithmetic module may also include a comparator. The number of comparators can be set according to the data amount of the padding operation to be performed, the processing speed of the padding operation, the processing efficiency, etc., and the disclosure does not limit this.
图10-2b示出根据本公开一实施例的填充指令处理装置的框图。在一种可能的实现方式中,如图10-2b所示,运算模块9-12可以包括主运算子模块9-121和多个从运算子模块9-122,主运算子模块9-121包括多个比较器9-120(图中未示出)。10-2b shows a block diagram of a filling instruction processing apparatus according to an embodiment of the present disclosure. In a possible implementation, as shown in FIG. 10-2b, the operation module 9-12 may include a master operation submodule 9-121 and a plurality of slave operation submodules 9-122, and the master operation submodule 9-121 includes Multiple comparators 9-120 (not shown in the figure).
主运算子模块9-121,用于利用多个比较器9-120根据填充核对待运算数据进行填充运算,得到运算结果,并将运算结果存入目标地址中。The main operation sub-module 9-121 is used for performing a filling operation on the data to be calculated according to the filling core using a plurality of comparators 9-120 to obtain an operation result, and storing the operation result in the target address.
在一种可能的实现方式中,操作域还可以包括输入高度和输入宽度。In a possible implementation manner, the operation domain may further include an input height and an input width.
其中,控制模块,还用于从待运算数据地址中,获取对应输入宽度和输入高度的待运算数据。The control module is also used to obtain data to be calculated corresponding to the input width and input height from the data to be calculated.
在该实现方式中,输入高度和输入宽度可以限定所获得的待运算数据的数据量和尺寸。操作域所包括的输入高度和输入宽度可以是具体的数值,还可以是存储输入高度和输入宽度的存储地址。在操作域中直接包括输入高度和输入宽度的具体数值时,将该具体数值确定为对应的输入高度和输入宽度。在操作域中包括输入高度和输入宽度的存储地址时,可以分别从输入高度和输入宽度的存储地址中获得输入高度和输入宽度。In this implementation, the input height and input width can define the data amount and size of the obtained data to be calculated. The input height and input width included in the operation domain may be specific numerical values, and may also be a storage address that stores the input height and input width. When the specific values of the input height and input width are directly included in the operation domain, the specific values are determined as the corresponding input height and input width. When the storage addresses of the input height and the input width are included in the operation domain, the input height and the input width can be obtained from the storage addresses of the input height and the input width, respectively.
在一种可能的实现方式中,在操作域中不包括输入高度和/或输入宽度时,可以根据预先设置的默认输入高度和默认输入宽度获取待运算数据。In a possible implementation manner, when the input height and / or input width are not included in the operation domain, the data to be calculated may be obtained according to the preset default input height and default input width.
通过上述方式,可以对待运算数据的数据量和尺寸进行限制,保证运算结果的准确性,并保证装置可以执行该填充指令。In the above manner, the data amount and size of the operation data can be limited, the accuracy of the operation result can be guaranteed, and the device can execute the filling instruction.
在一种可能的实现方式中,操作域还可以包括填充核高度和填充核宽度。In a possible implementation manner, the operation domain may further include a filling core height and a filling core width.
其中,控制模块9-11,还用于从填充核地址中,获取对应填充核高度和填充核宽度的填充核。Among them, the control module 9-11 is also used to obtain the filling core corresponding to the height and width of the filling core from the address of the filling core.
在一种可能的实现方式中,在操作域中不包括填充核高度和填充核宽度时,可以获取预先设置的默认填充核高度、默认填充核宽度,使得控制模块和运算模块可以执行填充指令。In a possible implementation manner, when the padding core height and the padding core width are not included in the operation domain, the preset default padding core height and default padding core width may be acquired, so that the control module and the arithmetic module can execute the padding instruction.
在一种可能的实现方式中,操作域还可以包括填充核数量。其中,运算模块9-12,还用于通过数量为填充核数量的多个填充核,对待运算数据进行填充运算。In a possible implementation, the operation domain may further include the number of filling cores. Among them, the calculation module 9-12 is also used to perform filling operation on the data to be calculated through a plurality of filling cores whose number is the number of filling cores.
在该实现方式中,填充核数量与待运算数据相对应。例如,在填充核数量为5时,可以确定待运算数据可以分为五个部分,需要5个填充核分别对待运算数据的五个部分进行填充运算。In this implementation, the number of padding cores corresponds to the data to be calculated. For example, when the number of filling cores is 5, it can be determined that the data to be calculated can be divided into five parts, and five filling cores are required to perform filling operations on the five parts of the data to be calculated.
在该实现方式中,在操作域不包括填充核数量时,可以确定待运算数据仅需一个填充核便可以实现填充运算。In this implementation manner, when the operation domain does not include the number of padding cores, it can be determined that only one padding core is needed for the data to be calculated to implement the padding operation.
在一种可能的实现方式中,如图10-2a、图10-2b所示,该装置还可以包括存储模块9-13。存储模块9-13用于存储待运算数据和填充核。In a possible implementation manner, as shown in FIGS. 10-2a and 10-2b, the device may further include a storage module 9-13. The storage modules 9-13 are used for storing data to be calculated and filling cores.
在该实现方式中,存储模块可以包括缓存和寄存器中的一种或多种,缓存可以包括高速暂存缓存,还可以包括至少一个NRAM(Neuron Random Access Memory,神经元随机存取存储器)。缓存可以用于存储待运算数据和池化核,寄存器可以用于存储待运算数据中的标量数据。In this implementation, the storage module may include one or more of a cache and a register. The cache may include a high-speed temporary storage cache, and may also include at least one NRAM (Neuron Random Access Memory). The cache can be used to store data to be calculated and pooled cores, and the register can be used to store scalar data in the data to be calculated.
在一种可能的实现方式中,缓存可以包括神经元缓存。神经元缓存也即上述神经元随机存取存储器,可以用于存储待运算数据中的神经元数据,神经元数据可以包括神经元向量数据。In a possible implementation, the cache may include a neuron cache. The neuron cache, that is, the foregoing neuron random access memory, can be used to store neuron data in the data to be calculated, and the neuron data can include neuron vector data.
在一种可能的实现方式中,该装置还可以包括直接内存访问模块,用于从存储模块中读取或者存储数据。In a possible implementation, the device may further include a direct memory access module, which is used to read or store data from the storage module.
在一种可能的实现方式中,控制模块9-11还可以用于根据填充指令生成汇编文件,并将汇编文件翻译成二进制文件,其中,二进制文件为编译后的填充指令。In a possible implementation manner, the control module 9-11 may also be used to generate an assembly file according to the filling instruction and translate the assembly file into a binary file, where the binary file is a compiled filling instruction.
在一种可能的实现方式中,填充指令的指令格式可以是:In a possible implementation manner, the instruction format of the filling instruction may be:
pad dst src channel srcHeight srcWidth padHeight padWdithpaddst src channel srcHeight srcWidth padHeight padWdith
其中,pad是填充指令的操作码,dst、src、channel、srcHeight、srcWidth、padHeight、padWdith是填充指令的操作域。其中,dst是目标地址,src0是待运算数据地址,src是填充核或填充核地址,channel是填充核数量,srcHeigh是待运算数据的输入高度,srcWidth是待运算数据的输入宽度,padHeigh是填充核高度,padWdith是填充核宽度。Among them, pad is the operation code of the filling instruction, and dst, src, channel, srcHeight, srcWidth, padHeight, and padWdith are the operation fields of the filling instruction. Among them, dst is the target address, src0 is the data address to be calculated, src is the filling core or filling core address, channel is the number of filling cores, srcHeigh is the input height of the data to be calculated, srcWidth is the input width of the data to be calculated, padHeigh is the pad Core height, padWdith is the width of the filling core.
应当理解的是,本领域技术人员可以根据需要对填充指令的操作码、指令格式中操作码和操作域的位置进行设置,本公开对此不作限制。It should be understood that, those skilled in the art can set the operation code of the filling instruction, the position of the operation code and the operation field in the instruction format as needed, and the disclosure does not limit this.
在一种可能的实现方式中,该装置可以设置于图形处理器(Graphics Processing Unit,简称GPU)、中央处理器(Central Processing Unit,简称CPU)和嵌入式神经网络处理器(Neural-network Processing Unit,简称NPU)的一种或多种之中。In a possible implementation manner, the device may be set in a graphics processor (Graphics Processing Unit, GPU for short), a central processing unit (Central Processing Unit, CPU for short) and an embedded neural network processor (Neural-network Processing Unit) , Referred to as NPU).
需要说明的是,尽管以上述实施例作为示例介绍了填充指令处理装置如上,但本领域技术人员能够理解,本公开应不限于此。事实上,用户完全可根据个人喜好和/或实际应用场景灵活设定各模块,只要符合本公开的技术方案即可。It should be noted that although the above embodiment is taken as an example to introduce the filling instruction processing apparatus as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various modules flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.
应用示例Application examples
以下结合“利用填充指令处理装置进行填充运算”作为一个示例性应用场景,给出根据本公开实施例的应用示例,以便于理解填充指令处理装置的流程。本领域技术人员应理解,以下应用示例仅仅是出于便于理解本公开实施例的目的,不应视为对本公开实施例的限制In the following, an application example according to an embodiment of the present disclosure is given in conjunction with "filling operation using a filling instruction processing device" as an exemplary application scenario, so as to facilitate understanding of the flow of the filling instruction processing device. Those skilled in the art should understand that the following application examples are only for the purpose of facilitating understanding of the embodiments of the present disclosure, and should not be considered as limitations to the embodiments of the present disclosure
图10-3示出根据本公开一实施例的填充指令处理装置的应用场景的示意图。如图10-3所示,填充指令处理装置对填充指令进行处理的过程如下:10-3 shows a schematic diagram of an application scenario of a filling instruction processing apparatus according to an embodiment of the present disclosure. As shown in Figure 10-3, the filling instruction processing device processes the filling instruction as follows:
控制模块9-11对获取到的填充指令1进行编译,得到编译后的填充指令1(如填充指令1为pad 500 100 200 5 64 32 2 1)。对编译后的填充指令1进行解析,得到填充指令1的操作码和操作域。其中,填充指令1的操作码为pad,目标地址为500,待运算数据地址为100,填充核地址为200,填充核数量为5,输入高度为64,输入宽度为32,填充核高度为2,填充核宽度为1。控制模块9-11从待运算数据地址100中获取64×32的待运算数据,从填充核地址200中获取2×1的填充核。The control module 9-11 compiles the obtained filling instruction 1 to obtain the compiled filling instruction 1 (for example, the filling instruction 1 is pad 500, 100, 200, 5, 64, 32, 2). Analyze the compiled stuffing instruction 1 to obtain the opcode and operation domain of the stuffing instruction 1. Among them, the operation code of the filling instruction 1 is pad, the target address is 500, the data address to be calculated is 100, the filling core address is 200, the filling core number is 5, the input height is 64, the input width is 32, and the filling core height is 2 , The width of the filling core is 1. The control module 9-11 acquires 64 × 32 to-be-calculated data from the data-to-be-operated data address 100, and acquires a 2 × 1 stuffing core from the stuffing core address 200.
运算模块9-12根据填充核数量5对待运算数据进行填充运算,得到运算结果,并将运算结果存入目标地址500中。The arithmetic module 9-12 performs the stuffing operation on the data to be calculated according to the number of stuffing cores 5 to obtain the calculation result, and stores the calculation result in the target address 500.
以上各模块的工作过程可参考上文的相关描述。For the working process of the above modules, please refer to the relevant description above.
这样,填充指令处理装置可以高效、快速地对填充指令进行处理,进行填充运算的处理效率高、速度快。In this way, the stuffing instruction processing device can process the stuffing instructions efficiently and quickly, and the stuffing operation has high processing efficiency and high speed.
图10-4示出根据本公开一实施例的填充指令处理方法的流程图。该方法可以应用于包含存储器和处理器的如计算机设备等,其中,存储器用于存储执行方法过程中所使用的数据;处理器用于执行相关的处理、运算步骤,如执行下述步骤S51-9和步骤S52-9。如图10-4所示,该方法应用于上述填充指令处理装置,该方法包括步骤S51-9和步骤S52-9。10-4 shows a flowchart of a filling instruction processing method according to an embodiment of the present disclosure. The method can be applied to computer equipment including a memory and a processor, where the memory is used to store data used in the execution of the method; the processor is used to perform related processing and operation steps, such as performing the following steps S51-9和 步骤 S52-9. As shown in FIG. 10-4, the method is applied to the above-mentioned filling instruction processing device, and the method includes step S51-9 and step S52-9.
在步骤S51-9中,利用控制模块对获取到的填充指令进行编译,得到编译后的填充指令,对编译后的填充指令进行解析,得到填充指令的操作码和操作域,并根据操作码和操作域获取执行填充指令所需的待运算数据、填充核和目标地址。其中,操作码用于指示填充指令对数据所进行的运算为填充运算,操作域包括待运算数据地址、填充核地址和目标地址。In step S51-9, the control module is used to compile the obtained filling instruction to obtain a compiled filling instruction, and the compiled filling instruction is analyzed to obtain the operation code and operation domain of the filling instruction, and according to the operation code and The operation domain obtains the data to be operated, the padding core and the target address required to execute the padding instruction. The operation code is used to indicate that the operation performed by the filling instruction on the data is a filling operation, and the operation domain includes the data address to be operated, the filling core address, and the target address.
在步骤S52-9中,利用运算模块根据填充核对待运算数据进行填充运算,获得运算结果,并将运算结果存入目标地址中。In step S52-9, the arithmetic module is used to perform the stuffing operation on the data to be calculated according to the stuffing core to obtain the calculation result, and the calculation result is stored in the target address.
在一种可能的实现方式中,根据填充核对待运算数据进行填充运算,得到运算结果,可以包括:In a possible implementation manner, performing the filling operation on the data to be calculated according to the filling core to obtain the operation result may include:
利用运算模块中的多个比较器,根据填充核对待运算数据进行填充运算。Using multiple comparators in the calculation module, the data to be calculated is filled according to the filling core.
在一种可能的实现方式中,运算模块包括主运算子模块和多个从运算子模块,主运算子模块包括多个比较器。其中,步骤S52-9可以包括:In a possible implementation manner, the operation module includes a master operation sub-module and multiple slave operation sub-modules, and the master operation sub-module includes multiple comparators. Wherein, step S52-9 may include:
利用主运算子模块中的多个比较器根据填充核对待运算数据进行填充运算,得到运算结果,并将运算结果存入目标地址中。A plurality of comparators in the main operation sub-module are used to perform the filling operation on the data to be calculated according to the filling core to obtain the operation result, and the operation result is stored in the target address.
在一种可能的实现方式中,操作域还可以包括读输入高度和输入宽度。其中,根据操作码和操作域获取执行填充指令所需的待运算数据、填充核和目标地址,可以包括:In a possible implementation manner, the operation domain may further include a read input height and an input width. Wherein, obtaining the data to be calculated, the padding core and the target address required to execute the padding instruction according to the operation code and the operation domain may include:
从待运算数据地址中,获取对应输入宽度和输入高度的待运算数据。Obtain the data to be calculated corresponding to the input width and input height from the address of the data to be calculated.
在一种可能的实现方式中,操作域还可以包括填充核高度和填充核宽度。In a possible implementation manner, the operation domain may further include a filling core height and a filling core width.
其中,根据操作码和操作域获取执行填充指令所需的待运算数据、填充核和目标地址,可以包括: 从填充核地址中,获取对应填充核高度和填充核宽度的填充核。Wherein, obtaining the data to be calculated, the filling core and the target address required to execute the filling instruction according to the operation code and the operation field may include: obtaining the filling core corresponding to the filling core height and the filling core width from the filling core address.
在一种可能的实现方式中,操作域还可以包括填充核数量。其中,根据填充核对待运算数据进行填充运算,获得运算结果,可以包括:In a possible implementation, the operation domain may further include the number of filling cores. Wherein, performing the filling operation on the data to be calculated according to the filling core to obtain the operation result may include:
通过数量为填充核数量的多个填充核,对待运算数据进行填充运算。The number of filling cores is the number of filling cores, and the data to be operated is filled.
在一种可能的实现方式中,该方法还可以包括:利用装置的存储模块存储待运算数据和填充核,In a possible implementation manner, the method may further include: using the storage module of the device to store the data to be calculated and filling the core,
其中,存储模块包括寄存器和缓存中的至少一种,Wherein, the storage module includes at least one of a register and a cache,
缓存,用于存储待运算数据和填充核,缓存包括至少一个神经元缓存NRAM;The cache is used to store the data to be calculated and the filling core. The cache includes at least one neuron cache NRAM;
寄存器,用于存储待运算数据中的标量数据;Register, used to store scalar data in the data to be calculated;
神经元缓存,用于存储待运算数据中的神经元数据,神经元数据包括神经元向量数据。The neuron cache is used to store neuron data in the data to be calculated, and the neuron data includes neuron vector data.
在一种可能的实现方式中,对编译后的填充指令进行解析,得到填充指令的操作码和操作域,可以包括:In a possible implementation manner, parsing the compiled padding instruction to obtain the operation code and operation field of the padding instruction may include:
存储编译后的填充指令;Store the filled instructions after compilation;
对编译后的填充指令进行解析,得到填充指令的操作码和操作域;Analyze the compiled stuffing instruction to get the opcode and operation domain of the stuffing instruction;
存储指令队列,指令队列包括按照执行顺序依次排列的多个待执行指令,多个待执行指令可以包括编译后的填充指令。An instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed that are sequentially arranged according to the execution order, and the plurality of instructions to be executed may include a filled instruction after compilation.
在一种可能的实现方式中,该方法还可以包括:In a possible implementation manner, the method may further include:
在确定多个待执行指令中的第一待执行指令与第一待执行指令之前的第零待执行指令存在关联关系时,缓存第一待执行指令,在第零待执行指令执行完毕后,执行第一待执行指令,When it is determined that the first to-be-executed instruction among the plurality of to-be-executed instructions is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction, cache the first to-be-executed instruction, and execute The first instruction to be executed,
其中,第一待执行指令与第一待执行指令之前的第零待执行指令存在关联关系包括:Wherein, the association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
存储第一待执行指令所需数据的第一存储地址区间与存储第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first storage address interval storing data required for the first instruction to be executed has an overlapping area with the zeroth storage address interval storing data required for the zeroth instruction to be executed.
需要说明的是,尽管以上述实施例作为示例介绍了填充指令处理方法如上,但本领域技术人员能够理解,本公开应不限于此。事实上,用户完全可根据个人喜好和/或实际应用场景灵活设定各步骤,只要符合本公开的技术方案即可。It should be noted that although the above embodiment is taken as an example to introduce the filling instruction processing method as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various steps flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.
本公开实施例所提供的填充指令处理方法的适用范围广,对填充指令的处理效率高、处理速度快,进行填充运算的处理效率高、速度快。The filling instruction processing method provided by the embodiments of the present disclosure has a wide range of application, and has high processing efficiency and fast processing speed for filling instructions, and high processing efficiency and fast speed for performing filling operations.
本公开还提供一种非易失性计算机可读存储介质,其上存储有计算机程序指令,所述计算机程序指令被处理器执行时实现上述填充指令处理方法。The present disclosure also provides a non-volatile computer-readable storage medium on which computer program instructions are stored, and when the computer program instructions are executed by a processor, the above filling instruction processing method is realized.
依据以下条款可以更好的理解前述内容:The foregoing can be better understood based on the following clauses:
条款J1、一种填充指令处理装置,所述装置包括:Clause J1, a filling instruction processing device, the device comprising:
控制模块,用于对获取到的填充指令进行编译,得到编译后的填充指令,对所述编译后的填充指令进行解析,得到填充指令的操作码和操作域,所述操作码用于指示所述填充指令对数据所进行的运算为填充运算,所述操作域包括待运算数据地址、填充核地址和目标地址,并根据所述操作码和所述操作域获取执行填充指令所需的待运算数据、填充核和目标地址;The control module is used to compile the obtained filling instruction to obtain a compiled filling instruction, and parse the compiled filling instruction to obtain an operation code and an operation field of the filling instruction. The operation code is used to indicate The operation performed by the filling instruction on the data is a filling operation, and the operation domain includes a data address to be operated, a filling core address, and a target address, and obtains a pending operation required to execute the filling instruction according to the operation code and the operation domain Data, padding core and target address;
运算模块,用于根据所述填充核对所述待运算数据进行填充运算,得到运算结果,并将所述运算结果存入所述目标地址中。The operation module is configured to perform filling operation on the data to be operated according to the filling check, obtain an operation result, and store the operation result in the target address.
条款J2、根据条款J1所述的装置,所述运算模块,包括:Clause J2. The device according to Clause J1, the operation module includes:
多个比较器,用于根据所述填充核对所述待运算数据进行填充运算。A plurality of comparators are used to perform filling operation on the data to be calculated according to the filling core.
条款J3、根据条款J2所述的装置,所述运算模块包括主运算子模块和多个从运算子模块,所述主运算子模块包括所述多个比较器,Clause J3. The device according to Clause J2, the operation module includes a master operation sub-module and a plurality of slave operation sub-modules, the master operation sub-module includes the plurality of comparators,
所述主运算子模块,用于利用所述多个比较器根据所述填充核对所述待运算数据进行填充运算,得到运算结果,并将所述运算结果存入所述目标地址中。The main operation sub-module is configured to use the plurality of comparators to perform filling operation on the data to be operated according to the filling core to obtain an operation result, and store the operation result in the target address.
条款J4、根据条款J1所述的装置,所述操作域还包括输入高度和输入宽度,Clause J4. The device according to Clause J1, the operation domain further includes an input height and an input width,
其中,所述控制模块,还用于从所述待运算数据地址中,获取对应所述输入宽度和所述输入高度的待运算数据。Wherein, the control module is also used to obtain the data to be calculated corresponding to the input width and the input height from the data to be calculated address.
条款J5、根据条款J1所述的装置,所述操作域还包括填充核高度和填充核宽度,Clause J5. The device according to Clause J1, the operation domain further includes a filling core height and a filling core width,
其中,所述控制模块,还用于从所述填充核地址中,获取对应所述填充核高度和所述填充核宽度的填充核。Wherein, the control module is further configured to acquire the filling core corresponding to the height of the filling core and the width of the filling core from the address of the filling core.
条款J6、根据条款J1所述的装置,所述操作域还包括填充核数量,Clause J6. The device according to Clause J1, the operation domain further includes a number of filling cores,
其中,所述运算模块,还用于通过数量为所述填充核数量的多个填充核,对所述待运算数据进行填充运算。Wherein, the operation module is further used to perform filling operation on the data to be operated by using a plurality of filling cores whose number is the number of filling cores.
条款J7、根据条款J1所述的装置,所述装置还包括:Clause J7. The device according to Clause J1, the device further comprising:
存储模块,用于存储所述待运算数据和所述填充核,A storage module for storing the data to be calculated and the filling core,
其中,所述存储模块包括寄存器和缓存中的至少一种,Wherein, the storage module includes at least one of a register and a cache,
所述缓存,用于存储所述待运算数据和所述填充核,所述缓存包括至少一个神经元缓存NRAM;The cache is used to store the data to be calculated and the filling core, and the cache includes at least one neuron cache NRAM;
所述寄存器,用于存储所述待运算数据中的标量数据;The register is used to store scalar data in the data to be calculated;
所述神经元缓存,用于存储所述待运算数据中的神经元数据,所述神经元数据包括神经元向量数据。The neuron cache is used to store neuron data in the data to be operated, and the neuron data includes neuron vector data.
条款J8、根据条款J1所述的装置,所述控制模块,包括:Clause J8. The device according to Clause J1, the control module comprising:
指令存储子模块,用于存储所述编译后的填充指令;Instruction storage submodule, used to store the compiled filling instruction;
指令处理子模块,用于对所述编译后的填充指令进行解析,得到填充指令的操作码和操作域;An instruction processing sub-module, which is used to parse the compiled stuffing instruction to obtain the opcode and operation domain of the stuffing instruction;
队列存储子模块,用于存储指令队列,所述指令队列包括按照执行顺序依次排列的多个待执行指令,所述多个待执行指令包括所述编译后的填充指令。A queue storage submodule is used to store an instruction queue, and the instruction queue includes a plurality of instructions to be executed in order according to an execution order, and the plurality of instructions to be executed include the compiled filling instructions.
条款J9、根据条款J8所述的装置,所述控制模块,还包括:Clause J9. The device according to Clause J8, the control module, further comprising:
依赖关系处理子模块,用于在确定所述多个待执行指令中的第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系时,将所述第一待执行指令缓存在所述指令存储子模块中,在所述第零待执行指令执行完毕后,从所述指令存储子模块中提取所述第一待执行指令发送至所述运算模块,The dependency processing sub-module is used to determine the first pending instruction when there is an association relationship between the first pending instruction in the plurality of pending instructions and the zeroth pending instruction before the first pending instruction The execution instruction is cached in the instruction storage submodule, and after the execution of the zeroth instruction to be executed is completed, the first instruction to be executed is extracted from the instruction storage submodule and sent to the arithmetic module,
其中,所述第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系包括:Wherein, the association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
存储所述第一待执行指令所需数据的第一存储地址区间与存储所述第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
条款J10、根据条款J1所述的装置,Clause J10, the device according to Clause J1,
所述控制模块,还用于根据所述填充指令生成汇编文件,并将所述汇编文件翻译成二进制文件,The control module is also used to generate an assembly file according to the filling instruction and translate the assembly file into a binary file,
其中,所述二进制文件为所述编译后的填充指令。Wherein, the binary file is the compiled filling instruction.
条款J11、一种机器学习运算装置,所述装置包括:Clause J11. A machine learning computing device, the device comprising:
一个或多个如条款J1-条款J10任一项所述的填充指令处理装置,用于从其他处理装置中获取待运算数据和控制信息,并执行指定的机器学习运算,将执行结果通过I/O接口传递给其他处理装置;One or more stuffing instruction processing devices as described in any one of Clause J1-Clause J10, which are used to obtain data and control information to be calculated from other processing devices, and perform specified machine learning operations, and pass the execution result through I / O interface is passed to other processing devices;
当所述机器学习运算装置包含多个所述填充指令处理装置时,所述多个所述填充指令处理装置间可以通过特定的结构进行连接并传输数据;When the machine learning operation device includes a plurality of the filling instruction processing devices, the plurality of filling instruction processing devices may be connected and transmit data through a specific structure;
其中,多个所述填充指令处理装置通过快速外部设备互连总线PCIE总线进行互联并传输数据,以支持更大规模的机器学习的运算;多个所述填充指令处理装置共享同一控制系统或拥有各自的控制系统;多个所述填充指令处理装置共享内存或者拥有各自的内存;多个所述填充指令处理装置的互联方式是任意互联拓扑。Among them, a plurality of the filling instruction processing devices are interconnected and transmitting data through a fast external device interconnect bus PCIE bus to support larger-scale machine learning operations; a plurality of the filling instruction processing devices share the same control system or own Respective control systems; a plurality of the filling instruction processing devices share memory or have their own memories; the interconnection method of the plurality of filling instruction processing devices is an arbitrary interconnection topology.
条款J12、一种组合处理装置,所述组合处理装置包括:Clause J12. A combined processing device, the combined processing device comprising:
如条款J11所述的机器学习运算装置、通用互联接口和其他处理装置;Machine learning computing device, general interconnection interface and other processing devices as described in clause J11;
所述机器学习运算装置与所述其他处理装置进行交互,共同完成用户指定的计算操作,The machine learning computing device interacts with the other processing device to jointly complete the calculation operation specified by the user,
其中,所述组合处理装置还包括:存储装置,该存储装置分别与所述机器学习运算装置和所述其他处理装置连接,用于保存所述机器学习运算装置和所述其他处理装置的数据。Wherein, the combined processing device further includes: a storage device, which is respectively connected to the machine learning computing device and the other processing device, and is used for storing data of the machine learning computing device and the other processing device.
条款J13、一种机器学习芯片,所述机器学习芯片包括:Clause J13. A machine learning chip, the machine learning chip includes:
如条款J11所述的机器学习运算装置或如条款J12所述的组合处理装置。The machine learning arithmetic device according to clause J11 or the combined processing device according to clause J12.
条款J14、一种电子设备,所述电子设备包括:Clause J14. An electronic device, the electronic device comprising:
如条款J13所述的机器学习芯片。Machine learning chip as described in clause J13.
条款J15、一种板卡,所述板卡包括:存储器件、接口装置和控制器件以及如条款J13所述的机器学习芯片;Clause J15, a board card, the board card includes: a storage device, an interface device and a control device, and a machine learning chip as described in Clause J13;
其中,所述机器学习芯片与所述存储器件、所述控制器件以及所述接口装置分别连接;Wherein, the machine learning chip is respectively connected to the storage device, the control device and the interface device;
所述存储器件,用于存储数据;The storage device is used for storing data;
所述接口装置,用于实现所述机器学习芯片与外部设备之间的数据传输;The interface device is used to realize data transmission between the machine learning chip and an external device;
所述控制器件,用于对所述机器学习芯片的状态进行监控。The control device is used for monitoring the state of the machine learning chip.
条款J16、一种填充指令处理方法,所述方法应用于填充指令处理装置,所述装置包括控制模块和运算模块,所述方法包括:Clause J16. A filling instruction processing method, the method is applied to a filling instruction processing device, the device includes a control module and an arithmetic module, and the method includes:
利用控制模块对获取到的填充指令进行编译,得到编译后的填充指令,对所述编译后的填充指令进行解析,得到填充指令的操作码和操作域,并根据所述操作码和所述操作域获取执行填充指令所需的待运算数据、填充核和目标地址;The control module is used to compile the obtained filling instruction to obtain a compiled filling instruction, and the compiled filling instruction is analyzed to obtain an operation code and an operation domain of the filling instruction, and the operation code and the operation are performed according to the operation code and the operation The domain obtains the data to be calculated, the padding core and the target address required to execute the padding instruction;
利用运算模块根据所述填充核对所述待运算数据进行填充运算,得到运算结果,并将所述运算结果存入所述目标地址中,Using an operation module to perform filling operation on the data to be operated according to the filling core to obtain an operation result, and store the operation result in the target address,
其中,所述操作码用于指示所述填充指令对数据所进行的运算为填充运算,所述操作域包括待运算数据地址、填充核地址和所述目标地址。Wherein, the operation code is used to indicate that the operation performed by the stuffing instruction on the data is a stuffing operation, and the operation domain includes a data address to be calculated, a stuffing core address, and the target address.
条款J17、根据条款J16所述的方法,根据所述填充核对所述待运算数据进行填充运算,得到运算结果,包括:Clause J17. According to the method described in Clause J16, perform a filling operation on the data to be calculated according to the filling check, to obtain an operation result, including:
利用所述运算模块中的多个比较器,根据所述填充核对所述待运算数据进行填充运算,得到运算结果。A plurality of comparators in the calculation module are used to perform a filling operation on the data to be calculated according to the filling core to obtain an operation result.
条款J18、根据条款J17所述的方法,所述运算模块包括主运算子模块和多个从运算子模块,所述主运算子模块包括所述多个比较器,Clause J18. The method according to Clause J17, the operation module includes a master operation sub-module and a plurality of slave operation sub-modules, the master operation sub-module includes the plurality of comparators,
其中,根据所述填充核对所述待运算数据进行填充运算,得到运算结果,并将所述运算结果存入所述目标地址中,包括:Wherein, performing filling operation on the data to be operated according to the filling check to obtain an operation result, and storing the operation result in the target address includes:
利用所述主运算子模块中的所述多个比较器根据所述填充核对所述待运算数据进行填充运算,得到运算结果,并将所述运算结果存入所述目标地址中。Use the plurality of comparators in the main operation sub-module to perform filling operation on the data to be operated according to the filling kernel to obtain an operation result, and store the operation result in the target address.
条款J19、根据条款J16所述的方法,所述操作域还包括输入高度和输入宽度,Clause J19. The method according to Clause J16, the operation domain further includes an input height and an input width,
其中,根据所述操作码和所述操作域获取执行填充指令所需的待运算数据、填充核和目标地址,包括:Wherein, obtaining the data to be operated, the padding core and the target address required to execute the padding instruction according to the operation code and the operation domain include:
从所述待运算数据地址中,获取对应所述输入宽度和所述输入高度的待运算数据。Obtain the data to be calculated corresponding to the input width and the input height from the address of the data to be calculated.
条款J20、根据条款J16所述的方法,所述操作域还包括填充核高度和填充核宽度,Clause J20. The method according to Clause J16, the operation domain further includes a filling core height and a filling core width,
其中,根据所述操作码和所述操作域获取执行填充指令所需的待运算数据、填充核和目标地址,包括:Wherein, obtaining the data to be operated, the padding core and the target address required to execute the padding instruction according to the operation code and the operation domain include:
从所述填充核地址中,获取对应所述填充核高度和所述填充核宽度的填充核。From the address of the filler core, a filler core corresponding to the height of the filler core and the width of the filler core is obtained.
条款J21、根据条款J16所述的方法,所述操作域还包括填充核数量,Clause J21. The method according to Clause J16, the operation domain further includes a number of filling cores,
其中,根据所述填充核对所述待运算数据进行填充运算,得到运算结果,包括:Wherein, performing filling operation on the data to be calculated according to the filling core to obtain an operation result includes:
通过数量为所述填充核数量的多个填充核,对所述待运算数据进行填充运算。The number of filling cores is the number of filling cores to perform filling operation on the data to be calculated.
条款J22、根据条款J16所述的方法,所述方法还包括:Clause J22. The method according to Clause J16, the method further comprising:
利用所述装置的存储模块存储所述待运算数据和所述填充核,Using the storage module of the device to store the data to be calculated and the filling core,
其中,所述存储模块包括寄存器和缓存中的至少一种,Wherein, the storage module includes at least one of a register and a cache,
所述缓存,用于存储所述待运算数据和所述填充核,所述缓存包括至少一个神经元缓存NRAM;The cache is used to store the data to be calculated and the filling core, and the cache includes at least one neuron cache NRAM;
所述寄存器,用于存储所述待运算数据中的标量数据;The register is used to store scalar data in the data to be calculated;
所述神经元缓存,用于存储所述待运算数据中的神经元数据,所述神经元数据包括神经元向量数据。The neuron cache is used to store neuron data in the data to be operated, and the neuron data includes neuron vector data.
条款J23、根据条款J16所述的方法,对所述编译后的填充指令进行解析,得到填充指令的操作码和操作域,包括:Clause J23. According to the method described in Clause J16, parse the compiled stuffing instruction to obtain the opcode and operation domain of the stuffing instruction, including:
存储所述编译后的填充指令;Storing the compiled filling instruction;
对所述编译后的填充指令进行解析,得到填充指令的操作码和操作域;Parse the compiled stuffing instruction to obtain the opcode and operation domain of the stuffing instruction;
存储指令队列,所述指令队列包括按照执行顺序依次排列的多个待执行指令,所述多个待执行指令包括所述编译后的填充指令。An instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed in order according to an execution order, and the plurality of instructions to be executed include the compiled filling instructions.
条款J24、根据条款J23所述的方法,所述方法还包括:Clause J24. The method according to Clause J23, the method further comprising:
在确定所述多个待执行指令中的第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系时,缓存所述第一待执行指令,并在确定所述第零待执行指令执行完毕后,控制进行所述第一待执行指令的执行,When it is determined that the first to-be-executed instruction among the plurality of to-be-executed instructions is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction, cache the first to-be-executed instruction, and determine the After the execution of the zeroth to-be-executed instruction is completed, control to execute the execution of the first to-be-executed instruction,
其中,所述第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系包括:Wherein, the association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
存储所述第一待执行指令所需数据的第一存储地址区间与存储所述第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
条款J25、根据条款J16所述的方法,对获取到的填充指令进行编译,得到编译后的填充指令,包括:Clause J25. According to the method described in Clause J16, compile the obtained filling instructions to obtain the compiled filling instructions, including:
根据所述填充指令生成汇编文件,并将所述汇编文件翻译成二进制文件,Generate an assembly file according to the filling instruction, and translate the assembly file into a binary file,
其中,所述二进制文件为所述编译后的填充指令。Wherein, the binary file is the compiled filling instruction.
条款J26、一种非易失性计算机可读存储介质,其上存储有计算机程序指令,所述计算机程序指令被处理器执行时实现条款J16至条款J25任一项所述的方法。Clause J26. A non-volatile computer-readable storage medium having computer program instructions stored thereon. When the computer program instructions are executed by a processor, the method according to any one of clauses J16 to J25 is implemented.
由于神经网络算法的广泛使用,计算机硬件运算能力的不断提升,实际应用中所涉及到的数据运算的种类和数量不断提高。由于编程语言的种类多样,在不同的语言环境下,为实现矩阵转置运算的运算过程,相关技术中,由于现阶段没有能广泛适用于各类编程语言的矩阵转置指令,技术人员需要自定义对应其编程语言环境的多条指令来实现矩阵转置运算,导致进行矩阵转置运算效率低、速度慢。本公开提供一种矩阵转置指令处理方法、装置、计算机设备和存储介质,仅用一个指令即可以实现矩阵转置运算,能够显著提高进行矩阵转置运算的效率和速度。Due to the widespread use of neural network algorithms and the continuous improvement of computer hardware computing capabilities, the types and number of data operations involved in practical applications continue to increase. Due to the variety of programming languages, in different language environments, in order to realize the operation process of matrix transposition operation, in the related art, because there is no matrix transposition instruction that can be widely applied to various programming languages at this stage, the technical staff needs to Define multiple instructions corresponding to its programming language environment to implement the matrix transpose operation, which results in low efficiency and slow speed of the matrix transpose operation. The present disclosure provides a matrix transposition instruction processing method, device, computer equipment, and storage medium. The matrix transposition operation can be implemented with only one instruction, which can significantly improve the efficiency and speed of matrix transposition operation.
图11-1示出根据本公开一实施例的矩阵转置指令处理装置的框图。如图11-1所示,该装置包括控制模块10-11和运算模块10-12。11-1 shows a block diagram of a matrix transposition instruction processing apparatus according to an embodiment of the present disclosure. As shown in Figure 11-1, the device includes a control module 10-11 and an arithmetic module 10-12.
控制模块10-11,用于对获取到的矩阵转置指令进行编译,得到编译后的矩阵转置指令,对编译后的矩阵转置指令进行解析,得到矩阵转置指令的操作码和操作域,并根据操作码和操作域获取执行矩阵转置指令所需的待运算数据、目标地址、待运算数据的输入高度和输入宽度。其中,操作码用于指示矩阵转置指令对数据所进行的运算为矩阵转置运算,操作域包括待运算数据地址、输入高度、输入宽度和目标地址。The control module 10-11 is used to compile the obtained matrix transposition instruction to obtain the compiled matrix transposition instruction, and parse the compiled matrix transposition instruction to obtain the operation code and operation domain of the matrix transposition instruction. , And obtain the data to be operated, the target address, and the input height and width of the data to be operated required to execute the matrix transposition instruction according to the operation code and the operation domain. Among them, the operation code is used to instruct the operation performed by the matrix transposition instruction on the data to be a matrix transposition operation, and the operation domain includes the data address to be operated, the input height, the input width, and the target address.
运算模块10-12,用于根据输入高度和输入宽度,对待运算数据进行矩阵转置运算,得到转置数据,并将转置数据存入目标地址中。The operation module 10-12 is used to perform matrix transposition operation on the data to be calculated according to the input height and input width to obtain the transposed data, and store the transposed data in the target address.
在本实施例中,控制模块所获取到的矩阵转置指令为未编译的、不能直接供硬件执行的软件指令,控制模块需先对矩阵转置指令(未编译的)进行编译。在得到编译后的矩阵转置指令之后,才能对编译后的矩阵转置指令进行解析。编译后的矩阵转置指令为能够直接供硬件执行的硬件指令。In this embodiment, the matrix transposition instruction acquired by the control module is an uncompiled software instruction that cannot be directly executed by the hardware. The control module needs to first compile the matrix transposition instruction (uncompiled). After the compiled matrix transposition instruction is obtained, the compiled matrix transposition instruction can be analyzed. The compiled matrix transposition instructions are hardware instructions that can be directly executed by hardware.
在本实施例中,控制模块可以从待运算数据地址中获得待运算数据。操作域中可以包括输入高度和输入宽度,或者操作域中包括存储待运算数据的输入高度和输入宽度的存储地址。在操作域中直接包括待运算数据的输入高度和输入宽度的具体数值时,可以将该具体数值确定为输入高度和输入宽度。在操作域中包括输入高度和输入宽度的存储地址时,可以从对应的存储地址中获得输入高度和输入宽度。控制模块可以通过数据输入输出单元获得指令和数据,该数据输入输出单元可以为一个或多个数据I/O接口或I/O引脚。In this embodiment, the control module can obtain the data to be calculated from the data address to be calculated. The operation domain may include an input height and an input width, or the operation domain includes a storage address that stores the input height and input width of the data to be calculated. When the specific value of the input height and input width of the data to be calculated is directly included in the operation domain, the specific value can be determined as the input height and input width. When the storage address of the input height and input width is included in the operation domain, the input height and input width can be obtained from the corresponding storage address. The control module can obtain instructions and data through the data input and output unit, which can be one or more data I / O interfaces or I / O pins.
在本实施例中,操作码可以是计算机程序中所规定的要执行操作的那一部分指令或字段(通常用 代码表示),是指令序列号,用来告知执行指令的装置具体需要执行哪一条指令。操作域可以是执行对应的指令所需的所有数据的来源,执行对应的指令所需的所有数据包括待运算数据、待运算数据的输入高度、输入宽度等参数以及对应的运算方法等等。对于一个矩阵转置指令其必须包括操作码和操作域,其中,操作域至少包括待运算数据地址、输入高度、输入宽度和目标地址In this embodiment, the operation code may be the part of the instruction or field (usually represented by code) specified in the computer program to perform the operation, and is the instruction sequence number used to inform the device that executes the instruction which instruction needs to be executed. . The operation domain may be a source of all data required to execute the corresponding instruction. All data required to execute the corresponding instruction include data to be operated, input height, input width and other parameters of the data to be operated, and corresponding operation methods. For a matrix transpose instruction, it must include an operation code and an operation field, where the operation field includes at least the data address to be operated, the input height, the input width, and the target address
应当理解的是,本领域技术人员可以根据需要对矩阵转置指令的指令格式以及所包含的操作码和操作域进行设置,本公开对此不作限制。It should be understood that, those skilled in the art can set the instruction format of the matrix transpose instruction, as well as the included operation codes and operation domains as needed, and this disclosure does not limit this.
在本实施例中,该装置可以包括一个或多个控制模块,以及一个或多个运算模块,可以根据实际需要对控制模块和运算模块的数量进行设置,本公开对此不作限制。在装置包括一个控制模块时,该控制模块可以接收矩阵转置指令,并控制一个或多个运算模块进行矩阵转置运算。在装置包括多个控制模块时,多个控制模块可以分别接收矩阵转置指令,并控制对应的一个或多个运算模块进行矩阵转置运算。In this embodiment, the device may include one or more control modules and one or more arithmetic modules, and the number of control modules and arithmetic modules may be set according to actual needs, which is not limited in the present disclosure. When the device includes a control module, the control module can receive the matrix transposition instruction and control one or more arithmetic modules to perform the matrix transposition operation. When the device includes multiple control modules, the multiple control modules may respectively receive matrix transposition instructions and control the corresponding one or more arithmetic modules to perform matrix transposition operations.
本公开实施例所提供的矩阵转置指令处理装置,该装置包括控制模块和运算模块,控制模块用于对获取到的矩阵转置指令进行编译,得到编译后的矩阵转置指令,对编译后的矩阵转置指令进行解析,得到矩阵转置指令的操作码和操作域,并根据操作码和操作域获取执行矩阵转置指令所需的待运算数据、目标地址、待运算数据的输入高度和输入宽度;运算模块用于根据输入高度和输入宽度,对待运算数据进行矩阵转置运算,得到转置数据,并将转置数据存入目标地址中。本公开实施例所提供的矩阵转置指令处理装置的适用范围广,对矩阵转置指令的处理效率高、处理速度快,进行矩阵转置运算的处理效率高、处理速度快。A matrix transposition instruction processing device provided by an embodiment of the present disclosure includes a control module and an operation module. The control module is used to compile the obtained matrix transposition instruction to obtain the compiled matrix transposition instruction. The matrix transposition instruction is parsed to obtain the operation code and operation domain of the matrix transposition instruction, and according to the operation code and operation domain, the data to be operated, the target address, and the input height of the data to be operated required to execute the matrix transposition instruction Input width; the calculation module is used to perform matrix transposition operation on the data to be calculated according to the input height and input width to obtain the transposed data, and store the transposed data in the target address. The matrix transposition instruction processing device provided by the embodiments of the present disclosure has a wide range of applications, and has high processing efficiency and fast processing speed for matrix transposition instructions, and high processing efficiency and fast processing speed for matrix transposition operations.
图11-2a示出根据本公开一实施例的矩阵转置指令处理装置的框图。在一种可能的实现方式中,如图11-2a所示,运算模块10-12可以包括多个矩阵转置运算器10-120。多个矩阵转置运算器10-120,用于根据输入高度和输入宽度,对待运算数据进行矩阵转置运算。其中,转置数据的高度等于输入宽度,转置数据的宽度等于输入高度。11-2a shows a block diagram of a matrix transposition instruction processing apparatus according to an embodiment of the present disclosure. In a possible implementation manner, as shown in FIG. 11-2a, the operation module 10-12 may include a plurality of matrix transpose operators 10-120. A plurality of matrix transposition operators 10-120 are used to perform matrix transposition operations on the data to be calculated according to the input height and input width. Among them, the height of the transposed data is equal to the input width, and the width of the transposed data is equal to the input height.
在该实现方式中,运算模块还可以包括一个矩阵转置运算器。可以根据所需进行的矩阵转置运算的数据量、对矩阵转置运算的处理速度、处理效率等需求对矩阵转置运算器的数量进行设置,本公开对此不作限制。In this implementation manner, the operation module may further include a matrix transpose operator. The number of matrix transposition operators can be set according to the amount of data required for the matrix transposition operation, the processing speed, processing efficiency, and other requirements of the matrix transposition operation, which is not limited in this disclosure.
图11-2b示出根据本公开一实施例的矩阵转置指令处理装置的框图。在一种可能的实现方式中,如图11-2b所示,运算模块10-12可以包括主运算子模块10-121和多个从运算子模块10-122,主运算子模块10-121包括多个矩阵转置运算器10-120(图中未示出)。11-2b shows a block diagram of a matrix transposition instruction processing apparatus according to an embodiment of the present disclosure. In a possible implementation manner, as shown in FIG. 11-2b, the operation module 10-12 may include a master operation submodule 10-121 and a plurality of slave operation submodules 10-122, and the master operation submodule 10-121 includes A plurality of matrix transpose operators 10-120 (not shown in the figure).
主运算子模块10-121,用于利用多个矩阵转置运算器10-120根据输入高度和输入宽度对待运算数据进行矩阵转置运算,得到转置数据,并将转置数据存入目标地址中。The main operation sub-module 10-121 is used for performing matrix transposition operation on the data to be calculated according to the input height and input width using a plurality of matrix transposition operators 10-120 to obtain transposed data, and store the transposed data in the target address in.
在一种可能的实现方式中,如图11-2a、图11-2b所示,该装置还可以包括存储模块10-13。存储模块10-13用于存储待运算数据。In a possible implementation manner, as shown in FIGS. 11-2a and 11-2b, the device may further include a storage module 10-13. The storage modules 10-13 are used to store data to be calculated.
在该实现方式中,存储模块可以包括缓存和寄存器中的一种或多种,缓存可以包括高速暂存缓存,还可以包括至少一个NRAM(Neuron Random Access Memory,神经元随机存取存储器)。缓存可以用于存储待运算数据,寄存器可以用于存储待运算数据中的标量数据。In this implementation, the storage module may include one or more of a cache and a register. The cache may include a high-speed temporary storage cache, and may also include at least one NRAM (Neuron Random Access Memory). The cache can be used to store data to be calculated, and the register can be used to store scalar data in the data to be calculated.
在一种可能的实现方式中,缓存可以包括神经元缓存。神经元缓存也即上述神经元随机存取存储器,可以用于存储待运算数据中的神经元数据,神经元数据可以包括神经元向量数据。In a possible implementation, the cache may include a neuron cache. The neuron cache, that is, the foregoing neuron random access memory, can be used to store neuron data in the data to be calculated, and the neuron data can include neuron vector data.
在一种可能的实现方式中,该装置还可以包括直接内存访问模块,用于从存储模块中读取或者存储数据。In a possible implementation, the device may further include a direct memory access module, which is used to read or store data from the storage module.
在一种可能的实现方式中,控制模块10-11还可以用于根据矩阵转置指令生成汇编文件,并将汇编文件翻译成二进制文件,其中,二进制文件为编译后的矩阵转置指令。In a possible implementation manner, the control module 10-11 may also be used to generate an assembly file according to the matrix transposition instruction and translate the assembly file into a binary file, where the binary file is a compiled matrix transposition instruction.
在一种可能的实现方式中,矩阵转置指令的指令格式可以是:In a possible implementation manner, the instruction format of the matrix transpose instruction may be:
transpose dst src srcHeight srcWidthtranspose dst src srcHeight srcWidth
其中,transpose是矩阵转置指令的操作码,dst、src、srcHeight、srcWidth是矩阵转置指令的操作域。其中,dst是目标地址,src是待运算数据地址,srcHeight是待运算数据的输入高度,srcWidth是待 运算数据的输入宽度。Among them, transpose is the operation code of the matrix transposition instruction, dst, src, srcHeight, srcWidth is the operation domain of the matrix transposition instruction. Among them, dst is the target address, src is the data address to be calculated, srcHeight is the input height of the data to be calculated, and srcWidth is the input width of the data to be calculated.
应当理解的是,本领域技术人员可以根据需要对矩阵转置指令的操作码、指令格式中操作码和操作域的位置进行设置,本公开对此不作限制。It should be understood that those skilled in the art may set the position of the operation code of the matrix transposition instruction, the operation code and the operation field in the instruction format according to needs, and this disclosure does not limit this.
在一种可能的实现方式中,该装置可以设置于图形处理器(Graphics Processing Unit,简称GPU)、中央处理器(Central Processing Unit,简称CPU)和嵌入式神经网络处理器(Neural-network Processing Unit,简称NPU)的一种或多种之中。In a possible implementation manner, the device may be set in a graphics processor (Graphics Processing Unit, GPU for short), a central processing unit (Central Processing Unit, CPU for short) and an embedded neural network processor (Neural-network Processing Unit) , Referred to as NPU).
需要说明的是,尽管以上述实施例作为示例介绍了矩阵转置指令处理装置如上,但本领域技术人员能够理解,本公开应不限于此。事实上,用户完全可根据个人喜好和/或实际应用场景灵活设定各模块,只要符合本公开的技术方案即可。It should be noted that although the above embodiment is taken as an example to introduce the matrix transposition instruction processing apparatus as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various modules flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.
应用示例Application examples
以下结合“利用矩阵转置指令处理装置进行矩阵转置运算”作为一个示例性应用场景,给出根据本公开实施例的应用示例,以便于理解矩阵转置指令处理装置的流程。本领域技术人员应理解,以下应用示例仅仅是出于便于理解本公开实施例的目的,不应视为对本公开实施例的限制The following uses "matrix transposition instruction processing device for matrix transposition operation" as an exemplary application scenario to give an application example according to an embodiment of the present disclosure, so as to facilitate understanding of the flow of the matrix transposition instruction processing device. Those skilled in the art should understand that the following application examples are only for the purpose of facilitating understanding of the embodiments of the present disclosure, and should not be considered as limitations to the embodiments of the present disclosure
图11-3示出根据本公开一实施例的矩阵转置指令处理装置的应用场景的示意图。如图11-3所示,矩阵转置指令处理装置对矩阵转置指令进行处理的过程如下:11-3 shows a schematic diagram of an application scenario of a matrix transposition instruction processing apparatus according to an embodiment of the present disclosure. As shown in Figure 11-3, the matrix transposition instruction processing device processes the matrix transposition instruction as follows:
控制模块10-11对获取到的矩阵转置指令1进行编译,得到编译后的矩阵转置指令1(如矩阵转置指令1为transpose 500 100 64 32)。对编译后的矩阵转置指令1进行解析,得到矩阵转置指令1的操作码和操作域。其中,矩阵转置指令1的操作码为transpose,目标地址为500,待运算数据地址为100,输入高度为64,输入宽度为32。控制模块10-11从待运算数据地址100中获取64×32的待运算数据。The control module 10-11 compiles the obtained matrix transposition instruction 1 to obtain the compiled matrix transposition instruction 1 (for example, the matrix transposition instruction 1 is transpose 500, 100, 64, 32). The compiled matrix transposition instruction 1 is analyzed to obtain the operation code and operation domain of the matrix transposition instruction 1. The operation code of the matrix transposition instruction 1 is transpose, the target address is 500, the data address to be calculated is 100, the input height is 64, and the input width is 32. The control module 10-11 acquires 64 × 32 data to be calculated from the data address 100 to be calculated.
运算模块10-12对待运算数据进行矩阵转置运算,得到32×64的转置数据,并将转置数据存入目标地址500中。The operation module 10-12 performs matrix transposition operation on the data to be operated to obtain 32 × 64 transposed data, and stores the transposed data in the target address 500.
以上各模块的工作过程可参考上文的相关描述。For the working process of the above modules, please refer to the relevant description above.
这样,矩阵转置指令处理装置可以高效、快速地对矩阵转置指令进行处理,进行矩阵转置运算的处理效率高、处理速度快。In this way, the matrix transposition instruction processing device can process the matrix transposition instruction efficiently and quickly, and the matrix transposition operation has high processing efficiency and fast processing speed.
图11-4示出根据本公开一实施例的矩阵转置指令处理方法的流程图。该方法可以应用于包含存储器和处理器的如计算机设备等,其中,存储器用于存储执行方法过程中所使用的数据;处理器用于执行相关的处理、运算步骤,如执行下述步骤S51-10和步骤S52-10。如图11-4所示,该方法应用于上述矩阵转置指令处理装置,该方法包括步骤S51-10和步骤S52-10。11-4 shows a flowchart of a matrix transposition instruction processing method according to an embodiment of the present disclosure. The method can be applied to computer equipment including a memory and a processor, where the memory is used to store data used in the execution of the method; the processor is used to perform related processing and operation steps, such as performing the following steps S51-10和 步骤 S52-10. As shown in FIG. 11-4, the method is applied to the above matrix transposition instruction processing device. The method includes steps S51-10 and S52-10.
在步骤S51-10中,利用控制模块对获取到的矩阵转置指令进行编译,得到编译后的矩阵转置指令,对编译后的矩阵转置指令进行解析,得到矩阵转置指令的操作码和操作域,并根据操作码和操作域获取执行矩阵转置指令所需的待运算数据、目标地址、待运算数据的输入高度和输入宽度。其中,操作码用于指示矩阵转置指令对数据所进行的运算为矩阵转置运算,操作域包括待运算数据地址、输入高度、输入宽度和目标地址。In step S51-10, the control module is used to compile the obtained matrix transposition instruction to obtain the compiled matrix transposition instruction, and the compiled matrix transposition instruction is analyzed to obtain the operation code and the matrix transposition instruction. The operation domain, and according to the operation code and the operation domain, obtain the data to be operated, the target address, the input height and the input width of the data to be operated required for executing the matrix transposition instruction. Among them, the operation code is used to instruct the operation performed by the matrix transposition instruction on the data to be a matrix transposition operation, and the operation domain includes the data address to be operated, the input height, the input width, and the target address.
在步骤S52-10中,利用运算模块根据输入高度和输入宽度,对待运算数据进行矩阵转置运算,得到转置数据,并将转置数据存入目标地址中。In step S52-10, the operation module is used to perform matrix transposition operation on the data to be calculated according to the input height and input width to obtain transposed data, and the transposed data is stored in the target address.
在一种可能的实现方式中,根据输入高度和输入宽度,对待运算数据进行矩阵转置运算,得到转置数据,可以包括:In a possible implementation manner, according to the input height and the input width, performing matrix transposition operation on the data to be operated to obtain transposed data may include:
利用运算模块中的多个矩阵转置运算器,根据输入高度和输入宽度对待运算数据进行矩阵转置运算。其中,转置数据的高度等于输入宽度,转置数据的宽度等于输入高度。A plurality of matrix transpose operators in the arithmetic module are used to perform matrix transpose calculation on the data to be calculated according to the input height and input width. Among them, the height of the transposed data is equal to the input width, and the width of the transposed data is equal to the input height.
在一种可能的实现方式中,运算模块包括主运算子模块和多个从运算子模块,主运算子模块包括多个矩阵转置运算器。其中,步骤S52-10可以包括:In a possible implementation manner, the operation module includes a master operation sub-module and multiple slave operation sub-modules, and the master operation sub-module includes multiple matrix transpose operators. Wherein, step S52-10 may include:
利用主运算子模块中的多个矩阵转置运算器根据输入高度和输入宽度对待运算数据进行矩阵转置运算,得到转置数据,并将转置数据存入目标地址中。A plurality of matrix transposition operators in the main operation sub-module are used to perform matrix transposition operation on the data to be calculated according to the input height and input width to obtain transposed data, and store the transposed data in the target address.
在一种可能的实现方式中,该方法还可以包括:利用装置的存储模块存储待运算数据,In a possible implementation manner, the method may further include: storing the data to be calculated by using a storage module of the device,
其中,存储模块包括寄存器和缓存中的至少一种,Wherein, the storage module includes at least one of a register and a cache,
缓存,用于存储待运算数据,缓存包括至少一个神经元缓存NRAM;Cache, used to store data to be calculated, the cache includes at least one neuron cache NRAM;
寄存器,用于存储待运算数据中的标量数据;Register, used to store scalar data in the data to be calculated;
神经元缓存,用于存储待运算数据中的神经元数据,神经元数据包括神经元向量数据。The neuron cache is used to store neuron data in the data to be calculated, and the neuron data includes neuron vector data.
在一种可能的实现方式中,对编译后的矩阵转置指令进行解析,得到矩阵转置指令的操作码和操作域,包括:In a possible implementation manner, parsing the compiled matrix transposition instruction to obtain the operation code and operation domain of the matrix transposition instruction includes:
存储编译后的矩阵转置指令;Store the compiled matrix transpose instruction;
对编译后的矩阵转置指令进行解析,得到矩阵转置指令的操作码和操作域;Analyze the compiled matrix transpose instruction to obtain the operation code and operation domain of the matrix transpose instruction;
存储指令队列,指令队列包括按照执行顺序依次排列的多个待执行指令,多个待执行指令可以包括编译后的矩阵转置指令。The instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed in order according to the execution order. The plurality of instructions to be executed may include a compiled matrix transposition instruction.
在一种可能的实现方式中,该方法还可以包括:在确定多个待执行指令中的第一待执行指令与第一待执行指令之前的第零待执行指令存在关联关系时,缓存第一待执行指令,在第零待执行指令执行完毕后,执行第一待执行指令,In a possible implementation manner, the method may further include: when determining that the first to-be-executed instruction among the plurality of to-be-executed instructions is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction, caching the first The instruction to be executed, after the execution of the zeroth instruction to be executed is completed, the first instruction to be executed is executed,
其中,第一待执行指令与第一待执行指令之前的第零待执行指令存在关联关系包括:Wherein, the association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
存储第一待执行指令所需数据的第一存储地址区间与存储第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first storage address interval storing data required for the first instruction to be executed has an overlapping area with the zeroth storage address interval storing data required for the zeroth instruction to be executed.
在一种可能的实现方式中,对获取到的矩阵转置指令进行编译,得到编译后的矩阵转置指令,可以包括:In a possible implementation manner, compiling the obtained matrix transposition instruction to obtain the compiled matrix transposition instruction may include:
根据矩阵转置指令生成汇编文件,并将汇编文件翻译成二进制文件。其中,二进制文件为编译后的矩阵转置指令。The assembly file is generated according to the matrix transposition instruction, and the assembly file is translated into a binary file. Among them, the binary file is a compiled matrix transposition instruction.
需要说明的是,尽管以上述实施例作为示例介绍了矩阵转置指令处理方法如上,但本领域技术人员能够理解,本公开应不限于此。事实上,用户完全可根据个人喜好和/或实际应用场景灵活设定各步骤,只要符合本公开的技术方案即可。It should be noted that although the above embodiment is used as an example to introduce the matrix transposition instruction processing method as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various steps flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.
本公开实施例所提供的矩阵转置指令处理方法的适用范围广,对矩阵转置指令的处理效率高、处理速度快,进行矩阵转置运算的处理效率高、处理速度快。The method for processing matrix transposition instructions provided by the embodiments of the present disclosure has a wide range of applications, and has high processing efficiency and fast processing speed for matrix transposition instructions, and high processing efficiency and fast processing speed for matrix transposition operations.
依据以下条款可以更好的理解前述内容:The foregoing can be better understood based on the following clauses:
条款K1、一种矩阵转置指令处理装置,所述装置包括:Clause K1, a matrix transposition instruction processing device, the device comprising:
控制模块,用于对获取到的矩阵转置指令进行编译,得到编译后的矩阵转置指令,对所述编译后的矩阵转置指令进行解析,得到矩阵转置指令的操作码和操作域,所述操作码用于指示所述矩阵转置指令对数据所进行的运算为矩阵转置运算,所述操作域包括待运算数据地址、输入高度、输入宽度和目标地址,并根据所述操作码和所述操作域获取执行矩阵转置指令所需的待运算数据、目标地址、所述待运算数据的输入高度和输入宽度;A control module, configured to compile the obtained matrix transposition instruction to obtain a compiled matrix transposition instruction, and parse the compiled matrix transposition instruction to obtain an operation code and an operation domain of the matrix transposition instruction, The operation code is used to indicate that the operation performed by the matrix transposition instruction on the data is a matrix transposition operation, and the operation field includes a data address to be operated, an input height, an input width, and a target address, and the operation code And the operation domain to obtain the data to be operated, the target address, the input height and the input width of the data to be operated required to execute the matrix transposition instruction;
运算模块,用于根据所述输入高度和所述输入宽度,对所述待运算数据进行矩阵转置运算,得到转置数据,并将所述转置数据存入所述目标地址中。The operation module is configured to perform matrix transposition operation on the data to be calculated according to the input height and the input width to obtain transposed data, and store the transposed data in the target address.
条款K2、根据条款K1所述的装置,所述运算模块,包括:Clause K2. The device according to Clause K1, the calculation module includes:
多个矩阵转置运算器,用于根据所述输入高度和所述输入宽度,对所述待运算数据进行矩阵转置运算,A plurality of matrix transpose operators, used to perform matrix transpose operations on the data to be calculated according to the input height and the input width,
其中,所述转置数据的高度等于所述输入宽度,所述转置数据的宽度等于所述输入高度。Wherein, the height of the transposed data is equal to the input width, and the width of the transposed data is equal to the input height.
条款K3、根据条款K2所述的装置,所述运算模块包括主运算子模块和多个从运算子模块,所述主运算子模块包括所述多个矩阵转置运算器,Clause K3. The device according to Clause K2, the operation module includes a master operation sub-module and a plurality of slave operation sub-modules, the master operation sub-module includes the plurality of matrix transpose operators,
所述主运算子模块,用于利用所述多个矩阵转置运算器根据所述输入高度和所述输入宽度对所述待运算数据进行矩阵转置运算,得到转置数据,并将所述转置数据存入所述目标地址中。The main operation submodule is configured to use the plurality of matrix transposition operators to perform matrix transposition operation on the data to be calculated according to the input height and the input width to obtain transposed data, and convert the The transposed data is stored in the target address.
条款K4、根据条款K1所述的装置,所述装置还包括:Clause K4. The device according to Clause K1, the device further comprising:
存储模块,用于存储所述待运算数据,A storage module for storing the data to be calculated,
其中,所述存储模块包括寄存器和缓存中的至少一种,Wherein, the storage module includes at least one of a register and a cache,
所述缓存,用于存储所述待运算数据,所述缓存包括至少一个神经元缓存NRAM;The cache is used to store the data to be calculated, and the cache includes at least one neuron cache NRAM;
所述寄存器,用于存储所述待运算数据中的标量数据;The register is used to store scalar data in the data to be calculated;
所述神经元缓存,用于存储所述待运算数据中的神经元数据,所述神经元数据包括神经元向量数据。The neuron cache is used to store neuron data in the data to be operated, and the neuron data includes neuron vector data.
条款K5、根据条款K1所述的装置,所述控制模块,包括:Clause K5. The device according to Clause K1, the control module includes:
指令存储子模块,用于存储所述编译后的矩阵转置指令;An instruction storage sub-module for storing the compiled matrix transposition instruction;
指令处理子模块,用于对所述编译后的矩阵转置指令进行解析,得到矩阵转置指令的操作码和操作域;An instruction processing sub-module, which is used to parse the compiled matrix transposition instruction to obtain the operation code and operation domain of the matrix transposition instruction;
队列存储子模块,用于存储指令队列,所述指令队列包括按照执行顺序依次排列的多个待执行指令,所述多个待执行指令包括所述编译后的矩阵转置指令。The queue storage submodule is used to store an instruction queue, and the instruction queue includes a plurality of instructions to be executed, which are sequentially arranged according to an execution order, and the plurality of instructions to be executed include the compiled matrix transposition instruction.
条款K6、根据条款K5所述的装置,所述控制模块,还包括:Clause K6. The device according to Clause K5, the control module, further comprising:
依赖关系处理子模块,用于在确定所述多个待执行指令中的第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系时,将所述第一待执行指令缓存在所述指令存储子模块中,在所述第零待执行指令执行完毕后,从所述指令存储子模块中提取所述第一待执行指令发送至所述运算模块,The dependency processing sub-module is used to determine the first pending instruction when there is an association relationship between the first pending instruction in the plurality of pending instructions and the zeroth pending instruction before the first pending instruction The execution instruction is cached in the instruction storage submodule, and after the execution of the zeroth instruction to be executed is completed, the first instruction to be executed is extracted from the instruction storage submodule and sent to the arithmetic module,
其中,所述第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系包括:Wherein, the association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
存储所述第一待执行指令所需数据的第一存储地址区间与存储所述第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
条款K7、根据条款K1所述的装置,Clause K7, the device according to Clause K1,
所述控制模块,还用于根据所述矩阵转置指令生成汇编文件,并将所述汇编文件翻译成二进制文件,The control module is also used to generate an assembly file according to the matrix transposition instruction and translate the assembly file into a binary file,
其中,所述二进制文件为所述编译后的矩阵转置指令。Wherein, the binary file is the compiled matrix transposition instruction.
条款K8、一种机器学习运算装置,所述装置包括:Clause K8. A machine learning computing device, the device comprising:
一个或多个如条款K1-条款K7任一项所述的矩阵转置指令处理装置,用于从其他处理装置中获取待运算数据和控制信息,并执行指定的机器学习运算,将执行结果通过I/O接口传递给其他处理装置;One or more matrix transposition instruction processing devices as described in any one of Clause K1-Clause K7, used to obtain the data and control information to be calculated from other processing devices, and execute the specified machine learning operation, and pass the execution result The I / O interface is passed to other processing devices;
当所述机器学习运算装置包含多个所述矩阵转置指令处理装置时,所述多个所述矩阵转置指令处理装置间可以通过特定的结构进行连接并传输数据;When the machine learning operation device includes a plurality of matrix transposition instruction processing devices, the plurality of matrix transposition instruction processing devices can be connected and transmit data through a specific structure;
其中,多个所述矩阵转置指令处理装置通过快速外部设备互连总线PCIE总线进行互联并传输数据,以支持更大规模的机器学习的运算;多个所述矩阵转置指令处理装置共享同一控制系统或拥有各自的控制系统;多个所述矩阵转置指令处理装置共享内存或者拥有各自的内存;多个所述矩阵转置指令处理装置的互联方式是任意互联拓扑。Among them, a plurality of the matrix transposition instruction processing devices interconnect and transmit data through a fast external device interconnect bus PCIE bus to support larger-scale machine learning operations; a plurality of the matrix transposition instruction processing devices share the same The control system may have its own control system; a plurality of the matrix transposition instruction processing devices share memory or have their own memories; the interconnection mode of the plurality of matrix transposition instruction processing devices is an arbitrary interconnection topology.
条款K9、一种组合处理装置,所述组合处理装置包括:Clause K9. A combined processing device, the combined processing device comprising:
如条款K8所述的机器学习运算装置、通用互联接口和其他处理装置;Machine learning computing devices, general interconnection interfaces and other processing devices as described in Clause K8;
所述机器学习运算装置与所述其他处理装置进行交互,共同完成用户指定的计算操作,The machine learning computing device interacts with the other processing device to jointly complete the calculation operation specified by the user,
其中,所述组合处理装置还包括:存储装置,该存储装置分别与所述机器学习运算装置和所述其他处理装置连接,用于保存所述机器学习运算装置和所述其他处理装置的数据。Wherein, the combined processing device further includes: a storage device, which is respectively connected to the machine learning computing device and the other processing device, and is used for storing data of the machine learning computing device and the other processing device.
条款K10、一种机器学习芯片,所述机器学习芯片包括:Clause K10, a machine learning chip, the machine learning chip including:
如条款K8所述的机器学习运算装置或如条款K8所述的组合处理装置。The machine learning arithmetic device according to clause K8 or the combined processing device according to clause K8.
条款K11、一种电子设备,所述电子设备包括:Clause K11. An electronic device, the electronic device comprising:
如条款K10所述的机器学习芯片。Machine learning chip as described in clause K10.
条款K12、一种板卡,所述板卡包括:存储器件、接口装置和控制器件以及如条款K10所述的机器学习芯片;Clause K12, a board card including: a storage device, an interface device and a control device, and a machine learning chip as described in Clause K10;
其中,所述机器学习芯片与所述存储器件、所述控制器件以及所述接口装置分别连接;Wherein, the machine learning chip is respectively connected to the storage device, the control device and the interface device;
所述存储器件,用于存储数据;The storage device is used for storing data;
所述接口装置,用于实现所述机器学习芯片与外部设备之间的数据传输;The interface device is used to realize data transmission between the machine learning chip and an external device;
所述控制器件,用于对所述机器学习芯片的状态进行监控。The control device is used for monitoring the state of the machine learning chip.
条款K13、一种矩阵转置指令处理方法,所述方法应用于矩阵转置指令处理装置,所述装置包括控制模块和运算模块,所述方法包括:Clause K13. A method for processing matrix transposition instructions. The method is applied to a matrix transposition instruction processing apparatus. The apparatus includes a control module and an arithmetic module. The method includes:
利用控制模块对获取到的矩阵转置指令进行编译,得到编译后的矩阵转置指令,对所述编译后的矩阵转置指令进行解析,得到矩阵转置指令的操作码和操作域,并根据所述操作码和所述操作域获取执行矩阵转置指令所需的待运算数据、目标地址、所述待运算数据的输入高度和输入宽度;The control module is used to compile the obtained matrix transposition instruction to obtain the compiled matrix transposition instruction, and the compiled matrix transposition instruction is analyzed to obtain the operation code and operation domain of the matrix transposition instruction, and according to The operation code and the operation domain obtain the data to be operated, the target address, the input height and the input width of the data to be operated required to execute the matrix transposition instruction;
利用运算模块根据所述输入高度和所述输入宽度,对所述待运算数据进行矩阵转置运算,得到转置数据,并将所述转置数据存入所述目标地址中,Using an arithmetic module to perform matrix transposition on the data to be calculated according to the input height and the input width to obtain transposed data, and store the transposed data in the target address,
其中,所述操作码用于指示所述矩阵转置指令对数据所进行的运算为矩阵转置运算,所述操作域包括待运算数据地址、所述输入高度、所述输入宽度和所述目标地址。Wherein, the operation code is used to indicate that the operation performed by the matrix transposition instruction on the data is a matrix transposition operation, and the operation field includes the data address to be operated, the input height, the input width, and the target address.
条款K14、根据条款K13所述的方法,根据所述输入高度和所述输入宽度,对所述待运算数据进行矩阵转置运算,得到转置数据,包括:Clause K14. According to the method described in Clause K13, perform matrix transposition on the data to be calculated according to the input height and the input width to obtain transposed data, including:
利用所述运算模块中的多个矩阵转置运算器,根据所述输入高度和所述输入宽度对所述待运算数据进行矩阵转置运算,Using a plurality of matrix transposition operators in the calculation module to perform matrix transposition calculation on the data to be calculated according to the input height and the input width,
其中,所述转置数据的高度等于所述输入宽度,所述转置数据的宽度等于所述输入高度。Wherein, the height of the transposed data is equal to the input width, and the width of the transposed data is equal to the input height.
条款K15、根据条款K14所述的方法,所述运算模块包括主运算子模块和多个从运算子模块,所述主运算子模块包括所述多个矩阵转置运算器,Clause K15. The method according to Clause K14, the operation module includes a master operation sub-module and a plurality of slave operation sub-modules, the master operation sub-module includes the plurality of matrix transpose operators,
其中,根据所述输入高度和所述输入宽度,对所述待运算数据进行矩阵转置运算,得到转置数据,并将所述转置数据存入所述目标地址中,包括:Wherein, according to the input height and the input width, performing matrix transposition operation on the data to be operated to obtain transposed data, and storing the transposed data in the target address includes:
利用所述主运算子模块中的多个矩阵转置运算器根据所述输入高度和所述输入宽度对所述待运算数据进行矩阵转置运算,得到转置数据,并将所述转置数据存入所述目标地址中。Using a plurality of matrix transposition operators in the main operation sub-module to perform matrix transposition operation on the data to be calculated according to the input height and the input width to obtain transposed data, and convert the transposed data Store in the target address.
条款K16、根据条款K13所述的方法,所述方法还包括:Clause K16. The method according to Clause K13, the method further comprising:
利用所述装置的存储模块存储所述待运算数据,Use the storage module of the device to store the data to be calculated,
其中,所述存储模块包括寄存器和缓存中的至少一种,Wherein, the storage module includes at least one of a register and a cache,
所述缓存,用于存储所述待运算数据,所述缓存包括至少一个神经元缓存NRAM;The cache is used to store the data to be calculated, and the cache includes at least one neuron cache NRAM;
所述寄存器,用于存储所述待运算数据中的标量数据;The register is used to store scalar data in the data to be calculated;
所述神经元缓存,用于存储所述待运算数据中的神经元数据,所述神经元数据包括神经元向量数据。The neuron cache is used to store neuron data in the data to be operated, and the neuron data includes neuron vector data.
条款K17、根据条款K13所述的方法,对所述编译后的矩阵转置指令进行解析,得到矩阵转置指令的操作码和操作域,包括:Clause K17. According to the method described in Clause K13, parse the compiled matrix transposition instruction to obtain the operation code and operation domain of the matrix transposition instruction, including:
存储所述编译后的矩阵转置指令;Storing the compiled matrix transposition instruction;
对所述编译后的矩阵转置指令进行解析,得到矩阵转置指令的操作码和操作域;Analyzing the compiled matrix transposition instruction to obtain the operation code and operation domain of the matrix transposition instruction;
存储指令队列,所述指令队列包括按照执行顺序依次排列的多个待执行指令,所述多个待执行指令包括所述编译后的矩阵转置指令。An instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed, which are sequentially arranged according to an execution order, and the plurality of instructions to be executed include the compiled matrix transposition instruction.
条款K18、根据条款K17所述的方法,所述方法还包括:Clause K18. The method according to Clause K17, the method further comprising:
在确定所述多个待执行指令中的第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系时,缓存所述第一待执行指令,并在确定所述第零待执行指令执行完毕后,控制进行所述第一待执行指令的执行,When it is determined that the first to-be-executed instruction among the plurality of to-be-executed instructions is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction, cache the first to-be-executed instruction, and determine the After the execution of the zeroth to-be-executed instruction is completed, control to execute the execution of the first to-be-executed instruction,
其中,所述第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系包括:Wherein, the association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
存储所述第一待执行指令所需数据的第一存储地址区间与存储所述第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
条款K19、根据条款K13所述的方法,对获取到的矩阵转置指令进行编译,得到编译后的矩阵转置指令,包括:Clause K19. According to the method described in Clause K13, compile the obtained matrix transposition instruction to obtain the compiled matrix transposition instruction, including:
根据所述矩阵转置指令生成汇编文件,并将所述汇编文件翻译成二进制文件,Generate an assembly file according to the matrix transposition instruction, and translate the assembly file into a binary file,
其中,所述二进制文件为所述编译后的矩阵转置指令。Wherein, the binary file is the compiled matrix transposition instruction.
条款K20、一种非易失性计算机可读存储介质,其上存储有计算机程序指令,所述计算机程序指 令被处理器执行时实现条款K13至条款K19任一项所述的方法。Clause K20. A non-volatile computer-readable storage medium having computer program instructions stored thereon, said computer program instructions being executed by a processor to implement the method of any one of Clauses K13 to K19.
由于神经网络算法的广泛使用,计算机硬件运算人能力的不断提升,实际应用中所涉及到的数据运算的种类和数量不断提高。平均池化运算(Average-pooling))是一种获取局部区中所有数据的平均值。由于编程语言的种类多样,在不同的语言环境下,为实现平均池化运算的运算过程,相关技术中,由于现阶段没有能广泛适用于各类编程语言的平均池化指令,技术人员需要自定义对应其编程语言环境的多条指令来实现平均池化运算,导致进行平均池化运算效率低、速度慢。本公开提供一种平均池化指令处理方法、装置、计算机设备和存储介质,仅用一个指令即可以实现平均池化运算,能够显著提高进行平均池化运算的效率和速度。Due to the extensive use of neural network algorithms, the ability of computer hardware operators to continuously improve, the types and number of data operations involved in practical applications continue to increase. Average-pooling (Average-pooling) is an average value of all data in the local area. Due to the variety of programming languages, in different language environments, in order to achieve the operation process of average pooling operation, in related technologies, because there is no average pooling instruction that can be widely applied to various programming languages at this stage, the technical staff needs to Define multiple instructions corresponding to its programming language environment to implement average pooling operations, resulting in low efficiency and slow speed of performing average pooling operations. The present disclosure provides an average pooling instruction processing method, device, computer equipment, and storage medium. The average pooling operation can be realized with only one instruction, which can significantly improve the efficiency and speed of performing the average pooling operation.
图12-1示出根据本公开一实施例的平均池化指令处理装置的框图。如图12-1所示,该装置包括控制模块11-11和运算模块11-12。12-1 shows a block diagram of an average pooled instruction processing apparatus according to an embodiment of the present disclosure. As shown in Figure 12-1, the device includes a control module 11-11 and an arithmetic module 11-12.
控制模块11-11,用于对获取到的平均池化指令进行编译,得到编译后的平均池化指令,对编译后的平均池化指令进行解析,得到平均池化指令的操作码和操作域,并根据操作码和操作域获取执行平均池化指令所需的待运算数据、池化核和目标地址。其中,操作码用于指示平均池化指令对数据所进行的运算为平均池化运算,操作域包括待运算数据地址、池化核地址和目标地址。The control module 11-11 is used to compile the obtained average pooling instruction to obtain the compiled average pooling instruction, and parse the compiled average pooling instruction to obtain the operation code and operation domain of the average pooling instruction , And obtain the data to be calculated, the pooling core and the target address required to execute the average pooling instruction according to the operation code and the operation domain. The operation code is used to indicate that the operation performed by the average pooling instruction on the data is the average pooling operation, and the operation domain includes the data address to be operated, the pooling core address, and the target address.
运算模块11-12,用于根据池化核对待运算数据进行平均池化运算,获得运算结果,并将运算结果存入目标地址中。The operation module 11-12 is configured to perform average pooling operation on the data to be calculated according to the pooling core, obtain the operation result, and store the operation result in the target address.
在本实施例中,控制模块所获取到的平均池化指令为未编译的、不能直接供硬件执行的软件指令,控制模块需先对平均池化指令(未编译的)进行编译。在得到编译后的平均池化指令之后,才能对编译后的平均池化指令进行解析。编译后的平均池化指令为能够直接供硬件执行的硬件指令。控制模块可以从待运算数据地址和池化核地址中,分别获得待运算数据和池化核。控制模块可以通过数据输入输出单元获得指令和数据,该数据输入输出单元可以为一个或多个数据I/O接口或I/O引脚。In this embodiment, the average pooling instruction acquired by the control module is an uncompiled software instruction that cannot be directly executed by the hardware. The control module needs to first compile the average pooling instruction (uncompiled). After the compiled average pooling instruction is obtained, the compiled average pooling instruction can be analyzed. The compiled average pooled instructions are hardware instructions that can be directly executed by the hardware. The control module can obtain the data to be calculated and the pooled core from the data to be calculated and the pooled core address, respectively. The control module can obtain instructions and data through the data input and output unit, which can be one or more data I / O interfaces or I / O pins.
在本实施例中,操作码可以是计算机程序中所规定的要执行操作的那一部分指令或字段(通常用代码表示),是指令序列号,用来告知执行指令的装置具体需要执行哪一条指令。操作域可以是执行对应的指令所需的所有数据的来源,执行对应的指令所需的所有数据包括待运算数据、池化核等参数以及对应的运算方法等等。对于一个平均池化指令其必须包括操作码和操作域,其中,操作域至少包括待运算数据地址、池化核地址和目标地址。In this embodiment, the operation code may be the part of the instruction or field (usually represented by code) specified in the computer program to perform the operation, and is the instruction sequence number used to inform the device that executes the instruction which instruction needs to be executed. . The operation domain may be a source of all data required to execute the corresponding instruction, and all data required to execute the corresponding instruction include parameters to be operated, parameters such as pooling cores, and corresponding operation methods. For an average pooling instruction, it must include an operation code and an operation domain, where the operation domain includes at least the data address to be calculated, the pooling core address, and the target address.
应当理解的是,本领域技术人员可以根据需要对平均池化指令的指令格式以及所包含的操作码和操作域进行设置,本公开对此不作限制。It should be understood that those skilled in the art can set the instruction format of the average pooling instruction, as well as the included operation codes and operation domains as needed, and this disclosure does not limit this.
在本实施例中,该装置可以包括一个或多个控制模块,以及一个或多个运算模块,可以根据实际需要对控制模块和运算模块的数量进行设置,本公开对此不作限制。在装置包括一个控制模块时,该控制模块可以接收平均池化指令,并控制一个或多个运算模块进行平均池化运算。在装置包括多个控制模块时,多个控制模块可以分别接收平均池化指令,并控制对应的一个或多个运算模块进行平均池化运算。In this embodiment, the device may include one or more control modules and one or more arithmetic modules, and the number of control modules and arithmetic modules may be set according to actual needs, which is not limited in the present disclosure. When the device includes a control module, the control module can receive the average pooling instruction and control one or more arithmetic modules to perform the average pooling operation. When the device includes multiple control modules, the multiple control modules may respectively receive the average pooling instruction and control the corresponding one or more arithmetic modules to perform the average pooling operation.
本公开实施例所提供的平均池化指令处理装置,该装置包括控制模块和运算模块,控制模块用于对获取到的平均池化指令进行编译,得到编译后的平均池化指令,对编译后的平均池化指令进行解析,得到平均池化指令的操作码和操作域,并根据操作码和操作域获取执行平均池化指令所需的待运算数据、池化核和目标地址;运算模块用于根据池化核对待运算数据进行平均池化运算,获得运算结果,并将运算结果存入目标地址中。本公开实施例所提供的平均池化指令处理装置的适用范围广,对平均池化指令的处理效率高、处理速度快,进行平均池化运算的处理效率高、速度快。An average pooling instruction processing device provided by an embodiment of the present disclosure includes a control module and an arithmetic module. The control module is used to compile the obtained average pooling instruction to obtain the compiled average pooling instruction. The average pooling instruction is parsed to obtain the operation code and operation domain of the average pooling instruction, and according to the operation code and operation domain, the data to be calculated, the pooling core and the target address required to execute the average pooling instruction are obtained; Based on the pooling core, the data to be calculated is averagely pooled to obtain the operation result, and the operation result is stored in the target address. The average pooling instruction processing device provided by the embodiments of the present disclosure has a wide application range, and has high processing efficiency and fast processing speed for the average pooling instruction, and high processing efficiency and speed for performing the average pooling operation.
图12-2a示出根据本公开一实施例的平均池化指令处理装置的框图。在一种可能的实现方式中,如图12-2a所示,运算模块11-12可以包括多个加法器11-120和多个除法器11-120’。多个加法器11-120用于执行平均池化运算中的加法运算。多个除法器11-120’用于执行平均池化运算中的除法运算。12-2a shows a block diagram of an average pooled instruction processing apparatus according to an embodiment of the present disclosure. In a possible implementation, as shown in FIG. 12-2a, the operation module 11-12 may include a plurality of adders 11-120 and a plurality of dividers 11-120 '. A plurality of adders 11-120 are used to perform addition operations in the average pooling operation. A plurality of dividers 11-120 'are used to perform the division operation in the average pooling operation.
在该实现方式中,运算模块也可以包括一个加法器和一个除法器,或者包括一个加法器、多个除法器,再或者包括多个加法器、一个除法器。可以根据所需进行的平均池化运算的数据量的大小、对 平均池化运算的处理速度、效率等要求对加法器和除法器的数量进行设置,本公开对此不作限制。In this implementation manner, the operation module may also include one adder and one divider, or one adder, multiple dividers, or multiple adders and one divider. The number of adders and dividers can be set according to the amount of data required for the average pooling operation, the processing speed, efficiency, etc. of the average pooling operation, which is not limited in the present disclosure.
图12-2b示出根据本公开一实施例的平均池化指令处理装置的框图。在一种可能的实现方式中,如图12-2b所示,运算模块11-12可以包括主运算子模块11-121和多个从运算子模块11-122。主运算子模块11-121可以包括多个加法器和多个除法器。12-2b shows a block diagram of an average pooled instruction processing apparatus according to an embodiment of the present disclosure. In a possible implementation manner, as shown in FIG. 12-2b, the operation module 11-12 may include a master operation sub-module 11-121 and a plurality of slave operation sub-modules 11-122. The main operation sub-module 11-121 may include multiple adders and multiple dividers.
主运算子模块11-121,用于利用多个加法器和多个除法器分别进行平均池化运算中的加法运算和除法运算,得到运算结果,并将运算结果存入目标地址中。The main operation sub-module 11-121 is used to perform addition and division operations in the average pooling operation using multiple adders and multiple dividers, respectively, to obtain an operation result, and store the operation result in a target address.
在一种可能的实现方式中,操作域还可以包括输入高度和输入宽度。In a possible implementation manner, the operation domain may further include an input height and an input width.
其中,控制模块,还用于从待运算数据地址中,获取对应输入宽度和输入高度的待运算数据。The control module is also used to obtain data to be calculated corresponding to the input width and input height from the data to be calculated.
在该实现方式中,输入高度和输入宽度可以限定所获得的待运算数据的数据量和尺寸。操作域所包括的输入高度和输入宽度可以是具体的数值,还可以是存储输入高度和输入宽度的存储地址。在操作域中直接包括输入高度和输入宽度的具体数值时,将该具体数值确定为对应的输入高度和输入宽度。在操作域中包括输入高度和输入宽度的存储地址时,可以分别从输入高度和输入宽度的存储地址中获得输入高度和输入宽度。In this implementation, the input height and input width can define the data amount and size of the obtained data to be calculated. The input height and input width included in the operation domain may be specific numerical values, and may also be a storage address that stores the input height and input width. When the specific values of the input height and input width are directly included in the operation domain, the specific values are determined as the corresponding input height and input width. When the storage addresses of the input height and the input width are included in the operation domain, the input height and the input width can be obtained from the storage addresses of the input height and the input width, respectively.
在一种可能的实现方式中,在操作域中不包括输入高度和/或输入宽度时,可以根据预先设置的默认输入高度和默认输入宽度获取待运算数据。In a possible implementation manner, when the input height and / or input width are not included in the operation domain, the data to be calculated may be obtained according to the preset default input height and default input width.
通过上述方式,可以对待运算数据的数据量和尺寸进行限制,保证运算结果的准确性,并保证装置可以执行该平均池化指令。In the above manner, the data amount and size of the operation data can be limited, the accuracy of the operation result can be ensured, and the device can execute the average pooling instruction.
在一种可能的实现方式中,操作域还可以包括池化核高度和池化核宽度。In a possible implementation, the operation domain may further include pooled core height and pooled core width.
其中,控制模块11-11,还用于按照池化核高度和池化核宽度从池化核地址中获取池化核。Among them, the control module 11-11 is also used to obtain the pooled core from the pooled core address according to the pooled core height and the pooled core width.
在一种可能的实现方式中,操作域还可以包括第一步幅。其中,运算模块11-12,还可以用于按照第一步幅在x方向上移动池化核。In a possible implementation, the operation domain may also include the first step. Among them, the arithmetic modules 11-12 can also be used to move the pooled core in the x direction according to the first step.
在一种可能的实现方式中,操作域还可以包括第二步幅。其中,运算模块11-12,还可以用于按照第二步幅在y方向上移动池化核。In a possible implementation manner, the operation domain may further include a second step. Among them, the calculation module 11-12 can also be used to move the pooling core in the y direction according to the second step.
在该实现方式中,平均池化运算的步幅是在进行平均池化运算中每一次移动池化核的幅度。第一步幅可以是在x方向上移动池化核的幅度,第二步幅可以是在y方向上移动池化核的幅度。In this implementation, the step size of the average pooling operation is the amplitude of each moving pooling core in the average pooling operation. The first step may be to move the amplitude of the pooled core in the x direction, and the second step may be to move the amplitude of the pooled core in the y direction.
需要说明的是,在本公开中仅以池化核为二维为例,描述了进行平均池化运算所需的池化核的高度、宽度、第一步幅和第二步幅等参数,若池化核为多维,在相应地池化核的参数则包括其每个维度的尺寸和步幅。It should be noted that, in this disclosure, only the pooling core is taken as a two-dimensional example, and the parameters such as the height, width, first step width and second step width of the pooling core required for the average pooling operation are described. If the pooling kernel is multi-dimensional, the parameters of the pooling kernel include the size and stride of each dimension.
在一种可能的实现方式中,在平均池化指令的操作域中并未给出第一步幅和第二步幅时,运算模块可以以池化核的高度和宽度分别为其对应维度的步幅,保证平均池化运算的正常进行。例如,运算模块11-12还可以用于在待运算数据上非重叠移动池化核,并比较池化核所对应的区域中的多个待运算数据,获得运算结果。In a possible implementation manner, when the first and second steps are not given in the operation domain of the average pooling instruction, the computing module may use the height and width of the pooling core as their corresponding dimensions, respectively The stride ensures that the average pooling operation proceeds normally. For example, the calculation modules 11-12 can also be used to move the pooled cores non-overlapping on the data to be calculated, and compare multiple data to be calculated in the area corresponding to the pooled cores to obtain the calculation result.
在一种可能的实现方式中,在操作域中不包括池化核高度、池化核宽度、时,可以获取预先设置的默认池化核高度、默认池化核宽度,使得控制模块和运算模块可以执行平均池化指令。In a possible implementation, when the pooled core height, pooled core width, and the pooled core are not included in the operation domain, the preset default pooled core height and default pooled core width can be obtained, so that the control module and the arithmetic module The average pooling instruction can be executed.
在一种可能的实现方式中,操作域还可以包括池化核数量。其中,运算模块11-12,还用于通过数量为池化核数量的多个池化核,对待运算数据进行平均池化运算。In a possible implementation, the operation domain may further include the number of pooled cores. Among them, the calculation module 11-12 is also used to perform average pooling operation on the data to be calculated through a plurality of pooling cores whose number is the number of pooling cores.
在该实现方式中,池化核数量与待运算数据相对应。例如,在池化核数量为5时,可以确定待运算数据可以分为五个部分,需要5个池化核分别对待运算数据的五个部分进行平均池化运算。In this implementation, the number of pooled cores corresponds to the data to be calculated. For example, when the number of pooling cores is 5, it can be determined that the data to be calculated can be divided into five parts, and five pooling cores are required to perform average pooling operations on the five parts of the data to be calculated, respectively.
在该实现方式中,在操作域不包括池化核数量时,可以确定待运算数据仅需一个池化核便可以实现平均池化运算。In this implementation manner, when the operation domain does not include the number of pooled cores, it can be determined that only one pooled core is needed for the data to be calculated to implement average pooling operations.
在一种可能的实现方式中,运算模块11-12还用于在待运算数据的尺寸为池化核的尺寸的非整数倍时,对待运算数据中为池化核的尺寸的整数倍的数据进行平均池化运算。其中,待运算数据的尺寸为所述池化核的尺寸的非整数倍,可以包括以下至少一项:待运算数据的输入宽度为池化核的宽度的非整数倍、待运算数据的输入高度为池化核的高度的非整数倍。In a possible implementation manner, the calculation module 11-12 is further used to calculate data that is an integer multiple of the size of the pooled core in the data to be calculated when the size of the data to be calculated is a non-integer multiple of the size of the pooled core Perform average pooling operations. The size of the data to be calculated is a non-integer multiple of the size of the pooled core, which may include at least one of the following: the input width of the data to be calculated is a non-integer multiple of the width of the pooled core, and the input height of the data to be calculated It is a non-integer multiple of the height of the pooled core.
在该实现方式中,对于待运算数据中为池化核的尺寸的非整数倍的部分剩余数据可以不进行平均 池化运算。In this implementation manner, the average pooling operation may not be performed on part of the remaining data that is not an integer multiple of the size of the pooling core in the data to be calculated.
在一种可能的实现方式中,如图12-2a、图12-2b所示,该装置还可以包括存储模块11-13。存储模块11-13用于存储待运算数据和池化核。In a possible implementation manner, as shown in FIGS. 12-2a and 12-2b, the device may further include a storage module 11-13. Storage modules 11-13 are used to store data to be calculated and pooled cores.
在该实现方式中,存储模块可以包括缓存和寄存器中的一种或多种,缓存可以包括高速暂存缓存,还可以包括至少一个NRAM(Neuron Random Access Memory,神经元随机存取存储器)。缓存可以用于存储待运算数据和池化核,寄存器可以用于存储待运算数据中的标量数据。In this implementation, the storage module may include one or more of a cache and a register. The cache may include a high-speed temporary storage cache, and may also include at least one NRAM (Neuron Random Access Memory). The cache can be used to store data to be calculated and pooled cores, and the register can be used to store scalar data in the data to be calculated.
在一种可能的实现方式中,缓存可以包括神经元缓存。神经元缓存也即上述神经元随机存取存储器,可以用于存储待运算数据中的神经元数据,神经元数据可以包括神经元向量数据。In a possible implementation, the cache may include a neuron cache. The neuron cache, that is, the foregoing neuron random access memory, can be used to store neuron data in the data to be calculated, and the neuron data can include neuron vector data.
在一种可能的实现方式中,该装置还可以包括直接内存访问模块,用于从存储模块中读取或者存储数据。In a possible implementation, the device may further include a direct memory access module, which is used to read or store data from the storage module.
在一种可能的实现方式中,控制模块11-11还可以用于根据平均池化指令生成汇编文件,并将汇编文件翻译成二进制文件,其中,二进制文件为编译后的平均池化指令。In a possible implementation manner, the control module 11-11 may also be used to generate an assembly file according to the average pooling instruction and translate the assembly file into a binary file, where the binary file is the compiled average pooling instruction.
在一种可能的实现方式中,平均池化指令的指令格式可以是:In a possible implementation, the instruction format of the average pooling instruction may be:
avgpool dst src0 src1 srcChannel srcHeigh srcWidth kernelHeight kernelWidth sx syavgpool dst src0 src1 srcChannel srcHeigh srcWidth kernelHeight kernelwidth sxsy
其中,avgpool为平均池化指令的操作码,dst、src0、src1、srcChannel、srcHeigh、srcWidth、kernelHeight、kernelWidth、sx、sy为平均池化指令的操作域。其中,dst为目标地址,src0为待运算数据地址,src1为池化核地址,srcChannel为池化核数量,srcHeigh为输入高度,srcWidth为输入宽度,kernelHeight为池化核高度,kernelWidth为池化核宽度,sx为池化核在x方向上进行移动的第一步幅,sy为池化核在y方向上进行移动的第二步幅。Among them, avgpool is the operation code of the average pooling instruction, dst, src0, src1, srcChannel, srcHeigh, srcWidth, kernelHeight, kernelWidth, sx, sy are the operation domain of the average pooling instruction. Among them, dst is the target address, src0 is the data address to be calculated, src1 is the pooled core address, srcChannel is the number of pooled cores, srcHeigh is the input height, srcWidth is the input width, kernelHeight is the pooled core height, and kernelWidth is the pooled core Width, sx is the first step for the pooled core to move in the x direction, and sy is the second step for the pooled core to move in the y direction.
应当理解的是,本领域技术人员可以根据需要对平均池化指令的操作码、指令格式中操作码和操作域的位置进行设置,本公开对此不作限制。It should be understood that those skilled in the art can set the operation code of the average pooling instruction, the position of the operation code and the operation domain in the instruction format as needed, and the disclosure does not limit this.
在一种可能的实现方式中,该装置可以设置于图形处理器(Graphics Processing Unit,简称GPU)、中央处理器(Central Processing Unit,简称CPU)和嵌入式神经网络处理器(Neural-network Processing Unit,简称NPU)的一种或多种之中。In a possible implementation manner, the device may be set in a graphics processor (Graphics Processing Unit, GPU for short), a central processing unit (Central Processing Unit, CPU for short) and an embedded neural network processor (Neural-network Processing Unit) , Referred to as NPU).
需要说明的是,尽管以上述实施例作为示例介绍了平均池化指令处理装置如上,但本领域技术人员能够理解,本公开应不限于此。事实上,用户完全可根据个人喜好和/或实际应用场景灵活设定各模块,只要符合本公开的技术方案即可。It should be noted that although the above embodiment is taken as an example to introduce the average pooling instruction processing apparatus as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various modules flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.
应用示例Application examples
以下结合“利用平均池化指令处理装置进行平均池化运算”作为一个示例性应用场景,给出根据本公开实施例的应用示例,以便于理解平均池化指令处理装置的流程。本领域技术人员应理解,以下应用示例仅仅是出于便于理解本公开实施例的目的,不应视为对本公开实施例的限制The following uses “average pooling instruction processing device for average pooling operation” as an exemplary application scenario to give an application example according to an embodiment of the present disclosure, in order to understand the flow of the average pooling instruction processing device. Those skilled in the art should understand that the following application examples are only for the purpose of facilitating understanding of the embodiments of the present disclosure, and should not be considered as limitations to the embodiments of the present disclosure
图12-3示出根据本公开一实施例的平均池化指令处理装置的应用场景的示意图。如图12-3所示,平均池化指令处理装置对平均池化指令进行处理的过程如下:12-3 shows a schematic diagram of an application scenario of an average pooled instruction processing apparatus according to an embodiment of the present disclosure. As shown in Figure 12-3, the average pooling instruction processing device processes the average pooling instruction as follows:
控制模块11-11对获取到的平均池化指令1进行编译,得到编译后的平均池化指令1(如平均池化指令1为avgpool 500 100 200 5 64 32 2 2 2 1),对编译后的平均池化指令进行解析,得到平均池化指令1的操作码和操作域。其中,平均池化指令1的操作码为avgpool,目标地址为500,待运算数据地址为100,池化核地址为200,池化核数量为5,输入高度为64,输入宽度为32,池化核高度为2,池化核宽度为2,第一步幅为2,第二步幅为1。控制模块11-11从待运算数据地址100中获取64×32的待运算数据,从池化核地址200中获取2×2的池化核。The control module 11-11 compiles the obtained average pooling instruction 1 to obtain the compiled average pooling instruction 1 (for example, the average pooling instruction 1 is avgpool 500, 100, 200, 5, 64, 32, 2, 2 and 2). The average pooling instruction is analyzed to obtain the operation code and operation domain of the average pooling instruction 1. Among them, the operation code of the average pooling instruction 1 is avgpool, the target address is 500, the data address to be calculated is 100, the pooling core address is 200, the number of pooling cores is 5, the input height is 64, the input width is 32, the pool The nucleation height is 2, the pooling nucleus width is 2, the first step is 2, and the second step is 1. The control module 11-11 obtains 64 × 32 to-be-calculated data from the data-to-be-operated data address 100, and 2 × 2 pooled cores from the pooled core address 200.
运算模块11-12利用5个池化核对待运算数据进行平均池化运算,得到运算结果,并将运算结果存入目标地址500中。The calculation module 11-12 uses 5 pooling cores to perform average pooling operation on the data to be calculated, obtain the calculation result, and store the calculation result in the target address 500.
以上各模块的工作过程可参考上文的相关描述。For the working process of the above modules, please refer to the relevant description above.
这样,可以高效、快速地对平均池化指令进行处理,且进行平均池化运算的效率和速度也有显著提高。In this way, the average pooling instruction can be processed efficiently and quickly, and the efficiency and speed of the average pooling operation are also significantly improved.
图12-4示出根据本公开一实施例的平均池化指令处理方法的流程图。该方法可以应用于包含存储 器和处理器的如计算机设备等,其中,存储器用于存储执行方法过程中所使用的数据;处理器用于执行相关的处理、运算步骤,如执行下述步骤S51-11和步骤S52-11。如图12-4所示,该方法应用于上述平均池化指令处理装置,该方法包括步骤S51-11和步骤S52-11。12-4 shows a flowchart of an average pooling instruction processing method according to an embodiment of the present disclosure. The method can be applied to computer equipment including a memory and a processor, where the memory is used to store data used in the execution of the method; the processor is used to perform related processing and operation steps, such as performing the following steps S51-11和 步骤 S52-11. As shown in FIG. 12-4, the method is applied to the above average pooling instruction processing device, and the method includes steps S51-11 and S52-11.
在步骤S51-11中,利用控制模块对获取到的平均池化指令进行编译,得到编译后的平均池化指令,对编译后的平均池化指令进行解析,得到平均池化指令的操作码和操作域,并根据操作码和操作域获取执行平均池化指令所需的待运算数据、池化核和目标地址。其中,操作码用于指示平均池化指令对数据所进行的运算为平均池化运算,操作域包括待运算数据地址、池化核地址和目标地址。In step S51-11, the control module is used to compile the obtained average pooling instruction to obtain the compiled average pooling instruction, and the compiled average pooling instruction is analyzed to obtain the average pooling instruction operation code and The operation domain, and according to the operation code and the operation domain, the data to be calculated, the pooling core, and the target address required to execute the average pooling instruction are obtained. The operation code is used to indicate that the operation performed by the average pooling instruction on the data is the average pooling operation, and the operation domain includes the data address to be operated, the pooling core address, and the target address.
在步骤S52-11中,利用运算模块根据池化核对待运算数据进行平均池化运算,获得运算结果,并将运算结果存入目标地址中。In step S52-11, the operation module is used to perform average pooling operation on the data to be calculated according to the pooling core to obtain the operation result, and the operation result is stored in the target address.
在一种可能的实现方式中,根据池化核对待运算数据进行平均池化运算,获得运算结果,可以包括:利用运算模块中的多个加法器执行平均池化运算中的加法运算,以及利用运算模块中的多个除法器执行平均池化运算中的除法运算。In a possible implementation manner, performing the average pooling operation on the data to be calculated according to the pooling core to obtain the operation result may include: performing the addition operation in the average pooling operation using multiple adders in the operation module, and using The multiple dividers in the arithmetic module perform the division operation in the average pooling operation.
在一种可能的实现方式中,运算模块包括主运算子模块和多个从运算子模块,主运算子模块包括多个加法器和多个除法器。其中,步骤S52-11可以包括:In a possible implementation manner, the operation module includes a master operation sub-module and multiple slave operation sub-modules, and the master operation sub-module includes multiple adders and multiple dividers. Wherein, step S52-11 may include:
利用主运算子模块中的多个加法器和多个除法器分别进行平均池化运算中的加法运算和除法运算,得到运算结果,并将运算结果存入目标地址中。Use multiple adders and multiple dividers in the main operation sub-module to perform addition and division operations in the average pooling operation, respectively, to obtain the operation result, and store the operation result in the target address.
在一种可能的实现方式中,操作域还可以包括输入高度和输入宽度。其中,根据操作码和操作域获取执行平均池化指令所需的待运算数据、池化核和目标地址,可以包括:In a possible implementation manner, the operation domain may further include an input height and an input width. Among them, obtaining the data to be calculated, the pooling core, and the target address required to execute the average pooling instruction according to the operation code and the operation domain may include:
从待运算数据地址中,获取对应输入宽度和输入高度的待运算数据。Obtain the data to be calculated corresponding to the input width and input height from the address of the data to be calculated.
在一种可能的实现方式中,操作域还可以包括池化核高度和池化核宽度。其中,根据操作码和操作域获取执行平均池化指令所需的待运算数据、池化核和目标地址,可以包括:In a possible implementation, the operation domain may further include pooled core height and pooled core width. Among them, obtaining the data to be calculated, the pooling core, and the target address required to execute the average pooling instruction according to the operation code and the operation domain may include:
按照池化核高度和池化核宽度从池化核地址中获取池化核。Obtain the pooled core from the pooled core address according to the pooled core height and the pooled core width.
在一种可能的实现方式中,操作域还可以包括第一步幅。其中,根据池化核对待运算数据进行平均池化运算,可以包括:按照第一步幅在x方向上移动池化核。In a possible implementation, the operation domain may also include the first step. Wherein, performing average pooling operation on the data to be calculated according to the pooling core may include: moving the pooling core in the x direction according to the first step.
在一种可能的实现方式中,操作域还可以包括第二步幅。其中,根据池化核对待运算数据进行平均池化运算,可以包括:按照第二步幅在y方向上移动池化核。In a possible implementation manner, the operation domain may further include a second step. Wherein, performing average pooling operation on the data to be calculated according to the pooling core may include: moving the pooling core in the y direction according to the second step.
在一种可能的实现方式中,根据池化核对待运算数据进行平均池化运算,获得运算结果,可以包括:In a possible implementation manner, performing average pooling operation on the data to be calculated according to the pooling core to obtain the operation result may include:
在待运算数据上非重叠移动池化核,并比较池化核所对应的区域中的多个待运算数据,获得运算结果。Move the pooled cores non-overlapping on the data to be calculated, and compare multiple data to be calculated in the area corresponding to the pooled cores to obtain the calculation result.
在一种可能的实现方式中,根据池化核对待运算数据进行平均池化运算,获得运算结果,可以包括:In a possible implementation manner, performing average pooling operation on the data to be calculated according to the pooling core to obtain the operation result may include:
在待运算数据的尺寸为池化核的尺寸的非整数倍时,对待运算数据中为池化核的尺寸的整数倍的数据进行平均池化运算,When the size of the data to be calculated is a non-integer multiple of the size of the pooled core, the average pooling operation is performed on the data to be calculated that is an integer multiple of the size of the pooled core,
其中,待运算数据的尺寸为池化核的尺寸的非整数倍,可以包括以下至少一项:待运算数据的输入宽度为池化核的宽度的非整数倍、待运算数据的输入高度为池化核的高度的非整数倍。The size of the data to be calculated is a non-integer multiple of the size of the pooled core, which may include at least one of the following: the input width of the data to be calculated is a non-integer multiple of the width of the pooled core, and the input height of the data to be calculated is the pool Non-integer multiple of the height of the chemical core.
在一种可能的实现方式中,操作域还可以包括池化核数量。其中,根据池化核对待运算数据进行平均池化运算,获得运算结果,可以包括:In a possible implementation, the operation domain may further include the number of pooled cores. Wherein, performing average pooling operation on the data to be calculated according to the pooling core to obtain the operation result may include:
通过数量为池化核数量的多个池化核,对待运算数据进行平均池化运算。Through multiple pooling cores whose number is the number of pooling cores, the data to be operated is averagely pooled.
在一种可能的实现方式中,该方法还可以包括:利用装置的存储模块存储待运算数据和池化核。其中,存储模块可以包括寄存器和缓存中的至少一种,缓存用于存储待运算数据和池化核,缓存可以包括至少一个神经元缓存NRAM;寄存器用于存储待运算数据中的标量数据;神经元缓存用于存储待运算数据中的神经元数据,神经元数据可以包括神经元向量数据。In a possible implementation manner, the method may further include: using the storage module of the device to store the data to be calculated and the pooled core. Among them, the storage module may include at least one of a register and a cache, the cache is used to store the data to be calculated and the pooled core, the cache may include at least one neuron cache NRAM; the register is used to store the scalar data in the data to be calculated; nerve The meta buffer is used to store neuron data in the data to be operated, and the neuron data may include neuron vector data.
在一种可能的实现方式中,对获取到的平均池化指令进行解析,得到平均池化指令的操作码和操作域,可以包括:In a possible implementation manner, parsing the obtained average pooling instruction to obtain the operation code and operation domain of the average pooling instruction may include:
存储编译后的平均池化指令;Store the compiled average pooled instructions;
对编译后的平均池化指令进行解析,得到平均池化指令的操作码和操作域;Analyze the compiled average pooled instruction to obtain the opcode and operation domain of the average pooled instruction;
存储指令队列,指令队列包括按照执行顺序依次排列的多个待执行指令,多个待执行指令可以包括编译后的平均池化指令。An instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed in order according to the execution order, and the plurality of instructions to be executed may include an average pooled instruction after compilation.
在一种可能的实现方式中,该方法还可以包括:在确定多个待执行指令中的第一待执行指令与第一待执行指令之前的第零待执行指令存在关联关系时,缓存第一待执行指令,在第零待执行指令执行完毕后,执行第一待执行指令,In a possible implementation manner, the method may further include: when determining that the first to-be-executed instruction among the plurality of to-be-executed instructions is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction, caching the first The instruction to be executed, after the execution of the zeroth instruction to be executed is completed, the first instruction to be executed is executed,
其中,第一待执行指令与第一待执行指令之前的第零待执行指令存在关联关系包括:Wherein, the association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
存储第一待执行指令所需数据的第一存储地址区间与存储第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first storage address interval storing data required for the first instruction to be executed has an overlapping area with the zeroth storage address interval storing data required for the zeroth instruction to be executed.
在一种可能的实现方式中,对获取到的平均池化指令进行编译,得到编译后的平均池化指令,可以包括:In a possible implementation manner, compiling the obtained average pooling instruction to obtain the compiled average pooling instruction may include:
根据平均池化指令生成汇编文件,并将汇编文件翻译成二进制文件。其中,二进制文件为编译后的平均池化指令。The assembly file is generated according to the average pooling instruction, and the assembly file is translated into a binary file. Among them, the binary file is the average pooled instruction after compilation.
需要说明的是,尽管以上述实施例作为示例介绍了平均池化指令处理方法如上,但本领域技术人员能够理解,本公开应不限于此。事实上,用户完全可根据个人喜好和/或实际应用场景灵活设定各步骤,只要符合本公开的技术方案即可。It should be noted that, although the above embodiment is taken as an example to introduce the average pooling instruction processing method as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various steps flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.
本公开实施例所提供的平均池化指令处理方法的适用范围广,对平均池化指令的处理效率高、处理速度快,进行平均池化运算的处理效率高、速度快。The average pooling instruction processing method provided by the embodiments of the present disclosure has a wide application range, and has high processing efficiency and fast processing speed for the average pooling instruction, and high processing efficiency and fast speed for the average pooling operation.
依据以下条款可以更好的理解前述内容:The foregoing can be better understood based on the following clauses:
条款L1、一种平均池化指令处理装置,所述装置包括:Clause L1, an average pooled instruction processing device, the device comprising:
控制模块,用于对获取到的平均池化指令进行编译,得到编译后的平均池化指令,对所述编译后的平均池化指令进行解析,得到平均池化指令的操作码和操作域,并根据所述操作码和所述操作域获取执行平均池化指令所需的待运算数据、池化核和目标地址;The control module is used to compile the obtained average pooling instruction to obtain the compiled average pooling instruction, and analyze the compiled average pooling instruction to obtain the operation code and operation domain of the average pooling instruction, And obtain, according to the operation code and the operation domain, the data to be operated, the pooling core, and the target address required to execute the average pooling instruction;
运算模块,用于根据所述池化核对所述待运算数据进行平均池化运算,获得运算结果,并将所述运算结果存入所述目标地址中,An operation module, configured to perform an average pooling operation on the data to be calculated according to the pooling check, obtain an operation result, and store the operation result in the target address,
其中,所述操作码用于指示所述平均池化指令对数据所进行的运算为平均池化运算,所述操作域包括待运算数据地址、池化核地址和所述目标地址。Wherein, the operation code is used to indicate that the operation performed by the average pooling instruction on the data is an average pooling operation, and the operation domain includes the data address to be operated, the pooling core address, and the target address.
条款L2、根据条款L1所述的装置,所述运算模块,包括:Clause L2. The device according to Clause L1, the arithmetic module includes:
多个加法器,用于执行所述平均池化运算中的加法运算;Multiple adders for performing the addition operation in the average pooling operation;
多个除法器,用于执行所述平均池化运算中的除法运算。A plurality of dividers are used to perform the division operation in the average pooling operation.
条款L3、根据条款L2所述的装置,所述运算模块包括主运算子模块和多个从运算子模块,所述主运算子模块包括所述多个加法器和所述多个除法器,Clause L3. The device according to Clause L2, the operation module includes a master operation submodule and a plurality of slave operation submodules, the master operation submodule includes the plurality of adders and the plurality of dividers,
所述主运算子模块,用于利用所述多个加法器和所述多个除法器分别进行所述平均池化运算中的加法运算和除法运算,得到运算结果,并将所述运算结果存入所述目标地址中。The main operation sub-module is configured to use the plurality of adders and the plurality of dividers to perform addition and division operations in the average pooling operation, respectively, to obtain an operation result, and store the operation result Into the target address.
条款L4、根据条款L1所述的装置,所述操作域还包括输入高度和输入宽度,Clause L4. The device according to Clause L1, the operation domain further includes an input height and an input width,
其中,所述控制模块,还用于从所述待运算数据地址中,获取对应所述输入宽度和所述输入高度的待运算数据。Wherein, the control module is also used to obtain the data to be calculated corresponding to the input width and the input height from the data to be calculated address.
条款L5、根据条款L1所述的装置,所述操作域还包括池化核高度和池化核宽度,Clause L5. The device according to Clause L1, the operation domain further includes a pooled core height and a pooled core width,
其中,所述控制模块,还用于按照所述池化核高度和所述池化核宽度从所述池化核地址中获取所述池化核。Wherein, the control module is further configured to obtain the pooled core from the pooled core address according to the pooled core height and the pooled core width.
条款L6、根据条款L1所述的装置,所述操作域还包括第一步幅,Clause L6. The device according to Clause L1, the operation domain further includes a first step,
其中,所述运算模块,还用于按照所述第一步幅在x方向上移动所述池化核。Wherein, the arithmetic module is also used to move the pooling core in the x direction according to the first step.
条款L7、根据条款L1所述的装置,所述操作域还包括第二步幅,Clause L7. The device according to Clause L1, the operation domain further includes a second step,
其中,所述运算模块,还用于按照所述第二步幅在y方向上移动所述池化核。Wherein, the calculation module is also used to move the pooling core in the y direction according to the second step.
条款L8、根据条款L1所述的装置,Clause L8, the device according to Clause L1,
所述运算模块,还用于在所述待运算数据上非重叠移动所述池化核,并比较所述池化核所对应的区域中的多个待运算数据,获得所述运算结果。The calculation module is also used to move the pooled core on the data to be calculated non-overlapping, and compare a plurality of data to be calculated in the area corresponding to the pooled core to obtain the calculation result.
条款L9、根据条款L1所述的装置,Clause L9, the device according to Clause L1,
所述运算模块,还用于在所述待运算数据的尺寸为所述池化核的尺寸的非整数倍时,对所述待运算数据中为所述池化核的尺寸的整数倍的数据进行平均池化运算,The calculation module is also used to, when the size of the data to be calculated is a non-integer multiple of the size of the pooled core, the data to be calculated is an integer multiple of the size of the pooled core Perform average pooling operations,
其中,所述待运算数据的尺寸为所述池化核的尺寸的非整数倍,包括以下至少一项:所述待运算数据的输入宽度为所述池化核的宽度的非整数倍、所述待运算数据的输入高度为所述池化核的高度的非整数倍。Wherein, the size of the data to be calculated is a non-integer multiple of the size of the pooled core, including at least one of the following: The input width of the data to be calculated is a non-integer multiple of the width of the pooled core. The input height of the data to be calculated is a non-integer multiple of the height of the pooled core.
条款L10、根据条款L1所述的装置,所述操作域还包括池化核数量,Clause L10. The device according to Clause L1, the operation domain further includes the number of pooled cores,
其中,所述运算模块,还用于通过数量为所述池化核数量的多个池化核,对所述待运算数据进行平均池化运算。Wherein, the calculation module is also used to perform average pooling operation on the data to be calculated through a plurality of pooling cores whose number is the number of the pooling cores.
条款L11、根据条款L1所述的装置,所述装置还包括:Clause L11. The device according to Clause L1, the device further comprising:
存储模块,用于存储所述待运算数据和所述池化核,A storage module for storing the data to be calculated and the pooled core,
其中,所述存储模块包括寄存器和缓存中的至少一种,Wherein, the storage module includes at least one of a register and a cache,
所述缓存,用于存储所述待运算数据和所述池化核,所述缓存包括至少一个神经元缓存NRAM;The cache is used to store the data to be operated and the pooled core, and the cache includes at least one neuron cache NRAM;
所述寄存器,用于存储所述待运算数据中的标量数据;The register is used to store scalar data in the data to be calculated;
所述神经元缓存,用于存储所述待运算数据中的神经元数据,所述神经元数据包括神经元向量数据。The neuron cache is used to store neuron data in the data to be operated, and the neuron data includes neuron vector data.
条款L12、根据条款L1所述的装置,所述控制模块,包括:Clause L12. The device according to Clause L1, the control module includes:
指令存储子模块,用于存储所述编译后的平均池化指令;Instruction storage sub-module for storing the compiled average pooled instruction;
指令处理子模块,用于对所述编译后的平均池化指令进行解析,得到平均池化指令的操作码和操作域;Instruction processing sub-module, which is used to analyze the compiled average pooling instruction to obtain the operation code and operation domain of the average pooling instruction;
队列存储子模块,用于存储指令队列,所述指令队列包括按照执行顺序依次排列的多个待执行指令,所述多个待执行指令包括所述编译后的平均池化指令。A queue storage sub-module is used to store an instruction queue, and the instruction queue includes a plurality of instructions to be executed in order according to an execution order, and the plurality of instructions to be executed include the compiled average pooled instructions.
条款L13、根据条款L12所述的装置,所述控制模块,还包括:Clause L13. The device according to Clause L12, the control module, further comprising:
依赖关系处理子模块,用于在确定所述多个待执行指令中的第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系时,将所述第一待执行指令缓存在所述指令存储子模块中,在所述第零待执行指令执行完毕后,从所述指令存储子模块中提取所述第一待执行指令发送至所述运算模块,The dependency processing sub-module is used to determine the first pending instruction when there is an association relationship between the first pending instruction in the plurality of pending instructions and the zeroth pending instruction before the first pending instruction The execution instruction is cached in the instruction storage submodule, and after the execution of the zeroth instruction to be executed is completed, the first instruction to be executed is extracted from the instruction storage submodule and sent to the arithmetic module,
其中,所述第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系包括:Wherein, the association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
存储所述第一待执行指令所需数据的第一存储地址区间与存储所述第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
条款L14、根据条款L1所述的装置,Clause L14, the device according to Clause L1,
所述控制模块,还用于根据所述平均池化指令生成汇编文件,并将所述汇编文件翻译成二进制文件,The control module is also used to generate an assembly file according to the average pooling instruction and translate the assembly file into a binary file,
其中,所述二进制文件为所述编译后的平均池化指令。Wherein, the binary file is the compiled average pooling instruction.
条款L15、一种机器学习运算装置,所述装置包括:Clause L15. A machine learning computing device, the device comprising:
一个或多个如条款L1-条款L14任一项所述的平均池化指令处理装置,用于从其他处理装置中获取待运算数据和控制信息,并执行指定的机器学习运算,将执行结果通过I/O接口传递给其他处理装置;One or more average pooling instruction processing devices as described in any one of Clause L1-Clause L14, used to obtain data to be operated and control information from other processing devices, and perform designated machine learning operations, passing the execution result through The I / O interface is passed to other processing devices;
当所述机器学习运算装置包含多个所述平均池化指令处理装置时,所述多个所述平均池化指令处理装置间可以通过特定的结构进行连接并传输数据;When the machine learning computing device includes a plurality of the average pooled instruction processing devices, the plurality of average pooled instruction processing devices may be connected and transmit data through a specific structure;
其中,多个所述平均池化指令处理装置通过快速外部设备互连总线PCIE总线进行互联并传输数据,以支持更大规模的机器学习的运算;多个所述平均池化指令处理装置共享同一控制系统或拥有各自的控制系统;多个所述平均池化指令处理装置共享内存或者拥有各自的内存;多个所述平均池化指令处 理装置的互联方式是任意互联拓扑。Among them, a plurality of the average pooled instruction processing apparatuses interconnect and transmit data through a PCIE bus, a fast external device interconnection bus, to support larger-scale machine learning operations; a plurality of the average pooled instruction processing apparatuses share the same The control system may have its own control system; the plurality of average pooled instruction processing devices share memory or have their own memories; the interconnection method of the plurality of average pooled instruction processing devices is any interconnection topology.
条款L16、一种组合处理装置,所述组合处理装置包括:Clause L16. A combined processing device, the combined processing device comprising:
如条款L15所述的机器学习运算装置、通用互联接口和其他处理装置;Machine learning computing device, general interconnection interface and other processing devices as described in clause L15;
所述机器学习运算装置与所述其他处理装置进行交互,共同完成用户指定的计算操作,The machine learning computing device interacts with the other processing device to jointly complete the calculation operation specified by the user,
其中,所述组合处理装置还包括:存储装置,该存储装置分别与所述机器学习运算装置和所述其他处理装置连接,用于保存所述机器学习运算装置和所述其他处理装置的数据。Wherein, the combined processing device further includes: a storage device, which is respectively connected to the machine learning computing device and the other processing device, and is used for storing data of the machine learning computing device and the other processing device.
条款L17、一种机器学习芯片,所述机器学习芯片包括:Clause L17. A machine learning chip, the machine learning chip includes:
如条款L15所述的机器学习运算装置或如条款L16所述的组合处理装置。The machine learning arithmetic device according to clause L15 or the combined processing device according to clause L16.
条款L18、一种电子设备,所述电子设备包括:Clause L18. An electronic device, the electronic device comprising:
如条款L17所述的机器学习芯片。Machine learning chip as described in clause L17.
条款L19、一种板卡,所述板卡包括:存储器件、接口装置和控制器件以及如条款L17所述的机器学习芯片;Clause L19. A board card including: a storage device, an interface device and a control device, and a machine learning chip as described in Clause L17;
其中,所述机器学习芯片与所述存储器件、所述控制器件以及所述接口装置分别连接;Wherein, the machine learning chip is respectively connected to the storage device, the control device and the interface device;
所述存储器件,用于存储数据;The storage device is used for storing data;
所述接口装置,用于实现所述机器学习芯片与外部设备之间的数据传输;The interface device is used to realize data transmission between the machine learning chip and an external device;
所述控制器件,用于对所述机器学习芯片的状态进行监控。The control device is used for monitoring the state of the machine learning chip.
条款L20、一种平均池化指令处理方法,所述方法应用于平均池化指令处理装置,所述装置包括控制模块和运算模块,所述方法包括:Clause L20. An average pooling instruction processing method. The method is applied to an average pooling instruction processing apparatus. The apparatus includes a control module and an arithmetic module. The method includes:
利用控制模块对获取到的平均池化指令进行编译,得到编译后的平均池化指令,对所述编译后的平均池化指令进行解析,得到平均池化指令的操作码和操作域,并根据所述操作码和所述操作域获取执行平均池化指令所需的待运算数据、池化核和目标地址;The control module is used to compile the obtained average pooling instruction to obtain the compiled average pooling instruction, and the compiled average pooling instruction is analyzed to obtain the operation code and operation domain of the average pooling instruction, and The operation code and the operation domain obtain the data to be calculated, the pooling core, and the target address required to execute the average pooling instruction;
利用运算模块根据所述池化核对所述待运算数据进行平均池化运算,获得运算结果,并将所述运算结果存入所述目标地址中,Using an arithmetic module to perform an average pooling operation on the data to be calculated according to the pooling core to obtain an operation result, and store the operation result in the target address,
其中,所述操作码用于指示所述平均池化指令对数据所进行的运算为平均池化运算,所述操作域包括待运算数据地址、池化核地址和所述目标地址。Wherein, the operation code is used to indicate that the operation performed by the average pooling instruction on the data is an average pooling operation, and the operation domain includes the data address to be operated, the pooling core address, and the target address.
条款L21、根据条款L20所述的方法,根据所述池化核对所述待运算数据进行平均池化运算,获得运算结果,包括:Clause L21. According to the method described in Clause L20, perform average pooling operation on the data to be calculated according to the pooling check, and obtain the operation result, including:
利用所述运算模块中的多个加法器执行所述平均池化运算中的加法运算,以及利用所述运算模块中的多个除法器执行所述平均池化运算中的除法运算。The addition operation in the average pooling operation is performed using multiple adders in the operation module, and the division operation in the average pooling operation is performed using multiple dividers in the operation module.
条款L22、根据条款L21所述的方法,所述运算模块包括主运算子模块和多个从运算子模块,所述主运算子模块包括多个加法器和多个除法器,Clause L22. The method according to Clause L21, the operation module includes a master operation submodule and a plurality of slave operation submodules, the master operation submodule includes a plurality of adders and a plurality of dividers,
其中,根据所述池化核对所述待运算数据进行平均池化运算,获得运算结果,并将所述运算结果存入所述目标地址中,包括:Wherein, performing an average pooling operation on the data to be calculated according to the pooling verification to obtain an operation result, and storing the operation result in the target address includes:
利用所述主运算子模块中的所述多个加法器和所述多个除法器分别进行所述平均池化运算中的加法运算和除法运算,得到运算结果,并将所述运算结果存入所述目标地址中。Use the plurality of adders and the plurality of dividers in the main operation submodule to perform addition and division operations in the average pooling operation, respectively, to obtain an operation result, and store the operation result in In the target address.
条款L23、根据条款L20所述的方法,所述操作域还包括输入高度和输入宽度,Clause L23. The method according to Clause L20, the operation domain further includes an input height and an input width,
其中,根据所述操作码和所述操作域获取执行所述平均池化指令所需的待运算数据、池化核和目标地址,包括:Wherein, obtaining the data to be operated, the pooling core and the target address required to execute the average pooling instruction according to the operation code and the operation domain includes:
从所述待运算数据地址中,获取对应所述输入宽度和所述输入高度的待运算数据。Obtain the data to be calculated corresponding to the input width and the input height from the address of the data to be calculated.
条款L24、根据条款L20所述的方法,所述操作域还包括池化核高度和池化核宽度,Clause L24. The method according to Clause L20, the operation domain further includes a pooled core height and a pooled core width,
其中,根据所述操作码和所述操作域获取执行平均池化指令所需的待运算数据、池化核和目标地址,包括:Wherein, obtaining the data to be calculated, the pooling core and the target address required to execute the average pooling instruction according to the operation code and the operation domain include:
按照所述池化核高度和所述池化核宽度从所述池化核地址中获取所述池化核。Obtain the pooled core from the pooled core address according to the pooled core height and the pooled core width.
条款L25、根据条款L20所述的方法,所述操作域还包括第一步幅,Clause L25. The method according to Clause L20, the operation domain further includes a first step,
其中,根据所述池化核对所述待运算数据进行平均池化运算,包括:Wherein, performing average pooling operation on the data to be calculated according to the pooling verification includes:
按照所述第一步幅在x方向上移动所述池化核。The pooling core is moved in the x direction according to the first step.
条款L26、根据条款L20所述的方法,所述操作域还包括第二步幅,Clause L26. The method according to Clause L20, the operation domain further includes a second step,
其中,根据所述池化核对所述待运算数据进行平均池化运算,包括:Wherein, performing average pooling operation on the data to be calculated according to the pooling verification includes:
按照所述第二步幅在y方向上移动所述池化核。The pooling core is moved in the y direction according to the second step.
条款L27、根据条款L20所述的方法,根据所述池化核对所述待运算数据进行平均池化运算,获得运算结果,包括:Clause L27. According to the method described in Clause L20, perform an average pooling operation on the data to be calculated according to the pooling check to obtain the operation result, including:
在所述待运算数据上非重叠移动所述池化核,并比较所述池化核所对应的区域中的多个待运算数据,获得所述运算结果。Move the pooled cores non-overlapping on the data to be calculated, and compare multiple data to be calculated in the area corresponding to the pooled cores to obtain the calculation result.
条款L28、根据条款L20所述的方法,根据所述池化核对所述待运算数据进行平均池化运算,获得运算结果,包括:Clause L28. According to the method described in Clause L20, perform an average pooling operation on the data to be calculated according to the pooling check to obtain the operation result, including:
在所述待运算数据的尺寸为所述池化核的尺寸的非整数倍时,对所述待运算数据中为所述池化核的尺寸的整数倍的数据进行平均池化运算,When the size of the data to be calculated is a non-integer multiple of the size of the pooling core, perform an average pooling operation on the data to be calculated that is an integer multiple of the size of the pooling core,
其中,所述待运算数据的尺寸为所述池化核的尺寸的非整数倍,包括以下至少一项:所述待运算数据的输入宽度为所述池化核的宽度的非整数倍、所述待运算数据的输入高度为所述池化核的高度的非整数倍。Wherein, the size of the data to be calculated is a non-integer multiple of the size of the pooled core, including at least one of the following: The input width of the data to be calculated is a non-integer multiple of the width of the pooled core. The input height of the data to be calculated is a non-integer multiple of the height of the pooled core.
条款L29、根据条款L20所述的方法,所述操作域还包括池化核数量,Clause L29. The method according to Clause L20, the operation domain further includes the number of pooled cores,
其中,根据所述池化核对所述待运算数据进行平均池化运算,获得运算结果,包括:Wherein, performing average pooling operation on the data to be calculated according to the pooling verification to obtain the operation result includes:
通过数量为所述池化核数量的多个池化核,对所述待运算数据进行平均池化运算。An average pooling operation is performed on the data to be calculated through a plurality of pooling cores whose number is the number of the pooling cores.
条款L30、根据条款L20所述的方法,所述方法还包括:Clause L30. The method according to Clause L20, the method further comprising:
利用所述装置的存储模块存储所述待运算数据和所述池化核,Using the storage module of the device to store the data to be calculated and the pooled core,
其中,所述存储模块包括寄存器和缓存中的至少一种,Wherein, the storage module includes at least one of a register and a cache,
所述缓存,用于存储所述待运算数据和所述池化核,所述缓存包括至少一个神经元缓存NRAM;The cache is used to store the data to be operated and the pooled core, and the cache includes at least one neuron cache NRAM;
所述寄存器,用于存储所述待运算数据中的标量数据;The register is used to store scalar data in the data to be calculated;
所述神经元缓存用于存储所述待运算数据中的神经元数据,所述神经元数据包括神经元向量数据。The neuron buffer is used to store neuron data in the data to be operated, and the neuron data includes neuron vector data.
条款L31、根据条款L20所述的方法,对获取到的平均池化指令进行解析,得到所述平均池化指令的操作码和操作域,包括:Clause L31. According to the method described in Clause L20, analyze the obtained average pooling instruction to obtain the operation code and operation domain of the average pooling instruction, including:
存储所述编译后的平均池化指令;Storing the compiled average pooled instruction;
对所述编译后的平均池化指令进行解析,得到平均池化指令的操作码和操作域;Analyzing the compiled average pooling instruction to obtain the operation code and operation domain of the average pooling instruction;
存储指令队列,所述指令队列包括按照执行顺序依次排列的多个待执行指令,所述多个待执行指令包括所述编译后的平均池化指令。An instruction queue is stored. The instruction queue includes a plurality of instructions to be executed, which are sequentially arranged in order of execution, and the plurality of instructions to be executed include the compiled average pooled instructions.
条款L32、根据条款L31所述的方法,所述方法还包括:Clause L32. The method according to Clause L31, the method further comprising:
在确定所述多个待执行指令中的第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系时,缓存所述第一待执行指令,并在确定所述第零待执行指令执行完毕后,控制进行所述第一待执行指令的执行,When it is determined that the first to-be-executed instruction among the plurality of to-be-executed instructions is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction, cache the first to-be-executed instruction, and determine the After the execution of the zeroth to-be-executed instruction is completed, control to execute the execution of the first to-be-executed instruction,
其中,所述第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系包括:Wherein, the association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
存储所述第一待执行指令所需数据的第一存储地址区间与存储所述第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
条款L33、根据条款L20所述的方法,对获取到的平均池化指令进行编译,得到编译后的平均池化指令,包括:Clause L33. According to the method described in Clause L20, compile the obtained average pooling instruction to obtain the compiled average pooling instruction, including:
根据所述平均池化指令生成汇编文件,并将所述汇编文件翻译成二进制文件,Generate an assembly file according to the average pooling instruction, and translate the assembly file into a binary file,
其中,所述二进制文件为所述编译后的平均池化指令。Wherein, the binary file is the compiled average pooling instruction.
条款L34、一种非易失性计算机可读存储介质,其上存储有计算机程序指令,所述计算机程序指令被处理器执行时实现条款L20至条款L33任一项所述的方法。Clause L34. A non-volatile computer-readable storage medium having computer program instructions stored thereon, which when executed by a processor implements the method of any one of Clause L20 to Clause L33.
由于神经网络算法的广泛使用,计算机硬件运算人能力的不断提升,实际应用中所涉及到的数据 运算的种类和数量不断提高。由于编程语言的种类多样,在不同的语言环境下,为实现标量运算的运算过程,相关技术中,由于现阶段没有能广泛适用于各类编程语言的标量指令,技术人员需要自定义对应其编程语言环境的多条指令来实现不同类型的标量运算,导致进行标量运算效率低、速度慢。本公开提供一种标量指令处理方法、装置、计算机设备和存储介质,仅用一个指令即可以实现标量运算,能够显著提高进行标量运算的效率和速度。Due to the widespread use of neural network algorithms, the ability of computer hardware operators to continuously improve, the types and number of data operations involved in practical applications continue to increase. Due to the variety of programming languages, in different language environments, in order to achieve the operation process of scalar operations, in related technologies, because there are no scalar instructions that can be widely applied to various programming languages at this stage, technicians need to customize their corresponding programming Multiple instructions in the locale implement different types of scalar operations, resulting in low efficiency and slow speed of scalar operations. The present disclosure provides a scalar instruction processing method, device, computer equipment, and storage medium, which can implement scalar operation with only one instruction, which can significantly improve the efficiency and speed of performing scalar operation.
图13-1示出根据本公开一实施例的标量指令处理装置的框图。如图13-1所示,该装置包括控制模块13-11和运算模块13-12。13-1 shows a block diagram of a scalar instruction processing device according to an embodiment of the present disclosure. As shown in Figure 13-1, the device includes a control module 13-11 and an arithmetic module 13-12.
控制模块13-11,用于对获取到的标量指令进行编译,得到编译后的标量指令,对编译后的标量指令进行解析,得到标量指令的操作码和操作域,并根据操作码和操作域获取执行标量指令所需的待运算标量和目标地址,以及确定标量指令的标量运算类型。其中,操作码用于指示标量指令对数据所进行的运算为标量运算,标量运算类型用于指示进行标量运算的运算种类和待运算标量的数据类型,操作域包括待运算标量地址和目标地址。The control module 13-11 is used to compile the obtained scalar instruction to obtain the compiled scalar instruction, parse the compiled scalar instruction to obtain the operation code and operation domain of the scalar instruction, and according to the operation code and operation domain Obtain the to-be-operated scalar and target address required to execute the scalar instruction, and determine the scalar operation type of the scalar instruction. Among them, the operation code is used to indicate that the operation performed by the scalar instruction on the data is a scalar operation, the scalar operation type is used to indicate the type of operation that performs the scalar operation and the data type of the scalar to be operated, and the operation domain includes the scalar address to be operated and the target address.
运算模块13-12,用于根据标量运算类型对待运算标量进行标量运算,获得运算结果,并将运算结果存入目标地址中。The operation module 13-12 is configured to perform a scalar operation on the scalar to be calculated according to the scalar operation type, obtain an operation result, and store the operation result in a target address.
在本实施例中,待运算标量可以是一个或多个。标量运算类型所指示的运算种类,可以指示对待运算标量所进行的算术运算、逻辑运算的种类或类型。例如,相加运算、逻辑左移运算等。标量运算类型所指示的待运算标量的数据类型可以是待运算标量的存储类型。数据类型可以包括16位无符号类型、32位无符号类型、48位无符号类型、16位有符号类型、32位有符号类型、48位有符号类型、指针类型等能够应用于标量的数据类型。本领域技术人员可以根据实际需要对运算种类和数据类型进行设置,本公开对此不作限制。In this embodiment, there may be one or more scalars to be calculated. The type of operation indicated by the scalar operation type may indicate the type or type of arithmetic operation or logical operation performed on the scalar to be operated. For example, addition operation, logical left shift operation, etc. The data type of the scalar to be operated indicated by the scalar operation type may be the storage type of the scalar to be operated. Data types can include 16-bit unsigned types, 32-bit unsigned types, 48-bit unsigned types, 16-bit signed types, 32-bit signed types, 48-bit signed types, pointer types, etc. that can be applied to scalar data types . A person skilled in the art can set the operation type and data type according to actual needs, and the disclosure does not limit this.
在本实施例中,控制模块所获取到的标量指令为未编译的、不能直接供硬件执行的软件指令,控制模块需先对标量指令(未编译的)进行编译。在得到编译后的标量指令之后,才能对编译后的标量指令进行解析。编译后的标量指令为能够直接供硬件执行的硬件指令。控制模块可以从待运算标量地址中,分别获得待运算标量。控制模块可以通过数据输入输出单元获得指令和数据,该数据输入输出单元可以为一个或多个数据I/O接口或I/O引脚。In this embodiment, the scalar instructions acquired by the control module are uncompiled software instructions that cannot be directly executed by the hardware. The control module needs to first compile the scalar instructions (uncompiled). After the compiled scalar instruction is obtained, the compiled scalar instruction can be parsed. The compiled scalar instructions are hardware instructions that can be directly executed by the hardware. The control module can obtain the scalar to be calculated from the scalar address to be calculated respectively. The control module can obtain instructions and data through the data input and output unit, which can be one or more data I / O interfaces or I / O pins.
在本实施例中,操作码可以是计算机程序中所规定的要执行操作的那一部分指令或字段(通常用代码表示),是指令序列号,用来告知执行指令的装置具体需要执行哪一条指令。操作域可以是执行对应的指令所需的所有数据的来源,执行对应的指令所需的所有数据包括待运算标量、标量运算类型等参数以及对应的运算方法等等。对于一个标量指令其必须包括操作码和操作域,其中,操作域至少包括待运算标量地址和目标地址。In this embodiment, the operation code may be the part of the instruction or field (usually represented by code) specified in the computer program to perform the operation, and is the instruction sequence number used to inform the device that executes the instruction which instruction needs to be executed. . The operation domain may be the source of all data required to execute the corresponding instruction. All data required to execute the corresponding instruction include parameters such as the scalar to be calculated, the type of the scalar operation, and the corresponding operation method. For a scalar instruction, it must include an operation code and an operation field, where the operation field includes at least the scalar address and the target address to be operated.
应当理解的是,本领域技术人员可以根据需要对标量指令的指令格式以及所包含的操作码和操作域进行设置,本公开对此不作限制。It should be understood that those skilled in the art can set the instruction format of the scalar instruction, as well as the included operation codes and operation fields as required, and this disclosure does not limit this.
在本实施例中,该装置可以包括一个或多个控制模块,以及一个或多个运算模块,可以根据实际需要对控制模块和运算模块的数量进行设置,本公开对此不作限制。在装置包括一个控制模块时,该控制模块可以接收标量指令,并控制一个或多个运算模块进行标量运算。在装置包括多个控制模块时,多个控制模块可以分别接收标量指令,并控制对应的一个或多个运算模块进行标量运算。In this embodiment, the device may include one or more control modules and one or more arithmetic modules, and the number of control modules and arithmetic modules may be set according to actual needs, which is not limited in the present disclosure. When the device includes a control module, the control module can receive scalar instructions and control one or more arithmetic modules to perform scalar arithmetic. When the device includes multiple control modules, the multiple control modules may respectively receive scalar instructions and control the corresponding one or more arithmetic modules to perform scalar operations.
本公开实施例所提供的标量指令处理装置,该装置包括控制模块和运算模块,控制模块用于对获取到的标量指令进行编译,得到编译后的标量指令,对编译后的标量指令进行解析,得到标量指令的操作码和操作域,并根据操作码和操作域获取执行标量指令所需的待运算标量和目标地址,以及确定标量指令的标量运算类型;运算模块用于根据标量运算类型对待运算标量进行标量运算,获得运算结果,并将运算结果存入目标地址中。本公开实施例所提供的标量指令处理装置的适用范围广,对标量指令的处理效率高、处理速度快,进行标量运算的处理效率高、处理速度快。The scalar instruction processing device provided by the embodiment of the present disclosure includes a control module and an arithmetic module. The control module is used to compile the obtained scalar instruction to obtain the compiled scalar instruction and analyze the compiled scalar instruction. Obtain the operation code and operation domain of the scalar instruction, and obtain the scalar operation target and target address required to execute the scalar instruction according to the operation code and operation domain, and determine the scalar operation type of the scalar instruction; The scalar performs a scalar operation, obtains the operation result, and stores the operation result in the target address. The scalar command processing device provided by the embodiments of the present disclosure has a wide application range, and has high processing efficiency and fast processing speed for scalar commands, and high processing efficiency and fast processing speed for performing scalar operations.
图13-2a示出根据本公开一实施例的标量指令处理装置的框图。在一种可能的实现方式中,如图13-2a所示,运算模块13-12可以包括多个标量运算器13-120。多个标量运算器13-120用于执行与标量运算类型相对应的标量运算。13-2a shows a block diagram of a scalar instruction processing device according to an embodiment of the present disclosure. In a possible implementation, as shown in FIG. 13-2a, the arithmetic module 13-12 may include a plurality of scalar operators 13-120. A plurality of scalar operators 13-120 are used to perform scalar operations corresponding to scalar operation types.
在该实现方式中,标量运算器可以包括加法器、除法器、乘法器等能够对标量进行算术运算、逻辑运算等运算的运算器。可以根据所需进行的标量运算的数据量的大小、标量运算类型、对标量运算的处理速度、效率等要求对标量运算器的种类及数量进行设置,本公开对此不作限制。In this implementation manner, the scalar operator may include an adder, a divider, a multiplier, and the like that can perform arithmetic operations, logical operations, and the like on the scalar. The type and number of scalar operators can be set according to the size of the scalar operation, the type of scalar operation, the processing speed and efficiency of the scalar operation, etc. The disclosure does not limit this.
图13-2b示出根据本公开一实施例的标量指令处理装置的框图。在一种可能的实现方式中,如图13-2b所示,运算模块13-12可以包括主运算子模块13-121和多个从运算子模块13-122。主运算子模块13-121可以包括多个标量运算器(图中未示出)。13-2b shows a block diagram of a scalar instruction processing apparatus according to an embodiment of the present disclosure. In a possible implementation, as shown in FIG. 13-2b, the operation module 13-12 may include a master operation sub-module 13-121 and a plurality of slave operation sub-modules 13-122. The main operation sub-module 13-121 may include a plurality of scalar operators (not shown in the figure).
主运算子模块13-121,用于利用多个标量运算器执行标量运算,得到运算结果,并将运算结果存入目标地址中。The main operation sub-module 13-121 is used for performing scalar operations with multiple scalar operators, obtaining operation results, and storing the operation results in the target address.
在一种可能的实现方式中,操作域还可以包括标量运算类型。In a possible implementation, the operation domain may also include a scalar operation type.
其中,控制模块13-11,还可以用于根据操作域确定标量运算类型。Among them, the control module 13-11 can also be used to determine the scalar operation type according to the operation domain.
在一种可能的实现方式中,运算种类可以是相加运算、求和运算、相乘运算、按位相与运算、按位求余数运算、按位求绝对值运算、按位相除运算、按位或运算、按位异或运算、按位取反运算、按位求最大值运算、按位求最小值运算、逻辑左移运算、逻辑右移运算、算术右移运算、逻辑与运算、逻辑或运算、逻辑异或运算和逻辑取反运算中的至少一种。In a possible implementation manner, the operation types may be addition operation, sum operation, multiplication operation, bitwise AND operation, bitwise remainder operation, bitwise absolute value operation, bitwise division operation, bitwise operation OR operation, bitwise XOR operation, bitwise inverse operation, bitwise maximum value operation, bitwise minimum value operation, logical left shift operation, logical right shift operation, arithmetic right shift operation, logical AND operation, logical OR At least one of operation, logical exclusive-OR operation and logical negation operation.
在该实现方式中,可以为不同的标量运算类型设置不同的操作域代码,以区分不同的运算种类。例如,可以将相加运算的代码设置为add。可以将求和运算的代码设置为sub。可以将相乘运算的代码设置为mul。可以将按位相与运算的代码设置为and。可以将按位求余数运算的代码设置为rem。可以将按位求绝对值运算的代码设置为abs。可以将按位相除运算的代码设置为div。可以将按位或运算的代码设置为or。可以将按位异或运算的代码设置为xor。可以将按位取反运算的代码设置为not。可以将按位求最大值运算的代码设置为max。可以将按位求最小值运算的代码设置为min。可以将逻辑左移运算的代码设置为sll。可以将逻辑右移运算的代码设置为srl。可以将算术右移运算的代码设置为sra。可以将逻辑与运算的代码设置为land。可以将逻辑或运算的代码设置为lor。可以将逻辑异或运算的代码设置为lxo。可以将逻辑取反运算的代码设置为lnot。本领域技术人员可以根据实际需要对运算种类的代码进行设置,本公开对此不作限制。In this implementation, different operation field codes can be set for different types of scalar operations to distinguish different types of operations. For example, the code for the addition operation can be set to add. You can set the code of the sum operation to sub. The code for the multiplication operation can be set to mul. You can set the code for phase-and-operation to and. You can set the code for bitwise remainder operation to rem. You can set the code for bitwise absolute value operation to abs. You can set the code for division by bit to div. You can set the bitwise OR code to or. You can set the bitwise XOR code to xor. You can set the code for bitwise inversion to not. You can set the code for the maximum bitwise operation to max. You can set the code for computing the minimum bitwise operation to min. The code for logical left shift operation can be set to sll. The code for logical right shift operation can be set to srl. The code for arithmetic right shift operation can be set to sra. You can set the code of logical AND operation to land. You can set the logic or operation code to lor. The code for logical XOR operation can be set to lxo. You can set the code for logical inversion to lnot. A person skilled in the art can set the code of the operation type according to actual needs, which is not limited in this disclosure.
在一种可能的实现方式中,操作域还可以包括运算参数。其中,控制模块13-11,还用于根据操作域确定运算参数。其中,运算模块13-12,还用于根据标量运算类型,对待运算标量进行标量运算,获得运算结果,并将所述运算结果存入所述目标地址中。In a possible implementation manner, the operation domain may further include operation parameters. Among them, the control module 13-11 is also used to determine the operation parameters according to the operation domain. The arithmetic module 13-12 is further configured to perform scalar operation on the scalar to be calculated according to the type of scalar operation, obtain an operation result, and store the operation result in the target address.
在一种可能的实现方式中,如图13-2a、图13-2b所示,该装置还可以包括存储模块13-13。存储模块13-13用于存储待运算标量。In a possible implementation manner, as shown in FIGS. 13-2a and 13-2b, the device may further include a storage module 13-13. The storage modules 13-13 are used to store scalars to be calculated.
在该实现方式中,存储模块可以包括缓存和寄存器中的一种或多种,缓存可以包括高速暂存缓存,还可以包括至少一个NRAM(Neuron Random Access Memory,神经元随机存取存储器)。缓存可以用于存储待运算数据,寄存器可以用于存储待运算标量。In this implementation, the storage module may include one or more of a cache and a register. The cache may include a high-speed temporary storage cache, and may also include at least one NRAM (Neuron Random Access Memory). The cache can be used to store the data to be calculated, and the register can be used to store the scalar to be calculated.
在一种可能的实现方式中,缓存可以包括神经元缓存。神经元缓存也即上述神经元随机存取存储器,可以用于存储待运算数据中的神经元数据,神经元数据可以包括神经元向量数据。其中,待运算数据包括与进行标量运算相关的数据、和/或与其他计算指令的运算相关的数据。In a possible implementation, the cache may include a neuron cache. The neuron cache, that is, the foregoing neuron random access memory, can be used to store neuron data in the data to be calculated, and the neuron data can include neuron vector data. Wherein, the data to be calculated includes data related to performing scalar operations and / or data related to operations of other calculation instructions.
在一种可能的实现方式中,该装置还可以包括直接内存访问模块,用于从存储模块中读取或者存储数据。In a possible implementation, the device may further include a direct memory access module, which is used to read or store data from the storage module.
在一种可能的实现方式中,控制模块13-11还可以用于根据标量指令生成汇编文件,并将汇编文件翻译成二进制文件,其中,二进制文件为编译后的标量指令。In a possible implementation manner, the control module 13-11 may also be used to generate an assembly file according to the scalar instruction and translate the assembly file into a binary file, where the binary file is a compiled scalar instruction.
在一种可能的实现方式中,标量指令的指令格式可以是:In a possible implementation, the instruction format of the scalar instruction may be:
scalar dst src opcode.type pa。scalar dst src opcode.type pa.
其中,scalar为标量指令的操作码,dst、src、opcode.type、pa为标量指令的操作域。其中,dst为目标地址。src为待运算标量地址,在待运算标量为多个时,src可以包括多个待运算向量地址src0,src1,...,srcn,本公开对此不作限制。opcode.type为标量运算类型,opcode.type中的opcode表示标量运算的运算种类,opcode.type中的type表示待运算标量的数据类型。pa为运算参数,如,移位个数等。Among them, scalar is the operation code of the scalar instruction, dst, src, opcode.type, pa are the operation domain of the scalar instruction. Among them, dst is the target address. src is a scalar address to be calculated. When there are multiple scalars to be calculated, src may include multiple vector addresses to be calculated src0, src1, ..., srcn, which is not limited in the present disclosure. opcode.type is a scalar operation type, opcode in opcode.type indicates the operation type of scalar operation, and type in opcode.type indicates the data type of the scalar operation to be operated. pa is the operation parameter, such as the number of shifts.
在一种可能的实现方式中,标量指令的指令格式可以是:In a possible implementation, the instruction format of the scalar instruction may be:
opcode.scalar.type dst src pa。opcode.scalar.type dst srcpa.
其中,opcode.scalar.type为标量指令的操作码,dst、src、pa为标量指令的操作域。其中,dst为目标地址。src为待运算标量地址,在待运算标量为多个时,src可以包括多个待运算向量地址src0,src1,...,srcn,本公开对此不作限制。或者,可以从src获取到多个待运算标量。pa为运算参数,如,移位个数等。操作码opcode.scalar.type中,opcode表示标量运算的运算种类,type表示待运算标量的数据类型。type可以是u16、u32、u48、s16、s32、s48、ptr,u16表示待运算向量为无符号且长度为16位的标量、u32表示待运算向量为无符号且长度为32位的标量、u48表示待运算向量为无符号且长度为48位的标量、s16表示待运算向量为有符号且长度为16位的标量、s32表示待运算向量为有符号且长度为32位的标量、s48表示待运算向量为有符号且长度为48位的标量、ptr表示待运算向量为指针类型的标量。Among them, opcode.scalar.type is the operation code of the scalar instruction, and dst, src, and pa are the operation domain of the scalar instruction. Among them, dst is the target address. src is a scalar address to be calculated. When there are multiple scalars to be calculated, src may include multiple vector addresses to be calculated src0, src1, ..., srcn, which is not limited in the present disclosure. Alternatively, multiple scalars to be calculated can be obtained from src. pa is the operation parameter, such as the number of shifts. In the opcode opcode.scalar.type, opcode indicates the type of operation of the scalar operation, and type indicates the data type of the scalar to be operated. type can be u16, u32, u48, s16, s32, s48, ptr, u16 indicates that the vector to be calculated is unsigned and has a length of 16 bits, u32 indicates that the vector to be calculated is unsigned, and has a length of 32 bits, u48 Indicates that the vector to be calculated is an unsigned scalar with a length of 48 bits, s16 indicates that the vector to be calculated is a signed scalar with a length of 16 bits, s32 indicates that the vector to be calculated is a signed scalar with a length of 32 bits, s48 indicates The operation vector is a signed scalar with a length of 48 bits, and ptr indicates that the vector to be operated is a pointer-type scalar.
在一种可能的实现方式中,可以将用于标量相加运算的标量指令的指令格式设置为:add.scalar.type dst src0 src1。其表示:将src0存储的数据类型为type的第一待运算标量和将src1存储的数据类型为type的第二待运算标量相加,得到运算结果。并将运算结果存储到目标地址dst中。In a possible implementation, the instruction format of the scalar instruction used for the scalar addition operation can be set to: add.scalar.type dst src0 src1. It means: add the first to-be-calculated scalar of data type stored in src0 and the second to-be-calculated scalar of data type of type stored in src1 to obtain the operation result. And store the operation result to the target address dst.
在一种可能的实现方式中,可以将用于标量相加运算的标量指令的指令格式设置为:add.scalar.type dst src0 src1。其表示:将src0存储的数据类型为type的第一待运算标量和将src1存储的数据类型为type的第二待运算标量进行相加运算,得到运算结果。并将运算结果存储到目标地址dst中。In a possible implementation, the instruction format of the scalar instruction used for the scalar addition operation can be set to: add.scalar.type dst src0 src1. It means that the first to-be-calculated scalar of data type stored in src0 and the second to-be-calculated scalar of data type of src1 are added to obtain an operation result. And store the operation result to the target address dst.
在一种可能的实现方式中,可以将用于标量求和运算的标量指令的指令格式设置为:sub.scalar.type dst src0。其表示:将src0存储的数据类型为type的多个待运算标量进行求和运算,得到运算结果。并将运算结果存储到目标地址dst中。或者,可以将用于标量求和运算的标量指令的指令格式设置为:sub.scalar.type dst src0 src1,...,srcn。其表示:将src0,src1,...,srcn中存储的数据类型为type的多个待运算标量进行求和运算,得到运算结果。并将运算结果存储到目标地址dst中。In a possible implementation, the instruction format of the scalar instruction used for the scalar sum operation can be set to: sub.scalar.type dst src0. It means: Summing a plurality of scalars of data type stored in src0 to be operated to obtain the operation result. And store the operation result to the target address dst. Alternatively, the instruction format of the scalar instruction used for the scalar sum operation can be set to: sub.scalar.type dst src0 src1, ..., srcn. It means: summing a plurality of scalars of the data type stored in src0, src1, ..., srcn to be typed to obtain an operation result. And store the operation result to the target address dst.
在一种可能的实现方式中,可以将用于标量相乘运算的标量指令的指令格式设置为:mul.scalar.type dst src0 src1。其表示:将src0存储的数据类型为type的第一待运算标量和将src1存储的数据类型为type的第二待运算标量相乘,得到运算结果。并将运算结果存储到目标地址dst中。In a possible implementation, the instruction format of the scalar instruction used for the scalar multiplication operation can be set to: mul.scalar.type dst src0 src1. It means: multiplying the first to-be-calculated scalar of data type stored by src0 and the second to-be-calculated scalar of data type of type stored in src1 to obtain the operation result. And store the operation result to the target address dst.
在一种可能的实现方式中,可以将用于按位相与运算的标量指令的指令格式设置为:and.scalar.type dst src0 src1。其表示:将src0存储的数据类型为type的第一待运算标量和将src1存储的数据类型为type的第二待运算标量按位相与,得到运算结果。并将运算结果存储到目标地址dst中。In a possible implementation, the instruction format of the scalar instruction used for phase-and-operation can be set as: and.scalar.type dst src0 src1. It means that the first to-be-calculated scalar of the data type stored in src0 and the second to-be-calculated scalar of the data type of type stored in src1 are combined in phase to obtain the operation result. And store the operation result to the target address dst.
在一种可能的实现方式中,可以将用于按位求余数运算的标量指令的指令格式设置为:rem.scalar.type dst src0。其表示:将src0存储的数据类型为type的待运算标量进行按位求余数运算,得到运算结果。并将运算结果存储到目标地址dst中。In a possible implementation, the instruction format of the scalar instruction used for bitwise remainder operation can be set to: rem.scalar.type dst src0. It means that the scalar to be operated on of the data type stored in src0 is subjected to bitwise remainder operation to obtain the operation result. And store the operation result to the target address dst.
在一种可能的实现方式中,可以将用于按位求绝对值运算的标量指令的指令格式设置为:abs.scalar.type dst src0。其表示:将src0存储的数据类型为type的待运算标量进行按位求绝对值运算,得到运算结果。并将运算结果存储到目标地址dst中。In a possible implementation, the instruction format of the scalar instruction used for bit-wise absolute value operation can be set to: abs.scalar.type dst src0. It means: Perform the bitwise absolute value operation on the to-be-calculated scalar data type stored in src0 to obtain the operation result. And store the operation result to the target address dst.
在一种可能的实现方式中,可以将用于按位相除运算的标量指令的指令格式设置为:div.scalar.type dst src0 src1。其表示:将src0存储的数据类型为type的第一待运算标量和将src1存储的数据类型为type的第二待运算标量进行按位相除运算,得到运算结果。并将运算结果存储到目标地址dst中。In a possible implementation, the instruction format of the scalar instruction used for the bitwise division operation can be set to: div.scalar.type dst src0 src1. It means that the first to-be-calculated scalar of data type stored in src0 and the second to-be-calculated scalar of data type of type src1 are subjected to bitwise division operation to obtain the operation result. And store the operation result to the target address dst.
在一种可能的实现方式中,可以将用于按位或运算的标量指令的指令格式设置为:or.scalar.type dst src0 src1。其表示:将src0存储的数据类型为type的第一待运算标量和将src1存储的数据类型为type的第二待运算标量进行按位或运算,得到运算结果。并将运算结果存储到目标地址dst中。In a possible implementation, the instruction format of the scalar instruction for bitwise OR operation can be set to: or.scalar.type dst src0 src1. It means that the first to-be-operated scalar of data type stored in src0 and the second to-be-operated scalar of data type of src1 are bit-wise ORed to obtain the operation result. And store the operation result to the target address dst.
在一种可能的实现方式中,可以将用于按位异或运算的标量指令的指令格式设置为:xor.scalar.type dst src0 src1。其表示:将src0存储的数据类型为type的第一待运算标量和将src1存储的数据类型为type的第二待运算标量进行按位异或运算,得到运算结果。并将运算结果存储到目标地址dst中。In a possible implementation, the instruction format of the scalar instruction for bitwise XOR operation can be set to: xor.scalar.type dst src0 src1. It means: performing a bitwise XOR operation on the first to-be-operated scalar of the data type stored by src0 and the second to-be-operated scalar of the data type of type stored in src1, to obtain an operation result. And store the operation result to the target address dst.
在一种可能的实现方式中,可以将用于按位取反运算的标量指令的指令格式设置为: not.scalar.type dst src0。其表示:将src0存储的数据类型为type的待运算标量进行按位取反运算,得到运算结果。并将运算结果存储到目标地址dst中。In a possible implementation, the instruction format of the scalar instruction used for bitwise inversion operation can be set to: not.scalar.type dst src0. It means: perform the bitwise inverse operation on the to-be-calculated scalar data type stored in src0 to obtain the operation result. And store the operation result to the target address dst.
在一种可能的实现方式中,可以将用于按位求最大值运算的标量指令的指令格式设置为:max.scalar.type dst src0。其表示:将src0存储的数据类型为type的待运算标量进行按位求最大值运算,得到运算结果。并将运算结果存储到目标地址dst中。In a possible implementation, the instruction format of the scalar instruction used for bit-wise maximum operation can be set to: max.scalar.type dst src0. It means that the scalar to be operated on which the data type stored in src0 is type is subjected to a bitwise maximum value operation to obtain an operation result. And store the operation result to the target address dst.
在一种可能的实现方式中,可以将用于按位求最小值运算的标量指令的指令格式设置为:min.scalar.type dst src0。其表示:将src0存储的数据类型为type的待运算标量进行按位求最小值运算,得到运算结果。并将运算结果存储到目标地址dst中。In a possible implementation, the instruction format of the scalar instruction used for bit-wise minimum operation can be set to: min.scalar.type dst src0. It means that the scalar to be calculated of the data type stored in src0 is typed to perform a bitwise minimum value operation to obtain an operation result. And store the operation result to the target address dst.
在一种可能的实现方式中,可以将用于逻辑左移运算的标量指令的指令格式设置为:sll.scalar.type dst src0 pa。其表示:将src0存储的数据类型为type的待运算标量进行逻辑左移pa位,得到运算结果。并将运算结果存储到目标地址dst中。In a possible implementation, the instruction format of the scalar instruction used for the logical left shift operation can be set to: sll.scalar.type dst src0pa. It means that the scalar to be operated of data type stored in src0 is logically shifted to the left by pa bits to obtain the operation result. And store the operation result to the target address dst.
在一种可能的实现方式中,可以将用于逻辑右移运算的标量指令的指令格式设置为:srl.scalar.type dst src0 pa。其表示:将src0存储的数据类型为type的待运算标量进行逻辑右移pa位,得到运算结果。并将运算结果存储到目标地址dst中。In a possible implementation, the instruction format of the scalar instruction used for logical right shift operation can be set to: srl.scalar.type dst src0pa. It means: logically shift the right-to-operate scalar of the data type stored in src0 by pa bits to obtain the operation result. And store the operation result to the target address dst.
在一种可能的实现方式中,可以将用于逻辑与运算的标量指令的指令格式设置为:land.scalar.type dst src0 src1。其表示:将src0存储的数据类型为type的第一待运算标量和将src1存储的数据类型为type的第二待运算标量进行逻辑与运算,得到运算结果。并将运算结果存储到目标地址dst中。In a possible implementation, the instruction format of the scalar instruction used for logical AND operation can be set to: land.scalar.type dst src0 src1. It means: perform a logical AND operation on the first to-be-operated scalar of the data type stored in src0 and the second to-be-operated scalar of the data type of type stored in src1, to obtain the operation result. And store the operation result to the target address dst.
在一种可能的实现方式中,可以将用于逻辑或运算的标量指令的指令格式设置为:lor.scalar.type dst src0 src1。其表示:将src0存储的数据类型为type的第一待运算标量和将src1存储的数据类型为type的第二待运算标量进行逻辑或运算,得到运算结果。并将运算结果存储到目标地址dst中。In a possible implementation, the instruction format of the scalar instruction used for logical OR operation can be set as: lor.scalar.type dst src0 src1. It means: performing logical OR operation on the first to-be-calculated scalar of the data type stored in src0 and the second to-be-operated scalar of the data type of type stored in src1, to obtain the operation result. And store the operation result to the target address dst.
在一种可能的实现方式中,可以将用于逻辑异或运算的标量指令的指令格式设置为:lxo.scalar.type dst src0 src1。其表示:将src0存储的数据类型为type的第一待运算标量和将src1存储的数据类型为type的第二待运算标量进行逻辑异或运算,得到运算结果。并将运算结果存储到目标地址dst中。In a possible implementation, the instruction format of the scalar instruction used for logical XOR operation can be set to: lxo.scalar.type dst src0 src1. It means: perform a logical exclusive OR operation on the first to-be-calculated scalar of the data type stored in src0 and the second to-be-operated scalar of the data type of type stored in src1 to obtain the operation result. And store the operation result to the target address dst.
在一种可能的实现方式中,可以将用于逻辑取反运算的标量指令的指令格式设置为:lnot.scalar.type dst src0。其表示:将src0存储的数据类型为type的待运算标量进行逻辑取反运算,得到运算结果。并将运算结果存储到目标地址dst中。In a possible implementation, the instruction format of the scalar instruction used for logical inversion operation can be set to: lnot.scalar.type dst src0. It means: logically invert the scalar to be operated on which the data type stored in src0 is type to obtain the operation result. And store the operation result to the target address dst.
应当理解的是,本领域技术人员可以根据需要对标量指令的操作码、指令格式中操作码和操作域的位置进行设置,本公开对此不作限制。It should be understood that those skilled in the art can set the operation code of the scalar instruction, the position of the operation code and the operation field in the instruction format as needed, and the disclosure does not limit this.
在一种可能的实现方式中,该装置可以设置于图形处理器(Graphics Processing Unit,简称GPU)、中央处理器(Central Processing Unit,简称CPU)和嵌入式神经网络处理器(Neural-network Processing Unit,简称NPU)的一种或多种之中。In a possible implementation manner, the device may be set in a graphics processor (Graphics Processing Unit, GPU for short), a central processing unit (Central Processing Unit, CPU for short) and an embedded neural network processor (Neural-network Processing Unit) , Referred to as NPU).
需要说明的是,尽管以上述实施例作为示例介绍了标量指令处理装置如上,但本领域技术人员能够理解,本公开应不限于此。事实上,用户完全可根据个人喜好和/或实际应用场景灵活设定各模块,只要符合本公开的技术方案即可。It should be noted that although the above embodiment is taken as an example to introduce the scalar instruction processing apparatus as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various modules flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.
应用示例Application examples
以下结合“利用标量指令处理装置进行标量运算”作为一个示例性应用场景,给出根据本公开实施例的应用示例,以便于理解标量指令处理装置的流程。本领域技术人员应理解,以下应用示例仅仅是出于便于理解本公开实施例的目的,不应视为对本公开实施例的限制The following describes an application example according to an embodiment of the present disclosure in conjunction with "scalar operation using a scalar instruction processing device" as an exemplary application scenario, so as to facilitate understanding of the flow of the scalar instruction processing device. Those skilled in the art should understand that the following application examples are only for the purpose of facilitating understanding of the embodiments of the present disclosure, and should not be considered as limitations to the embodiments of the present disclosure
图13-3a、图13-3b示出根据本公开一实施例的标量指令处理装置的应用场景的示意图。如图13-3a、图13-3b所示,标量指令处理装置对标量指令进行处理的过程如下:13-3a and 13-3b show schematic diagrams of application scenarios of a scalar instruction processing apparatus according to an embodiment of the present disclosure. As shown in Figures 13-3a and 13-3b, the scalar command processing device processes the scalar commands as follows:
示例一Example one
如图13-3a所示,控制模块13-11对获取到的标量指令1进行编译,得到编译后的标量指令1(如标量指令1为scalar 500 101 102 add.u16),对编译后的标量指令进行解析,得到标量指令1的操作码和操作域。其中,标量指令1的操作码为scalar,目标地址为500,第一待运算标量地址为101,第二待运算 向量地址为102。标量运算类型为add.u16,其中运算种类为相加运算add,数据类型为16位无符号标量。控制模块13-11从待运算标量地址101中获取到16位无符号的第一待运算标量,以及从待运算标量地址102中获取到16位无符号的第二待运算标量。As shown in Figure 13-3a, the control module 13-11 compiles the obtained scalar instruction 1 to obtain the compiled scalar instruction 1 (for example, the scalar instruction 1 is scalar 500 500 101 102 add.u16) The instruction is parsed to obtain the operation code and operation domain of the scalar instruction 1. The operation code of scalar instruction 1 is scalar, the target address is 500, the first scalar address to be calculated is 101, and the second vector address to be calculated is 102. The scalar operation type is add.u16, where the operation type is add operation add and the data type is 16-bit unsigned scalar. The control module 13-11 obtains a 16-bit unsigned first scalar to be calculated from the scalar address 101 to be calculated, and a 16-bit unsigned second scalar to be calculated from the scalar address 102 to be calculated.
运算模块13-12对第一待运算标量和第二待运算标量进行相加运算,得到运算结果1,并将运算结果1存入目标地址500中。The arithmetic module 13-12 performs an addition operation on the first scalar to be calculated and the second scalar to be calculated to obtain an operation result 1, and stores the operation result 1 in the target address 500.
示例二Example 2
如图13-3b所示,控制模块13-11对获取到的标量指令2进行编译,得到编译后的标量指令2(如标量指令2为mul.scalar.u16 501 103 104),对编译后的标量指令进行解析,得到标量指令2的操作码和操作域。其中,标量指令2的操作码为mul.scalar.u16,目标地址为501,第三待运算标量地址为103,第四待运算标量的地址为104。控制模块13-11从待运算标量地址101中获取到16位无符号的第一待运算标量,以及从待运算标量地址102中获取到16位无符号的第二待运算标量。As shown in Figure 13-3b, the control module 13-11 compiles the obtained scalar instruction 2 to obtain the compiled scalar instruction 2 (for example, the scalar instruction 2 is mul.scalar.u16501501103104). The scalar instruction is analyzed to obtain the operation code and operation domain of the scalar instruction 2. The operation code of scalar instruction 2 is mul.scalar.u16, the target address is 501, the third scalar address to be calculated is 103, and the fourth scalar address to be calculated is 104. The control module 13-11 obtains a 16-bit unsigned first scalar to be calculated from the scalar address 101 to be calculated, and a 16-bit unsigned second scalar to be calculated from the scalar address 102 to be calculated.
运算模块13-12对第一待运算标量和第二待运算标量进行相乘运算,得到运算结果2,并将运算结果2存入目标地址501中。The arithmetic module 13-12 performs a multiplication operation on the first scalar to be calculated and the second scalar to be calculated to obtain an operation result 2 and stores the operation result 2 in the target address 501.
以上各模块的工作过程可参考上文的相关描述。For the working process of the above modules, please refer to the relevant description above.
这样,标量指令处理装置可以高效、快速地对标量指令进行处理,进行标量运算的处理效率高、处理速度快。In this way, the scalar command processing device can process the scalar commands efficiently and quickly, and the scalar operation has high processing efficiency and fast processing speed.
图13-4示出根据本公开一实施例的标量指令处理方法的流程图。该方法可以应用于包含存储器和处理器的如计算机设备等,其中,存储器用于存储执行方法过程中所使用的数据;处理器用于执行相关的处理、运算步骤,如执行下述步骤S51-13和步骤S52-13。如图13-4所示,该方法应用于上述标量指令处理装置,该方法包括步骤S51-13和步骤S52-13。13-4 shows a flowchart of a scalar instruction processing method according to an embodiment of the present disclosure. The method can be applied to computer equipment including a memory and a processor, where the memory is used to store data used in the execution of the method; the processor is used to perform related processing and operation steps, such as performing the following steps S51-13和 步骤 S52-13. As shown in FIG. 13-4, the method is applied to the above scalar instruction processing device, and the method includes steps S51-13 and S52-13.
在步骤S51-13中,利用控制模块对获取到的标量指令进行编译,得到编译后的标量指令,对编译后的标量指令进行解析,得到标量指令的操作码和操作域,并根据操作码和操作域获取执行标量指令所需的待运算标量和目标地址,以及确定标量指令的标量运算类型。其中,操作码用于指示标量指令对数据所进行的运算为标量运算,标量运算类型用于指示进行标量运算的运算种类和待运算标量的数据类型,操作域包括待运算标量地址和目标地址。In step S51-13, the control module is used to compile the obtained scalar instruction to obtain the compiled scalar instruction, and the compiled scalar instruction is parsed to obtain the operation code and operation domain of the scalar instruction, and according to the operation code and The operation domain obtains the scalar to be operated and the target address required to execute the scalar instruction, and determines the scalar operation type of the scalar instruction. Among them, the operation code is used to indicate that the operation performed by the scalar instruction on the data is a scalar operation, the scalar operation type is used to indicate the type of operation that performs the scalar operation and the data type of the scalar to be operated, and the operation domain includes the scalar address to be operated and the target address.
在步骤S52-13中,利用运算模块根据标量运算类型对待运算标量进行标量运算,获得运算结果,并将运算结果存入目标地址中。In step S52-13, the arithmetic module is used to perform a scalar operation on the scalar to be calculated according to the scalar operation type to obtain an operation result, and the operation result is stored in the target address.
在一种可能的实现方式中,根据标量运算类型对待运算标量进行标量运算,获得运算结果,可以包括:利用运算模块中的多个标量运算器执行与标量运算类型相对应的标量运算。In a possible implementation manner, performing the scalar operation on the scalar to be calculated according to the scalar operation type to obtain the operation result may include: performing scalar operation corresponding to the scalar operation type by using multiple scalar operators in the operation module.
在一种可能的实现方式中,运算模块包括主运算子模块和多个从运算子模块,主运算子模块包括多个标量运算器。其中,步骤S52-13可以包括:In a possible implementation manner, the operation module includes a master operation sub-module and multiple slave operation sub-modules, and the master operation sub-module includes multiple scalar operators. Wherein, steps S52-13 may include:
利用主运算子模块中的多个标量运算器执行与标量运算类型相对应的标量运算,对待运算标量执行前序处理,得到运算结果,并将运算结果存入目标地址中。Use multiple scalar operators in the main operation sub-module to perform scalar operations corresponding to the scalar operation type, perform pre-processing on the scalar operations to obtain the operation results, and store the operation results in the target address.
在一种可能的实现方式中,操作域还可以包括标量运算类型。其中,确定标量指令的标量运算类型,可以包括:In a possible implementation, the operation domain may also include a scalar operation type. Among them, determining the scalar operation type of the scalar instruction may include:
根据操作域确定标量运算类型。Determine the type of scalar operation based on the operation domain.
在一种可能的实现方式中,操作域还可以包括运算参数。其中,根据操作码和操作域获取执行标量指令所需的待运算标量和目标地址,还可以包括:根据操作域确定运算参数。In a possible implementation manner, the operation domain may further include operation parameters. Wherein, obtaining the scalar to be calculated and the target address required to execute the scalar instruction according to the operation code and the operation domain may further include: determining the operation parameter according to the operation domain.
其中,根据标量运算类型对待运算标量进行标量运算,可以包括:Among them, performing scalar operation on the scalar to be calculated according to the scalar operation type may include:
根据运算参数和标量运算类型,对待运算标量进行标量运算。According to the operation parameters and the type of scalar operation, the scalar operation is performed on the scalar to be operated.
在一种可能的实现方式中,运算种类包括以下至少一种:相加运算、求和运算、相乘运算、按位相与运算、按位求余数运算、按位求绝对值运算、按位相除运算、按位或运算、按位异或运算、按位取反运算、按位求最大值运算、按位求最小值运算、逻辑左移运算、逻辑右移运算、算术右移运算、逻辑与运算、逻辑或运算、逻辑异或运算和逻辑取反运算。In a possible implementation manner, the operation type includes at least one of the following: addition operation, sum operation, multiplication operation, bitwise AND operation, bitwise remainder operation, bitwise absolute value operation, bitwise division Operation, bitwise OR operation, bitwise XOR operation, bitwise inverse operation, bitwise maximum value operation, bitwise minimum value operation, logical left shift operation, logical right shift operation, arithmetic right shift operation, logical AND Operation, logical OR operation, logical XOR operation and logical inverse operation.
在一种可能的实现方式中,该方法还可以包括:利用装置的存储模块存储待运算标量,In a possible implementation manner, the method may further include: using a storage module of the device to store the scalar to be calculated,
其中,存储模块包括寄存器和缓存中的至少一种,Wherein, the storage module includes at least one of a register and a cache,
缓存,用于存储待运算数据,缓存包括至少一个神经元缓存NRAM;Cache, used to store data to be calculated, the cache includes at least one neuron cache NRAM;
寄存器,用于存储待运算标量;Register, used to store the scalar to be calculated;
神经元缓存,用于存储待运算数据中的神经元数据,神经元数据包括神经元向量数据。The neuron cache is used to store neuron data in the data to be calculated, and the neuron data includes neuron vector data.
在一种可能的实现方式中,对编译后的标量指令进行解析,得到标量指令的操作码和操作域,可以包括:In a possible implementation, parsing the compiled scalar instruction to obtain the operation code and operation domain of the scalar instruction may include:
存储编译后的标量指令;Store compiled scalar instructions;
对编译后的标量指令进行解析,得到标量指令的操作码和操作域;Analyze the compiled scalar instructions to get the opcode and operation domain of the scalar instructions;
存储指令队列,指令队列包括按照执行顺序依次排列的多个待执行指令,待执行指令可以包括编译后的标量指令。An instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed in order according to the execution order. The instructions to be executed may include compiled scalar instructions.
在一种可能的实现方式中,该方法还可以包括:在确定多个待执行指令中的第一待执行指令与第一待执行指令之前的第零待执行指令存在关联关系时,缓存第一待执行指令,并在确定第零执行指令执行完毕后,控制进行第一待执行指令的执行,In a possible implementation manner, the method may further include: when determining that the first to-be-executed instruction among the plurality of to-be-executed instructions is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction, caching the first The instruction to be executed, and after determining that the execution of the zeroth execution instruction is completed, control to execute the execution of the first instruction to be executed,
其中,第一待执行指令与第一待执行指令之前的编译后的第零待执行指令存在关联关系可以包括:存储第一待执行指令所需数据的第一存储地址区间与存储第零待执行指令所需数据的第零存储地址区间具有重叠的区域。Wherein, the first to-be-executed instruction is associated with the compiled zeroth to-be-executed instruction before the first to-be-executed instruction may include: a first storage address interval storing data required by the first to-be-executed instruction and storing a zeroth to-be-executed The zeroth storage address interval of the data required by the instruction has overlapping areas.
在一种可能的实现方式中,对获取到的标量指令进行编译,得到编译后的标量指令,可以包括:In a possible implementation manner, compiling the obtained scalar instruction to obtain the compiled scalar instruction may include:
根据标量指令生成汇编文件,并将汇编文件翻译成二进制文件。其中,二进制文件为编译后的标量指令。Generate assembly files according to scalar instructions, and translate the assembly files into binary files. Among them, the binary file is a compiled scalar instruction.
需要说明的是,尽管以上述实施例作为示例介绍了标量指令处理方法如上,但本领域技术人员能够理解,本公开应不限于此。事实上,用户完全可根据个人喜好和/或实际应用场景灵活设定各步骤,只要符合本公开的技术方案即可。It should be noted that although the above embodiment is taken as an example to introduce the scalar instruction processing method as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various steps flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.
本公开实施例所提供的标量指令处理方法的适用范围广,对标量的处理效率高、处理速度快,进行标量运算的处理效率高、处理速度快。The scalar instruction processing method provided by the embodiments of the present disclosure has a wide application range, and has high processing efficiency and fast processing speed for scalars, and high processing efficiency and fast processing speed for performing scalar operations.
依据以下条款可以更好的理解前述内容:The foregoing can be better understood based on the following clauses:
条款M1、一种标量指令处理装置,所述装置包括:Clause M1, a scalar instruction processing device, the device comprising:
控制模块,用于对获取到的标量指令进行编译,得到编译后的标量指令,对所述编译后的标量指令进行解析,得到标量指令的操作码和操作域,并根据所述操作码和所述操作域获取执行标量指令所需的待运算标量和目标地址,以及确定标量指令的标量运算类型;The control module is used to compile the obtained scalar instruction to obtain the compiled scalar instruction, analyze the compiled scalar instruction to obtain the operation code and operation domain of the scalar instruction, and according to the operation code and The operation domain obtains the scalar to be operated and the target address required to execute the scalar instruction, and determines the scalar operation type of the scalar instruction;
运算模块,用于根据所述标量运算类型对所述待运算标量进行标量运算,获得运算结果,并将所述运算结果存入所述目标地址中,An operation module, configured to perform a scalar operation on the scalar to be operated according to the scalar operation type, obtain an operation result, and store the operation result in the target address,
其中,所述操作码用于指示所述标量指令对数据所进行的运算为标量运算,所述标量运算类型用于指示进行标量运算的运算种类和所述待运算标量的数据类型,所述操作域包括待运算标量地址和所述目标地址。Wherein, the operation code is used to indicate that the operation performed by the scalar instruction on the data is a scalar operation, and the scalar operation type is used to indicate the type of operation that performs the scalar operation and the data type of the scalar to be operated, the operation The field includes the scalar address to be operated and the target address.
条款M2、根据条款M1所述的装置,所述运算模块,包括:Clause M2. The device according to Clause M1, the operation module includes:
多个标量运算器,用于执行与所述标量运算类型相对应的标量运算。A plurality of scalar operators are used to perform scalar operations corresponding to the types of scalar operations.
条款M3、根据条款M2所述的装置,所述运算模块包括主运算子模块和多个从运算子模块,所述主运算子模块包括所述多个标量运算器,Clause M3. The device according to Clause M2, the operation module includes a master operation sub-module and a plurality of slave operation sub-modules, the master operation sub-module includes the plurality of scalar operators,
所述主运算子模块,用于利用所述多个标量运算器执行所述标量运算,对所述待运算标量执行前序处理,得到运算结果,并将所述运算结果存入所述目标地址中。The main operation sub-module is configured to perform the scalar operation using the plurality of scalar operators, perform pre-processing on the scalar to be operated, obtain an operation result, and store the operation result in the target address in.
条款M4、根据条款M1所述的装置,所述操作域还包括标量运算类型,Clause M4. The device according to Clause M1, the operation domain further includes a scalar operation type,
其中,所述控制模块,还用于根据所述操作域确定所述标量运算类型。Wherein, the control module is also used to determine the scalar operation type according to the operation domain.
条款M5、根据条款M1所述的装置,所述操作域还包括运算参数,Clause M5. The device according to Clause M1, the operation domain further includes operation parameters,
其中,所述控制模块,还用于根据所述操作域确定所述运算参数;Wherein, the control module is also used to determine the operation parameter according to the operation domain;
所述运算模块,还用于根据所述运算参数和所述标量运算类型,对所述待运算标量进行标量运算。The operation module is further configured to perform scalar operation on the scalar to be operated according to the operation parameter and the scalar operation type.
条款M6、根据条款M1所述的装置,所述操作码还用于指示所述标量运算类型,Clause M6. The device according to Clause M1, the operation code is further used to indicate the scalar operation type,
所述控制模块,还用于根据所述操作码确定所述标量运算类型。The control module is also used to determine the scalar operation type according to the operation code.
条款M7、根据条款M1所述的装置,所述运算种类包括以下至少一种:Clause M7. The device according to Clause M1, the operation type includes at least one of the following:
相加运算、求和运算、相乘运算、按位相与运算、按位求余数运算、按位求绝对值运算、按位相除运算、按位或运算、按位异或运算、按位取反运算、按位求最大值运算、按位求最小值运算、逻辑左移运算、逻辑右移运算、算术右移运算、逻辑与运算、逻辑或运算、逻辑异或运算和逻辑取反运算。Addition operation, summation operation, multiplication operation, bitwise AND operation, bitwise remainder operation, bitwise absolute value operation, bitwise division operation, bitwise OR operation, bitwise XOR operation, bitwise inverse operation Operation, bitwise maximum value operation, bitwise minimum value operation, logical left shift operation, logical right shift operation, arithmetic right shift operation, logical AND operation, logical OR operation, logical exclusive OR operation and logical inverse operation.
条款M8、根据条款M1所述的装置,所述装置还包括:Clause M8. The device according to Clause M1, the device further comprising:
存储模块,用于存储所述待运算标量,A storage module for storing the scalar to be calculated,
其中,所述存储模块包括寄存器和缓存中的至少一种,Wherein, the storage module includes at least one of a register and a cache,
所述缓存,用于存储所述待运算数据,所述缓存包括至少一个神经元缓存NRAM;The cache is used to store the data to be calculated, and the cache includes at least one neuron cache NRAM;
所述寄存器,用于存储所述待运算标量;The register is used to store the scalar to be calculated;
所述神经元缓存,用于存储所述待运算数据中的神经元数据,所述神经元数据包括神经元向量数据。The neuron cache is used to store neuron data in the data to be operated, and the neuron data includes neuron vector data.
条款M9、根据条款M1所述的装置,所述控制模块,包括:Clause M9. The device according to Clause M1, the control module includes:
指令存储子模块,用于存储所述编译后的标量指令;An instruction storage sub-module for storing the compiled scalar instruction;
指令处理子模块,用于对所述编译后的标量指令进行解析,得到标量指令的操作码和操作域;An instruction processing sub-module, which is used to parse the compiled scalar instruction to obtain the operation code and operation domain of the scalar instruction;
队列存储子模块,用于存储指令队列,所述指令队列包括按照执行顺序依次排列的多个待执行指令,所述待执行指令包括所述编译后的标量指令。The queue storage submodule is used to store an instruction queue, and the instruction queue includes a plurality of instructions to be executed in order according to the execution order, and the instructions to be executed include the compiled scalar instructions.
条款M10、根据条款M9所述的装置,所述控制模块,还包括:Clause M10. The device according to Clause M9, the control module, further comprising:
依赖关系处理子模块,用于在确定所述多个待执行指令中的第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系时,将所述第一待执行指令缓存在所述指令存储子模块中,在所述第零待执行指令执行完毕后,从所述指令存储子模块中提取所述第一待执行指令发送至所述运算模块,The dependency processing sub-module is used to determine the first pending instruction when there is an association relationship between the first pending instruction in the plurality of pending instructions and the zeroth pending instruction before the first pending instruction The execution instruction is cached in the instruction storage submodule, and after the execution of the zeroth instruction to be executed is completed, the first instruction to be executed is extracted from the instruction storage submodule and sent to the arithmetic module,
其中,所述第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系包括:Wherein, the association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
存储所述第一待执行指令所需数据的第一存储地址区间与存储所述第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
条款M11、根据条款M1所述的装置,Clause M11, the device according to Clause M1,
所述控制模块,还用于根据所述标量指令生成汇编文件,并将所述汇编文件翻译成二进制文件,The control module is also used to generate an assembly file according to the scalar instruction and translate the assembly file into a binary file,
其中,所述二进制文件为所述编译后的标量指令。Wherein, the binary file is the compiled scalar instruction.
条款M12、一种机器学习运算装置,所述装置包括:Clause M12. A machine learning computing device, the device comprising:
一个或多个如条款M1-条款M11任一项所述的标量指令处理装置,用于从其他处理装置中获取待运算标量和控制信息,并执行指定的机器学习运算,将执行结果通过I/O接口传递给其他处理装置;One or more scalar instruction processing devices as described in any one of clauses M1 to M11, used to obtain scalars and control information to be calculated from other processing devices, and perform designated machine learning operations, and pass the execution result to O interface is passed to other processing devices;
当所述机器学习运算装置包含多个所述标量指令处理装置时,所述多个所述标量指令处理装置间可以通过特定的结构进行连接并传输数据;When the machine learning operation device includes a plurality of the scalar instruction processing devices, the plurality of the scalar instruction processing devices may be connected and transmit data through a specific structure;
其中,多个所述标量指令处理装置通过快速外部设备互连总线PCIE总线进行互联并传输数据,以支持更大规模的机器学习的运算;多个所述标量指令处理装置共享同一控制系统或拥有各自的控制系统;多个所述标量指令处理装置共享内存或者拥有各自的内存;多个所述标量指令处理装置的互联方式是任意互联拓扑。Among them, a plurality of the scalar instruction processing apparatuses interconnect and transmit data through a fast external device interconnection bus PCIE bus to support larger-scale machine learning operations; a plurality of the scalar instruction processing apparatuses share the same control system or own Respective control systems; a plurality of the scalar instruction processing devices share memory or have their own memories; the interconnection method of the plurality of scalar instruction processing devices is an arbitrary interconnection topology.
条款M13、一种组合处理装置,所述组合处理装置包括:Clause M13. A combined processing device, the combined processing device comprising:
如条款M12所述的机器学习运算装置、通用互联接口和其他处理装置;Machine learning computing device, general interconnection interface and other processing devices as described in clause M12;
所述机器学习运算装置与所述其他处理装置进行交互,共同完成用户指定的计算操作,The machine learning computing device interacts with the other processing device to jointly complete the calculation operation specified by the user,
其中,所述组合处理装置还包括:存储装置,该存储装置分别与所述机器学习运算装置和所述其他处理装置连接,用于保存所述机器学习运算装置和所述其他处理装置的数据。Wherein, the combined processing device further includes: a storage device, which is respectively connected to the machine learning computing device and the other processing device, and is used for storing data of the machine learning computing device and the other processing device.
条款M14、一种机器学习芯片,所述机器学习芯片包括:Clause M14. A machine learning chip, the machine learning chip includes:
如条款M12所述的机器学习运算装置或如条款M13所述的组合处理装置。The machine learning arithmetic device according to clause M12 or the combined processing device according to clause M13.
条款M15、一种电子设备,所述电子设备包括:Clause M15. An electronic device, the electronic device comprising:
如条款M14所述的机器学习芯片。Machine learning chip as described in clause M14.
条款M16、一种板卡,所述板卡包括:存储器件、接口装置和控制器件以及如条款M14所述的机器学习芯片;Clause M16, a board card, the board card includes: a storage device, an interface device and a control device, and a machine learning chip as described in Clause M14;
其中,所述机器学习芯片与所述存储器件、所述控制器件以及所述接口装置分别连接;Wherein, the machine learning chip is respectively connected to the storage device, the control device and the interface device;
所述存储器件,用于存储数据;The storage device is used for storing data;
所述接口装置,用于实现所述机器学习芯片与外部设备之间的数据传输;The interface device is used to realize data transmission between the machine learning chip and an external device;
所述控制器件,用于对所述机器学习芯片的状态进行监控。The control device is used for monitoring the state of the machine learning chip.
条款M17、一种标量指令处理方法,所述方法应用于标量指令处理装置,所述装置包括控制模块和运算模块,所述方法包括:Clause M17. A scalar instruction processing method. The method is applied to a scalar instruction processing device. The device includes a control module and an arithmetic module. The method includes:
利用控制模块对获取到的标量指令进行编译,得到编译后的标量指令,对所述编译后的标量指令进行解析,得到标量指令的操作码和操作域,并根据所述操作码和所述操作域获取执行标量指令所需的待运算标量和目标地址,以及确定标量指令的标量运算类型;The control module is used to compile the obtained scalar instruction to obtain the compiled scalar instruction, and the compiled scalar instruction is analyzed to obtain the operation code and operation domain of the scalar instruction, and according to the operation code and the operation The domain obtains the scalar to be operated and the target address required to execute the scalar instruction, and determines the scalar operation type of the scalar instruction;
利用运算模块根据所述标量运算类型对所述待运算标量进行标量运算,获得运算结果,并将所述运算结果存入所述目标地址中,Using an operation module to perform a scalar operation on the to-be-operated scalar according to the scalar operation type, obtain an operation result, and store the operation result in the target address,
其中,所述操作码用于指示所述标量指令对数据所进行的运算为标量运算,所述标量运算类型用于指示进行标量运算的运算种类和所述待运算标量的数据类型,所述操作域包括待运算标量地址和所述目标地址。Wherein, the operation code is used to indicate that the operation performed by the scalar instruction on the data is a scalar operation, and the scalar operation type is used to indicate the type of operation that performs the scalar operation and the data type of the scalar to be operated, the operation The field includes the scalar address to be operated and the target address.
条款M18、根据条款M17所述的方法,根据所述标量运算类型对所述待运算标量进行标量运算,包括:Clause M18. Perform the scalar operation on the scalar to be calculated according to the scalar operation type according to the method described in Clause M17, including:
利用所述运算模块中的多个标量运算器执行与所述标量运算类型相对应的标量运算。A plurality of scalar arithmetic units in the arithmetic module are used to perform scalar arithmetic corresponding to the scalar arithmetic type.
条款M19、根据条款M18所述的方法,所述运算模块包括主运算子模块和多个从运算子模块,所述主运算子模块包括所述多个标量运算器,Clause M19. The method according to Clause M18, the operation module includes a master operation sub-module and a plurality of slave operation sub-modules, the master operation sub-module includes the plurality of scalar operators,
其中,根据所述标量运算类型对所述待运算标量进行标量运算,获得运算结果,并将所述运算结果存入所述目标地址中,包括:Wherein, performing a scalar operation on the scalar to be operated according to the scalar operation type to obtain an operation result, and storing the operation result in the target address includes:
利用所述主运算子模块中的多个标量运算器执行与所述标量运算类型相对应的标量运算,得到运算结果,并将所述运算结果存入所述目标地址中。A plurality of scalar operators in the main operation sub-module are used to perform a scalar operation corresponding to the scalar operation type, obtain an operation result, and store the operation result in the target address.
条款M20、根据条款M17所述的方法,所述操作域还包括标量运算类型,Clause M20. The method according to Clause M17, the operation domain further includes a scalar operation type,
其中,确定标量指令的标量运算类型,包括:Among them, determining the scalar operation type of scalar instructions includes:
根据所述操作域确定所述标量运算类型。The type of scalar operation is determined according to the operation domain.
条款M21、根据条款M17所述的方法,所述操作域还包括运算参数,Clause M21. The method according to Clause M17, the operation domain further includes operation parameters,
其中,根据所述操作码和所述操作域获取执行标量指令所需的待运算标量和目标地址,还包括:Wherein, obtaining the scalar to be calculated and the target address required to execute the scalar instruction according to the operation code and the operation domain also includes:
根据所述操作域确定所述运算参数;Determining the operation parameter according to the operation domain;
其中,根据所述标量运算类型对所述待运算标量进行标量运算,包括:Wherein, performing scalar operation on the scalar to be calculated according to the scalar operation type includes:
根据所述运算参数和所述标量运算类型,对所述待运算标量进行标量运算。Perform a scalar operation on the scalar to be operated according to the operation parameter and the type of scalar operation.
条款M22、根据条款M17所述的方法,所述操作码还用于指示所述标量运算类型,确定标量指令的标量运算类型,包括:Clause M22. According to the method of Clause M17, the opcode is also used to indicate the scalar operation type and determine the scalar operation type of the scalar instruction, including:
根据所述操作码确定所述标量运算类型。The type of scalar operation is determined according to the operation code.
条款M23、根据条款M17所述的方法,所述运算种类包括以下至少一种:Clause M23. The method according to Clause M17, the operation type includes at least one of the following:
相加运算、求和运算、相乘运算、按位相与运算、按位求余数运算、按位求绝对值运算、按位相除运算、按位或运算、按位异或运算、按位取反运算、按位求最大值运算、按位求最小值运算、逻辑左移运算、逻辑右移运算、算术右移运算、逻辑与运算、逻辑或运算、逻辑异或运算和逻辑取反运算。Addition operation, summation operation, multiplication operation, bitwise AND operation, bitwise remainder operation, bitwise absolute value operation, bitwise division operation, bitwise OR operation, bitwise XOR operation, bitwise inverse operation Operation, bitwise maximum value operation, bitwise minimum value operation, logical left shift operation, logical right shift operation, arithmetic right shift operation, logical AND operation, logical OR operation, logical exclusive OR operation and logical inverse operation.
条款M24、根据条款M17所述的方法,所述方法还包括:Clause M24. The method according to Clause M17, the method further comprising:
利用所述装置的存储模块存储所述待运算标量,Using the storage module of the device to store the scalar to be calculated,
其中,所述存储模块包括寄存器和缓存中的至少一种,Wherein, the storage module includes at least one of a register and a cache,
所述缓存,用于存储所述待运算数据,所述缓存包括至少一个神经元缓存NRAM;The cache is used to store the data to be calculated, and the cache includes at least one neuron cache NRAM;
所述寄存器,用于存储所述待运算标量;The register is used to store the scalar to be calculated;
所述神经元缓存,用于存储所述待运算数据中的神经元数据,所述神经元数据包括神经元向量数据。The neuron cache is used to store neuron data in the data to be operated, and the neuron data includes neuron vector data.
条款M25、根据条款M17所述的方法,对所述编译后的标量指令进行解析,得到标量指令的操作码和操作域,包括:Clause M25. According to the method described in Clause M17, parse the compiled scalar instruction to obtain the operation code and operation domain of the scalar instruction, including:
存储所述编译后的标量指令;Storing the compiled scalar instruction;
对所述编译后的标量指令进行解析,得到标量指令的操作码和操作域;Parse the compiled scalar instruction to obtain the operation code and operation domain of the scalar instruction;
存储指令队列,所述指令队列包括按照执行顺序依次排列的多个待执行指令,所述待执行指令包括所述编译后的标量指令。An instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed in order according to an execution order, and the instructions to be executed include the compiled scalar instructions.
条款M26、根据条款M25所述的方法,所述方法还包括:Clause M26. The method according to Clause M25, the method further comprising:
在确定所述多个待执行指令中的第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系时,缓存所述第一待执行指令,并在确定所述第零待执行指令执行完毕后,控制进行所述第一待执行指令的执行,When it is determined that the first to-be-executed instruction among the plurality of to-be-executed instructions is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction, cache the first to-be-executed instruction, and determine the After the execution of the zeroth to-be-executed instruction is completed, control to execute the execution of the first to-be-executed instruction,
其中,所述第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系包括:Wherein, the association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
存储所述第一待执行指令所需数据的第一存储地址区间与存储所述第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
条款M27、根据条款M17所述的方法,对获取到的标量指令进行编译,得到编译后的标量指令,包括:Clause M27. According to the method described in Clause M17, compile the obtained scalar instruction to obtain the compiled scalar instruction, including:
根据所述标量指令生成汇编文件,并将所述汇编文件翻译成二进制文件,Generate an assembly file according to the scalar instruction, and translate the assembly file into a binary file,
其中,所述二进制文件为所述编译后的标量指令。Wherein, the binary file is the compiled scalar instruction.
条款M28、一种非易失性计算机可读存储介质,其上存储有计算机程序指令,所述计算机程序指令被处理器执行时实现条款M17至条款M27任一项所述的方法。Clause M28. A non-volatile computer-readable storage medium having computer program instructions stored thereon. When the computer program instructions are executed by a processor, the method described in any one of clauses M17 to M27 is implemented.
图14-1示出根据本公开一实施例的标量类型转换指令处理装置的框图。如图14-1所示,该装置包括控制模块14-11和运算模块14-12。14-1 shows a block diagram of a scalar type conversion instruction processing device according to an embodiment of the present disclosure. As shown in Figure 14-1, the device includes a control module 14-11 and an arithmetic module 14-12.
控制模块14-11,用于对获取到的标量类型转换指令进行编译,得到编译后的标量类型转换指令,对编译后的标量类型转换指令进行解析,得到标量类型转换指令的操作码和操作域,并根据操作码和操作域获取执行标量类型转换指令所需的待运算标量和目标地址,以及确定目标数据类型和待运算标量的初始数据类型。其中,操作码用于指示标量类型转换指令对数据所进行的运算为标量类型转换运算,操作域包括待运算标量地址和目标地址。The control module 14-11 is used to compile the obtained scalar type conversion instruction to obtain the compiled scalar type conversion instruction, and parse the compiled scalar type conversion instruction to obtain the operation code and operation domain of the scalar type conversion instruction , And obtain the scalar to be calculated and the target address required to execute the scalar type conversion instruction according to the operation code and the operation domain, and determine the initial data type of the target data type and the scalar to be calculated. The operation code is used to instruct the operation performed by the scalar type conversion instruction on the data to be a scalar type conversion operation, and the operation domain includes the scalar address to be operated and the target address.
运算模块14-12,用于根据目标数据类型对初始数据类型的待运算标量进行标量类型转换运算,获得运算结果,并将运算结果存入目标地址中。其中,运算结果的数据类型为目标数据类型。The operation module 14-12 is configured to perform a scalar type conversion operation on the scalar to be operated of the initial data type according to the target data type, obtain an operation result, and store the operation result in the target address. Among them, the data type of the operation result is the target data type.
在本实施例中,控制模块所获取到的标量类型转换指令为未编译的、不能直接供硬件执行的软件指令,控制模块需先对标量类型转换指令(未编译)进行编译。在得到编译后的标量类型转换指令之后,才能对编译后的标量类型转换指令进行解析。编译后的标量类型转换指令为能够直接供硬件执行的硬件指令。控制模块可以从待运算标量地址中获取待运算标量。控制模块可以通过数据输入输出单元获得标量类型转换指令和待运算标量,该数据输入输出单元可以为一个或多个数据I/O接口或I/O引脚。In this embodiment, the scalar type conversion instruction acquired by the control module is an uncompiled software instruction that cannot be directly executed by hardware. The control module needs to first compile the scalar type conversion instruction (uncompiled). After the compiled scalar type conversion instruction is obtained, the compiled scalar type conversion instruction can be parsed. The compiled scalar type conversion instructions are hardware instructions that can be directly executed by the hardware. The control module may obtain the scalar to be calculated from the scalar address to be calculated. The control module can obtain the scalar type conversion instruction and the scalar to be calculated through the data input and output unit, and the data input and output unit can be one or more data I / O interfaces or I / O pins.
在本实施例中,操作码可以是计算机程序中所规定的要执行操作的那一部分指令或字段(通常用代码表示),是指令序列号,用来告知执行指令的装置具体需要执行哪一条指令。操作域可以是执行对应的指令所需的所有数据的来源,执行对应的指令所需的所有数据包括参数数据、待运算标量、对应的运算方法等等。对于一个标量类型转换指令其必须包括操作码和操作域,其中操作域至少包括待运算标量地址和目标地址。In this embodiment, the operation code may be the part of the instruction or field (usually represented by code) specified in the computer program to perform the operation, and is the instruction sequence number used to inform the device that executes the instruction which instruction needs to be executed. . The operation domain may be the source of all data required to execute the corresponding instruction, and all data required to execute the corresponding instruction include parameter data, scalar to be operated, corresponding operation method, and so on. For a scalar type conversion instruction, it must include an operation code and an operation field, where the operation field includes at least the scalar address and the target address to be operated.
应当理解的是,本领域技术人员可以根据需要对标量类型转换指令的指令格式以及所包含的操作 码和操作域进行设置,本公开对此不作限制。It should be understood that those skilled in the art can set the instruction format of the scalar type conversion instruction as well as the included operation codes and operation domains as needed, and this disclosure does not limit this.
在本实施例中,该装置可以包括一个或多个控制模块,以及一个或多个运算模块,可以根据实际需要对控制模块和运算模块的数量进行设置,本公开对此不作限制。在装置包括一个控制模块时,该控制模块可以接收标量类型转换指令,并控制一个或多个运算模块进行标量类型转换运算。在装置包括多个控制模块时,多个控制模块可以分别接收标量类型转换指令,并控制对应的一个或多个运算模块进行标量类型转换运算。In this embodiment, the device may include one or more control modules and one or more arithmetic modules, and the number of control modules and arithmetic modules may be set according to actual needs, which is not limited in the present disclosure. When the device includes a control module, the control module may receive a scalar type conversion instruction and control one or more arithmetic modules to perform scalar type conversion operations. When the device includes multiple control modules, the multiple control modules may respectively receive scalar type conversion instructions and control the corresponding one or more arithmetic modules to perform scalar type conversion operations.
本公开实施例所提供的标量类型转换指令处理装置,该装置包括控制模块和运算模块。控制模块用于对获取到的标量类型转换指令进行编译,得到编译后的标量类型转换指令,对编译后的标量类型转换指令进行解析,得到标量类型转换指令的操作码和操作域,并根据操作码和操作域获取执行标量类型转换指令所需的待运算标量和目标地址,以及确定目标数据类型和待运算标量的初始数据类型。运算模块用于根据目标数据类型对初始数据类型的待运算标量进行标量类型转换运算,获得运算结果,并将运算结果存入目标地址中。本公开实施例所提供的标量类型转换指令处理装置的适用范围广,对标量类型转换指令的处理效率高、处理速度快,进行标量类型转换的处理效率高、处理速度快。A scalar type conversion instruction processing device provided by an embodiment of the present disclosure includes a control module and an arithmetic module. The control module is used to compile the obtained scalar type conversion instruction to obtain the compiled scalar type conversion instruction, to parse the compiled scalar type conversion instruction, to obtain the scalar type conversion instruction operation code and operation domain, and according to the operation The code and operation domain obtain the scalar to be calculated and the target address required to execute the scalar type conversion instruction, and determine the initial data type of the target data type and the scalar to be calculated. The operation module is used to perform a scalar type conversion operation on the scalar to be operated of the initial data type according to the target data type, obtain the operation result, and store the operation result in the target address. The scalar type conversion instruction processing device provided by the embodiments of the present disclosure has a wide application range, and has high processing efficiency and fast processing speed for scalar type conversion instructions, and high processing efficiency and fast processing speed for scalar type conversion.
图14-2a示出根据本公开一实施例的标量类型转换指令处理装置的框图。在一种可能的实现方式中,如图14-2a所示,运算模块14-12可以包括多个标量运算器14-120,用于执行标量类型转换运算。14-2a shows a block diagram of a scalar type conversion instruction processing device according to an embodiment of the present disclosure. In a possible implementation, as shown in FIG. 14-2a, the operation module 14-12 may include a plurality of scalar operators 14-120 for performing scalar type conversion operations.
在该实现方式中,运算模块也可以包括一个标量运算器。可以根据所需进行标量类型转换运算的数据量的大小、对标量类型转换运算的处理速度、效率等要求对标量运算器的数量进行设置,本公开对此不作限制。In this implementation, the arithmetic module may also include a scalar arithmetic unit. The number of scalar operators can be set according to the size of the data amount required to perform the scalar type conversion operation, the processing speed, efficiency, etc. of the scalar type conversion operation, which is not limited in the present disclosure.
图14-2b示出根据本公开一实施例的标量类型转换指令处理装置的框图。在一种可能的实现方式中,如图14-2b所示,运算模块14-12可以包括主运算子模块14-121和多个从运算子模块14-122。主运算子模块14-121可以包括多个标量运算器14-120(图中未示出)。14-2b shows a block diagram of a scalar type conversion instruction processing device according to an embodiment of the present disclosure. In a possible implementation, as shown in FIG. 14-2b, the operation module 14-12 may include a master operation sub-module 14-121 and a plurality of slave operation sub-modules 14-122. The main operation sub-module 14-121 may include a plurality of scalar operators 14-120 (not shown in the figure).
主运算子模块14-121,用于利用多个标量运算器14-120执行标量类型转换运算,得到运算结果,并将运算结果存入目标地址中。The main operation sub-module 14-121 is used to perform a scalar type conversion operation using a plurality of scalar operators 14-120, obtain an operation result, and store the operation result in a target address.
在一种可能的实现方式中,操作域还可以包括初始数据类型和目标数据类型。控制模块14-11,还用于根据操作域确定目标数据类型和待运算标量的初始数据类型。In a possible implementation manner, the operation domain may further include an initial data type and a target data type. The control module 14-11 is also used to determine the target data type and the initial data type of the scalar to be calculated according to the operation domain.
在一种可能的实现方式中,操作码还可以用于指示初始数据类型和目标数据类型。控制模块14-11,还用于根据操作码确定目标数据类型和待运算标量的初始数据类型。In a possible implementation, the operation code can also be used to indicate the initial data type and the target data type. The control module 14-11 is also used to determine the target data type and the initial data type of the scalar to be calculated according to the operation code.
在一种可能的实现方式中,在根据操作码或者操作域均不能确定初始数据类型和/或目标数据类型时,可以根据预先设置的默认初始数据类型和默认目标数据类型确定初始数据类型和/或目标数据类型。可以将预先设置的默认初始数据类型确定为当前该标量类型转换指令的初始数据类型,可以将预先设置的默认目标数据类型确定为当前该标量类型转换指令的目标数据类型。本领域技术人员可以根据实际需要对目标数据类型和初始数据类型的确定方式进行设置,本公开对此不作限制。In a possible implementation manner, when the initial data type and / or the target data type cannot be determined according to the operation code or the operation domain, the initial data type and / or the default target data type may be determined according to the preset default initial data type and / or target data type Or target data type. The preset default initial data type may be determined as the current initial data type of the scalar type conversion instruction, and the preset default target data type may be determined as the current target data type of the scalar type conversion instruction. A person skilled in the art may set the determination method of the target data type and the initial data type according to actual needs, which is not limited in the present disclosure.
在一种可能的实现方式中,目标数据类型可以包括16位浮点数、32位浮点数、48位浮点数、16位整数、32位整数和48位整数中的任意一种,初始数据类型可以包括16位有符号数、32位有符号数、48位有符号数、16位无符号数、32位无符号数、48位无符号数和指针数据类型中的任意一种。In a possible implementation, the target data type may include any one of 16-bit floating-point numbers, 32-bit floating-point numbers, 48-bit floating-point numbers, 16-bit integers, 32-bit integers, and 48-bit integers. The initial data type may be Including any of 16-bit signed numbers, 32-bit signed numbers, 48-bit signed numbers, 16-bit unsigned numbers, 32-bit unsigned numbers, 48-bit unsigned numbers, and pointer data types.
在该实现方式中,目标数据类型和初始数据类型还可以是如64位整数等数据类型,本领域技术人员可以根据实际需要对目标数据类型和初始数据类型进行设置,只要保证目标数据类型与初始数据类型所指示的数据类型不同即可,本公开对此不作限制。In this implementation, the target data type and the initial data type can also be data types such as 64-bit integers. Those skilled in the art can set the target data type and the initial data type according to actual needs, as long as the target data type and the initial data type are guaranteed. The data type indicated by the data type may be different, and this disclosure does not limit it.
在该实现方式中,可以对上述目标数据类型和初始数据类型的编号、名称等标识(或代码)进行设置,以根据标量转换指令中的标识(或代码)确定标量类型转换指令所指示的目标数据类型和初始数据类型。例如,可以将16位浮点数的标识设置为cvtf16、32位浮点数的标识设置为cvtf32、48位浮点数的标识设置为cvtf48、16位整数的标识设置为cvti16、32位整数的标识设置为cvti32以及将48位整数的标识设置为cvti48。可以将16位有符号数的标识设置为s16、32位有符号数的标识设置为s32、48位有符号数的标识设置为s48、16位无符号数的标识设置为u16、32位无符号数的标识设置为u32、48位无符号数的标识设置为u48和指针数据类型的标识设置为ptr。本领域技术人员可以根据实际需要对目 标数据类型和初始数据类型的标识进行设置,本公开对此不作限制。In this implementation, the identification (or code) such as the number and name of the above target data type and initial data type can be set to determine the target indicated by the scalar type conversion instruction according to the identification (or code) in the scalar conversion instruction Data type and initial data type. For example, you can set the 16-bit floating point ID to cvtf16, the 32-bit floating point ID to cvtf32, the 48-bit floating point ID to cvtf48, the 16-bit integer ID to cvti16, and the 32-bit integer ID to cvti32 and set the 48-bit integer identifier to cvti48. You can set the 16-bit signed number ID to s16, the 32-bit signed number ID to s32, the 48-bit signed number ID to s48, the 16-bit unsigned ID to u16, and the 32-bit unsigned ID The ID of the number is set to u32, the ID of the 48-bit unsigned number is set to u48, and the ID of the pointer data type is set to ptr. Those skilled in the art can set the identification of the target data type and the initial data type according to actual needs, and this disclosure does not limit this.
在一种可能的实现方式中,如图14-2a、图14-2b所示,该装置还可以包括存储模块14-13。存储模块14-13用于存储待运算标量。In a possible implementation manner, as shown in FIGS. 14-2a and 14-2b, the device may further include a storage module 14-13. The storage modules 14-13 are used to store scalars to be calculated.
在该实现方式中,存储模块可以包括缓存和寄存器中的一种或多种,缓存可以包括高速暂存缓存,还可以包括至少一个NRAM(Neuron Random Access Memory,神经元随机存取存储器)。缓存可以用于存储待运算数据,寄存器可以用于存储待运算标量。In this implementation, the storage module may include one or more of a cache and a register. The cache may include a high-speed temporary storage cache, and may also include at least one NRAM (Neuron Random Access Memory). The cache can be used to store the data to be calculated, and the register can be used to store the scalar to be calculated.
在一种可能的实现方式中,缓存可以包括神经元缓存。神经元缓存也即上述神经元随机存取存储器,可以用于存储待运算数据中的神经元数据,神经元数据可以包括神经元向量数据。其中,待运算数据包括与进行标量类型转换相关的数据、和/或与其他计算指令的运算相关的数据。In a possible implementation, the cache may include a neuron cache. The neuron cache, that is, the foregoing neuron random access memory, can be used to store neuron data in the data to be calculated, and the neuron data can include neuron vector data. Wherein, the data to be calculated includes data related to the conversion of the scalar type and / or data related to the calculation of other calculation instructions.
在一种可能的实现方式中,该装置还可以包括直接内存访问模块,用于从存储模块中读取或者存储数据。In a possible implementation, the device may further include a direct memory access module, which is used to read or store data from the storage module.
在一种可能的实现方式中,控制模块14-11还可以用于根据标量类型转换指令生成汇编文件,并将汇编文件翻译成二进制文件,其中,二进制文件为编译后的标量类型转换指令。In a possible implementation manner, the control module 14-11 may also be used to generate an assembly file according to the scalar type conversion instruction and translate the assembly file into a binary file, where the binary file is a compiled scalar type conversion instruction.
在一种可能的实现方式中,标量类型转换指令的指令格式可以是:In a possible implementation, the instruction format of the scalar type conversion instruction may be:
scalar dst src0opcode.typescalar dst src0opcode.type
其中,scalar是标量类型转换指令的操作码,dst、src0、opcode.type是标量类型转换指令的操作域。其中,dst是目标地址。src0是待运算标量地址。opcode.type中的opcode是目标数据类型,opcode.type中的type是待运算标量的初始数据类型。Among them, scalar is the operation code of the scalar type conversion instruction, dst, src0, opcode.type are the operation domain of the scalar type conversion instruction. Among them, dst is the target address. src0 is the scalar address to be calculated. opcode in opcode.type is the target data type, and type in opcode.type is the initial data type of the scalar to be calculated.
在一种可能的实现方式中,标量类型转换指令的指令格式还可以是:In a possible implementation, the instruction format of the scalar type conversion instruction may also be:
opcode.scalar.type dst src0opcode.scalar.type dst src0
其中,opcode.scalar.type是标量类型转换指令的操作码,dst、src0是标量类型转换指令的操作域。其中,opcode.scalar.type中的opcode用于指示目标数据类型,opcode.scalar.type中的type用于指示待运算标量的初始数据类型,opcode.scalar.type中的scalar用于指示该指令为标量类型转换指令。dst是目标地址,src0是待运算标量地址。Among them, opcode.scalar.type is the operation code of the scalar type conversion instruction, and dst and src0 are the operation domains of the scalar type conversion instruction. Among them, opcode in opcode.scalar.type is used to indicate the target data type, type in opcode.scalar.type is used to indicate the initial data type of the scalar to be calculated, and scalar in opcode.scalar.type is used to indicate that the instruction is Scalar type conversion instructions. dst is the target address, and src0 is the scalar address to be calculated.
应当理解的是,本领域技术人员可以根据需要对标量类型转换指令的操作码、指令格式中操作码和操作域的位置进行设置,本公开对此不作限制。It should be understood that those skilled in the art can set the position of the operation code, operation code and operation field in the instruction format of the scalar type conversion instruction as needed, and this disclosure does not limit this.
在一种可能的实现方式中,该装置可以设置于图形处理器(Graphics Processing Unit,简称GPU)、中央处理器(Central Processing Unit,简称CPU)和嵌入式神经网络处理器(Neural-network Processing Unit,简称NPU)的一种或多种之中。In a possible implementation manner, the device may be set in a graphics processor (Graphics Processing Unit, GPU for short), a central processing unit (Central Processing Unit, CPU for short) and an embedded neural network processor (Neural-network Processing Unit) , Referred to as NPU).
需要说明的是,尽管以上述实施例作为示例介绍了标量类型转换指令处理装置如上,但本领域技术人员能够理解,本公开应不限于此。事实上,用户完全可根据个人喜好和/或实际应用场景灵活设定各模块,只要符合本公开的技术方案即可。It should be noted that although the above embodiment is taken as an example to introduce the scalar type conversion instruction processing device as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various modules flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.
应用示例Application examples
以下结合“利用标量类型转换指令处理装置进行标量类型转换运算”作为一个示例性应用场景,给出根据本公开实施例的应用示例,以便于理解标量类型转换指令处理装置的流程。本领域技术人员应理解,以下应用示例仅仅是出于便于理解本公开实施例的目的,不应视为对本公开实施例的限制。The following uses "scalar type conversion instruction processing apparatus for scalar type conversion operation" as an exemplary application scenario, and gives an application example according to an embodiment of the present disclosure, in order to understand the flow of the scalar type conversion instruction processing apparatus. Those skilled in the art should understand that the following application examples are only for the purpose of facilitating the understanding of the embodiments of the present disclosure, and should not be considered as limitations to the embodiments of the present disclosure.
图14-3示出根据本公开一实施例的标量类型转换指令处理装置的应用场景的示意图。如图14-3所示,标量类型转换指令处理装置对标量类型转换指令进行处理的过程如下:14-3 shows a schematic diagram of an application scenario of a scalar type conversion instruction processing device according to an embodiment of the present disclosure. As shown in Figure 14-3, the scalar type conversion instruction processing device processes the scalar type conversion instruction as follows:
控制模块14-11对获取到的标量类型转换指令1进行编译,得到编译后的标量类型转换指令1。对编译后的标量类型转换指令1(如标量类型转换指令1为scalar 500 100 cvtf16.u32)进行解析,得到标量类型转换指令1的操作码和操作域。其中,标量类型转换指令1的操作码为scalar,目标地址为500,待运算标量地址为100,目标数据类型为cvtf16(也即16为浮点数),待运算标量的初始数据类型为u32(也即32位无符号数)。控制模块14-11从待运算标量地址100获取待运算标量。The control module 14-11 compiles the obtained scalar type conversion instruction 1 to obtain the compiled scalar type conversion instruction 1. Analyze the compiled scalar type conversion instruction 1 (for example, scalar type conversion instruction 1 is scalar 500 500 cvtf16.u32) to obtain the operation code and operation domain of the scalar type conversion instruction 1. Among them, the operation code of the scalar type conversion instruction 1 is scalar, the target address is 500, the scalar address to be calculated is 100, the target data type is cvtf16 (that is, 16 is a floating point number), and the initial data type of the scalar to be calculated is u32 (also (32-bit unsigned number). The control module 14-11 acquires the scalar to be calculated from the scalar address to be calculated 100.
运算模块14-12根据目标数据类型对初始数据类型的待运算标量进行标量类型转换运算(也即将32位无符号数的待运算标量的数据类型转换为16为浮点数),得到运算结果,并将运算结果存入目标 地址500中。The arithmetic module 14-12 performs a scalar type conversion operation on the scalar to be calculated of the initial data type according to the target data type (that is, converts the data type of the 32-bit unsigned scalar to be calculated into 16 to a floating point number), and obtains the operation result, and The operation result is stored in the target address 500.
以上各模块的工作过程可参考上文的相关描述。For the working process of the above modules, please refer to the relevant description above.
这样,标量类型转换指令处理装置可以高效、快速地对标量类型转换指令进行处理,进行标量类型转换的处理效率高、处理速度快。In this way, the scalar type conversion instruction processing device can efficiently and quickly process the scalar type conversion instruction, and the processing efficiency of the scalar type conversion is high and the processing speed is fast.
图14-4示出根据本公开一实施例的标量类型转换指令处理方法的流程图。该方法可以应用于包含存储器和处理器的如计算机设备等,其中,存储器用于存储执行方法过程中所使用的数据;处理器用于执行相关的处理、运算步骤,如执行下述步骤S51-14和步骤S52-14。如图14-4所示,该方法应用于上述标量类型转换指令处理装置,该方法包括步骤S51-14和步骤S52-14。14-4 shows a flowchart of a scalar type conversion instruction processing method according to an embodiment of the present disclosure. The method can be applied to computer equipment including a memory and a processor, where the memory is used to store data used in the execution of the method; the processor is used to perform related processing and operation steps, such as performing the following steps S51-14和 步骤 S52-14. As shown in FIG. 14-4, the method is applied to the above scalar type conversion instruction processing device, and the method includes steps S51-14 and S52-14.
在步骤S51-14中,利用控制模块对获取到的标量类型转换指令进行编译,得到编译后的标量类型转换指令,对编译后的标量类型转换指令进行解析,得到标量类型转换指令的操作码和操作域,并根据操作码和操作域获取执行标量类型转换指令所需的待运算标量和目标地址,以及确定目标数据类型和待运算标量的初始数据类型。其中,操作码用于指示标量类型转换指令对数据所进行的运算为标量类型转换运算,操作域包括待运算标量地址和目标地址。In step S51-14, the control module is used to compile the obtained scalar type conversion instruction to obtain the compiled scalar type conversion instruction, and the compiled scalar type conversion instruction is parsed to obtain the scalar type conversion instruction operation code and The operation domain, and obtain the scalar to be calculated and the target address required to execute the scalar type conversion instruction according to the operation code and the operation domain, and determine the initial data type of the target data type and the scalar to be calculated. The operation code is used to instruct the operation performed by the scalar type conversion instruction on the data to be a scalar type conversion operation, and the operation domain includes the scalar address to be operated and the target address.
在步骤S52-14中,利用运算模块根据目标数据类型对初始数据类型的待运算标量进行标量类型转换运算,获得运算结果,并将运算结果存入目标地址中,运算结果的数据类型为目标数据类型。In step S52-14, the operation module is used to perform a scalar type conversion operation on the scalar to be operated of the initial data type according to the target data type to obtain the operation result, and the operation result is stored in the target address, and the data type of the operation result is the target data Types of.
在一种可能的实现方式中,根据目标数据类型对初始数据类型的待运算标量进行标量类型转换运算,可以包括:In a possible implementation manner, performing a scalar type conversion operation on the scalar to be operated on the initial data type according to the target data type may include:
利用运算模块中的多个标量运算器执行标量类型转换运算。Use multiple scalar operators in the arithmetic module to perform scalar type conversion operations.
在一种可能的实现方式中,运算模块包括主运算子模块和多个从运算子模块,主运算子模块包括多个标量运算器。其中,步骤S52-14可以包括:In a possible implementation manner, the operation module includes a master operation sub-module and multiple slave operation sub-modules, and the master operation sub-module includes multiple scalar operators. Wherein, steps S52-14 may include:
利用主运算子模块中的多个标量运算器执行标量类型转换运算,得到运算结果,并将运算结果存入目标地址中。Use multiple scalar operators in the main operation sub-module to perform scalar type conversion operations, obtain the operation results, and store the operation results in the target address.
在一种可能的实现方式中,操作域还可以包括初始数据类型和目标数据类型,步骤S51-14可以包括:根据操作域确定目标数据类型和待运算标量的初始数据类型。In a possible implementation manner, the operation domain may further include an initial data type and a target data type, and steps S51-14 may include: determining the target data type and the initial data type of the scalar to be calculated according to the operation domain.
在一种可能的实现方式中,操作码还用于指示初始数据类型和目标数据类型,步骤S51-14可以包括:根据操作码确定目标数据类型和待运算标量的初始数据类型In a possible implementation, the operation code is also used to indicate the initial data type and the target data type. Steps S51-14 may include: determining the target data type and the initial data type of the scalar to be calculated according to the operation code
在一种可能的实现方式中,目标数据类型可以包括16位浮点数、32位浮点数、48位浮点数、16位整数、32位整数和48位整数中的任意一种,初始数据类型可以包括16位有符号数、32位有符号数、48位有符号数、16位无符号数、32位无符号数、48位无符号数和指针数据类型中的任意一种。In a possible implementation, the target data type may include any one of 16-bit floating-point numbers, 32-bit floating-point numbers, 48-bit floating-point numbers, 16-bit integers, 32-bit integers, and 48-bit integers. The initial data type may be Including any of 16-bit signed numbers, 32-bit signed numbers, 48-bit signed numbers, 16-bit unsigned numbers, 32-bit unsigned numbers, 48-bit unsigned numbers, and pointer data types.
在一种可能的实现方式中,该方法还可以包括:利用装置的存储模块存储待运算标量,In a possible implementation manner, the method may further include: using a storage module of the device to store the scalar to be calculated,
其中,存储模块包括寄存器和缓存中的至少一种,Wherein, the storage module includes at least one of a register and a cache,
缓存,用于存储待运算数据,缓存包括至少一个神经元缓存NRAM;Cache, used to store data to be calculated, the cache includes at least one neuron cache NRAM;
寄存器,用于存储待运算标量;Register, used to store the scalar to be calculated;
神经元缓存,用于存储待运算数据中的神经元数据,神经元数据包括神经元向量数据。The neuron cache is used to store neuron data in the data to be calculated, and the neuron data includes neuron vector data.
在一种可能的实现方式中,步骤S51-14可以包括:In a possible implementation manner, steps S51-14 may include:
存储编译后的标量类型转换指令;Store the compiled scalar type conversion instructions;
对编译后的标量类型转换指令进行解析,得到标量类型转换指令的操作码和操作域;Analyze the compiled scalar type conversion instruction to obtain the operation code and operation domain of the scalar type conversion instruction;
存储指令队列,指令队列包括按照执行顺序依次排列的多个待执行指令,多个待执行指令可以包括编译后的标量类型转换指令。The instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed in order according to the execution order. The plurality of instructions to be executed may include a compiled scalar type conversion instruction.
在一种可能的实现方式中,该方法还可以包括:In a possible implementation manner, the method may further include:
在确定多个待执行指令中的第一待执行指令与第一待执行指令之前的第零待执行指令存在关联关系时,缓存第一待执行指令,并在确定第零待执行指令执行完毕后,控制进行第一待执行指令的执行,When it is determined that the first to-be-executed instruction among the plurality of to-be-executed instructions is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction, cache the first to-be-executed instruction, and after determining that the execution of the zeroth to-be-executed instruction is completed , Control the execution of the first instruction to be executed,
其中,第一待执行指令与第一待执行指令之前的第零待执行指令存在关联关系包括:Wherein, the association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
存储第一待执行指令所需数据的第一存储地址区间与存储第零待执行指令所需数据的第零存储 地址区间具有重叠的区域。The first storage address section storing data required for the first instruction to be executed has an overlapping area with the zeroth storage address section storing data required for the zeroth instruction to be executed.
在一种可能的实现方式中,对获取到的标量类型转换指令进行编译,得到编译后的标量类型转换指令,可以包括:In a possible implementation manner, compiling the obtained scalar type conversion instruction to obtain the compiled scalar type conversion instruction may include:
根据标量类型转换指令生成汇编文件,并将汇编文件翻译成二进制文件。其中,二进制文件为编译后的标量类型转换指令。The assembly file is generated according to the scalar type conversion instruction, and the assembly file is translated into a binary file. Among them, the binary file is a compiled scalar type conversion instruction.
需要说明的是,尽管以上述实施例作为示例介绍了标量类型转换指令处理方法如上,但本领域技术人员能够理解,本公开应不限于此。事实上,用户完全可根据个人喜好和/或实际应用场景灵活设定各步骤,只要符合本公开的技术方案即可。It should be noted that although the above embodiment is taken as an example to introduce the scalar type conversion instruction processing method as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various steps flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.
本公开实施例所提供的标量类型转换指令处理方法的适用范围广,对标量类型转换指令的处理效率高、处理速度快,进行标量类型转换的处理效率高、处理速度快。The scalar type conversion instruction processing method provided by the embodiments of the present disclosure has a wide application range, and has high processing efficiency and fast processing speed for scalar type conversion instructions, and high processing efficiency and fast processing speed for scalar type conversion.
依据以下条款可以更好的理解前述内容:The foregoing can be better understood based on the following clauses:
条款N1、一种标量类型转换指令处理装置,所述装置包括:Clause N1, a scalar type conversion instruction processing device, the device comprising:
控制模块,用于对获取到的标量类型转换指令进行编译,得到编译后的标量类型转换指令,对编译后的标量类型转换指令进行解析,得到标量类型转换指令的操作码和操作域,并根据所述操作码和所述操作域获取执行标量类型转换指令所需的待运算标量和目标地址,以及确定目标数据类型和所述待运算标量的初始数据类型;The control module is used to compile the obtained scalar type conversion instruction to obtain the compiled scalar type conversion instruction, to parse the compiled scalar type conversion instruction, to obtain the scalar type conversion instruction operation code and operation domain, and according to The operation code and the operation domain obtain the scalar to be calculated and the target address required to execute the scalar type conversion instruction, and determine the target data type and the initial data type of the scalar to be calculated;
运算模块,用于根据所述目标数据类型对初始数据类型的所述待运算标量进行标量类型转换运算,获得运算结果,并将所述运算结果存入所述目标地址中,所述运算结果的数据类型为所述目标数据类型,The operation module is configured to perform a scalar type conversion operation on the to-be-operated scalar of the initial data type according to the target data type, obtain an operation result, and store the operation result in the target address. The data type is the target data type,
其中,所述操作码用于指示所述标量类型转换指令对数据所进行的运算为标量类型转换运算,所述操作域包括待运算标量地址和所述目标地址。Wherein, the operation code is used to indicate that the operation performed by the scalar type conversion instruction on the data is a scalar type conversion operation, and the operation field includes a scalar address to be operated and the target address.
条款N2、根据条款N1所述的装置,所述运算模块,包括:Clause N2. The device according to Clause N1, the calculation module includes:
多个标量运算器,用于执行所述标量类型转换运算。A plurality of scalar operators are used to perform the scalar type conversion operation.
条款N3、根据条款N2所述的装置,所述运算模块包括主运算子模块和多个从运算子模块,所述主运算子模块包括所述多个标量运算器,Clause N3. The device according to Clause N2, the operation module includes a master operation sub-module and a plurality of slave operation sub-modules, the master operation sub-module includes the plurality of scalar operators,
所述主运算子模块,用于利用所述多个标量运算器执行所述标量类型转换运算,得到运算结果,并将所述运算结果存入所述目标地址中。The main operation submodule is configured to perform the scalar type conversion operation using the plurality of scalar operators, obtain an operation result, and store the operation result in the target address.
条款N4、根据条款N1所述的装置,所述操作域还包括初始数据类型和目标数据类型,Clause N4. The device according to Clause N1, the operation domain further includes an initial data type and a target data type,
其中,所述控制模块,还用于根据所述操作域确定目标数据类型和所述待运算标量的初始数据类型。Wherein, the control module is also used to determine the target data type and the initial data type of the scalar to be calculated according to the operation domain.
条款N5、根据条款N1所述的装置,所述操作码还用于指示初始数据类型和目标数据类型,Clause N5. The device according to Clause N1, the operation code is also used to indicate an initial data type and a target data type,
其中,所述控制模块,还用于根据所述操作码确定目标数据类型和所述待运算标量的初始数据类型。Wherein, the control module is also used to determine the target data type and the initial data type of the scalar to be calculated according to the operation code.
条款N6、根据条款N1所述的装置,所述目标数据类型包括16位浮点数、32位浮点数、48位浮点数、16位整数、32位整数和48位整数中的任意一种,所述初始数据类型包括16位有符号数、32位有符号数、48位有符号数、16位无符号数、32位无符号数、48位无符号数和指针数据类型中的任意一种。Clause N6. The device according to Clause N1, the target data type includes any one of a 16-bit floating point number, a 32-bit floating point number, a 48-bit floating point number, a 16-bit integer, a 32-bit integer, and a 48-bit integer. The initial data types include any of 16-bit signed numbers, 32-bit signed numbers, 48-bit signed numbers, 16-bit unsigned numbers, 32-bit unsigned numbers, 48-bit unsigned numbers, and pointer data types.
条款N7、根据条款N1所述的装置,所述装置还包括:Clause N7. The device according to Clause N1, the device further comprising:
存储模块,用于存储所述待运算标量,A storage module for storing the scalar to be calculated,
其中,所述存储模块包括寄存器和缓存中的至少一种,Wherein, the storage module includes at least one of a register and a cache,
所述缓存,用于存储所述待运算数据,所述缓存包括至少一个神经元缓存NRAM;The cache is used to store the data to be calculated, and the cache includes at least one neuron cache NRAM;
所述寄存器,用于存储所述待运算标量;The register is used to store the scalar to be calculated;
所述神经元缓存,用于存储所述待运算数据中的神经元数据,所述神经元数据包括神经元向量数据。The neuron cache is used to store neuron data in the data to be operated, and the neuron data includes neuron vector data.
条款N8、根据条款N1所述的装置,所述控制模块包括:Clause N8. The device according to Clause N1, the control module includes:
指令存储子模块,用于存储所述编译后的标量类型转换指令;An instruction storage sub-module for storing the compiled scalar type conversion instruction;
指令处理子模块,用于对所述编译后的标量类型转换指令进行解析,得到标量类型转换指令的操作码和操作域;An instruction processing submodule, used for parsing the compiled scalar type conversion instruction to obtain the operation code and operation domain of the scalar type conversion instruction;
队列存储子模块,用于存储指令队列,所述指令队列包括按照执行顺序依次排列的多个待执行指令,所述多个待执行指令包括所述编译后的标量类型转换指令。The queue storage sub-module is used to store an instruction queue, and the instruction queue includes a plurality of instructions to be executed in order according to an execution order, and the plurality of instructions to be executed include the compiled scalar type conversion instruction.
条款N9、根据条款N8所述的装置,所述控制模块,还包括:Clause N9. The device according to Clause N8, the control module, further comprising:
依赖关系处理子模块,用于在确定所述多个待执行指令中的第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系时,将所述第一待执行指令缓存在所述指令存储子模块中,在所述第零待执行指令执行完毕后,从所述指令存储子模块中提取所述第一待执行指令发送至所述运算模块,The dependency processing sub-module is used to determine the first pending instruction when there is an association relationship between the first pending instruction in the plurality of pending instructions and the zeroth pending instruction before the first pending instruction The execution instruction is cached in the instruction storage submodule, and after the execution of the zeroth instruction to be executed is completed, the first instruction to be executed is extracted from the instruction storage submodule and sent to the arithmetic module,
其中,所述第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系包括:Wherein, the association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
存储所述第一待执行指令所需数据的第一存储地址区间与存储所述第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
条款N10、根据条款N1所述的装置,Clause N10, the device according to Clause N1,
所述控制模块,还用于根据所述标量类型转换指令生成汇编文件,并将所述汇编文件翻译成二进制文件,The control module is also used to generate an assembly file according to the scalar type conversion instruction and translate the assembly file into a binary file,
其中,所述二进制文件为所述编译后的标量类型转换指令。Wherein, the binary file is the compiled scalar type conversion instruction.
条款N11、一种机器学习运算装置,所述装置包括:Clause N11. A machine learning computing device, the device comprising:
一个或多个如条款N1-条款N10任一项所述的标量类型转换指令处理装置,用于从其他处理装置中获取待运算标量和控制信息,并执行指定的机器学习运算,将执行结果通过I/O接口传递给其他处理装置;One or more scalar type conversion instruction processing devices as described in any one of Clause N1-Clause N10, used to obtain scalar and control information to be calculated from other processing devices, and perform specified machine learning operations, and pass the execution result The I / O interface is passed to other processing devices;
当所述机器学习运算装置包含多个所述标量类型转换指令处理装置时,所述多个所述标量类型转换指令处理装置间可以通过特定的结构进行连接并传输数据;When the machine learning computing device includes a plurality of the scalar type conversion instruction processing devices, the plurality of the scalar type conversion instruction processing devices can be connected and transmit data through a specific structure;
其中,多个所述标量类型转换指令处理装置通过快速外部设备互连总线PCIE总线进行互联并传输数据,以支持更大规模的机器学习的运算;多个所述标量类型转换指令处理装置共享同一控制系统或拥有各自的控制系统;多个所述标量类型转换指令处理装置共享内存或者拥有各自的内存;多个所述标量类型转换指令处理装置的互联方式是任意互联拓扑。Among them, a plurality of the scalar type conversion instruction processing apparatuses interconnect and transmit data through a fast external device interconnect bus PCIE bus to support larger-scale machine learning operations; a plurality of the scalar type conversion instruction processing apparatuses share the same The control system may have its own control system; a plurality of the scalar type conversion instruction processing devices share memory or have their own memories; the interconnection mode of the plurality of scalar type conversion instruction processing devices is any interconnection topology.
条款N12、一种组合处理装置,所述组合处理装置包括:Clause N12. A combined processing device, the combined processing device comprising:
如条款N11所述的机器学习运算装置、通用互联接口和其他处理装置;Machine learning computing devices, general interconnection interfaces and other processing devices as described in clause N11;
所述机器学习运算装置与所述其他处理装置进行交互,共同完成用户指定的计算操作,The machine learning computing device interacts with the other processing device to jointly complete the calculation operation specified by the user,
其中,所述组合处理装置还包括:存储装置,该存储装置分别与所述机器学习运算装置和所述其他处理装置连接,用于保存所述机器学习运算装置和所述其他处理装置的数据。Wherein, the combined processing device further includes: a storage device, which is respectively connected to the machine learning computing device and the other processing device, and is used for storing data of the machine learning computing device and the other processing device.
条款N13、一种机器学习芯片,所述机器学习芯片包括:Clause N13. A machine learning chip, the machine learning chip includes:
如条款N11所述的机器学习运算装置或如条款N12所述的组合处理装置。The machine learning arithmetic device according to clause N11 or the combined processing device according to clause N12.
条款N14、一种电子设备,所述电子设备包括:Clause N14. An electronic device, the electronic device comprising:
如条款N13所述的机器学习芯片。Machine learning chip as described in clause N13.
条款N15、一种板卡,所述板卡包括:存储器件、接口装置和控制器件以及如条款N13所述的机器学习芯片;Clause N15. A board card including: a storage device, an interface device and a control device, and a machine learning chip as described in Clause N13;
其中,所述机器学习芯片与所述存储器件、所述控制器件以及所述接口装置分别连接;Wherein, the machine learning chip is respectively connected to the storage device, the control device and the interface device;
所述存储器件,用于存储数据;The storage device is used for storing data;
所述接口装置,用于实现所述机器学习芯片与外部设备之间的数据传输;The interface device is used to realize data transmission between the machine learning chip and an external device;
所述控制器件,用于对所述机器学习芯片的状态进行监控。The control device is used for monitoring the state of the machine learning chip.
条款N16、一种标量类型转换指令处理方法,所述方法应用于标量类型转换指令处理装置,所述装置包括控制模块和运算模块,所述方法包括:Clause N16. A scalar type conversion instruction processing method. The method is applied to a scalar type conversion instruction processing device. The device includes a control module and an arithmetic module. The method includes:
利用控制模块对获取到的标量类型转换指令进行编译,得到编译后的标量类型转换指令,对所述编译后的标量类型转换指令进行解析,得到标量类型转换指令的操作码和操作域,并根据所述操作码 和所述操作域获取执行标量类型转换指令所需的待运算标量和目标地址,以及确定目标数据类型和所述待运算标量的初始数据类型;The control module is used to compile the obtained scalar type conversion instruction to obtain a compiled scalar type conversion instruction. The compiled scalar type conversion instruction is parsed to obtain the operation code and operation domain of the scalar type conversion instruction. The operation code and the operation domain obtain the scalar to be calculated and the target address required to execute the scalar type conversion instruction, and determine the target data type and the initial data type of the scalar to be calculated;
利用运算模块根据所述目标数据类型对初始数据类型的所述待运算标量进行标量类型转换运算,获得运算结果,并将所述运算结果存入所述目标地址中,所述运算结果的数据类型为所述目标数据类型,The operation module performs a scalar type conversion operation on the scalar to be operated of the initial data type according to the target data type to obtain an operation result, and stores the operation result in the target address, and the data type of the operation result Is the target data type,
其中,所述操作码用于指示所述标量类型转换指令对数据所进行的运算为标量类型转换运算,所述操作域包括待运算标量地址和所述目标地址。Wherein, the operation code is used to indicate that the operation performed by the scalar type conversion instruction on the data is a scalar type conversion operation, and the operation field includes a scalar address to be operated and the target address.
条款N17、根据条款N16所述的方法,根据所述目标数据类型对初始数据类型的所述待运算标量进行标量类型转换运算,包括:Clause N17. According to the method of Clause N16, performing a scalar type conversion operation on the scalar to be operated on the initial data type according to the target data type, including:
利用所述运算模块中的多个标量运算器执行所述标量类型转换运算。The scalar type conversion operation is performed by using multiple scalar operators in the arithmetic module.
条款N18、根据条款N17所述的方法,所述运算模块包括主运算子模块和多个从运算子模块,所述主运算子模块包括所述多个标量运算器,Clause N18. The method according to Clause N17, the operation module includes a master operation sub-module and a plurality of slave operation sub-modules, the master operation sub-module includes the plurality of scalar operators,
其中,根据所述目标数据类型对初始数据类型的所述待运算标量进行标量类型转换运算,获得运算结果,并将所述运算结果存入所述目标地址中,包括:Wherein, performing a scalar type conversion operation on the scalar to be operated of the initial data type according to the target data type to obtain an operation result, and storing the operation result in the target address includes:
利用所述主运算子模块中的多个标量运算器执行所述标量类型转换运算,得到运算结果,并将所述运算结果存入所述目标地址中。Use a plurality of scalar operators in the main operation sub-module to perform the scalar type conversion operation to obtain an operation result, and store the operation result in the target address.
条款N19、根据条款N16所述的方法,所述操作域还包括初始数据类型和目标数据类型,Clause N19. The method according to Clause N16, the operation domain further includes an initial data type and a target data type,
其中,确定目标数据类型和所述待运算标量的初始数据类型,包括:Wherein, determining the target data type and the initial data type of the scalar to be calculated includes:
根据所述操作域确定目标数据类型和所述待运算标量的初始数据类型。The target data type and the initial data type of the scalar to be calculated are determined according to the operation domain.
条款N20、根据条款N16所述的方法,所述操作码还用于指示初始数据类型和目标数据类型,Clause N20, the method according to Clause N16, the operation code is also used to indicate the initial data type and the target data type,
其中,确定目标数据类型和所述待运算标量的初始数据类型,包括:Wherein, determining the target data type and the initial data type of the scalar to be calculated includes:
根据所述操作码确定目标数据类型和所述待运算标量的初始数据类型。The target data type and the initial data type of the scalar to be calculated are determined according to the operation code.
条款N21、根据条款N16所述的方法,所述目标数据类型包括16位浮点数、32位浮点数、48位浮点数、16位整数、32位整数和48位整数中的任意一种,所述初始数据类型包括16位有符号数、32位有符号数、48位有符号数、16位无符号数、32位无符号数、48位无符号数和指针数据类型中的任意一种。Clause N21. The method according to Clause N16, the target data type includes any one of 16-bit floating point number, 32-bit floating point number, 48-bit floating point number, 16-bit integer, 32-bit integer, and 48-bit integer. The initial data types include any of 16-bit signed numbers, 32-bit signed numbers, 48-bit signed numbers, 16-bit unsigned numbers, 32-bit unsigned numbers, 48-bit unsigned numbers, and pointer data types.
条款N22、根据条款N17所述的方法,所述方法还包括:Clause N22. The method according to Clause N17, the method further comprising:
利用所述装置的存储模块存储所述待运算标量,Using the storage module of the device to store the scalar to be calculated,
其中,所述存储模块包括寄存器和缓存中的至少一种,Wherein, the storage module includes at least one of a register and a cache,
所述缓存,用于存储所述待运算数据,所述缓存包括至少一个神经元缓存NRAM;The cache is used to store the data to be calculated, and the cache includes at least one neuron cache NRAM;
所述寄存器,用于存储所述待运算标量;The register is used to store the scalar to be calculated;
所述神经元缓存,用于存储所述待运算数据中的神经元数据,所述神经元数据包括神经元向量数据。The neuron cache is used to store neuron data in the data to be operated, and the neuron data includes neuron vector data.
条款N23、根据条款N16所述的方法,对所述编译后的标量类型转换指令进行解析,得到标量类型转换指令的操作码和操作域,包括:Clause N23. According to the method described in Clause N16, parse the compiled scalar type conversion instruction to obtain the operation code and operation domain of the scalar type conversion instruction, including:
存储所述编译后的标量类型转换指令;Storing the compiled scalar type conversion instruction;
对所述编译后的标量类型转换指令进行解析,得到标量类型转换指令的操作码和操作域;Parse the compiled scalar type conversion instruction to obtain the operation code and operation domain of the scalar type conversion instruction;
存储指令队列,所述指令队列包括按照执行顺序依次排列的多个待执行指令,所述多个待执行指令包括所述编译后的标量类型转换指令。An instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed, which are sequentially arranged according to an execution order, and the plurality of instructions to be executed include the compiled scalar type conversion instruction.
条款N24、根据条款N23所述的方法,所述方法还包括:Clause N24. The method according to Clause N23, the method further comprising:
在确定所述多个待执行指令中的第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系时,缓存所述第一待执行指令,并在确定所述第零待执行指令执行完毕后,控制进行所述第一待执行指令的执行,When it is determined that the first to-be-executed instruction among the plurality of to-be-executed instructions is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction, cache the first to-be-executed instruction, and determine the After the execution of the zeroth to-be-executed instruction is completed, control to execute the execution of the first to-be-executed instruction,
其中,所述第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系包括:Wherein, the association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
存储所述第一待执行指令所需数据的第一存储地址区间与存储所述第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
条款N25、根据条款N16所述的方法,对获取到的标量类型转换指令进行编译,得到编译后的标量类型转换指令,包括:Clause N25. According to the method described in Clause N16, compile the obtained scalar type conversion instruction to obtain the compiled scalar type conversion instruction, including:
根据所述标量类型转换指令生成汇编文件,并将所述汇编文件翻译成二进制文件,Generate an assembly file according to the scalar type conversion instruction, and translate the assembly file into a binary file,
其中,所述二进制文件为所述编译后的标量类型转换指令。Wherein, the binary file is the compiled scalar type conversion instruction.
条款N26、一种非易失性计算机可读存储介质,其上存储有计算机程序指令,所述计算机程序指令被处理器执行时实现条款N16至条款N25任一项所述的方法。Clause N26. A non-volatile computer-readable storage medium having computer program instructions stored thereon. When the computer program instructions are executed by a processor, the method of any one of Clause N16 to Clause N25 is implemented.
由于神经网络算法的广泛使用,计算机硬件运算人能力的不断提升,实际应用中所涉及到的数据运算的种类和数量不断提高。由于编程语言的种类多样,在不同的语言环境下,为实现取地址处理的过程,相关技术中,由于现阶段没有能广泛适用于各类编程语言的取地址指令,技术人员需要自定义对应其编程语言环境的特定指令来实现取地址处理,导致进行取地址处理的效率低、速度慢。本公开提供一种取地址指令处理方法、装置、计算机设备和存储介质,仅用一个指令即可以实现取地址处理,能够显著提高进行取地址处理的效率和速度。Due to the extensive use of neural network algorithms, the ability of computer hardware operators to continuously improve, the types and number of data operations involved in practical applications continue to increase. Due to the variety of programming languages, in different language environments, in order to implement the address fetching process, in related technologies, since there is no address fetch instruction that can be widely applied to various programming languages at this stage, technicians need to customize their corresponding The specific instructions of the programming language environment are used to implement address fetch processing, which results in low efficiency and slow speed of address fetch processing. The present disclosure provides an address fetch instruction processing method, device, computer equipment, and storage medium. The address fetch processing can be implemented with only one instruction, which can significantly improve the efficiency and speed of address fetch processing.
图15-1示出根据本公开一实施例的取地址指令处理装置的框图。如图15-1所示,该装置包括控制模块15-11和处理模块15-12。15-1 shows a block diagram of an address fetch instruction processing apparatus according to an embodiment of the present disclosure. As shown in Figure 15-1, the device includes a control module 15-11 and a processing module 15-12.
控制模块15-11,用于对获取到的取地址指令进行编译,得到编译后的取地址指令,对编译后的取地址指令进行解析,得到取地址指令的操作码和操作域,并根据操作码和操作域获取执行取地址指令所需的待存储地址数据和目标地址。其中,操作码用于指示取地址指令对数据所进行的处理为取地址处理,操作域包括待存储地址数据的初始地址和目标地址。The control module 15-11 is used to compile the obtained address fetch instruction to obtain the compiled address fetch instruction, and parse the compiled address fetch instruction to obtain the operation code and operation domain of the address fetch instruction, and according to the operation The code and the operation domain obtain the address data to be stored and the target address required to execute the address fetch instruction. Among them, the operation code is used to indicate that the processing performed by the address fetch instruction on the data is address fetch processing, and the operation domain includes the initial address and the target address of the address data to be stored.
处理模块15-12,用于对待存储地址数据进行处理,得到处理后的待存储地址数据,并将处理后的待存储地址数据存入目标地址中。The processing module 15-12 is configured to process the address data to be stored, obtain the processed address data to be stored, and store the processed address data to be stored in the target address.
在本实施例中,待存储地址数据可以是表示一个待存储地址或多个待存储地址的数据。取地址指令所指示的取地址处理,可以是获取待存储地址数据并重新存储,以便于在新的地址中可以获取待存储地址数据,进而获得待存储地址数据中记载的待存储地址中的数据。In this embodiment, the address data to be stored may be data representing one address to be stored or a plurality of addresses to be stored. The address fetch processing indicated by the address fetch instruction may be to obtain the address data to be stored and re-store it, so that the address data to be stored can be obtained at the new address, and then the data in the address to be stored recorded in the address data to be stored can be obtained .
在本实施例中,控制模块所获取到的取地址指令为未编译的、不能直接供硬件执行的软件指令,控制模块需先对取地址指令(未编译)进行编译。在得到编译后的取地址指令之后,才能对编译后的取地址指令进行解析。编译后的取地址指令为能够直接供硬件执行的硬件指令。控制模块可以从存储待存储地址数据的初始地址中获取待存储地址数据。控制模块可以通过数据输入输出单元获得取地址指令和待存储地址数据,该数据输入输出单元可以为一个或多个数据I/O接口或I/O引脚。In this embodiment, the address fetch instruction obtained by the control module is an uncompiled software instruction that cannot be directly executed by the hardware. The control module needs to first compile the address fetch instruction (uncompiled). After the compiled address fetch instruction is obtained, the compiled address fetch instruction can be parsed. The compiled address instruction is a hardware instruction that can be directly executed by hardware. The control module may obtain the address data to be stored from the initial address where the address data to be stored is stored. The control module can obtain the address fetch instruction and the address data to be stored through the data input / output unit. The data input / output unit may be one or more data I / O interfaces or I / O pins.
在本实施例中,操作码可以是计算机程序中所规定的要执行操作的那一部分指令或字段(通常用代码表示),是指令序列号,用来告知执行指令的装置具体需要执行哪一条指令。操作域可以是执行对应的指令所需的所有数据的来源,执行对应的指令所需的所有数据包括待存储地址数据、存储待存储地址数据的初始地址、目标地址等等。对于一个取地址指令其必须包括操作码和操作域,其中操作域至少包括存储待存储地址数据的初始地址和目标地址。In this embodiment, the operation code may be the part of the instruction or field (usually represented by code) specified in the computer program to perform the operation, and is the instruction sequence number used to inform the device that executes the instruction which instruction needs to be executed. . The operation domain may be a source of all data required to execute the corresponding instruction, and all data required to execute the corresponding instruction includes address data to be stored, an initial address to store the address data to be stored, a target address, and so on. For an address fetch instruction, it must include an operation code and an operation field, where the operation field includes at least an initial address and a target address for storing address data to be stored.
应当理解的是,本领域技术人员可以根据需要对取地址指令的指令格式以及所包含的操作码和操作域进行设置,本公开对此不作限制。It should be understood that those skilled in the art can set the instruction format of the address fetch instruction, as well as the included operation codes and operation domains as needed, and this disclosure does not limit this.
在本实施例中,该装置可以包括一个或多个控制模块,以及一个或多个处理模块,可以根据实际需要对控制模块和处理模块的数量进行设置,本公开对此不作限制。在装置包括一个控制模块时,该控制模块可以接收取地址指令,并控制一个或多个处理模块进行取地址处理。在装置包括多个控制模块时,多个控制模块可以分别接收取地址指令,并控制对应的一个或多个处理模块进行取地址处理。In this embodiment, the device may include one or more control modules and one or more processing modules, and the number of control modules and processing modules may be set according to actual needs, which is not limited in the present disclosure. When the device includes a control module, the control module may receive an address fetch instruction and control one or more processing modules to perform address fetch processing. When the device includes multiple control modules, the multiple control modules may respectively receive address fetch instructions and control corresponding one or more processing modules to perform address fetch processing.
本公开实施例所提供的取地址指令处理装置,该装置包括控制模块和处理模块。控制模块用于对获取到的取地址指令进行编译,得到编译后的取地址指令,对编译后的取地址指令进行解析,得到取地址指令的操作码和操作域,并根据操作码和操作域获取执行取地址指令所需的待存储地址数据和目标地址。处理模块用于对待存储地址数据进行处理,得到处理后的待存储地址数据,并将处理后的待存储地址数据存入目标地址中。本公开实施例所提供的取地址指令处理装置的适用范围广,对取地址 指令的处理效率高、处理速度快,进行取地址处理的效率高、速度快。The address fetch instruction processing device provided by the embodiment of the present disclosure includes a control module and a processing module. The control module is used to compile the obtained address fetch instruction to obtain the compiled address fetch instruction, and parse the compiled address fetch instruction to obtain the operation code and operation domain of the address fetch instruction, and according to the operation code and operation domain Obtain the address data and target address required to execute the address fetch instruction. The processing module is used to process the address data to be stored, obtain the processed address data to be stored, and store the processed address data to be stored in the target address. The address fetch instruction processing device provided by the embodiments of the present disclosure has a wide range of applications, and has high processing efficiency and fast processing speed for address fetch instructions, and high efficiency and fast processing for address fetch processing.
图15-2示出根据本公开一实施例的取地址指令处理装置的框图。在一种可能的实现方式中,如图15-2所示,处理模块15-12可以包括主处理子模块15-121和多个从处理子模块15-122。15-2 shows a block diagram of an address fetch instruction processing apparatus according to an embodiment of the present disclosure. In a possible implementation, as shown in FIG. 15-2, the processing module 15-12 may include a master processing sub-module 15-121 and multiple slave processing sub-modules 15-122.
主处理子模块15-121,用于对待存储地址数据进行处理,得到处理后的待存储地址数据,并将处理后的待存储地址数据存入目标地址中。The main processing sub-module 15-121 is used to process the address data to be stored, obtain the processed address data to be stored, and store the processed address data to be stored in the target address.
在一种可能的实现方式中,操作域还可以包括初始存储空间标识和目标存储空间标识。控制模块15-11,还用于根据操作域确定初始存储空间标识、目标存储空间标识、初始地址和目标地址,并从初始存储空间标识所标识的初始存储空间的初始地址中,获取待存储地址数据。其中,将处理后的待存储地址数据存入目标地址中,可以包括:将处理后的待存储地址数据存入目标存储空间标识所标识的目标存储空间的目标地址中。In a possible implementation manner, the operation domain may further include an initial storage space identifier and a target storage space identifier. The control module 15-11 is also used to determine the initial storage space identifier, target storage space identifier, initial address and target address according to the operation domain, and obtain the address to be stored from the initial address of the initial storage space identified by the initial storage space identifier data. Wherein, storing the processed address data to be stored in the target address may include: storing the processed address data to be stored in the target address of the target storage space identified by the target storage space identifier.
在该实现方式中,初始存储空间标识可以是初始存储空间的编号、名称等表示该初始存储空间的标识。目标存储空间标识可以是目标存储空间的编号、名称等表示该目标存储空间的标识。目标存储空间可以与初始存储空间不同,目标存储空间可以是装置的缓存等存储空间。初始存储空间可以是装置中除缓存以外的存储空间,例如,初始存储空间可以是装置的NRAM、WRAM、DDR等。其中,NRAM(Nanotube Random Access Memory)是基于碳纳米管(Carbon Nanotube,简称CNT)的非易失性存储器。WRAM(Window RAM)是VRAM(Video RAM,影像随机接达记忆器)的一种。DDR(DDR SDRAM)是双倍速率同步动态随机存储器。目标存储空间可以与初始存储空间相同,基于取地址指令可以改变或增加待存储地址数据的存储位置。In this implementation manner, the initial storage space identifier may be an identifier indicating the initial storage space, such as a number and name of the initial storage space. The target storage space identifier may be an ID representing the target storage space, such as the number and name of the target storage space. The target storage space may be different from the initial storage space, and the target storage space may be a storage space such as a cache of the device. The initial storage space may be a storage space other than the cache in the device, for example, the initial storage space may be NRAM, WRAM, DDR, etc. of the device. Among them, NRAM (Nanotube Random Access Memory) is a non-volatile memory based on carbon nanotube (Carbon Nanotube, CNT for short). WRAM (Window RAM) is a type of VRAM (Video RAM, the image is randomly accessed to the memory). DDR (DDR SDRAM) is double rate synchronous dynamic random access memory. The target storage space may be the same as the initial storage space, and the storage location of the address data to be stored can be changed or increased based on the address fetch instruction.
在一种可能的实现方式中,操作码还可以用于指示初始存储空间标识和目标存储空间标识。控制模块15-11,还用于根据操作码确定初始存储空间标识、目标存储空间标识、初始地址和目标地址,并从初始存储空间标识所标识的初始存储空间的初始地址中,获取待存储地址数据。其中,将处理后的待存储地址数据存入目标地址中,可以包括:将处理后的待存储地址数据存入目标存储空间标识所标识的目标存储空间的目标地址中。In a possible implementation, the operation code may also be used to indicate the initial storage space identifier and the target storage space identifier. The control module 15-11 is also used to determine the initial storage space identifier, target storage space identifier, initial address and target address according to the operation code, and obtain the address to be stored from the initial address of the initial storage space identified by the initial storage space identifier data. Wherein, storing the processed address data to be stored in the target address may include: storing the processed address data to be stored in the target address of the target storage space identified by the target storage space identifier.
在一种可能的实现方式中,还可以在初始地址中标记其所在的初始存储空间,以便于控制模块可以根据初始地址从其所在的初始存储空间中获取待存储地址数据。也可以在目标地址中标记其所在的目标存储空间,以便于控制模块可以从操作域中确定出目标地址及其所在的目标存储空间,并使得处理模块可以将处理后的待存储地址数据存入目标存储空间的目标地址中。In a possible implementation manner, the initial storage space where it is located may also be marked in the initial address, so that the control module can obtain the address data to be stored from the initial storage space where it is located according to the initial address. You can also mark the target storage space in the target address, so that the control module can determine the target address and the target storage space from the operation domain, and enable the processing module to store the processed address data to be stored In the target address of the target storage space.
在一种可能的实现方式中,可以预先设置默认初始存储空间和默认目标存储空间。在根据取地址指令的操作域或操作码均不能确定初始存储空间和/或目标存储空间时,可以将默认初始存储空间确定为当前取地址指令的初始地址所处的初始存储空间,以及可以将默认目标存储空间确定为当前取地址指令的目标地址所处的目标存储空间。In a possible implementation manner, the default initial storage space and the default target storage space may be preset. When the initial storage space and / or the target storage space cannot be determined according to the operation domain or operation code of the address fetch instruction, the default initial storage space can be determined as the initial storage space where the initial address of the current address fetch instruction is located, and the The default target storage space is determined as the target storage space where the target address of the current address fetch instruction is located.
在一种可能的实现方式中,如图15-2所示,该装置还可以包括存储模块15-13。存储模块15-13用于存储待存储地址数据。In a possible implementation manner, as shown in FIG. 15-2, the device may further include a storage module 15-13. The storage modules 15-13 are used to store address data to be stored.
在该实现方式中,存储模块可以包括缓存和寄存器中的一种或多种,缓存可以包括高速暂存缓存,还可以包括至少一个NRAM(Neuron Random Access Memory,神经元随机存取存储器)。缓存,用于存储待运算数据和待存储地址数据。寄存器,用于存储待运算数据中的标量数据。其中,待运算数据包括与上述计算指令和/或取地址指令的执行相关的数据。In this implementation, the storage module may include one or more of a cache and a register. The cache may include a high-speed temporary storage cache, and may also include at least one NRAM (Neuron Random Access Memory). The cache is used to store data to be calculated and address data to be stored. The register is used to store the scalar data in the data to be calculated. The data to be calculated includes data related to the execution of the above calculation instruction and / or address fetch instruction.
在一种可能的实现方式中,缓存可以包括神经元缓存。神经元缓存也即上述神经元随机存取存储器,可以用于存储待运算数据中的神经元数据,神经元数据可以包括神经元向量数据。In a possible implementation, the cache may include a neuron cache. The neuron cache, that is, the foregoing neuron random access memory, can be used to store neuron data in the data to be calculated, and the neuron data can include neuron vector data.
在一种可能的实现方式中,该装置还可以包括直接内存访问模块,用于从存储模块中读取或者存储数据。In a possible implementation, the device may further include a direct memory access module, which is used to read or store data from the storage module.
在一种可能的实现方式中,控制模块15-11还可以用于根据取地址指令生成汇编文件,并将汇编文件翻译成二进制文件,其中,二进制文件为编译后的取地址指令。In a possible implementation manner, the control module 15-11 may also be used to generate an assembly file according to the address fetch instruction and translate the assembly file into a binary file, where the binary file is a compiled address fetch instruction.
在一种可能的实现方式中,取地址指令的指令格式可以是:In a possible implementation, the instruction format of the address fetch instruction may be:
lda.space1.space2 dst src0lda.space1.space2 dst src0
其中,lda.space1.space2是取地址指令的操作码,dst、src0是取地址指令的操作域。其中,dst是目标地址。src0是存储待存储地址数据的初始地址。lda.space1.space2中的lda用于指示该指令为取地址指令,lda.space1.space2中space1是目标存储空间标识,lda.space1.space2中space2是初始存储空间标识。Among them, lda.space1.space2 is the operation code of the address fetch instruction, and dst and src0 are the operation domains of the address fetch instruction. Among them, dst is the target address. src0 is the initial address to store the address data to be stored. lda in lda.space1.space2 is used to indicate that the instruction is an address fetch instruction, space1 in lda.space1.space2 is the target storage space identifier, and space2 in lda.space1.space2 is the initial storage space identifier.
在一种可能的实现方式中,取地址指令的指令格式还可以是:In a possible implementation, the instruction format of the address fetch instruction may also be:
lda dst src0 space1 space2lda dst src0 space1 space2
其中,lda是取地址指令的操作码,dst、src0、space1、space2是取地址指令的操作域。其中,lda用于指示该指令为取地址指令。dst是目标地址,src0是存储待存储地址数据的初始地址。space1是目标存储空间标识。space2是初始存储空间标识。Among them, lda is the operation code of the address fetch instruction, dst, src0, space1, space2 are the operation domain of the address fetch instruction. Among them, lda is used to indicate that the instruction is an address fetch instruction. dst is the target address, and src0 is the initial address to store the address data to be stored. space1 is the target storage space identifier. space2 is the initial storage space identifier.
应当理解的是,本领域技术人员可以根据需要对取地址指令的操作码、指令格式中操作码和操作域的位置进行设置,本公开对此不作限制。It should be understood that those skilled in the art can set the operation code of the address fetch instruction, the position of the operation code and the operation field in the instruction format according to need, and the disclosure does not limit this.
在一种可能的实现方式中,该装置可以设置于图形处理器(Graphics Processing Unit,简称GPU)、中央处理器(Central Processing Unit,简称CPU)和嵌入式神经网络处理器(Neural-network Processing Unit,简称NPU)的一种或多种之中。In a possible implementation manner, the device may be set in a graphics processor (Graphics Processing Unit, GPU for short), a central processing unit (Central Processing Unit, CPU for short) and an embedded neural network processor (Neural-network Processing Unit) , Referred to as NPU).
需要说明的是,尽管以上述实施例作为示例介绍了取地址指令处理装置如上,但本领域技术人员能够理解,本公开应不限于此。事实上,用户完全可根据个人喜好和/或实际应用场景灵活设定各模块,只要符合本公开的技术方案即可。It should be noted that, although the above embodiment is taken as an example to introduce the address fetch instruction processing apparatus as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various modules flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.
应用示例Application examples
以下结合“利用取地址指令处理装置进行取地址处理”作为一个示例性应用场景,给出根据本公开实施例的应用示例,以便于理解取地址指令处理装置的流程。本领域技术人员应理解,以下应用示例仅仅是出于便于理解本公开实施例的目的,不应视为对本公开实施例的限制。In the following, an application example according to an embodiment of the present disclosure will be given in conjunction with “using address fetch instruction processing apparatus for address fetch processing” as an exemplary application scenario, so as to facilitate understanding of the flow of the address fetch instruction processing apparatus. Those skilled in the art should understand that the following application examples are only for the purpose of facilitating the understanding of the embodiments of the present disclosure, and should not be considered as limitations to the embodiments of the present disclosure.
图15-3a、图15-3b示出根据本公开一实施例的取地址指令处理装置的应用场景的示意图。如图15-3a、图15-3b所示,取地址指令处理装置对取地址指令进行处理的过程如下:15-3a and 15-3b show schematic diagrams of application scenarios of an address fetch instruction processing apparatus according to an embodiment of the present disclosure. As shown in Figure 15-3a and Figure 15-3b, the address fetch instruction processing device processes the address fetch instruction as follows:
示例一Example one
如图15-3a所示,控制模块15-11对获取到的取地址指令1进行编译,得到编译后的取地址指令1。对编译后的取地址指令1(如取地址指令1为lda.n1.g1 500 100)进行解析,得到取地址指令1的操作码和操作域。其中,取地址指令1的操作码为lda.n1.g1,且可以根据操作码lda.n1.g1确定出初始存储空间标识为n1、目标存储空间标识为g1。目标地址为500,存储待存储地址数据的初始地址为100。控制模块15-11从初始存储空间标识n1所标识的初始存储空间的待存储地址数据地址100中获取待存储地址数据。As shown in FIG. 15-3a, the control module 15-11 compiles the obtained address fetch instruction 1 to obtain the compiled address fetch instruction 1. Analyze the compiled address fetch instruction 1 (if the address fetch instruction 1 is lda.n1.g1 500 500), the operation code and operation domain of the address fetch instruction 1 are obtained. The operation code of the address fetch instruction 1 is lda.n1.g1, and the initial storage space identifier n1 and the target storage space identifier g1 can be determined according to the operation code lda.n1.g1. The target address is 500, and the initial address for storing the address data to be stored is 100. The control module 15-11 acquires address data to be stored from the address data address 100 to be stored in the initial storage space identified by the initial storage space identification n1.
处理模块15-12对待存储地址数据进行处理,得到处理后的待存储地址数据1,并将处理后的待存储地址数据1存储入目标存储空间标识g1所标识的目标存储空间的目标存储地址500中。The processing module 15-12 processes the address data to be stored, obtains the processed address data 1 to be stored, and stores the processed address data 1 to be stored in the target storage address 500 of the target storage space identified by the target storage space identifier g1 in.
如图15-3b所示,控制模块15-11对获取到的取地址指令2进行编译,得到编译后的取地址指令2。对编译后的取地址指令2(如取地址指令2为lda 501 101 n2 g2)进行解析,得到取地址指令2的操作码和操作域。其中,取地址指令2的操作码为lda。目标地址为501,存储待存储地址数据的初始地址为101,初始存储空间标识n2为,目标存储空间标识g2为。控制模块15-11从初始存储空间标识n2所标识的初始存储空间的待存储地址数据地址101中获取待存储地址数据。As shown in FIG. 15-3b, the control module 15-11 compiles the obtained address fetch instruction 2 to obtain the compiled address fetch instruction 2. Analyze the compiled address fetch instruction 2 (if the address fetch instruction 2 is lda501501101n2g2) to obtain the operation code and operation domain of the address fetch instruction 2. Among them, the operation code of the address fetch instruction 2 is lda. The target address is 501, the initial address for storing address data to be stored is 101, the initial storage space identifier n2 is, and the target storage space identifier g2 is. The control module 15-11 acquires the address data to be stored from the address data address 101 to be stored in the initial storage space identified by the initial storage space identification n2.
处理模块15-12对待存储地址数据进行处理,得到处理后的待存储地址数据2,并将处理后的待存储地址数据2存储入目标存储空间标识g2所标识的目标存储空间的目标存储地址501中。The processing module 15-12 processes the address data to be stored, obtains the processed address data 2 to be stored, and stores the processed address data 2 to be stored in the target storage address 501 of the target storage space identified by the target storage space identifier g2 in.
以上各模块的工作过程可参考上文的相关描述。For the working process of the above modules, please refer to the relevant description above.
这样,取地址指令处理装置可以高效、快速地对取地址指令进行处理,进行取地址处理的效率高、速度快。In this way, the address fetch instruction processing device can process the address fetch instruction efficiently and quickly, and the address fetch processing is efficient and fast.
图15-4示出根据本公开一实施例的取地址指令处理方法的流程图。该方法可以应用于包含存储器和处理器的如计算机设备等,其中,存储器用于存储执行方法过程中所使用的数据;处理器用于执行相关的处理、运算步骤,如执行下述步骤S51-15和步骤S52-15。如图15-4所示,该方法应用于上述取地址指令处理装置,该方法包括步骤S51-15和步骤S52-15。15-4 shows a flowchart of an address fetch instruction processing method according to an embodiment of the present disclosure. The method can be applied to computer equipment including a memory and a processor, where the memory is used to store data used during the execution of the method; the processor is used to perform related processing and operation steps, such as performing the following steps S51-15和 步骤 S52-15. As shown in FIG. 15-4, this method is applied to the above address fetching instruction processing device. The method includes steps S51-15 and S52-15.
在步骤S51-15中,利用控制模块对获取到的取地址指令进行编译,得到编译后的取地址指令,对编译后的取地址指令进行解析,得到取地址指令的操作码和操作域,并根据操作码和操作域获取执行取地址指令所需的待存储地址数据和目标地址。其中,操作码用于指示取地址指令对数据所进行的处理为取地址处理,操作域包括存储待存储地址数据的初始地址和目标地址。In step S51-15, the control module is used to compile the obtained address fetch instruction to obtain the compiled address fetch instruction, and the compiled address fetch instruction is parsed to obtain the operation code and operation domain of the address fetch instruction, and Obtain the address data and target address required to execute the address fetch instruction according to the operation code and operation domain. The operation code is used to instruct the address fetch instruction to process the data as address fetch processing, and the operation domain includes an initial address and a target address for storing address data to be stored.
在步骤S52-15中,利用处理模块对待存储地址数据进行处理,得到处理后的待存储地址数据,并将处理后的待存储地址数据存入目标地址中。In step S52-15, the processing module is used to process the address data to be stored to obtain the processed address data to be stored, and the processed address data to be stored is stored in the target address.
在一种可能的实现方式中,处理模块包括主处理子模块和多个从处理子模块。其中,步骤S52-15可以包括:In a possible implementation, the processing module includes a main processing sub-module and multiple slave processing sub-modules. Wherein, steps S52-15 may include:
对待存储地址数据进行处理,得到处理后的待存储地址数据,并将处理后的待存储地址数据存入目标地址中。The address data to be stored is processed to obtain the processed address data to be stored, and the processed address data to be stored is stored in the target address.
在一种可能的实现方式中,操作域还可以包括初始存储空间标识和目标存储空间标识。其中,根据操作码和操作域获取执行取地址指令所需的待存储地址数据和目标地址,可以包括:根据操作域确定初始存储空间标识、目标存储空间标识、初始地址和目标地址,并从初始存储空间标识所标识的初始存储空间的初始地址中,获取待存储地址数据。In a possible implementation manner, the operation domain may further include an initial storage space identifier and a target storage space identifier. Among them, obtaining the address data and the target address to be stored required to execute the address fetch instruction according to the operation code and the operation domain may include: determining the initial storage space identifier, target storage space identifier, initial address and target address according to the operation domain, and In the initial address of the initial storage space identified by the storage space identifier, the address data to be stored is obtained.
其中,将处理后的待存储地址数据存入目标地址中,可以包括:将处理后的待存储地址数据存入目标存储空间标识所标识的目标存储空间的目标地址中。Wherein, storing the processed address data to be stored in the target address may include: storing the processed address data to be stored in the target address of the target storage space identified by the target storage space identifier.
在一种可能的实现方式中,操作码还用于指示初始存储空间标识和目标存储空间标识。其中,根据操作码和操作域获取执行取地址指令所需的待存储地址数据和目标地址,可以包括:根据操作码确定初始存储空间标识和目标存储空间标识,根据操作域确定初始地址和目标地址,并从初始存储空间标识所标识的初始存储空间的初始地址中,获取待存储地址数据。In a possible implementation, the operation code is also used to indicate the initial storage space identifier and the target storage space identifier. Wherein, obtaining the address data and the target address required to execute the address fetch instruction according to the operation code and the operation domain may include: determining the initial storage space identifier and the target storage space identifier according to the operation code, and determining the initial address and target address according to the operation domain And obtain the address data to be stored from the initial address of the initial storage space identified by the initial storage space identifier.
其中,将处理后的待存储地址数据存入目标地址中,可以包括:将处理后的待存储地址数据存入目标存储空间标识所标识的目标存储空间的目标地址中。Wherein, storing the processed address data to be stored in the target address may include: storing the processed address data to be stored in the target address of the target storage space identified by the target storage space identifier.
在一种可能的实现方式中,该方法还可以包括:利用装置的存储模块存储待存储地址数据,In a possible implementation manner, the method may further include: using the storage module of the device to store the address data to be stored,
其中,存储模块包括寄存器和缓存中的至少一种,Wherein, the storage module includes at least one of a register and a cache,
缓存,用于存储待运算数据和待存储地址数据,缓存包括至少一个神经元缓存NRAM;Cache, used to store data to be calculated and address data to be stored, the cache includes at least one neuron cache NRAM;
寄存器,用于存储待运算数据中的标量数据;Register, used to store scalar data in the data to be calculated;
神经元缓存,用于存储待运算数据中的神经元数据,神经元数据包括神经元向量数据。The neuron cache is used to store neuron data in the data to be calculated, and the neuron data includes neuron vector data.
在一种可能的实现方式中,步骤S51-15可以包括:In a possible implementation manner, steps S51-15 may include:
存储编译后的取地址指令;Store the compiled address instruction;
对编译后的取地址指令进行解析,得到取地址指令的操作码和操作域;Analyze the compiled address fetch instruction to obtain the operation code and operation domain of the address fetch instruction;
存储指令队列,指令队列包括按照执行顺序依次排列的多个待执行指令,多个待执行指令可以包括编译后的取地址指令。The instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed arranged in order according to the execution order, and the plurality of instructions to be executed may include a compiled address fetch instruction.
在一种可能的实现方式中,该方法还可以包括:In a possible implementation manner, the method may further include:
在确定多个待执行指令中的第一待执行指令与第一待执行指令之前的第零待执行指令存在关联关系时,缓存第一待执行指令,并在确定第零待执行指令执行完毕后,控制进行第一待执行指令的执行,When it is determined that the first to-be-executed instruction among the plurality of to-be-executed instructions is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction, cache the first to-be-executed instruction, and after determining that the execution of the zeroth to-be-executed instruction is completed , Control the execution of the first instruction to be executed,
其中,第一待执行指令与第一待执行指令之前的第零待执行指令存在关联关系包括:Wherein, the association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
存储第一待执行指令所需数据的第一存储地址区间与存储第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first storage address interval storing data required for the first instruction to be executed has an overlapping area with the zeroth storage address interval storing data required for the zeroth instruction to be executed.
在一种可能的实现方式中,对获取到的取地址指令进行编译,得到编译后的取地址指令,可以包括:In a possible implementation manner, compiling the obtained address fetch instruction to obtain the compiled address fetch instruction may include:
根据取地址指令生成汇编文件,并将汇编文件翻译成二进制文件。其中,二进制文件为编译后的取地址指令。The assembly file is generated according to the address fetch instruction, and the assembly file is translated into a binary file. Among them, the binary file is the address instruction after compilation.
需要说明的是,尽管以上述实施例作为示例介绍了取地址指令处理方法如上,但本领域技术人员能够理解,本公开应不限于此。事实上,用户完全可根据个人喜好和/或实际应用场景灵活设定各步 骤,只要符合本公开的技术方案即可。It should be noted that although the above embodiment is taken as an example to introduce the address fetch instruction processing method as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various steps flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.
本公开实施例所提供的取地址指令处理方法的适用范围广,对取地址指令的处理效率高、处理速度快,进行取地址处理的效率高、速度快。The address fetch instruction processing method provided by the embodiments of the present disclosure has a wide range of application, and has high processing efficiency and fast processing speed for address fetch instructions, and high efficiency and fast speed for address fetch processing.
依据以下条款可以更好的理解前述内容:The foregoing can be better understood based on the following clauses:
条款O1、一种取地址指令处理装置,所述装置包括:Clause O1, an address fetch instruction processing device, the device comprising:
控制模块,用于对获取到的取地址指令进行编译,得到编译后的取地址指令,对编译后的取地址指令进行解析,得到取地址指令的操作码和操作域,并根据所述操作码和所述操作域获取执行取地址指令所需的待存储地址数据和目标地址;The control module is used to compile the obtained address fetch instruction to obtain the compiled address fetch instruction, and parse the compiled address fetch instruction to obtain the operation code and operation domain of the address fetch instruction, and according to the operation code Acquiring the address data to be stored and the target address required to execute the address fetch instruction with the operation domain;
处理模块,用于对所述待存储地址数据进行处理,得到处理后的待存储地址数据,并将所述处理后的待存储地址数据存入所述目标地址中,A processing module, configured to process the address data to be stored, obtain the processed address data to be stored, and store the processed address data to be stored in the target address,
其中,所述操作码用于指示所述取地址指令对数据所进行的处理为取地址处理,所述操作域包括存储所述待存储地址数据的初始地址和所述目标地址。Wherein, the operation code is used to indicate that the processing performed by the address fetch instruction on the data is address fetch processing, and the operation domain includes an initial address and the target address that store the address data to be stored.
条款O2、根据条款O1所述的装置,所述处理模块包括主处理子模块和多个从处理子模块,Clause O2. The apparatus according to Clause O1, the processing module includes a master processing sub-module and a plurality of slave processing sub-modules,
所述主处理子模块,用于对所述待存储地址数据进行处理,得到处理后的待存储地址数据,并将所述处理后的待存储地址数据存入所述目标地址中。The main processing submodule is configured to process the address data to be stored to obtain the processed address data to be stored, and store the processed address data to be stored in the target address.
条款O3、根据条款O1所述的装置,所述操作域还包括初始存储空间标识和目标存储空间标识,Clause O3. The device according to Clause O1, the operation domain further includes an initial storage space identifier and a target storage space identifier,
其中,所述控制模块,还用于根据所述操作域确定所述初始存储空间标识、所述目标存储空间标识、所述初始地址和所述目标地址,并从所述初始存储空间标识所标识的初始存储空间的初始地址中,获取所述待存储地址数据;Wherein, the control module is further configured to determine the initial storage space identifier, the target storage space identifier, the initial address and the target address according to the operation domain, and identify from the initial storage space identifier In the initial address of the initial storage space, obtain the address data to be stored;
其中,将所述处理后的待存储地址数据存入所述目标地址中,包括:Wherein, storing the processed address data to be stored in the target address includes:
将所述处理后的待存储地址数据存入所述目标存储空间标识所标识的目标存储空间的目标地址中。Storing the processed address data to be stored in the target address of the target storage space identified by the target storage space identifier.
条款O4、根据条款O1所述的装置,所述操作码还用于指示初始存储空间标识和目标存储空间标识,Clause O4. The device according to Clause O1, the operation code is further used to indicate an initial storage space identifier and a target storage space identifier,
其中,所述控制模块,还用于根据所述操作码确定所述初始存储空间标识和所述目标存储空间标识,根据所述操作域确定所述初始地址和所述目标地址,并从所述初始存储空间标识所标识的初始存储空间的初始地址中,获取所述待存储地址数据;Wherein, the control module is further configured to determine the initial storage space identifier and the target storage space identifier according to the operation code, determine the initial address and the target address according to the operation domain, and select from the Acquiring the address data to be stored from the initial address of the initial storage space identified by the initial storage space identifier;
其中,将所述处理后的待存储地址数据存入所述目标地址中,包括:Wherein, storing the processed address data to be stored in the target address includes:
将所述处理后的待存储地址数据存入所述目标存储空间标识所标识的目标存储空间的目标地址中。Storing the processed address data to be stored in the target address of the target storage space identified by the target storage space identifier.
条款O5、根据条款O1所述的装置,所述装置还包括:Clause O5. The device according to Clause O1, the device further comprising:
存储模块,用于存储所述待存储地址数据,A storage module, used to store the address data to be stored,
其中,所述存储模块包括寄存器和缓存中的至少一种,Wherein, the storage module includes at least one of a register and a cache,
所述缓存,用于存储待运算数据和所述待存储地址数据,所述缓存包括至少一个神经元缓存NRAM;The cache is used to store data to be calculated and the address data to be stored, and the cache includes at least one neuron cache NRAM;
所述寄存器,用于存储所述待运算数据中的标量数据;The register is used to store scalar data in the data to be calculated;
所述神经元缓存,用于存储所述待运算数据中的神经元数据,所述神经元数据包括神经元向量数据。The neuron cache is used to store neuron data in the data to be operated, and the neuron data includes neuron vector data.
条款O6、根据条款O1所述的装置,所述控制模块包括:Clause O6. The device according to Clause O1, the control module includes:
指令存储子模块,用于存储所述编译后的取地址指令;An instruction storage sub-module for storing the compiled address fetch instruction;
指令处理子模块,用于对所述编译后的取地址指令进行解析,得到取地址指令的操作码和操作域;An instruction processing sub-module, which is used to parse the compiled address fetch instruction to obtain the operation code and operation domain of the address fetch instruction;
队列存储子模块,用于存储指令队列,所述指令队列包括按照执行顺序依次排列的多个待执行指令,所述多个待执行指令包括所述编译后的取地址指令。The queue storage sub-module is used to store an instruction queue, and the instruction queue includes a plurality of instructions to be executed, which are sequentially arranged according to the execution order, and the plurality of instructions to be executed include the compiled address fetch instruction.
条款O7、根据条款O6所述的装置,所述控制模块,还包括:Clause O7. The device according to Clause O6, the control module, further comprising:
依赖关系处理子模块,用于在确定所述多个待执行指令中的第一待执行指令与所述第一待执行指 令之前的第零待执行指令存在关联关系时,将所述第一待执行指令缓存在所述指令存储子模块中,在所述第零待执行指令执行完毕后,从所述指令存储子模块中提取所述第一待执行指令发送至所述处理模块,The dependency processing sub-module is used to determine the first pending instruction when there is an association relationship between the first pending instruction in the plurality of pending instructions and the zeroth pending instruction before the first pending instruction The execution instruction is cached in the instruction storage submodule, and after the execution of the zeroth instruction to be executed is completed, the first instruction to be executed is extracted from the instruction storage submodule and sent to the processing module,
其中,所述第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系包括:Wherein, the association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
存储所述第一待执行指令所需数据的第一存储地址区间与存储所述第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
条款O8、根据条款O1所述的装置,Clause O8, the device according to Clause O1,
所述控制模块,还用于根据所述取地址指令生成汇编文件,并将所述汇编文件翻译成二进制文件,The control module is also used to generate an assembly file according to the address fetch instruction and translate the assembly file into a binary file,
其中,所述二进制文件为所述编译后的取地址指令。Wherein, the binary file is the compiled address fetch instruction.
条款O9、一种机器学习运算装置,所述装置包括:Clause O9. A machine learning computing device, the device comprising:
一个或多个如条款O1-条款O8任一项所述的取地址指令处理装置,用于从其他处理装置中获取待存储地址数据和控制信息,并执行指定的机器学习运算,将执行结果通过I/O接口传递给其他处理装置;One or more address fetch instruction processing devices as described in any one of clauses O1 to O8, used to obtain the address data and control information to be stored from other processing devices, and perform the specified machine learning operation, and pass the execution result through The I / O interface is passed to other processing devices;
当所述机器学习运算装置包含多个所述取地址指令处理装置时,所述多个所述取地址指令处理装置间可以通过特定的结构进行连接并传输数据;When the machine learning arithmetic device includes a plurality of address fetch instruction processing devices, the plurality of address fetch instruction processing devices may be connected and transmit data through a specific structure;
其中,多个所述取地址指令处理装置通过快速外部设备互连总线PCIE总线进行互联并传输数据,以支持更大规模的机器学习的运算;多个所述取地址指令处理装置共享同一控制系统或拥有各自的控制系统;多个所述取地址指令处理装置共享内存或者拥有各自的内存;多个所述取地址指令处理装置的互联方式是任意互联拓扑。Among them, a plurality of the address fetching instruction processing devices interconnect and transmit data through a fast external device interconnect bus PCIE bus to support larger-scale machine learning operations; a plurality of the address fetching instruction processing devices share the same control system Or have their own control systems; multiple of the address fetch instruction processing devices share memory or have their own memory; the interconnection method of the multiple address fetch instruction processing devices is any interconnected topology.
条款O10、一种组合处理装置,所述组合处理装置包括:Clause O10. A combined processing device, the combined processing device comprising:
如条款O9所述的机器学习运算装置、通用互联接口和其他处理装置;Machine learning computing devices, general interconnection interfaces and other processing devices as described in clause O9;
所述机器学习运算装置与所述其他处理装置进行交互,共同完成用户指定的计算操作,The machine learning computing device interacts with the other processing device to jointly complete the calculation operation specified by the user,
其中,所述组合处理装置还包括:存储装置,该存储装置分别与所述机器学习运算装置和所述其他处理装置连接,用于保存所述机器学习运算装置和所述其他处理装置的数据。Wherein, the combined processing device further includes: a storage device, which is respectively connected to the machine learning computing device and the other processing device, and is used for storing data of the machine learning computing device and the other processing device.
条款O11、一种机器学习芯片,所述机器学习芯片包括:Clause O11. A machine learning chip, the machine learning chip including:
如条款O9所述的机器学习运算装置或如条款O10所述的组合处理装置。The machine learning arithmetic device according to clause O9 or the combined processing device according to clause O10.
条款O12、一种电子设备,所述电子设备包括:Clause O12. An electronic device, the electronic device comprising:
如条款O11所述的机器学习芯片。Machine learning chip as described in clause O11.
条款O13、一种板卡,所述板卡包括:存储器件、接口装置和控制器件以及如条款O11所述的机器学习芯片;Clause O13, a board card including: a storage device, an interface device and a control device, and a machine learning chip as described in Clause O11;
其中,所述机器学习芯片与所述存储器件、所述控制器件以及所述接口装置分别连接;Wherein, the machine learning chip is respectively connected to the storage device, the control device and the interface device;
所述存储器件,用于存储数据;The storage device is used for storing data;
所述接口装置,用于实现所述机器学习芯片与外部设备之间的数据传输;The interface device is used to realize data transmission between the machine learning chip and an external device;
所述控制器件,用于对所述机器学习芯片的状态进行监控。The control device is used for monitoring the state of the machine learning chip.
条款O14、一种取地址指令处理方法,所述方法应用于取地址指令处理装置,所述装置包括控制模块和处理模块,所述方法包括:Clause O14. An address fetch instruction processing method. The method is applied to an address fetch instruction processing device. The device includes a control module and a processing module. The method includes:
利用控制模块对获取到的取地址指令进行编译,得到编译后的取地址指令,对所述编译后的取地址指令进行解析,得到取地址指令的操作码和操作域,并根据所述操作码和所述操作域获取执行取地址指令所需的待存储地址数据和目标地址;The control module is used to compile the obtained address fetch instruction to obtain a compiled address fetch instruction. The compiled address fetch instruction is parsed to obtain the operation code and operation domain of the address fetch instruction, and according to the operation code Acquiring the address data to be stored and the target address required to execute the address fetch instruction with the operation domain;
利用处理模块对所述待存储地址数据进行处理,得到处理后的待存储地址数据,并将所述处理后的待存储地址数据存入所述目标地址中,Use the processing module to process the address data to be stored to obtain the processed address data to be stored, and store the processed address data to be stored in the target address,
其中,所述操作码用于指示所述取地址指令对数据所进行的处理为取地址处理,所述操作域包括存储所述待存储地址数据的初始地址和所述目标地址。Wherein, the operation code is used to indicate that the processing performed by the address fetch instruction on the data is address fetch processing, and the operation domain includes an initial address and the target address that store the address data to be stored.
条款O15、根据条款O14所述的方法,所述处理模块包括主处理子模块和多个从处理子模块,Clause O15. The method according to Clause O14, the processing module includes a master processing submodule and a plurality of slave processing submodules,
其中,对所述待存储地址数据进行处理,得到处理后的待存储地址数据,并将所述处理后的待存 储地址数据存入所述目标地址中,包括:Wherein, processing the address data to be stored to obtain processed address data to be stored, and storing the processed address data to be stored in the target address includes:
对所述待存储地址数据进行处理,得到处理后的待存储地址数据,并将所述处理后的待存储地址数据存入所述目标地址中。The address data to be stored is processed to obtain processed address data to be stored, and the processed address data to be stored is stored in the target address.
条款O16、根据条款O14所述的方法,所述操作域还包括初始存储空间标识和目标存储空间标识,Clause O16. The method according to Clause O14, the operation domain further includes an initial storage space identifier and a target storage space identifier,
其中,根据所述操作码和所述操作域获取执行取地址指令所需的待存储地址数据和目标地址,包括:Wherein, acquiring the address data to be stored and the target address required to execute the address fetch instruction according to the operation code and the operation domain includes:
根据所述操作域确定所述初始存储空间标识、所述目标存储空间标识、所述初始地址和所述目标地址,并从所述初始存储空间标识所标识的初始存储空间的初始地址中,获取所述待存储地址数据;Determine the initial storage space identifier, the target storage space identifier, the initial address and the target address according to the operation domain, and obtain from the initial address of the initial storage space identified by the initial storage space identifier The address data to be stored;
其中,将所述处理后的待存储地址数据存入所述目标地址中,包括:Wherein, storing the processed address data to be stored in the target address includes:
将所述处理后的待存储地址数据存入所述目标存储空间标识所标识的目标存储空间的目标地址中。Storing the processed address data to be stored in the target address of the target storage space identified by the target storage space identifier.
条款O17、根据条款O14所述的方法,所述操作码还用于指示初始存储空间标识和目标存储空间标识,Clause O17. The method according to Clause O14, the operation code is further used to indicate the initial storage space identifier and the target storage space identifier,
其中,根据所述操作码和所述操作域获取执行取地址指令所需的待存储地址数据和目标地址,包括:Wherein, acquiring the address data to be stored and the target address required to execute the address fetch instruction according to the operation code and the operation domain includes:
根据所述操作码确定所述初始存储空间标识和所述目标存储空间标识,根据所述操作域确定所述初始地址和所述目标地址,并从所述初始存储空间标识所标识的初始存储空间的初始地址中,获取所述待存储地址数据;Determine the initial storage space identifier and the target storage space identifier according to the operation code, determine the initial address and the target address according to the operation domain, and determine the initial storage space identified from the initial storage space identifier In the initial address of, obtain the address data to be stored;
其中,将所述处理后的待存储地址数据存入所述目标地址中,包括:Wherein, storing the processed address data to be stored in the target address includes:
将所述处理后的待存储地址数据存入所述目标存储空间标识所标识的目标存储空间的目标地址中。Storing the processed address data to be stored in the target address of the target storage space identified by the target storage space identifier.
条款O18、根据条款O14所述的方法,所述方法还包括:Clause O18. The method according to Clause O14, the method further comprising:
利用所述装置的存储模块存储所述待存储地址数据,Using the storage module of the device to store the address data to be stored,
其中,所述存储模块包括寄存器和缓存中的至少一种,Wherein, the storage module includes at least one of a register and a cache,
所述缓存,用于存储待运算数据和所述待存储地址数据,所述缓存包括至少一个神经元缓存NRAM;The cache is used to store data to be calculated and the address data to be stored, and the cache includes at least one neuron cache NRAM;
所述寄存器,用于存储所述待运算数据中的标量数据;The register is used to store scalar data in the data to be calculated;
所述神经元缓存,用于存储所述待运算数据中的神经元数据,所述神经元数据包括神经元向量数据。The neuron cache is used to store neuron data in the data to be operated, and the neuron data includes neuron vector data.
条款O19、根据条款O14所述的方法,对所述编译后的取地址指令进行解析,得到取地址指令的操作码和操作域,包括:Clause O19. According to the method described in Clause O14, parse the compiled address fetch instruction to obtain the operation code and operation domain of the address fetch instruction, including:
存储所述编译后的取地址指令;Store the compiled address fetch instruction;
对所述编译后的取地址指令进行解析,得到取地址指令的操作码和操作域;Parse the compiled address fetch instruction to obtain the operation code and operation domain of the address fetch instruction;
存储指令队列,所述指令队列包括按照执行顺序依次排列的多个待执行指令,所述多个待执行指令包括所述编译后的取地址指令。An instruction queue is stored. The instruction queue includes a plurality of instructions to be executed, which are sequentially arranged according to an execution order, and the plurality of instructions to be executed include the compiled address fetch instruction.
条款O20、根据条款O19所述的方法,所述方法还包括:Clause O20. The method according to Clause O19, the method further comprising:
在确定所述多个待执行指令中的第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系时,缓存所述第一待执行指令,并在确定所述第零待执行指令执行完毕后,控制进行所述第一待执行指令的执行,When it is determined that the first to-be-executed instruction among the plurality of to-be-executed instructions is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction, cache the first to-be-executed instruction, and determine the After the execution of the zeroth to-be-executed instruction is completed, control to execute the execution of the first to-be-executed instruction,
其中,所述第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系包括:Wherein, the association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
存储所述第一待执行指令所需数据的第一存储地址区间与存储所述第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
条款O21、根据条款O14所述的方法,对获取到的取地址指令进行编译,得到编译后的取地址指令,包括:Clause O21. According to the method described in Clause O14, compile the obtained address fetch instruction to obtain the compiled address fetch instruction, including:
根据所述取地址指令生成汇编文件,并将所述汇编文件翻译成二进制文件,Generate an assembly file according to the address fetch instruction, and translate the assembly file into a binary file,
其中,所述二进制文件为所述编译后的取地址指令。Wherein, the binary file is the compiled address fetch instruction.
条款O22、一种非易失性计算机可读存储介质,其上存储有计算机程序指令,所述计算机程序指令被处理器执行时实现条款O14至条款O21任一项所述的方法。Clause O22. A non-volatile computer-readable storage medium having computer program instructions stored thereon. When the computer program instructions are executed by a processor, the method according to any one of clauses O14 to O21 is implemented.
由于神经网络算法的广泛使用,计算机硬件运算人能力的不断提升,实际应用中所涉及到的数据运算的种类和数量不断提高。由于编程语言的种类多样,在不同的语言环境下,为实现标量数据迁移,相关技术中,由于现阶段没有能广泛适用于各类编程语言的标量数据迁移指令,技术人员需要自定义对应其编程语言环境的一条或多条指令来实现标量数据迁移,导致进行标量数据迁移的效率低、速度慢。本公开提供一种标量数据迁移指令处理方法、装置、计算机设备和存储介质,仅用一个指令即可以实现标量数据迁移,能够显著提高进行标量数据迁移的效率和速度。Due to the extensive use of neural network algorithms, the ability of computer hardware operators to continuously improve, the types and number of data operations involved in practical applications continue to increase. Due to the variety of programming languages, in different language environments, in order to achieve scalar data migration, in related technologies, since there is no scalar data migration instruction that can be widely applied to various programming languages at this stage, technicians need to customize their corresponding programming One or more instructions in the language environment to achieve scalar data migration, resulting in low efficiency and slow speed of scalar data migration. The present disclosure provides a scalar data migration instruction processing method, device, computer equipment, and storage medium, which can realize scalar data migration with only one instruction, which can significantly improve the efficiency and speed of scalar data migration.
图16-1示出根据本公开一实施例的标量数据迁移指令处理装置的框图。如图16-1所示,该装置包括控制模块16-11和处理模块16-12。16-1 shows a block diagram of a scalar data migration instruction processing apparatus according to an embodiment of the present disclosure. As shown in Figure 16-1, the device includes a control module 16-11 and a processing module 16-12.
控制模块16-11,用于对获取到的标量数据迁移指令进行编译,得到编译后的标量数据迁移指令,对编译后的标量数据迁移指令进行解析,得到标量数据迁移指令的操作码和操作域,并根据操作码和操作域获取执行标量数据迁移指令所需的待迁移标量数据和目标地址,以及确定进行迁移处理所需的迁移参数。其中,操作码用于指示标量数据迁移指令对标量数据所进行的处理为迁移处理,操作域包括待迁移标量数据地址和目标地址,迁移参数可以包括待迁移标量数据地址所在的初始存储空间、目标地址所在的目标存储空间和进行迁移处理的迁移类型。The control module 16-11 is used to compile the acquired scalar data migration instruction to obtain the compiled scalar data migration instruction, and parse the compiled scalar data migration instruction to obtain the scalar data migration instruction operation code and operation domain. And obtain the scalar data to be migrated and the target address required to execute the scalar data migration instruction according to the operation code and the operation domain, and determine the migration parameters required for the migration process. The operation code is used to instruct the scalar data migration instruction to process the scalar data as migration processing. The operation domain includes the address of the scalar data to be migrated and the target address, and the migration parameter may include the initial storage space and target where the scalar data address to be migrated is located. The target storage space where the address is located and the migration type to be migrated.
处理模块16-12,根据迁移参数,将待迁移标量数据存入目标地址中。The processing module 16-12, according to the migration parameters, stores the scalar data to be migrated into the target address.
在本实施例中,待迁移标量数据可以是一个或多个。迁移类型可以指示初始存储空间的标量数据存储速度、目标存储空间的标量数据存储速度、以及二者存储速度的快慢关系。在标量数据迁移指令中,可以为不同的目标存储空间与初始存储空间之间的存储速度快慢关系,设置不同的代码,进行存储速度快慢的区分。比如,可以将迁移类型为“初始存储空间的存储速度大于目标存储空间的存储速度”的代码设置为“st”。可以将迁移类型为“初始存储空间的存储速度等于目标存储空间的存储速度”的代码设置为“mv”。可以将迁移类型为“初始存储空间的存储速度小于目标存储空间的存储速度”的代码设置为“ld”。本领域技术人员可以根据实际需要对迁移类型以及迁移类型的代码进行设置,本公开对此不作限制。In this embodiment, there may be one or more scalar data to be migrated. The migration type may indicate the storage speed of the scalar data in the initial storage space, the storage speed of the scalar data in the target storage space, and the speed relationship between the storage speeds of the two. In the scalar data migration instruction, different codes can be set for the storage speed relationship between different target storage spaces and the initial storage space to distinguish the storage speed. For example, the code whose migration type is "the storage speed of the initial storage space is greater than the storage speed of the target storage space" can be set to "st". The code whose migration type is "the storage speed of the initial storage space is equal to the storage speed of the target storage space" can be set to "mv". The code whose migration type is "the storage speed of the initial storage space is less than the storage speed of the target storage space" can be set to "ld". A person skilled in the art may set the migration type and the code of the migration type according to actual needs, which is not limited in the present disclosure.
在本实施例中,迁移参数可以包括初始存储空间、目标存储空间的名称、编号等标识,以表示初始存储空间和目标存储空间。In this embodiment, the migration parameters may include an identifier such as the initial storage space, the name and number of the target storage space, to represent the initial storage space and the target storage space.
在本实施例中,初始存储空间可以是装置的NRAM、DDR、寄存器等。目标存储空间可以是装置的NRAM、DDR。其中,NRAM(Nanotube Random Access Memory)是基于碳纳米管(Carbon Nanotube,简称CNT)的非易失性存储器。DDR(也可称DDR SDRAM)是双倍速率同步动态随机存储器(Double Data Rate Synchronous Dynamic Random Access Memory)。In this embodiment, the initial storage space may be NRAM, DDR, registers, etc. of the device. The target storage space may be the NRAM or DDR of the device. Among them, NRAM (Nanotube Random Access Memory) is a non-volatile memory based on carbon nanotube (Carbon Nanotube, CNT for short). DDR (also known as DDR SDRAM) is double rate synchronous dynamic random access memory (Double Data Rate Synchronous Dynamic Random Access Memory).
在本实施例中,控制模块所获取到的标量数据迁移指令为未编译的、不能直接供硬件执行的软件指令,控制模块需先对标量数据迁移指令(未编译)进行编译。在得到编译后的标量数据迁移指令之后,才能对编译后的标量数据迁移指令进行解析。编译后的标量数据迁移指令为能够直接供硬件执行的硬件指令。控制模块可以从待迁移标量数据地址中获取待迁移标量数据。控制模块可以通过数据输入输出单元获得指令和数据,该数据输入输出单元可以为一个或多个数据I/O接口或I/O引脚。In this embodiment, the scalar data migration instruction acquired by the control module is an uncompiled software instruction that cannot be directly executed by hardware. The control module needs to first compile the scalar data migration instruction (uncompiled). After the compiled scalar data migration instruction is obtained, the compiled scalar data migration instruction can be analyzed. The compiled scalar data migration instructions are hardware instructions that can be directly executed by the hardware. The control module may obtain the scalar data to be migrated from the scalar data address to be migrated. The control module can obtain instructions and data through the data input and output unit, which can be one or more data I / O interfaces or I / O pins.
在本实施例中,操作码可以是计算机程序中所规定的要执行操作的那一部分指令或字段(通常用代码表示),是指令序列号,用来告知执行指令的装置具体需要执行哪一条指令。操作域可以是执行对应的指令所需的所有数据的来源,执行对应的指令所需的所有数据包括目标地址、待迁移标量数据地址,待迁移标量数据地址所在的初始存储空间、目标地址所在的目标存储空间和进行迁移处理的迁移参数等等。对于一个标量数据迁移指令其必须包括操作码和操作域,其中操作域至少包括待迁移标量数据和目标地址。In this embodiment, the operation code may be the part of the instruction or field (usually represented by code) specified in the computer program to perform the operation, and is the instruction sequence number used to inform the device that executes the instruction which instruction needs to be executed. . The operation domain may be the source of all data required to execute the corresponding instruction. All data required to execute the corresponding instruction include the target address, the scalar data address to be migrated, the initial storage space where the scalar data address to be migrated is located, and the target address is located. Target storage space and migration parameters for migration processing, etc. For a scalar data migration instruction, it must include an operation code and an operation field, where the operation field includes at least the scalar data to be migrated and the target address.
应当理解的是,本领域技术人员可以根据需要对标量数据迁移指令格式以及所包含的操作码和操 作域进行设置,本公开对此不作限制。It should be understood that those skilled in the art can set the format of the scalar data migration instruction, as well as the included operation codes and operation domains as needed, and this disclosure does not limit this.
在本实施例中,该装置可以包括一个或多个控制模块,以及一个或多个处理模块,可以根据实际需要对控制模块和处理模块的数量进行设置,本公开对此不作限制。在装置包括一个控制模块时,该控制模块可以接收标量数据迁移指令,并控制一个或多个处理模块进行标量数据迁移。在装置包括多个控制模块时,多个控制模块可以分别接收标量数据迁移指令,并控制对应的一个或多个处理模块进行标量数据迁移。In this embodiment, the device may include one or more control modules and one or more processing modules, and the number of control modules and processing modules may be set according to actual needs, which is not limited in the present disclosure. When the device includes a control module, the control module may receive a scalar data migration instruction and control one or more processing modules to perform scalar data migration. When the device includes multiple control modules, the multiple control modules may respectively receive scalar data migration instructions and control the corresponding one or more processing modules to perform scalar data migration.
本公开实施例所提供的标量数据迁移指令处理装置,该装置包括控制模块和处理模块。控制模块用于对获取到的标量数据迁移指令进行编译,得到编译后的标量数据迁移指令,对编译后的标量数据迁移指令进行解析,得到标量数据迁移指令的操作码和操作域,并根据操作码和操作域获取执行标量数据迁移指令所需的待迁移标量数据和目标地址,以及确定进行迁移处理所需的迁移参数。处理模块用于根据迁移参数,将待迁移标量数据存入目标地址中。本公开实施例所提供的标量数据迁移指令处理装置的适用范围广,对标量数据迁移指令的处理效率高、处理速度快,进行标量数据迁移的处理效率高、速度快。A scalar data migration instruction processing device provided by an embodiment of the present disclosure includes a control module and a processing module. The control module is used to compile the acquired scalar data migration instruction to obtain the compiled scalar data migration instruction, to parse the compiled scalar data migration instruction, to obtain the scalar data migration instruction operation code and operation domain, and according to the operation The code and the operation domain acquire the scalar data to be migrated and the target address required to execute the scalar data migration instruction, and determine the migration parameters required for the migration process. The processing module is used to store the scalar data to be migrated into the target address according to the migration parameters. The scalar data migration instruction processing device provided by the embodiments of the present disclosure has a wide application range, and has high processing efficiency and fast processing speed for scalar data migration instructions, and high processing efficiency and fast speed for scalar data migration.
图16-2示出根据本公开一实施例的标量数据迁移指令处理装置的框图。在一种可能的实现方式中,如图16-2所示,处理模块16-12可以包括主处理子模块16-121和多个从处理子模块16-122。16-2 shows a block diagram of a scalar data migration instruction processing apparatus according to an embodiment of the present disclosure. In a possible implementation, as shown in FIG. 16-2, the processing module 16-12 may include a master processing sub-module 16-121 and multiple slave processing sub-modules 16-122.
主处理子模块16-121,用于对待迁移标量数据进行处理,得到处理后的待迁移标量数据,将处理后的待迁移标量数据存入目标地址中。对待迁移标量数据所进行的处理包括数据类型等转换处理,本公开对此不作限制。The main processing sub-module 16-121 is used to process the scalar data to be migrated, obtain the processed scalar data to be migrated, and store the processed scalar data to be migrated in the target address. The processing performed on the scalar data to be migrated includes conversion processing such as data type, which is not limited in the present disclosure.
在一种可能的实现方式中,操作域还可以包括标量数据迁移量。其中,控制模块16-11,还用于根据操作域确定标量数据迁移量,并从待迁移标量数据地址中获取对应标量数据迁移量的待迁移标量数据。In a possible implementation manner, the operation domain may further include a scalar data migration amount. The control module 16-11 is also used to determine the scalar data migration amount according to the operation domain, and obtain the scalar data to be migrated corresponding to the scalar data migration amount from the scalar data address to be migrated.
在该实现方式中,标量数据迁移量可以是所获取的待迁移标量数据的数据量。In this implementation manner, the scalar data migration amount may be the data amount of the acquired scalar data to be migrated.
在一种可能的实现方式中,可以预先设置默认标量数据迁移量。在操作域中不包含标量数据迁移量时,可以将默认标量数据迁移量确定为当前标量数据迁移指令的标量数据迁移量。进而从待迁移标量数据地址中获取对应标量数据迁移量的待迁移标量数据。In a possible implementation manner, a default scalar data migration amount may be preset. When the scalar data migration amount is not included in the operation domain, the default scalar data migration amount may be determined as the scalar data migration amount of the current scalar data migration instruction. Furthermore, the scalar data to be migrated corresponding to the scalar data migration amount is acquired from the scalar data address to be migrated.
在一种可能的实现方式中,在操作域中不包含标量数据迁移量时,可以直接从待迁移标量数据地址中获取其中所存储的全部待迁移标量数据。In a possible implementation manner, when the scalar data migration amount is not included in the operation domain, all scalar data to be migrated stored therein may be directly obtained from the scalar data address to be migrated.
在一种可能的实现的方式中,操作域还可以包括迁移参数。其中,确定进行迁移处理所需的迁移参数,可以包括:根据操作域,确定进行迁移处理所需的迁移参数。In a possible implementation manner, the operation domain may further include migration parameters. Wherein, determining the migration parameters required for the migration process may include: determining the migration parameters required for the migration process according to the operation domain.
在一种可能的实现方式中,操作码还可以用于指示迁移参数。其中,确定进行迁移处理所需的迁移参数,可以包括:根据操作码,确定进行迁移处理所需的迁移参数。In a possible implementation, the operation code may also be used to indicate the migration parameter. Wherein, determining the migration parameters required for the migration process may include: determining the migration parameters required for the migration process according to the operation code.
在一种可能的实现方式中,还可以设置默认迁移参数。在根据操作域和操作码均不能确定当前标量数据迁移指令的迁移参数时,可以将默认迁移参数确定为当前标量数据迁移指令的迁移参数。In a possible implementation, default migration parameters can also be set. When the migration parameter of the current scalar data migration instruction cannot be determined according to the operation domain and the operation code, the default migration parameter may be determined as the migration parameter of the current scalar data migration instruction.
在一种可能的实现方式中,还可以根据待迁移标量数据地址和目标地址确定其分别对应的初始存储空间、目标存储空间,进而根据初始存储空间、目标存储空间的存储速度、存储空间类型等参数,确定迁移参数。In a possible implementation, the initial storage space and the target storage space corresponding to the scalar data address and the target address to be migrated may be determined, and then the storage speed, storage space type, etc. of the initial storage space, the target storage space, etc. Parameters to determine the migration parameters.
在一种可能的实现方式中,如图16-2所示,该装置还可以包括存储模块16-13。存储模块16-13用于存储待迁移标量数据。In a possible implementation manner, as shown in FIG. 16-2, the device may further include a storage module 16-13. The storage modules 16-13 are used to store scalar data to be migrated.
在该实现方式中,存储模块可以包括缓存和寄存器中的一种或多种,缓存可以包括高速暂存缓存,还可以包括至少一个NRAM(Neuron Random Access Memory,神经元随机存取存储器)。缓存,用于存储待运算数据。寄存器,用于存储待迁移标量数据和待运算数据中的标量数据。其中,待运算数据可以是与计算指令、标量数据迁移指令的执行相关的数据。In this implementation, the storage module may include one or more of a cache and a register. The cache may include a high-speed temporary storage cache, and may also include at least one NRAM (Neuron Random Access Memory). Cache, used to store data to be calculated. The register is used to store scalar data in the scalar data to be migrated and the data to be calculated. The data to be calculated may be data related to the execution of calculation instructions and scalar data migration instructions.
在一种可能的实现方式中,缓存可以包括神经元缓存。神经元缓存也即上述神经元随机存取存储器,可以用于存储待运算数据中的神经元数据,神经元数据包括神经元向量数据。In a possible implementation, the cache may include a neuron cache. The neuron cache, that is, the foregoing neuron random access memory, can be used to store neuron data in the data to be calculated, and the neuron data includes neuron vector data.
在一种可能的实现方式中,该装置还可以包括直接内存访问模块,用于从存储模块中读取或者存 储数据。In a possible implementation, the device may further include a direct memory access module, which is used to read or store data from the storage module.
在一种可能的实现方式中,控制模块16-11还可以用于根据标量数据迁移指令生成汇编文件,并将汇编文件翻译成二进制文件,其中,二进制文件为编译后的标量数据迁移指令。In a possible implementation, the control module 16-11 may also be used to generate an assembly file according to the scalar data migration instruction and translate the assembly file into a binary file, where the binary file is a compiled scalar data migration instruction.
在一种可能的实现方式中,标量数据迁移指令的指令格式可以是:In a possible implementation, the instruction format of the scalar data migration instruction may be:
migrate dst src type.space1.space2 sizemigratedstrcstype.space1.space2size
其中,migrate是标量数据迁移指令的操作码,dst、src0、type.space1.space2、size是标量数据迁移指令的操作域。其中,dst是目标地址,src是待迁移标量数据地址,在待迁移标量数据为多个时,src可以包括多个待迁移标量数据的地址src0,src1,...,srcn,本公开对此不作限制。type.space1.space2是迁移参数,type.space1.space2中的type表示迁移类型,type.space1.space2中的space1表示待迁移标量数据地址src所在的初始存储空间,type.space1.space2中的space2表示目标地址dst所在的目标存储空间。size是标量数据迁移量。Among them, migrate is the operation code of the scalar data migration instruction, and dst, src0, type.space1.space2, and size are the operation fields of the scalar data migration instruction. Where dst is the target address and src is the scalar data address to be migrated. When there are multiple scalar data to be migrated, src may include multiple addresses of scalar data to be migrated src0, src1, ..., srcn No restrictions. type.space1.space2 is the migration parameter, type in type.space1.space2 indicates the migration type, space1 in type.space1.space2 indicates the initial storage space where the scalar data address src to be migrated is located, and space2 in type.space1.space2 Indicates the target storage space where the target address dst is located. size is the amount of scalar data migration.
在一种可能的实现方式中,标量数据迁移指令的指令格式还可以是:In a possible implementation, the instruction format of the scalar data migration instruction may also be:
type.space1.space2 dst src sizetype.space1.space2 dst src size
其中,type.space1.space2是标量数据迁移指令的操作码,dst、src、size是标量数据迁移指令的操作域。其中,dst是目标地址,src是待迁移标量数据地址,在待迁移标量数据为多个时,src可以包括多个待迁移标量数据的地址src0,src1,...,srcn,本公开对此不作限制。size是标量数据迁移量。操作码type.space1.space2中的type表示迁移类型,type.space1.space2中的space1表示待迁移标量数据地址src所在的初始存储空间,type.space1.space2中的space2表示目标地址dst所在的目标存储空间。Among them, type.space1.space2 is the operation code of the scalar data migration instruction, and dst, src, and size are the operation fields of the scalar data migration instruction. Where dst is the target address and src is the scalar data address to be migrated. When there are multiple scalar data to be migrated, src may include multiple addresses of scalar data to be migrated src0, src1, ..., srcn. No restrictions. size is the amount of scalar data migration. The type in opcode type.space1.space2 represents the migration type, space1 in type.space1.space2 represents the initial storage space where the scalar data address to be migrated is located, and space2 in type.space1.space2 represents the destination where the target address dst is located storage.
其中,type可以是ld、st、mv。ld表示的迁移类型为“初始存储空间的存储速度小于目标存储空间的存储速度”。st表示的迁移类型为“初始存储空间的存储速度大于目标存储空间的存储速度”。mv表示的迁移类型为“初始存储空间的存储速度等于目标存储空间的存储速度”。Among them, type can be ld, st, mv. The migration type indicated by ld is "the storage speed of the initial storage space is less than the storage speed of the target storage space". The migration type indicated by st is "the storage speed of the initial storage space is greater than the storage speed of the target storage space". The type of migration indicated by mv is "the storage speed of the initial storage space is equal to the storage speed of the target storage space".
在一种可能的实现方式中,可以将迁移类型为“初始存储空间的存储速度小于目标存储空间的存储速度”的标量数据迁移指令的指令格式设置为:ld.space1.space2 dst src0 size。根据标量数据迁移量size、初始存储空间space1、目标存储空间space2及迁移类型ld,从初始存储空间space1中的待迁移标量数据地址src0内获取数据量为标量数据迁移量size的待迁移标量数据,并将该待迁移标量数据存入目标存储空间space2中的目标地址dst中。其中,初始存储空间space1的存储速度小于目标存储空间space2的存储速度。In a possible implementation, the instruction format of the scalar data migration instruction whose migration type is "the storage speed of the initial storage space is less than the storage speed of the target storage space" can be set to: ld.space1.space2dst src0size. According to the scalar data migration amount size, the initial storage space space1, the target storage space space2, and the migration type ld, obtain the scalar data to be migrated from the scalar data address src0 in the initial storage space space1 whose data amount is the scalar data migration amount size, And store the scalar data to be migrated into the target address dst in the target storage space space2. The storage speed of the initial storage space space1 is less than the storage speed of the target storage space space2.
在一种可能的实现方式中,可以将迁移类型为“初始存储空间的存储速度大于目标存储空间的存储速度”的标量数据迁移指令的指令格式设置为:st.space1.space2 dst src0 size。根据标量数据迁移量size、初始存储空间space1、目标存储空间space2及迁移类型st,从初始存储空间space1中的待迁移标量数据地址src0内获取数据量为标量数据迁移量size的待迁移标量数据,并将该待迁移标量数据存入目标存储空间space2中的目标地址dst中。其中,初始存储空间space1的存储速度大于目标存储空间space2的存储速度。In a possible implementation, the instruction format of the scalar data migration instruction whose migration type is "the storage speed of the initial storage space is greater than the storage speed of the target storage space" may be set to: st.space1.space2dst src0size. According to the scalar data migration amount size, the initial storage space space1, the target storage space space2, and the migration type st, obtain the scalar data to be migrated from the scalar data address src0 in the initial storage space space1 whose data amount is the scalar data migration amount size, And store the scalar data to be migrated into the target address dst in the target storage space space2. The storage speed of the initial storage space space1 is greater than the storage speed of the target storage space space2.
在一种可能的实现方式中,可以将迁移类型为“初始存储空间的存储速度等于目标存储空间的存储速度”的标量数据迁移指令的指令格式设置为:mv.space1.space2 dst src0 size。根据标量数据迁移量size、初始存储空间space1、目标存储空间space2及迁移类型st,从初始存储空间space1中的待迁移标量数据地址src0内获取数据量为标量数据迁移量size的待迁移标量数据,并将该待迁移标量数据存入目标存储空间space2中的目标地址dst中。其中,初始存储空间space1的存储速度等于目标存储空间space2的存储速度。In a possible implementation, the instruction format of the scalar data migration instruction whose migration type is "the storage speed of the initial storage space is equal to the storage speed of the target storage space" can be set to: mv.space1.space2dst src0size. According to the scalar data migration amount size, the initial storage space space1, the target storage space space2 and the migration type st, obtain the scalar data to be migrated from the scalar data address src0 in the initial storage space space1 whose data amount is the scalar data migration amount size, And store the scalar data to be migrated into the target address dst in the target storage space space2. The storage speed of the initial storage space space1 is equal to the storage speed of the target storage space space2.
应当理解的是,本领域技术人员可以根据需要对标量数据迁移指令的操作码、指令格式中操作码和操作域的位置进行设置,本公开对此不作限制。It should be understood that those skilled in the art can set the operation code of the scalar data migration instruction, the position of the operation code and the operation field in the instruction format as needed, and this disclosure does not limit this.
在一种可能的实现方式中,该装置可以设置于(Graphics Processing Unit,简称GPU)、中央处理器(Central Processing Unit,简称CPU)和嵌入式神经网络处理器(Neural-network Processing Unit,简称NPU)的一种或多种之中。In a possible implementation, the device may be set in (Graphics Processing Unit, GPU for short), Central Processing Unit (CPU for short), and Neural-network Processing Unit (NPU for short) ) Of one or more.
需要说明的是,尽管以上述实施例作为示例介绍了标量数据迁移指令处理装置如上,但本领域技 术人员能够理解,本公开应不限于此。事实上,用户完全可根据个人喜好和/或实际应用场景灵活设定各模块,只要符合本公开的技术方案即可。It should be noted that although the above embodiment is taken as an example to introduce the scalar data migration instruction processing apparatus as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various modules flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.
应用示例Application examples
以下结合“利用标量数据迁移指令处理装置进行数据迁移”作为一个示例性应用场景,给出根据本公开实施例的应用示例,以便于理解标量数据迁移指令处理装置的流程。本领域技术人员应理解,以下应用示例仅仅是出于便于理解本公开实施例的目的,不应视为对本公开实施例的限制。The following uses "scalar data migration instruction processing device for data migration" as an exemplary application scenario, and gives an application example according to an embodiment of the present disclosure to facilitate understanding of the flow of the scalar data migration instruction processing device. Those skilled in the art should understand that the following application examples are only for the purpose of facilitating the understanding of the embodiments of the present disclosure, and should not be considered as limitations to the embodiments of the present disclosure.
图16-3示出根据本公开一实施例的标量数据迁移指令处理装置的应用场景的示意图。如图16-3所示,标量数据迁移指令处理装置对标量数据迁移指令进行处理的过程如下:16-3 shows a schematic diagram of an application scenario of a scalar data migration instruction processing apparatus according to an embodiment of the present disclosure. As shown in Figure 16-3, the scalar data migration instruction processing device processes the scalar data migration instruction as follows:
控制模块16-11对获取到的标量数据迁移指令1进行编译,得到编译后的标量数据迁移指令1。对编译后的标量数据迁移指令1(如标量数据迁移指令1为ld.200.300 500 400 5)进行解析,得到标量数据迁移指令1的操作码和操作域。其中,标量数据迁移指令1的操作码为ld,初始存储空间为200,目标存储空间为300,目标地址为500,待迁移标量数据地址为400,标量数据迁移量为5。根据操作码ld可以确定初始存储空间200的存储速度小于目标存储空间300的存储速度。控制模块16-11从初始存储空间为200中的待迁移标量数据地址400内获取数据量为标量数据迁移量5的待迁移标量数据。处理模块16-12根据迁移参数,将待迁移标量数据存入目标存储空间为300中的目标地址500内。The control module 16-11 compiles the acquired scalar data migration instruction 1 to obtain the compiled scalar data migration instruction 1. Analyze the compiled scalar data migration instruction 1 (such as scalar data migration instruction 1 is ld.200.300, 500, and 400) to obtain the operation code and operation domain of the scalar data migration instruction 1. The operation code of the scalar data migration instruction 1 is ld, the initial storage space is 200, the target storage space is 300, the target address is 500, the scalar data address to be migrated is 400, and the scalar data migration amount is 5. According to the operation code ld, it can be determined that the storage speed of the initial storage space 200 is less than the storage speed of the target storage space 300. The control module 16-11 acquires the scalar data to be migrated whose data volume is the scalar data migration volume 5 from the scalar data address 400 to be migrated in the initial storage space 200. The processing module 16-12 stores the scalar data to be migrated into the target address 500 in the target storage space 300 according to the migration parameters.
其中,标量数据迁移指令1除了可以为上述ld.200.300 500 400 5,还可以为migrate 500#400ld.200.300 5等,二者处理过程相似,不再赘述。Among them, the scalar data migration instruction 1 can be not only the above ld.200.300, 500, 400, 5, but also migrate 500 # 400ld.200.300, etc., the processing procedures of the two are similar, and will not be repeated here.
以上各模块的工作过程可参考上文的相关描述。For the working process of the above modules, please refer to the relevant description above.
这样,标量数据迁移指令处理装置可以高效、快速地对标量数据迁移指令进行处理,进行标量数据迁移的处理效率高、速度快。In this way, the scalar data migration instruction processing device can efficiently and quickly process the scalar data migration instruction, and the processing efficiency of the scalar data migration is high and the speed is fast.
图16-4示出根据本公开一实施例的标量数据迁移指令处理方法的流程图。该方法可以应用于包含存储器和处理器的如计算机设备等,其中,存储器用于存储执行方法过程中所使用的数据;处理器用于执行相关的处理、运算步骤,如执行下述步骤S51-16和步骤S52-16。如图16-4所示,该方法应用于上述标量数据迁移指令处理装置,该方法包括步骤S51-16和步骤S52-16。16-4 shows a flowchart of a scalar data migration instruction processing method according to an embodiment of the present disclosure. The method can be applied to computer equipment including a memory and a processor, where the memory is used to store data used in the execution of the method; the processor is used to perform relevant processing and operation steps, such as performing the following steps S51-16 And step S52-16. As shown in FIG. 16-4, the method is applied to the above scalar data migration instruction processing device, and the method includes steps S51-16 and S52-16.
在步骤S51-16中,利用控制模块对获取到标量数据迁移指令进行编译,得到编译后的标量数据迁移指令,对所述编译后的标量数据迁移指令进行解析,得到标量数据迁移指令的操作码和操作域,并根据操作码和操作域获取执行标量数据迁移指令所需的待迁移标量数据和目标地址,以及确定进行迁移处理所需的迁移参数。其中,操作码用于指示标量数据迁移指令对标量数据所进行的处理为迁移处理,操作域包括待迁移标量数据地址和目标地址,迁移参数包括待迁移标量数据地址所在的初始存储空间、目标地址所在的目标存储空间和进行迁移处理的迁移类型。In step S51-16, the control module is used to compile the acquired scalar data migration instruction to obtain a compiled scalar data migration instruction, and the compiled scalar data migration instruction is parsed to obtain the scalar data migration instruction operation code. And the operation domain, and obtain the scalar data to be migrated and the target address required to execute the scalar data migration instruction according to the operation code and the operation domain, and determine the migration parameters required for the migration process. The operation code is used to instruct the scalar data migration instruction to process the scalar data as migration processing. The operation domain includes the address of the scalar data to be migrated and the target address, and the migration parameters include the initial storage space and target address where the scalar data address to be migrated The target storage space and migration type for migration processing.
在步骤S52-16中,利用处理模块根据迁移参数,将待迁移标量数据存入目标地址中。In step S52-16, the processing module stores the scalar data to be migrated into the target address according to the migration parameters.
在一种可能的实现方式中,处理模块可以包括主处理子模块和多个从处理子模块。其中,步骤S52-16可以包括:In a possible implementation manner, the processing module may include a master processing sub-module and multiple slave processing sub-modules. Wherein, steps S52-16 may include:
利用主处理子模块对待迁移标量数据进行处理,得到处理后的待迁移标量数据,将处理后的待迁移标量数据存入目标地址中。The main processing sub-module is used to process the scalar data to be migrated to obtain the processed scalar data to be migrated, and the processed scalar data to be migrated is stored in the target address.
在一种可能的实现方式中,操作域还可以包括标量数据迁移量。其中,根据所述操作码和所述操作域获取执行标量数据迁移指令所需的待迁移标量数据和目标地址,可以包括:In a possible implementation manner, the operation domain may further include a scalar data migration amount. Wherein, acquiring the scalar data to be migrated and the target address required to execute the scalar data migration instruction according to the operation code and the operation domain may include:
根据所述操作域确定所述标量数据迁移量,并从所述待迁移标量数据地址中获取对应所述标量数据迁移量的待迁移标量数据。Determine the scalar data migration amount according to the operation domain, and obtain scalar data to be migrated corresponding to the scalar data migration amount from the scalar data address to be migrated.
在一种可能的实现方式中,操作域还可以包括迁移参数。其中,确定进行迁移处理所需的迁移参数,可以包括:根据操作域,确定进行迁移处理所需的迁移参数。In a possible implementation manner, the operation domain may further include migration parameters. Wherein, determining the migration parameters required for the migration process may include: determining the migration parameters required for the migration process according to the operation domain.
在一种可能的实现方式中,操作码还可以用于指示迁移参数。其中,确定进行迁移处理所需的迁移参数,可以包括:根据操作码,确定进行迁移处理所需的迁移参数。In a possible implementation, the operation code may also be used to indicate the migration parameter. Wherein, determining the migration parameters required for the migration process may include: determining the migration parameters required for the migration process according to the operation code.
在一种可能的实现方式中,该方法还包括:利用装置的存储模块存储待迁移标量数据,In a possible implementation manner, the method further includes: using the storage module of the device to store the scalar data to be migrated,
其中,存储模块包括寄存器和缓存中的至少一种,Wherein, the storage module includes at least one of a register and a cache,
缓存,用于存储待运算数据,缓存包括至少一个神经元缓存NRAM;Cache, used to store data to be calculated, the cache includes at least one neuron cache NRAM;
寄存器,用于存储待迁移标量和待运算数据中的标量数据;The register is used to store the scalar data to be migrated and the scalar data in the data to be calculated;
神经元缓存,用于存储待运算数据中的神经元数据,神经元数据包括神经元向量数据。The neuron cache is used to store neuron data in the data to be calculated, and the neuron data includes neuron vector data.
在一种可能的实现方式中,步骤S51-16可以包括:In a possible implementation, step S51-16 may include:
存储编译后的标量数据迁移指令;Store compiled scalar data migration instructions;
对编译后的标量数据迁移指令进行解析,得到标量数据迁移指令的操作码和操作域;Analyze the compiled scalar data migration instruction to obtain the operation code and operation domain of the scalar data migration instruction;
存储指令队列,指令队列包括按照执行顺序依次排列的多个待执行指令,多个待执行指令包括编译后的标量数据迁移指令。An instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed in order according to the execution order, and the plurality of instructions to be executed include compiled scalar data migration instructions.
在一种可能的实现方式中,该方法还可以包括:In a possible implementation manner, the method may further include:
在确定多个待执行指令中的第一待执行指令与第一待执行指令之前的第零待执行指令存在关联关系时,缓存所述第一待执行指令,并在确定所述第零待执行指令执行完毕后,控制进行所述第一待执行指令的执行,When it is determined that the first to-be-executed instruction among the plurality of to-be-executed instructions is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction, cache the first to-be-executed instruction and determine After the execution of the instruction is completed, the execution of the first instruction to be executed is controlled,
其中,第一待执行指令与第一待执行指令之前的第零待执行指令存在关联关系包括:Wherein, the association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
存储第一待执行指令所需数据的第一存储地址区间与存储第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first storage address interval storing data required for the first instruction to be executed has an overlapping area with the zeroth storage address interval storing data required for the zeroth instruction to be executed.
在一种可能的实现方式中,对获取到的标量数据迁移进行编译,得到编译后的标量数据迁移指令,可以包括:In a possible implementation manner, compiling the acquired scalar data migration to obtain a compiled scalar data migration instruction may include:
根据标量数据迁移指令生成汇编文件,并将汇编文件翻译成二进制文件,Generate assembly files according to scalar data migration instructions, and translate the assembly files into binary files,
其中,二进制文件为编译后的标量数据迁移指令。Among them, the binary file is a compiled scalar data migration instruction.
需要说明的是,尽管以上述实施例作为示例介绍了标量数据迁移指令处理方法如上,但本领域技术人员能够理解,本公开应不限于此。事实上,用户完全可根据个人喜好和/或实际应用场景灵活设定各步骤,只要符合本公开的技术方案即可。It should be noted that although the above embodiment is taken as an example to introduce the scalar data migration instruction processing method as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various steps flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.
本公开实施例所提供的标量数据迁移指令处理方法的适用范围广,对标量数据迁移指令的处理效率高、处理速度快,进行标量数据迁移的处理效率高、速度快。The scalar data migration instruction processing method provided by the embodiments of the present disclosure has a wide range of application, and has high processing efficiency and fast processing speed for scalar data migration instructions, and high processing efficiency and fast speed for scalar data migration.
依据以下条款可以更好的理解前述内容:The foregoing can be better understood based on the following clauses:
条款P1、一种标量数据迁移指令处理装置,所述装置包括:Clause P1, a scalar data migration instruction processing device, the device comprising:
控制模块,用于对获取到的标量数据迁移指令进行编译,得到编译后的标量数据迁移指令,对编译后的标量数据迁移指令进行解析,得到标量数据迁移指令的操作码和操作域,并根据所述操作码和所述操作域获取执行标量数据迁移指令所需的待迁移标量数据和目标地址,以及确定进行迁移处理所需的迁移参数;The control module is used to compile the acquired scalar data migration instruction to obtain the compiled scalar data migration instruction, and parse the compiled scalar data migration instruction to obtain the scalar data migration instruction operation code and operation domain, and according to The operation code and the operation domain obtain the scalar data to be migrated and the target address required to execute the scalar data migration instruction, and determine the migration parameters required for the migration process;
处理模块,根据所述迁移参数,将所述待迁移标量数据存入所述目标地址中,The processing module stores the scalar data to be migrated into the target address according to the migration parameter,
其中,所述操作码用于指示所述标量数据迁移指令对标量数据所进行的处理为迁移处理,所述操作域包括待迁移标量数据地址和所述目标地址,所述迁移参数包括所述待迁移标量数据地址所在的初始存储空间、所述目标地址所在的目标存储空间和进行迁移处理的迁移类型。Wherein, the operation code is used to instruct the scalar data migration instruction to process the scalar data as migration processing, the operation domain includes the scalar data address to be migrated and the target address, and the migration parameter includes the pending The initial storage space where the scalar data address is migrated, the target storage space where the target address is located, and the type of migration for migration processing.
条款P2、根据条款P1所述的装置,所述处理模块包括主处理子模块和多个从处理子模块,Clause P2. The device according to Clause P1, the processing module includes a master processing sub-module and a plurality of slave processing sub-modules,
所述主处理子模块,用于对所述待迁移标量数据进行处理,得到处理后的待迁移标量数据,将所述处理后的待迁移标量数据存入所述目标地址中。The main processing submodule is configured to process the scalar data to be migrated to obtain processed scalar data to be migrated, and store the processed scalar data to be migrated in the target address.
条款P3、根据条款P1所述的装置,所述操作域还包括标量数据迁移量,Clause P3. The device according to Clause P1, the operation domain further includes a scalar data migration amount,
其中,所述控制模块,还用于根据所述操作域确定所述标量数据迁移量,并从所述待迁移标量数据地址中获取对应所述标量数据迁移量的待迁移标量数据。Wherein, the control module is further configured to determine the scalar data migration amount according to the operation domain, and obtain scalar data to be migrated corresponding to the scalar data migration amount from the scalar data address to be migrated.
条款P4、根据条款P1所述的装置,所述操作域还包括迁移参数,Clause P4. The device according to Clause P1, the operation domain further includes migration parameters,
其中,确定进行迁移处理所需的迁移参数,包括:Among them, the migration parameters required for migration processing are determined, including:
根据所述操作域,确定进行迁移处理所需的迁移参数。According to the operation domain, the migration parameters required for the migration process are determined.
条款P5、根据条款P1所述的装置,所述操作码还用于指示迁移参数,Clause P5. The device according to Clause P1, the operation code is also used to indicate a migration parameter,
其中,确定进行迁移处理所需的迁移参数,包括:Among them, the migration parameters required for migration processing are determined, including:
根据所述操作码,确定进行迁移处理所需的迁移参数。According to the operation code, the migration parameters required for the migration process are determined.
条款P6、根据条款P1所述的装置,所述装置还包括:Clause P6. The device according to Clause P1, the device further comprising:
存储模块,用于存储所述待迁移标量数据,A storage module for storing the scalar data to be migrated,
其中,所述存储模块包括寄存器和缓存中的至少一种,Wherein, the storage module includes at least one of a register and a cache,
所述缓存,用于存储待运算数据,所述缓存包括至少一个神经元缓存NRAM;The cache is used to store data to be calculated, and the cache includes at least one neuron cache NRAM;
所述寄存器,用于存储所述待迁移标量数据和所述待运算数据中的标量数据;The register is used to store the scalar data in the data to be migrated and the scalar data in the data to be calculated;
所述神经元缓存,用于存储所述待运算数据中的神经元数据,所述神经元数据包括神经元向量数据。The neuron cache is used to store neuron data in the data to be operated, and the neuron data includes neuron vector data.
条款P7、根据条款P1所述的装置,所述控制模块包括:Clause P7. The device according to Clause P1, the control module includes:
指令存储子模块,用于存储所述编译后的标量数据迁移指令;An instruction storage sub-module for storing the compiled scalar data migration instruction;
指令处理子模块,用于对所述编译后的标量数据迁移指令进行解析,得到标量数据迁移指令的操作码和操作域;An instruction processing sub-module, which is used to parse the compiled scalar data migration instruction to obtain the operation code and operation domain of the scalar data migration instruction;
队列存储子模块,用于存储指令队列,所述指令队列包括按照执行顺序依次排列的多个待执行指令,所述多个待执行指令包括所述编译后的标量数据迁移指令。The queue storage submodule is used to store an instruction queue, and the instruction queue includes a plurality of instructions to be executed, which are sequentially arranged according to the execution order, and the plurality of instructions to be executed include the compiled scalar data migration instruction.
条款P8、根据条款P7所述的装置,所述控制模块,还包括:Clause P8. The device according to Clause P7, the control module, further comprising:
依赖关系处理子模块,用于在确定所述多个待执行指令中的第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系时,将所述第一待执行指令缓存在所述指令存储子模块中,在所述第零待执行指令执行完毕后,从所述指令存储子模块中提取所述第一待执行指令发送至所述处理模块,The dependency processing sub-module is used to determine the first pending instruction when there is an association relationship between the first pending instruction in the plurality of pending instructions and the zeroth pending instruction before the first pending instruction The execution instruction is cached in the instruction storage submodule, and after the execution of the zeroth instruction to be executed is completed, the first instruction to be executed is extracted from the instruction storage submodule and sent to the processing module,
其中,所述第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系包括:Wherein, the association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
存储所述第一待执行指令所需数据的第一存储地址区间与存储所述第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
条款P9、根据条款P1所述的装置,Clause P9, the device according to Clause P1,
所述控制模块,还用于根据所述标量数据迁移指令生成汇编文件,并将所述汇编文件翻译成二进制文件,The control module is also used to generate an assembly file according to the scalar data migration instruction and translate the assembly file into a binary file,
其中,所述二进制文件为所述编译后的标量数据迁移指令。Wherein, the binary file is the compiled scalar data migration instruction.
条款P10、一种机器学习运算装置,所述装置包括:Clause P10. A machine learning computing device, the device comprising:
一个或多个如条款P1-条款P9任一项所述的标量数据迁移指令处理装置,用于从其他处理装置中获取待迁移标量数据和控制信息,并执行指定的机器学习运算,将执行结果通过I/O接口传递给其他处理装置;One or more scalar data migration instruction processing devices as described in any one of Clause P1-Clause P9, used to obtain scalar data and control information to be migrated from other processing devices, and perform specified machine learning operations, which will execute the results Passed to other processing devices through the I / O interface;
当所述机器学习运算装置包含多个所述标量数据迁移指令处理装置时,所述多个所述标量数据迁移指令处理装置间可以通过特定的结构进行连接并传输数据;When the machine learning computing device includes a plurality of scalar data migration instruction processing devices, the plurality of scalar data migration instruction processing devices may be connected and transmit data through a specific structure;
其中,多个所述标量数据迁移指令处理装置通过快速外部设备互连总线PCIE总线进行互联并传输数据,以支持更大规模的机器学习的运算;多个所述标量数据迁移指令处理装置共享同一控制系统或拥有各自的控制系统;多个所述标量数据迁移指令处理装置共享内存或者拥有各自的内存;多个所述标量数据迁移指令处理装置的互联方式是任意互联拓扑。Among them, a plurality of said scalar data migration instruction processing apparatuses interconnect and transmit data through a PCIE bus, a fast external device interconnection bus, to support larger-scale machine learning operations; a plurality of said scalar data migration instruction processing apparatuses share the same The control system may have its own control system; the multiple scalar data migration instruction processing devices share memory or have their own memories; the interconnection method of the multiple scalar data migration instruction processing devices is any interconnection topology.
条款P11、一种组合处理装置,所述组合处理装置包括:Clause P11. A combined processing device, the combined processing device comprising:
如条款P10所述的机器学习运算装置、通用互联接口和其他处理装置;Machine learning computing device, general interconnection interface and other processing devices as described in Clause P10;
所述机器学习运算装置与所述其他处理装置进行交互,共同完成用户指定的计算操作,The machine learning computing device interacts with the other processing device to jointly complete the calculation operation specified by the user,
其中,所述组合处理装置还包括:存储装置,该存储装置分别与所述机器学习运算装置和所述其他处理装置连接,用于保存所述机器学习运算装置和所述其他处理装置的数据。Wherein, the combined processing device further includes: a storage device, which is respectively connected to the machine learning computing device and the other processing device, and is used for storing data of the machine learning computing device and the other processing device.
条款P12、一种机器学习芯片,所述机器学习芯片包括:Clause P12. A machine learning chip, the machine learning chip includes:
如条款P10所述的机器学习运算装置或如条款P11所述的组合处理装置。The machine learning arithmetic device according to clause P10 or the combined processing device according to clause P11.
条款P13、一种电子设备,所述电子设备包括:Article P13. An electronic device, the electronic device comprising:
如条款P12所述的机器学习芯片。Machine learning chip as described in clause P12.
条款P14、一种板卡,所述板卡包括:存储器件、接口装置和控制器件以及如条款P12所述的机器学习芯片;Clause P14, a board card including: a storage device, an interface device and a control device, and a machine learning chip as described in Clause P12;
其中,所述机器学习芯片与所述存储器件、所述控制器件以及所述接口装置分别连接;Wherein, the machine learning chip is respectively connected to the storage device, the control device and the interface device;
所述存储器件,用于存储数据;The storage device is used for storing data;
所述接口装置,用于实现所述机器学习芯片与外部设备之间的数据传输;The interface device is used to realize data transmission between the machine learning chip and an external device;
所述控制器件,用于对所述机器学习芯片的状态进行监控。The control device is used for monitoring the state of the machine learning chip.
条款P15、一种标量数据迁移指令处理方法,所述方法应用于标量数据迁移指令处理装置,所述装置包括控制模块和处理模块,所述方法包括:Clause P15. A scalar data migration instruction processing method. The method is applied to a scalar data migration instruction processing device. The device includes a control module and a processing module. The method includes:
利用控制模块对获取到标量数据迁移指令进行编译,得到编译后的标量数据迁移指令,对所述编译后的标量数据迁移指令进行解析,得到标量数据迁移指令的操作码和操作域,并根据所述操作码和所述操作域获取执行标量数据迁移指令所需的待迁移标量数据和目标地址,以及确定进行迁移处理所需的迁移参数;The control module is used to compile the acquired scalar data migration instruction to obtain the compiled scalar data migration instruction, and the compiled scalar data migration instruction is analyzed to obtain the scalar data migration instruction's operation code and operation domain. Obtaining the scalar data to be migrated and the target address required to execute the scalar data migration instruction by the operation code and the operation domain, and determining the migration parameters required for the migration process;
利用处理模块根据所述迁移参数,将所述待迁移标量数据存入所述目标地址中,The processing module stores the scalar data to be migrated into the target address according to the migration parameter,
其中,所述操作码用于指示所述标量数据迁移指令对标量数据所进行的处理为迁移处理,所述操作域包括待迁移标量数据地址和所述目标地址,所述迁移参数包括所述待迁移标量数据地址所在的初始存储空间、所述目标地址所在的目标存储空间和进行迁移处理的迁移类型。Wherein, the operation code is used to instruct the scalar data migration instruction to process the scalar data as migration processing, the operation domain includes the scalar data address to be migrated and the target address, and the migration parameter includes the pending The initial storage space where the scalar data address is migrated, the target storage space where the target address is located, and the type of migration for migration processing.
条款P16、根据条款P15所述的方法,所述处理模块包括主处理子模块和多个从处理子模块,Clause P16. The method according to Clause P15, the processing module includes a master processing submodule and a plurality of slave processing submodules,
其中,根据所述迁移参数,将所述待迁移标量数据存入所述目标地址中,包括:Wherein, storing the scalar data to be migrated into the target address according to the migration parameter includes:
利用所述主处理子模块对所述待迁移标量数据进行处理,得到处理后的待迁移标量数据,将所述处理后的待迁移标量数据存入所述目标地址中。The main processing submodule is used to process the scalar data to be migrated to obtain processed scalar data to be migrated, and the processed scalar data to be migrated is stored in the target address.
条款P17、根据条款P15所述的方法,所述操作域还包括标量数据迁移量,Clause P17. The method according to Clause P15, the operation domain further includes a scalar data migration amount,
其中,根据所述操作码和所述操作域获取执行标量数据迁移指令所需的待迁移标量数据和目标地址,包括:Wherein, acquiring the scalar data to be migrated and the target address required to execute the scalar data migration instruction according to the operation code and the operation domain includes:
根据所述操作域确定所述标量数据迁移量,并从所述待迁移标量数据地址中获取对应所述标量数据迁移量的待迁移标量数据。Determine the scalar data migration amount according to the operation domain, and obtain scalar data to be migrated corresponding to the scalar data migration amount from the scalar data address to be migrated.
条款P18、根据条款P15所述的方法,所述操作域还包括迁移参数,Clause P18. The method according to Clause P15, the operation domain further includes migration parameters,
其中,确定进行迁移处理所需的迁移参数,包括:Among them, the migration parameters required for migration processing are determined, including:
根据所述操作域,确定进行迁移处理所需的迁移参数。According to the operation domain, the migration parameters required for the migration process are determined.
条款P19、根据条款P15所述的方法,所述操作码还用于指示迁移参数,Clause P19, the method according to Clause P15, the operation code is also used to indicate the migration parameter,
其中,确定进行迁移处理所需的迁移参数,包括:Among them, the migration parameters required for migration processing are determined, including:
根据所述操作码,确定进行迁移处理所需的迁移参数。According to the operation code, the migration parameters required for the migration process are determined.
条款P20、根据条款P15所述的方法,所述方法还包括:Clause P20. The method according to Clause P15, the method further comprising:
利用所述装置的存储模块存储所述待迁移标量数据,Use the storage module of the device to store the scalar data to be migrated,
其中,所述存储模块包括寄存器和缓存中的至少一种,Wherein, the storage module includes at least one of a register and a cache,
所述缓存,用于存储待运算数据,所述缓存包括至少一个神经元缓存NRAM;The cache is used to store data to be calculated, and the cache includes at least one neuron cache NRAM;
所述寄存器,用于存储所述待迁移标量和所述待运算数据中的标量数据;The register is used to store the scalar data in the to-be-migrated scalar and the to-be-calculated data;
所述神经元缓存,用于存储所述待运算数据中的神经元数据,所述神经元数据包括神经元向量数据。The neuron cache is used to store neuron data in the data to be operated, and the neuron data includes neuron vector data.
条款P21、根据条款P15所述的方法,对所述编译后的标量数据迁移指令进行解析,得到标量数据迁移指令的操作码和操作域,包括:Clause P21. According to the method described in Clause P15, parse the compiled scalar data migration instruction to obtain the operation code and operation domain of the scalar data migration instruction, including:
存储所述编译后的标量数据迁移指令;Store the compiled scalar data migration instruction;
对所述编译后的标量数据迁移指令进行解析,得到标量数据迁移指令的操作码和操作域;Parse the compiled scalar data migration instruction to obtain the operation code and operation domain of the scalar data migration instruction;
存储指令队列,所述指令队列包括按照执行顺序依次排列的多个待执行指令,所述多个待执行指令包括所述编译后的标量数据迁移指令。An instruction queue is stored. The instruction queue includes a plurality of instructions to be executed in order according to the execution order, and the plurality of instructions to be executed include the compiled scalar data migration instruction.
条款P22、根据条款P21所述的方法,所述方法还包括:Clause P22. The method according to Clause P21, the method further comprising:
在确定所述多个待执行指令中的第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系时,缓存所述第一待执行指令,并在确定所述第零待执行指令执行完毕后,控制进行所述第一待执行指令的执行,When it is determined that the first to-be-executed instruction among the plurality of to-be-executed instructions is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction, cache the first to-be-executed instruction, and determine the After the execution of the zeroth to-be-executed instruction is completed, control to execute the execution of the first to-be-executed instruction,
其中,所述第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系包括:Wherein, the association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
存储所述第一待执行指令所需数据的第一存储地址区间与存储所述第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
条款P23、根据条款P15所述的方法,对获取到的标量数据迁移进行编译,得到编译后的标量数据迁移指令,包括:Clause P23. According to the method described in Clause P15, compile the acquired scalar data migration to obtain a compiled scalar data migration instruction, including:
根据所述标量数据迁移指令生成汇编文件,并将所述汇编文件翻译成二进制文件,Generate an assembly file according to the scalar data migration instruction, and translate the assembly file into a binary file,
其中,所述二进制文件为所述编译后的标量数据迁移指令。Wherein, the binary file is the compiled scalar data migration instruction.
条款P24、一种非易失性计算机可读存储介质,其上存储有计算机程序指令,所述计算机程序指令被处理器执行时实现条款P15至条款P23任一项所述的方法。Clause P24. A non-volatile computer-readable storage medium having computer program instructions stored thereon, which when executed by a processor implements the method of any one of Clause P15 to Clause P23.
图17-1示出根据本公开一实施例的标量控制流指令处理装置的框图。如图17-1所示,该装置包括控制模块17-11。控制模块11包括指令编译子模块17-111、数据获取子模块17-112和跳转控制子模块17-113。17-1 shows a block diagram of a scalar control flow instruction processing apparatus according to an embodiment of the present disclosure. As shown in Figure 17-1, the device includes a control module 17-11. The control module 11 includes an instruction compilation submodule 17-111, a data acquisition submodule 17-112, and a jump control submodule 17-113.
指令编译子模块17-111,对获取到的标量控制流指令进行编译,得到编译后的标量控制流指令。The instruction compilation submodule 17-111 compiles the obtained scalar control flow instruction to obtain the compiled scalar control flow instruction.
数据获取子模块17-112,根据编译后的标量控制流指令的操作码和操作域获取执行标量控制流指令所需的待判断标量和目标跳转地址,以及确定标量控制流指令所对应的跳转条件。The data acquisition sub-module 17-112 obtains the scalar control target and the target jump address required to execute the scalar control flow instruction according to the compiled scalar control flow instruction operation code and operation domain, and determines the jump corresponding to the scalar control flow instruction Transfer conditions.
跳转控制子模块17-113,在待判断标量满足跳转条件时,控制指令流跳转至目标跳转地址。The jump control sub-module 17-113, when the scalar to be judged meets the jump condition, controls the instruction flow to jump to the target jump address.
其中,操作码用于指示标量控制流指令对数据所进行的处理为标量跳转处理,操作域包括待判断标量地址和目标跳转地址。The operation code is used to instruct the scalar control flow instruction to process the data as scalar jump processing, and the operation field includes the scalar address to be judged and the target jump address.
在本实施例中,待判断标量可以是一个或多个。操作域中可以包括待判断标量地址,也可以直接包括待判断标量,以便于控制模块获取待判断标量。In this embodiment, there may be one or more scalars to be determined. The operation domain may include the scalar address to be judged, or may directly include the scalar to be judged, so that the control module can obtain the scalar to be judged.
在本实施例中,控制模块所获取到的标量控制流指令为未编译的、不能直接供硬件执行的软件指令,控制模块需先对标量控制流指令(未编译)进行编译。在得到编译后的标量控制流指令之后,才能对编译后的标量控制流指令进行解析。编译后的标量控制流指令为能够直接供硬件执行的硬件指令。控制模块可以通过数据输入输出单元获得标量控制流指令和待判断标量,该数据输入输出单元可以为一个或多个数据I/O接口或I/O引脚。In this embodiment, the scalar control flow instruction acquired by the control module is an uncompiled software instruction that cannot be directly executed by hardware. The control module needs to first compile the scalar control flow instruction (uncompiled). After the compiled scalar control flow instruction is obtained, the compiled scalar control flow instruction can be parsed. The compiled scalar control flow instructions are hardware instructions that can be directly executed by hardware. The control module may obtain a scalar control flow instruction and a scalar to be judged through a data input / output unit, and the data input / output unit may be one or more data I / O interfaces or I / O pins.
在本实施例中,操作码可以是计算机程序中所规定的要执行操作的那一部分指令或字段(通常用代码表示),是指令序列号,用来告知执行指令的装置具体需要执行哪一条指令。操作域可以是执行对应的指令所需的所有数据的来源,执行对应的指令所需的所有数据包括待判断标量、待判断标量地址、目标跳转地址、跳转条件等等。对于一个标量控制流指令其必须包括操作码和操作域,其中操作域至少包括存储待判断标量地址和目标跳转地址。In this embodiment, the operation code may be the part of the instruction or field (usually represented by code) specified in the computer program to perform the operation, and is the instruction sequence number used to inform the device that executes the instruction which instruction needs to be executed. . The operation domain may be a source of all data required to execute the corresponding instruction, and all data required to execute the corresponding instruction include a scalar to be determined, a scalar address to be determined, a target jump address, a jump condition, and so on. For a scalar control flow instruction, it must include an operation code and an operation field, where the operation field includes at least the storage of the scalar address to be judged and the target jump address.
应当理解的是,本领域技术人员可以根据需要对标量控制流指令的指令格式以及所包含的操作码和操作域进行设置,本公开对此不作限制。It should be understood that those skilled in the art can set the instruction format of the scalar control flow instruction, as well as the included operation codes and operation domains as needed, and the disclosure does not limit this.
在本实施例中,该装置可以包括一个或多个控制模块,可以根据实际需要对控制模块的数量进行设置,本公开对此不作限制。该装置可以用于进行机器学习算法的计算,如神经网络算法。In this embodiment, the device may include one or more control modules, and the number of control modules may be set according to actual needs, which is not limited in the present disclosure. The device can be used for calculation of machine learning algorithms, such as neural network algorithms.
在本实施例中,该装置还可以包括处理模块。控制模块还可以用于接收计算指令获取待处理数据。处理模块用于根据计算指令对待处理数据进行运算处理,得到运算结果。In this embodiment, the device may further include a processing module. The control module can also be used to receive calculation instructions to obtain data to be processed. The processing module is used to perform operation processing on the data to be processed according to the calculation instruction to obtain the operation result.
本公开实施例所提供的标量控制流指令处理装置,该装置包括控制模块,控制模块包括:指令编译子模块,对获取到的标量控制流指令进行编译,得到编译后的标量控制流指令;数据获取子模块,根据编译后的标量控制流指令的操作码和操作域获取执行标量控制流指令所需的待判断标量和目标跳转地址,以及确定标量控制流指令所对应的跳转条件;跳转控制子模块,在待判断标量满足跳转条件时,控制指令流跳转至目标跳转地址。本公开实施例所提供的标量控制流指令处理装置的适用范围 广,对标量控制流指令的处理效率高、处理速度快。A scalar control flow instruction processing device provided by an embodiment of the present disclosure includes a control module. The control module includes: an instruction compilation submodule, which compiles the obtained scalar control flow instruction to obtain a compiled scalar control flow instruction; data Obtain a submodule, obtain the scalar to be judged and the target jump address required to execute the scalar control flow instruction according to the operation code and operation domain of the compiled scalar control flow instruction, and determine the jump condition corresponding to the scalar control flow instruction; jump The transfer control submodule, when the scalar to be determined meets the jump condition, controls the instruction flow to jump to the target jump address. The scalar control flow instruction processing device provided by the embodiments of the present disclosure has a wide application range, and has high processing efficiency and fast processing speed for scalar control flow instructions.
在一种可能的实现方式中,跳转控制子模块17-113,可以包括:In a possible implementation, the jump control sub-modules 17-113 may include:
至少一个比较器,用于根据跳转条件对待判断标量进行比较,得到比较结果,比较结果用于指示得到待判断标量是否满足跳转条件。At least one comparator is used to compare the scalar to be judged according to the jump condition to obtain a comparison result, and the comparison result is used to indicate whether the scalar to be judged meets the jump condition.
在一种可能的实现方式中,操作域还可以包括跳转条件。其中,数据获取子模块17-112可以用于在操作域包括跳转条件时,根据操作域确定标量控制流指令所对应的跳转条件。In a possible implementation manner, the operation domain may further include a jump condition. Wherein, the data acquisition sub-module 17-112 may be used to determine the jump condition corresponding to the scalar control flow instruction according to the operation domain when the operation domain includes the jump condition.
在一种可能的实现方式中,操作码还可以用于指示跳转条件。其中,数据获取子模块17-112可以用于在操作码用于指示跳转条件时,根据操作码确定标量控制流指令所对应的跳转条件。In a possible implementation manner, the operation code may also be used to indicate a jump condition. Among them, the data acquisition sub-module 17-112 can be used to determine the jump condition corresponding to the scalar control flow instruction according to the operation code when the operation code is used to indicate the jump condition.
在一种可能的实现方式中,跳转条件可以包括判断条件和待判断标量的数据类型。判断条件用于指示标量控制流指令对待判断标量所需进行的判断、或比较的类型。In a possible implementation manner, the jump condition may include a judgment condition and a data type of a scalar to be judged. The judgment condition is used to indicate the type of judgment or comparison that the scalar control flow instruction needs to make to judge the scalar.
在一种可能的实现方式中,判断条件可以包括以下任一种:In a possible implementation manner, the judgment condition may include any one of the following:
待判断标量中的第一待判断标量等于待判断标量中的第二待判断标量;The first scalar to be judged in the scalar to be judged is equal to the second scalar to be judged in the scalar to be judged;
待判断标量中的第一待判断标量不等于待判断标量中的第二待判断标量;The first scalar to be judged in the scalar to be judged is not equal to the second scalar to be judged in the scalar to be judged;
待判断标量中的第一待判断标量小于待判断标量中的第二待判断标量;The first scalar to be judged in the scalar to be judged is smaller than the second scalar to be judged in the scalar to be judged;
待判断标量中的第一待判断标量大于或等于待判断标量中的第二待判断标量;The first scalar to be judged in the scalar to be judged is greater than or equal to the second scalar to be judged in the scalar to be judged;
待判断标量大于指定值。The scalar to be judged is greater than the specified value.
在该实现方式中,判断条件还可以是针对待判断标量的其他判断条件,例如,判断条件还可以是待判断标量中的第一待判断标量小于待判断标量中的第二待判断标量。判断条件还可以是待判断标量小于指定值、待判断标量等于指定值等,指定值可以是预先设置的数值。判断条件还可以是待判断标量中的第一待判断标量和第二待判断标量的和大于、或等于、或小于、或小于等于、或大于或等于、或不等于待判断标量中的第三标量等。本领域技术人员可以根据实际需要对判断条件进行设置,本公开对此不作限制。In this implementation manner, the judgment condition may also be another judgment condition for the scalar to be judged, for example, the judgment condition may also be that the first scalar to be judged in the scalar to be judged is smaller than the second scalar to be judged in the scalar to be judged. The judgment condition may also be that the scalar to be judged is less than the specified value, the scalar to be judged is equal to the specified value, etc. The specified value may be a preset value. The judgment condition may also be the sum of the first scalar to be judged and the second scalar to be judged is greater than, or equal to, or less than, or less than or equal to, or greater than or equal to, or not equal to the third of the scalar to be judged Scalar, etc. Those skilled in the art can set the judgment conditions according to actual needs, and this disclosure does not limit this.
在该实现方式中,可以设置不同判断条件标识来区分不同的判断条件。例如,可以将“待判断标量中的第一待判断标量等于待判断标量中的第二待判断标量”的判断条件标识设置为“beq”,可以将“待判断标量中的第一待判断标量不等于待判断标量中的第二待判断标量”的判断条件标识设置为“bne”。可以将“待判断标量中的第一待判断标量小于待判断标量中的第二待判断标量”的判断条件标识设置为“blt”。可以将“待判断标量中的第一待判断标量大于或等于待判断标量中的第二待判断标量”的判断条件标识设置为“bge”。可以将“待判断标量大于指定值”的判断条件标识设置为“blt.a”,其中,a为指定值。In this implementation manner, different judgment condition flags can be set to distinguish different judgment conditions. For example, the judgment condition flag of "the first scalar to be judged equal to the second scalar to be judged" is set to "beq", and the "first scalar to be judged scalar" can be set to "beq" The judgment condition flag not equal to the second scalar to be judged in the scalar to be judged "is set to" bne ". The judgment condition flag of "the first scalar to be judged in the scalar to be judged is smaller than the second scalar to be judged in the scalar to be judged" may be set to "blt". The judgment condition flag of “the first scalar to be judged in the scalar to be judged is greater than or equal to the second scalar to be judged in the scalar to be judged” may be set to “bge”. The judgment condition flag of "the scalar to be judged is greater than the specified value" may be set to "blt.a", where a is the specified value.
在一种可能的实现方式中,数据类型可以包括16位无符号类型、32位无符号类型、48位无符号类型、16位有符号类型、32位有符号类型、48位有符号类型中的任意一种。In a possible implementation, the data types may include 16-bit unsigned types, 32-bit unsigned types, 48-bit unsigned types, 16-bit signed types, 32-bit signed types, and 48-bit signed types. Any kind.
在该实现方式中,待判断标量可以是整型等种类、且对应上述数据类型的标量。本领域技术人员可以根据实际需要对待判断标量的数据类型、种类进行设置,本公开对此不作限制。In this implementation manner, the scalar to be determined may be a scalar of an integer type or the like and corresponding to the above data type. A person skilled in the art may set the data type and type of the scalar to be judged according to actual needs, which is not limited in the present disclosure.
在一种可能的实现方式中,可以预先设置默认数据类型。在跳转条件中不包含数据类型时,可以将默认数据类型确定为待判断标量的数据类型。In a possible implementation, the default data type can be preset. When the data type is not included in the jump condition, the default data type can be determined as the data type of the scalar to be judged.
在一种可能的实现方式中,在标量控制流指令中不包括跳转条件和待判断标量地址、或者跳转条件和待判断标量地址为空、或者跳转条件和待判断标量地址为指定内容时,可以直接控制指令流跳转至目标跳转地址。In a possible implementation, the scalar control flow instruction does not include the jump condition and the scalar address to be determined, or the jump condition and the scalar address to be determined are empty, or the jump condition and the scalar address to be determined are specified content , You can directly control the instruction flow to jump to the target jump address.
图17-2示出根据本公开一实施例的标量控制流指令处理装置的框图。在一种可能的实现方式中,如图17-2所示,该装置还可以包括存储模块17-13。存储模块17-13用于存储待判断标量。17-2 shows a block diagram of a scalar control flow instruction processing apparatus according to an embodiment of the present disclosure. In a possible implementation manner, as shown in FIG. 17-2, the device may further include a storage module 17-13. The storage modules 17-13 are used to store scalars to be judged.
在该实现方式中,存储模块可以包括内存、缓存和寄存器中的一种或多种,缓存可以包括高速暂存缓存。可以根据需要将待判断标量存储在存储模块中的内存、缓存和/或寄存器中,本公开对此不作限制。In this implementation manner, the storage module may include one or more of a memory, a cache, and a register, and the cache may include a high-speed temporary storage cache. The scalar to be judged can be stored in the memory, cache and / or register in the storage module as needed, and the disclosure does not limit this.
在一种可能的实现方式中,该装置还可以包括直接内存访问模块,用于从存储模块中读取或者存储数据。In a possible implementation, the device may further include a direct memory access module, which is used to read or store data from the storage module.
在一种可能的实现方式中,指令编译子模块可以根据控制模块执行指令的需要,对标量控制流指令进行编译,例如,编译为控制模块能够识别的二进制文件、十六进制文件等。其中,对获取到的标量控制流指令进行编译,得到编译后的标量控制流指令,可以包括:根据标量控制流指令生成汇编文件,并将汇编文件翻译成二进制文件,其中,二进制文件为编译后的标量控制流指令。In a possible implementation manner, the instruction compilation sub-module may compile the scalar control flow instruction according to the need of the control module to execute the instruction, for example, compile into a binary file or a hexadecimal file that the control module can recognize. Compiling the obtained scalar control flow instruction to obtain the compiled scalar control flow instruction may include: generating an assembly file according to the scalar control flow instruction and translating the assembly file into a binary file, where the binary file is compiled Scalar control flow instruction.
在一种可能的实现方式中,标量控制流指令的指令格式可以是:In a possible implementation, the instruction format of the scalar control flow instruction may be:
jump,src,label,type1.type2jump, src, label, type1.type2
其中,jump是标量控制流指令的操作码,src、label、type1.type2是标量控制流指令的操作域。其中,label是目标跳转地址。src是待判断标量地址,其中,在待判断标量为多个时,标量控制流指令可以包括多个待判断标量地址,如src1、src2、...、srcn。type1.type2表示跳转条件,其中,type1.type2中的type1表示判断条件,type1.type2中的type2表示待判断标量的数据类型。Among them, jump is the operation code of the scalar control flow instruction, src, label, type1. Type2 is the operation domain of the scalar control flow instruction. Among them, label is the target jump address. src is a scalar address to be judged, wherein, when there are multiple scalars to be judged, the scalar control flow instruction may include multiple scalar addresses to be judged, such as src1, src2, ..., srcn. type1.type2 represents the jump condition, where type1 in type1.type2 represents the judgment condition, and type2 in type1.type2 represents the data type of the scalar to be judged.
其中,在待判断标量为多个时,指令格式中可以包括多个待判断标量地址,以下以包括两个待判断标量为例,标量控制流指令的指令格式可以是:Where there are multiple scalars to be judged, the instruction format may include multiple scalar addresses to be judged. The following takes two scalars to be judged as examples. The instruction format of the scalar control flow instruction may be:
jump,src0,src1,label,type1.type2jump, src0, src1, label, type1.type2
在一种可能的实现方式中,标量控制流指令的指令格式还可以是:In a possible implementation, the instruction format of the scalar control flow instruction may also be:
type1.type2,src,labeltype1.type2, src, label
其中,type1.type2是标量控制流指令的操作码,src、label是标量控制流指令的操作域。其中,type1.type2用于指示该指令为标量控制流指令,其中,type1.type2中的type1表示判断条件,type1.type2中的type2表示待判断标量的数据类型。src是待判断标量地址,其中,在待判断标量为多个时,标量控制流指令可以包括多个待判断标量地址,如src1、src2、...、srcn。Among them, type1.type2 is the operation code of the scalar control flow instruction, and src and label are the operation domains of the scalar control flow instruction. Among them, type1.type2 is used to indicate that the instruction is a scalar control flow instruction, where type1 in type1.type2 represents the judgment condition, and type2 in type1.type2 represents the data type of the scalar to be judged. src is a scalar address to be judged, wherein, when there are multiple scalars to be judged, the scalar control flow instruction may include multiple scalar addresses to be judged, such as src1, src2, ..., srcn.
其中,在待判断标量为多个时,指令格式中可以包括多个待判断标量地址,以下以包括两个待判断标量为例,标量控制流指令的指令格式可以是:Where there are multiple scalars to be judged, the instruction format may include multiple scalar addresses to be judged. The following takes two scalars to be judged as examples. The instruction format of the scalar control flow instruction may be:
type1.type2,src0,src1,labeltype1.type2, src0, src1, label
在一种可能的实现方式中,可以为不同的标量控制流指令设置对应的指令格式。In a possible implementation manner, corresponding instruction formats may be set for different scalar control flow instructions.
在一种可能的实现方式中,可以将判断条件为“待判断标量中的第一待判断标量等于待判断标量中的第二待判断标量”的标量控制流指令的指令格式设置为:beq.type12,src0,src1,label。该标量控制流指令表示:对src0和src1中分别存储的数据类型均为type2的第一待判断标量和第二待判断标量进行比较,在的第一待判断标量等于第二待判断标量时,控制指令流跳转至目标跳转地址label。In a possible implementation, the judgment format of the scalar control flow instruction whose judgment condition is "the first scalar to be judged is equal to the second scalar to be judged" is set to beq. type12, src0, src1, label. The scalar control flow instruction indicates that the first to-be-determined scalar and the second to-be-determined scalar whose data types are respectively stored in src0 and src1 are type2 are compared. The control instruction flow jumps to the target jump address label.
在一种可能的实现方式中,可以将判断条件为“待判断标量中的第一待判断标量不等于待判断标量中的第二待判断标量”的标量控制流指令的指令格式设置为:bne.type2,src0,src1,label。该标量控制流指令表示:对src0和src1中分别存储的数据类型均为type2的第一待判断标量和第二待判断标量进行比较,在的第一待判断标量不等于第二待判断标量时,控制指令流跳转至目标跳转地址label。In a possible implementation, the instruction format of the scalar control flow instruction whose judgment condition is "the first scalar to be judged in the scalar to be judged is not equal to the second scalar to be judged" is set to: bne .type2, src0, src1, label. The scalar control flow instruction indicates that the first to-be-determined scalar and the second to-be-determined scalar whose data types stored in src0 and src1 are respectively type2 are compared, when the first to-be-determined scalar is not equal to the second to-be-determined scalar , The control instruction flow jumps to the target jump address label.
在一种可能的实现方式中,可以将判断条件为“待判断标量中的第一待判断标量小于待判断标量中的第二待判断标量”的标量控制流指令的指令格式设置为:blt.type2,src0,src1,label。该标量控制流指令表示:对src0和src1中分别存储的数据类型均为type2的第一待判断标量和第二待判断标量进行比较,在的第一待判断标量小于第二待判断标量时,控制指令流跳转至目标跳转地址label。In a possible implementation, the judgment format of the scalar control flow instruction whose judgment condition is "the first scalar to be judged in the scalar to be judged is less than the second scalar to be judged in the scalar to be judged" is set to: blt. type2, src0, src1, label. The scalar control flow instruction indicates that the first to-be-determined scalar and the second to-be-determined scalar whose data types are respectively stored in src0 and src1 are type2 are compared. The control instruction flow jumps to the target jump address label.
在一种可能的实现方式中,可以将判断条件为“待判断标量中的第一待判断标量大于或等于待判断标量中的第二待判断标量”的标量控制流指令的指令格式设置为:bge.type2,src0,src1,label。该标量控制流指令表示:对src0和src1中分别存储的数据类型均为type2的第一待判断标量和第二待判断标量进行比较,在的第一待判断标量大于或等于第二待判断标量时,控制指令流跳转至目标跳转地址label。In a possible implementation, the instruction format of the scalar control flow instruction whose judgment condition is "the first scalar to be judged in the scalar to be judged is greater than or equal to the second scalar to be judged" is set as: bge.type2, src0, src1, label. The scalar control flow instruction indicates that the first to-be-determined scalar and the second to-be-determined scalar whose data types stored in src0 and src1 are respectively type2 are compared, where the first to-be-determined scalar is greater than or equal to the second to-be-determined scalar , The control instruction flow jumps to the target jump address label.
在一种可能的实现方式中,可以将无需判断直接进指令流跳转的标量控制流指令的指令格式设置为:jmp,label。该标量控制流指令表示:在接收到该指令时,直接控制指令流跳转至目标跳转地址label。In a possible implementation manner, the instruction format of the scalar control flow instruction that jumps directly into the instruction flow without judgment can be set as: jmp, label. The scalar control flow instruction indicates that when the instruction is received, the instruction flow is directly controlled to jump to the target jump address label.
应当理解的是,本领域技术人员可以根据需要对标量控制流指令的操作码、指令格式中操作码和操作域的位置进行设置,本公开对此不作限制。It should be understood that those skilled in the art may set the position of the operation code, operation code and operation field in the instruction format of the scalar control flow instruction according to needs, and this disclosure does not limit this.
在一种可能的实现方式中,该装置可以设置于图形处理器(Graphics Processing Unit,简称GPU)、中央处理器(Central Processing Unit,简称CPU)和嵌入式神经网络处理器(Neural-network Processing  Unit,简称NPU)的一种或多种之中。In a possible implementation manner, the device may be set in a graphics processor (Graphics Processing Unit, GPU for short), a central processing unit (CPU Processing), and an embedded neural network processor (Neural-network Processing Unit) , Referred to as NPU).
需要说明的是,尽管以上述实施例作为示例介绍了标量控制流指令处理装置如上,但本领域技术人员能够理解,本公开应不限于此。事实上,用户完全可根据个人喜好和/或实际应用场景灵活设定各模块,只要符合本公开的技术方案即可。It should be noted that although the above embodiment is taken as an example to introduce the scalar control flow instruction processing apparatus as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various modules flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.
应用示例Application examples
以下结合“利用标量控制流指令处理装置进行取地址处理”作为一个示例性应用场景,给出根据本公开实施例的应用示例,以便于理解标量控制流指令处理装置的流程。本领域技术人员应理解,以下应用示例仅仅是出于便于理解本公开实施例的目的,不应视为对本公开实施例的限制。The following uses "scalar control flow instruction processing device for address fetch processing" as an exemplary application scenario to give an application example according to an embodiment of the present disclosure, in order to understand the flow of the scalar control flow instruction processing device. Those skilled in the art should understand that the following application examples are only for the purpose of facilitating the understanding of the embodiments of the present disclosure, and should not be considered as limitations to the embodiments of the present disclosure.
图17-3示出根据本公开一实施例的标量控制流指令处理装置的应用场景的示意图。如图17-3所示,标量控制流指令处理装置对标量控制流指令进行处理的过程如下:17-3 shows a schematic diagram of an application scenario of a scalar control flow instruction processing apparatus according to an embodiment of the present disclosure. As shown in Figure 17-3, the scalar control flow instruction processing device processes the scalar control flow instruction as follows:
如图17-3所示,控制模块17-11对获取到的标量控制流指令1进行编译,得到编译后的标量控制流指令1。对编译后的标量控制流指令1(如标量控制流指令1为beq.u16 101 102 500)进行解析,得到编译后的标量控制流指令1的操作码和操作域。确定出判断条件为“待判断标量中的第一待判断标量等于待判断标量中的第二待判断标量”、数据类型为16位无符号类型、目标跳转地址为500。从第一待判断标量地址101获取到16位无符号的第一待判断标量s1,从第二待判断标量地址102获取到16位无符号的第二待判断标量s2。利用比较器对第一待判断标量s1和第二待判断标量s2进行比较,在第一待判断标量s1等于第二待判断标量s2时,控制指令流跳转至目标跳转地址500。As shown in FIG. 17-3, the control module 17-11 compiles the obtained scalar control flow instruction 1 to obtain the compiled scalar control flow instruction 1. Analyze the compiled scalar control flow instruction 1 (for example, scalar control flow instruction 1 is beq.u16, 101, 102, 500), and obtain the compiled scalar control flow instruction 1 operation code and operation domain. It is determined that the judgment condition is “the first scalar to be judged in the scalar to be judged is equal to the second scalar to be judged in the scalar to be judged”, the data type is a 16-bit unsigned type, and the target jump address is 500. A 16-bit unsigned first to-be-determined scalar s1 is acquired from the first to-be-determined scalar address 101, and a 16-bit unsigned to-be-determined scalar s2 is acquired from the second to-be-determined scalar address 102. The comparator is used to compare the first to-be-determined scalar s1 and the second to-be-determined scalar s2. When the first to-be-determined scalar s1 is equal to the second to-be-determined scalar s2, the control instruction flow jumps to the target jump address 500.
以上控制模块的工作过程可参考上文的相关描述。For the working process of the above control module, please refer to the related description above.
这样,标量控制流指令处理装置可以高效、快速地对标量控制流指令进行处理。In this way, the scalar control flow instruction processing device can efficiently and quickly process the scalar control flow instruction.
图17-4示出根据本公开一实施例的标量控制流指令处理方法的流程图。该方法可以应用于包含存储器和处理器的如计算机设备等,其中,存储器用于存储执行方法过程中所使用的数据;处理器用于执行相关的处理、运算步骤,如执行下述步骤S51-17和步骤S52-17。如图17-4所示,该方法应用于上述标量控制流指令处理装置,该方法包括步骤S51-17至步骤S53-17。17-4 shows a flowchart of a scalar control flow instruction processing method according to an embodiment of the present disclosure. The method can be applied to computer equipment including a memory and a processor, where the memory is used to store data used in the execution of the method; the processor is used to perform related processing and operation steps, such as performing the following steps S51-17 And step S52-17. As shown in FIG. 17-4, the method is applied to the above scalar control flow instruction processing device, and the method includes steps S51-17 to S53-17.
在步骤S51-17中,对获取到的标量控制流指令进行编译,得到编译后的标量控制流指令。In step S51-17, compile the obtained scalar control flow instruction to obtain a compiled scalar control flow instruction.
在步骤S52-17中,根据编译后的标量控制流指令的操作码和操作域获取执行标量控制流指令所需的待判断标量和目标跳转地址,以及确定标量控制流指令所对应的跳转条件。其中,操作码用于指示标量控制流指令对数据所进行的处理为标量跳转处理,操作域包括待判断标量地址和目标跳转地址。In step S52-17, obtain the to-be-determined scalar and target jump addresses required to execute the scalar control flow instruction according to the compiled scalar control flow instruction operation code and operation domain, and determine the jump corresponding to the scalar control flow instruction condition. The operation code is used to instruct the scalar control flow instruction to process the data as scalar jump processing, and the operation field includes the scalar address to be judged and the target jump address.
在步骤S53-17中,在待判断标量满足跳转条件时,控制指令流跳转至目标跳转地址。In step S53-17, when the scalar to be determined meets the jump condition, the control instruction flow jumps to the target jump address.
在一种可能的实现方式中,该方法还可以包括:在待判断标量满足跳转条件时,控制指令流跳转至目标跳转地址,可以包括:In a possible implementation manner, the method may further include: when the scalar to be judged meets the jump condition, controlling the instruction flow to jump to the target jump address may include:
根据跳转条件利用至少一个比较器对待判断标量进行比较,得到比较结果,比较结果用于指示得到待判断标量是否满足跳转条件。According to the jump condition, at least one comparator is used to compare the scalar to be judged to obtain a comparison result, and the comparison result is used to indicate whether the scalar to be judged meets the jump condition.
在一种可能的实现方式中,操作域还可以包括跳转条件。其中,确定标量控制流指令所对应的跳转条件,可以包括:在操作域包括跳转条件时,根据操作域确定标量控制流指令所对应的跳转条件。In a possible implementation manner, the operation domain may further include a jump condition. Wherein, determining the jump condition corresponding to the scalar control flow instruction may include: when the operation domain includes the jump condition, determining the jump condition corresponding to the scalar control flow instruction according to the operation domain.
在一种可能的实现方式中,操作码还可以用于指示跳转条件。其中,确定标量控制流指令所对应的跳转条件,可以包括:在操作码用于指示跳转条件时,根据操作码确定标量控制流指令所对应的跳转条件。In a possible implementation manner, the operation code may also be used to indicate a jump condition. Wherein, determining the jump condition corresponding to the scalar control flow instruction may include: when the operation code is used to indicate the jump condition, determining the jump condition corresponding to the scalar control flow instruction according to the operation code.
在一种可能的实现方式中,跳转条件可以包括判断条件和待判断标量的数据类型。In a possible implementation manner, the jump condition may include a judgment condition and a data type of a scalar to be judged.
其中,判断条件可以包括以下任一种:The judgment condition may include any of the following:
待判断标量中的第一待判断标量等于待判断标量中的第二待判断标量;The first scalar to be judged in the scalar to be judged is equal to the second scalar to be judged in the scalar to be judged;
待判断标量中的第一待判断标量不等于待判断标量中的第二待判断标量;The first scalar to be judged in the scalar to be judged is not equal to the second scalar to be judged in the scalar to be judged;
待判断标量中的第一待判断标量小于待判断标量中的第二待判断标量;The first scalar to be judged in the scalar to be judged is smaller than the second scalar to be judged in the scalar to be judged;
待判断标量中的第一待判断标量大于或等于待判断标量中的第二待判断标量;The first scalar to be judged in the scalar to be judged is greater than or equal to the second scalar to be judged in the scalar to be judged;
待判断标量大于指定值。The scalar to be judged is greater than the specified value.
数据类型可以包括以下任一种:16位无符号类型、32位无符号类型、48位无符号类型、16位有符 号类型、32位有符号类型、48位有符号类型。The data type may include any of the following: 16-bit unsigned type, 32-bit unsigned type, 48-bit unsigned type, 16-bit signed type, 32-bit signed type, and 48-bit signed type.
在一种可能的实现方式中,该方法还可以包括:存储待判断标量。In a possible implementation manner, the method may further include: storing a scalar to be judged.
在一种可能的实现方式中,该方法还可以包括:In a possible implementation manner, the method may further include:
存储编译后的标量控制流指令;Store compiled scalar control flow instructions;
对编译后的标量控制流指令进行解析,得到编译后的标量控制流指令的操作码和操作域;Analyze the compiled scalar control flow instruction to obtain the operation code and operation domain of the compiled scalar control flow instruction;
存储指令队列,指令队列包括按照执行顺序依次排列的多个待执行指令,多个待执行指令可以包括编译后的标量控制流指令。The instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed in order according to the execution order, and the plurality of instructions to be executed may include compiled scalar control flow instructions.
在一种可能的实现方式中,该方法还可以包括:在确定多个待执行指令中的第一待执行指令与第一待执行指令之前的第零待执行指令存在关联关系时,缓存第一待执行指令,并在确定第零待执行指令执行完毕后,控制进行第一待执行指令的执行。In a possible implementation manner, the method may further include: when determining that the first to-be-executed instruction among the plurality of to-be-executed instructions is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction, caching the first The instruction to be executed, and after it is determined that the execution of the zeroth instruction to be executed is completed, the execution of the first instruction to be executed is controlled.
其中,第一待执行指令与第一待执行指令之前的第零待执行指令存在关联关系包括:存储第一待执行指令所需数据的第一存储地址区间与存储第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first to-be-executed instruction is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction, including: a first storage address interval storing data required by the first to-be-executed instruction and data required to store the zeroth to-be-executed instruction The zeroth storage address interval has overlapping areas.
在一种可能的实现方式中,对获取到的标量控制流指令进行编译,得到编译后的标量控制流指令,可以包括:根据标量控制流指令生成汇编文件,并将汇编文件翻译成二进制文件。其中,二进制文件为编译后的标量控制流指令。In a possible implementation, compiling the obtained scalar control flow instruction to obtain the compiled scalar control flow instruction may include: generating an assembly file according to the scalar control flow instruction, and translating the assembly file into a binary file. Among them, the binary file is a compiled scalar control flow instruction.
需要说明的是,尽管以上述实施例作为示例介绍了标量控制流指令处理方法如上,但本领域技术人员能够理解,本公开应不限于此。事实上,用户完全可根据个人喜好和/或实际应用场景灵活设定各步骤,只要符合本公开的技术方案即可。It should be noted that although the above embodiment is taken as an example to introduce the scalar control flow instruction processing method as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various steps flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.
本公开实施例所提供的标量控制流指令处理方法的适用范围广,对标量控制流指令的处理效率高、处理速度快。The scalar control flow instruction processing method provided by the embodiments of the present disclosure has a wide application range, and has high processing efficiency and fast processing speed for the scalar control flow instruction.
依据以下条款可以更好的理解前述内容:The foregoing can be better understood based on the following clauses:
条款Q1、一种标量控制流指令处理装置,所述装置包括控制模块,所述控制模块包括:Clause Q1, a scalar control flow instruction processing device, the device includes a control module, the control module includes:
指令编译子模块,对获取到的标量控制流指令进行编译,得到编译后的标量控制流指令;The instruction compilation submodule compiles the obtained scalar control flow instruction to obtain the compiled scalar control flow instruction;
数据获取子模块,根据所述编译后的标量控制流指令的操作码和操作域获取执行标量控制流指令所需的待判断标量和目标跳转地址,以及确定标量控制流指令所对应的跳转条件;The data acquisition submodule obtains the scalar to be judged and the target jump address required to execute the scalar control flow instruction according to the compiled scalar control flow instruction operation code and operation domain, and determines the jump corresponding to the scalar control flow instruction condition;
跳转控制子模块,在所述待判断标量满足所述跳转条件时,控制指令流跳转至所述目标跳转地址,A jump control submodule, when the scalar to be judged satisfies the jump condition, controlling the instruction flow to jump to the target jump address,
其中,所述操作码用于指示所述标量控制流指令对数据所进行的处理为标量跳转处理,所述操作域包括待判断标量地址和所述目标跳转地址。Wherein, the operation code is used to instruct the scalar control flow instruction to process the data as scalar jump processing, and the operation field includes a scalar address to be judged and the target jump address.
条款Q2、根据条款Q1所述的装置,所述跳转控制子模块,包括:Clause Q2. The device according to Clause Q1, the jump control sub-module includes:
至少一个比较器,用于根据所述跳转条件对所述待判断标量进行比较,得到比较结果,所述比较结果用于指示得到待判断标量是否满足所述跳转条件。At least one comparator is configured to compare the scalar to be determined according to the jump condition to obtain a comparison result, and the comparison result is used to indicate whether the scalar to be determined meets the jump condition.
条款Q3、根据条款Q1所述的装置,所述操作域还包括跳转条件,Clause Q3. The device according to Clause Q1, the operation domain further includes a jump condition,
其中,数据获取子模块,用于在所述操作域包括跳转条件时,根据所述操作域确定标量控制流指令所对应的跳转条件。Wherein, the data acquisition sub-module is used to determine the jump condition corresponding to the scalar control flow instruction according to the operation domain when the operation domain includes the jump condition.
条款Q4、根据条款Q1所述的装置,所述操作码还用于指示跳转条件,Clause Q4. The device according to Clause Q1, the operation code is also used to indicate a jump condition,
其中,数据获取子模块,用于在所述操作码用于指示跳转条件时,根据所述操作码确定标量控制流指令所对应的跳转条件。Wherein, the data acquisition sub-module is used to determine the jump condition corresponding to the scalar control flow instruction according to the operation code when the operation code is used to indicate the jump condition.
条款Q5、根据条款Q1所述的装置,所述跳转条件包括判断条件和待判断标量的数据类型,Clause Q5. The device according to Clause Q1, the jump condition includes a judgment condition and a data type of a scalar to be judged,
其中,所述判断条件包括以下任一种:Wherein, the judgment condition includes any one of the following:
所述待判断标量中的第一待判断标量等于所述待判断标量中的第二待判断标量;The first scalar to be judged in the scalar to be judged is equal to the second scalar to be judged in the scalar to be judged;
所述待判断标量中的第一待判断标量不等于所述待判断标量中的第二待判断标量;The first scalar to be judged in the scalar to be judged is not equal to the second scalar to be judged in the scalar to be judged;
所述待判断标量中的第一待判断标量小于所述待判断标量中的第二待判断标量;The first scalar to be judged in the scalar to be judged is smaller than the second scalar to be judged in the scalar to be judged;
所述待判断标量中的第一待判断标量大于或等于所述待判断标量中的第二待判断标量;The first scalar to be judged in the scalar to be judged is greater than or equal to the second scalar to be judged in the scalar to be judged;
所述待判断标量大于指定值;The scalar to be judged is greater than the specified value;
所述数据类型包括以下任一种:The data type includes any of the following:
16位无符号类型、32位无符号类型、48位无符号类型、16位有符号类型、32位有符号类型、48位有符号类型。16-bit unsigned type, 32-bit unsigned type, 48-bit unsigned type, 16-bit signed type, 32-bit signed type, 48-bit signed type.
条款Q6、根据条款Q1所述的装置,所述装置还包括:Clause Q6. The device according to Clause Q1, the device further comprising:
存储模块,用于存储所述待判断标量。The storage module is used for storing the scalar to be determined.
条款Q7、根据条款Q1所述的装置,所述控制模块包括:Clause Q7. The device according to Clause Q1, the control module includes:
指令存储子模块,用于存储所述编译后的标量控制流指令;Instruction storage sub-module for storing the compiled scalar control flow instruction;
指令处理子模块,用于对所述编译后的标量控制流指令进行解析,得到所述编译后的标量控制流指令的操作码和操作域;An instruction processing submodule, used for parsing the compiled scalar control flow instruction to obtain the operation code and operation domain of the compiled scalar control flow instruction;
队列存储子模块,用于存储指令队列,所述指令队列包括按照执行顺序依次排列的多个待执行指令,所述多个待执行指令包括所述编译后的标量控制流指令。A queue storage submodule is used to store an instruction queue, and the instruction queue includes a plurality of instructions to be executed in order according to an execution order, and the plurality of instructions to be executed include the compiled scalar control flow instruction.
条款Q8、根据条款Q7所述的装置,所述控制模块,还包括:Clause Q8. The device according to Clause Q7, the control module, further comprising:
依赖关系处理子模块,用于在确定所述多个待执行指令中的第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系时,将所述第一待执行指令缓存在所述指令存储子模块中,在所述第零待执行指令执行完毕后,从所述指令存储子模块中提取并控制所述第一待执行指令的执行,The dependency processing sub-module is used to determine the first pending instruction when there is an association relationship between the first pending instruction in the plurality of pending instructions and the zeroth pending instruction before the first pending instruction The execution instruction is cached in the instruction storage submodule, and after the execution of the zeroth instruction to be executed is completed, the execution of the first instruction to be executed is extracted and controlled from the instruction storage submodule,
其中,所述第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系包括:Wherein, the association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
存储所述第一待执行指令所需数据的第一存储地址区间与存储所述第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
条款Q9、根据条款Q1所述的装置,对获取到的标量控制流指令进行编译,得到编译后的标量控制流指令,包括:Clause Q9. According to the device described in Clause Q1, compile the obtained scalar control flow instruction to obtain the compiled scalar control flow instruction, including:
根据所述标量控制流指令生成汇编文件,并将所述汇编文件翻译成二进制文件,Generate an assembly file according to the scalar control flow instruction, and translate the assembly file into a binary file,
其中,所述二进制文件为所述编译后的标量控制流指令。Wherein, the binary file is the compiled scalar control flow instruction.
条款Q10、一种机器学习运算装置,所述装置包括:Clause Q10. A machine learning computing device, the device comprising:
一个或多个如条款Q1-条款Q9任一项所述的标量控制流指令处理装置,用于从其他处理装置中获取待判断标量和控制信息,并执行指定的机器学习运算,将执行结果通过I/O接口传递给其他处理装置;One or more scalar control flow instruction processing devices as described in any one of Clause Q1-Clause Q9, used to obtain the scalar and control information to be judged from other processing devices, and perform the specified machine learning operation, and pass the execution result The I / O interface is passed to other processing devices;
当所述机器学习运算装置包含多个所述标量控制流指令处理装置时,所述多个所述标量控制流指令处理装置间可以通过特定的结构进行连接并传输数据;When the machine learning computing device includes a plurality of the scalar control flow instruction processing devices, the plurality of the scalar control flow instruction processing devices can be connected and transmit data through a specific structure;
其中,多个所述标量控制流指令处理装置通过快速外部设备互连总线PCIE总线进行互联并传输数据,以支持更大规模的机器学习的运算;多个所述标量控制流指令处理装置共享同一控制系统或拥有各自的控制系统;多个所述标量控制流指令处理装置共享内存或者拥有各自的内存;多个所述标量控制流指令处理装置的互联方式是任意互联拓扑。Among them, a plurality of the scalar control flow instruction processing devices interconnect and transmit data through a fast external device interconnect bus PCIE bus to support larger-scale machine learning operations; a plurality of the scalar control flow instruction processing devices share the same The control system may have its own control system; the multiple scalar control flow instruction processing devices share memory or have their own memories; the interconnection method of the multiple scalar control flow instruction processing devices is any interconnection topology.
条款Q11、一种组合处理装置,所述组合处理装置包括:Clause Q11. A combined processing device, the combined processing device comprising:
如条款Q10所述的机器学习运算装置、通用互联接口和其他处理装置;Machine learning computing devices, general interconnection interfaces and other processing devices as described in clause Q10;
所述机器学习运算装置与所述其他处理装置进行交互,共同完成用户指定的计算操作,The machine learning computing device interacts with the other processing device to jointly complete the calculation operation specified by the user,
其中,所述组合处理装置还包括:存储装置,该存储装置分别与所述机器学习运算装置和所述其他处理装置连接,用于保存所述机器学习运算装置和所述其他处理装置的数据。Wherein, the combined processing device further includes: a storage device, which is respectively connected to the machine learning computing device and the other processing device, and is used for storing data of the machine learning computing device and the other processing device.
条款Q12、一种机器学习芯片,所述机器学习芯片包括:Article Q12. A machine learning chip, the machine learning chip includes:
如条款Q10所述的机器学习运算装置或如条款Q11所述的组合处理装置。The machine learning arithmetic device according to clause Q10 or the combined processing device according to clause Q11.
条款Q13、一种电子设备,所述电子设备包括:Article Q13. An electronic device, the electronic device comprising:
如条款Q12所述的机器学习芯片。Machine learning chip as described in clause Q12.
条款Q14、一种板卡,所述板卡包括:存储器件、接口装置和控制器件以及如条款Q12所述的机器学习芯片;Clause Q14, a board card, the board card includes: a storage device, an interface device and a control device, and a machine learning chip as described in Clause Q12;
其中,所述机器学习芯片与所述存储器件、所述控制器件以及所述接口装置分别连接;Wherein, the machine learning chip is respectively connected to the storage device, the control device and the interface device;
所述存储器件,用于存储数据;The storage device is used for storing data;
所述接口装置,用于实现所述机器学习芯片与外部设备之间的数据传输;The interface device is used to realize data transmission between the machine learning chip and an external device;
所述控制器件,用于对所述机器学习芯片的状态进行监控。The control device is used for monitoring the state of the machine learning chip.
条款Q15、一种标量控制流指令处理方法,所述方法包括:Clause Q15. A scalar control flow instruction processing method, the method comprising:
对获取到的标量控制流指令进行编译,得到编译后的标量控制流指令;Compile the obtained scalar control flow instruction to obtain the compiled scalar control flow instruction;
根据所述编译后的标量控制流指令的操作码和操作域获取执行标量控制流指令所需的待判断标量和目标跳转地址,以及确定标量控制流指令所对应的跳转条件;Acquiring the scalar to be judged and the target jump address required to execute the scalar control flow instruction according to the compiled scalar control flow instruction operation code and operation domain, and determining the jump condition corresponding to the scalar control flow instruction;
在所述待判断标量满足所述跳转条件时,控制指令流跳转至所述目标跳转地址,When the scalar to be judged satisfies the jump condition, the control instruction flow jumps to the target jump address,
其中,所述操作码用于指示所述标量控制流指令对数据所进行的处理为标量跳转处理,所述操作域包括待判断标量地址和所述目标跳转地址。Wherein, the operation code is used to instruct the scalar control flow instruction to process the data as scalar jump processing, and the operation field includes a scalar address to be judged and the target jump address.
条款Q16、根据条款Q15所述的方法,在所述待判断标量满足所述跳转条件时,控制指令流跳转至所述目标跳转地址,包括:Clause Q16. According to the method described in Clause Q15, when the scalar to be judged satisfies the jump condition, controlling the instruction flow to jump to the target jump address includes:
根据所述跳转条件利用至少一个比较器对所述待判断标量进行比较,得到比较结果,所述比较结果用于指示得到待判断标量是否满足所述跳转条件。According to the jump condition, at least one comparator is used to compare the scalar to be determined to obtain a comparison result, and the comparison result is used to indicate whether the scalar to be determined meets the jump condition.
条款Q17、根据条款Q15所述的方法,所述操作域还包括跳转条件,Clause Q17, the method according to Clause Q15, the operation domain further includes a jump condition,
其中,确定标量控制流指令所对应的跳转条件,包括:Among them, determining the jump condition corresponding to the scalar control flow instruction includes:
在所述操作域包括跳转条件时,根据所述操作域确定标量控制流指令所对应的跳转条件。When the operation domain includes a jump condition, the jump condition corresponding to the scalar control flow instruction is determined according to the operation domain.
条款Q18、根据条款Q15所述的方法,所述操作码还用于指示跳转条件,Clause Q18, the method according to Clause Q15, the operation code is also used to indicate a jump condition,
其中,确定标量控制流指令所对应的跳转条件,包括:Among them, determining the jump condition corresponding to the scalar control flow instruction includes:
在所述操作码用于指示跳转条件时,根据所述操作码确定标量控制流指令所对应的跳转条件。When the operation code is used to indicate a jump condition, the jump condition corresponding to the scalar control flow instruction is determined according to the operation code.
条款Q19、根据条款Q15所述的方法,所述跳转条件包括判断条件和待判断标量的数据类型,Clause Q19. The method according to Clause Q15, the jump condition includes a judgment condition and a data type of a scalar to be judged,
其中,所述判断条件包括以下任一种:Wherein, the judgment condition includes any one of the following:
所述待判断标量中的第一待判断标量等于所述待判断标量中的第二待判断标量;The first scalar to be judged in the scalar to be judged is equal to the second scalar to be judged in the scalar to be judged;
所述待判断标量中的第一待判断标量不等于所述待判断标量中的第二待判断标量;The first scalar to be judged in the scalar to be judged is not equal to the second scalar to be judged in the scalar to be judged;
所述待判断标量中的第一待判断标量小于所述待判断标量中的第二待判断标量;The first scalar to be judged in the scalar to be judged is smaller than the second scalar to be judged in the scalar to be judged;
所述待判断标量中的第一待判断标量大于或等于所述待判断标量中的第二待判断标量;The first scalar to be judged in the scalar to be judged is greater than or equal to the second scalar to be judged in the scalar to be judged;
所述待判断标量大于指定值;The scalar to be judged is greater than the specified value;
所述数据类型包括以下任一种:The data type includes any of the following:
16位无符号类型、32位无符号类型、48位无符号类型、16位有符号类型、32位有符号类型、48位有符号类型。16-bit unsigned type, 32-bit unsigned type, 48-bit unsigned type, 16-bit signed type, 32-bit signed type, 48-bit signed type.
条款Q20、根据条款Q15所述的方法,所述方法还包括:Clause Q20. The method according to Clause Q15, the method further comprising:
存储所述待判断标量。Store the scalar to be judged.
条款Q21、根据条款Q15所述的方法,所述方法还包括:Clause Q21. The method according to Clause Q15, the method further comprising:
存储所述编译后的标量控制流指令;Storing the compiled scalar control flow instruction;
对所述编译后的标量控制流指令进行解析,得到所述编译后的标量控制流指令的操作码和操作域;Analyzing the compiled scalar control flow instruction to obtain the operation code and operation domain of the compiled scalar control flow instruction;
存储指令队列,所述指令队列包括按照执行顺序依次排列的多个待执行指令,所述多个待执行指令包括所述编译后的标量控制流指令。An instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed that are sequentially arranged according to an execution order, and the plurality of instructions to be executed include the compiled scalar control flow instruction.
条款Q22、根据条款Q21所述的方法,所述方法还包括:Clause Q22. The method according to Clause Q21, the method further comprising:
在确定所述多个待执行指令中的第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系时,缓存所述第一待执行指令,并在确定所述第零待执行指令执行完毕后,控制进行所述第一待执行指令的执行,When it is determined that the first to-be-executed instruction among the plurality of to-be-executed instructions is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction, cache the first to-be-executed instruction, and determine the After the execution of the zeroth to-be-executed instruction is completed, control to execute the execution of the first to-be-executed instruction,
其中,所述第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系包括:Wherein, the association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
存储所述第一待执行指令所需数据的第一存储地址区间与存储所述第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
条款Q23、根据条款Q15所述的方法,对获取到的标量控制流指令进行编译,得到编译后的标量控制流指令,包括:Clause Q23. According to the method described in Clause Q15, compile the obtained scalar control flow instruction to obtain the compiled scalar control flow instruction, including:
根据所述标量控制流指令生成汇编文件,并将所述汇编文件翻译成二进制文件,Generate an assembly file according to the scalar control flow instruction, and translate the assembly file into a binary file,
其中,所述二进制文件为所述编译后的标量控制流指令。Wherein, the binary file is the compiled scalar control flow instruction.
由于神经网络算法的广泛使用,计算机硬件运算人能力的不断提升,实际应用中所涉及到的数据运算的种类和数量不断提高。由于编程语言的种类多样,在不同的语言环境下,为实现向量运算,相关技术中,由于现阶段没有能广泛适用于各类编程语言的用于进行向量运算的指令,技术人员需要自定义对应其编程语言环境的一条或多条指令来实现向量运算,导致进行向量运算的效率低、速度慢。本公开提供一种向量指令处理方法、装置、计算机设备和存储介质,仅用一个指令即可以实现向量运算,能够显著提高进行向量运算的效率和速度。Due to the extensive use of neural network algorithms, the ability of computer hardware operators to continuously improve, the types and number of data operations involved in practical applications continue to increase. Due to the variety of programming languages, in different language environments, in order to implement vector operations, in related technologies, since there are no instructions for vector operations that can be widely applied to various programming languages at this stage, technicians need to customize the corresponding One or more instructions in its programming language environment implement vector operations, resulting in low efficiency and slow speed of vector operations. The present disclosure provides a vector instruction processing method, device, computer equipment, and storage medium. Vector operations can be implemented with only one instruction, which can significantly improve the efficiency and speed of vector operation.
图18-1示出根据本公开一实施例的向量指令处理装置的框图。如图18-1所示,该装置包括控制模块18-11和运算模块18-12。18-1 shows a block diagram of a vector instruction processing apparatus according to an embodiment of the present disclosure. As shown in Figure 18-1, the device includes a control module 18-11 and an arithmetic module 18-12.
控制模块18-11,用于对获取到的向量指令进行编译,得到编译后的向量指令,对编译后的向量指令进行解析,得到向量指令的操作码和操作域,并根据操作码和操作域获取执行向量指令所需的待运算向量和目标地址,以及确定向量指令的向量运算类型。其中,操作码用于指示向量指令对数据所进行的运算为向量运算,操作域包括待运算向量地址和目标地址。The control module 18-11 is used to compile the obtained vector instructions to obtain the compiled vector instructions, and parse the compiled vector instructions to obtain the operation codes and operation domains of the vector instructions, and according to the operation codes and operation domains Obtain the vector to be operated and the target address required to execute the vector instruction, and determine the vector operation type of the vector instruction. The operation code is used to indicate that the operation performed by the vector instruction on the data is a vector operation, and the operation domain includes the vector address and the target address to be operated.
运算模块18-12,用于根据向量运算类型对待运算向量进行向量运算,获得运算结果,并将运算结果存入目标地址。The operation module 18-12 is used for performing vector operation on the operation vector according to the type of vector operation, obtaining the operation result, and storing the operation result in the target address.
在本实施例中,待运算向量可以是一个或多个。向量运算类型可以指示对待运算向量所进行的算术运算、逻辑运算的种类或类型。例如,向量相加运算等。本领域技术人员可以根据实际需要对向量运算类型进行设置,本公开对此不作限制。In this embodiment, there may be one or more vectors to be calculated. The type of vector operation may indicate the type or type of arithmetic operation or logical operation performed on the vector to be operated. For example, vector addition operation. A person skilled in the art can set the type of vector operation according to actual needs, which is not limited in the present disclosure.
在本实施例中,控制模块所获取到的向量指令为未编译的、不能直接供硬件执行的软件指令,控制模块需先对向量指令(未编译的)进行编译。在得到编译后的向量指令之后,才能对编译后的向量指令进行解析。编译后的向量指令为能够直接供硬件执行的硬件指令。控制模块可以从待运算向量地址中,分别获得待运算向量。控制模块可以通过数据输入输出单元获得指令和数据,该数据输入输出单元可以为一个或多个数据I/O接口或I/O引脚。In this embodiment, the vector instructions acquired by the control module are uncompiled software instructions that cannot be directly executed by the hardware. The control module needs to first compile the vector instructions (uncompiled). After the compiled vector instruction is obtained, the compiled vector instruction can be parsed. The compiled vector instructions are hardware instructions that can be directly executed by hardware. The control module can obtain vectors to be calculated from the addresses of the vectors to be calculated. The control module can obtain instructions and data through the data input and output unit, which can be one or more data I / O interfaces or I / O pins.
在本实施例中,操作码可以是计算机程序中所规定的要执行操作的那一部分指令或字段(通常用代码表示),是指令序列号,用来告知执行指令的装置具体需要执行哪一条指令。操作域可以是执行对应的指令所需的所有数据的来源,执行对应的指令所需的所有数据包括待运算向量、向量运算类型等参数以及对应的运算方法等等。对于一个向量指令其必须包括操作码和操作域,其中,操作域至少包括待运算向量地址和目标地址。In this embodiment, the operation code may be the part of the instruction or field (usually represented by code) specified in the computer program to perform the operation, and is the instruction sequence number used to inform the device that executes the instruction which instruction needs to be executed. . The operation domain may be the source of all data required to execute the corresponding instruction. All data required to execute the corresponding instruction include parameters such as the vector to be operated, vector operation type, and corresponding operation methods. For a vector instruction, it must include an operation code and an operation field, where the operation field includes at least the vector address and the target address to be operated.
应当理解的是,本领域技术人员可以根据需要对向量指令的指令格式以及所包含的操作码和操作域进行设置,本公开对此不作限制。It should be understood that, those skilled in the art can set the instruction format of the vector instruction, as well as the included operation codes and operation domains as needed, and this disclosure does not limit this.
在本实施例中,该装置可以包括一个或多个控制模块,以及一个或多个运算模块,可以根据实际需要对控制模块和运算模块的数量进行设置,本公开对此不作限制。在装置包括一个控制模块时,该控制模块可以接收向量指令,并控制一个或多个运算模块进行向量运算。在装置包括多个控制模块时,多个控制模块可以分别接收向量指令,并控制对应的一个或多个运算模块进行向量运算。In this embodiment, the device may include one or more control modules and one or more arithmetic modules, and the number of control modules and arithmetic modules may be set according to actual needs, which is not limited in the present disclosure. When the device includes a control module, the control module can receive vector instructions and control one or more arithmetic modules to perform vector operations. When the device includes multiple control modules, the multiple control modules can respectively receive vector instructions and control the corresponding one or more arithmetic modules to perform vector operations.
本公开实施例所提供的向量指令处理装置,该装置包括控制模块和运算模块,控制模块用于对获取到的向量指令进行编译,得到编译后的向量指令,对编译后的向量指令进行解析,得到向量指令的操作码和操作域,并根据操作码和操作域获取执行向量指令所需的待运算向量和目标地址,以及确定向量指令的向量运算类型;运算模块用于根据向量运算类型对待运算向量进行向量运算,获得运算结果,并将运算结果存入目标地址中。本公开实施例所提供的向量指令处理装置的适用范围广,对向量指令的处理效率高、处理速度快,进行向量运算的处理效率高、处理速度快。The vector instruction processing device provided by the embodiment of the present disclosure includes a control module and an operation module. The control module is used to compile the obtained vector instruction to obtain the compiled vector instruction and parse the compiled vector instruction. Obtain the operation code and operation domain of the vector instruction, and obtain the operation vector and the target address required to execute the vector instruction according to the operation code and operation domain, and determine the vector operation type of the vector instruction; the operation module is used to perform the operation according to the vector operation type The vector performs vector operation to obtain the operation result, and stores the operation result in the target address. The vector instruction processing device provided by the embodiments of the present disclosure has a wide application range, and has high processing efficiency and fast processing speed for vector instructions, and high processing efficiency and fast processing speed for vector operations.
图18-2a示出根据本公开一实施例的向量指令处理装置的框图。在一种可能的实现方式中,如图18-2a所示,运算模块18-12可以包括多个向量运算器18-120。多个向量运算器18-120用于执行与向量运算类型相对应的向量运算。18-2a shows a block diagram of a vector instruction processing apparatus according to an embodiment of the present disclosure. In a possible implementation, as shown in FIG. 18-2a, the operation module 18-12 may include a plurality of vector operators 18-120. A plurality of vector operators 18-120 are used to perform vector operations corresponding to the types of vector operations.
在该实现方式中,向量运算器可以包括加法器、除法器、乘法器、比较器等能够对向量进行算术运算、逻辑运算等运算的运算器。可以根据所需进行的向量运算的数据量的大小、向量运算类型、对向量运算的处理速度、效率等要求对向量运算器的种类及数量进行设置,本公开对此不作限制。In this implementation manner, the vector operator may include an adder, a divider, a multiplier, a comparator, and the like that can perform arithmetic operations, logical operations, and the like on the vector. The type and number of vector operators can be set according to the size of the data amount of the vector operation, the type of vector operation, the processing speed and efficiency of the vector operation, etc., and the disclosure does not limit this.
图18-2b示出根据本公开一实施例的向量指令处理装置的框图。在一种可能的实现方式中,如图18-2b所示,运算模块18-12可以包括主运算子模块18-121和多个从运算子模块18-122。主运算子模块18-121可以包括多个向量运算器(图中未示出)。主运算子模块18-121,用于利用多个向量运算器执行向量运算,得到运算结果,并将运算结果存入目标地址中。18-2b shows a block diagram of a vector instruction processing apparatus according to an embodiment of the present disclosure. In a possible implementation, as shown in FIG. 18-2b, the operation module 18-12 may include a master operation sub-module 18-121 and a plurality of slave operation sub-modules 18-122. The main operation sub-module 18-121 may include a plurality of vector operators (not shown in the figure). The main operation sub-module 18-121 is used to perform vector operation by using a plurality of vector operators, obtain the operation result, and store the operation result in the target address.
在一种可能的实现方式中,如图18-2b所示,运算模块18-12可以包括主运算子模块18-121和多个从运算子模块18-122,从运算子模块18-122可以包括多个向量运算器(图中未示出)。从运算子模块18-122,用于利用所包含的多个向量运算器并行执行进行对应的向量运算,得到运算结果,并将运算结果存入对应的子缓存空间中,以及将运算结果发送至主运算子模块18-121。主运算子模块18-121,还用于接收运算结果,并将述运算结果存入目标地址中。In a possible implementation, as shown in FIG. 18-2b, the operation module 18-12 may include a master operation sub-module 18-121 and a plurality of slave operation sub-modules 18-122, and the slave operation sub-module 18-122 may Includes multiple vector operators (not shown). The sub-modules 18-122 are used to execute the corresponding vector operations in parallel using the included multiple vector operators, obtain the operation results, store the operation results in the corresponding sub-cache space, and send the operation results to Main operation sub-module 18-121. The main operation sub-module 18-121 is also used to receive the operation result and store the operation result in the target address.
在该实现方式中,控制模块可以根据向量运算类型和运算的任务量等,确定通过主运算子模块或多个从运算子模块执行当前接收到的向量指令。例如,在确定需要对待运算向量进行求和运算时,可以控制主运算子模块进行运算。在确定需要对待运算向量进行乘法运算时,可以控制多个从运算子模块进行运算。In this implementation manner, the control module may determine that the currently received vector instruction is executed by the master operation sub-module or multiple slave operation sub-modules according to the type of vector operation and the amount of operation tasks. For example, when it is determined that the vector to be calculated needs to be summed, the main calculation sub-module can be controlled to perform the calculation. When it is determined that the operation vector needs to be multiplied, multiple slave operation sub-modules can be controlled to perform operations.
在一种可能的实现方式中,操作域还可以包括向量运算类型。In a possible implementation, the operation domain may also include a vector operation type.
其中,控制模块18-11,还可以用于根据操作域确定向量运算类型。Among them, the control module 18-11 can also be used to determine the vector operation type according to the operation domain.
在一种可能的实现方式中,向量运算类型可以包括以下至少一种:向量相乘运算、向量与标量相乘运算、向量相加运算、向量求和运算、满足运算条件存储指定值运算、按位与运算、按位或运算、按位异或运算、按位取反运算、按位求最大值运算、按位求最小值运算。其中,运算条件可以包括以下任一种:按位相等、按位不相等、按位小于、按位大于或等于、按位大于、按位小于或等于。指定值可以是0、1等数值,本公开对此不作限制。In a possible implementation manner, the type of vector operation may include at least one of the following: vector multiplication operation, vector and scalar multiplication operation, vector addition operation, vector sum operation, operation to store specified value operation that meets the operation conditions, press Bitwise AND operation, bitwise OR operation, bitwise XOR operation, bitwise inverse operation, bitwise maximum value operation, bitwise minimum value operation. The calculation conditions may include any of the following: bitwise equal, bitwise unequal, bitwise less than, bitwise greater than or equal, bitwise greater than, bitwise less than or equal. The specified value may be a numerical value of 0, 1, etc., and this disclosure does not limit it.
其中,满足按位相等存储指定值运算可以是:判断待运算向量中的第一待运算向量与第二待运算向量的对应位是否相等,在第一待运算向量与第二待运算向量的对应位相等时,存储指定值;在对应位不相等时存储第一待运算向量或第二待运算向量在对应位的值、或者存储0等与指定值不同的数值。The operation of satisfying the bit-by-bit equal storage of the specified value may be: judging whether the corresponding bits of the first to-be-calculated vector and the second to-be-calculated vector in the to-be-computed vector are equal, and the correspondence between the first to-be-calculated vector and the second to-be-calculated vector When the bits are equal, the specified value is stored; when the corresponding bit is not equal, the value of the first to-be-computed vector or the second to-be-computed vector at the corresponding bit is stored, or a value such as 0 that is different from the specified value is stored.
满足按位不相等存储指定值运算可以是:判断待运算向量中的第一待运算向量与第二待运算向量的对应位是否相等,在第一待运算向量与第二待运算向量的对应位不相等时,存储指定值;在对应位相等时存储第一待运算向量或第二待运算向量在对应位的值、或者存储0等与指定值不同的数值。Satisfying bitwise inequality to store the specified value operation may be: judging whether the corresponding bits of the first to-be-calculated vector and the second to-be-calculated vector in the to-be-computed vector are equal, and the corresponding bits of the first to-be-calculated vector and the second to-be-calculated vector When they are not equal, the specified value is stored; when the corresponding bit is equal, the value of the first to-be-computed vector or the second to-be-computed vector at the corresponding bit is stored, or a value such as 0 that is different from the specified value is stored.
满足按位小于存储指定值运算可以是:判断待运算向量中的第一待运算向量与第二待运算向量的对应位的大小关系,在对应位上的第一待运算向量的值小于第二待运算向量的值时,存储指定值;在对应位上的第一待运算向量的值大于或等于第二待运算向量的值时,存储第一待运算向量或第二待运算向量在对应位的值、或者存储0等与指定值不同的数值。The operation that satisfies the bit-less than storing the specified value may be: judging the size relationship between the corresponding bits of the first to-be-computed vector and the second to-be-computed vector in the to-be-computed vector, the value of the first to-be-computed vector on the corresponding bit is less than the second When the value of the vector to be calculated is stored, the specified value is stored; when the value of the first vector to be calculated on the corresponding bit is greater than or equal to the value of the second vector to be calculated, the first vector to be calculated or the second vector to be calculated is stored in the corresponding bit Or store a value such as 0 that is different from the specified value.
满足按位大于或等于存储指定值运算可以是:判断待运算向量中的第一待运算向量与第二待运算向量的对应位的大小关系,在对应位上的第一待运算向量的值大于或等于第二待运算向量的值时,存储指定值;在对应位上的第一待运算向量的值小于第二待运算向量的值时,存储第一待运算向量或第二待运算向量在对应位的值、或者存储0等与指定值不同的数值。Satisfying the bitwise operation greater than or equal to storing the specified value may be: judging the size relationship between the corresponding bits of the first to-be-computed vector and the second to-be-computed vector in the to-be-computed vector, the value of the first to-be-computed vector on the corresponding bit is greater than Or equal to the value of the second to-be-computed vector, store the specified value; when the value of the first to-be-computed vector in the corresponding bit is less than the value of the second to-be-computed vector, store the first to-be-computed vector or the second to-be-computed vector at The value of the corresponding bit or a value other than the specified value such as 0 is stored.
满足按位大于存储指定值运算可以是:判断待运算向量中的第一待运算向量与第二待运算向量的对应位的大小关系,在对应位上的第一待运算向量的值大于第二待运算向量的值时,存储指定值;在对应位上的第一待运算向量的值小于或等于第二待运算向量的值时,存储第一待运算向量或第二待运算向量在对应位的值、或者存储0等与指定值不同的数值。The operation that satisfies the bit-wise greater than storing the specified value may be: judging the size relationship between the corresponding bits of the first to-be-computed vector and the second to-be-computed vector in the to-be-computed vector, the value of the first to-be-computed vector on the corresponding bit is greater than the second When the value of the to-be-computed vector is stored, the specified value is stored; when the value of the first to-be-computed vector on the corresponding bit is less than or equal to the value of the second to-be-computed vector, the first to-be-computed vector or the second to-be-computed vector is stored in the corresponding bit Or store a value such as 0 that is different from the specified value.
满足按位小于或等于存储指定值运算可以是:判断待运算向量中的第一待运算向量与第二待运算向量的对应位的大小关系,在对应位上的第一待运算向量的值小于或等于第二待运算向量的值时,存储指定值;在对应位上的第一待运算向量的值大于第二待运算向量的值时,存储第一待运算向量或第二待运算向量在对应位的值、或者存储0等与指定值不同的数值。Satisfying bitwise less than or equal to storing the specified value operation may be: judging the size relationship between the corresponding bits of the first to-be-computed vector and the second to-be-computed vector in the to-be-computed vector, the value of the first to-be-computed vector on the corresponding bit is less than Or equal to the value of the second to-be-computed vector, store the specified value; when the value of the first to-be-computed vector in the corresponding bit is greater than the value of the second to-be-computed vector, store the first to-be-computed vector or the second to-be-computed vector at The value of the corresponding bit or a value other than the specified value such as 0 is stored.
在该实现方式中,可以为不同的向量运算类型设置不同的操作域代码,以区分不同的运算种类。例如,可以将“向量相乘运算”的代码设置为“mult”。可以将“向量与标量相乘运算”的代码设置为“mult.const”。可以将“向量相加运算”的代码设置为“add”。可以将“向量求和运算”的代码设置为“sub”。可以将“按位与运算”的代码设置为“and”。可以将“按位或运算”的代码设置为“or”。可以将“按位异或运算”的代码设置为“xor”。可以将“按位取反运算”的代码设置为“not”。可以将“按位求最大值运算”的代码设置为“max”。可以将“按位求最小值运算”的代码设置为“min”。可以将“满足按位相等则存储指定值1运算”的代码设置为“eq”。可以将“满足按位不相等则存储指定值1运算”的代码设置为“ne”。可以将“满足按位小于存储指定值1运算”的代码设置为“lt”。可以将“满足按位大于或等于存储指定值1运算”的代码设置为“ge”。可以将“满足按位大于存储指定值1运算”的代码设置为“gt”。可以将“满足按位小于或等于存储指定值1运算”的代码设置为“le”。In this implementation, different operation domain codes can be set for different types of vector operations to distinguish different types of operations. For example, the code of "vector multiplication operation" can be set to "mult". The code for "multiplying vector and scalar" can be set to "mult.const". The code of "vector addition operation" can be set to "add". The code of "vector summation operation" can be set to "sub". The code of "bitwise AND operation" can be set to "and". The code of "bitwise OR operation" can be set to "or". The code for "bitwise XOR operation" can be set to "xor". You can set the code for "bitwise inversion" to "not". The code for "maximum bitwise operation" can be set to "max". The "minimum bitwise operation" code can be set to "min". You can set the code for "Save the specified value 1 operation if bitwise equality is satisfied" to "eq". You can set the code "meet the operation of storing the specified value 1 if the bitwise inequality is satisfied" as "ne". You can set the code that "satisfies the bitwise operation less than the storage specified value 1" to "lt". You can set the code that meets the "bitwise greater than or equal to store specified value 1 operation" code to "ge". You can set the code that "satisfies the bitwise operation greater than the storage specified value 1" to "gt". The code "meet the operation of bitwise less than or equal to storing the specified value 1" can be set to "le".
本领域技术人员可以根据实际需要对运算种类、及其对应的代码进行设置,本公开对此不作限制。A person skilled in the art can set the type of operation and its corresponding code according to actual needs, and this disclosure does not limit this.
在一种可能的实现方式中,操作域还可以包括输入量。其中,控制模块18-11,还用于根据操作域确定输入量,并从待运算数据地址中获取数据量为输入量的待运算向量。In a possible implementation manner, the operation domain may further include an input amount. Among them, the control module 18-11 is also used to determine the input amount according to the operation domain, and obtain the to-be-calculated vector whose data amount is the input amount from the data address to be calculated.
在该实现方式中,输入量可以是表征待运算向量的数据量的参数,例如,向量长度、宽度等。In this implementation manner, the input amount may be a parameter that characterizes the data amount of the vector to be calculated, for example, vector length, width, and the like.
在一种可能的实现方式中,可以设置默认输入量。在根据操作域无法确定输入量时,可以将默认输入量确定为当前向量指令的输入量,并从待运算数据地址中获取数据量为默认输入量的待运算向量。In a possible implementation, the default input amount can be set. When the input quantity cannot be determined according to the operation domain, the default input quantity can be determined as the input quantity of the current vector instruction, and the to-be-calculated vector whose data quantity is the default input quantity can be obtained from the data address to be calculated.
在一种可能的实现方式中,如图18-2a、图18-2b所示,该装置还可以包括存储模块18-13。存储模块18-13用于存储待运算向量。In a possible implementation manner, as shown in FIGS. 18-2a and 18-2b, the device may further include a storage module 18-13. The storage modules 18-13 are used to store vectors to be calculated.
在该实现方式中,存储模块可以包括缓存和寄存器中的一种或多种,缓存可以包括高速暂存缓存,还可以包括至少一个NRAM(Neuron Random Access Memory,神经元随机存取存储器)。缓存,用于存储待运算数据和待运算向量。寄存器,用于存储待运算数据中的标量数据。In this implementation, the storage module may include one or more of a cache and a register. The cache may include a high-speed temporary storage cache, and may also include at least one NRAM (Neuron Random Access Memory). Cache, used to store data and vector to be calculated. The register is used to store the scalar data in the data to be calculated.
在一种可能的实现方式中,缓存可以包括神经元缓存。神经元缓存也即上述神经元随机存取存储器,可以用于存储待运算数据中的神经元数据,神经元数据可以包括神经元向量数据。In a possible implementation, the cache may include a neuron cache. The neuron cache, that is, the foregoing neuron random access memory, can be used to store neuron data in the data to be calculated, and the neuron data can include neuron vector data.
在一种可能的实现方式中,该装置还可以包括直接内存访问模块,用于从存储模块中读取或者存储数据。In a possible implementation, the device may further include a direct memory access module, which is used to read or store data from the storage module.
在一种可能的实现方式中,控制模块18-11还可以用于根据向量指令生成汇编文件,并将汇编文件翻译成二进制文件,其中,二进制文件为编译后的向量指令。In a possible implementation, the control module 18-11 can also be used to generate an assembly file according to the vector instructions and translate the assembly file into a binary file, where the binary file is a compiled vector instruction.
在一种可能的实现方式中,向量指令的指令格式可以是:In a possible implementation manner, the instruction format of the vector instruction may be:
opcode dst src type sizeopcode dst src type size
其中,opcode为向量指令的操作码,dst、src、type、size为向量指令的操作域。其中,dst为目标地址。src为待运算向量地址,在待运算向量为多个时,src可以包括多个待运算数据地址src0,src1,...,srcn,本公开对此不作限制。type为向量运算类型。size为输入量。其中,type可以是向量运算类型的代码,如mult、mult.const、add、sub、eq、ne、lt、ge、gt、le、eq、and、or、xor、not、max、min。Among them, opcode is the operation code of the vector instruction, dst, src, type, size are the operation domain of the vector instruction. Among them, dst is the target address. src is a vector address to be operated. When there are multiple vectors to be operated, src may include multiple data addresses to be operated src0, src1, ..., srcn, which is not limited in the present disclosure. type is the type of vector operation. size is the amount of input. Among them, type can be a code of vector operation type, such as mult, mult.const, add, sub, eq, ne, lt, ge, gt, le, eq, and, or, xor, not, max, min.
其中,在待运算向量为多个时,指令格式中可以包括多个待运算数据地址,以下以包括两个待运算向量为例,向量指令的指令格式可以是:When there are multiple vectors to be operated, the instruction format may include multiple data addresses to be operated. The following takes the two vectors to be operated as an example. The instruction format of the vector instruction may be:
opcode dst src0 src1 type sizeopcode dst src0 src1 type size
在一种可能的实现方式中,向量指令的指令格式可以是:In a possible implementation manner, the instruction format of the vector instruction may be:
type,dst,src,sizetype, dst, src, size
在一种可能的实现方式中,可以将用于“向量相乘运算”的向量指令的指令格式设置为:mult dst src0 src1 size。其表示:从第一待运算地址src0获取size大小的第一待运算向量、从第二待运算地址src1中获取size大小的第二待运算向量,对第一待运算向量和第二待运算向量进行相乘运算,得到运算结果。并将运算结果存储到目标地址dst中。In a possible implementation, the instruction format of the vector instruction used for "vector multiplication" can be set to: mult dst src0 src1 size. It means: Obtain the first to-be-operated vector of size size from the first to-be-operated address src0, the second to-be-operated vector of size size from the second to-be-operated address src1, Perform the multiplication operation to get the operation result. And store the operation result to the target address dst.
在一种可能的实现方式中,可以将用于“向量与标量相乘运算”的向量指令的指令格式设置为:mult.const dst src0 src1 size。其表示:从第一待运算数据地址src0获取size大小的待运算向量、从第二待运算数据地址src1中获取size大小的待运算标量,对待运算向量和待运算标量进行相乘运算,得到运 算结果。并将运算结果存储到目标地址dst中。In a possible implementation, the instruction format of the vector instruction used for "vector and scalar multiplication operation" can be set to: mult.const dst src0 src1 size. It means: Obtain the size-to-be-calculated vector of the size from the first data-to-be-operated data address src0, obtain the size-to-be-calculated scalar from the second data-to-be-operated data address src1, multiply the vector and the scalar to be calculated result. And store the operation result to the target address dst.
在一种可能的实现方式中,可以将用于“向量相加运算”的向量指令的指令格式设置为:add dst src0 src1 size。其表示:从第一待运算地址src0获取size大小的第一待运算向量、从第二待运算地址src1中获取size大小的第二待运算向量,对第一待运算向量和第二待运算向量进行相加运算,得到运算结果。并将运算结果存储到目标地址dst中。In a possible implementation, the instruction format of the vector instruction used for "vector addition operation" can be set to: add dst src0 src1 size. It means: Obtain the first to-be-operated vector of size size from the first to-be-operated address src0, obtain the second to-be-calculated size of the size from the second to-be-operated address src1, and compare the first to-be-operated vector and the second to-be-operated vector Perform the addition operation to obtain the operation result. And store the operation result to the target address dst.
在一种可能的实现方式中,可以将用于“向量求和运算”的向量指令的指令格式设置为:sub dst src size。其表示:从待运算地址src获取size大小的多个待运算向量,对多个待运算向量进行求和运算,得到运算结果。并将运算结果存储到目标地址dst中。In a possible implementation, the instruction format of the vector instruction used for "vector summation operation" can be set to: sub dst src size. It means that multiple size-to-be-operated vectors of size size are obtained from the address-to-be-operated address src, and a summation operation is performed on the plurality of to-be-operated vectors to obtain an operation result. And store the operation result to the target address dst.
在一种可能的实现方式中,可以将用于“按位与运算”的向量指令的指令格式设置为:and dst src0 src1 size。其表示:从第一待运算地址src0获取size大小的第一待运算向量、从第二待运算地址src1中获取size大小的第二待运算向量,对第一待运算向量和第二待运算向量进行按位与运算,得到运算结果。并将运算结果存储到目标地址dst中。In a possible implementation, the instruction format of the vector instruction used for "bitwise AND operation" can be set to: and dst src0 src1 size. It means: Obtain the first to-be-operated vector of size size from the first to-be-operated address src0, and the second to-be-calculated vector of the size of size from the second to-be-operated address src1. Perform bitwise AND operation to get the operation result. And store the operation result to the target address dst.
在一种可能的实现方式中,可以将用于“按位或运算”的向量指令的指令格式设置为:or dst src0 src1 size。其表示:从第一待运算地址src0获取size大小的第一待运算向量、从第二待运算地址src1中获取size大小的第二待运算向量,对第一待运算向量和第二待运算向量进行按位或运算,得到运算结果。并将运算结果存储到目标地址dst中。In a possible implementation, the instruction format of the vector instruction used for "bitwise OR operation" can be set to: or dst src0 src1 size. It means: Obtain the first to-be-operated vector of size size from the first to-be-operated address src0, and the second to-be-calculated vector of the size of size from the second to-be-operated address src1. Perform bitwise OR operation to get the operation result. And store the operation result to the target address dst.
在一种可能的实现方式中,可以将用于“按位异或运算”的向量指令的指令格式设置为:xordst src0 src1 size。其表示:从第一待运算地址src0获取size大小的第一待运算向量、从第二待运算地址src1中获取size大小的第二待运算向量,对第一待运算向量和第二待运算向量进行按位异或运算,得到运算结果。并将运算结果存储到目标地址dst中。In a possible implementation, the instruction format of the vector instruction used for "bitwise XOR operation" can be set to: xordst src0 src1 size. It means: Obtain the first to-be-operated vector of size size from the first to-be-operated address src0, and the second to-be-calculated vector of the size of size from the second to-be-operated address src1. Perform bitwise XOR operation to get the operation result. And store the operation result to the target address dst.
在一种可能的实现方式中,可以将用于“按位取反运算”的向量指令的指令格式设置为:not dst src size。其表示:从待运算地址src获取size大小的待运算向量,对待运算向量进行按位取反运算,得到运算结果。并将运算结果存储到目标地址dst中。In a possible implementation, the instruction format of the vector instruction used for "bitwise inversion operation" may be set to: not dst src size. It means that the size-to-be-operated vector of size is obtained from the address-to-be-operated address src, and the bitwise inverse operation is performed on the vector to obtain the operation result. And store the operation result to the target address dst.
在一种可能的实现方式中,可以将用于“按位求最大值运算”的向量指令的指令格式设置为:max dst src0 src1 size。其表示:从第一待运算地址src0获取size大小的第一待运算向量、从第二待运算地址src1中获取size大小的第二待运算向量,对第一待运算向量和第二待运算向量进行按位求最大值运算,得到运算结果。并将运算结果存储到目标地址dst中。In a possible implementation, the instruction format of the vector instruction for "maximum bitwise operation" can be set to: max dst src0 src1 size. It means: Obtain the first to-be-operated vector of size size from the first to-be-operated address src0, obtain the second to-be-calculated size of the size from the second to-be-operated address src1, and compare the first to-be-operated vector and the second to-be-operated vector Carry out the operation of seeking the maximum value bit by bit to obtain the operation result. And store the operation result to the target address dst.
在一种可能的实现方式中,可以将用于“按位求最小值运算”的向量指令的指令格式设置为:min dst src0 src1 size。其表示:从第一待运算地址src0获取size大小的第一待运算向量、从第二待运算地址src1中获取size大小的第二待运算向量,对第一待运算向量和第二待运算向量进行按位求最小值运算,得到运算结果。并将运算结果存储到目标地址dst中。In a possible implementation manner, the instruction format of the vector instruction for "minimum bitwise operation" can be set to: mindst src0 src1 size. It means: Obtain the first to-be-operated vector of size size from the first to-be-operated address src0, obtain the second to-be-calculated size of the size from the second to-be-operated address src1, and compare the first to-be-operated vector and the second to-be-operated vector Carry out the operation of finding the minimum value bit by bit and obtain the operation result. And store the operation result to the target address dst.
在一种可能的实现方式中,可以将用于“满足按位相等则存储指定值1运算”的向量指令的指令格式设置为:eq dst src0 src1 size。其表示:从第一待运算地址src0获取size大小的第一待运算向量、从第二待运算地址src1中获取size大小的第二待运算向量,对第一待运算向量和第二待运算向量进行按位比较,在第一待运算向量和第二待运算向量的对应位相等时存储指定值1,得到运算结果。并将运算结果存储到目标地址dst中。In a possible implementation, the instruction format of the vector instruction used for "save the specified value 1 if the bitwise equality is met" can be set to: eq dst src0 src1 size. It means: Obtain the first to-be-operated vector of size size from the first to-be-operated address src0, obtain the second to-be-calculated size of the size from the second to-be-operated address src1, and compare the first to-be-operated vector and the second to-be-operated vector Perform a bit-by-bit comparison, store the specified value 1 when the corresponding bits of the first to-be-calculated vector and the second to-be-calculated vector are equal, and obtain the operation result. And store the operation result to the target address dst.
在一种可能的实现方式中,可以将用于“满足按位不相等则存储指定值1运算”的向量指令的指令格式设置为:ne dst src0 src1 size。其表示:从第一待运算地址src0获取size大小的第一待运算向量、从第二待运算地址src1中获取size大小的第二待运算向量,对第一待运算向量和第二待运算向量进行按位比较,在第一待运算向量和第二待运算向量的对应位不相等时存储指定值1,得到运算结果。并将运算结果存储到目标地址dst中。In a possible implementation manner, the instruction format of the vector instruction used for "save the specified value 1 if the bitwise inequality is satisfied" may be set to: nedst src0 src1 size. It means: Obtain the first to-be-operated vector of size size from the first to-be-operated address src0, obtain the second to-be-calculated size of the size from the second to-be-operated address src1, and compare the first to-be-operated vector and the second to-be-operated vector Perform a bit-by-bit comparison, store the specified value 1 when the corresponding bits of the first to-be-calculated vector and the second to-be-calculated vector are not equal, and obtain the operation result. And store the operation result to the target address dst.
在一种可能的实现方式中,可以将用于“满足按位小于则存储指定值1运算”的向量指令的指令格式设置为:lt dst src0 src1 size。其表示:从第一待运算地址src0获取size大小的第一待运算向量、从第二待运算地址src1中获取size大小的第二待运算向量,对第一待运算向量和第二待运算向量进行按位比较,在对应位上第一待运算向量的数值小于第二待运算向量的数值时存储指定值1,得到运算结果。 并将运算结果存储到目标地址dst中。In a possible implementation manner, the instruction format of the vector instruction used for "satisfying bitwise less than storing specified value 1 operation" can be set to: lt dst src0 src1 size. It means: Obtain the first to-be-operated vector of size size from the first to-be-operated address src0, and the second to-be-calculated vector of the size of size from the second to-be-operated address src1. Perform a bit-by-bit comparison, store the specified value 1 when the value of the first to-be-calculated vector on the corresponding bit is smaller than the value of the second to-be-calculated vector, and obtain the operation result. And store the operation result to the target address dst.
在一种可能的实现方式中,可以将用于“满足按位大于或等于则存储指定值1运算”的向量指令的指令格式设置为:ge dst src0 src1 size。其表示:从第一待运算地址src0获取size大小的第一待运算向量、从第二待运算地址src1中获取size大小的第二待运算向量,对第一待运算向量和第二待运算向量进行按位比较,在对应位上第一待运算向量的数值大于或等于第二待运算向量的数值时存储指定值1,得到运算结果。并将运算结果存储到目标地址dst中。In a possible implementation, the instruction format of the vector instruction used for "meet bitwise greater than or equal to store specified value 1 operation" can be set to: ge dst src0 src1 size. It means: Obtain the first to-be-operated vector of size size from the first to-be-operated address src0, and the second to-be-calculated vector of the size of size from the second to-be-operated address src1. Perform a bit-by-bit comparison, store the specified value 1 when the value of the first to-be-calculated vector on the corresponding bit is greater than or equal to the value of the second to-be-calculated vector, and obtain the operation result. And store the operation result to the target address dst.
在一种可能的实现方式中,可以将用于“满足按位大于则存储指定值1运算”的向量指令的指令格式设置为:gt dst src0 src1 size。其表示:从第一待运算地址src0获取size大小的第一待运算向量、从第二待运算地址src1中获取size大小的第二待运算向量,对第一待运算向量和第二待运算向量进行按位比较,在对应位上第一待运算向量的数值大于第二待运算向量的数值时存储指定值1,得到运算结果。并将运算结果存储到目标地址dst中。In a possible implementation manner, the instruction format of the vector instruction used for "satisfying bitwise greater than storing specified value 1 operation" can be set to: gtdstsrc0src1size. It means: Obtain the first to-be-operated vector of size size from the first to-be-operated address src0, obtain the second to-be-calculated size of the size from the second to-be-operated address src1, and compare the first to-be-operated vector and the second to-be-operated vector Perform a bit-by-bit comparison, store the specified value 1 when the value of the first to-be-calculated vector on the corresponding bit is greater than the value of the second to-be-calculated vector, and obtain the operation result. And store the operation result to the target address dst.
在一种可能的实现方式中,可以将用于“满足按位小于或等于则存储指定值1运算”的向量指令的指令格式设置为:le dst src0 src1 size。其表示:从第一待运算地址src0获取size大小的第一待运算向量、从第二待运算地址src1中获取size大小的第二待运算向量,对第一待运算向量和第二待运算向量进行按位比较,在对应位上第一待运算向量的数值小于或等于第二待运算向量的数值时存储指定值1,得到运算结果。并将运算结果存储到目标地址dst中。In a possible implementation, the instruction format of the vector instruction used for "meet bitwise less than or equal to store specified value 1 operation" can be set to: le dst src0 src1 size. It means: Obtain the first to-be-operated vector of size size from the first to-be-operated address src0, obtain the second to-be-calculated size of the size from the second to-be-operated address src1, and compare the first to-be-operated vector and the second to-be-operated vector Perform a bit-by-bit comparison, store the specified value 1 when the value of the first to-be-calculated vector on the corresponding bit is less than or equal to the value of the second to-be-calculated vector, and obtain the operation result. And store the operation result to the target address dst.
应当理解的是,本领域技术人员可以根据需要对向量指令的操作码、指令格式中操作码和操作域的位置进行设置,本公开对此不作限制。It should be understood that those skilled in the art can set the operation code of the vector instruction, the position of the operation code and the operation field in the instruction format as needed, and the disclosure does not limit this.
在一种可能的实现方式中,该装置可以设置于图形处理器(Graphics Processing Unit,简称GPU)、中央处理器(Central Processing Unit,简称CPU)和嵌入式神经网络处理器(Neural-network Processing Unit,简称NPU)的一种或多种之中。In a possible implementation manner, the device may be set in a graphics processor (Graphics Processing Unit, GPU for short), a central processing unit (Central Processing Unit, CPU for short) and an embedded neural network processor (Neural-network Processing Unit) , Referred to as NPU).
需要说明的是,尽管以上述实施例作为示例介绍了向量指令处理装置如上,但本领域技术人员能够理解,本公开应不限于此。事实上,用户完全可根据个人喜好和/或实际应用场景灵活设定各模块,只要符合本公开的技术方案即可。It should be noted that although the vector instruction processing apparatus is described above by taking the above embodiment as an example, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various modules flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.
应用示例Application examples
以下结合“利用向量指令处理装置进行向量运算”作为一个示例性应用场景,给出根据本公开实施例的应用示例,以便于理解向量指令处理装置的流程。本领域技术人员应理解,以下应用示例仅仅是出于便于理解本公开实施例的目的,不应视为对本公开实施例的限制The following describes an application example according to an embodiment of the present disclosure in conjunction with "using vector instruction processing apparatus for vector operation" as an exemplary application scenario to facilitate understanding of the flow of the vector instruction processing apparatus. Those skilled in the art should understand that the following application examples are only for the purpose of facilitating understanding of the embodiments of the present disclosure, and should not be considered as limitations to the embodiments of the present disclosure
图18-3示出根据本公开一实施例的向量指令处理装置的应用场景的示意图。如图18-3所示,向量指令处理装置对向量指令进行处理的过程如下:控制模块18-11对获取到的向量指令1进行编译,得到编译后的向量指令1(如向量指令1为opcode 500 101 102 add 1024),对编译后的向量指令进行解析,得到向量指令1的操作码和操作域。其中,向量指令1的操作码为opcode,目标地址为500,第一待运算向量地址为101,第二待运算数据地址为102。向量运算类型为add(向量相加运算)。输入量为1024。控制模块18-11从待运算向量地址101中获取到数据量为输入量1024的第一待运算向量,以及从待运算向量地址102中获取到数据量为输入量1024的第二待运算向量。运算模块18-12对第一待运算向量和第二待运算向量进行相加运算,得到运算结果1,并将运算结果1存入目标地址500中。18-3 shows a schematic diagram of an application scenario of a vector instruction processing apparatus according to an embodiment of the present disclosure. As shown in Figure 18-3, the vector instruction processing device processes the vector instruction as follows: the control module 18-11 compiles the acquired vector instruction 1 to obtain the compiled vector instruction 1 (for example, the vector instruction 1 is an opcode 500, 101, 102, and add 1024) to analyze the compiled vector instruction to obtain the operation code and operation domain of vector instruction 1. The operation code of the vector instruction 1 is opcode, the target address is 500, the first to-be-calculated vector address is 101, and the second to-be-calculated data address is 102. The vector operation type is add (vector addition operation). The input is 1024. The control module 18-11 obtains a first to-be-operated vector whose data amount is 1024 from the to-be-operated vector address 101, and a second to-be-operated vector whose data amount is 1024 from the to-be-operated vector address 102. The operation module 18-12 performs an addition operation on the first to-be-operated vector and the second to-be-operated vector to obtain an operation result 1, and stores the operation result 1 in the target address 500.
其中,向量指令1除可以为上述opcode 500 101 102 add 1024,还可以为add 500 101 102 1024,不同指令格式的向量指令的处理过程相似,不再赘述。Among them, the vector instruction 1 can be not only the above opcode 500, 101, 102, add, 1024, but also add 500, 101, 102, 1024. The processing procedure of vector instructions in different instruction formats is similar and will not be repeated here.
以上各模块的工作过程可参考上文的相关描述。For the working process of the above modules, please refer to the relevant description above.
这样,向量指令处理装置可以高效、快速地对向量指令进行处理,进行向量运算的处理效率高、处理速度快。In this way, the vector instruction processing device can efficiently and quickly process the vector instruction, and the vector operation has high processing efficiency and fast processing speed.
图18-4示出根据本公开一实施例的向量指令处理方法的流程图。该方法可以应用于包含存储器和处理器的如计算机设备等,其中,存储器用于存储执行方法过程中所使用的数据;处理器用于执行相关的处理、运算步骤,如执行下述步骤S51-18和步骤S52-18。如图18-4所示,该方法应用于上述向量指令处理装置,该方法包括步骤S51-18和步骤S52-18。18-4 shows a flowchart of a vector instruction processing method according to an embodiment of the present disclosure. The method can be applied to computer equipment including a memory and a processor, where the memory is used to store data used in the execution of the method; the processor is used to perform related processing and operation steps, such as performing the following steps S51-18和 步骤 S52-18. As shown in FIG. 18-4, this method is applied to the above-mentioned vector instruction processing apparatus. The method includes steps S51-18 and S52-18.
在步骤S51-18中,利用控制模块对获取到的向量指令进行编译,得到编译后的向量指令,对编译后的向量指令进行解析,得到向量指令的操作码和操作域,并根据操作码和操作域获取执行向量指令所需的待运算向量和目标地址,以及确定向量指令的向量运算类型。其中,操作码用于指示向量指令对数据所进行的运算为向量运算,操作域包括待运算向量地址和目标地址。In step S51-18, the control module is used to compile the acquired vector instruction to obtain a compiled vector instruction, and the compiled vector instruction is analyzed to obtain the operation code and operation domain of the vector instruction, and according to the operation code and The operation domain obtains the vector to be operated and the target address required to execute the vector instruction, and determines the vector operation type of the vector instruction. The operation code is used to indicate that the operation performed by the vector instruction on the data is a vector operation, and the operation domain includes the vector address and the target address to be operated.
在步骤S52-18中,利用运算模块根据向量运算类型对待运算向量进行向量运算,获得运算结果,并将运算结果存入目标地址中。In step S52-18, the operation module is used to perform vector operation on the operation vector according to the vector operation type to obtain an operation result, and the operation result is stored in the target address.
在一种可能的实现方式中,根据向量运算类型对待运算向量进行向量运算,获得运算结果,可以包括:利用运算模块中的多个向量运算器执行与向量运算类型相对应的向量运算。In a possible implementation manner, performing vector operation on the operation vector according to the vector operation type to obtain an operation result may include: using multiple vector operators in the operation module to perform vector operation corresponding to the vector operation type.
在一种可能的实现方式中,运算模块可以包括主运算子模块和多个从运算子模块,主运算子模块可以包括所述多个向量运算器。其中,步骤S52-18可以包括:利用主运算子模块中的多个向量运算器执行与向量运算类型相对应的向量运算,得到运算结果,并将运算结果存入目标地址中。In a possible implementation manner, the operation module may include a master operation submodule and a plurality of slave operation submodules, and the master operation submodule may include the plurality of vector operators. Wherein, step S52-18 may include: using a plurality of vector operators in the main operation sub-module to perform a vector operation corresponding to the type of vector operation, obtain an operation result, and store the operation result in a target address.
在一种可能的实现方式中,运算模块包括主运算子模块和多个从运算子模块,从运算子模块包括多个向量运算器,其中,根据向量运算类型对待运算向量进行向量运算,获得运算结果,并将运算结果存入目标地址中,包括:利用每个从运算子模块所包含的多个向量运算器并行执行进行对应的向量运算,得到运算结果,并将运算结果存入对应的子缓存空间中,以及将运算结果发送至主运算子模块;利用主运算子模块接收运算结果,并将运算结果存入目标地址中。In a possible implementation, the operation module includes a main operation sub-module and multiple sub-operation sub-modules, and the sub-operation sub-module includes multiple vector operators, wherein the vector operation is performed according to the vector operation type to obtain the operation Results, and store the operation results in the target address, including: use multiple vector operators included in each slave operation sub-module to perform corresponding vector operations in parallel to obtain the operation results, and store the operation results in the corresponding sub In the buffer space, and send the operation result to the main operation sub-module; use the main operation sub-module to receive the operation result and store the operation result in the target address.
在一种可能的实现方式中,操作域还可以包括向量运算类型。其中,确定向量指令的向量运算类型,可以包括:在操作域中包括向量运算类型时,根据操作域确定向量运算类型。In a possible implementation, the operation domain may also include a vector operation type. Wherein, determining the vector operation type of the vector instruction may include: when the vector operation type is included in the operation domain, determining the vector operation type according to the operation domain.
在一种可能的实现方式中,操作域还可以包括输入量。其中,根据操作码和操作域获取执行向量指令所需的待运算向量和目标地址,还可以包括:根据操作域确定输入量,并从待运算数据地址中获取数据量为输入量的待运算向量。In a possible implementation manner, the operation domain may further include an input amount. Wherein, obtaining the to-be-computed vector and the target address required to execute the vector instruction according to the operation code and the operation domain may also include: determining the input volume according to the operation domain, and obtaining the to-be-calculated vector whose data volume is the input volume from the to-be-calculated data address .
在一种可能的实现方式中,操作码还用于指示所述向量运算类型。其中,确定向量指令的向量运算类型,可以包括:在操作码用于指示所述向量运算类型时,根据操作码确定所述向量运算类型。In a possible implementation, the operation code is also used to indicate the type of vector operation. Wherein, determining the vector operation type of the vector instruction may include: when the operation code is used to indicate the vector operation type, determining the vector operation type according to the operation code.
在一种可能的实现方式中,向量运算类型可以包括以下至少一种:向量相乘运算、向量与标量相乘运算、向量相加运算、向量求和运算、满足运算条件存储指定值运算、按位与运算、按位或运算、按位异或运算、按位取反运算、按位求最大值运算、按位求最小值运算。其中,运算条件可以包括以下任一种:按位相等、按位不相等、按位小于、按位大于或等于、按位大于、按位小于或等于。In a possible implementation manner, the type of vector operation may include at least one of the following: vector multiplication operation, vector and scalar multiplication operation, vector addition operation, vector sum operation, operation to store specified value operation that meets the operation conditions, press Bitwise AND operation, bitwise OR operation, bitwise XOR operation, bitwise inverse operation, bitwise maximum value operation, bitwise minimum value operation. The calculation conditions may include any of the following: bitwise equal, bitwise unequal, bitwise less than, bitwise greater than or equal, bitwise greater than, bitwise less than or equal.
在一种可能的实现方式中,该方法还可以包括:利用装置的存储模块存储待运算向量,其中,存储模块包括寄存器和缓存中的至少一种,缓存,用于存储待运算数据和待运算向量,缓存包括至少一个神经元缓存NRAM;寄存器,用于存储待运算数据中的标量数据;神经元缓存,用于存储待运算数据中的神经元数据,神经元数据包括神经元向量数据。In a possible implementation manner, the method may further include: using the storage module of the device to store the to-be-calculated vector, wherein the storage module includes at least one of a register and a cache, and the cache is used to store the to-be-calculated data and the to-be-calculated Vectors, caches include at least one neuron cache NRAM; registers, used to store scalar data in data to be operated; neuron caches, used to store neuron data in data to be operated, and neuron data includes neuron vector data.
在一种可能的实现方式中,对编译后的向量指令进行解析,得到向量指令的操作码和操作域,可以包括:存储编译后的向量指令;对编译后的向量指令进行解析,得到向量指令的操作码和操作域;存储指令队列,指令队列包括按照执行顺序依次排列的多个待执行指令,多个待执行指令可以包括编译后的向量指令。In a possible implementation, parsing the compiled vector instruction to obtain the operation code and operation domain of the vector instruction may include: storing the compiled vector instruction; parsing the compiled vector instruction to obtain the vector instruction Operation code and operation domain; store the instruction queue, the instruction queue includes a plurality of instructions to be executed in order according to the execution order, and the plurality of instructions to be executed may include compiled vector instructions.
在一种可能的实现方式中,该方法还可以包括:在确定多个待执行指令中的第一待执行指令与第一待执行指令之前的第零待执行指令存在关联关系时,缓存第一待执行指令,并在确定第零执行指令执行完毕后,控制进行第一待执行指令的执行,In a possible implementation manner, the method may further include: when determining that the first to-be-executed instruction among the plurality of to-be-executed instructions is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction, caching the first The instruction to be executed, and after determining that the execution of the zeroth execution instruction is completed, control to execute the execution of the first instruction to be executed,
其中,第一待执行指令与第一待执行指令之前的第零待执行指令存在关联关系可以包括:存储第一待执行指令所需数据的第一存储地址区间与存储第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first to-be-executed instruction is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction may include: a first storage address interval that stores data required by the first to-be-executed instruction and a zeroth to-be-executed instruction The zeroth storage address interval of data has overlapping areas.
在一种可能的实现方式中,对获取到的向量指令进行编译,得到编译后的向量指令,可以包括:根据向量指令生成汇编文件,并将汇编文件翻译成二进制文件。其中,二进制文件为编译后的向量指令。In a possible implementation manner, compiling the obtained vector instruction to obtain the compiled vector instruction may include: generating an assembly file according to the vector instruction, and translating the assembly file into a binary file. Among them, the binary file is a compiled vector instruction.
需要说明的是,尽管以上述实施例作为示例介绍了向量指令处理方法如上,但本领域技术人员能 够理解,本公开应不限于此。事实上,用户完全可根据个人喜好和/或实际应用场景灵活设定各步骤,只要符合本公开的技术方案即可。It should be noted that although the vector instruction processing method is described above using the above embodiment as an example, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various steps flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.
本公开实施例所提供的向量指令处理方法的适用范围广,对向量的处理效率高、处理速度快,进行向量运算的处理效率高、处理速度快。The vector instruction processing method provided by the embodiments of the present disclosure has a wide application range, high processing efficiency and fast processing speed for vectors, and high processing efficiency and fast processing speed for vector operations.
依据以下条款可以更好的理解前述内容:The foregoing can be better understood based on the following clauses:
条款R1、一种向量指令处理装置,所述装置包括:Clause R1, a vector instruction processing device, the device comprising:
控制模块,用于对获取到的向量指令进行编译,得到编译后的向量指令,对所述编译后的向量指令进行解析,得到向量指令的操作码和操作域,并根据所述操作码和所述操作域获取执行向量指令所需的待运算向量和目标地址,以及确定向量指令的向量运算类型;The control module is used to compile the obtained vector instructions to obtain the compiled vector instructions, parse the compiled vector instructions to obtain the operation codes and operation domains of the vector instructions, and according to the operation codes and operations The operation domain obtains the vector to be operated and the target address required to execute the vector instruction, and determines the vector operation type of the vector instruction;
运算模块,用于根据所述向量运算类型对所述待运算向量进行向量运算,获得运算结果,并将所述运算结果存入所述目标地址中,An operation module, configured to perform vector operation on the to-be-operated vector according to the vector operation type, obtain an operation result, and store the operation result in the target address,
其中,所述操作码用于指示所述向量指令对数据所进行的运算为向量运算,所述操作域包括待运算向量地址和所述目标地址。Wherein, the operation code is used to indicate that the operation performed by the vector instruction on the data is a vector operation, and the operation domain includes the vector address to be operated and the target address.
条款R2、根据条款R1所述的装置,所述运算模块,包括:Clause R2. The device according to Clause R1, the arithmetic module includes:
多个向量运算器,用于执行与所述向量运算类型相对应的向量运算。A plurality of vector operators are used to perform vector operations corresponding to the vector operation type.
条款R3、根据条款R2所述的装置,所述运算模块包括主运算子模块和多个从运算子模块,所述主运算子模块包括所述多个向量运算器,Clause R3. The device according to Clause R2, the operation module includes a master operation sub-module and a plurality of slave operation sub-modules, the master operation sub-module includes the plurality of vector operators,
所述主运算子模块,用于利用所述多个向量运算器执行所述向量运算,得到运算结果,并将所述运算结果存入所述目标地址中。The main operation sub-module is used to perform the vector operation by using the plurality of vector operators, obtain an operation result, and store the operation result in the target address.
条款R4、根据条款R2所述的装置,所述运算模块包括主运算子模块和多个从运算子模块,所述从运算子模块包括所述多个向量运算器,Clause R4. The device according to Clause R2, the operation module includes a master operation sub-module and a plurality of slave operation sub-modules, the slave operation sub-module includes the plurality of vector operators,
从运算子模块,用于利用所包含的多个向量运算器并行执行进行对应的向量运算,得到运算结果,并将所述运算结果存入对应的子缓存空间中,以及将所述运算结果发送至所述主运算子模块;From the operation sub-module, it is used to perform the corresponding vector operation in parallel by using a plurality of included vector operators to obtain the operation result, and store the operation result in the corresponding sub-cache space, and send the operation result To the main operation submodule;
所述主运算子模块,还用于接收所述运算结果,并将所述运算结果存入所述目标地址中。The main operation sub-module is also used to receive the operation result and store the operation result in the target address.
条款R5、根据条款R1所述的装置,所述操作域还包括向量运算类型,Clause R5. The device according to Clause R1, the operation domain further includes a vector operation type,
其中,所述控制模块,还用于在所述操作域中包括向量运算类型时,根据所述操作域确定所述向量运算类型。Wherein, the control module is further used to determine the vector operation type according to the operation domain when the operation domain includes the vector operation type.
条款R6、根据条款R1所述的装置,所述操作域还包括输入量,Clause R6. The device according to Clause R1, the operation domain further includes an input,
其中,所述控制模块,还用于根据所述操作域确定所述输入量,并从所述待运算数据地址中获取数据量为所述输入量的待运算向量。Wherein, the control module is further configured to determine the input amount according to the operation domain, and obtain a to-be-calculated vector whose data amount is the input amount from the data address to be calculated.
条款R7、根据条款R1所述的装置,所述操作码还用于指示所述向量运算类型,Clause R7. The apparatus according to Clause R1, the operation code is further used to indicate the type of vector operation,
所述控制模块,还用于在所述操作码用于指示所述向量运算类型时,根据所述操作码确定所述向量运算类型。The control module is further configured to determine the vector operation type according to the operation code when the operation code is used to indicate the vector operation type.
条款R8、根据条款R1所述的装置,所述向量运算类型包括以下至少一种:Clause R8. The apparatus according to Clause R1, the vector operation type includes at least one of the following:
向量相乘运算、向量与标量相乘运算、向量相加运算、向量求和运算、满足运算条件存储指定值运算、按位与运算、按位或运算、按位异或运算、按位取反运算、按位求最大值运算、按位求最小值运算,Vector multiply operation, vector and scalar multiplication operation, vector addition operation, vector sum operation, store specified value operation that meets the operation conditions, bitwise AND operation, bitwise OR operation, bitwise XOR operation, bitwise inversion Operation, bitwise maximum value operation, bitwise minimum value operation,
其中,所述运算条件包括以下任一种:按位相等、按位不相等、按位小于、按位大于或等于、按位大于、按位小于或等于。Wherein, the calculation conditions include any one of the following: bitwise equal, bitwise unequal, bitwise less than, bitwise greater than or equal, bitwise greater than, bitwise less than or equal.
条款R9、根据条款R1所述的装置,所述装置还包括:Clause R9. The device according to Clause R1, the device further comprising:
存储模块,用于存储所述待运算向量,A storage module, used to store the vector to be calculated,
其中,所述存储模块包括寄存器和缓存中的至少一种,Wherein, the storage module includes at least one of a register and a cache,
所述缓存,用于存储待运算数据和所述待运算向量,所述缓存包括至少一个神经元缓存NRAM;The cache is used to store data to be calculated and the vector to be calculated, and the cache includes at least one neuron cache NRAM;
所述寄存器,用于存储所述待运算数据中的标量数据;The register is used to store scalar data in the data to be calculated;
所述神经元缓存,用于存储所述待运算数据中的神经元数据,所述神经元数据包括神经元向量数 据。The neuron cache is used to store neuron data in the data to be operated, and the neuron data includes neuron vector data.
条款R10、根据条款R1所述的装置,所述控制模块,包括:Clause R10. The device according to Clause R1, the control module includes:
指令存储子模块,用于存储所述编译后的向量指令;Instruction storage sub-module for storing the compiled vector instruction;
指令处理子模块,用于对所述编译后的向量指令进行解析,得到向量指令的操作码和操作域;Instruction processing sub-module, which is used to parse the compiled vector instruction to obtain the operation code and operation domain of the vector instruction;
队列存储子模块,用于存储指令队列,所述指令队列包括按照执行顺序依次排列的多个待执行指令,所述多个待执行指令包括所述编译后的向量指令。The queue storage submodule is used to store an instruction queue, and the instruction queue includes a plurality of instructions to be executed, which are sequentially arranged according to the execution order, and the plurality of instructions to be executed include the compiled vector instructions.
条款R11、根据条款R10所述的装置,所述控制模块,还包括:Clause R11. The device according to Clause R10, the control module, further comprising:
依赖关系处理子模块,用于在确定所述多个待执行指令中的第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系时,将所述第一待执行指令缓存在所述指令存储子模块中,在所述第零待执行指令执行完毕后,从所述指令存储子模块中提取所述第一待执行指令发送至所述运算模块,The dependency processing sub-module is used to determine the first pending instruction when there is an association relationship between the first pending instruction in the plurality of pending instructions and the zeroth pending instruction before the first pending instruction The execution instruction is cached in the instruction storage submodule, and after the execution of the zeroth instruction to be executed is completed, the first instruction to be executed is extracted from the instruction storage submodule and sent to the arithmetic module,
其中,所述第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系包括:Wherein, the association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
存储所述第一待执行指令所需数据的第一存储地址区间与存储所述第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
条款R12、根据条款R1所述的装置,Clause R12, the device according to Clause R1,
所述控制模块,还用于根据所述向量指令生成汇编文件,并将所述汇编文件翻译成二进制文件,The control module is also used to generate an assembly file according to the vector instruction and translate the assembly file into a binary file,
其中,所述二进制文件为所述编译后的向量指令。Wherein, the binary file is the compiled vector instruction.
条款R13、一种机器学习运算装置,所述装置包括:Clause R13. A machine learning computing device, the device comprising:
一个或多个如条款R1-条款R12任一项所述的向量指令处理装置,用于从其他处理装置中获取待运算向量和控制信息,并执行指定的机器学习运算,将执行结果通过I/O接口传递给其他处理装置;One or more vector instruction processing devices as described in any one of Clause R1-Clause R12, used to obtain vectors and control information to be calculated from other processing devices, and perform specified machine learning operations, and pass the execution results through I / O interface is passed to other processing devices;
当所述机器学习运算装置包含多个所述向量指令处理装置时,所述多个所述向量指令处理装置间可以通过特定的结构进行连接并传输数据;When the machine learning computing device includes a plurality of the vector instruction processing devices, the plurality of vector instruction processing devices may be connected and transmit data through a specific structure;
其中,多个所述向量指令处理装置通过快速外部设备互连总线PCIE总线进行互联并传输数据,以支持更大规模的机器学习的运算;多个所述向量指令处理装置共享同一控制系统或拥有各自的控制系统;多个所述向量指令处理装置共享内存或者拥有各自的内存;多个所述向量指令处理装置的互联方式是任意互联拓扑。Among them, a plurality of the vector instruction processing devices interconnect and transmit data through a fast external device interconnect bus PCIE bus to support larger-scale machine learning operations; a plurality of the vector instruction processing devices share the same control system or own Respective control systems; a plurality of the vector instruction processing devices share memory or have their own memories; the interconnection mode of the plurality of vector instruction processing devices is an arbitrary interconnection topology.
条款R14、一种组合处理装置,所述组合处理装置包括:Clause R14. A combined processing device, the combined processing device comprising:
如条款R13所述的机器学习运算装置、通用互联接口和其他处理装置;Machine learning computing device, general interconnection interface and other processing devices as described in clause R13;
所述机器学习运算装置与所述其他处理装置进行交互,共同完成用户指定的计算操作,The machine learning computing device interacts with the other processing device to jointly complete the calculation operation specified by the user,
其中,所述组合处理装置还包括:存储装置,该存储装置分别与所述机器学习运算装置和所述其他处理装置连接,用于保存所述机器学习运算装置和所述其他处理装置的数据。Wherein, the combined processing device further includes: a storage device, which is respectively connected to the machine learning computing device and the other processing device, and is used for storing data of the machine learning computing device and the other processing device.
条款R15、一种机器学习芯片,所述机器学习芯片包括:Clause R15. A machine learning chip, the machine learning chip includes:
如条款R13所述的机器学习运算装置或如条款R14所述的组合处理装置。The machine learning arithmetic device according to clause R13 or the combined processing device according to clause R14.
条款R16、一种电子设备,所述电子设备包括:Clause R16. An electronic device, the electronic device comprising:
如条款R15所述的机器学习芯片。Machine learning chip as described in clause R15.
条款R17、一种板卡,所述板卡包括:存储器件、接口装置和控制器件以及如条款R15所述的机器学习芯片;Clause R17, a board card, the board card includes: a storage device, an interface device and a control device, and a machine learning chip as described in Clause R15;
其中,所述机器学习芯片与所述存储器件、所述控制器件以及所述接口装置分别连接;Wherein, the machine learning chip is respectively connected to the storage device, the control device and the interface device;
所述存储器件,用于存储数据;The storage device is used for storing data;
所述接口装置,用于实现所述机器学习芯片与外部设备之间的数据传输;The interface device is used to realize data transmission between the machine learning chip and an external device;
所述控制器件,用于对所述机器学习芯片的状态进行监控。The control device is used for monitoring the state of the machine learning chip.
条款R18、一种向量指令处理方法,所述方法应用于向量指令处理装置,所述装置包括控制模块和运算模块,所述方法包括:Clause R18. A vector instruction processing method. The method is applied to a vector instruction processing apparatus. The apparatus includes a control module and an arithmetic module. The method includes:
利用控制模块对获取到的向量指令进行编译,得到编译后的向量指令,对所述编译后的向量指令进行解析,得到向量指令的操作码和操作域,并根据所述操作码和所述操作域获取执行向量指令所需 的待运算向量和目标地址,以及确定向量指令的向量运算类型;The control module is used to compile the obtained vector instruction to obtain a compiled vector instruction, and the compiled vector instruction is parsed to obtain the operation code and operation domain of the vector instruction, and according to the operation code and the operation The domain obtains the vector to be operated and the target address required to execute the vector instruction, and determines the vector operation type of the vector instruction;
利用运算模块根据所述向量运算类型对所述待运算向量进行向量运算,获得运算结果,并将所述运算结果存入所述目标地址中,Using an operation module to perform a vector operation on the to-be-operated vector according to the vector operation type to obtain an operation result, and store the operation result in the target address,
其中,所述操作码用于指示所述向量指令对数据所进行的运算为向量运算,所述操作域包括待运算向量地址和所述目标地址。Wherein, the operation code is used to indicate that the operation performed by the vector instruction on the data is a vector operation, and the operation domain includes the vector address to be operated and the target address.
条款R19、根据条款R18所述的方法,根据所述向量运算类型对所述待运算向量进行向量运算,包括:Clause R19, according to the method described in Clause R18, performing vector operations on the vector to be operated according to the vector operation type, including:
利用所述运算模块中的多个向量运算器执行与所述向量运算类型相对应的向量运算。A plurality of vector operators in the operation module are used to perform vector operations corresponding to the vector operation type.
条款R20、根据条款R19所述的方法,所述运算模块包括主运算子模块和多个从运算子模块,所述主运算子模块包括所述多个向量运算器,Clause R20. The method according to Clause R19, the operation module includes a master operation sub-module and a plurality of slave operation sub-modules, the master operation sub-module includes the plurality of vector operators,
其中,根据所述向量运算类型对所述待运算向量进行向量运算,获得运算结果,并将所述运算结果存入所述目标地址中,包括:Wherein, performing vector operation on the vector to be operated according to the vector operation type to obtain an operation result, and storing the operation result in the target address includes:
利用所述主运算子模块中的所述多个向量运算器执行与所述向量运算类型相对应的向量运算,得到运算结果,并将所述运算结果存入所述目标地址中。Use the plurality of vector operators in the main operation sub-module to perform a vector operation corresponding to the vector operation type, obtain an operation result, and store the operation result in the target address.
条款R21、根据条款R19所述的方法,所述运算模块包括主运算子模块和多个从运算子模块,所述从运算子模块包括所述多个向量运算器,Clause R21. The method according to Clause R19, the operation module includes a master operation sub-module and a plurality of slave operation sub-modules, the slave operation sub-module includes the plurality of vector operators,
其中,根据所述向量运算类型对所述待运算向量进行向量运算,获得运算结果,并将所述运算结果存入所述目标地址中,包括:Wherein, performing vector operation on the vector to be operated according to the vector operation type to obtain an operation result, and storing the operation result in the target address includes:
利用每个从运算子模块所包含的多个向量运算器并行执行进行对应的向量运算,得到运算结果,并将所述运算结果存入对应的子缓存空间中,以及将所述运算结果发送至所述主运算子模块;Use multiple vector operators included in each slave operation sub-module to execute corresponding vector operations in parallel to obtain operation results, store the operation results in the corresponding sub-cache space, and send the operation results to The main operation sub-module;
利用所述主运算子模块接收所述运算结果,并将所述运算结果存入所述目标地址中。The main operation sub-module is used to receive the operation result and store the operation result in the target address.
条款R22、根据条款R18所述的方法,所述操作域还包括向量运算类型,Clause R22. The method according to Clause R18, the operation domain further includes a vector operation type,
其中,确定向量指令的向量运算类型,包括:Among them, determining the vector operation type of the vector instruction includes:
在所述操作域中包括向量运算类型时,根据所述操作域确定所述向量运算类型。When a vector operation type is included in the operation domain, the vector operation type is determined according to the operation domain.
条款R23、根据条款R18所述的方法,所述操作域还包括输入量,Clause R23. The method according to Clause R18, the operation domain further includes an input,
其中,根据所述操作码和所述操作域获取执行向量指令所需的待运算向量和目标地址,还包括:Wherein, obtaining the to-be-operated vector and the target address required to execute the vector instruction according to the operation code and the operation domain also includes:
根据所述操作域确定所述输入量,并从所述待运算数据地址中获取数据量为所述输入量的待运算向量。The input amount is determined according to the operation domain, and a to-be-calculated vector whose data amount is the input amount is obtained from the data address to be calculated.
条款R24、根据条款R18所述的方法,所述操作码还用于指示所述向量运算类型,Clause R24. The method according to Clause R18, the opcode is also used to indicate the type of vector operation,
其中,确定向量指令的向量运算类型,包括:Among them, determining the vector operation type of the vector instruction includes:
在所述操作码用于指示所述向量运算类型时,根据所述操作码确定所述向量运算类型。When the operation code is used to indicate the vector operation type, the vector operation type is determined according to the operation code.
条款R25、根据条款R18所述的方法,所述向量运算类型包括以下至少一种:Clause R25. The method according to Clause R18, the vector operation type includes at least one of the following:
向量相乘运算、向量与标量相乘运算、向量相加运算、向量求和运算、满足运算条件存储指定值运算、按位与运算、按位或运算、按位异或运算、按位取反运算、按位求最大值运算、按位求最小值运算,Vector multiplication operation, vector and scalar multiplication operation, vector addition operation, vector sum operation, operation to store specified value operation, bitwise AND operation, bitwise OR operation, bitwise XOR operation, bitwise inversion Operation, bitwise maximum value operation, bitwise minimum value operation,
其中,所述运算条件包括以下任一种:按位相等、按位不相等、按位小于、按位大于或等于、按位大于、按位小于或等于。Wherein, the calculation conditions include any one of the following: bitwise equal, bitwise unequal, bitwise less than, bitwise greater than or equal, bitwise greater than, bitwise less than or equal.
条款R26、根据条款R18所述的方法,所述方法还包括:Clause R26. The method according to Clause R18, the method further comprising:
利用所述装置的存储模块存储所述待运算向量,Use the storage module of the device to store the vector to be calculated,
其中,所述存储模块包括寄存器和缓存中的至少一种,Wherein, the storage module includes at least one of a register and a cache,
所述缓存,用于存储待运算数据和所述待运算向量,所述缓存包括至少一个神经元缓存NRAM;The cache is used to store data to be calculated and the vector to be calculated, and the cache includes at least one neuron cache NRAM;
所述寄存器,用于存储所述待运算数据中的标量数据;The register is used to store scalar data in the data to be calculated;
所述神经元缓存,用于存储所述待运算数据中的神经元数据,所述神经元数据包括神经元向量数据。The neuron cache is used to store neuron data in the data to be operated, and the neuron data includes neuron vector data.
条款R27、根据条款R18所述的方法,对所述编译后的向量指令进行解析,得到向量指令的操作 码和操作域,包括:Clause R27. According to the method described in Clause R18, parse the compiled vector instruction to obtain the opcode and operation domain of the vector instruction, including:
存储所述编译后的向量指令;Storing the compiled vector instruction;
对所述编译后的向量指令进行解析,得到向量指令的操作码和操作域;Analyzing the compiled vector instruction to obtain the operation code and operation domain of the vector instruction;
存储指令队列,所述指令队列包括按照执行顺序依次排列的多个待执行指令,所述多个待执行指令包括所述编译后的向量指令。An instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed, which are sequentially arranged according to an execution order, and the plurality of instructions to be executed include the compiled vector instructions.
条款R28、根据条款R27所述的方法,所述方法还包括:Clause R28. The method according to Clause R27, the method further comprising:
在确定所述多个待执行指令中的第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系时,缓存所述第一待执行指令,并在确定所述第零待执行指令执行完毕后,控制进行所述第一待执行指令的执行,When it is determined that the first to-be-executed instruction among the plurality of to-be-executed instructions is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction, cache the first to-be-executed instruction, and determine the After the execution of the zeroth to-be-executed instruction is completed, control to execute the execution of the first to-be-executed instruction,
其中,所述第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系包括:Wherein, the association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
存储所述第一待执行指令所需数据的第一存储地址区间与存储所述第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
条款R29、根据条款R18所述的方法,对获取到的向量指令进行编译,得到编译后的向量指令,包括:Clause R29. Compile the obtained vector instructions according to the method described in Clause R18 to obtain the compiled vector instructions, including:
根据所述向量指令生成汇编文件,并将所述汇编文件翻译成二进制文件,Generate an assembly file according to the vector instruction, and translate the assembly file into a binary file,
其中,所述二进制文件为所述编译后的向量指令。Wherein, the binary file is the compiled vector instruction.
条款R30、一种非易失性计算机可读存储介质,其上存储有计算机程序指令,所述计算机程序指令被处理器执行时实现条款R18至条款R29任一项所述的方法。Clause R30. A non-volatile computer-readable storage medium having computer program instructions stored thereon, which when executed by a processor implements the method of any one of clause R18 to clause R29.
由于神经网络算法的广泛使用,计算机硬件运算人能力的不断提升,实际应用中所涉及到的数据运算的种类和数量不断提高。由于编程语言的种类多样,在不同的语言环境下,为实现向量的循环运算,相关技术中,由于现阶段没有能广泛适用于各类编程语言的用于进行向量的循环运算的指令,技术人员需要自定义对应其编程语言环境的一条或多条指令来实现向量运算,导致进行向量运算的效率低、速度慢。本公开提供一种循环向量指令处理方法、装置、计算机设备和存储介质,仅用一个指令即可以实现循环向量运算,能够显著提高进行循环向量运算的效率和速度。Due to the extensive use of neural network algorithms, the ability of computer hardware operators to continuously improve, the types and number of data operations involved in practical applications continue to increase. Due to the variety of programming languages, in different language environments, in order to realize the loop operation of vectors, in the related art, because there are no instructions for the loop operation of vectors that can be widely applied to various programming languages at this stage, the technical staff Need to customize one or more instructions corresponding to its programming language environment to implement vector operations, resulting in low efficiency and slow speed of vector operations. The present disclosure provides a cyclic vector instruction processing method, device, computer equipment, and storage medium. The cyclic vector operation can be implemented with only one instruction, which can significantly improve the efficiency and speed of cyclic vector operation.
图19-1示出根据本公开一实施例的循环向量指令处理装置的框图。如图19-1所示,该装置包括控制模块19-11和运算模块19-12。19-1 shows a block diagram of a loop vector instruction processing apparatus according to an embodiment of the present disclosure. As shown in Figure 19-1, the device includes a control module 19-11 and an arithmetic module 19-12.
控制模块19-11,用于对获取到的循环向量指令进行编译,得到编译后的循环向量指令,对编译后的循环向量指令进行解析,得到循环向量指令的操作码和操作域,并根据操作码和操作域获取执行循环向量指令所需的第一待运算向量、第二待运算向量和目标地址,以及确定循环向量指令的向量运算类型。其中,操作码用于指示循环向量指令对数据所进行的运算为循环向量运算,操作域包括第一待运算向量地址、第二待运算向量地址和目标地址。The control module 19-11 is used to compile the obtained loop vector instruction to obtain the compiled loop vector instruction, parse the compiled loop vector instruction to obtain the operation code and operation domain of the loop vector instruction, and according to the operation The code and the operation domain obtain the first to-be-operated vector, the second to-be-operated vector, and the target address required to execute the loop vector instruction, and determine the vector operation type of the loop vector instruction. The operation code is used to indicate that the operation performed by the cyclic vector instruction on the data is a cyclic vector operation, and the operation domain includes the first to-be-calculated vector address, the second to-be-calculated vector address, and the target address.
运算模块19-12,用于根据第二待运算向量将第一待运算向量切分为多个切分向量,根据向量运算类型对每个切分向量与第二待运算向量分别进行向量运算,获得运算结果,并将运算结果存入目标地址。The operation module 19-12 is configured to divide the first to-be-operated vector into a plurality of divided vectors according to the second to-be-operated vector, and perform vector operations on each of the divided vector and the second to-be-operated vector according to the type of vector operation, Obtain the operation result, and store the operation result in the target address.
在本实施例中,循环向量运算可以是将数据量较大的一个向量切分为与数据量较小的另一个向量的数据量相同的多个切分向量,而后对每个切分向量与另一个向量分别进行对应向量运算类型的运算,得到运算结果。向量运算类型可以指示对切分向量和第二待运算向量所进行的算术运算、逻辑运算的种类或类型。例如,向量相加运算等。本领域技术人员可以根据实际需要对向量运算类型进行设置,本公开对此不作限制。In this embodiment, the cyclic vector operation may be to divide a vector with a larger data volume into multiple divided vectors with the same data volume as another vector with a smaller data volume, and then divide each vector into The other vector performs the operation corresponding to the type of vector operation to obtain the operation result. The type of vector operation may indicate the type or type of arithmetic operation or logical operation performed on the split vector and the second to-be-operated vector. For example, vector addition operation. A person skilled in the art can set the type of vector operation according to actual needs, which is not limited in the present disclosure.
在本实施例中,根据向量运算类型对每个切分向量与第二待运算向量分别进行向量运算,能够得到与每个切分向量分别对应的多个切分运算结果,并将多个切分运算结果作为该循环向量指令的运算结果存入目标地址,也即将多个切分运算结果作为第一待运算向量与第二待运算向量进行向量运行的运算结果。第一待运算向量的数据量可以是第二待运算向量的数据量的整数倍,以保证所获得的切分向量能够与第二待运算向量进行向量运算。In this embodiment, each split vector and the second to-be-calculated vector are subjected to vector operations according to the type of vector operation, and multiple split operation results corresponding to each split vector can be obtained, and the multiple splits can be obtained. The result of the division operation is stored in the target address as the operation result of the loop vector instruction, that is, the results of the multiple division operations are used as the operation result of the vector operation of the first to-be-operated vector and the second to-be-operated vector. The data amount of the first to-be-operated vector may be an integer multiple of the data amount of the second to-be-operated vector, so as to ensure that the obtained segmentation vector can perform vector operation with the second to-be-operated vector.
在本实施例中,控制模块所获取到的循环向量指令为未编译的、不能直接供硬件执行的软件指令,控制模块需先对循环向量指令(未编译的)进行编译。在得到编译后的循环向量指令之后,才能对编译后的循环向量指令进行解析。编译后的循环向量指令为能够直接供硬件执行的硬件指令。控制模块可以从第一待运算向量地址和第二待运算向量地址中,分别获得第一待运算向量和第二待运算向量。控制模块可以通过数据输入输出单元获得指令和数据,该数据输入输出单元可以为一个或多个数据I/O接口或I/O引脚。In this embodiment, the loop vector instruction acquired by the control module is an uncompiled software instruction that cannot be directly executed by hardware. The control module needs to first compile the loop vector instruction (uncompiled). After the compiled loop vector instruction is obtained, the compiled loop vector instruction can be parsed. The compiled loop vector instructions are hardware instructions that can be directly executed by the hardware. The control module may obtain the first to-be-computed vector and the second to-be-computed vector from the first to-be-computed vector address and the second to-be-computed vector address, respectively. The control module can obtain instructions and data through the data input and output unit, which can be one or more data I / O interfaces or I / O pins.
在本实施例中,操作码可以是计算机程序中所规定的要执行操作的那一部分指令或字段(通常用代码表示),是指令序列号,用来告知执行指令的装置具体需要执行哪一条指令。操作域可以是执行对应的指令所需的所有数据的来源,执行对应的指令所需的所有数据包括第一待运算向量、第二待运算向量、向量运算类型等参数以及对应的运算方法等等。对于一个循环向量指令其必须包括操作码和操作域,其中,操作域至少包括第一待运算向量地址、第二待运算向量地址和目标地址。In this embodiment, the operation code may be the part of the instruction or field (usually represented by code) specified in the computer program to perform the operation, and is the instruction sequence number used to inform the device that executes the instruction which instruction needs to be executed. . The operation domain can be the source of all the data required to execute the corresponding instruction. All the data required to execute the corresponding instruction include the first to-be-computed vector, the second to-be-computed vector, vector operation type and other parameters, and the corresponding operation method, etc. . For a loop vector instruction, it must include an operation code and an operation field, where the operation field includes at least a first to-be-operated vector address, a second to-be-operated vector address, and a target address.
应当理解的是,本领域技术人员可以根据需要对循环向量指令的指令格式以及所包含的操作码和操作域进行设置,本公开对此不作限制。It should be understood that those skilled in the art may set the instruction format of the loop vector instruction, as well as the included opcodes and operation fields as needed, and this disclosure does not limit this.
在本实施例中,该装置可以包括一个或多个控制模块,以及一个或多个运算模块,可以根据实际需要对控制模块和运算模块的数量进行设置,本公开对此不作限制。在装置包括一个控制模块时,该控制模块可以接收循环向量指令,并控制一个或多个运算模块进行循环向量运算。在装置包括多个控制模块时,多个控制模块可以分别接收循环向量指令,并控制对应的一个或多个运算模块进行循环向量运算。In this embodiment, the device may include one or more control modules and one or more arithmetic modules, and the number of control modules and arithmetic modules may be set according to actual needs, which is not limited in the present disclosure. When the device includes a control module, the control module can receive a loop vector instruction and control one or more arithmetic modules to perform a loop vector operation. When the device includes a plurality of control modules, the plurality of control modules can respectively receive the loop vector instruction and control the corresponding one or more arithmetic modules to perform the loop vector operation.
本公开实施例所提供的循环向量指令处理装置,该装置包括控制模块和运算模块,控制模块用于对获取到的循环向量指令进行编译,得到编译后的循环向量指令,对编译后的循环向量指令进行解析,得到循环向量指令的操作码和操作域,并根据操作码和操作域获取执行循环向量指令所需的第一待运算向量、第二待运算向量和目标地址,以及确定循环向量指令的向量运算类型;运算模块用于根据第二待运算向量将第一待运算向量切分为多个切分向量,根据向量运算类型对每个切分向量与第二待运算向量分别进行向量运算,获得运算结果,并将运算结果存入目标地址中。本公开实施例所提供的循环向量指令处理装置的适用范围广,对循环向量指令的处理效率高、处理速度快,进行运算的处理效率高、处理速度快。A loop vector instruction processing device provided by an embodiment of the present disclosure includes a control module and an operation module. The control module is used to compile the obtained loop vector instruction to obtain a compiled loop vector instruction and a compiled loop vector instruction. The instruction is parsed to obtain the operation code and operation domain of the loop vector instruction, and according to the operation code and operation domain, the first to-be-operated vector, the second to-be-operated vector and the target address required to execute the loop vector instruction are obtained, and the loop vector instruction is determined The type of vector operation; the operation module is used to divide the first to-be-operated vector into a plurality of divided vectors according to the second to-be-operated vector, and perform vector operations on each of the divided vector and the second to-be-operated vector according to the vector operation type To obtain the operation result and store the operation result in the target address. The loop vector instruction processing device provided by the embodiments of the present disclosure has a wide application range, and has high processing efficiency and fast processing speed for loop vector instructions, and high processing efficiency and fast processing speed for performing calculations.
图19-2a示出根据本公开一实施例的循环向量指令处理装置的框图。在一种可能的实现方式中,如图19-2a所示,运算模块19-12可以包括多个向量运算器19-120。多个向量运算器19-120用于执行与向量运算类型相对应的向量运算。19-2a shows a block diagram of a loop vector instruction processing apparatus according to an embodiment of the present disclosure. In a possible implementation, as shown in FIG. 19-2a, the operation module 19-12 may include multiple vector operators 19-120. A plurality of vector operators 19-120 are used to perform vector operations corresponding to the types of vector operations.
在该实现方式中,向量运算器可以包括加法器、除法器、乘法器、比较器等能够对向量进行算术运算、逻辑运算等运算的运算器。可以根据所需进行的向量运算的数据量的大小、向量运算类型、对向量运算的处理速度、效率等要求对向量运算器的种类及数量进行设置,本公开对此不作限制。In this implementation manner, the vector operator may include an adder, a divider, a multiplier, a comparator, and the like that can perform arithmetic operations, logical operations, and the like on the vector. The type and number of vector operators can be set according to the size of the data amount of the vector operation, the type of vector operation, the processing speed and efficiency of the vector operation, etc., and the disclosure does not limit this.
图19-2b示出根据本公开一实施例的循环向量指令处理装置的框图。在一种可能的实现方式中,如图19-2b所示,运算模块19-12可以包括主运算子模块19-121和多个从运算子模块19-122。主运算子模块19-121可以包括多个向量运算器(图中未示出)。19-2b shows a block diagram of a loop vector instruction processing apparatus according to an embodiment of the present disclosure. In a possible implementation, as shown in FIG. 19-2b, the operation module 19-12 may include a master operation sub-module 19-121 and a plurality of slave operation sub-modules 19-122. The main operation sub-modules 19-121 may include multiple vector operators (not shown in the figure).
主运算子模块19-121,用于利用多个向量运算器执行向量运算,得到运算结果,并将运算结果存入目标地址中。The main operation sub-modules 19-121 are used to perform vector operations using multiple vector operators to obtain operation results, and store the operation results in the target address.
在一种可能的实现方式中,如图19-2b所示,运算模块19-12可以包括主运算子模块19-121和多个从运算子模块19-122,从运算子模块19-122可以包括多个向量运算器(图中未示出)。从运算子模块19-122,用于利用所包含的多个向量运算器并行执行进行对应的向量运算,得到运算结果,并将运算结果存入对应的子缓存空间中,以及将运算结果发送至主运算子模块19-121。主运算子模块19-121,还用于接收运算结果,并将述运算结果存入目标地址中。In a possible implementation, as shown in FIG. 19-2b, the operation module 19-12 may include a master operation sub-module 19-121 and a plurality of slave operation sub-modules 19-122, and the slave operation sub-module 19-122 may Includes multiple vector operators (not shown). The sub-modules 19-122 are used to execute the corresponding vector operations in parallel using the included multiple vector operators, obtain the operation results, store the operation results in the corresponding sub-cache space, and send the operation results to Main operation sub-module 19-121. The main operation sub-module 19-121 is also used to receive the operation result and store the operation result in the target address.
在该实现方式中,控制模块可以根据向量运算类型和运算的任务量等,确定通过主运算子模块或多个从运算子模块执行当前接收到的向量指令。例如,在确定向量运算类型为向量相加运算时,可以控制主运算子模块进行运算。在确定向量运算类型为向量相乘运算时,可以控制多个从运算子模块进 行运算。In this implementation manner, the control module may determine that the currently received vector instruction is executed by the master operation sub-module or multiple slave operation sub-modules according to the type of vector operation and the amount of operation tasks. For example, when it is determined that the vector operation type is a vector addition operation, the main operation sub-module can be controlled to perform the operation. When the vector operation type is determined to be vector multiplication operation, multiple slave operation sub-modules can be controlled to perform operations.
在一种可能的实现方式中,操作域还可以包括向量运算类型。In a possible implementation, the operation domain may also include a vector operation type.
其中,控制模块19-11,还可以用于根据操作域确定向量运算类型。Among them, the control module 19-11 can also be used to determine the vector operation type according to the operation domain.
在一种可能的实现方式中,向量运算类型可以包括以下至少一种:向量相乘运算、向量相加运算、向量求和运算、满足运算条件存储指定值运算、按位与运算、按位或运算、按位异或运算。其中,运算条件可以包括以下任一种:按位相等、按位不相等、按位小于、按位大于或等于、按位大于、按位小于或等于。指定值可以是0、1等数值,本公开对此不作限制。In a possible implementation manner, the vector operation type may include at least one of the following: vector multiplication operation, vector addition operation, vector sum operation, operation to store a specified value when the operation condition is met, bitwise AND operation, bitwise or Operation, bitwise XOR operation. The calculation conditions may include any of the following: bitwise equal, bitwise unequal, bitwise less than, bitwise greater than or equal, bitwise greater than, bitwise less than or equal. The specified value may be a numerical value of 0, 1, etc., and this disclosure does not limit it.
其中,满足按位相等存储指定值运算可以是:判断切分向量与第二待运算向量的对应位是否相等,在切分向量与第二待运算向量的对应位相等时,存储指定值;在对应位不相等时存储切分向量或第二待运算向量在对应位的值、或者存储0等与指定值不同的数值。Among them, the operation of satisfying the bit-wise equal storage of the specified value may be: judging whether the corresponding bits of the split vector and the second to-be-calculated vector are equal, and storing the specified value when the corresponding bits of the split vector and the second to-be-calculated vector are equal; When the corresponding bits are not equal, the value of the split vector or the second to-be-calculated vector at the corresponding bit is stored, or a value other than the specified value such as 0 is stored.
满足按位不相等存储指定值运算可以是:判断切分向量与第二待运算向量的对应位是否相等,在切分向量与第二待运算向量的对应位不相等时,存储指定值;在对应位相等时存储第一切分向量或第二待运算向量在对应位的值、或者存储0等与指定值不同的数值。Satisfying bitwise inequality to store the specified value operation may be: judging whether the corresponding bits of the cut vector and the second to-be-calculated vector are equal, and storing the specified value when the corresponding bits of the cut vector and the second to-be-calculated vector are not equal; When the corresponding bits are equal, the value of the first segmented vector or the second to-be-calculated vector at the corresponding bit is stored, or a value such as 0, which is different from the specified value, is stored.
满足按位小于存储指定值运算可以是:判断切分向量与第二待运算向量的对应位的大小关系,在对应位上的切分向量的值小于第二待运算向量的值时,存储指定值;在对应位上的切分向量的值大于或等于第二待运算向量的值时,存储第一切分向量或第二待运算向量在对应位的值、或者存储0等与指定值不同的数值。The operation that satisfies the bitwise less than storing the specified value may be: judging the size relationship between the corresponding bit of the cutting vector and the second to-be-calculated vector, when the value of the cutting vector on the corresponding bit is smaller than the value of the second to-be-calculating vector, storing the specified Value; when the value of the segmentation vector on the corresponding bit is greater than or equal to the value of the second to-be-operated vector, store the value of the first or second to-be-computed vector in the corresponding bit, or store 0, etc., different from the specified value Value.
满足按位大于或等于存储指定值运算可以是:判断切分向量与第二待运算向量的对应位的大小关系,在对应位上的切分向量的值大于或等于第二待运算向量的值时,存储指定值;在对应位上的切分向量的值小于第二待运算向量的值时,存储第一切分向量或第二待运算向量在对应位的值、或者存储0等与指定值不同的数值。Satisfying the bitwise operation greater than or equal to storing the specified value may be: judging the size relationship between the corresponding bit of the cutting vector and the second to-be-calculated vector, the value of the cutting vector at the corresponding bit is greater than or equal to the value of the second to-be-calculating vector When storing the specified value; when the value of the segmentation vector on the corresponding bit is less than the value of the second to-be-calculated vector, store the value of the first or second to-be-computed vector in the corresponding bit, or store 0 and the like Values with different values.
满足按位大于存储指定值运算可以是:判断切分向量与第二待运算向量的对应位的大小关系,在对应位上的切分向量的值大于第二待运算向量的值时,存储指定值;在对应位上的切分向量的值小于或等于第二待运算向量的值时,存储第一切分向量或第二待运算向量在对应位的值、或者存储0等与指定值不同的数值。The operation that satisfies the bit-wise greater than storing the specified value may be: judging the size relationship between the corresponding bit of the cutting vector and the second to-be-calculated vector. Value; when the value of the segmentation vector on the corresponding bit is less than or equal to the value of the second to-be-computed vector, store the value of the first or second to-be-computed vector in the corresponding bit, or store 0, etc., different from the specified value Value.
满足按位小于或等于存储指定值运算可以是:判断切分向量与第二待运算向量的对应位的大小关系,在对应位上的切分向量的值小于或等于第二待运算向量的值时,存储指定值;在对应位上的切分向量的值大于第二待运算向量的值时,存储第一切分向量或第二待运算向量在对应位的值、或者存储0等与指定值不同的数值。Satisfying the bitwise operation less than or equal to storing the specified value can be: judging the size relationship between the corresponding bit of the segmentation vector and the second to-be-calculated vector, the value of the segmentation vector at the corresponding bit is less than or equal to the value of the second to-be-calculated vector Store the specified value; when the value of the cut vector on the corresponding bit is greater than the value of the second vector to be calculated, store the value of the first cut vector or the second vector to be calculated in the corresponding bit, or store 0 and the like Values with different values.
在该实现方式中,可以为不同的向量运算类型设置不同的操作域代码,以区分不同的运算种类。例如,可以将“向量相乘运算”的代码设置为“mult.cycle”。可以将“向量相加运算”的代码设置为“add.cycle”。可以将“向量求和运算”的代码设置为“sub.cycle”。可以将“按位与运算”的代码设置为“and.cycle”。可以将“按位或运算”的代码设置为“or.cycle”。可以将“按位异或运算”的代码设置为“xor.cycle”。可以将“满足按位相等则存储指定值1运算”的代码设置为“eq.cycle”。可以将“满足按位不相等则存储指定值1运算”的代码设置为“ne.cycle”。可以将“满足按位小于存储指定值1运算”的代码设置为“lt.cycle”。可以将“满足按位大于或等于存储指定值1运算”的代码设置为“ge.cycle”。可以将“满足按位大于存储指定值1运算”的代码设置为“gt.cycle”。可以将“满足按位小于或等于存储指定值1运算”的代码设置为“le.cycle”。In this implementation, different operation domain codes can be set for different types of vector operations to distinguish different types of operations. For example, the code for "vector multiplication operation" can be set to "mult.cycle". The code of "vector addition operation" can be set to "add.cycle". The code of "vector summation operation" can be set to "sub.cycle". The code of "bitwise AND operation" can be set to "and.cycle". The code of "bitwise OR operation" can be set to "or.cycle". The code for "bitwise XOR operation" can be set to "xor.cycle". You can set the code of "Save the specified value 1 operation if bitwise equality is satisfied" as "eq.cycle". You can set the code "means that the specified value 1 operation is satisfied if bitwise inequality is satisfied" as "ne.cycle". You can set the code for "Satisfy bitwise operation less than 1" to "lt.cycle". You can set the code for "meet bitwise greater than or equal to the specified value 1 operation" to "ge.cycle". You can set the code that "satisfies the bitwise operation greater than the storage specified value 1" to "gt.cycle". You can set the code for "meet bitwise less than or equal to the operation of storing the specified value 1" as "le.cycle".
本领域技术人员可以根据实际需要对运算种类、及其对应的代码进行设置,本公开对此不作限制。A person skilled in the art can set the type of operation and its corresponding code according to actual needs, and this disclosure does not limit this.
在一种可能的实现方式中,操作域还可以包括第一输入量和第二输入量。其中,控制模块19-11,还可以用于根据操作域确定第一输入量和第二输入量,并从第一待运算向量地址中获取数据量为第一输入量的第一待运算向量,以及从第二待运算向量地址中获取数据量为第二输入量的第二待运算向量。In a possible implementation manner, the operation domain may further include a first input amount and a second input amount. The control module 19-11 can also be used to determine the first input amount and the second input amount according to the operation domain, and obtain the first to-be-calculated vector whose data amount is the first input amount from the first to-be-calculated vector address, And obtaining a second to-be-calculated vector whose data amount is a second input amount from the second to-be-calculated vector address.
在一种可能的实现方式中,根据第二待运算向量将第一待运算向量切分为n个切分向量,可以包括:根据第二输入量确定每个切分向量的切分数据量,并根据切分数据量将第一待运算向量切分为n个切分向量。In a possible implementation manner, dividing the first to-be-calculated vector into n divided vectors according to the second to-be-calculated vector may include: determining the divided data amount of each of the divided vectors according to the second input amount, And divide the first to-be-operated vector into n divided vectors according to the amount of divided data.
在该实现方式中,可以将第二输入量确定为每个切分向量的切分数据量。In this implementation, the second input amount can be determined as the amount of segmentation data for each segmentation vector.
在该实现方式中,第一输入量和第二输入量可以是表征第一待运算向量和第二待运算向量的数据量的参数,例如,向量长度、宽度等。In this implementation manner, the first input amount and the second input amount may be parameters that characterize the data amount of the first to-be-computed vector and the second to-be-computed vector, for example, vector length, width, and the like.
在一种可能的实现方式中,可以设置默认第一输入量和第二输入量。在根据操作域无法确定第一输入量和第二输入量时,可以将默认第一输入量和第二输入量确定为当前循环向量指令的第一输入量和第二输入量,并从待运算向量地址中获取数据量为默认第一输入量的第一待运算向量、默认第二输入量的第二待运算向量。In a possible implementation manner, the default first input amount and the second input amount may be set. When the first input amount and the second input amount cannot be determined according to the operation domain, the default first input amount and the second input amount can be determined as the first input amount and the second input amount of the current loop vector instruction, and the The data to be obtained from the vector address is a first to-be-calculated vector with a default first input amount and a second to-be-calculated vector with a default second input amount.
在一种可能的实现方式中,如图19-2a、图19-2b所示,该装置还可以包括存储模块19-13。存储模块19-13用于存储第一待运算向量和第二待运算向量。In a possible implementation manner, as shown in FIGS. 19-2a and 19-2b, the device may further include a storage module 19-13. The storage modules 19-13 are used to store the first to-be-calculated vector and the second to-be-calculated vector.
在该实现方式中,存储模块可以包括内存、缓存和寄存器中的一种或多种,缓存可以包括高速暂存缓存,还可以包括至少一个NRAM(Neuron Random Access Memory,神经元随机存取存储器)。缓存,用于存储待运算数据、第一待运算向量和第二待运算向量。寄存器,用于存储待运算数据中的标量数据。In this implementation, the storage module may include one or more of memory, cache, and registers. The cache may include a high-speed temporary storage cache, and may also include at least one NRAM (Neuron Random Access Memory). . The cache is used to store the data to be calculated, the first vector to be calculated, and the second vector to be calculated. The register is used to store the scalar data in the data to be calculated.
在一种可能的实现方式中,缓存可以包括神经元缓存。神经元缓存也即上述神经元随机存取存储器,可以用于存储待运算数据中的神经元数据,神经元数据可以包括神经元向量数据。In a possible implementation, the cache may include a neuron cache. The neuron cache, that is, the foregoing neuron random access memory, can be used to store neuron data in the data to be calculated, and the neuron data can include neuron vector data.
在一种可能的实现方式中,该装置还可以包括直接内存访问模块,用于从存储模块中读取或者存储数据。In a possible implementation, the device may further include a direct memory access module, which is used to read or store data from the storage module.
在一种可能的实现方式中,控制模块19-11还可以用于根据循环向量指令生成汇编文件,并将汇编文件翻译成二进制文件,其中,二进制文件为编译后的循环向量指令。In a possible implementation, the control module 19-11 can also be used to generate an assembly file according to the loop vector instruction and translate the assembly file into a binary file, where the binary file is a compiled loop vector instruction.
在一种可能的实现方式中,循环向量指令的指令格式可以是:In a possible implementation, the instruction format of the loop vector instruction may be:
opcode dst src0 src1 src0_size src1_size type.cycleopcode dst src0 src1 src0_size src1_size type.cycle
其中,opcode为循环向量指令的操作码,dst、src、type、src0_size、src1_size为循环向量指令的操作域。其中,dst为目标地址。src0为第一待运算向量地址。src1为第二待运算向量地址。type为向量运算类型。src0_size为第一输入量。src1_size为第二输入量。其中,type.cycle可以是向量运算类型的代码,如mult.cycle、add.cycle、sub.cycle、eq.cycle、ne.cycle、lt.cycle、ge.cycle、gt.cycle、le.cycle、eq.cycle、and.cycle、or.cycle、xor.cycle。Among them, opcode is the operation code of the loop vector instruction, dst, src, type, src0_size, src1_size are the operation domain of the loop vector instruction. Among them, dst is the target address. src0 is the first vector address to be calculated. src1 is the second vector address to be calculated. type is the type of vector operation. src0_size is the first input amount. src1_size is the second input amount. Among them, type.cycle can be the code of the vector operation type, such as mult.cycle, add.cycle, sub.cycle, eq.cycle, ne.cycle, lt.cycle, ge.cycle, gt.cycle, le.cycle, eq.cycle, and.cycle, or.cycle, xor.cycle.
在一种可能的实现方式中,循环向量指令的指令格式还可以是:In a possible implementation, the instruction format of the loop vector instruction may also be:
type.cycle dst src0 src1 src0_size src1_sizetype.cycle dst src0 src1 src0_size src1_size
在一种可能的实现方式中,可以将用于“向量相乘运算”的循环向量指令的指令格式设置为:mult.cycle dst,src0 src1 src0_size src1_size。其表示:从第一待运算地址src0获取src0_size大小的第一待运算向量、从第二待运算地址src1中获取src1_size大小的第二待运算向量。将第一待运算向量切分为多个切分向量,每个切分向量的数据量与src1_size相同。对每个切分向量与第二待运算向量分别进行相乘运算,得到运算结果。并将运算结果存储到目标地址dst中。In a possible implementation, the instruction format of the loop vector instruction used for "vector multiplication operation" can be set as: mult.cycle dst, src0 src1 src0_size src1_size. It means that the first to-be-operated vector of size src0_size is obtained from the first to-be-operated address src0, and the second to-be-calculated vector of size src1_size is obtained from the second to-be-operated address src1. The first to-be-operated vector is divided into multiple divided vectors, and the data amount of each divided vector is the same as src1_size. Multiplying each cutting vector and the second to-be-calculated vector to obtain an operation result. And store the operation result to the target address dst.
在一种可能的实现方式中,可以将用于“向量相加运算”的循环向量指令的指令格式设置为:add.cycle dst src0 src1 src0_size src1_size。其表示:从第一待运算地址src0获取src0_size大小的第一待运算向量、从第二待运算地址src1中获取src1_size大小的第二待运算向量。将第一待运算向量切分为多个切分向量,每个切分向量的数据量与src1_size相同。对每个切分向量与第二待运算向量分别进行相加运算,得到运算结果。并将运算结果存储到目标地址dst中。In a possible implementation, the instruction format of the loop vector instruction used for "vector addition operation" can be set as: add.cycle dst src0 src1 src0_size src1_size. It means that the first to-be-operated vector of size src0_size is obtained from the first to-be-operated address src0, and the second to-be-calculated vector of size src1_size is obtained from the second to-be-operated address src1. The first to-be-operated vector is divided into multiple divided vectors, and the data amount of each divided vector is the same as src1_size. Each addition vector and the second to-be-calculated vector are separately added to obtain an operation result. And store the operation result to the target address dst.
在一种可能的实现方式中,可以将用于“向量求和运算”的循环向量指令的指令格式设置为:sub.cycle dst src0 src1 src0_size src1_size。其表示:从第一待运算地址src0获取src0_size大小的第一待运算向量、从第二待运算地址src1中获取src1_size大小的第二待运算向量。将第一待运算向量切分为多个切分向量,每个切分向量的数据量与src1_size相同。对每个切分向量与第二待运算向量分别进行求和运算,得到运算结果。并将运算结果存储到目标地址dst中。In a possible implementation, the instruction format of the loop vector instruction used for "vector summation operation" can be set to: sub.cycle dst src0 src1 src0_size src1_size. It means that the first to-be-operated vector of size src0_size is obtained from the first to-be-operated address src0, and the second to-be-calculated vector of size src1_size is obtained from the second to-be-operated address src1. The first to-be-operated vector is divided into multiple divided vectors, and the data amount of each divided vector is the same as src1_size. Perform a sum operation on each of the segmentation vector and the second to-be-calculated vector to obtain an operation result. And store the operation result to the target address dst.
在一种可能的实现方式中,可以将用于“按位与运算”的循环向量指令的指令格式设置为:and.cycle dst src0 src1 src0_size src1_size。其表示:从第一待运算地址src0获取src0_size大小的第一待运算向量、 从第二待运算地址src1中获取src1_size大小的第二待运算向量。将第一待运算向量切分为多个切分向量,每个切分向量的数据量与src1_size相同。对每个切分向量与第二待运算向量分别进行按位与运算,得到运算结果。并将运算结果存储到目标地址dst中。In a possible implementation, the instruction format of the loop vector instruction used for "bitwise AND operation" can be set as: and.cycle dst src0 src1 src0_size src1_size. It means that the first to-be-calculated vector of size src0_size is obtained from the first to-be-calculated address src0, and the second to-be-calculated vector of size src1_size is obtained from the second to-be-calculated address src1. The first to-be-operated vector is divided into multiple divided vectors, and the data amount of each divided vector is the same as src1_size. Perform bitwise AND operation on each of the split vector and the second to-be-calculated vector to obtain the operation result. And store the operation result to the target address dst.
在一种可能的实现方式中,可以将用于“按位或运算”的循环向量指令的指令格式设置为:or.cycle dst src0 src1 src0_size src1_size。其表示:从第一待运算地址src0获取src0_size大小的第一待运算向量、从第二待运算地址src1中获取src1_size大小的第二待运算向量。将第一待运算向量切分为多个切分向量,每个切分向量的数据量与src1_size相同。对每个切分向量与第二待运算向量分别进行按位或运算,得到运算结果。并将运算结果存储到目标地址dst中。In a possible implementation, the instruction format of the loop vector instruction used for "bitwise OR operation" can be set as: or.cycle dst src0 src1 src0_size src1_size. It means that the first to-be-operated vector of size src0_size is obtained from the first to-be-operated address src0, and the second to-be-calculated vector of size src1_size is obtained from the second to-be-operated address src1. The first to-be-operated vector is divided into multiple divided vectors, and the data amount of each divided vector is the same as src1_size. Perform bitwise OR operation on each of the segmentation vector and the second to-be-calculated vector to obtain the operation result. And store the operation result to the target address dst.
在一种可能的实现方式中,可以将用于“按位异或运算”的循环向量指令的指令格式设置为:xor.cycle dst src0 src1 src0_size src1_size。其表示:从第一待运算地址src0获取src0_size大小的第一待运算向量、从第二待运算地址src1中获取src1_size大小的第二待运算向量。将第一待运算向量切分为多个切分向量,每个切分向量的数据量与src1_size相同。对每个切分向量与第二待运算向量分别进行按位异或运算,得到运算结果。并将运算结果存储到目标地址dst中。In a possible implementation, the instruction format of the loop vector instruction used for "bitwise XOR operation" can be set to: xor.cycle dst src0 src1 src0_size src1_size. It means that the first to-be-operated vector of size src0_size is obtained from the first to-be-operated address src0, and the second to-be-calculated vector of size src1_size is obtained from the second to-be-operated address src1. The first to-be-operated vector is divided into multiple divided vectors, and the data amount of each divided vector is the same as src1_size. Perform a bitwise XOR operation on each of the split vector and the second to-be-calculated vector to obtain an operation result. And store the operation result to the target address dst.
在一种可能的实现方式中,可以将用于“满足按位相等则存储指定值1运算”的循环向量指令的指令格式设置为:eq.cycle dst src0 src1 src0_size src1_size。其表示:从第一待运算地址src0获取src0_size大小的第一待运算向量、从第二待运算地址src1中获取src1_size大小的第二待运算向量。将第一待运算向量切分为多个切分向量,每个切分向量的数据量与src1_size相同。判断切分向量与第二待运算向量的对应位是否相等,在切分向量与第二待运算向量的对应位相等时,存储指定值1,得到运算结果。并将运算结果存储到目标地址dst中。In a possible implementation manner, the instruction format of the loop vector instruction used for "save bit value equal to store specified value 1 operation" can be set to: eq.cycle dst src0 src1 src0_size src1_size. It means that the first to-be-operated vector of size src0_size is obtained from the first to-be-operated address src0, and the second to-be-calculated vector of size src1_size is obtained from the second to-be-operated address src1. The first to-be-operated vector is divided into multiple divided vectors, and the data amount of each divided vector is the same as src1_size. It is determined whether the corresponding bits of the segmentation vector and the second to-be-calculated vector are equal. When the corresponding bits of the segmentation vector and the second to-be-calculated vector are equal, the specified value 1 is stored to obtain the operation result. And store the operation result to the target address dst.
在一种可能的实现方式中,可以将用于“满足按位不相等则存储指定值1运算”的循环向量指令的指令格式设置为:ne.cycle dst src0 src1 src0_size src1_size。其表示:从第一待运算地址src0获取src0_size大小的第一待运算向量、从第二待运算地址src1中获取src1_size大小的第二待运算向量。将第一待运算向量切分为多个切分向量,每个切分向量的数据量与src1_size相同。判断切分向量与第二待运算向量的对应位是否相等,在切分向量与第二待运算向量的对应位不相等时,存储指定值1,得到运算结果。并将运算结果存储到目标地址dst中。In a possible implementation manner, the instruction format of the loop vector instruction used for "meet bitwise inequality to store specified value 1 operation" can be set to: ne.cycle dst src0 src1 src0_size src1_size. It means that the first to-be-operated vector of size src0_size is obtained from the first to-be-operated address src0, and the second to-be-calculated vector of size src1_size is obtained from the second to-be-operated address src1. The first to-be-operated vector is divided into multiple divided vectors, and the data amount of each divided vector is the same as src1_size. It is determined whether the corresponding bits of the segmentation vector and the second to-be-calculated vector are equal. When the corresponding bits of the segmentation vector and the second to-be-calculated vector are not equal, the specified value 1 is stored to obtain the operation result. And store the operation result to the target address dst.
在一种可能的实现方式中,可以将用于“满足按位小于则存储指定值1运算”的循环向量指令的指令格式设置为:lt.cycle dst src0 src1 src0_size src1_size。其表示:从第一待运算地址src0获取src0_size大小的第一待运算向量、从第二待运算地址src1中获取src1_size大小的第二待运算向量。将第一待运算向量切分为多个切分向量,每个切分向量的数据量与src1_size相同。判断切分向量与第二待运算向量的对应位的大小关系,在对应位上的切分向量的值小于第二待运算向量的值时,存储指定值1,得到运算结果。并将运算结果存储到目标地址dst中。In a possible implementation, the instruction format of the loop vector instruction for "meet bitwise less than the specified value 1 operation" can be set to: lt.cycle dst src0 src1 src0_size src1_size. It means that the first to-be-operated vector of size src0_size is obtained from the first to-be-operated address src0, and the second to-be-calculated vector of size src1_size is obtained from the second to-be-operated address src1. The first to-be-operated vector is divided into multiple divided vectors, and the data amount of each divided vector is the same as src1_size. The size relationship between the segment vector and the second to-be-calculated vector is determined. When the value of the segment vector on the corresponding bit is smaller than the value of the second to-be-calculated vector, the specified value 1 is stored to obtain the operation result. And store the operation result to the target address dst.
在一种可能的实现方式中,可以将用于“满足按位大于或等于则存储指定值1运算”的循环向量指令的指令格式设置为:ge.cycle dst src0 src1 src0_size src1_size。其表示:从第一待运算地址src0获取src0_size大小的第一待运算向量、从第二待运算地址src1中获取src1_size大小的第二待运算向量。将第一待运算向量切分为多个切分向量,每个切分向量的数据量与src1_size相同。判断切分向量与第二待运算向量的对应位的大小关系,在对应位上的切分向量的值大于或等于第二待运算向量的值时,存储指定值1,得到运算结果。并将运算结果存储到目标地址dst中。In a possible implementation, the instruction format of the loop vector instruction for "meet bitwise greater than or equal to store specified value 1 operation" can be set to: ge.cycle dst src0 src1 src0_size src1_size. It means that the first to-be-operated vector of size src0_size is obtained from the first to-be-operated address src0, and the second to-be-calculated vector of size src1_size is obtained from the second to-be-operated address src1. The first to-be-operated vector is divided into multiple divided vectors, and the data amount of each divided vector is the same as src1_size. The size relationship between the segment vector and the second to-be-calculated vector is determined. When the value of the segment vector on the corresponding bit is greater than or equal to the value of the second to-be-calculated vector, the specified value 1 is stored to obtain the operation result. And store the operation result to the target address dst.
在一种可能的实现方式中,可以将用于“满足按位大于则存储指定值1运算”的循环向量指令的指令格式设置为:gt.cycle dst src0 src1 src0_size src1_size。其表示:从第一待运算地址src0获取src0_size大小的第一待运算向量、从第二待运算地址src1中获取src1_size大小的第二待运算向量。将第一待运算向量切分为多个切分向量,每个切分向量的数据量与src1_size相同。判断切分向量与第二待运算向量的对应位的大小关系,在对应位上的切分向量的值大于第二待运算向量的值时,存储指定值1,得到运算结果。并将运算结果存储到目标地址dst中。In a possible implementation, the instruction format of the loop vector instruction used for "meet bitwise greater than storage specified value 1 operation" can be set as: gt.cycle dst src0 src1 src0_size src1_size. It means that the first to-be-operated vector of size src0_size is obtained from the first to-be-operated address src0, and the second to-be-calculated vector of size src1_size is obtained from the second to-be-operated address src1. The first to-be-operated vector is divided into multiple divided vectors, and the data amount of each divided vector is the same as src1_size. Judge the size relationship between the corresponding bits of the segmentation vector and the second to-be-calculated vector. When the value of the segmentation vector on the corresponding bit is greater than the value of the second to-be-calculated vector, store the specified value 1 to obtain the operation result. And store the operation result to the target address dst.
在一种可能的实现方式中,可以将用于“满足按位小于或等于则存储指定值1运算”的循环向量指令的指令格式设置为:le.cycle dst src0 src1 src0_size src1_size。其表示:从第一待运算地址src0获取 src0_size大小的第一待运算向量、从第二待运算地址src1中获取src1_size大小的第二待运算向量。将第一待运算向量切分为多个切分向量,每个切分向量的数据量与src1_size相同。判断切分向量与第二待运算向量的对应位的大小关系,在对应位上的切分向量的值小于第二待运算向量的值时,存储指定值1,得到运算结果。并将运算结果存储到目标地址dst中。In a possible implementation, the instruction format of the loop vector instruction used for "meet bitwise less than or equal to store specified value 1 operation" can be set as: le.cycle dst src0 src1 src0_size src1_size. It means that the first to-be-operated vector of size src0_size is obtained from the first to-be-operated address src0, and the second to-be-calculated vector of size src1_size is obtained from the second to-be-operated address src1. The first to-be-operated vector is divided into multiple divided vectors, and the data amount of each divided vector is the same as src1_size. The size relationship between the segment vector and the second to-be-calculated vector is determined. When the value of the segment vector on the corresponding bit is smaller than the value of the second to-be-calculated vector, the specified value 1 is stored to obtain the operation result. And store the operation result to the target address dst.
应当理解的是,本领域技术人员可以根据需要对循环向量指令的操作码、指令格式中操作码和操作域的位置进行设置,本公开对此不作限制。It should be understood that those skilled in the art can set the operation code of the loop vector instruction, the position of the operation code and the operation field in the instruction format as needed, and the disclosure does not limit this.
在一种可能的实现方式中,该装置可以设置于图形处理器(Graphics Processing Unit,简称GPU)、中央处理器(Central Processing Unit,简称CPU)和嵌入式神经网络处理器(Neural-network Processing Unit,简称NPU)的一种或多种之中。In a possible implementation manner, the device may be set in a graphics processor (Graphics Processing Unit, GPU for short), a central processing unit (Central Processing Unit, CPU for short) and an embedded neural network processor (Neural-network Processing Unit) , Referred to as NPU).
需要说明的是,尽管以上述实施例作为示例介绍了循环向量指令处理装置如上,但本领域技术人员能够理解,本公开应不限于此。事实上,用户完全可根据个人喜好和/或实际应用场景灵活设定各模块,只要符合本公开的技术方案即可。It should be noted that although the loop vector instruction processing apparatus is described above using the above embodiment as an example, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various modules flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.
应用示例Application examples
以下结合“利用循环向量指令处理装置进行向量运算”作为一个示例性应用场景,给出根据本公开实施例的应用示例,以便于理解循环向量指令处理装置的流程。本领域技术人员应理解,以下应用示例仅仅是出于便于理解本公开实施例的目的,不应视为对本公开实施例的限制In the following, an application example according to an embodiment of the present disclosure is given in conjunction with "using a loop vector instruction processing device for vector operations" as an exemplary application scenario, so as to facilitate understanding of the flow of the loop vector instruction processing device. Those skilled in the art should understand that the following application examples are only for the purpose of facilitating understanding of the embodiments of the present disclosure, and should not be considered as limitations to the embodiments of the present disclosure
图19-3示出根据本公开一实施例的循环向量指令处理装置的应用场景的示意图。如图19-3所示,循环向量指令处理装置对循环向量指令进行处理的过程如下:19-3 shows a schematic diagram of an application scenario of a loop vector instruction processing device according to an embodiment of the present disclosure. As shown in Figure 19-3, the loop vector instruction processing device processes the loop vector instruction as follows:
控制模块19-11对获取到的循环向量指令1进行编译,得到编译后的循环向量指令1(如循环向量指令1为opcode 500 101 102 add.cycle 64 16),对编译后的循环向量指令进行解析,得到循环向量指令1的操作码和操作域。其中,循环向量指令1的操作码为opcode,目标地址为500,第一待运算向量地址为101,第二待运算向量地址为102。向量运算类型为add.cycle(向量相加运算)。第一输入量为64。第二输入量为16。控制模块19-11从第一待运算向量地址101中获取到数据量为第一输入量64的第一待运算向量,以及从第二待运算向量地址102中获取到数据量为第二输入量16的第二待运算向量。The control module 19-11 compiles the obtained loop vector instruction 1 to obtain the compiled loop vector instruction 1 (for example, the loop vector instruction 1 is opcode 500 500 101 102 add.cycle 64), and executes the compiled loop vector instruction Analyze to get the operation code and operation domain of the loop vector instruction 1. Wherein, the operation code of the loop vector instruction 1 is opcode, the target address is 500, the first to-be-computed vector address is 101, and the second to-be-computed vector address is 102. The vector operation type is add.cycle (vector addition operation). The first input is 64. The second input is 16. The control module 19-11 obtains the first to-be-calculated vector whose data amount is the first input amount 64 from the first to-be-calculated vector address 101, and the second input amount that the data amount is obtained from the second to-be-calculated vector address 102 The second pending vector of 16.
运算模块19-12将第一待运算向量切分为4个切分向量,如图19-3所示的切分向量1、切分向量2、切分向量3、切分向量4,每个切分向量的数据量为16。并对每个切分向量和第二待运算向量分别进行相加运算,得到对应的切分运算结果,如图19-3所示的切分运算结果1、切分运算结果2、切分运算结果3、切分运算结果4。并将切分运算结果1、切分运算结果2、切分运算结果3、切分运算结果4作为该循环向量指令1的运算结果1,并将运算结果1存入目标地址500中。The operation module 19-12 divides the first to-be-operated vector into 4 divided vectors, as shown in FIG. 19-3, divided vector 1, divided vector 2, divided vector 3, divided vector 4, each The data volume of the segmentation vector is 16. And add each segmentation vector and the second to-be-calculated vector separately to obtain the corresponding segmentation operation result, as shown in Figure 19-3, the segmentation operation result 1, the segmentation operation result 2, the segmentation operation Result 3, split operation result 4. The division operation result 1, the division operation result 2, the division operation result 3, and the division operation result 4 are used as the operation result 1 of the loop vector instruction 1, and the operation result 1 is stored in the target address 500.
其中,循环向量指令1除可以为上述opcode 500 101 102 add.cycle 64 16,还可以为add.cycle 500 101 102 64 16,不同指令格式的循环向量指令的处理过程相似,不再赘述。Among them, the loop vector instruction 1 can be not only the above opcode 500, 101, 102, add.cycle, 64, 16 but also add.cycle, 500, 101, 102, 64, 16. The processing procedure of the loop vector instructions in different instruction formats is similar and will not be repeated here.
以上各模块的工作过程可参考上文的相关描述。For the working process of the above modules, please refer to the relevant description above.
这样,循环向量指令处理装置可以高效、快速地对循环向量指令进行处理,进行运算的处理效率高、处理速度快。In this way, the loop vector instruction processing device can efficiently and quickly process the loop vector instruction, and the calculation processing efficiency is high and the processing speed is high.
图19-4示出根据本公开一实施例的循环向量指令处理方法的流程图。该方法可以应用于包含存储器和处理器的如计算机设备等,其中,存储器用于存储执行方法过程中所使用的数据;处理器用于执行相关的处理、运算步骤,如执行下述步骤S51-19和步骤S52-19。如图19-4所示,该方法应用于上述循环向量指令处理装置,该方法包括步骤S51-19和步骤S52-19。19-4 shows a flowchart of a loop vector instruction processing method according to an embodiment of the present disclosure. The method can be applied to computer equipment including a memory and a processor, where the memory is used to store data used in the execution of the method; the processor is used to perform related processing and operation steps, such as performing the following steps S51-19和 步骤 S52-19. As shown in FIG. 19-4, this method is applied to the above loop vector instruction processing device. The method includes steps S51-19 and S52-19.
在步骤S51-19中,利用控制模块对获取到的循环向量指令进行编译,得到编译后的循环向量指令,对编译后的循环向量指令进行解析,得到循环向量指令的操作码和操作域,并根据操作码和操作域获取执行循环向量指令所需的第一待运算向量、第二待运算向量和目标地址,以及确定循环向量指令的向量运算类型。其中,操作码用于指示循环向量指令对数据所进行的运算为循环向量运算,操作域包括第一待运算向量地址、第二待运算向量地址和目标地址。In step S51-19, the control module is used to compile the obtained loop vector instruction to obtain a compiled loop vector instruction, and the compiled loop vector instruction is analyzed to obtain the operation code and operation domain of the loop vector instruction, and Obtain the first to-be-operated vector, the second to-be-operated vector and the target address required to execute the loop vector instruction according to the operation code and the operation domain, and determine the vector operation type of the loop vector instruction. The operation code is used to indicate that the operation performed by the cyclic vector instruction on the data is a cyclic vector operation, and the operation domain includes the first to-be-calculated vector address, the second to-be-calculated vector address, and the target address.
在步骤S52-19中,利用运算模块根据第二待运算向量将第一待运算向量切分为多个切分向量,根据向量运算类型对每个切分向量与第二待运算向量分别进行向量运算,获得运算结果,并将运算结果 存入目标地址。In step S52-19, the operation module is used to divide the first to-be-operated vector into a plurality of divided vectors according to the second to-be-operated vector, and each divided vector and the second to-be-operated vector are separately vectored according to the type of vector operation Operate, get the operation result, and store the operation result in the target address.
在一种可能的实现方式中,根据向量运算类型对每个切分向量与第二待运算向量分别进行向量运算,可以包括:利用运算模块中的多个向量运算器执行与向量运算类型相对应的向量运算。In a possible implementation manner, performing vector operations on each of the divided vectors and the second to-be-operated vector separately according to the type of vector operation may include: using multiple vector operators in the operation module to perform the operation corresponding to the type of vector operation Vector operation.
在一种可能的实现方式中,运算模块可以包括主运算子模块和多个从运算子模块,主运算子模块可以包括多个向量运算器。其中,步骤S52-19可以包括:利用主运算子模块中的多个向量运算器执行与向量运算类型相对应的向量运算,得到运算结果,并将运算结果存入目标地址中。In a possible implementation manner, the operation module may include a master operation sub-module and multiple slave operation sub-modules, and the master operation sub-module may include multiple vector operators. Wherein, step S52-19 may include: using a plurality of vector operators in the main operation sub-module to perform a vector operation corresponding to the type of vector operation, obtain an operation result, and store the operation result in a target address.
在一种可能的实现方式中,运算模块包括主运算子模块和多个从运算子模块,从运算子模块包括多个向量运算器,其中,步骤S52-19可以包括:利用每个从运算子模块所包含的多个向量运算器并行执行进行对应的向量运算,得到运算结果,并将运算结果存入对应的子缓存空间中,以及将运算结果发送至主运算子模块;利用主运算子模块接收运算结果,并将运算结果存入目标地址中。In a possible implementation manner, the operation module includes a master operation sub-module and multiple slave operation sub-modules, and the slave operation sub-module includes multiple vector operators, wherein step S52-19 may include: using each slave operator The multiple vector operators included in the module execute the corresponding vector operation in parallel to obtain the operation result, and store the operation result in the corresponding sub-cache space, and send the operation result to the main operation sub-module; use the main operation sub-module Receive the operation result, and store the operation result in the target address.
在一种可能的实现方式中,操作域还可以包括向量运算类型。其中,确定循环向量指令的向量运算类型,可以包括:在操作域中包括向量运算类型时,根据操作域确定向量运算类型。In a possible implementation, the operation domain may also include a vector operation type. Wherein, determining the vector operation type of the loop vector instruction may include: when the vector operation type is included in the operation domain, determining the vector operation type according to the operation domain.
在一种可能的实现方式中,操作域还可以包括第一输入量和第二输入量。其中,根据操作码和操作域获取执行循环向量指令所需的第一待运算向量、第二待运算向量和目标地址,还可以包括:根据操作域确定第一输入量和第二输入量,并从第一待运算向量地址中获取数据量为第一输入量的第一待运算向量,以及从第二待运算向量地址中获取数据量为第二输入量的第二待运算向量。其中,根据第二待运算向量将第一待运算向量切分为多个切分向量,可以包括:根据第二输入量确定每个切分向量的切分数据量,并根据切分数据量将第一待运算向量切分为多个切分向量。In a possible implementation manner, the operation domain may further include a first input amount and a second input amount. Wherein, obtaining the first to-be-computed vector, the second to-be-computed vector and the target address required to execute the loop vector instruction according to the operation code and the operation domain may further include: determining the first input amount and the second input volume according to the operation domain, and Obtain a first to-be-calculated vector whose data amount is the first input amount from the first to-be-calculated vector address, and obtain a second to-be-calculated vector whose data amount is the second input amount from the second to-be-calculated vector address. Wherein, dividing the first to-be-computed vector into multiple divided vectors according to the second to-be-computed vector may include: determining the divided data amount of each divided vector according to the second input amount, and according to the divided data amount The first to-be-operated vector is divided into multiple divided vectors.
在一种可能的实现方式中,操作码还用于指示向量运算类型,确定循环向量指令的向量运算类型,可以包括:在操作码用于指示向量运算类型时,根据操作码确定向量运算类型。In a possible implementation, the operation code is also used to indicate the type of vector operation to determine the type of vector operation of the loop vector instruction, which may include: when the operation code is used to indicate the type of vector operation, determining the type of vector operation according to the operation code.
在一种可能的实现方式中,向量运算类型可以包括以下至少一种:向量相乘运算、向量相加运算、向量求和运算、满足运算条件存储指定值运算、按位与运算、按位或运算、按位异或运算。其中,运算条件可以包括以下任一种:按位相等、按位不相等、按位小于、按位大于或等于、按位大于、按位小于或等于。In a possible implementation manner, the vector operation type may include at least one of the following: vector multiplication operation, vector addition operation, vector sum operation, operation to store a specified value when the operation condition is met, bitwise AND operation, bitwise or Operation, bitwise XOR operation. The calculation conditions may include any of the following: bitwise equal, bitwise unequal, bitwise less than, bitwise greater than or equal, bitwise greater than, bitwise less than or equal.
在一种可能的实现方式中,该方法还可以包括:利用装置的存储模块存储第一待运算向量和第二待运算向量,其中,存储模块包括寄存器和缓存中的至少一种,缓存,用于存储待运算数据、第一待运算向量和第二待运算向量,缓存包括至少一个神经元缓存NRAM;寄存器,用于存储待运算数据中的标量数据;神经元缓存用于存储待运算数据中的神经元数据,神经元数据包括神经元向量数据。In a possible implementation manner, the method may further include: using the storage module of the device to store the first to-be-computed vector and the second to-be-computed vector, where the storage module includes at least one of a register and a cache, the cache is used To store the data to be calculated, the first vector to be calculated and the second vector to be calculated, the cache includes at least one neuron cache NRAM; the register is used to store the scalar data in the data to be calculated; the neuron cache is used to store the data to be calculated Neuron data, neuron data includes neuron vector data.
在一种可能的实现方式中,对编译后的循环向量指令进行解析,得到循环向量指令的操作码和操作域,可以包括:存储编译后的循环向量指令;对编译后的循环向量指令进行解析,得到循环向量指令的操作码和操作域;存储指令队列,指令队列包括按照执行顺序依次排列的多个待执行指令,多个待执行指令可以包括编译后的循环向量指令。In a possible implementation, parsing the compiled loop vector instruction to obtain the operation code and operation domain of the loop vector instruction may include: storing the compiled loop vector instruction; parsing the compiled loop vector instruction To obtain the operation code and operation domain of the loop vector instruction; store the instruction queue, the instruction queue includes a plurality of instructions to be executed in order according to the execution order, and the plurality of instructions to be executed may include a compiled loop vector instruction.
在一种可能的实现方式中,该方法还可以包括:在确定多个待执行指令中的第一待执行指令与第一待执行指令之前的第零待执行指令存在关联关系时,缓存第一待执行指令,并在确定第零执行指令执行完毕后,控制进行第一待执行指令的执行,In a possible implementation manner, the method may further include: when determining that the first to-be-executed instruction among the plurality of to-be-executed instructions is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction, caching the first The instruction to be executed, and after determining that the execution of the zeroth execution instruction is completed, control to execute the execution of the first instruction to be executed,
其中,第一待执行指令与第一待执行指令之前的第零待执行指令存在关联关系可以包括:存储第一待执行指令所需数据的第一存储地址区间与存储第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first to-be-executed instruction is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction may include: a first storage address interval that stores data required by the first to-be-executed instruction and a zeroth to-be-executed instruction The zeroth storage address interval of data has overlapping areas.
在一种可能的实现方式中,对获取到的循环向量指令进行编译,得到编译后的循环向量指令,可以包括:根据循环向量指令生成汇编文件,并将汇编文件翻译成二进制文件。其中,二进制文件为编译后的循环向量指令。In a possible implementation manner, compiling the obtained loop vector instruction to obtain the compiled loop vector instruction may include: generating an assembly file according to the loop vector instruction, and translating the assembly file into a binary file. Among them, the binary file is a compiled loop vector instruction.
需要说明的是,尽管以上述实施例作为示例介绍了循环向量指令处理方法如上,但本领域技术人员能够理解,本公开应不限于此。事实上,用户完全可根据个人喜好和/或实际应用场景灵活设定各步骤,只要符合本公开的技术方案即可。It should be noted that, although the above embodiment is taken as an example to introduce the loop vector instruction processing method as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various steps flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.
本公开实施例所提供的循环向量指令处理方法的适用范围广,对向量的处理效率高、处理速度快, 进行运算的处理效率高、处理速度快。The loop vector instruction processing method provided by the embodiments of the present disclosure has a wide application range, high processing efficiency and fast processing speed for vectors, and high processing efficiency and fast processing speed for performing calculations.
依据以下条款可以更好的理解前述内容:The foregoing can be better understood based on the following clauses:
条款S1、一种循环向量指令处理装置,所述装置包括:Clause S1, a loop vector instruction processing device, the device comprising:
控制模块,用于对获取到的循环向量指令进行编译,得到编译后的循环向量指令,对所述编译后的循环向量指令进行解析,得到循环向量指令的操作码和操作域,并根据所述操作码和所述操作域获取执行循环向量指令所需的第一待运算向量和第二待运算向量和目标地址,以及确定循环向量指令的向量运算类型;The control module is used to compile the obtained loop vector instruction to obtain the compiled loop vector instruction, analyze the compiled loop vector instruction to obtain the operation code and operation domain of the loop vector instruction, and according to the The operation code and the operation domain obtain the first to-be-operated vector and the second to-be-operated vector and the target address required to execute the loop vector instruction, and determine the vector operation type of the loop vector instruction;
运算模块,用于根据第二待运算向量将所述第一待运算向量切分为多个切分向量,根据所述向量运算类型对每个切分向量与所述第二待运算向量分别进行向量运算,获得运算结果,并将所述运算结果存入所述目标地址中,An operation module, configured to divide the first to-be-operated vector into a plurality of divided vectors according to the second to-be-operated vector, and separately perform each divided vector and the second to-be-operated vector according to the vector operation type Vector operation to obtain the operation result, and store the operation result in the target address,
其中,所述操作码用于指示所述循环向量指令对数据所进行的运算为循环向量运算,所述操作域包括第一待运算向量地址、第二待运算向量地址和所述目标地址。Wherein, the operation code is used to indicate that the operation performed by the cyclic vector instruction on the data is a cyclic vector operation, and the operation domain includes a first to-be-computed vector address, a second to-be-computed vector address, and the target address.
条款S2、根据条款S1所述的装置,所述运算模块,包括:Clause S2. The device according to Clause S1, the calculation module includes:
多个向量运算器,用于执行与所述向量运算类型相对应的向量运算。A plurality of vector operators are used to perform vector operations corresponding to the vector operation type.
条款S3、根据条款S2所述的装置,所述运算模块包括主运算子模块和多个从运算子模块,所述主运算子模块包括所述多个向量运算器,Clause S3. The device according to Clause S2, the operation module includes a master operation sub-module and a plurality of slave operation sub-modules, the master operation sub-module includes the plurality of vector operators,
所述主运算子模块,用于利用所述多个向量运算器执行所述向量运算,得到运算结果,并将所述运算结果存入所述目标地址中。The main operation sub-module is used to perform the vector operation by using the plurality of vector operators, obtain an operation result, and store the operation result in the target address.
条款S4、根据条款S2所述的装置,所述运算模块包括主运算子模块和多个从运算子模块,所述从运算子模块包括所述多个向量运算器,Clause S4. The device according to Clause S2, the operation module includes a master operation sub-module and a plurality of slave operation sub-modules, the slave operation sub-module includes the plurality of vector operators,
所述从运算子模块,用于利用所包含的多个向量运算器并行执行进行对应的向量运算,得到运算结果,并将所述运算结果存入对应的子缓存空间中,以及将所述运算结果发送至所述主运算子模块;The slave operation sub-module is used to execute corresponding vector operations in parallel by using a plurality of vector operators included to obtain operation results, store the operation results in the corresponding sub-cache space, and store the operations The result is sent to the main operation sub-module;
所述主运算子模块,还用于接收所述运算结果,并将所述运算结果存入所述目标地址中。The main operation sub-module is also used to receive the operation result and store the operation result in the target address.
条款S5、根据条款S1所述的装置,所述操作域还包括向量运算类型,Clause S5. The device according to Clause S1, the operation domain further includes a vector operation type,
其中,所述控制模块,还用于在所述操作域中包括向量运算类型时,根据所述操作域确定所述向量运算类型。Wherein, the control module is further used to determine the vector operation type according to the operation domain when the operation domain includes the vector operation type.
条款S6、根据条款S1所述的装置,所述操作域还包括第一输入量和第二输入量,Clause S6. The device according to Clause S1, the operation domain further includes a first input amount and a second input amount,
其中,所述控制模块,还用于根据所述操作域确定所述第一输入量和所述第二输入量,并从所述第一待运算向量地址中获取数据量为所述第一输入量的第一待运算向量,以及从第二待运算向量地址中获取数据量为所述第二输入量的第二待运算向量,The control module is further configured to determine the first input amount and the second input amount according to the operation domain, and obtain the data amount from the first to-be-calculated vector address as the first input The first to-be-calculated vector of the quantity, and the second to-be-calculated vector whose data quantity is obtained from the second to-be-calculated vector address as the second input quantity,
其中,根据第二待运算向量将所述第一待运算向量切分为多个切分向量,包括:Wherein, dividing the first to-be-calculated vector into a plurality of divided vectors according to the second to-be-calculated vector includes:
根据所述第二输入量确定每个切分向量的切分数据量,并根据所述切分数据量将所述第一待运算向量切分为多个切分向量。The amount of segmentation data of each segmentation vector is determined according to the second input amount, and the first to-be-operated vector is segmented into multiple segmentation vectors according to the amount of segmentation data.
条款S7、根据条款S1所述的装置,所述操作码还用于指示所述向量运算类型,Clause S7. The device according to Clause S1, the operation code is further used to indicate the vector operation type,
所述控制模块,还用于在所述操作码用于指示所述向量运算类型时,根据所述操作码确定所述向量运算类型。The control module is further configured to determine the vector operation type according to the operation code when the operation code is used to indicate the vector operation type.
条款S8、根据条款S1所述的装置,所述向量运算类型包括以下至少一种:Clause S8. The apparatus according to Clause S1, the vector operation type includes at least one of the following:
向量相乘运算、向量相加运算、向量求和运算、满足运算条件存储指定值运算、按位与运算、按位或运算、按位异或运算,Vector multiplication operation, vector addition operation, vector sum operation, storage specified value operation, bitwise AND operation, bitwise OR operation, bitwise XOR operation
其中,所述运算条件包括以下任一种:按位相等、按位不相等、按位小于、按位大于或等于、按位大于、按位小于或等于。Wherein, the calculation conditions include any one of the following: bitwise equal, bitwise unequal, bitwise less than, bitwise greater than or equal, bitwise greater than, bitwise less than or equal.
条款S9、根据条款S1所述的装置,所述装置还包括:Clause S9. The device according to Clause S1, the device further comprising:
存储模块,用于存储所述第一待运算向量和所述第二待运算向量,A storage module, configured to store the first to-be-calculated vector and the second to-be-calculated vector,
其中,所述存储模块包括寄存器和缓存中的至少一种,Wherein, the storage module includes at least one of a register and a cache,
所述缓存,用于存储待运算数据、所述第一待运算向量和所述第二待运算向量,所述缓存包括至 少一个神经元缓存NRAM;The cache is used to store data to be calculated, the first vector to be calculated, and the second vector to be calculated, and the cache includes at least one neuron cache NRAM;
所述寄存器,用于存储所述待运算数据中的标量数据;The register is used to store scalar data in the data to be calculated;
所述神经元缓存,用于存储所述待运算数据中的神经元数据,所述神经元数据包括神经元向量数据。The neuron cache is used to store neuron data in the data to be operated, and the neuron data includes neuron vector data.
条款S10、根据条款S1所述的装置,所述控制模块,包括:Clause S10. The device according to Clause S1, the control module includes:
指令存储子模块,用于存储所述编译后的循环向量指令;An instruction storage submodule, used to store the compiled loop vector instruction;
指令处理子模块,用于对所述编译后的循环向量指令进行解析,得到循环向量指令的操作码和操作域;An instruction processing sub-module, which is used to analyze the compiled loop vector instruction to obtain the operation code and operation domain of the loop vector instruction;
队列存储子模块,用于存储指令队列,所述指令队列包括按照执行顺序依次排列的多个待执行指令,所述多个待执行指令包括所述编译后的循环向量指令。A queue storage submodule is used to store an instruction queue, and the instruction queue includes a plurality of instructions to be executed, which are sequentially arranged according to an execution order, and the plurality of instructions to be executed include the compiled loop vector instruction.
条款S11、根据条款S10所述的装置,所述控制模块,还包括:Clause S11. The device according to Clause S10, the control module, further comprising:
依赖关系处理子模块,用于在确定所述多个待执行指令中的第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系时,将所述第一待执行指令缓存在所述指令存储子模块中,在所述第零待执行指令执行完毕后,从所述指令存储子模块中提取所述第一待执行指令发送至所述运算模块,The dependency processing sub-module is used to determine the first pending instruction when there is an association relationship between the first pending instruction in the plurality of pending instructions and the zeroth pending instruction before the first pending instruction The execution instruction is cached in the instruction storage submodule, and after the execution of the zeroth instruction to be executed is completed, the first instruction to be executed is extracted from the instruction storage submodule and sent to the arithmetic module,
其中,所述第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系包括:Wherein, the association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
存储所述第一待执行指令所需数据的第一存储地址区间与存储所述第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
条款S12、根据条款S1所述的装置,Clause S12, the device according to Clause S1,
所述控制模块,还用于根据所述循环向量指令生成汇编文件,并将所述汇编文件翻译成二进制文件,The control module is also used to generate an assembly file according to the loop vector instruction and translate the assembly file into a binary file,
其中,所述二进制文件为所述编译后的循环向量指令。Wherein, the binary file is the compiled loop vector instruction.
条款S13、一种机器学习运算装置,所述装置包括:Clause S13. A machine learning computing device, the device comprising:
一个或多个如条款S1-条款S12任一项所述的循环向量指令处理装置,用于从其他处理装置中获取待运算向量和控制信息,并执行指定的机器学习运算,将执行结果通过I/O接口传递给其他处理装置;One or more loop vector instruction processing devices as described in any one of Clause S1-Clause S12, used to obtain vectors and control information to be calculated from other processing devices, and perform specified machine learning operations, and pass the execution result through I / O interface is passed to other processing devices;
当所述机器学习运算装置包含多个所述循环向量指令处理装置时,所述多个所述循环向量指令处理装置间可以通过特定的结构进行连接并传输数据;When the machine learning computing device includes a plurality of the loop vector instruction processing devices, the plurality of loop vector instruction processing devices may be connected and transmit data through a specific structure;
其中,多个所述循环向量指令处理装置通过快速外部设备互连总线PCIE总线进行互联并传输数据,以支持更大规模的机器学习的运算;多个所述循环向量指令处理装置共享同一控制系统或拥有各自的控制系统;多个所述循环向量指令处理装置共享内存或者拥有各自的内存;多个所述循环向量指令处理装置的互联方式是任意互联拓扑。Among them, a plurality of the loop vector instruction processing devices interconnect and transmit data through a fast external device interconnect bus PCIE bus to support larger-scale machine learning operations; a plurality of the loop vector instruction processing devices share the same control system Or have their own control systems; a plurality of the loop vector instruction processing devices share memory or have their own memory; the interconnection method of the plurality of loop vector instruction processing devices is an arbitrary interconnection topology.
条款S14、一种组合处理装置,所述组合处理装置包括:Clause S14. A combined processing device, the combined processing device comprising:
如条款S13所述的机器学习运算装置、通用互联接口和其他处理装置;Machine learning computing device, general interconnection interface and other processing devices as described in clause S13;
所述机器学习运算装置与所述其他处理装置进行交互,共同完成用户指定的计算操作,The machine learning computing device interacts with the other processing device to jointly complete the calculation operation specified by the user,
其中,所述组合处理装置还包括:存储装置,该存储装置分别与所述机器学习运算装置和所述其他处理装置连接,用于保存所述机器学习运算装置和所述其他处理装置的数据。Wherein, the combined processing device further includes: a storage device, which is respectively connected to the machine learning computing device and the other processing device, and is used for storing data of the machine learning computing device and the other processing device.
条款S15、一种机器学习芯片,所述机器学习芯片包括:Clause S15. A machine learning chip, the machine learning chip includes:
如条款S13所述的机器学习运算装置或如条款S14所述的组合处理装置。The machine learning arithmetic device according to clause S13 or the combined processing device according to clause S14.
条款S16、一种电子设备,所述电子设备包括:Clause S16. An electronic device, the electronic device comprising:
如条款S15所述的机器学习芯片。Machine learning chip as described in clause S15.
条款S17、一种板卡,所述板卡包括:存储器件、接口装置和控制器件以及如条款S15所述的机器学习芯片;Clause S17, a board card, the board card includes: a storage device, an interface device and a control device, and a machine learning chip as described in Clause S15;
其中,所述机器学习芯片与所述存储器件、所述控制器件以及所述接口装置分别连接;Wherein, the machine learning chip is respectively connected to the storage device, the control device and the interface device;
所述存储器件,用于存储数据;The storage device is used for storing data;
所述接口装置,用于实现所述机器学习芯片与外部设备之间的数据传输;The interface device is used to realize data transmission between the machine learning chip and an external device;
所述控制器件,用于对所述机器学习芯片的状态进行监控。The control device is used for monitoring the state of the machine learning chip.
条款S18、一种循环向量指令处理方法,所述方法应用于循环向量指令处理装置,所述装置包括控制模块和运算模块,所述方法包括:Clause S18. A cyclic vector instruction processing method. The method is applied to a cyclic vector instruction processing device. The device includes a control module and an operation module. The method includes:
利用控制模块对获取到的循环向量指令进行编译,得到编译后的循环向量指令,对所述编译后的循环向量指令进行解析,得到循环向量指令的操作码和操作域,并根据所述操作码和所述操作域获取执行循环向量指令所需的第一待运算向量、第二待运算向量和目标地址,以及确定循环向量指令的向量运算类型;The control module is used to compile the obtained loop vector instruction to obtain a compiled loop vector instruction. The compiled loop vector instruction is analyzed to obtain the operation code and operation domain of the loop vector instruction, and the operation code is determined according to the operation code. Acquiring the first to-be-operated vector, the second to-be-operated vector and the target address required to execute the loop vector instruction with the operation domain, and determining the vector operation type of the loop vector instruction;
利用运算模块根据第二待运算向量将所述第一待运算向量切分为多个切分向量,根据所述向量运算类型对每个切分向量与所述第二待运算向量分别进行向量运算,获得运算结果,并将所述运算结果存入所述目标地址中,The operation module is used to divide the first to-be-operated vector into a plurality of divided vectors according to the second to-be-operated vector, and each divided vector and the second to-be-operated vector are respectively subjected to vector operations according to the vector operation type To obtain the operation result and store the operation result in the target address,
其中,所述操作码用于指示所述循环向量指令对数据所进行的运算为循环向量运算,所述操作域包括第一待运算向量地址、第二待运算向量地址和所述目标地址。Wherein, the operation code is used to indicate that the operation performed by the cyclic vector instruction on the data is a cyclic vector operation, and the operation domain includes a first to-be-computed vector address, a second to-be-computed vector address, and the target address.
条款S19、根据条款S18所述的方法,根据所述向量运算类型对每个切分向量与第二待运算向量分别进行向量运算,包括:Clause S19. According to the method described in Clause S18, performing vector operations on each of the split vector and the second to-be-calculated vector according to the vector operation type includes:
利用所述运算模块中的多个向量运算器执行与所述向量运算类型相对应的向量运算。A plurality of vector operators in the operation module are used to perform vector operations corresponding to the vector operation type.
条款S20、根据条款S19所述的方法,所述运算模块包括主运算子模块和多个从运算子模块,所述主运算子模块包括所述多个向量运算器,Clause S20. The method according to Clause S19, the operation module includes a master operation sub-module and a plurality of slave operation sub-modules, the master operation sub-module includes the plurality of vector operators,
其中,根据第二待运算向量将所述第一待运算向量切分为多个切分向量,根据所述向量运算类型对每个切分向量与所述第二待运算向量分别进行向量运算,获得运算结果,并将所述运算结果存入所述目标地址中,包括:Wherein, the first to-be-operated vector is divided into a plurality of divided vectors according to the second to-be-operated vector, and each divided vector and the second to-be-operated vector are separately subjected to vector operations according to the vector operation type, Obtain the operation result, and store the operation result in the target address, including:
利用所述主运算子模块中的所述多个向量运算器执行与所述向量运算类型相对应的向量运算,得到运算结果,并将所述运算结果存入所述目标地址中。Use the plurality of vector operators in the main operation sub-module to perform a vector operation corresponding to the vector operation type, obtain an operation result, and store the operation result in the target address.
条款S21、根据条款S19所述的方法,所述运算模块包括主运算子模块和多个从运算子模块,所述从运算子模块包括所述多个向量运算器,Clause S21. The method according to Clause S19, the operation module includes a master operation sub-module and a plurality of slave operation sub-modules, the slave operation sub-module includes the plurality of vector operators,
其中,根据第二待运算向量将所述第一待运算向量切分为多个切分向量,根据所述向量运算类型对每个切分向量与所述第二待运算向量分别进行向量运算,获得运算结果,并将所述运算结果存入所述目标地址中,包括:Wherein, the first to-be-operated vector is divided into a plurality of divided vectors according to the second to-be-operated vector, and each divided vector and the second to-be-operated vector are separately subjected to vector operations according to the vector operation type, Obtain the operation result, and store the operation result in the target address, including:
利用每个从运算子模块所包含的多个向量运算器并行执行进行对应的向量运算,得到运算结果,并将所述运算结果存入对应的子缓存空间中,以及将所述运算结果发送至所述主运算子模块;Use multiple vector operators included in each slave operation sub-module to execute corresponding vector operations in parallel to obtain operation results, store the operation results in the corresponding sub-cache space, and send the operation results to The main operation sub-module;
利用所述主运算子模块接收所述运算结果,并将所述运算结果存入所述目标地址中。The main operation sub-module is used to receive the operation result and store the operation result in the target address.
条款S22、根据条款S18所述的方法,所述操作域还包括向量运算类型,Clause S22. The method according to Clause S18, the operation domain further includes a vector operation type,
其中,确定循环向量指令的向量运算类型,包括:Among them, determine the vector operation type of the loop vector instruction, including:
在所述操作域中包括向量运算类型时,根据所述操作域确定所述向量运算类型。When a vector operation type is included in the operation domain, the vector operation type is determined according to the operation domain.
条款S23、根据条款S18所述的方法,所述操作域还包括第一输入量和第二输入量,Clause S23. The method according to Clause S18, the operation domain further includes a first input amount and a second input amount,
其中,根据所述操作码和所述操作域获取执行循环向量指令所需的第一待运算向量、第二待运算向量和目标地址,还包括:Wherein, obtaining the first to-be-computed vector, the second to-be-computed vector and the target address required to execute the loop vector instruction according to the operation code and the operation domain, further includes:
根据所述操作域确定所述第一输入量和所述第二输入量,并从所述第一待运算向量地址中获取数据量为所述第一输入量的第一待运算向量,以及从第二待运算向量地址中获取数据量为所述第二输入量的第二待运算向量,Determine the first input amount and the second input amount according to the operation domain, and obtain the first to-be-calculated vector whose data amount is the first input amount from the first to-be-calculated vector address, and from Obtaining a second to-be-calculated vector whose data amount is the second input amount from the second to-be-calculated vector address,
其中,根据第二待运算向量将所述第一待运算向量切分为多个切分向量,包括:Wherein, dividing the first to-be-calculated vector into a plurality of divided vectors according to the second to-be-calculated vector includes:
根据所述第二输入量确定每个切分向量的切分数据量,并根据所述切分数据量将所述第一待运算向量切分为多个切分向量。The amount of segmentation data of each segmentation vector is determined according to the second input amount, and the first to-be-operated vector is segmented into multiple segmentation vectors according to the amount of segmentation data.
条款S24、根据条款S18所述的方法,所述操作码还用于指示所述向量运算类型,Clause S24. The method according to Clause S18, the operation code is also used to indicate the type of vector operation,
其中,确定循环向量指令的向量运算类型,包括:Among them, determine the vector operation type of the loop vector instruction, including:
在所述操作码用于指示所述向量运算类型时,根据所述操作码确定所述向量运算类型。When the operation code is used to indicate the vector operation type, the vector operation type is determined according to the operation code.
条款S25、根据条款S18所述的方法,所述向量运算类型包括以下至少一种:Clause S25. The method according to Clause S18, the vector operation type includes at least one of the following:
向量相乘运算、向量相加运算、向量求和运算、满足运算条件存储指定值运算、按位与运算、按位或运算、按位异或运算,Vector multiplication operation, vector addition operation, vector sum operation, storage specified value operation, bitwise AND operation, bitwise OR operation, bitwise XOR operation
其中,所述运算条件包括以下任一种:按位相等、按位不相等、按位小于、按位大于或等于、按位大于、按位小于或等于。Wherein, the calculation conditions include any one of the following: bitwise equal, bitwise unequal, bitwise less than, bitwise greater than or equal, bitwise greater than, bitwise less than or equal.
条款S26、根据条款S18所述的方法,所述方法还包括:Clause S26. The method according to Clause S18, the method further comprising:
利用所述装置的存储模块存储所述第一待运算向量和所述第二待运算向量,Using the storage module of the device to store the first to-be-computed vector and the second to-be-computed vector,
其中,所述存储模块包括寄存器和缓存中的至少一种,Wherein, the storage module includes at least one of a register and a cache,
所述缓存,用于存储待运算数据、所述第一待运算向量和所述第二待运算向量,所述缓存包括至少一个神经元缓存NRAM;The cache is used to store data to be calculated, the first vector to be calculated, and the second vector to be calculated, and the cache includes at least one neuron cache NRAM;
所述寄存器,用于存储所述待运算数据中的标量数据;The register is used to store scalar data in the data to be calculated;
所述神经元缓存,用于存储所述待运算数据中的神经元数据,所述神经元数据包括神经元向量数据。The neuron cache is used to store neuron data in the data to be operated, and the neuron data includes neuron vector data.
条款S27、根据条款S18所述的方法,对所述编译后的循环向量指令进行解析,得到循环向量指令的操作码和操作域,包括:Clause S27. According to the method described in Clause S18, parse the compiled loop vector instruction to obtain the operation code and operation domain of the loop vector instruction, including:
存储所述编译后的循环向量指令;Storing the compiled loop vector instruction;
对所述编译后的循环向量指令进行解析,得到循环向量指令的操作码和操作域;Parse the compiled loop vector instruction to obtain the operation code and operation domain of the loop vector instruction;
存储指令队列,所述指令队列包括按照执行顺序依次排列的多个待执行指令,所述多个待执行指令包括所述编译后的循环向量指令。An instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed, which are sequentially arranged according to an execution order, and the plurality of instructions to be executed include the compiled loop vector instruction.
条款S28、根据条款S27所述的方法,所述方法还包括:Clause S28. The method according to Clause S27, the method further comprising:
在确定所述多个待执行指令中的第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系时,缓存所述第一待执行指令,并在确定所述第零待执行指令执行完毕后,控制进行所述第一待执行指令的执行,When it is determined that the first to-be-executed instruction among the plurality of to-be-executed instructions is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction, cache the first to-be-executed instruction, and determine the After the execution of the zeroth to-be-executed instruction is completed, control to execute the execution of the first to-be-executed instruction,
其中,所述第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系包括:Wherein, the association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
存储所述第一待执行指令所需数据的第一存储地址区间与存储所述第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
条款S29、根据条款S18所述的方法,对获取到的循环向量指令进行编译,得到编译后的循环向量指令,包括:Clause S29. According to the method described in Clause S18, compile the obtained loop vector instruction to obtain the compiled loop vector instruction, including:
根据所述循环向量指令生成汇编文件,并将所述汇编文件翻译成二进制文件,Generate an assembly file according to the loop vector instruction, and translate the assembly file into a binary file,
其中,所述二进制文件为所述编译后的循环向量指令。Wherein, the binary file is the compiled loop vector instruction.
条款S30、一种非易失性计算机可读存储介质,其上存储有计算机程序指令,所述计算机程序指令被处理器执行时实现条款S18至条款S29任一项所述的方法。Clause S30. A non-volatile computer-readable storage medium having computer program instructions stored thereon, which when executed by a processor implements the method of any one of Clause S18 to Clause S29.
由于神经网络算法的广泛使用,计算机硬件运算人能力的不断提升,实际应用中所涉及到的数据运算的种类和数量不断提高。由于编程语言的种类多样,在不同的语言环境下,为实现对向量数据的迁移,相关技术中,由于现阶段没有能广泛适用于各类编程语言的向量数据迁移指令,技术人员需要自定义对应其编程语言环境的一条或多条指令来实现向量数据迁移,导致进行向量数据迁移的效率低、速度慢。本公开提供一种向量数据迁移指令处理方法、装置、计算机设备和存储介质,仅用一个指令即可以实现向量数据迁移,能够显著提高进行向量数据迁移的效率和速度。Due to the extensive use of neural network algorithms, the ability of computer hardware operators to continuously improve, the types and number of data operations involved in practical applications continue to increase. Due to the variety of programming languages, in different language environments, in order to realize the migration of vector data, in related technologies, because there is no vector data migration instruction that can be widely applied to various programming languages at this stage, technicians need to customize the corresponding One or more instructions in its programming language environment implement vector data migration, resulting in low efficiency and slow speed of vector data migration. The present disclosure provides a vector data migration instruction processing method, device, computer equipment, and storage medium. Vector data migration can be achieved with only one instruction, which can significantly improve the efficiency and speed of vector data migration.
图20-1示出根据本公开一实施例的向量数据迁移指令处理装置的框图。如图20-1所示,该装置包括控制模块20-11和处理模块20-12。FIG. 20-1 shows a block diagram of a vector data migration instruction processing apparatus according to an embodiment of the present disclosure. As shown in FIG. 20-1, the device includes a control module 20-11 and a processing module 20-12.
控制模块20-11,用于对获取到的向量数据迁移指令进行编译,得到编译后的向量数据迁移指令,对编译后的向量数据迁移指令进行解析,得到向量数据迁移指令的操作码和操作域,并根据操作码和操作域获取执行向量数据迁移指令所需的待迁移向量数据和目标地址,以及确定进行迁移处理所需的迁移参数。其中,操作码用于指示向量数据迁移指令对向量数据所进行的处理为迁移处理,操作域包 括待迁移向量数据地址和目标地址,迁移参数可以包括待迁移向量数据地址所在的初始存储空间、目标地址所在的目标存储空间和进行迁移处理的迁移类型。The control module 20-11 is used to compile the acquired vector data migration instruction to obtain the compiled vector data migration instruction, and parse the compiled vector data migration instruction to obtain the operation code and operation domain of the vector data migration instruction And obtain the vector data to be migrated and the target address required to execute the vector data migration instruction according to the operation code and the operation domain, and determine the migration parameters required for the migration process. The operation code is used to instruct the vector data migration instruction to process the vector data as migration processing. The operation domain includes the address of the vector data to be migrated and the target address, and the migration parameter may include the initial storage space and target where the address of the vector data to be migrated is located. The target storage space where the address is located and the migration type to be migrated.
处理模块20-12,根据迁移参数,将待迁移向量数据存入目标地址中。The processing module 20-12 stores the vector data to be migrated into the target address according to the migration parameters.
在本实施例中,待迁移向量数据可以是一个或多个。迁移类型可以指示初始存储空间的向量数据存储速度、目标存储空间的向量数据存储速度、以及二者存储速度的快慢关系。在向量数据迁移指令中,可以为不同的目标存储空间与初始存储空间之间的存储速度快慢关系,设置不同的代码,进行存储速度快慢的区分。比如,可以将迁移类型为“初始存储空间的存储速度大于目标存储空间的存储速度”的代码设置为“st”。可以将迁移类型为“初始存储空间的存储速度等于目标存储空间的存储速度”的代码设置为“mv”。可以将迁移类型为“初始存储空间的存储速度小于目标存储空间的存储速度”的代码设置为“ld”。本领域技术人员可以根据实际需要对迁移类型以及迁移类型的代码进行设置,本公开对此不作限制。In this embodiment, there may be one or more vector data to be migrated. The migration type may indicate the vector data storage speed of the initial storage space, the vector data storage speed of the target storage space, and the speed relationship between the storage speeds of the two. In the vector data migration instruction, different codes can be set for the storage speed relationship between different target storage spaces and the initial storage space to distinguish the storage speed. For example, the code whose migration type is "the storage speed of the initial storage space is greater than the storage speed of the target storage space" can be set to "st". The code whose migration type is "the storage speed of the initial storage space is equal to the storage speed of the target storage space" can be set to "mv". The code whose migration type is "the storage speed of the initial storage space is less than the storage speed of the target storage space" can be set to "ld". A person skilled in the art may set the migration type and the code of the migration type according to actual needs, which is not limited in the present disclosure.
在本实施例中,迁移参数可以包括初始存储空间、目标存储空间的名称、编号等标识,以表示初始存储空间和目标存储空间。In this embodiment, the migration parameters may include an identifier such as the initial storage space, the name and number of the target storage space, to represent the initial storage space and the target storage space.
在本实施例中,初始存储空间和目标存储空间可以是装置的NRAM、WRAM、DRAM、寄存器等用于存储数据的空间,DRAM可以包括LDRAM、GDRAM等。其中,NRAM(Nanotube Random Access Memory)是基于碳纳米管(Carbon Nanotube,简称CNT)的非易失性存储器。WRAM(Window RAM)是VRAM(Video RAM,影像随机接达记忆器)的一种。DRAM(Dynamic Random Access Memory)为动态随机存取存储器。LDRAM为Local DRAM,可以是装置中的某一个计算核所独有的DRAM。GDRAM为Global DRAM,可以是装置中多个计算核共享的DRAM。计算核为装置中进行数据运算的单元、模块等,如下述处理模块。In this embodiment, the initial storage space and the target storage space may be NRAM, WRAM, DRAM, registers, etc. of the device for storing data, and DRAM may include LDRAM, GDRAM, etc. Among them, NRAM (Nanotube Random Access Memory) is a non-volatile memory based on carbon nanotube (Carbon Nanotube, CNT for short). WRAM (Window RAM) is a type of VRAM (Video RAM, the image is randomly accessed to the memory). DRAM (Dynamic Random Access Memory) is a dynamic random access memory. LDRAM is Local DRAM, which can be a DRAM unique to a computing core in the device. GDRAM is Global DRAM, which can be a DRAM shared by multiple computing cores in the device. The computing core is a unit, module, etc. that performs data operations in the device, such as the following processing module.
在本实施例中,控制模块所获取到的向量数据迁移指令为未编译的、不能直接供硬件执行的软件指令,控制模块需先对向量数据迁移指令(未编译)进行编译。在得到编译后的向量数据迁移指令之后,才能对编译后的向量数据迁移指令进行解析。编译后的向量数据迁移指令为能够直接供硬件执行的硬件指令。控制模块可以从待迁移向量数据地址中获取待迁移向量数据。控制模块可以通过数据输入输出单元获得指令和数据,该数据输入输出单元可以为一个或多个数据I/O接口或I/O引脚。In this embodiment, the vector data migration instruction acquired by the control module is an uncompiled software instruction that cannot be directly executed by hardware. The control module needs to first compile the vector data migration instruction (uncompiled). After the compiled vector data migration instruction is obtained, the compiled vector data migration instruction can be parsed. The compiled vector data migration instructions are hardware instructions that can be directly executed by the hardware. The control module may obtain vector data to be migrated from the vector data address to be migrated. The control module can obtain instructions and data through the data input and output unit, which can be one or more data I / O interfaces or I / O pins.
在本实施例中,操作码可以是计算机程序中所规定的要执行操作的那一部分指令或字段(通常用代码表示),是指令序列号,用来告知执行指令的装置具体需要执行哪一条指令。操作域可以是执行对应的指令所需的所有数据的来源,执行对应的指令所需的所有数据包括目标地址、待迁移向量数据地址,待迁移向量数据地址所在的初始存储空间、目标地址所在的目标存储空间和进行迁移处理的迁移参数等等。对于一个向量数据迁移指令其必须包括操作码和操作域,其中操作域至少包括待迁移向量数据和目标地址。In this embodiment, the operation code may be the part of the instruction or field (usually represented by code) specified in the computer program to perform the operation, and is the instruction sequence number used to inform the device that executes the instruction which instruction needs to be executed. . The operation domain may be the source of all data required to execute the corresponding instruction. All data required to execute the corresponding instruction include the target address, the vector data address to be migrated, the initial storage space where the vector data address to be migrated is located, and the target address is located. Target storage space and migration parameters for migration processing, etc. For a vector data migration instruction, it must include an operation code and an operation field, where the operation field includes at least the vector data to be migrated and the target address.
应当理解的是,本领域技术人员可以根据需要对向量数据迁移指令格式以及所包含的操作码和操作域进行设置,本公开对此不作限制。It should be understood that a person skilled in the art may set the format of the vector data migration instruction and the included operation codes and operation fields as required, and the disclosure does not limit this.
在本实施例中,该装置可以包括一个或多个控制模块,以及一个或多个处理模块,可以根据实际需要对控制模块和处理模块的数量进行设置,本公开对此不作限制。在装置包括一个控制模块时,该控制模块可以接收向量数据迁移指令,并控制一个或多个处理模块进行向量数据迁移。在装置包括多个控制模块时,多个控制模块可以分别接收向量数据迁移指令,并控制对应的一个或多个处理模块进行向量数据迁移。In this embodiment, the device may include one or more control modules and one or more processing modules, and the number of control modules and processing modules may be set according to actual needs, which is not limited in the present disclosure. When the device includes a control module, the control module can receive a vector data migration instruction and control one or more processing modules to perform vector data migration. When the device includes multiple control modules, the multiple control modules may respectively receive vector data migration instructions and control the corresponding one or more processing modules to perform vector data migration.
本公开实施例所提供的向量数据迁移指令处理装置,该装置包括控制模块和处理模块。控制模块用于对获取到的向量数据迁移指令进行编译,得到编译后的向量数据迁移指令,对编译后的向量数据迁移指令进行解析,得到向量数据迁移指令的操作码和操作域,并根据操作码和操作域获取执行向量数据迁移指令所需的待迁移向量数据和目标地址,以及确定进行迁移处理所需的迁移参数。处理模块用于根据迁移参数,将待迁移向量数据存入目标地址中。本公开实施例所提供的向量数据迁移指令处理装置的适用范围广,对向量数据迁移指令的处理效率高、处理速度快,进行向量数据迁移的处理效率高、处理速度快。The vector data migration instruction processing device provided by the embodiment of the present disclosure includes a control module and a processing module. The control module is used to compile the acquired vector data migration instruction to obtain the compiled vector data migration instruction, and parse the compiled vector data migration instruction to obtain the operation code and operation domain of the vector data migration instruction, and according to the operation The code and the operation domain acquire the vector data to be migrated and the target address required to execute the vector data migration instruction, and determine the migration parameters required for the migration process. The processing module is used to store the vector data to be migrated into the target address according to the migration parameters. The vector data migration instruction processing device provided by the embodiments of the present disclosure has a wide application range, and has high processing efficiency and fast processing speed for vector data migration instructions, and high processing efficiency and fast processing speed for vector data migration.
图20-2示出根据本公开一实施例的向量数据迁移指令处理装置的框图。在一种可能的实现方式中,如图20-2所示,处理模块20-12可以包括主处理子模块20-121和多个从处理子模块20-122。FIG. 20-2 shows a block diagram of a vector data migration instruction processing apparatus according to an embodiment of the present disclosure. In a possible implementation, as shown in FIG. 20-2, the processing module 20-12 may include a master processing sub-module 20-121 and a plurality of slave processing sub-modules 20-122.
主处理子模块20-121,用于对待迁移向量数据进行处理,得到处理后的待迁移向量数据,将处理后的待迁移向量数据存入目标地址中。其中,对待迁移向量数据所进行的处理包括数据类型转换等处理,也可以不对待迁移向量数据进行处理直接进行存储,本公开对此不作限制。The main processing submodules 20-121 are used to process the to-be-migrated vector data, obtain the processed to-be-migrated vector data, and store the processed to-be-migrated vector data in the target address. The processing to be performed on the migration vector data includes data type conversion and other processing, and it may be stored directly without processing the migration vector data, which is not limited in the present disclosure.
在一种可能的实现方式中,操作域还可以包括向量数据迁移量。其中,控制模块20-11,还用于根据操作域确定向量数据迁移量,并从待迁移向量数据地址中获取对应向量数据迁移量的待迁移向量数据。In a possible implementation, the operation domain may also include vector data migration. The control module 20-11 is also used to determine the amount of vector data migration according to the operation domain, and obtain vector data to be migrated corresponding to the amount of vector data migration from the address of the vector data to be migrated.
在该实现方式中,向量数据迁移量可以是所获取的待迁移向量数据的数据量。In this implementation manner, the vector data migration amount may be the acquired data amount of the vector data to be migrated.
在一种可能的实现方式中,可以预先设置默认向量数据迁移量。在操作域中不包含向量数据迁移量时,可以将默认向量数据迁移量确定为当前向量数据迁移指令的向量数据迁移量。进而从待迁移向量数据地址中获取对应向量数据迁移量的待迁移向量数据。In a possible implementation manner, the default vector data migration amount may be preset. When the vector data migration amount is not included in the operation domain, the default vector data migration amount may be determined as the vector data migration amount of the current vector data migration instruction. Then, the vector data to be migrated corresponding to the migration amount of the vector data is acquired from the vector data address to be migrated.
在一种可能的实现方式中,在操作域中不包含向量数据迁移量时,可以直接从待迁移向量数据地址中获取其中所存储的全部待迁移向量数据。In a possible implementation manner, when the vector data migration amount is not included in the operation domain, all vector data to be migrated stored therein may be directly obtained from the vector data address to be migrated.
在一种可能的实现的方式中,操作域还可以包括迁移参数。其中,确定进行迁移处理所需的迁移参数,可以包括:根据操作域,确定进行迁移处理所需的迁移参数。In a possible implementation manner, the operation domain may further include migration parameters. Wherein, determining the migration parameters required for the migration process may include: determining the migration parameters required for the migration process according to the operation domain.
在一种可能的实现方式中,操作码还可以用于指示迁移参数。其中,确定进行迁移处理所需的迁移参数,可以包括:根据操作码,确定进行迁移处理所需的迁移参数。In a possible implementation, the operation code may also be used to indicate the migration parameter. Wherein, determining the migration parameters required for the migration process may include: determining the migration parameters required for the migration process according to the operation code.
在一种可能的实现方式中,还可以设置默认迁移参数。在根据操作域和操作码均不能确定当前向量数据迁移指令的迁移参数时,可以将默认迁移参数确定为当前向量数据迁移指令的迁移参数。In a possible implementation, default migration parameters can also be set. When the migration parameter of the current vector data migration instruction cannot be determined according to the operation domain and the operation code, the default migration parameter may be determined as the migration parameter of the current vector data migration instruction.
在一种可能的实现方式中,还可以根据待迁移向量数据地址和目标地址确定其分别对应的初始存储空间、目标存储空间,进而根据初始存储空间、目标存储空间的存储速度、存储空间类型等参数,确定迁移参数。In a possible implementation, the initial storage space and the target storage space corresponding to the vector data address and the target address to be migrated may be determined respectively, and then the storage speed and storage space type of the initial storage space, the target storage space, etc. Parameters to determine the migration parameters.
在一种可能的实现方式中,如图20-2所示,该装置还可以包括存储模块20-13。存储模块20-13用于存储待迁移向量数据。In a possible implementation manner, as shown in FIG. 20-2, the device may further include a storage module 20-13. The storage modules 20-13 are used to store vector data to be migrated.
在该实现方式中,存储模块可以包括缓存和寄存器中的一种或多种,缓存可以包括高速暂存缓存,还可以包括至少一个NRAM(Neuron Random Access Memory,神经元随机存取存储器)。缓存,用于存储待运算数据和待迁移向量数据。寄存器,用于存储待运算数据中的标量数据。In this implementation, the storage module may include one or more of a cache and a register. The cache may include a high-speed temporary storage cache, and may also include at least one NRAM (Neuron Random Access Memory). The cache is used to store data to be calculated and vector data to be migrated. The register is used to store the scalar data in the data to be calculated.
在一种可能的实现方式中,缓存可以包括神经元缓存。神经元缓存也即上述神经元随机存取存储器,可以用于存储待运算数据中的神经元数据,神经元数据包括神经元向量数据。In a possible implementation, the cache may include a neuron cache. The neuron cache, that is, the foregoing neuron random access memory, can be used to store neuron data in the data to be calculated, and the neuron data includes neuron vector data.
在一种可能的实现方式中,该装置还可以包括直接内存访问模块,用于从存储模块中读取或者存储数据。In a possible implementation, the device may further include a direct memory access module, which is used to read or store data from the storage module.
在一种可能的实现方式中,控制模块20-11还可以用于根据向量数据迁移指令生成汇编文件,并将汇编文件翻译成二进制文件,其中,二进制文件为编译后的向量数据迁移指令。In a possible implementation manner, the control module 20-11 may also be used to generate an assembly file according to the vector data migration instruction and translate the assembly file into a binary file, where the binary file is a compiled vector data migration instruction.
在一种可能的实现方式中,向量数据迁移指令的指令格式可以是:In a possible implementation, the instruction format of the vector data migration instruction may be:
vector dst src type.space1.space2 sizevector dst src type.space1.space2 size
其中,vector是向量数据迁移指令的操作码,dst、src0、type.space1.space2、size是向量数据迁移指令的操作域。其中,dst是目标地址,src是待迁移向量数据地址,在待迁移向量数据为多个时,src可以包括多个待迁移向量数据的地址src0,src1,...,srcn,本公开对此不作限制。type.space1.space2是迁移参数,type.space1.space2中的type表示迁移类型,type.space1.space2中的space1表示待迁移向量数据地址src所在的初始存储空间,type.space1.space2中的space2表示目标地址dst所在的目标存储空间。size是向量数据迁移量。Among them, vector is the operation code of the vector data migration instruction, dst, src0, type.space1.space2, size are the operation domain of the vector data migration instruction. Where dst is the target address and src is the address of the vector data to be migrated. When there are multiple vector data to be migrated, src may include multiple addresses of the vector data to be migrated src0, src1, ..., srcn. No restrictions. type.space1.space2 is the migration parameter, type in type.space1.space2 indicates the migration type, space1 in type.space1.space2 indicates the initial storage space where the vector data address src to be migrated is located, and space2 in type.space1.space2 Indicates the target storage space where the target address dst is located. size is the amount of vector data migration.
在一种可能的实现方式中,向量数据迁移指令的指令格式还可以是:In a possible implementation manner, the instruction format of the vector data migration instruction may also be:
type.space1.space2 dst src sizetype.space1.space2 dst src size
其中,type.space1.space2是向量数据迁移指令的操作码,dst、src、size是向量数据迁移指令的操 作域。其中,dst是目标地址,src是待迁移向量数据地址,在待迁移向量数据为多个时,src可以包括多个待迁移向量数据的地址src0,src1,...,srcn,本公开对此不作限制。size是向量数据迁移量。操作码type.space1.space2中的type表示迁移类型,type.space1.space2中的space1表示待迁移向量数据地址src所在的初始存储空间,type.space1.space2中的space2表示目标地址dst所在的目标存储空间。Among them, type.space1.space2 is the operation code of the vector data migration instruction, and dst, src, and size are the operation domains of the vector data migration instruction. Where dst is the target address and src is the address of the vector data to be migrated. When there are multiple vector data to be migrated, src may include multiple addresses of the vector data to be migrated src0, src1, ..., srcn No restrictions. size is the amount of vector data migration. The type in opcode type.space1.space2 represents the migration type, space1 in type.space1.space2 represents the initial storage space where the vector data address src to be migrated is located, and space2 in type.space1.space2 represents the destination where the destination address dst is located storage.
其中,type可以是ld、st、mv。ld表示的迁移类型为“初始存储空间的存储速度小于目标存储空间的存储速度”。st表示的迁移类型为“初始存储空间的存储速度大于目标存储空间的存储速度”。mv表示的迁移类型为“初始存储空间的存储速度等于目标存储空间的存储速度”。Among them, type can be ld, st, mv. The migration type indicated by ld is "the storage speed of the initial storage space is less than the storage speed of the target storage space". The migration type indicated by st is "the storage speed of the initial storage space is greater than the storage speed of the target storage space". The type of migration indicated by mv is "the storage speed of the initial storage space is equal to the storage speed of the target storage space".
在一种可能的实现方式中,可以将迁移类型为“初始存储空间的存储速度小于目标存储空间的存储速度”的向量数据迁移指令的指令格式设置为:ld.space1.space2,dst,src0,size。根据向量数据迁移量size、初始存储空间space1、目标存储空间space2及迁移类型ld,从初始存储空间space1中的待迁移向量数据地址src0内获取数据量为向量数据迁移量size的待迁移向量数据,并将该待迁移向量数据存入目标存储空间space2中的目标地址dst中。其中,初始存储空间space1的存储速度小于目标存储空间space2的存储速度。In a possible implementation, the instruction format of the vector data migration instruction whose migration type is "the storage speed of the initial storage space is less than the storage speed of the target storage space" can be set as: ld.space1.space2, dst, src0, size. According to the vector data migration amount size, the initial storage space space1, the target storage space space2 and the migration type ld, obtain the vector data to be migrated with the data amount of vector data migration amount size from the vector data address src0 in the initial storage space space1, And store the vector data to be migrated into the target address dst in the target storage space space2. The storage speed of the initial storage space space1 is less than the storage speed of the target storage space space2.
在一种可能的实现方式中,可以将迁移类型为“初始存储空间的存储速度大于目标存储空间的存储速度”的向量数据迁移指令的指令格式设置为:st.space1.space2,dst,src0,size。根据向量数据迁移量size、初始存储空间space1、目标存储空间space2及迁移类型st,从初始存储空间space1中的待迁移向量数据地址src0内获取数据量为向量数据迁移量size的待迁移向量数据,并将该待迁移向量数据存入目标存储空间space2中的目标地址dst中。其中,初始存储空间space1的存储速度大于目标存储空间space2的存储速度。In a possible implementation, the instruction format of the vector data migration instruction whose migration type is "the storage speed of the initial storage space is greater than the storage speed of the target storage space" can be set to: st.space1.space2, dst, src0, size. According to the vector data migration size size, the initial storage space space1, the target storage space space2 and the migration type st, obtain the vector data to be migrated from the vector data address src0 in the initial storage space space1 whose data volume is the vector data migration amount size, And store the vector data to be migrated into the target address dst in the target storage space space2. The storage speed of the initial storage space space1 is greater than the storage speed of the target storage space space2.
在一种可能的实现方式中,可以将迁移类型为“初始存储空间的存储速度等于目标存储空间的存储速度”的向量数据迁移指令的指令格式设置为:mv.space1.space2,dst,src0,size。根据向量数据迁移量size、初始存储空间space1、目标存储空间space2及迁移类型st,从初始存储空间space1中的待迁移向量数据地址src0内获取数量为向量数据迁移量size的待迁移向量数据,并将该待迁移向量数据存入目标存储空间space2中的目标地址dst中。其中,初始存储空间space1的存储速度等于目标存储空间space2的存储速度。In a possible implementation, the instruction format of the vector data migration instruction whose migration type is "the storage speed of the initial storage space is equal to the storage speed of the target storage space" can be set as: mv.space1.space2, dst, src0, size. According to the vector data migration amount size, initial storage space space1, target storage space space2 and migration type st, obtain the vector data to be migrated in the amount of vector data migration size from the vector data address src0 in the initial storage space space1, and Store the vector data to be migrated into the target address dst in the target storage space space2. The storage speed of the initial storage space space1 is equal to the storage speed of the target storage space space2.
应当理解的是,本领域技术人员可以根据需要对向量数据迁移指令的操作码、指令格式中操作码和操作域的位置进行设置,本公开对此不作限制。It should be understood that those skilled in the art can set the operation code of the vector data migration instruction, the position of the operation code and the operation field in the instruction format according to need, and the disclosure does not limit this.
在一种可能的实现方式中,该装置可以设置于(Graphics Processing Unit,简称GPU)、中央处理器(Central Processing Unit,简称CPU)和嵌入式神经网络处理器(Neural-network Processing Unit,简称NPU)的一种或多种之中。In a possible implementation, the device may be set in (Graphics Processing Unit, GPU for short), Central Processing Unit (CPU for short), and Neural-network Processing Unit (NPU for short) ) Of one or more.
需要说明的是,尽管以上述实施例作为示例介绍了向量数据迁移指令处理装置如上,但本领域技术人员能够理解,本公开应不限于此。事实上,用户完全可根据个人喜好和/或实际应用场景灵活设定各模块,只要符合本公开的技术方案即可。It should be noted that although the above embodiment is taken as an example to introduce the vector data migration instruction processing apparatus as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various modules flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.
应用示例Application examples
以下结合“利用向量数据迁移指令处理装置进行数据迁移”作为一个示例性应用场景,给出根据本公开实施例的应用示例,以便于理解向量数据迁移指令处理装置的流程。本领域技术人员应理解,以下应用示例仅仅是出于便于理解本公开实施例的目的,不应视为对本公开实施例的限制。The following describes an application example according to an embodiment of the present disclosure in conjunction with "data migration using a vector data migration instruction processing device" as an exemplary application scenario, so as to facilitate understanding of the flow of the vector data migration instruction processing device. Those skilled in the art should understand that the following application examples are only for the purpose of facilitating the understanding of the embodiments of the present disclosure, and should not be considered as limitations to the embodiments of the present disclosure.
图20-3示出根据本公开一实施例的向量数据迁移指令处理装置的应用场景的示意图。如图20-3所示,向量数据迁移指令处理装置对向量数据迁移指令进行处理的过程如下:20-3 shows a schematic diagram of an application scenario of a vector data migration instruction processing apparatus according to an embodiment of the present disclosure. As shown in Figure 20-3, the vector data migration instruction processing device processes the vector data migration instruction as follows:
控制模块20-11对获取到的向量数据迁移指令1进行编译,得到编译后的向量数据迁移指令1。对编译后的向量数据迁移指令1(如向量数据迁移指令1为ld.200.300 500 400 5)进行解析,得到向量数据迁移指令1的操作码和操作域。其中,向量数据迁移指令1的操作码为ld,初始存储空间为200,目标存储空间为300,目标地址为500,待迁移向量数据地址为400,向量数据迁移量为5。根据操作码ld可以确定初始存储空间200的存储速度小于目标存储空间300的存储速度。控制模块20-11从初始存储空间为200中的待迁移向量数据地址400内获取数据量为向量数据迁移量5的待迁移向量数据。处理模 块20-12根据迁移参数,将待迁移向量数据存入目标存储空间为300中的目标地址500内。The control module 20-11 compiles the acquired vector data migration instruction 1 to obtain the compiled vector data migration instruction 1. Analyze the compiled vector data migration instruction 1 (such as vector data migration instruction 1 is ld.200.300, 500, and 400) to obtain the operation code and operation domain of the vector data migration instruction 1. The operation code of the vector data migration instruction 1 is ld, the initial storage space is 200, the target storage space is 300, the target address is 500, the address of the vector data to be migrated is 400, and the amount of vector data migration is 5. According to the operation code ld, it can be determined that the storage speed of the initial storage space 200 is less than the storage speed of the target storage space 300. The control module 20-11 obtains vector data to be migrated with a data volume of 5 from the vector data address 400 to be migrated in the initial storage space 200. The processing module 20-12 stores the vector data to be migrated into the target address 500 in the target storage space 300 according to the migration parameters.
其中,向量数据迁移指令1除了可以为上述ld.200.300 500 4005,还可以为vector 500 400 ld.200.300 5等,二者处理过程相似,不再赘述。Among them, the vector data migration instruction 1 can be not only the above ld.200.300, 500, 4005, but also the vector, 500, 400, ld.200.300, etc. The processing procedures of the two are similar and will not be repeated here.
以上各模块的工作过程可参考上文的相关描述。For the working process of the above modules, please refer to the relevant description above.
这样,向量数据迁移指令处理装置可以高效、快速地对向量数据迁移指令进行处理,进行向量数据迁移的处理效率高、处理速度快。In this way, the vector data migration instruction processing device can efficiently and quickly process the vector data migration instruction, and the processing efficiency of vector data migration is high and the processing speed is fast.
图20-4示出根据本公开一实施例的向量数据迁移指令处理方法的流程图。该方法可以应用于包含存储器和处理器的如计算机设备等,其中,存储器用于存储执行方法过程中所使用的数据;处理器用于执行相关的处理、运算步骤,如执行下述步骤S51-20和步骤S52-20。如图20-4所示,利用控制模块该方法应用于上述向量数据迁移指令处理装置,该方法包括步骤S51-20和步骤S52-20。20-4 shows a flowchart of a vector data migration instruction processing method according to an embodiment of the present disclosure. The method can be applied to computer equipment including a memory and a processor, where the memory is used to store data used in the execution of the method; the processor is used to perform related processing and operation steps, such as performing the following steps S51-20和 步骤 S52-20. As shown in FIG. 20-4, the method is applied to the above vector data migration instruction processing device using the control module. The method includes steps S51-20 and S52-20.
在步骤S51-20中,利用处理模块对获取到向量数据迁移指令进行编译,得到编译后的向量数据迁移指令,对所述编译后的向量数据迁移指令进行解析,得到向量数据迁移指令的操作码和操作域,并根据操作码和操作域获取执行向量数据迁移指令所需的待迁移向量数据和目标地址,以及确定进行迁移处理所需的迁移参数。其中,操作码用于指示向量数据迁移指令对向量数据所进行的处理为迁移处理,操作域包括待迁移向量数据地址和目标地址,迁移参数包括待迁移向量数据地址所在的初始存储空间、目标地址所在的目标存储空间和进行迁移处理的迁移类型。In step S51-20, the processing module is used to compile the acquired vector data migration instruction to obtain a compiled vector data migration instruction, and the compiled vector data migration instruction is analyzed to obtain an operation code of the vector data migration instruction And the operation domain, and obtain the vector data to be migrated and the target address required to execute the vector data migration instruction according to the operation code and the operation domain, and determine the migration parameters required for the migration process. The operation code is used to instruct the vector data migration instruction to process the vector data as migration processing. The operation domain includes the vector data address to be migrated and the target address, and the migration parameters include the initial storage space and target address where the vector data address to be migrated is located. The target storage space and migration type for migration processing.
在步骤S52-20中,根据迁移参数,将待迁移向量数据存入目标地址中。In step S52-20, the vector data to be migrated is stored in the target address according to the migration parameters.
在一种可能的实现方式中,处理模块可以包括主处理子模块和多个从处理子模块。。其中,步骤S52-20可以包括:In a possible implementation manner, the processing module may include a master processing sub-module and multiple slave processing sub-modules. . Wherein, step S52-20 may include:
对待迁移向量数据进行处理,得到处理后的待迁移向量数据,将处理后的待迁移向量数据存入目标地址中。The vector data to be migrated is processed to obtain the processed vector data to be migrated, and the processed vector data to be migrated is stored in the target address.
在一种可能的实现方式中,操作域还可以包括向量数据迁移量。其中,根据所述操作码和所述操作域获取执行向量数据迁移指令所需的待迁移向量数据和目标地址,可以包括:In a possible implementation, the operation domain may also include vector data migration. Wherein, acquiring the vector data to be migrated and the target address required to execute the vector data migration instruction according to the operation code and the operation domain may include:
根据所述操作域确定所述向量数据迁移量,并从所述待迁移向量数据地址中获取对应所述向量数据迁移量的待迁移向量数据。The vector data migration amount is determined according to the operation domain, and vector data to be migrated corresponding to the vector data migration amount is obtained from the vector data to be migrated address.
在一种可能的实现方式中,操作域还可以包括迁移参数。其中,确定进行迁移处理所需的迁移参数,可以包括:根据操作域,确定进行迁移处理所需的迁移参数。In a possible implementation manner, the operation domain may further include migration parameters. Wherein, determining the migration parameters required for the migration process may include: determining the migration parameters required for the migration process according to the operation domain.
在一种可能的实现方式中,操作码还可以用于指示迁移参数。其中,确定进行迁移处理所需的迁移参数,可以包括:根据操作码,确定进行迁移处理所需的迁移参数。In a possible implementation, the operation code may also be used to indicate the migration parameter. Wherein, determining the migration parameters required for the migration process may include: determining the migration parameters required for the migration process according to the operation code.
在一种可能的实现方式中,该方法还包括:利用装置的存储模块存储待迁移向量数据,In a possible implementation manner, the method further includes: using the storage module of the device to store the vector data to be migrated,
其中,存储模块包括寄存器和缓存中的至少一种,Wherein, the storage module includes at least one of a register and a cache,
缓存,用于存储待运算数据和待迁移向量数据,缓存包括至少一个神经元缓存NRAM;Cache, used to store data to be calculated and vector data to be migrated, the cache includes at least one neuron cache NRAM;
寄存器,用于存储待运算数据中的标量数据;Register, used to store scalar data in the data to be calculated;
神经元缓存,用于存储待运算数据中的神经元数据,神经元数据包括神经元向量数据。The neuron cache is used to store neuron data in the data to be calculated, and the neuron data includes neuron vector data.
在一种可能的实现方式中,步骤S51-20可以包括:In a possible implementation, step S51-20 may include:
存储编译后的向量数据迁移指令;Store compiled vector data migration instructions;
对编译后的向量数据迁移指令进行解析,得到向量数据迁移指令的操作码和操作域;Analyze the compiled vector data migration instruction to obtain the operation code and operation domain of the vector data migration instruction;
存储指令队列,指令队列包括按照执行顺序依次排列的多个待执行指令,多个待执行指令包括编译后的向量数据迁移指令。The instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed in order according to the execution order, and the plurality of instructions to be executed include compiled vector data migration instructions.
在一种可能的实现方式中,该方法还可以包括:In a possible implementation manner, the method may further include:
在确定多个待执行指令中的第一待执行指令与第一待执行指令之前的第零待执行指令存在关联关系时,缓存所述第一待执行指令,并在确定所述第零待执行指令执行完毕后,控制进行所述第一待执行指令的执行,When it is determined that the first to-be-executed instruction among the plurality of to-be-executed instructions is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction, cache the first to-be-executed instruction and determine After the execution of the instruction is completed, the execution of the first instruction to be executed is controlled,
其中,第一待执行指令与第一待执行指令之前的第零待执行指令存在关联关系包括:Wherein, the association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
存储第一待执行指令所需数据的第一存储地址区间与存储第零待执行指令所需数据的第零存储 地址区间具有重叠的区域。The first storage address section storing data required for the first instruction to be executed has an overlapping area with the zeroth storage address section storing data required for the zeroth instruction to be executed.
在一种可能的实现方式中,对获取到的向量数据迁移进行编译,得到编译后的向量数据迁移指令,可以包括:In a possible implementation manner, compiling the acquired vector data migration to obtain a compiled vector data migration instruction may include:
根据向量数据迁移指令生成汇编文件,并将汇编文件翻译成二进制文件,Generate assembly files according to vector data migration instructions, and translate the assembly files into binary files,
其中,二进制文件为编译后的向量数据迁移指令。Among them, the binary file is a compiled vector data migration instruction.
需要说明的是,尽管以上述实施例作为示例介绍了向量数据迁移指令处理方法如上,但本领域技术人员能够理解,本公开应不限于此。事实上,用户完全可根据个人喜好和/或实际应用场景灵活设定各步骤,只要符合本公开的技术方案即可。It should be noted that although the above embodiment is taken as an example to introduce the processing method of the vector data migration instruction as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various steps flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.
本公开实施例所提供的向量数据迁移指令处理方法的适用范围广,对向量数据迁移指令的处理效率高、处理速度快,进行向量数据迁移的处理效率高、处理速度快。The method for processing vector data migration instructions provided by the embodiments of the present disclosure has a wide application range, and has high processing efficiency and fast processing speed for vector data migration instructions, and high processing efficiency and fast processing speed for vector data migration.
依据以下条款可以更好的理解前述内容:The foregoing can be better understood based on the following clauses:
条款T1、一种向量数据迁移指令处理装置,所述装置包括:Clause T1, a vector data migration instruction processing device, the device comprising:
控制模块,用于对获取到的向量数据迁移指令进行编译,得到编译后的向量数据迁移指令,对编译后的向量数据迁移指令进行解析,得到向量数据迁移指令的操作码和操作域,并根据所述操作码和所述操作域获取执行向量数据迁移指令所需的待迁移向量数据和目标地址,以及确定进行迁移处理所需的迁移参数;The control module is used to compile the acquired vector data migration instruction to obtain the compiled vector data migration instruction, and parse the compiled vector data migration instruction to obtain the operation code and operation domain of the vector data migration instruction, and according to The operation code and the operation domain acquire the vector data to be migrated and the target address required to execute the vector data migration instruction, and determine the migration parameters required for the migration process;
处理模块,根据所述迁移参数,将所述待迁移向量数据存入所述目标地址中,The processing module stores the vector data to be migrated into the target address according to the migration parameter,
其中,所述操作码用于指示所述向量数据迁移指令对向量数据所进行的处理为迁移处理,所述操作域包括待迁移向量数据地址和所述目标地址,所述迁移参数包括所述待迁移向量数据地址所在的初始存储空间、所述目标地址所在的目标存储空间和进行迁移处理的迁移类型。Wherein, the operation code is used to indicate that the processing performed by the vector data migration instruction on the vector data is migration processing, the operation domain includes the address of the vector data to be migrated and the target address, and the migration parameter includes the The initial storage space where the migration vector data address is located, the target storage space where the target address is located, and the type of migration for migration processing.
条款T2、根据条款T1所述的装置,所述处理模块包括主处理子模块和多个从处理子模块,Clause T2. The apparatus according to Clause T1, the processing module includes a master processing sub-module and a plurality of slave processing sub-modules,
所述主处理子模块,用于对所述待迁移向量数据进行处理,得到处理后的待迁移向量数据,将所述处理后的待迁移向量数据存入所述目标地址中。The main processing submodule is configured to process the vector data to be migrated to obtain processed vector data to be migrated, and store the processed vector data to be migrated in the target address.
条款T3、根据条款T1所述的装置,所述操作域还包括向量数据迁移量,Clause T3. The device according to Clause T1, the operation domain further includes vector data migration,
其中,所述控制模块,还用于根据所述操作域确定所述向量数据迁移量,并从所述待迁移向量数据地址中获取对应所述向量数据迁移量的待迁移向量数据。Wherein, the control module is further configured to determine the vector data migration amount according to the operation domain, and obtain the vector data to be migrated corresponding to the vector data migration amount from the vector data address to be migrated.
条款T4、根据条款T1所述的装置,所述操作域还包括迁移参数,Clause T4. The device according to Clause T1, the operation domain further includes migration parameters,
其中,确定进行迁移处理所需的迁移参数,包括:Among them, the migration parameters required for migration processing are determined, including:
根据所述操作域,确定进行迁移处理所需的迁移参数。According to the operation domain, the migration parameters required for the migration process are determined.
条款T5、根据条款T1所述的装置,所述操作码还用于指示迁移参数,Clause T5. The device according to Clause T1, the operation code is also used to indicate a migration parameter,
其中,确定进行迁移处理所需的迁移参数,包括:Among them, the migration parameters required for migration processing are determined, including:
根据所述操作码,确定进行迁移处理所需的迁移参数。According to the operation code, the migration parameters required for the migration process are determined.
条款T6、根据条款T1所述的装置,所述装置还包括:Clause T6. The device according to Clause T1, the device further comprising:
存储模块,用于存储所述待迁移向量数据,A storage module, used to store the vector data to be migrated,
其中,所述存储模块包括寄存器和缓存中的至少一种,Wherein, the storage module includes at least one of a register and a cache,
所述缓存,用于存储待运算数据和所述待迁移向量数据,所述缓存包括至少一个神经元缓存NRAM;The cache is used to store data to be calculated and the vector data to be migrated, and the cache includes at least one neuron cache NRAM;
所述寄存器,用于存储所述待运算数据中的标量数据;The register is used to store scalar data in the data to be calculated;
所述神经元缓存,用于存储所述待运算数据中的神经元数据,所述神经元数据包括神经元向量数据。The neuron cache is used to store neuron data in the data to be operated, and the neuron data includes neuron vector data.
条款T7、根据条款T1所述的装置,所述控制模块包括:Clause T7. The device according to Clause T1, the control module includes:
指令存储子模块,用于存储所述编译后的向量数据迁移指令;Instruction storage submodule, used to store the compiled vector data migration instruction;
指令处理子模块,用于对所述编译后的向量数据迁移指令进行解析,得到向量数据迁移指令的操作码和操作域;An instruction processing sub-module, which is used to analyze the compiled vector data migration instruction to obtain the operation code and operation domain of the vector data migration instruction;
队列存储子模块,用于存储指令队列,所述指令队列包括按照执行顺序依次排列的多个待执行指 令,所述多个待执行指令包括所述编译后的向量数据迁移指令。The queue storage sub-module is used to store an instruction queue, and the instruction queue includes a plurality of instructions to be executed, which are sequentially arranged according to the execution order, and the plurality of instructions to be executed include the compiled vector data migration instructions.
条款T8、根据条款T7所述的装置,所述控制模块,还包括:Clause T8. The device according to Clause T7, the control module, further comprising:
依赖关系处理子模块,用于在确定所述多个待执行指令中的第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系时,将所述第一待执行指令缓存在所述指令存储子模块中,在所述第零待执行指令执行完毕后,从所述指令存储子模块中提取所述第一待执行指令发送至所述处理模块,The dependency processing sub-module is used to determine the first pending instruction when there is an association relationship between the first pending instruction in the plurality of pending instructions and the zeroth pending instruction before the first pending instruction The execution instruction is cached in the instruction storage submodule, and after the execution of the zeroth instruction to be executed is completed, the first instruction to be executed is extracted from the instruction storage submodule and sent to the processing module,
其中,所述第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系包括:Wherein, the association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
存储所述第一待执行指令所需数据的第一存储地址区间与存储所述第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
条款T9、根据条款T1所述的装置,Clause T9, the device according to Clause T1,
所述控制模块,还用于根据所述向量数据迁移指令生成汇编文件,并将所述汇编文件翻译成二进制文件,The control module is also used to generate an assembly file according to the vector data migration instruction and translate the assembly file into a binary file,
其中,所述二进制文件为所述编译后的向量数据迁移指令。Wherein, the binary file is the compiled vector data migration instruction.
条款T10、一种机器学习运算装置,所述装置包括:Clause T10. A machine learning computing device, the device comprising:
一个或多个如条款T1-条款T9任一项所述的向量数据迁移指令处理装置,用于从其他处理装置中获取待迁移向量数据和控制信息,并执行指定的机器学习运算,将执行结果通过I/O接口传递给其他处理装置;One or more vector data migration instruction processing devices as described in any one of clauses T1 to T9, used to obtain vector data and control information to be migrated from other processing devices, and perform specified machine learning operations, which will execute the results Passed to other processing devices through the I / O interface;
当所述机器学习运算装置包含多个所述向量数据迁移指令处理装置时,所述多个所述向量数据迁移指令处理装置间可以通过特定的结构进行连接并传输数据;When the machine learning operation device includes a plurality of vector data migration instruction processing devices, the plurality of vector data migration instruction processing devices can be connected and transmit data through a specific structure;
其中,多个所述向量数据迁移指令处理装置通过快速外部设备互连总线PCIE总线进行互联并传输数据,以支持更大规模的机器学习的运算;多个所向量数据迁移指令处理装置共享同一控制系统或拥有各自的控制系统;多个所述向量数据迁移指令处理装置共享内存或者拥有各自的内存;多个所述向量数据迁移指令处理装置的互联方式是任意互联拓扑。Among them, a plurality of the vector data migration instruction processing devices interconnect and transmit data through a fast external device interconnect bus PCIE bus to support larger-scale machine learning operations; a plurality of vector data migration instruction processing devices share the same control The system may have its own control system; a plurality of the vector data migration instruction processing devices share memory or have their own memories; the interconnection mode of the plurality of vector data migration instruction processing devices is an arbitrary interconnection topology.
条款T11、一种组合处理装置,所述组合处理装置包括:Clause T11. A combined processing device, the combined processing device comprising:
如条款T10所述的机器学习运算装置、通用互联接口和其他处理装置;Machine learning computing device, general interconnection interface and other processing devices as described in clause T10;
所述机器学习运算装置与所述其他处理装置进行交互,共同完成用户指定的计算操作,The machine learning computing device interacts with the other processing device to jointly complete the calculation operation specified by the user,
其中,所述组合处理装置还包括:存储装置,该存储装置分别与所述机器学习运算装置和所述其他处理装置连接,用于保存所述机器学习运算装置和所述其他处理装置的数据。Wherein, the combined processing device further includes: a storage device, which is respectively connected to the machine learning computing device and the other processing device, and is used for storing data of the machine learning computing device and the other processing device.
条款T12、一种机器学习芯片,所述机器学习芯片包括:Clause T12. A machine learning chip, the machine learning chip includes:
如条款T10所述的机器学习运算装置或如条款T11所述的组合处理装置。The machine learning arithmetic device according to clause T10 or the combined processing device according to clause T11.
条款T13、一种电子设备,所述电子设备包括:Article T13. An electronic device, the electronic device comprising:
如条款T12所述的机器学习芯片。Machine learning chip as described in clause T12.
条款T14、一种板卡,所述板卡包括:存储器件、接口装置和控制器件以及如条款T12所述的机器学习芯片;Clause T14, a board card, the board card includes: a storage device, an interface device and a control device, and a machine learning chip as described in Clause T12;
其中,所述机器学习芯片与所述存储器件、所述控制器件以及所述接口装置分别连接;Wherein, the machine learning chip is respectively connected to the storage device, the control device and the interface device;
所述存储器件,用于存储数据;The storage device is used for storing data;
所述接口装置,用于实现所述机器学习芯片与外部设备之间的数据传输;The interface device is used to realize data transmission between the machine learning chip and an external device;
所述控制器件,用于对所述机器学习芯片的状态进行监控。The control device is used for monitoring the state of the machine learning chip.
条款T15、一种向量数据迁移指令处理方法,所述方法应用于向量数据迁移指令处理装置,所述装置包括控制模块和处理模块,所述方法包括:Clause T15. A method for processing vector data migration instructions. The method is applied to a vector data migration instruction processing device. The device includes a control module and a processing module. The method includes:
利用控制模块对获取到向量数据迁移指令进行编译,得到编译后的向量数据迁移指令,对所述编译后的向量数据迁移指令进行解析,得到向量数据迁移指令的操作码和操作域,并根据所述操作码和所述操作域获取执行向量数据迁移指令所需的待迁移向量数据和目标地址,以及确定进行迁移处理所需的迁移参数;The control module is used to compile the acquired vector data migration instruction to obtain the compiled vector data migration instruction, and the compiled vector data migration instruction is analyzed to obtain the operation code and operation domain of the vector data migration instruction. The operation code and the operation domain acquire the vector data to be migrated and the target address required to execute the vector data migration instruction, and determine the migration parameters required for the migration process;
利用处理模块根据所述迁移参数,将所述待迁移向量数据存入所述目标地址中,The processing module stores the vector data to be migrated into the target address according to the migration parameter,
其中,所述操作码用于指示所述向量数据迁移指令对向量数据所进行的处理为迁移处理,所述操作域包括待迁移向量数据地址和所述目标地址,所述迁移参数包括所述待迁移向量数据地址所在的初始存储空间、所述目标地址所在的目标存储空间和进行迁移处理的迁移类型。Wherein, the operation code is used to indicate that the processing performed by the vector data migration instruction on the vector data is migration processing, the operation domain includes the address of the vector data to be migrated and the target address, and the migration parameter includes the The initial storage space where the migration vector data address is located, the target storage space where the target address is located, and the type of migration for migration processing.
条款T16、根据条款T15所述的方法,所述处理模块包括主处理子模块和多个从处理子模块,Clause T16. The method according to Clause T15, the processing module includes a master processing sub-module and a plurality of slave processing sub-modules,
其中,根据所述迁移参数,将所述待迁移向量数据存入所述目标地址中,包括:Wherein, storing the vector data to be migrated into the target address according to the migration parameter includes:
对所述待迁移向量数据进行处理,得到处理后的待迁移向量数据,将所述处理后的待迁移向量数据存入所述目标地址中。Processing the vector data to be migrated to obtain processed vector data to be migrated, and storing the processed vector data to be migrated in the target address.
条款T17、根据条款T15所述的方法,所述操作域还包括向量数据迁移量,Clause T17. The method according to Clause T15, the operation domain further includes vector data migration,
其中,根据所述操作码和所述操作域获取执行向量数据迁移指令所需的待迁移向量数据和目标地址,包括:Wherein, obtaining the vector data to be migrated and the target address required to execute the vector data migration instruction according to the operation code and the operation domain includes:
根据所述操作域确定所述向量数据迁移量,并从所述待迁移向量数据地址中获取对应所述向量数据迁移量的待迁移向量数据。The vector data migration amount is determined according to the operation domain, and vector data to be migrated corresponding to the vector data migration amount is obtained from the vector data to be migrated address.
条款T18、根据条款T15所述的方法,所述操作域还包括迁移参数,Clause T18, the method according to Clause T15, the operation domain further includes migration parameters,
其中,确定进行迁移处理所需的迁移参数,包括:Among them, the migration parameters required for migration processing are determined, including:
根据所述操作域,确定进行迁移处理所需的迁移参数。According to the operation domain, the migration parameters required for the migration process are determined.
条款T19、根据条款T15所述的方法,所述操作码还用于指示迁移参数,Clause T19, the method according to Clause T15, the operation code is also used to indicate the migration parameter,
其中,确定进行迁移处理所需的迁移参数,包括:Among them, the migration parameters required for migration processing are determined, including:
根据所述操作码,确定进行迁移处理所需的迁移参数。According to the operation code, the migration parameters required for the migration process are determined.
条款T20、根据条款T15所述的方法,所述方法还包括:Clause T20. The method according to Clause T15, the method further comprising:
利用所述装置的存储模块存储所述待迁移向量数据,Using the storage module of the device to store the vector data to be migrated,
其中,所述存储模块包括寄存器和缓存中的至少一种,Wherein, the storage module includes at least one of a register and a cache,
所述缓存,用于存储待运算数据和所述待迁移向量数据,所述缓存包括至少一个神经元缓存NRAM;The cache is used to store data to be calculated and the vector data to be migrated, and the cache includes at least one neuron cache NRAM;
所述寄存器,用于存储所述待运算数据中的标量数据;The register is used to store scalar data in the data to be calculated;
所述神经元缓存,用于存储所述待运算数据中的神经元数据,所述神经元数据包括神经元向量数据。The neuron cache is used to store neuron data in the data to be operated, and the neuron data includes neuron vector data.
条款T21、根据条款T15所述的方法,对所述编译后的向量数据迁移指令进行解析,得到向量数据迁移指令的操作码和操作域,包括:Clause T21. According to the method described in Clause T15, parse the compiled vector data migration instruction to obtain the operation code and operation domain of the vector data migration instruction, including:
存储所述编译后的向量数据迁移指令;Storing the compiled vector data migration instruction;
对所述编译后的向量数据迁移指令进行解析,得到向量数据迁移指令的操作码和操作域;Parse the compiled vector data migration instruction to obtain the operation code and operation domain of the vector data migration instruction;
存储指令队列,所述指令队列包括按照执行顺序依次排列的多个待执行指令,所述多个待执行指令包括所述编译后的向量数据迁移指令。An instruction queue is stored. The instruction queue includes a plurality of instructions to be executed, which are sequentially arranged according to the execution order, and the plurality of instructions to be executed include the compiled vector data migration instructions.
条款T22、根据条款T21所述的方法,所述方法还包括:Clause T22. The method according to Clause T21, the method further comprising:
在确定所述多个待执行指令中的第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系时,缓存所述第一待执行指令,并在确定所述第零待执行指令执行完毕后,控制进行所述第一待执行指令的执行,When it is determined that the first to-be-executed instruction among the plurality of to-be-executed instructions is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction, cache the first to-be-executed instruction, and determine the After the execution of the zeroth to-be-executed instruction is completed, control to execute the execution of the first to-be-executed instruction,
其中,所述第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系包括:Wherein, the association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
存储所述第一待执行指令所需数据的第一存储地址区间与存储所述第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
条款T23、根据条款T15所述的方法,对获取到的向量数据迁移进行编译,得到编译后的向量数据迁移指令,包括:Clause T23. Compile the acquired vector data migration according to the method described in Clause T15 to obtain the compiled vector data migration instruction, including:
根据所述向量数据迁移指令生成汇编文件,并将所述汇编文件翻译成二进制文件,Generate an assembly file according to the vector data migration instruction, and translate the assembly file into a binary file,
其中,所述二进制文件为所述编译后的向量数据迁移指令。Wherein, the binary file is the compiled vector data migration instruction.
条款T24、一种非易失性计算机可读存储介质,其上存储有计算机程序指令,所述计算机程序指令被处理器执行时实现条款T15至条款T23任一项所述的方法。Clause T24. A non-volatile computer-readable storage medium having computer program instructions stored thereon, which when executed by a processor implements the method of any one of Clause T15 to Clause T23.
由于神经网络算法的广泛使用,计算机硬件运算人能力的不断提升,实际应用中所涉及到的数据运算的种类和数量不断提高。由于编程语言的种类多样,在不同的语言环境下,为实现同步控制过程,相关技术中,由于没有能广泛适用于各类编程语言的同步控制指令,技术人员需要自定义对应其编程语言环境的多条指令来实现同步控制,导致进行同步控制的效率低、速度慢。本公开提供一种同步控制指令处理方法、装置、计算机设备和存储介质,仅用一个指令即可以实现同步控制,能够显著提高进行同步控制的效率和速度。Due to the extensive use of neural network algorithms, the ability of computer hardware operators to continuously improve, the types and number of data operations involved in practical applications continue to increase. Due to the variety of programming languages, in different language environments, in order to achieve the synchronous control process, in the related art, because there is no synchronous control command that can be widely applied to various programming languages, technicians need to customize the corresponding programming language environment Multiple instructions to achieve synchronous control, resulting in low efficiency and slow speed of synchronous control. The present disclosure provides a synchronous control instruction processing method, device, computer equipment, and storage medium, which can realize synchronous control with only one instruction, and can significantly improve the efficiency and speed of synchronous control.
图21-1a示出根据本公开一实施例的同步控制指令处理装置的框图。如图21-1a所示,该装置包括控制模块21-11、多个运算模块21-12和编译模块21-13。21-1a shows a block diagram of a synchronization control instruction processing apparatus according to an embodiment of the present disclosure. As shown in Fig. 21-1a, the device includes a control module 21-11, a plurality of arithmetic modules 21-12 and a compilation module 21-13.
编译模块21-13,用于对获取到的同步控制指令进行编译,得到编译后的同步控制指令。The compiling module 21-13 is used to compile the acquired synchronization control instruction to obtain the compiled synchronization control instruction.
控制模块21-11,用于对编译后的同步控制指令进行解析,得到同步控制指令的操作码,并确定需要执行同步控制指令的目标运算模块。其中,操作码用于指示同步控制指令用于对装置的多个运算模块进行同步控制。The control module 21-11 is used to analyze the compiled synchronous control instruction, obtain the operation code of the synchronous control instruction, and determine the target operation module that needs to execute the synchronous control instruction. The operation code is used to indicate that the synchronization control instruction is used to perform synchronization control on multiple operation modules of the device.
多个运算模块21-12中的目标运算模块,用于在执行到编译后的同步控制指令时,进入暂停状态。其中,暂停状态下目标运算模块暂停工作,不再进行数据的运算,不能继续执行其自身需要执行的计算指令。The target operation module in the plurality of operation modules 21-12 is used to enter the suspended state when the compiled synchronous control instruction is executed. Among them, in the suspended state, the target computing module suspends work, no longer performs data calculation, and cannot continue to execute the calculation instructions it needs to execute.
控制模块21-11,还用于监测多个运算模块21-12的运行状态,在确定目标运算模块均处于暂停状态时,控制处于暂停状态的目标运算模块进入工作状态。其中,工作状态下的目标运算模块可以进行数据运算,执行其需要执行的计算指令。The control module 21-11 is also used to monitor the operating states of the multiple computing modules 21-12, and when it is determined that the target computing modules are all in the suspended state, control the target computing modules in the suspended state to enter the working state. Among them, the target computing module in the working state can perform data computing and execute the computing instructions it needs to execute.
在本实施例中,同步控制指令可以对运算模块执行计算指令的过程进行同步控制,使得运算模块在执行到该同步控制指令可以暂停工作,等待控制模块发出继续工作的指令,达到同步控制的目的。In this embodiment, the synchronous control instruction can synchronously control the process of the calculation module executing the calculation instruction, so that the operation module can suspend work after the execution of the synchronous control instruction, and wait for the control module to issue an instruction to continue working to achieve the purpose of synchronous control .
在本实施例中,装置所获取到的同步控制指令为未编译的、不能直接供硬件执行的软件指令,编译模块需先对同步控制指令(未编译的)进行编译,编译后的同步控制指令为能够直接供硬件执行的硬件指令。控制模块可以通过数据输入输出单元获得指令和数据,该数据输入输出单元可以为一个或多个数据I/O接口或I/O引脚。In this embodiment, the synchronization control instruction acquired by the device is an uncompiled software instruction that cannot be directly executed by the hardware. The compilation module needs to first compile the synchronization control instruction (uncompiled). The compiled synchronization control instruction It is a hardware instruction that can be directly executed by hardware. The control module can obtain instructions and data through the data input and output unit, which can be one or more data I / O interfaces or I / O pins.
可选地,该指令处理装置可以包括通用处理器和人工智能处理器,其中的编译模块可以设置于CPU等通用处理器上,即CPU等通用处理器可以用于实现编译的操作,该人工智能处理器而可以包含上述的控制模块和运算模块等,该人工智能处理器的具体结构可参见下文的描述,该人工智能处理器可以对接收到的编译后的同步控制指令进行解析,并运行相应的指令。Optionally, the instruction processing apparatus may include a general-purpose processor and an artificial intelligence processor, and the compilation module therein may be set on a general-purpose processor such as a CPU, that is, a general-purpose processor such as a CPU may be used to implement the compilation operation, the artificial intelligence The processor may include the above-mentioned control module and operation module, etc. The specific structure of the artificial intelligence processor can be found in the following description. The artificial intelligence processor can analyze the received compiled synchronous control instruction and run the corresponding Instructions.
在本实施例中,操作码可以是计算机程序中所规定的要执行操作的那一部分指令或字段(通常用代码表示),是指令序列号,用来告知执行指令的装置具体需要执行哪一条指令。操作域可以是执行对应的指令所需的所有数据的来源,执行对应的指令所需的所有数据包括待运算数据、数量阈值等参数以及对应的运算方法等等。对于一个同步控制指令其可以包括操作码和操作域。In this embodiment, the operation code may be the part of the instruction or field (usually represented by code) specified in the computer program to perform the operation, and is the instruction sequence number used to inform the device that executes the instruction which instruction needs to be executed. . The operation domain may be the source of all data required to execute the corresponding instruction, and all data required to execute the corresponding instruction include parameters to be operated, parameters such as quantity thresholds, and corresponding operation methods. For a synchronous control instruction, it may include an operation code and an operation field.
应当理解的是,本领域技术人员可以根据需要对同步控制指令的指令格式以及所包含的操作码和操作域进行设置,本公开对此不作限制。It should be understood that a person skilled in the art may set the instruction format of the synchronization control instruction and the included operation codes and operation domains as needed, and this disclosure does not limit this.
在本实施例中,该装置可以包括一个或多个控制模块,以及一个或多个运算模块,可以根据实际需要对控制模块和运算模块的数量进行设置,本公开对此不作限制。在装置包括一个控制模块时,该控制模块可以接收同步控制指令,并实现对应的至少一个运算模块的同步控制。在装置包括多个控制模块时,多个控制模块可以分别接收同步控制指令,并分别实现对应的多个运算模块的同步控制。In this embodiment, the device may include one or more control modules and one or more arithmetic modules, and the number of control modules and arithmetic modules may be set according to actual needs, which is not limited in the present disclosure. When the device includes a control module, the control module may receive a synchronous control instruction and implement synchronous control of the corresponding at least one arithmetic module. When the device includes a plurality of control modules, the plurality of control modules may receive synchronization control instructions respectively, and respectively realize synchronization control of the corresponding plurality of arithmetic modules.
在本实施例中,运算模块可以是装置的核(core)、装置中的处理器等能够执行计算指令的器件、模块等,本公开对此不作限制。In this embodiment, the arithmetic module may be a device or module capable of executing calculation instructions, such as a core of the device, a processor in the device, etc., which is not limited in the present disclosure.
本公开实施例所提供的同步控制指令处理装置,该装置包括编译模块、控制模块和多个运算模块,编译模块用于对获取到的同步控制指令进行编译,得到编译后的同步控制指令;控制模块用于对编译后的同步控制指令进行解析,得到同步控制指令的操作码,并确定需要执行同步控制指令的目标运算模块;目标运算模块用于在执行到编译后的同步控制指令时,进入暂停状态;控制模块还用于监测多 个运算模块的运行状态,在确定目标运算模块均处于暂停状态时,控制处于暂停状态的目标运算模块同步进入工作状态。本公开实施例所提供的同步控制指令处理方法、装置及相关产品的适用范围广,对同步控制指令的处理效率高、处理速度快,且提高了对应运算模块进行同步控制的处理效率、处理速度,进而提高了对数据进行运算的效率和速度。The synchronization control instruction processing device provided by the embodiment of the present disclosure includes a compilation module, a control module, and a plurality of arithmetic modules. The compilation module is used to compile the acquired synchronization control instruction to obtain the compiled synchronization control instruction; control The module is used to analyze the compiled synchronous control instruction, obtain the operation code of the synchronous control instruction, and determine the target operation module that needs to execute the synchronous control instruction; the target operation module is used to enter the compiled synchronous control instruction. Suspended state; the control module is also used to monitor the running state of multiple computing modules. When it is determined that the target computing modules are all in the suspended state, the target computing modules in the suspended state are controlled to enter the working state synchronously. The synchronous control instruction processing method, device and related products provided by the embodiments of the present disclosure have a wide range of applications, high processing efficiency and fast processing speed for synchronous control instructions, and improved processing efficiency and processing speed for synchronous control of corresponding arithmetic modules , Which in turn improves the efficiency and speed of computing data.
在一种可能的实现方式中,操作码可以用于指示所述目标运算模块所需同步的目标信号,或者操作域可以包括所述目标运算模块所需同步的目标信号,以使所述控制模块根据所述操作码或者所述操作域确定目标信号。所述目标运算模块,还用于在执行到所述编译后的同步控制指令时,控制与所述控制模块确定的目标信号相对应的处理进入暂停状态。In a possible implementation, the operation code may be used to indicate the target signal required for synchronization by the target operation module, or the operation domain may include the target signal required for synchronization by the target operation module, so that the control module The target signal is determined according to the operation code or the operation domain. The target operation module is also used to control the processing corresponding to the target signal determined by the control module to enter a suspended state when the compiled synchronous control instruction is executed.
其中,所述目标信号可以包括以下至少一种:计算队列信号、IO信号、到达信号。其中,到达信号是运算模块间并行到达的一类信号,包括运算模块需要同步执行的全部信号。计算队列信号可以是运算模块中等待执行的计算任务的队列的信号,IO信号可以为运算模块的输入和/或输出信号。The target signal may include at least one of the following: calculation queue signal, IO signal, and arrival signal. Among them, the arrival signal is a type of signal that arrives in parallel between the arithmetic modules, including all signals that the arithmetic modules need to execute synchronously. The calculation queue signal may be a signal of a queue of calculation tasks waiting for execution in the operation module, and the IO signal may be an input and / or output signal of the operation module.
在该实现方式中,在同步控制指令没有指示目标信号时,可以将预先设置的默认目标信号确定为目标信号,本公开对此不作限制。In this implementation manner, when the synchronization control instruction does not indicate the target signal, the preset default target signal may be determined as the target signal, which is not limited in the present disclosure.
在该实现方式中,可以为不同的目标信号设置特定的操作码或者操作域,以使得处理器可以根据特定的操作码或者操作域确定出目标运算模块中所需暂停的信号。目标信号可以是其他与运算模块的运行、控制、进行运算等处理相关的信号,本公开对此不作限制。In this implementation manner, specific operation codes or operation domains can be set for different target signals, so that the processor can determine a signal to be suspended in the target operation module according to the specific operation codes or operation domains. The target signal may be other signals related to the operation, control, and calculation of the arithmetic module, which is not limited in this disclosure.
在一种可能的实现方式中,确定需要执行同步控制指令的目标运算模块,可以包括:根据目标任务的标识,将所述多个运算模块中执行所述目标任务的运算模块确定为目标运算模块。所述目标任务的标识包括以下至少一种:任务名称、任务类型、任务编号,目标任务的标识还可以包括其他能够表征目标任务的信息,本公开对此不作限制。In a possible implementation manner, determining the target operation module that needs to execute the synchronous control instruction may include: determining the operation module that executes the target task among the plurality of operation modules as the target operation module according to the target task identifier . The identification of the target task includes at least one of the following: task name, task type, and task number. The identification of the target task may also include other information that can characterize the target task, which is not limited in this disclosure.
其中,在装置中,控制模块会根据任务(包括上述目标任务)的类型、运算模块的工作状态,为任务分配一个或多个运算模块,使其执行该任务。Among them, in the device, the control module will assign one or more arithmetic modules to the task according to the type of the task (including the above-mentioned target task) and the working state of the arithmetic module to make it execute the task.
通过上述方式,可以实现对执行目标任务的所有运算模块的同步控制。In the above manner, synchronous control of all arithmetic modules performing the target task can be achieved.
在一种可能的实现方式中,控制模块可以根据预设的目标任务标识,确定需要执行同步控制指令的目标运算模块。此时,该同步控制指令中可以只包括操作码。该同步控制指令的指令格式可以是“sync_all()”,其中,基于该同步控制指令装置可以实现对预设的目标任务标识的目标任务的所有运算模块的同步控制。In a possible implementation, the control module may determine the target computing module that needs to execute the synchronous control instruction according to the preset target task identifier. At this time, the synchronization control instruction may include only the operation code. The instruction format of the synchronous control instruction may be "sync_all ()", wherein, based on the synchronous control instruction device, synchronous control of all arithmetic modules of the target task identified by the preset target task may be achieved.
可选地,该同步控制指令可以包含于一核函数中,该装置的通用处理器可以对该核函数中的指令等程序进行编译,并通过调用相应的接口将该编译后的核函数发送至人工智能器上的相应运算模块上执行。其中,该装置还可以根据该核函数的特性确定目标任务的标识。这样,该核函数中的同步控制指令可以根据该确定的目标任务标识确定目标运算模块。Optionally, the synchronization control instruction may be included in a kernel function, and the general processor of the device may compile the instruction and other programs in the kernel function, and send the compiled kernel function to the corresponding interface by calling the corresponding interface. The corresponding operation module on the artificial intelligence machine is executed. Among them, the device can also determine the identification of the target task according to the characteristics of the kernel function. In this way, the synchronization control instruction in the kernel function can determine the target operation module according to the determined target task identifier.
图21-1b示出根据本公开一实施例的同步控制指令处理装置中模块集群的结构示意图。如图21-1b所示,该装置中包括编译模块100、控制模块200、互联模块500、全局内存400、多个模块集群以及与每个模块集群分别对应的共享存储600。共享存储600中包括至少一个DMA610。其中,图21-1b中仅示出了两个模块集群1、模块集群2,剩余模块集群的结构与模块集群1、模块集群2相似,图中并未示出。每个模块集群中包括运算模块1、运算模块2、运算模块3和运算模块4四个运算模块。其中,互联模块500用于实现全局内容400、控制模块200以及模块集群之间的通信互联。全局内存400用于实现装置中控制模块200以及模块集群的存储。21-1b shows a schematic structural diagram of a module cluster in a synchronous control instruction processing apparatus according to an embodiment of the present disclosure. As shown in FIG. 21-1b, the device includes a compilation module 100, a control module 200, an interconnect module 500, a global memory 400, a plurality of module clusters, and a shared storage 600 corresponding to each module cluster. The shared storage 600 includes at least one DMA 610. Among them, only two module clusters 1 and 2 are shown in FIG. 21-1b, and the structure of the remaining module clusters is similar to that of module cluster 1 and module cluster 2, which is not shown in the figure. Each module cluster includes four arithmetic modules: arithmetic module 1, arithmetic module 2, arithmetic module 3 and arithmetic module 4. Among them, the interconnection module 500 is used to implement communication interconnection between the global content 400, the control module 200, and the module cluster. The global memory 400 is used to store the control module 200 and the module cluster in the device.
举例来说,假定目标任务的标识UNION可以通过指定模块集群数量的方式指定对应的运算模块。例如,UNION1表示调用核函数执行任务时,占用1个模块集群,共用4个运算模块。UNION2表示调用核函数执行任务时,占用2个模块集群,共用8个运算模块。UNION4表示调用核函数执行任务时,占用4个模块集群,共用16个core。UNION8表示调用核函数执行任务时,占用8个模块集群,共用32个core。其中,以“UNION1”为例,控制模块200可以根据“UNION1”为其指定闲置或者能够执行任务的模块集群1,以使得模块集群1能够执行任务“UNION1”。For example, assume that the target task identifier UNION can specify the corresponding computing module by specifying the number of module clusters. For example, UNION1 means that when calling a nuclear function to perform a task, it occupies 1 module cluster and shares 4 arithmetic modules. UNION2 means that when calling a nuclear function to perform a task, it occupies 2 module clusters and shares 8 arithmetic modules. UNION4 means that when calling a core function to perform a task, it occupies 4 module clusters and shares 16 cores. UNION8 means that when calling a core function to perform a task, it occupies 8 module clusters and shares 32 cores. Taking “UNION1” as an example, the control module 200 may designate a module cluster 1 that is idle or capable of executing tasks according to “UNION1”, so that the module cluster 1 can execute the task “UNION1”.
在一种可能的实现方式中,所述多个运算模块被划分为多个模块集群(cluster),每个模块集群中 包括一个或多个运算模块(如图21-1b所示)。其中,确定需要执行同步控制指令的目标运算模块,可以包括:根据目标任务的标识,将所述多个模块集群中与执行所述目标任务相关的目标模块集群中的全部运算模块确定为目标运算模块,所述目标集群中的全部或部分运算模块用于执行所述目标任务。In a possible implementation manner, the plurality of operation modules are divided into a plurality of module clusters (cluster), and each module cluster includes one or more operation modules (as shown in FIG. 21-1b). Wherein, determining the target operation module that needs to execute the synchronous control instruction may include: determining all the operation modules in the target module cluster related to the execution of the target task in the plurality of module clusters as the target operation according to the identification of the target task Module, all or part of the operation modules in the target cluster are used to execute the target task.
通过上述方式,可以实现对与执行目标任务相关的目标模块集群中的全部运算模块的同步控制。In the above manner, synchronous control of all computing modules in the target module cluster related to the execution of the target task can be achieved.
例如,控制模块可以根据预设的目标任务标识,确定出该目标任务标识所需占用的运算模块集群数量,并进一步确定需要执行同步控制指令的目标运算模块。此时,该同步控制指令中可以只包括操作码。该同步控制指令的指令格式可以是“sync_all0()”,其中,基于该同步控制指令装置可以实现对预设的目标任务标识的目标任务的所有运算模块集群中的运算模块的同步控制。For example, the control module may determine the number of computing module clusters required by the target task identifier according to the preset target task identifier, and further determine the target computing module that needs to execute the synchronous control instruction. At this time, the synchronization control instruction may include only the operation code. The instruction format of the synchronization control instruction may be "sync_all0 ()", wherein, based on the synchronization control instruction device, the synchronization control of the operation modules in all operation module clusters of the target task identified by the preset target task may be achieved.
可选地,该同步控制指令可以包含于一核函数中,该装置的通用处理器可以对该核函数中的指令等程序进行编译,并通过调用相应的接口将该编译后的核函数发送至人工智能器上的相应运算模块上执行。其中,该装置还可以根据该核函数的特性确定目标任务的标识。这样,该核函数中的同步控制指令可以根据该确定的目标任务标识,确定目标运算模块集群。从而控制模块可以将该目标运算模块集群中的所有运算模块作为目标运算模块。Optionally, the synchronization control instruction may be included in a kernel function, and the general processor of the device may compile the instruction and other programs in the kernel function, and send the compiled kernel function to the corresponding interface by calling the corresponding interface. The corresponding operation module on the artificial intelligence machine is executed. Among them, the device can also determine the identification of the target task according to the characteristics of the kernel function. In this way, the synchronization control instruction in the kernel function can determine the target computing module cluster according to the determined target task identifier. Therefore, the control module can use all the computing modules in the target computing module cluster as the target computing module.
在一种可能的实现方式中,所述操作码或所述操作域可以用于指示目标任务的标识。In a possible implementation manner, the operation code or the operation field may be used to indicate the identification of the target task.
在一种可能的实现方式中,同步控制指令的指令格式可以是“sync_sign2_all1()”。其中,sign2为目标任务的标识,通过该同步控制指令装置可以与执行标识为sign2的目标任务相关的目标模块集群中的全部运算模块的同步控制。In a possible implementation, the instruction format of the synchronization control instruction may be "sync_sign2_all1 ()". Among them, sign2 is the identification of the target task, and the synchronization control instruction device can perform synchronous control of all the arithmetic modules in the target module cluster related to the execution of the target task identified as sign2.
在一种可能的实现方式中,所述多个运算模块被划分为多个模块集群,每个模块集群中包括一个或多个运算模块,所述操作码或所述操作域用于指示目标模块集群的标识。其中,确定需要执行同步控制指令的目标运算模块,可以包括:根据所述目标模块集群的标识,将所述多个运算模块中属于所述目标模块集群的运算模块确定为目标运算模块。In a possible implementation manner, the plurality of operation modules are divided into a plurality of module clusters, and each module cluster includes one or more operation modules, and the operation code or the operation domain is used to indicate the target module The ID of the cluster. Wherein, determining the target operation module that needs to execute the synchronous control instruction may include: determining the operation module belonging to the target module cluster among the plurality of operation modules as the target operation module according to the identifier of the target module cluster.
在该实现方式中,目标模块集群的标识可以是目标模块集群在多个模块集群中的编号、序号、名称等能够表征目标模块集群的标识信息,本公开对此不作限制。In this implementation manner, the identification of the target module cluster may be the identification information of the target module cluster in multiple module clusters, such as the serial number, name, and the like, which can characterize the target module cluster, which is not limited in the present disclosure.
通过上述方式,可以实现对一个或多个目标模块集群中的全部运算模块的同步控制。In the above manner, synchronous control of all arithmetic modules in one or more target module clusters can be achieved.
在一种可能的实现方式中,同步控制指令的指令格式可以是“sync_cluster”。其中,cluster为目标模块集群的标识。通过该同步控制指令,装置可以实现对标识为cluster的目标集群中的全部运算模块的同步控制。在目标集群的数量为多个时,同步控制指令的指令格式可以是“sync_cluster0 cluster1...clustern”,cluster0 cluster1...clustern分别为第一个目标模块集群的标识、第二个目标模块集群的标识...第n个目标模块集群的标识,以实现对多个模块集群中所有运算模块的同步控制。In a possible implementation, the instruction format of the synchronization control instruction may be "sync_cluster". Among them, cluster is the identifier of the target module cluster. Through the synchronous control instruction, the device can realize synchronous control of all arithmetic modules in the target cluster identified as cluster. When the number of target clusters is multiple, the command format of the synchronization control command may be "sync_cluster0 cluster1 ... clustern", cluster0 cluster1 ... clustern are the identification of the first target module cluster and the second target module cluster respectively The mark ... The mark of the nth target module cluster, to realize the synchronous control of all computing modules in multiple module clusters.
在一种可能的实现方式中,所述操作码或所述操作域可以用于指示目标运算模块的标识。其中,根据所述操作码或者所述操作域确定需要执行同步控制指令的目标运算模块,可以包括:根据所述目标运算模块的标识,从所述多个运算模块中确定目标运算模块。In a possible implementation manner, the operation code or the operation field may be used to indicate the identification of the target operation module. Wherein, determining the target operation module that needs to execute the synchronization control instruction according to the operation code or the operation domain may include: determining the target operation module from the plurality of operation modules according to the identifier of the target operation module.
在该实现方式中,目标运算模块的标识可以是目标运算模块在多个运算模块中的编号、序号、名称等能够表征目标运算模块的标识信息,本公开对此不作限制。In this implementation manner, the identification of the target operation module may be the identification information of the target operation module among the multiple operation modules, such as the number, serial number, and name, which can characterize the target operation module, which is not limited in this disclosure.
通过上述方式,可以实现对一个或多个目标运算模块的同步控制。In the above manner, synchronous control of one or more target computing modules can be achieved.
在一种可能的实现方式中,同步控制指令的指令格式可以是“syn_sign3_0 sign3_1...sign3_n”。其中,sign3_0 sign3_1...sign3_n分别为第一个目标运算模块的标识、第二个目标运算模块的标识...第n个目标运算模块的标识。通过该同步控制指令,装置可以实现对标识为sign3_0 sign3_1...sign3_n的目标运算模块的同步控制。In a possible implementation, the instruction format of the synchronization control instruction may be "syn_sign3_0 sign3_1 ... sign3_n". Among them, sign3_0sign3_1 ... sign3_n are the identification of the first target operation module, the identification of the second target operation module ... the identification of the nth target operation module. Through the synchronous control instruction, the device can realize the synchronous control of the target arithmetic module identified as sign3_0sign3_1 ... sign3_n.
在一种可能的实现方式中,在同步控制指令的操作码和操作域中均未指示目标运算模块时,控制模块,还用于确定所述同步控制指令所在的核函数(kernel),将所述多个运算模块中调用所述核函数的运算模块确定为目标运算模块。In a possible implementation, when neither the operation code nor the operation domain of the synchronous control instruction indicates the target operation module, the control module is also used to determine the kernel function (kernel) where the synchronous control instruction is located. The operation module that calls the kernel function among the plurality of operation modules is determined as the target operation module.
在该实现方式中,装置可以调用一个或多个核函数,运算模块可以调用核函数执行需要该核函数的任务。同步控制指令可以被预先写在核函数中,控制模块可以根据运算模块调用核函数的记录确定目标运算模块。In this implementation, the device can call one or more kernel functions, and the operation module can call the kernel functions to perform tasks that require the kernel functions. The synchronization control instruction can be written in the kernel function in advance, and the control module can determine the target operation module according to the record of the operation module calling the kernel function.
其中,控制模块在控制多个运算模块执行任务时,运算模块自身(或者在控制模块的控制下)可以根据所执行任务的类型、任务并行度等任务信息,确定所需调用的核函数,以完成任务。Among them, when the control module controls multiple computing modules to perform tasks, the computing module itself (or under the control of the control module) can determine the required nuclear function to be called according to the task information such as the type of task performed, the degree of task parallelism, mission accomplished.
通过上述方式,可以实现对调用了“同步控制指令所在核函数”的目标运算模块的同步控制。In the above manner, synchronous control of the target operation module that calls the "core function where the synchronous control instruction is located" can be achieved.
在一种可能的实现方式中,所述操作域中包括数量阈值。所述控制模块,还用于在确定处于暂停状态的目标运算模块的数量达到所述数量阈值时,控制处于暂停状态的目标运算模块进入工作状态。In a possible implementation manner, the operation domain includes a quantity threshold. The control module is also used to control the target computing module in the suspended state to enter a working state when it is determined that the number of target computing modules in the suspended state reaches the number threshold.
在该实现方式中,在已确定目标运算模块的基础上,还可以对同步控制的目标运算模块的数量进行限制,数量阈值可以小于或等于已经确定的目标运算模块的数量。In this implementation manner, on the basis of the determined target operation module, the number of synchronously controlled target operation modules may also be limited, and the number threshold may be less than or equal to the determined target operation module.
在一种可能的实现方式中,同步控制指令的指令格式可以是“barrier N”。其中,N为数量阈值。barrier仅用于指示该指令为同步控制指令以及其目标信号为到达信号。通过该同步控制指令,装置可以实现对调用了“同步控制指令所在的核函数”的目标运算模块的同步控制,并且,可以在确定处于暂停状态的目标运算模块的数量达到数量阈值时,控制处于暂停状态的目标运算模块进入工作状态。In a possible implementation, the command format of the synchronization control command may be "barrier N". Among them, N is the number threshold. The barrier is only used to indicate that the instruction is a synchronous control instruction and its target signal is an arrival signal. Through the synchronous control instruction, the device can realize the synchronous control of the target operation module invoking the "kernel function where the synchronous control instruction is located", and when the number of target operation modules in the suspended state reaches the number threshold, the control is in The target computing module in the suspended state enters the working state.
对于上述同步控制指令“sync_sign1_all0()”、“sync_sign2_all1()”、“sync_cluster”、“sync_cluster0 cluster1...clustern”、“syn_sign3_0 sign3_1...sign3_n”、和“barrier N”,sync、syn、barrier还可以用于指示目标信号,其中,sync所指示的目标信号可以为计算队列信号和IO信号,syn和barrier所指的目标信号可以是到达信号。可以根据需要将同步控制指令的组成部分sync、syn、barrier、all0()、all1()、cluster、cluster0 cluster1...clustern、sign3_0 sign3_1...sign3_n、N设置为操作码或操作域,以及设置其在同步控制指令中的位置,本公开对此不作限制。并且,上述同步控制指令仅是本公开技术方案的几个示例,本领域技术人员可以根据需要按照本公开的技术方案对其指令格式进行设置,本公开对此不作限制。For the above synchronization control instructions "sync_sign1_all0 ()", "sync_sign2_all1 ()", "sync_cluster", "sync_cluster0cluster1 ... clustern", "syn_sign3_0sign3_1 ... sign3_n", and "barrierN", sync, syn, barrier It can also be used to indicate a target signal, where the target signal indicated by sync may be a calculation queue signal and an IO signal, and the target signals indicated by syn and barrier may be arrival signals. The components of the synchronous control instruction, sync, syn, barrier, all0 (), all1 (), cluster, cluster0, cluster1 ... clustern, sign3_0, sign3_1 ... sign3_n, N, can be set to opcodes or operation domains as needed, and Setting its position in the synchronous control command is not limited in this disclosure. Moreover, the above synchronization control instructions are only a few examples of the technical solutions of the present disclosure, and those skilled in the art can set their instruction formats according to the technical solutions of the present disclosure as needed, and the present disclosure does not limit this.
图21-2示出根据本公开一实施例的同步控制指令处理装置的框图。在一种可能的实现方式中,如图21-2所示,运算模块21-12可以包括多个运算器21-120。多个运算器21-120用于执行与计算指令的运算类型相对应的运算。21-2 shows a block diagram of a synchronization control instruction processing apparatus according to an embodiment of the present disclosure. In a possible implementation manner, as shown in FIG. 21-2, the computing module 21-12 may include multiple operators 21-120. The multiple operators 21-120 are used to perform operations corresponding to the operation type of the calculation instruction.
在该实现方式中,运算器可以包括加法器、除法器、乘法器、比较器等能够对数据进行算术运算、逻辑运算等运算的运算器。可以根据所需进行的运算的数据量的大小、运算类型、对数据进行运算的处理速度、效率等要求对运算器的种类及数量进行设置,本公开对此不作限制。In this implementation, the arithmetic unit may include an arithmetic unit capable of performing arithmetic operations, logical operations, and the like on the data, such as adders, dividers, multipliers, and comparators. The type and number of arithmetic units can be set according to the size of the amount of data to be calculated, the type of calculation, the processing speed and efficiency of performing calculation on the data, etc., and the disclosure does not limit this.
在一种可能的实现方式中,如图21-2所示,该装置还可以包括存储模块21-13。存储模块21-13用于存储待运算数据待运算数据。In a possible implementation manner, as shown in FIG. 21-2, the device may further include a storage module 21-13. The storage modules 21-13 are used to store data to be calculated.
在该实现方式中,存储模块可以包括缓存和寄存器中的一种或多种,缓存可以包括高速暂存缓存,还可以包括至少一个NRAM(Neuron Random Access Memory,神经元随机存取存储器)。缓存可以用于存储待运算数据,寄存器可以用于存储待运算数据中的标量数据。In this implementation, the storage module may include one or more of a cache and a register. The cache may include a high-speed temporary storage cache, and may also include at least one NRAM (Neuron Random Access Memory). The cache can be used to store data to be calculated, and the register can be used to store scalar data in the data to be calculated.
在一种可能的实现方式中,缓存可以包括神经元缓存。神经元缓存也即上述神经元随机存取存储器,可以用于存储待运算数据中的神经元数据,神经元数据可以包括神经元向量数据。其中,待运算数据包括与进行标量类型转换相关的数据、和/或与其他计算指令的运算相关的数据。In a possible implementation, the cache may include a neuron cache. The neuron cache, that is, the foregoing neuron random access memory, can be used to store neuron data in the data to be calculated, and the neuron data can include neuron vector data. Wherein, the data to be calculated includes data related to the conversion of the scalar type and / or data related to the calculation of other calculation instructions.
在一种可能的实现方式中,该装置还可以包括直接内存访问模块,用于从存储模块中读取或者存储数据。In a possible implementation, the device may further include a direct memory access module, which is used to read or store data from the storage module.
在一种可能的实现方式中,控制模块21-11还可以用于根据同步控制指令生成第一汇编文件,并将第一汇编文件翻译成第一二进制文件,其中,第一二进制文件为编译后的同步控制指令。In a possible implementation, the control module 21-11 can also be used to generate a first assembly file according to the synchronization control instruction and translate the first assembly file into a first binary file, where the first binary The file is the compiled synchronous control instruction.
在一种可能的实现方式中,控制模块21-11还可以用于根据计算指令生成第二汇编文件,并将第二汇编文件翻译成第二二进制文件,其中,第二二进制文件为编译后的计算指令。In a possible implementation, the control module 21-11 may also be used to generate a second assembly file according to the calculation instruction and translate the second assembly file into a second binary file, where the second binary file It is the compiled calculation instruction.
在该实现方式中,还可以将汇编文件翻译成其他进位制的文件,本公开对此不作限制。In this implementation, the assembly file can also be translated into other carry files, which is not limited in this disclosure.
在一种可能的实现方式中,该装置可以设置于图形处理器(Graphics Processing Unit,简称GPU)、中央处理器(Central Processing Unit,简称CPU)和嵌入式神经网络处理器(Neural-network Processing Unit,简称NPU)的一种或多种之中。In a possible implementation manner, the device may be set in a graphics processor (Graphics Processing Unit, GPU for short), a central processing unit (Central Processing Unit, CPU for short) and an embedded neural network processor (Neural-network Processing Unit) , Referred to as NPU).
需要说明的是,尽管以上述实施例作为示例介绍了同步控制指令处理装置如上,但本领域技术人员能够理解,本公开应不限于此。事实上,用户完全可根据个人喜好和/或实际应用场景灵活设定各模块,只要符合本公开的技术方案即可。It should be noted that although the foregoing embodiment is used as an example to introduce the synchronization control instruction processing apparatus as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various modules flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.
应用示例Application examples
以下结合“利用同步控制指令处理装置进行同步控制”作为一个示例性应用场景,给出根据本公开实施例的应用示例,以便于理解同步控制指令处理装置的流程。本领域技术人员应理解,以下应用示例仅仅是出于便于理解本公开实施例的目的,不应视为对本公开实施例的限制The following describes an application example according to an embodiment of the present disclosure in conjunction with "synchronization control using a synchronization control instruction processing device" as an exemplary application scenario, so as to facilitate understanding of the flow of the synchronization control instruction processing device. Those skilled in the art should understand that the following application examples are only for the purpose of facilitating understanding of the embodiments of the present disclosure, and should not be considered as limitations to the embodiments of the present disclosure
图21-3示出根据本公开一实施例的同步控制指令处理装置的应用场景的示意图。如图21-3所示,同步控制指令处理装置对同步控制指令进行处理的过程如下:21-3 illustrate a schematic diagram of an application scenario of a synchronization control instruction processing apparatus according to an embodiment of the present disclosure. As shown in Figure 21-3, the synchronization control instruction processing device processes the synchronization control instruction as follows:
编译模块21-13对获取到的同步控制指令1进行编译,得到编译后的同步控制指令1(如同步控制指令1为barrier 16)。控制模块21-11对编译后的同步控制指令进行解析,得到同步控制指令1的操作码。其中,同步控制指令1的操作码为barrier,数量阈值为16,根据barrier确定目标信号为到达信号。控制模块21-11向装置的所有运算模块发送编译后的同步控制指令。The compiling module 21-13 compiles the acquired synchronization control instruction 1 to obtain the compiled synchronization control instruction 1 (for example, the synchronization control instruction 1 is barrier 16). The control module 21-11 analyzes the compiled synchronous control instruction to obtain the operation code of the synchronous control instruction 1. Among them, the operation code of the synchronous control instruction 1 is a barrier, the number threshold is 16, and the target signal is determined to be an arrival signal according to the barrier. The control module 21-11 sends the compiled synchronous control instruction to all arithmetic modules of the device.
多个运算模块21-12中的目标运算模块在执行到编译后的同步控制指令时,控制与到达信号相关的处理进入暂停状态。The target arithmetic module of the plurality of arithmetic modules 21-12 controls the processing related to the arrival signal to enter the suspended state when the compiled synchronous control instruction is executed.
控制模块21-11还用于检测多个运算模块21-12的运行状态,在确定处于暂停状态的运算模块的数量达到数量阈值16时,控制16个处于暂停状态的运算模块21-12与到达信号相关的处理同步进入工作状态.The control module 21-11 is also used to detect the operating states of the multiple arithmetic modules 21-12. When it is determined that the number of arithmetic modules in the suspended state reaches the number threshold 16, the 16 arithmetic modules 21-12 in the suspended state and the arrival are controlled Signal-related processing enters the working state synchronously.
这样,同步控制指令处理装置可以高效、快速地对同步控制指令进行处理。以上各模块的工作过程可参考上文的相关描述。In this way, the synchronous control instruction processing device can efficiently and quickly process the synchronous control instruction. For the working process of the above modules, please refer to the relevant description above.
图21-4示出根据本公开一实施例的同步控制指令处理方法的流程图。该方法可以应用于包含存储器和处理器的如计算机设备等,其中,存储器用于存储执行方法过程中所使用的数据;处理器用于执行相关的处理、运算步骤,如执行下述步骤S51-21至步骤S54-21。如图21-4所示,该方法应用于上述同步控制指令处理装置,该方法包括步骤S51-21至步骤S54-21。21-4 illustrate a flowchart of a method for processing synchronization control instructions according to an embodiment of the present disclosure. The method can be applied to computer equipment including a memory and a processor, where the memory is used to store data used in the execution of the method; the processor is used to perform related processing and operation steps, such as performing the following steps S51-21 Go to step S54-21. As shown in FIG. 21-4, this method is applied to the above synchronization control instruction processing device. The method includes steps S51-21 to S54-21.
在步骤S51-21中,控制所述控制模块对获取到的同步控制指令进行编译,得到编译后的同步控制指令。In step S51-21, the control module is controlled to compile the acquired synchronization control instruction to obtain the compiled synchronization control instruction.
在步骤S52-21中,控制所述控制模块对所述编译后的同步控制指令进行解析,得到同步控制指令的操作码,并确定需要执行同步控制指令的目标运算模块。其中,所述操作码用于指示所述同步控制指令对装置的多个运算模块所进行的处理为同步控制。In step S52-21, the control module is controlled to parse the compiled synchronous control instruction to obtain the operation code of the synchronous control instruction, and determine the target operation module that needs to execute the synchronous control instruction. Wherein, the operation code is used to instruct the synchronization control instruction to process the multiple arithmetic modules of the device as synchronization control.
在步骤S53-21中,控制所述目标运算模块在执行到所述编译后的同步控制指令时,进入暂停状态。In step S53-21, the target arithmetic module is controlled to enter a suspended state when the compiled synchronous control instruction is executed.
在步骤S54-21中,控制所述控制模块监测所述多个运算模块的运行状态,在确定目标运算模块均处于暂停状态时,控制处于暂停状态的目标运算模块同步进入工作状态。In step S54-21, the control module is controlled to monitor the running state of the plurality of operation modules, and when it is determined that the target operation modules are all in the suspended state, the target operation modules in the suspended state are controlled to enter the working state synchronously.
在一种可能的实现方式中,所述同步控制指令还包括操作域,所述操作码用于指示所述目标运算模块所需同步的目标信号,或者所述操作域包括所述目标运算模块所需同步的目标信号,以使所述控制模块根据所述操作码或者所述操作域确定目标信号,In a possible implementation manner, the synchronization control instruction further includes an operation domain, and the operation code is used to indicate a target signal that the target operation module needs to synchronize, or the operation domain includes the target operation module. A target signal to be synchronized, so that the control module determines the target signal according to the operation code or the operation domain,
所述目标运算模块,还用于在执行到所述编译后的同步控制指令时,控制与所述控制模块确定的目标信号相对应的处理进入暂停状态,The target computing module is further configured to control the processing corresponding to the target signal determined by the control module to enter a suspended state when the compiled synchronous control instruction is executed,
其中,所述目标信号包括以下至少一种:计算队列信号、IO信号、到达信号。Wherein, the target signal includes at least one of the following: calculation queue signal, IO signal, and arrival signal.
在一种可能的实现方式中,确定需要执行同步控制指令的目标运算模块时,包括:In a possible implementation, when determining the target operation module that needs to execute the synchronous control instruction, it includes:
根据目标任务的标识,将所述多个运算模块中执行所述目标任务的运算模块确定为目标运算模块,其中,所述目标任务的标识包括以下至少一种:任务名称、任务类型、任务编号。According to the identification of the target task, the operation module that executes the target task among the plurality of operation modules is determined as the target operation module, wherein the identification of the target task includes at least one of the following: task name, task type, task number .
在一种可能的实现方式中,所述多个运算模块被划分为多个模块集群,每个模块集群中包括一个或多个运算模块,In a possible implementation manner, the multiple computing modules are divided into multiple module clusters, and each module cluster includes one or more computing modules,
其中,确定需要执行同步控制指令的目标运算模块,包括:Among them, determining the target computing module that needs to execute the synchronous control instruction includes:
根据目标任务的标识,将所述多个模块集群中与执行所述目标任务相关的目标模块集群中的全部运算模块确定为目标运算模块,所述目标集群中的全部或部分运算模块用于执行所述目标任务,其中,所述目标任务的标识包括以下至少一种:任务名称、任务类型、任务编号。According to the identification of the target task, all operation modules in the target module cluster related to executing the target task in the plurality of module clusters are determined as target operation modules, and all or part of the operation modules in the target cluster are used for execution The target task, wherein the target task identifier includes at least one of the following: task name, task type, and task number.
在一种可能的实现方式中,所述操作码或者所述操作域用于指示得到目标任务的标识。In a possible implementation manner, the operation code or the operation field is used to indicate that an identification of the target task is obtained.
在一种可能的实现方式中,所述多个运算模块被划分为多个模块集群,每个模块集群中包括一个或多个运算模块,所述操作码或所述操作域用于指示目标模块集群的标识,In a possible implementation manner, the plurality of operation modules are divided into a plurality of module clusters, and each module cluster includes one or more operation modules, and the operation code or the operation domain is used to indicate the target module The identity of the cluster,
其中,确定需要执行同步控制指令的目标运算模块,包括:Among them, determining the target computing module that needs to execute the synchronous control instruction includes:
根据所述目标模块集群的标识,将所述多个运算模块中属于所述目标模块集群的运算模块确定为目标运算模块。According to the identifier of the target module cluster, the operation module belonging to the target module cluster among the plurality of operation modules is determined as the target operation module.
在一种可能的实现方式中,所述操作码或所述操作域用于指示目标运算模块的标识,In a possible implementation manner, the operation code or the operation field is used to indicate the identification of the target operation module,
其中,确定需要执行同步控制指令的目标运算模块,包括:Among them, determining the target computing module that needs to execute the synchronous control instruction includes:
根据所述目标运算模块的标识,从所述多个运算模块中确定目标运算模块。The target operation module is determined from the plurality of operation modules according to the identification of the target operation module.
在一种可能的实现方式中,所述控制模块,还用于确定所述同步控制指令所在的核函数,将所述多个运算模块中调用所述核函数的运算模块确定为目标运算模块。In a possible implementation manner, the control module is further configured to determine a core function where the synchronization control instruction is located, and determine an operation module that calls the core function among the plurality of operation modules as a target operation module.
在一种可能的实现方式中,所述操作域中包括数量阈值,In a possible implementation manner, the operation domain includes a quantity threshold,
所述控制模块,还用于在确定处于暂停状态的目标运算模块的数量达到所述数量阈值时,控制处于暂停状态的目标运算模块进入工作状态。The control module is also used to control the target computing module in the suspended state to enter a working state when it is determined that the number of target computing modules in the suspended state reaches the number threshold.
在一种可能的实现方式中,所述算模块包括主运算子模块和多个从运算子模块,所述方法还包括:In a possible implementation manner, the arithmetic module includes a master arithmetic sub-module and multiple slave arithmetic sub-modules, and the method further includes:
控制所述编译模块对获取到的计算指令进行编译,得到编译后的计算指令;Controlling the compilation module to compile the obtained calculation instruction to obtain the compiled calculation instruction;
控制所述控制模块获取执行所述编译后的计算指令所需的待运算数据,解析所述编译后的计算指令得到多个运算指令;Controlling the control module to obtain data to be operated required for executing the compiled calculation instruction, and parsing the compiled calculation instruction to obtain multiple operation instructions;
控制所述主运算子模块对所述待运算数据执行前序处理,以及与进行数据和运算指令的传输;Controlling the main operation sub-module to perform pre-processing on the data to be operated and to transmit data and operation instructions;
控制所述从运算子模块根据传输的数据和运算指令并行执行中间运算得到多个中间结果;Controlling the slave operation sub-module to perform intermediate operations in parallel according to the transmitted data and operation instructions to obtain multiple intermediate results;
控制所述主运算子模块对所述多个中间结果执行后续处理,得到运算结果。Controlling the main operation sub-module to perform subsequent processing on the plurality of intermediate results to obtain operation results.
在一种可能的实现方式中,该方法还可以包括:存储待运算数据,In a possible implementation manner, the method may further include: storing data to be calculated,
其中,所述存储模块包括寄存器和缓存中的至少一种,Wherein, the storage module includes at least one of a register and a cache,
所述缓存,用于存储所述待运算数据,所述缓存包括至少一个神经元缓存NRAM;The cache is used to store the data to be calculated, and the cache includes at least one neuron cache NRAM;
所述寄存器,用于存储所述待运算数据中的标量数据;The register is used to store scalar data in the data to be calculated;
所述神经元缓存,用于存储所述待运算数据中的神经元数据,所述神经元数据包括神经元向量数据。The neuron cache is used to store neuron data in the data to be operated, and the neuron data includes neuron vector data.
在一种可能的实现方式中,该方法还可以包括:In a possible implementation manner, the method may further include:
控制控制模块存储编译后的同步控制指令和编译后的计算指令;The control control module stores the compiled synchronous control instruction and the compiled calculation instruction;
控制控制模块对编译后的同步控制指令和编译后的计算指令分别进行解析,得到对应的操作码和操作域;The control control module separately analyzes the compiled synchronous control instruction and the compiled calculation instruction to obtain the corresponding operation code and operation domain;
控制控制模块存储指令队列,指令队列包括按照执行顺序依次排列的多个待执行指令,多个待执行指令包括编译后的同步控制指令和编译后的计算指令。The control control module stores an instruction queue, and the instruction queue includes a plurality of instructions to be executed in order according to the execution order, and the plurality of instructions to be executed include a compiled synchronous control instruction and a compiled calculation instruction.
在一种可能的实现方式中,该方法还可以包括:在确定多个待执行指令中的第一待执行指令与第一待执行指令之前的第零待执行指令存在关联关系时,缓存第一待执行指令,并在确定第零执行指令执行完毕后,控制进行第一待执行指令的执行,In a possible implementation manner, the method may further include: when determining that the first to-be-executed instruction among the plurality of to-be-executed instructions is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction, caching the first The instruction to be executed, and after determining that the execution of the zeroth execution instruction is completed, control to execute the execution of the first instruction to be executed,
其中,第一待执行指令与第一待执行指令之前的第零待执行指令存在关联关系可以包括:存储第一待执行指令所需数据的第一存储地址区间与存储第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first to-be-executed instruction is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction may include: a first storage address interval that stores data required by the first to-be-executed instruction and a zeroth to-be-executed instruction The zeroth storage address interval of data has overlapping areas.
在一种可能的实现方式中,控制控制模块对获取到的同步控制指令进行编译,得到编译后的同步控制指令,可以包括:根据同步控制指令生成第一汇编文件,并将第一汇编文件翻译成第一二进制文件。其中,第一二进制文件为编译后的同步控制指令。In a possible implementation manner, the control control module compiles the acquired synchronization control instruction to obtain the compiled synchronization control instruction, which may include: generating a first assembly file according to the synchronization control instruction and translating the first assembly file Into the first binary file. Among them, the first binary file is a compiled synchronous control instruction.
在一种可能的实现方式中,对获取到的计算指令进行编译,得到编译后的计算指令,可以包括:根据计算指令生成第二汇编文件,并将第二汇编文件翻译成第二二进制文件。其中,第二二进制文件为编译后的计算指令。In a possible implementation manner, compiling the obtained calculation instruction to obtain the compiled calculation instruction may include: generating a second assembly file according to the calculation instruction, and translating the second assembly file into the second binary file. Among them, the second binary file is a compiled calculation instruction.
需要说明的是,尽管以上述实施例作为示例介绍了同步控制指令处理方法如上,但本领域技术人 员能够理解,本公开应不限于此。事实上,用户完全可根据个人喜好和/或实际应用场景灵活设定各步骤,只要符合本公开的技术方案即可。It should be noted that although the above embodiment is taken as an example to introduce the synchronization control instruction processing method as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various steps flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.
本公开实施例所提供的同步控制指令处理方法的适用范围广,对同步控制指令的处理效率高、处理速度快,且提高了对数据进行运算的效率和速度。The synchronous control instruction processing method provided by the embodiments of the present disclosure has a wide application range, high processing efficiency and fast processing speed for synchronous control instructions, and improves the efficiency and speed of computing data.
依据以下条款可以更好的理解前述内容:The foregoing can be better understood based on the following clauses:
条款U1.一种同步控制指令处理装置,所述装置包括编译模块、控制模块和多个运算模块,Clause U1. A synchronous control instruction processing device, the device includes a compilation module, a control module, and a plurality of arithmetic modules,
所述编译模块,用于对获取到的同步控制指令进行编译,得到编译后的同步控制指令;The compiling module is used to compile the acquired synchronization control instruction to obtain the compiled synchronization control instruction;
所述控制模块,用于对所述编译后的同步控制指令进行解析,得到同步控制指令的操作码,并确定需要执行同步控制指令的目标运算模块;The control module is used to analyze the compiled synchronous control instruction, obtain the operation code of the synchronous control instruction, and determine the target operation module that needs to execute the synchronous control instruction;
所述目标运算模块,用于在执行到所述编译后的同步控制指令时,进入暂停状态;The target operation module is configured to enter a suspended state when the compiled synchronous control instruction is executed;
所述控制模块,还用于监测所述多个运算模块的运行状态,在确定目标运算模块均处于暂停状态时,控制处于暂停状态的目标运算模块同步进入工作状态,The control module is also used to monitor the running state of the multiple computing modules, and when it is determined that the target computing modules are all in the suspended state, control the target computing modules in the suspended state to synchronously enter the working state,
其中,所述操作码用于指示所述同步控制指令对装置的多个运算模块所进行的处理为同步控制。Wherein, the operation code is used to instruct the synchronization control instruction to process the multiple arithmetic modules of the device as synchronization control.
条款U2.根据条款U1所述的装置,所述同步控制指令还包括操作域,所述操作码用于指示所述目标运算模块所需同步的目标信号,或者所述操作域包括所述目标运算模块所需同步的目标信号,以使所述控制模块根据所述操作码或者所述操作域确定目标信号,Clause U2. The apparatus according to Clause U1, the synchronization control instruction further includes an operation domain, the operation code is used to indicate a target signal required for synchronization by the target operation module, or the operation domain includes the target operation A target signal to be synchronized by the module, so that the control module determines the target signal according to the operation code or the operation domain,
所述目标运算模块,还用于在执行到所述编译后的同步控制指令时,控制与所述控制模块确定的目标信号相对应的处理进入暂停状态,The target computing module is further configured to control the processing corresponding to the target signal determined by the control module to enter a suspended state when the compiled synchronous control instruction is executed,
其中,所述目标信号包括以下至少一种:计算队列信号、IO信号、到达信号。Wherein, the target signal includes at least one of the following: calculation queue signal, IO signal, and arrival signal.
条款U3.根据条款U1或条款U2所述的装置,其特征在于,确定需要执行同步控制指令的目标运算模块,包括:Clause U3. The device according to Clause U1 or Clause U2, characterized in that the target operation module that needs to execute the synchronous control instruction is determined, including:
根据所述目标任务的标识,将所述多个运算模块中执行所述目标任务的运算模块确定为目标运算模块,其中,所述目标任务的标识包括以下至少一种:任务名称、任务类型、任务编号。According to the identification of the target task, the operation module that executes the target task among the plurality of operation modules is determined as the target operation module, wherein the identification of the target task includes at least one of the following: task name, task type, Task number.
条款U4.根据条款U1或条款U2所述的装置,其特征在于,所述多个运算模块被划分为多个模块集群,每个模块集群中包括一个或多个运算模块,Clause U4. The device according to Clause U1 or Clause U2, characterized in that the plurality of arithmetic modules are divided into a plurality of module clusters, and each module cluster includes one or more arithmetic modules,
其中,确定需要执行同步控制指令的目标运算模块,包括:Among them, determining the target computing module that needs to execute the synchronous control instruction includes:
根据目标任务的标识,将所述多个模块集群中与执行所述目标任务相关的目标模块集群中的全部运算模块确定为目标运算模块,所述目标集群中的全部或部分运算模块用于执行所述目标任务,其中,所述目标任务的标识包括以下至少一种:任务名称、任务类型、任务编号。According to the identification of the target task, all operation modules in the target module cluster related to executing the target task in the plurality of module clusters are determined as target operation modules, and all or part of the operation modules in the target cluster are used for execution The target task, wherein the target task identifier includes at least one of the following: task name, task type, and task number.
条款U5.根据条款U3或条款U4所述的装置,所述操作码或者所述操作域用于指示得到目标任务的标识。Clause U5. The apparatus according to Clause U3 or Clause U4, the operation code or the operation field is used to indicate that an identification of the target task is obtained.
条款U6.根据条款U1或条款U2所述的装置,所述多个运算模块被划分为多个模块集群,每个模块集群中包括一个或多个运算模块,所述操作码或所述操作域用于指示目标模块集群的标识,Clause U6. The apparatus according to Clause U1 or Clause U2, the plurality of operation modules are divided into a plurality of module clusters, each module cluster includes one or more operation modules, the operation code or the operation domain The ID used to indicate the target module cluster,
其中,确定需要执行同步控制指令的目标运算模块,包括:Among them, determining the target computing module that needs to execute the synchronous control instruction includes:
根据所述目标模块集群的标识,将所述多个运算模块中属于所述目标模块集群的运算模块确定为目标运算模块。According to the identifier of the target module cluster, the operation module belonging to the target module cluster among the plurality of operation modules is determined as the target operation module.
条款U7.根据条款U1或条款U2所述的装置,所述操作码或所述操作域用于指示目标运算模块的标识,Clause U7. The device according to Clause U1 or Clause U2, the operation code or the operation field is used to indicate the identification of the target operation module,
其中,确定需要执行同步控制指令的目标运算模块,包括:Among them, determining the target computing module that needs to execute the synchronous control instruction includes:
根据所述目标运算模块的标识,从所述多个运算模块中确定目标运算模块。The target operation module is determined from the plurality of operation modules according to the identification of the target operation module.
条款U8.根据条款U1或条款U2所述的装置,其特征在于,所述控制模块,还用于确定所述同步控制指令所在的核函数,将所述多个运算模块中调用所述核函数的运算模块确定为目标运算模块。Clause U8. The device according to Clause U1 or Clause U2, characterized in that the control module is further configured to determine a core function where the synchronous control instruction is located, and call the core function from the plurality of arithmetic modules The operation module is determined as the target operation module.
条款U9.根据条款U3或条款U8所述的装置,所述操作域中包括数量阈值,Clause U9. The device according to Clause U3 or Clause U8, wherein the operation domain includes a number threshold,
所述控制模块,还用于在确定处于暂停状态的目标运算模块的数量达到所述数量阈值时,控制处于暂停状态的目标运算模块进入工作状态。The control module is also used to control the target computing module in the suspended state to enter a working state when it is determined that the number of target computing modules in the suspended state reaches the number threshold.
条款U10.根据条款U1所述的装置,所述运算模块包括主运算子模块和多个从运算子模块,Clause U10. The device according to Clause U1, the arithmetic module includes a master arithmetic sub-module and a plurality of slave arithmetic sub-modules,
所述编译模块,还用于对获取到的计算指令进行编译,得到编译后的计算指令;The compiling module is also used to compile the obtained calculation instruction to obtain the compiled calculation instruction;
所述控制模块,用于获取执行所述编译后的计算指令所需的待运算数据,解析所述编译后的计算指令得到多个运算指令,并将所述待运算数据和所述多个运算指令发送至所述主运算子模块;The control module is configured to obtain the data to be operated required to execute the compiled calculation instruction, parse the compiled calculation instruction to obtain a plurality of operation instructions, and combine the data to be operated and the plurality of operations The instruction is sent to the main operation submodule;
所述主运算子模块,用于对所述待运算数据执行前序处理,以及与所述多个从运算子模块进行数据和运算指令的传输;The master operation sub-module is used for performing pre-processing on the data to be operated, and transmitting data and operation instructions with the plurality of slave operation sub-modules;
所述从运算子模块,用于根据从所述主运算子模块传输的数据和运算指令并行执行中间运算得到多个中间结果,并将所述多个中间结果传输给所述主运算子模块;The slave operation sub-module is configured to execute intermediate operations in parallel according to data and operation instructions transmitted from the master operation sub-module to obtain a plurality of intermediate results, and transmit the plurality of intermediate results to the master operation sub-module;
所述主运算子模块,还用于对所述多个中间结果执行后续处理,得到运算结果。The main operation sub-module is also used to perform subsequent processing on the plurality of intermediate results to obtain operation results.
条款U11.根据条款U10所述的装置,所述装置还包括:Clause U11. The device according to Clause U10, the device further comprising:
存储模块,用于存储所述待运算数据,A storage module for storing the data to be calculated,
其中,所述存储模块包括寄存器和缓存中的至少一种,Wherein, the storage module includes at least one of a register and a cache,
所述缓存,用于存储所述待运算数据,所述缓存包括至少一个神经元缓存NRAM;The cache is used to store the data to be calculated, and the cache includes at least one neuron cache NRAM;
所述寄存器,用于存储所述待运算数据中的标量数据;The register is used to store scalar data in the data to be calculated;
所述神经元缓存,用于存储所述待运算数据中的神经元数据,所述神经元数据包括神经元向量数据。The neuron cache is used to store neuron data in the data to be operated, and the neuron data includes neuron vector data.
条款U12.根据条款U10所述的装置,所述控制模块,包括:Clause U12. The device according to Clause U10, the control module comprising:
指令存储子模块,用于存储所述编译后的同步控制指令和所述编译后的计算指令;An instruction storage sub-module for storing the compiled synchronization control instruction and the compiled calculation instruction;
指令处理子模块,用于对所述编译后的同步控制指令和所述编译后的计算指令分别进行解析,得到对应的操作码和操作域;The instruction processing sub-module is used to parse the compiled synchronous control instruction and the compiled calculation instruction respectively to obtain the corresponding operation code and operation domain;
队列存储子模块,用于存储指令队列,所述指令队列包括按照执行顺序依次排列的多个待执行指令,所述多个待执行指令包括所述编译后的同步控制指令和所述编译后的计算指令。The queue storage submodule is used to store an instruction queue, and the instruction queue includes a plurality of instructions to be executed, which are sequentially arranged according to an execution order, and the plurality of instructions to be executed include the compiled synchronous control instruction and the compiled Calculation instructions.
条款U13.根据条款U12所述的装置,所述控制模块,还包括:Clause U13. The device according to Clause U12, the control module, further comprising:
依赖关系处理子模块,用于在确定所述多个待执行指令中的第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系时,将所述第一待执行指令缓存在所述指令存储子模块中,在所述第零待执行指令执行完毕后,从所述指令存储子模块中提取所述第一待执行指令发送至所述运算模块,The dependency processing sub-module is used to determine the first pending instruction when there is an association relationship between the first pending instruction in the plurality of pending instructions and the zeroth pending instruction before the first pending instruction The execution instruction is cached in the instruction storage submodule, and after the execution of the zeroth instruction to be executed is completed, the first instruction to be executed is extracted from the instruction storage submodule and sent to the arithmetic module,
其中,所述第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系包括:Wherein, the association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
存储所述第一待执行指令所需数据的第一存储地址区间与存储所述第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
条款U14.根据条款U1所述的装置,Clause U14. The device according to Clause U1,
所述控制模块,还用于根据所述同步控制指令生成第一汇编文件,并将所述第一汇编文件翻译成第一二进制文件,其中,所述第一二进制文件为所述编译后的同步控制指令。The control module is further configured to generate a first assembly file according to the synchronization control instruction and translate the first assembly file into a first binary file, where the first binary file is the Synchronized control instructions after compilation.
条款U15.一种机器学习运算装置,所述装置包括:Clause U15. A machine learning computing device, the device comprising:
一个或多个如条款U1-条款U14任一项所述的同步控制指令处理装置,用于从其他处理装置中获取待运算数据和控制信息,并执行指定的机器学习运算,将执行结果通过I/O接口传递给其他处理装置;One or more synchronous control instruction processing devices as described in any one of clauses U1 to U14, used to obtain data and control information to be calculated from other processing devices, and perform specified machine learning operations, and pass the execution result through I / O interface is passed to other processing devices;
当所述机器学习运算装置包含多个所述同步控制指令处理装置时,所述多个所述同步控制指令处理装置间可以通过特定的结构进行连接并传输数据;When the machine learning operation device includes a plurality of the synchronization control instruction processing devices, the plurality of the synchronization control instruction processing devices may be connected and transmit data through a specific structure;
其中,多个所述同步控制指令处理装置通过快速外部设备互连总线PCIE总线进行互联并传输数据,以支持更大规模的机器学习的运算;多个所述同步控制指令处理装置共享同一控制系统或拥有各自的控制系统;多个所述同步控制指令处理装置共享内存或者拥有各自的内存;多个所述同步控制指令处理装置的互联方式是任意互联拓扑。Among them, a plurality of the synchronous control instruction processing devices interconnect and transmit data through a PCIE bus that is a fast external device interconnection bus to support larger-scale machine learning operations; a plurality of the synchronous control instruction processing devices share the same control system Or have their own control systems; a plurality of the synchronous control instruction processing devices share memory or have their own memories; the interconnection method of the plurality of synchronous control instruction processing devices is an arbitrary interconnection topology.
条款U16.一种组合处理装置,所述组合处理装置包括:Clause U16. A combined processing device, the combined processing device comprising:
如条款U15所述的机器学习运算装置、通用互联接口和其他处理装置;Machine learning computing devices, general interconnect interfaces and other processing devices as described in clause U15;
所述机器学习运算装置与所述其他处理装置进行交互,共同完成用户指定的计算操作,The machine learning computing device interacts with the other processing device to jointly complete the calculation operation specified by the user,
其中,所述组合处理装置还包括:存储装置,该存储装置分别与所述机器学习运算装置和所述其他处理装置连接,用于保存所述机器学习运算装置和所述其他处理装置的数据。Wherein, the combined processing device further includes: a storage device, which is respectively connected to the machine learning computing device and the other processing device, and is used for storing data of the machine learning computing device and the other processing device.
条款U17.一种机器学习芯片,所述机器学习芯片包括:Clause U17. A machine learning chip, the machine learning chip comprising:
如条款U15所述的机器学习运算装置或如条款U16所述的组合处理装置。The machine learning arithmetic device according to clause U15 or the combined processing device according to clause U16.
条款U18.一种电子设备,所述电子设备包括:Clause U18. An electronic device, the electronic device comprising:
如条款U17所述的机器学习芯片。Machine learning chip as described in clause U17.
条款U19.一种板卡,所述板卡包括:存储器件、接口装置和控制器件以及如条款U17所述的机器学习芯片;Clause U19. A board card comprising: a storage device, an interface device and a control device, and a machine learning chip as described in Clause U17;
其中,所述机器学习芯片与所述存储器件、所述控制器件以及所述接口装置分别连接;Wherein, the machine learning chip is respectively connected to the storage device, the control device and the interface device;
所述存储器件,用于存储数据;The storage device is used for storing data;
所述接口装置,用于实现所述机器学习芯片与外部设备之间的数据传输;The interface device is used to realize data transmission between the machine learning chip and an external device;
所述控制器件,用于对所述机器学习芯片的状态进行监控。The control device is used for monitoring the state of the machine learning chip.
条款U20.一种同步控制指令处理方法,所述方法应用于同步控制指令处理装置,所述装置包括多个运算模块、编译模块和控制模块,所述方法包括:Clause U20. A synchronous control instruction processing method, the method is applied to a synchronous control instruction processing device, the device includes a plurality of arithmetic modules, a compilation module, and a control module, the method includes:
控制所述编译模块对获取到的同步控制指令进行编译,得到编译后的同步控制指令;Controlling the compilation module to compile the acquired synchronization control instruction to obtain the compiled synchronization control instruction;
控制所述控制模块对所述编译后的同步控制指令进行解析,得到同步控制指令的操作码,并确定需要执行同步控制指令的目标运算模块;Controlling the control module to parse the compiled synchronous control instruction, obtain the operation code of the synchronous control instruction, and determine the target operation module that needs to execute the synchronous control instruction;
控制所述目标运算模块在执行到所述编译后的同步控制指令时,进入暂停状态;Controlling the target operation module to enter a suspended state when the compiled synchronous control instruction is executed;
控制所述控制模块监测所述多个运算模块的运行状态,在确定目标运算模块均处于暂停状态时,控制处于暂停状态的目标运算模块同步进入工作状态,Controlling the control module to monitor the operating states of the plurality of computing modules, and when it is determined that the target computing modules are all in the suspended state, controlling the target computing modules in the suspended state to enter the working state synchronously,
其中,所述操作码用于指示所述同步控制指令对装置的多个运算模块所进行的处理为同步控制。Wherein, the operation code is used to instruct the synchronization control instruction to process the multiple arithmetic modules of the device as synchronization control.
条款U21.根据条款U20所述的方法,所述同步控制指令还包括操作域,所述操作码用于指示所述目标运算模块所需同步的目标信号,或者所述操作域包括所述目标运算模块所需同步的目标信号,以使所述控制模块根据所述操作码或者所述操作域确定目标信号,Clause U21. The method according to Clause U20, the synchronization control instruction further includes an operation domain, the operation code is used to indicate a target signal required for synchronization by the target operation module, or the operation domain includes the target operation A target signal to be synchronized by the module, so that the control module determines the target signal according to the operation code or the operation domain,
其中,控制所述目标运算模块在执行到所述编译后的同步控制指令时,进入暂停状态,包括:Wherein, controlling the target operation module to enter the suspended state when the compiled synchronous control instruction is executed includes:
控制所述目标运算模块在执行到所述编译后的同步控制指令时,控制与所述控制模块确定的目标信号相对应的处理进入暂停状态,Controlling the target arithmetic module to control the processing corresponding to the target signal determined by the control module to enter a suspended state when the compiled synchronous control instruction is executed,
其中,所述目标信号包括以下至少一种:计算队列信号、IO信号、到达信号。Wherein, the target signal includes at least one of the following: calculation queue signal, IO signal, and arrival signal.
条款U22.根据条款U21或条款U20所述的方法,确定需要执行同步控制指令的目标运算模块,包括:Clause U22. According to the method described in Clause U21 or Clause U20, determine the target computing module that needs to execute the synchronous control instruction, including:
根据所述目标任务的标识,将所述多个运算模块中执行所述目标任务的运算模块确定为目标运算模块,所述目标任务的标识包括以下至少一种:任务名称、任务类型、任务编号,。According to the identification of the target task, the operation module that executes the target task among the plurality of operation modules is determined as the target operation module, and the identification of the target task includes at least one of the following: task name, task type, task number ,.
条款U23.根据条款U19或条款U20所述的方法,所述多个运算模块被划分为多个模块集群,每个模块集群中包括一个或多个运算模块,Clause U23. According to the method described in Clause U19 or Clause U20, the plurality of computing modules are divided into a plurality of module clusters, and each module cluster includes one or more computing modules,
其中,确定需要执行同步控制指令的目标运算模块,包括:Among them, determining the target computing module that needs to execute the synchronous control instruction includes:
根据目标任务的标识,将所述多个模块集群中与执行所述目标任务相关的目标模块集群中的全部运算模块确定为目标运算模块,所述目标集群中的全部或部分运算模块用于执行所述目标任务,所述目标任务的标识包括以下至少一种:任务名称、任务类型、任务编号。According to the identification of the target task, all operation modules in the target module cluster related to executing the target task in the plurality of module clusters are determined as target operation modules, and all or part of the operation modules in the target cluster are used for execution For the target task, the target task identifier includes at least one of the following: task name, task type, and task number.
条款U24.根据条款U20或21所述的方法,所述多个运算模块被划分为多个模块集群,每个模块集群中包括一个或多个运算模块,所述操作码或所述操作域用于指示目标模块集群的标识,Clause U24. The method according to Clause U20 or 21, the plurality of operation modules are divided into a plurality of module clusters, each module cluster includes one or more operation modules, the operation code or the operation domain is used To indicate the ID of the target module cluster,
其中,需要执行同步控制指令的目标运算模块,包括:Among them, the target computing module that needs to execute the synchronous control instruction includes:
根据所述目标模块集群的标识,将所述多个运算模块中属于所述目标模块集群的运算模块确定为目标运算模块。According to the identifier of the target module cluster, the operation module belonging to the target module cluster among the plurality of operation modules is determined as the target operation module.
条款U25.根据条款U20或21所述的方法,所述操作码或所述操作域用于指示目标运算模块的标识,Clause U25. The method according to Clause U20 or 21, the operation code or the operation field is used to indicate the identification of the target operation module,
其中,确定需要执行同步控制指令的目标运算模块,包括:Among them, determining the target computing module that needs to execute the synchronous control instruction includes:
根据所述目标运算模块的标识,从所述多个运算模块中确定目标运算模块。The target operation module is determined from the plurality of operation modules according to the identification of the target operation module.
条款U26.根据条款U20或21所述的方法,所述方法还包括:Clause U26. The method according to Clause U20 or 21, further comprising:
控制所述控制模块确定所述同步控制指令所在的核函数,将所述多个运算模块中调用所述核函数的运算模块确定为目标运算模块。Controlling the control module to determine the nuclear function where the synchronous control instruction is located, and determining the arithmetic module calling the nuclear function among the plurality of arithmetic modules as the target arithmetic module.
条款U27.根据条款U22或26所述的方法,所述操作域中包括数量阈值,所述方法还包括:Clause U27. The method according to Clause U22 or 26, wherein the operation domain includes a quantity threshold, and the method further includes:
控制所述控制模块在确定处于暂停状态的目标运算模块的数量达到所述数量阈值时,控制处于暂停状态的目标运算模块进入工作状态。When the control module determines that the number of target operation modules in the suspended state reaches the number threshold, controls the target operation modules in the suspended state to enter the working state.
条款U28.根据条款U20所述的方法,所述算模块包括主运算子模块和多个从运算子模块,所述方法还包括:Clause U28. The method according to Clause U20, the arithmetic module includes a master arithmetic sub-module and a plurality of slave arithmetic sub-modules, the method further includes:
控制所述编译模块对获取到的计算指令进行编译,得到编译后的计算指令;Controlling the compilation module to compile the obtained calculation instruction to obtain the compiled calculation instruction;
控制所述控制模块获取执行所述编译后的计算指令所需的待运算数据,解析所述编译后的计算指令得到多个运算指令;Controlling the control module to obtain data to be operated required for executing the compiled calculation instruction, and parsing the compiled calculation instruction to obtain multiple operation instructions;
控制所述主运算子模块对所述待运算数据执行前序处理,以及与进行数据和运算指令的传输;Controlling the main operation sub-module to perform pre-processing on the data to be operated and to transmit data and operation instructions;
控制所述从运算子模块根据传输的数据和运算指令并行执行中间运算得到多个中间结果;Controlling the slave operation sub-module to perform intermediate operations in parallel according to the transmitted data and operation instructions to obtain multiple intermediate results;
控制所述主运算子模块对所述多个中间结果执行后续处理,得到运算结果。Controlling the main operation sub-module to perform subsequent processing on the plurality of intermediate results to obtain operation results.
条款U29根据条款U28所述的方法,所述方法还包括:Clause U29 is in accordance with the method of Clause U28, the method further comprising:
控制所述装置的存储模块存储所述待运算数据,Controlling the storage module of the device to store the data to be calculated,
其中,所述存储模块包括寄存器和缓存中的至少一种,Wherein, the storage module includes at least one of a register and a cache,
所述缓存,用于存储所述待运算数据,所述缓存包括至少一个神经元缓存NRAM;The cache is used to store the data to be calculated, and the cache includes at least one neuron cache NRAM;
所述寄存器,用于存储所述待运算数据中的标量数据;The register is used to store scalar data in the data to be calculated;
所述神经元缓存,用于存储所述待运算数据中的神经元数据,所述神经元数据包括神经元向量数据。The neuron cache is used to store neuron data in the data to be operated, and the neuron data includes neuron vector data.
条款U30.根据条款U28所述的方法,所述方法还包括:Clause U30. The method according to Clause U28, the method further comprising:
控制所述控制模块存储所述编译后的同步控制指令和所述编译后的计算指令;Controlling the control module to store the compiled synchronous control instruction and the compiled calculation instruction;
控制所述控制模块对所述编译后的同步控制指令和所述编译后的计算指令分别进行解析,得到对应的操作码和操作域;Controlling the control module to separately parse the compiled synchronous control instruction and the compiled calculation instruction to obtain the corresponding operation code and operation domain;
控制所述控制模块存储指令队列,所述指令队列包括按照执行顺序依次排列的多个待执行指令,所述多个待执行指令包括所述编译后的同步控制指令和所述编译后的计算指令。Controlling the control module to store an instruction queue, the instruction queue including a plurality of instructions to be executed in order according to an execution order, the plurality of instructions to be executed including the compiled synchronous control instruction and the compiled calculation instruction .
条款U31.根据条款U30所述的方法,所述方法还包括:Clause U31. The method according to Clause U30, the method further comprising:
控制所述控制模块在确定所述多个待执行指令中的第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系时,缓存所述第一待执行指令,并在确定所述第零待执行指令执行完毕后,控制进行所述第一待执行指令的执行,Controlling the control module to cache the first instruction to be executed when it is determined that the first instruction to be executed among the plurality of instructions to be executed is associated with the zeroth instruction to be executed before the first instruction to be executed, And after determining that the execution of the zeroth to-be-executed instruction is completed, control to execute the execution of the first to-be-executed instruction,
其中,所述第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系包括:Wherein, the association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
存储所述第一待执行指令所需数据的第一存储地址区间与存储所述第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
条款U32.根据条款U20所述的方法,Clause U32. According to the method described in Clause U20,
控制所述控制模块对获取到的同步控制指令进行编译,得到编译后的同步控制指令,包括:Controlling the control module to compile the acquired synchronization control instruction to obtain the compiled synchronization control instruction, including:
根据所述同步控制指令生成第一汇编文件,并将所述第一汇编文件翻译成第一二进制文件,其中,所述第一二进制文件为所述编译后的同步控制指令。Generating a first assembly file according to the synchronization control instruction, and translating the first assembly file into a first binary file, wherein the first binary file is the compiled synchronization control instruction.
条款U33.根据条款U22或条款U23所述的方法,所述操作码或者所述操作域用于指示所述目标任务的标识。Clause U33. The method according to Clause U22 or Clause U23, the operation code or the operation field is used to indicate the identification of the target task.
条款U34.一种非易失性计算机可读存储介质,其上存储有计算机程序指令,所述计算机程序指令被处理器执行时实现条款U20至条款U33任一项所述的方法。Clause U34. A non-volatile computer-readable storage medium having computer program instructions stored thereon, which when executed by a processor implements the method of any one of clause U20 to clause U33.
图22-1示出根据本公开一实施例的中断存储指令处理装置的框图。如图22-1所示,该装置包括控制模块22-11,控制模块22-11包括指令编译子模块22-111、参数获取子模块22-112和中断存储子模块22-113。22-1 shows a block diagram of an interrupt storage instruction processing apparatus according to an embodiment of the present disclosure. As shown in FIG. 22-1, the device includes a control module 22-11. The control module 22-11 includes an instruction compilation submodule 22-111, a parameter acquisition submodule 22-112, and an interrupt storage submodule 22-113.
指令编译子模块22-111,对获取到的中断存储指令进行编译,得到编译后的中断存储指令。The instruction compilation submodule 22-111 compiles the obtained interrupt storage instruction to obtain the compiled interrupt storage instruction.
参数获取子模块22-112,根据编译后的中断存储指令的操作域和操作码,确定进行响应中断退出的处理所需的存储参数。其中,操作码用于指示中断存储指令对装置中断退出时所进行的处理为中断存储处理。存储参数用于指示装置中断退出时需要存储的数据。中断存储处理包括装置中断退出以及根据存储参数进行数据存储。The parameter acquisition sub-module 22-112 determines the storage parameters required for the process of responding to the interrupt exit based on the operation domain and operation code of the compiled interrupt storage instruction. Among them, the operation code is used to instruct the interrupt storage instruction to perform the processing when the device interrupts and exits as interrupt storage processing. The storage parameter is used to indicate the data that needs to be stored when the device interrupts and exits. The interruption storage process includes device interruption exit and data storage according to storage parameters.
中断存储子模块22-113,在执行到编译后的中断存储指令时,控制装置中断退出,并根据存储参数进行数据存储。The interrupt storage submodule 22-113, when the compiled interrupt storage instruction is executed, the control device interrupts and exits, and performs data storage according to the storage parameters.
在本实施例中,在对装置进行调试、测试等处理的过程中,根据中断存储指令,能够实时将装置中断退出时,能够对可以表明装置的运行状态的数据进行存储,也即根据存储参数进行数据存储。以便于相关人员可以基于根据存储参数所存储的数据,确定调试和测试的结果,以及根据存储的数据对装置的运行状况进行分析。In this embodiment, in the process of debugging and testing the device, according to the interrupt storage instruction, when the device can be interrupted and exited in real time, data that can indicate the operating state of the device can be stored, that is, according to the storage parameters Perform data storage. So that relevant personnel can determine the results of debugging and testing based on the data stored according to the storage parameters, and analyze the operating status of the device based on the stored data.
在本实施例中,可以根据存储参数确定需要存储的数据,并将确定的需要存储的数据存储到装置的内存中,或者存储到其他位置,本公开对此不作限制。In this embodiment, the data that needs to be stored can be determined according to the storage parameters, and the determined data that needs to be stored can be stored in the memory of the device or stored in another location, which is not limited in the present disclosure.
在本实施例中,控制模块所获取到的中断存储指令为未编译的、不能直接供硬件执行的软件指令,控制模块需先对中断存储指令(未编译的)进行编译。编译后的中断存储指令为能够直接供硬件执行的硬件指令。控制模块可以通过数据输入输出单元获得指令和数据,该数据输入输出单元可以为一个或多个数据I/O接口或I/O引脚。In this embodiment, the interrupt storage instruction acquired by the control module is an uncompiled software instruction that cannot be directly executed by hardware. The control module needs to first compile the interrupt storage instruction (uncompiled). The compiled interrupt storage instructions are hardware instructions that can be directly executed by the hardware. The control module can obtain instructions and data through the data input and output unit, which can be one or more data I / O interfaces or I / O pins.
在本实施例中,操作码可以是计算机程序中所规定的要执行操作的那一部分指令或字段(通常用代码表示),是指令序列号,用来告知执行指令的装置具体需要执行哪一条指令。操作域可以是执行对应的指令所需的所有数据的来源,执行对应的指令所需的所有数据包括待运算数据、存储参数以及对应的运算方法等等。对于一个中断存储指令其必须包括操作码和操作域。In this embodiment, the operation code may be the part of the instruction or field (usually represented by code) specified in the computer program to perform the operation, and is the instruction sequence number used to inform the device that executes the instruction which instruction needs to be executed. . The operation domain may be the source of all data required to execute the corresponding instruction, and all data required to execute the corresponding instruction include data to be operated, storage parameters, and corresponding operation methods, and so on. For an interrupt storage instruction, it must include the operation code and operation field.
应当理解的是,本领域技术人员可以根据需要对中断存储指令的指令格式以及所包含的操作码和操作域进行设置,本公开对此不作限制。It should be understood that, those skilled in the art can set the instruction format of the interrupt storage instruction, as well as the included operation codes and operation fields as required, and the disclosure does not limit this.
在本实施例中,该装置可以包括一个或多个控制模块,可以根据实际需要对控制模块的数量进行设置,本公开对此不作限制。In this embodiment, the device may include one or more control modules, and the number of control modules may be set according to actual needs, which is not limited in the present disclosure.
本公开实施例所提供的中断存储指令处理装置,该装置包括控制模块,控制模块包括:指令编译子模块对获取到的中断存储指令进行编译,得到编译后的中断存储指令;参数获取子模块根据编译后的中断存储指令的操作域和操作码,确定进行响应中断退出的处理所需的存储参数;中断存储子模块在执行到编译后的中断存储指令时,控制装置中断退出,并根据存储参数进行数据存储。本公开实施例所提供的中断存储指令处理方法、装置及相关产品的适用范围广,对中断存储指令的处理效率高、处理速度快,能够高效、快速地对装置的中断退出做出响应,且提高了对数据进行运算的效率和速度。An interrupt storage instruction processing device provided by an embodiment of the present disclosure includes a control module. The control module includes: an instruction compilation submodule that compiles the obtained interrupt storage instruction to obtain a compiled interrupt storage instruction; the parameter acquisition submodule is based on The operation domain and operation code of the compiled interrupt storage instruction determine the storage parameters required for the process of responding to the interrupt exit; when the interrupt storage submodule executes the compiled interrupt storage instruction, the control device interrupts the exit and according to the storage parameters Perform data storage. The interrupt storage instruction processing method, device and related products provided by the embodiments of the present disclosure have a wide range of applications, and have high processing efficiency and fast processing speed for the interrupt storage instruction, and can efficiently and quickly respond to the interrupt exit of the device, and Improve the efficiency and speed of data operation.
图22-2a示出根据本公开一实施例的中断存储指令处理装置的框图。在一种可能的实现方式中,如图22-2a所示,该装置还可以包括运算模块22-12。22-2a shows a block diagram of an interrupt storage instruction processing apparatus according to an embodiment of the present disclosure. In a possible implementation manner, as shown in FIG. 22-2a, the device may further include an arithmetic module 22-12.
控制模块22-11,还用于对获取到的计算指令进行编译,得到编译后的计算指令,获取执行编译后的计算指令所需的待运算数据。The control module 22-11 is also used to compile the obtained calculation instruction, obtain the compiled calculation instruction, and obtain the data to be calculated required to execute the compiled calculation instruction.
运算模块22-12,用于根据编译后的计算指令对待运算数据进行运算,得到运算结果。The operation module 22-12 is configured to perform operation on the data to be calculated according to the compiled calculation instruction to obtain an operation result.
其中,在执行到编译后的中断存储指令时,控制装置中断退出,可以包括:在执行到编译后的中断存储指令时,控制运算模块中断当前编译后的计算指令的执行。Wherein, when the compiled interrupt storage instruction is executed, the control device interrupts and exits, which may include: when the compiled interrupt storage instruction is executed, the control arithmetic module interrupts the execution of the currently compiled calculation instruction.
其中,运算模块22-12可以包括多个运算器22-120。多个运算器22-120用于执行与计算指令的运算类型相对应的运算。The operation module 22-12 may include multiple operators 22-120. A plurality of operators 22-120 are used to perform operations corresponding to the operation type of the calculation instruction.
在该实现方式中,在执行计算指令的过程中,若控制模块执行到中断存储指令时,中断当前正在执行的计算指令的计算过程,并根据存储参数进行数据存储。In this implementation manner, in the process of executing the calculation instruction, if the control module executes the interrupt storage instruction, the calculation process of the calculation instruction currently being executed is interrupted, and data storage is performed according to the storage parameter.
在该实现方式中,计算指令可以是与中断存储指令不同的、对标量、向量、矩阵、张量等数据进行算术运算、逻辑运算等运算的指令,例如,标量计算指令、卷积计算指令等,本领域技术人员可以根据实际需要对计算指令进行设置,本公开对此不作限制。控制模块所获取到的计算指令为未编译的、不能直接供硬件执行的软件指令,控制模块需先对计算指令(未编译的)进行编译。编译后的计算指令为能够直接供硬件执行的硬件指令。In this implementation, the calculation instruction may be different from the interrupt storage instruction, which performs arithmetic operations such as scalar, vector, matrix, and tensor data on the arithmetic operation, logical operation, etc., for example, scalar calculation instruction, convolution calculation instruction, etc. A person skilled in the art may set the calculation instruction according to actual needs, and this disclosure does not limit this. The calculation instruction acquired by the control module is an uncompiled software instruction that cannot be directly executed by the hardware. The control module needs to first compile the calculation instruction (uncompiled). The compiled calculation instructions are hardware instructions that can be directly executed by the hardware.
在该实现方式中,控制模块还用于对编译后的计算指令进行解析,得到计算指令的操作码和操作域,并根据操作码和操作域获取待运算数据。In this implementation, the control module is also used to parse the compiled calculation instruction to obtain the operation code and operation domain of the calculation instruction, and obtain the data to be calculated according to the operation code and operation domain.
在该实现方式中,运算器可以包括加法器、除法器、乘法器、比较器等能够对数据进行算术运算、逻辑运算等运算的运算器。可以根据所需进行的运算的数据量的大小、运算类型、对数据进行运算的处理速度、效率等要求对运算器的种类及数量进行设置,本公开对此不作限制。In this implementation, the arithmetic unit may include an arithmetic unit capable of performing arithmetic operations, logical operations, and the like on the data, such as adders, dividers, multipliers, and comparators. The type and number of arithmetic units can be set according to the size of the amount of data to be calculated, the type of calculation, the processing speed and efficiency of performing calculation on the data, etc., and the disclosure does not limit this.
图22-2b示出根据本公开一实施例的中断存储指令处理装置的框图。在一种可能的实现方式中,如图22-2b所示,运算模块22-12可以包括主运算子模块22-121和多个从运算子模块22-122。主运算子模块22-121可以包括多个运算器,和/或从运算子模块22-122中可以包括多个运算器(图中未示出)。22-2b shows a block diagram of an interrupt storage instruction processing apparatus according to an embodiment of the present disclosure. In a possible implementation, as shown in FIG. 22-2b, the operation module 22-12 may include a master operation sub-module 22-121 and a plurality of slave operation sub-modules 22-122. The master operation sub-module 22-121 may include multiple operators, and / or the slave operation sub-module 22-122 may include multiple operators (not shown in the figure).
控制模块22-11,还用于解析编译后的计算指令得到多个运算指令,并将待运算数据和多个运算指令发送至主运算子模块22-121。The control module 22-11 is also used to parse the compiled calculation instructions to obtain multiple operation instructions, and send the data to be operated and the multiple operation instructions to the main operation sub-module 22-121.
主运算子模块22-121,用于对待运算数据执行前序处理,以及与多个从运算子模块22-122进行数据和运算指令的传输。The master operation sub-module 22-121 is used to perform pre-processing on the data to be operated, and to transmit data and operation instructions with a plurality of slave operation sub-modules 22-122.
从运算子模块22-122,用于根据从主运算子模块22-121传输的数据和运算指令并行执行中间运算得到多个中间结果,并将多个中间结果传输给主运算子模块22-122。The slave operation submodule 22-122 is used to execute intermediate operations in parallel according to the data and operation instructions transmitted from the master operation submodule 22-121 to obtain multiple intermediate results, and transmit the multiple intermediate results to the master operation submodule 22-122 .
主运算子模块22-121,还用于对多个中间结果执行后续处理,得到运算结果。The main operation submodule 22-121 is also used to perform subsequent processing on multiple intermediate results to obtain operation results.
在该实现方式中,在计算指令为针对标量、向量数据所进行的运算时,装置可以控制主运算子模块利用其中的运算器进行与计算指令相对应的运算。在计算指令为针对矩阵、张量等维度大于或等于2的数据进行运算时,装置可以控制从运算子模块利用其中的运算器进行与计算指令相对应的运算。In this implementation manner, when the calculation instruction is an operation performed on scalar and vector data, the device may control the main operation sub-module to perform an operation corresponding to the calculation instruction using the arithmetic unit therein. When the calculation instruction is to perform operations on data with dimensions greater than or equal to 2 such as a matrix and a tensor, the device may control the sub-module from the operation sub-module to perform an operation corresponding to the calculation instruction.
在一种可能的实现方式中,存储参数可以包括存储空间类型和存储空间标识。其中,操作码还可以用于指示存储空间类型,操作域可以包括存储空间标识。其中,根据存储参数进行数据存储,可以包括:In a possible implementation manner, the storage parameter may include a storage space type and a storage space identifier. The operation code can also be used to indicate the type of storage space, and the operation domain can include a storage space identifier. Among them, data storage according to storage parameters may include:
确定与存储空间类型、存储空间标识相匹配的至少一个目标存储空间,并存储目标存储空间中的数据。Determine at least one target storage space that matches the storage space type and storage space identifier, and store the data in the target storage space.
在该实现方式中,目标存储空间可以是装置的内存,如缓存、寄存器等位于装置中、或与装置的运行相关的内部存储空间。存储空间类型可以表明存储空间的位置、存储速度等信息。可以为不同类型的存储空间设置在中断存储指令中的代码,例如,可以将寄存器的代码设置为“gpr”,可以将NRAM的代码设置为“nram”。其中,缓存可以包括NRAM(Neuron Random Access Memory)神经元随机存取存储器,其为专门用于存储神经元的存储器。装置的内存可以用于存储执行计算指令所需的待运算数据等,本公开对此不作限制。In this implementation manner, the target storage space may be the memory of the device, such as caches, registers, etc. that are located in the device, or an internal storage space related to the operation of the device. The storage space type can indicate the location, storage speed, and other information of the storage space. The code in the interrupt storage instruction can be set for different types of storage space. For example, the code of the register can be set to "gpr", and the code of the NRAM can be set to "nram". Wherein, the cache may include NRAM (Neuron Random Access Memory) random access memory, which is a memory specially used for storing neurons. The memory of the device may be used to store data to be calculated and the like required to execute the calculation instruction, which is not limited in this disclosure.
在该实现方式中,存储空间标识可以存储空间在装置中的编号、名称以及存储空间的类型等能够表征该存储空间的信息。In this implementation, the storage space identifier may store the number, name, and type of storage space in the device that can characterize the storage space.
在一种可能的实现方式中,在目标存储空间为多个时,每个目标存储空间中的数据为一组待存储数据,多组待存储数据对应至少一种数据格式。In a possible implementation manner, when there are multiple target storage spaces, the data in each target storage space is a set of data to be stored, and the multiple sets of data to be stored correspond to at least one data format.
在该实现方式中,对目标存储空间中所存储的数据的数据格式不作限制,同一组待存储数据的数据格式相同,多组待存储数据之间的数据格式可以相同,也可以不同。数据格式可以包括数据类型和数据长度中的至少一种。例如,目标存储空间中所存储的数据可以为16位整数型数据,32位无符号整数型数据等。举例来说,假定装置中断退出后,根据存储参数确定需存储寄存器1、寄存器2和寄存器3中的数据,其中,寄存器1中的数据以16位整数型数据格式存储、寄存器2中的数据以32位无符号整数型数据格式存储和寄存器3中的数据以8位整数型数据格式存储。中断存储模块可以对寄存器1中16位整数型的数据、寄存器2中32位无符号整数型的数据和寄存器3中8位整数型的数据进行存储。In this implementation manner, the data format of the data stored in the target storage space is not limited, the data format of the same set of data to be stored is the same, and the data format between multiple sets of data to be stored may be the same or different. The data format may include at least one of data type and data length. For example, the data stored in the target storage space may be 16-bit integer data, 32-bit unsigned integer data, and so on. For example, assume that after the device exits after interruption, it is determined that the data in register 1, register 2 and register 3 need to be stored according to the storage parameters, where the data in register 1 is stored in a 16-bit integer data format and the data in register 2 is The 32-bit unsigned integer data format is stored and the data in register 3 is stored in an 8-bit integer data format. The interrupt storage module can store 16-bit integer data in register 1, 32-bit unsigned integer data in register 2, and 8-bit integer data in register 3.
在该实现方式中,中断存储模块在进行数据存储的过程中,可以直接按照数据在目标存储空间中的数据格式进行数据存储,也可以将目标存储空间中数据统一转化为指定数据格式后进行存储。In this implementation, during the data storage process, the interrupt storage module can directly store the data according to the data format of the data in the target storage space, or convert the data in the target storage space into a specified data format for storage .
在该实现方式中,中断存储模块可以存储目标存储空间中的全部数据,也可以存储目标存储空间中的部分数据,本公开对此不作限制。In this implementation manner, the interrupt storage module may store all data in the target storage space or part of the data in the target storage space, which is not limited in the present disclosure.
在一种可能的实现方式中,存储参数可以包括存储空间标识和待存储数据地址。其中,操作码还可以用于指示存储空间标识,操作域可以包括待存储数据地址。其中,根据存储参数进行数据存储,可以包括:In a possible implementation manner, the storage parameter may include a storage space identifier and an address of data to be stored. The operation code can also be used to indicate the storage space identifier, and the operation domain can include the data address to be stored. Among them, data storage according to storage parameters may include:
确定与存储空间标识相对应的目标存储空间,并从目标存储空间的待存储数据地址中获取待存储数据,存储待存储数据。Determine the target storage space corresponding to the storage space identifier, and obtain the data to be stored from the data to be stored address of the target storage space, and store the data to be stored.
在一种可能的实现方式中,在与装置相关的存储空间中仅有一个与存储空间标识对应的存储空间、且该存储空间的类型也仅为一个时,存储参数也可以包括存储空间类型和待存储数据地址。其中,操作码还可以用于指示存储空间类型,操作域可以包括待存储数据地址。其中,根据存储参数进行数据存储,可以包括:确定与存储空间类型相对应的目标存储空间,并从目标存储空间的待存储数据地址中获取待存储数据,存储待存储数据。In a possible implementation manner, when there is only one storage space corresponding to the storage space identifier in the storage space related to the device, and the type of the storage space is only one, the storage parameter may also include the storage space type and The data address to be stored. The operation code can also be used to indicate the type of storage space, and the operation field can include the address of the data to be stored. Wherein, data storage according to the storage parameters may include: determining a target storage space corresponding to the type of storage space, obtaining data to be stored from the data storage address of the target storage space, and storing the data to be stored.
在该实现方式中,当根据待存储数据地址和存储空间类型可以确定唯一的待存储数据时,存储参数也可以包括存储空间类型和待存储数据地址。In this implementation, when the unique data to be stored can be determined according to the data address to be stored and the storage space type, the storage parameter may also include the storage space type and the data to be stored address.
在一种可能的实现方式中,存储参数还可以包括目标存储量。其中,操作域还可以包括目标存储量。其中,从目标存储空间的待存储数据地址中获取待存储数据,存储待存储数据,可以包括:In a possible implementation manner, the storage parameter may further include a target storage amount. Among them, the operation domain may also include the target storage amount. Wherein, obtaining the data to be stored from the data address to be stored in the target storage space, and storing the data to be stored may include:
从目标存储空间的待存储数据地址中获取数据量为目标存储量的待存储数据,存储待存储数据。Obtain the data to be stored from the data address to be stored in the target storage space as the target storage amount, and store the data to be stored.
在一种可能的实现方式中,可以设置默认目标存储量。在根据中断存储指令的操作域无法确定目标存储量时,可以将默认目标存储量确定为对应当前中断存储指令的目标存储量,进而从目标存储空间的待存储数据地址中获取数据量为目标存储量(也即默认目标存储量)的待存储数据。In a possible implementation, the default target storage amount can be set. When the target storage cannot be determined according to the operation domain of the interrupt storage instruction, the default target storage can be determined as the target storage corresponding to the current interrupt storage instruction, and then the data amount can be obtained from the data address to be stored in the target storage space as the target storage The amount of data (that is, the default target storage amount) to be stored.
应当理解的是,本领域技术人员可以根据实际需要对接收中断存储指令的运算模块进行设置,本公开对此不作限制。It should be understood that, those skilled in the art can set the arithmetic module that receives the interrupt storage instruction according to actual needs, which is not limited in the present disclosure.
在一种可能的实现方式中,操作域中可以包括用于指示中断存储空间的标识。其中,中断存储子模块22-113可以将根据存储参数获取的所需存储的数据存入与中断存储空间的标识相对应的中断存储空间中。In a possible implementation, the operation domain may include an identifier for indicating that the storage space is interrupted. Among them, the interrupt storage sub-module 22-113 may store the required storage data acquired according to the storage parameters into the interrupt storage space corresponding to the identifier of the interrupt storage space.
在一种可能的实现方式中,中断存储空间可以包括装置的片外存储和/或片上存储。片外存储可以包括至少一个DDR(也即DDR SDRAM,英文:Double Data Rate SDRAM,双倍速率同步动态随机存储器),DDR可以包括至少一个LDRAM(Local DRAM,本地动态随机存储器)。片上存储可以包括寄存器、NRAM中的至少一种,片上存储中每一种存储空间(寄存器和/或NRAM)可以包括至少一个。片外存储的可用存储空间小于或等于指定存储容量。In a possible implementation, the interrupt storage space may include off-chip storage and / or on-chip storage of the device. The off-chip storage may include at least one DDR (that is, DDR SDRAM, English: Double Data Rate SDRAM, double rate synchronous dynamic random access memory), and DDR may include at least one LDRAM (Local DRAM, local dynamic random access memory). The on-chip storage may include at least one of registers and NRAM, and each type of storage space (register and / or NRAM) in the on-chip storage may include at least one. The available storage space of off-chip storage is less than or equal to the specified storage capacity.
在该实现方式中,片外存储的可用存储空间可以是片外存储中、可以供装置在执行中断存储指令的过程中进行中断退出后的数据存储的存储空间。其中,向量数据可以存储于NRAM、LDRAM中。可以预先根据装置所执行的数据运算过程对指定存储容量进行设置,例如,1024KB等,本公开对此不作限制。In this implementation manner, the available storage space of off-chip storage may be storage space in off-chip storage that can be used for data storage after the device performs an interrupt exit during execution of an interrupt storage instruction. Among them, vector data can be stored in NRAM, LDRAM. The specified storage capacity may be set in advance according to the data calculation process performed by the device, for example, 1024KB, etc., which is not limited in the present disclosure.
在该实现方式中,可以通过在操作域中中断存储空间的标识可以包括中断存储空间的编号、名称、首地址等能够表征中断存储空间的参数,本公开对此不作限制。In this implementation manner, the identifier of the interrupt storage space in the operation domain may include the number, name, and first address of the interrupt storage space, and other parameters that can characterize the interrupt storage space, which is not limited in the present disclosure.
在一种可能的实现方式中,若操作域中不包含中断存储空间,可以根据预先设置的默认存储方式,对不同种类的数据进行存储,例如,可以默认设置将标量数据存储到寄存器中,将向量数据存储到NRAM中。本领域技术人员可以根据实际需要对默认存储方式进行设置,本公开对此不作限制。In a possible implementation, if the operation domain does not include the interrupt storage space, different types of data can be stored according to the preset default storage method. For example, the scalar data can be stored in the register by default, and the Vector data is stored in NRAM. A person skilled in the art may set the default storage mode according to actual needs, which is not limited in the present disclosure.
在一种可能的实现方式中,如图22-2a所示,该装置还可以包括存储模块22-13。存储模块22-13可以包括片外存储和/或片上存储,片上存储可以用于存储待运算数据。其中,片上存储可以包括寄存器和缓存中的至少一种。缓存用于存储待运算数据,缓存包括至少一个NRAM。寄存器用于存储待运算数据中的标量数据。其中,待运算数据包括标量数据、向量数据、张量数据等类型的数据。待运算 数据可以是机器学习中进行运算所使用的数据。机器学习运算可以包括神经网络运算。In a possible implementation manner, as shown in FIG. 22-2a, the device may further include a storage module 22-13. The storage modules 22-13 may include off-chip storage and / or on-chip storage, and on-chip storage may be used to store data to be calculated. Among them, the on-chip storage may include at least one of a register and a cache. The cache is used to store data to be calculated, and the cache includes at least one NRAM. The register is used to store the scalar data in the data to be calculated. Among them, the data to be calculated includes scalar data, vector data, tensor data and other types of data. The data to be calculated can be the data used for the operation in machine learning. Machine learning operations may include neural network operations.
在一种可能的实现方式中,缓存可以包括神经元缓存。神经元缓存可以用于存储待运算数据中的神经元数据。神经元数据可以是进行神经网络运算所使用的数据,如向量数据等。In a possible implementation, the cache may include a neuron cache. The neuron cache can be used to store neuron data in the data to be calculated. Neuron data can be data used for neural network operations, such as vector data.
在一种可能的实现方式中,该装置还可以包括直接内存访问模块,用于从存储模块中读取或者存储数据。In a possible implementation, the device may further include a direct memory access module, which is used to read or store data from the storage module.
在一种可能的实现方式中,控制模块22-11还可以用于根据中断存储指令生成第一汇编文件,并将第一汇编文件翻译成第一二进制文件,其中,第一二进制文件为编译后的中断存储指令。In a possible implementation, the control module 22-11 may also be used to generate a first assembly file according to the interrupt storage instruction and translate the first assembly file into a first binary file, where the first binary The file is the interrupt storage instruction after compilation.
在一种可能的实现方式中,控制模块22-11还可以用于根据计算指令生成第二汇编文件,并将第二汇编文件翻译成第二二进制文件,其中,第二二进制文件为编译后的计算指令。In a possible implementation, the control module 22-11 may also be used to generate a second assembly file according to the calculation instruction and translate the second assembly file into a second binary file, where the second binary file It is the compiled calculation instruction.
在该实现方式中,还可以将汇编文件翻译成其他进位制的文件,本公开对此不作限制。In this implementation, the assembly file can also be translated into other carry files, which is not limited in this disclosure.
在一种可能的实现方式中,中断存储指令的指令格式可以是:In a possible implementation, the instruction format of the interrupt storage instruction may be:
breakdump.type addrSpace sign0breakdump.typeaddrSpacesign0
其中,breakdump.type为中断存储指令的操作码。sign0、addrSpace为中断存储指令的操作域。breakdump.type中的type表示存储空间类型。sign0表示存储空间标识,其中,在存储空间为多个时,存储空间标识可以为多个。addrSpace为中断存储空间的标识。其表示:在装置中断退出时,将存储空间类型type相对应的、标识为存储空间标识sign0的存储空间确定为目标存储空间,并将目标存储空间中的全部数据存储至与中断存储空间的标识addrSpace相对应的中断存储空间中。type可以是gpr。Among them, breakdump.type is the opcode of the interrupt storage instruction. sign0, addrSpace are the operation domains of interrupt storage instructions. The type in breakdump.type represents the storage space type. sign0 represents a storage space identifier, where, when there are multiple storage spaces, there may be multiple storage space identifiers. addrSpace is the identifier of the interrupted storage space. It means that when the device is interrupted and exited, the storage space corresponding to the storage space type type and identified as the storage space identifier sign0 is determined as the target storage space, and all data in the target storage space is stored to the identification of the interrupted storage space The interrupt storage space corresponding to addrSpace. type can be gpr.
举例来说,当需要存储寄存器中的数据时,假定需要存储数据的寄存器为6个,其6个寄存器的存储空间标识分别为sign0、sign1、sign2、sign3、sign4、sign5,其对应的中断存储指令的指令格式可以为:breakdump.gpr nram0 sign0 sign1 sign2 sign3 sign4 sign5。其表示:在装置中断退出时,将存储空间标识为sign0、sign1、sign2、sign3、sign4、sign5的6个寄存器中的全部数据存储在与中断存储空间的标识nram0对应的中断存储空间中。For example, when data in a register needs to be stored, it is assumed that there are 6 registers that need to store data, and the storage space identifiers of the 6 registers are sign0, sign1, sign2, sign3, sign4, and sign5, respectively, and their corresponding interrupt storage The instruction format of the instruction can be: breakdump.gpr nram0 sign0 sign1 sign2 sign3 sign4 sign4 sign5. It means that when the device interrupts and exits, all the data in the six registers with the storage spaces identified as sign0, sign1, sign2, sign3, sign4, and sign5 are stored in the interrupt storage space corresponding to the identification nram0 of the interrupt storage space.
在一种可能的实现方式中,中断存储指令的指令格式还可以是:In a possible implementation, the instruction format of the interrupt storage instruction may also be:
breakdump.sign addrSpace src sizebreakdump.sign addrSpace src size
其中,breakdump.sign为中断存储指令的操作码。addrSpace、src、size为中断存储指令的操作域。src表示待存储数据地址。size表示目标存储量。addrSpace为中断存储空间的标识。其表示:在装置中断退出时,确定存储空间标识sign相对应目标存储空间,并从目标存储空间的待存储数据地址src中获取数据量为目标存储量size的待存储数据,将待存储数据存储至与中断存储空间的标识addrSpace对应的中断存储空间中。Among them, breakdump.sign is the opcode of the interrupt storage instruction. addrSpace, src, and size are the operation domains of the interrupt storage instruction. src represents the data address to be stored. size represents the target storage capacity. addrSpace is the identifier of the interrupted storage space. It means that when the device is interrupted and exited, it is determined that the storage space identifier sign corresponds to the target storage space, and the data to be stored whose data amount is the target storage amount size is obtained from the data address to be stored src of the target storage space, and the data to be stored is stored To the interrupt storage space corresponding to the identifier addrSpace of the interrupt storage space.
在一种可能的实现方式中,在需要存储内存如NRAM中的数据时,存储参数可以包括存储空间类型、待存储数据地址和目标存储量。中断存储指令的指令格式可以是:breakdump.nram ldram0 src size。其表示:在装置中断退出时,从NRAM的待存储数据地址src中获取数据量为目标存储量size的待存储数据,并将待存储数据存储至与中断存储空间的标识ldram0对应的中断存储空间中。In a possible implementation manner, when data in a memory such as NRAM needs to be stored, the storage parameter may include a storage space type, a data address to be stored, and a target storage amount. The instruction format of the interrupt storage instruction may be: breakdump.nram ldram0 src size. It means that when the device is interrupted and exited, the data to be stored whose data volume is the target storage size is obtained from the data address to be stored src of NRAM, and the data to be stored is stored in the interrupt storage space corresponding to the identifier ldram0 of the interrupt storage space in.
在一种可能的实现方式中,该装置可以设置于图形处理器(Graphics Processing Unit,简称GPU)、中央处理器(Central Processing Unit,简称CPU)和嵌入式神经网络处理器(Neural-network Processing Unit,简称NPU)的一种或多种之中。In a possible implementation manner, the device may be set in a graphics processor (Graphics Processing Unit, GPU for short), a central processing unit (Central Processing Unit, CPU for short) and an embedded neural network processor (Neural-network Processing Unit) , Referred to as NPU).
需要说明的是,尽管以上述实施例作为示例介绍了中断存储指令处理装置如上,但本领域技术人员能够理解,本公开应不限于此。事实上,用户完全可根据个人喜好和/或实际应用场景灵活设定各模块,只要符合本公开的技术方案即可。It should be noted that although the foregoing embodiment is taken as an example to introduce the interrupt storage instruction processing apparatus as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various modules flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.
应用示例Application examples
以下结合“利用中断存储指令处理装置执行中断存储指令”作为一个示例性应用场景,给出根据本公开实施例的应用示例,以便于理解中断存储指令处理装置的流程。本领域技术人员应理解,以下应用示例仅仅是出于便于理解本公开实施例的目的,不应视为对本公开实施例的限制In the following, an application example according to an embodiment of the present disclosure will be given in conjunction with "using an interrupt storage instruction processing device to execute an interrupt storage instruction" as an exemplary application scenario, so as to facilitate understanding of the flow of the interrupt storage instruction processing device. Those skilled in the art should understand that the following application examples are only for the purpose of facilitating understanding of the embodiments of the present disclosure, and should not be considered as limitations to the embodiments of the present disclosure
图22-3a、图22-3b示出根据本公开一实施例的中断存储指令处理装置的应用场景的示意图。如图22-3a、图22-3b所示,中断存储指令处理装置对中断存储指令进行处理的过程如下:22-3a and 22-3b are schematic diagrams illustrating application scenarios of an apparatus for processing interrupt storage instructions according to an embodiment of the present disclosure. As shown in Figures 22-3a and 22-3b, the interrupt storage instruction processing device processes the interrupt storage instruction as follows:
示例一Example one
如图22-3a所示,控制模块22-11对获取到的中断存储指令1进行编译,得到编译后的中断存储指令1(如中断存储指令1为breakdump.gpr nram0 r0 r1 r2 r3 r4 r5),对编译后的中断存储指令1进行解析,得到中断存储指令1的操作码和操作域。其中,中断存储指令1的操作码为breakdump.gpr,gpr为存储空间类型,表示寄存器。r0、r1、r2、r3、r4、r5表示存储空间标识。nram0表示中断存储空间的标识。在执行到编译后的中断存储指令1时,控制装置中断退出,并对存储空间标识为r0、r1、r2、r3、r4、r5的6个寄存器中的数据存储至与中断存储空间的标识nram0对应的中断存储空间中。As shown in Figure 22-3a, the control module 22-11 compiles the obtained interrupt storage instruction 1 to obtain the compiled interrupt storage instruction 1 (for example, the interrupt storage instruction 1 is breakdump.gpr nram0 r0 rr1 r2 r3 r4 r5) , Analyze the compiled interrupt storage instruction 1 to obtain the operation code and operation domain of the interrupt storage instruction 1. Among them, the operation code of the interrupt storage instruction 1 is breakdump.gpr, and gpr is the storage space type, indicating a register. r0, r1, r2, r3, r4, and r5 represent storage space identifiers. nram0 indicates the ID of the interrupt storage space. When the compiled interrupt storage instruction 1 is executed, the control device interrupts and exits, and stores the data in the six registers with the storage space identifications r0, r1, r2, r3, r4, and r5 to the interrupt storage space identification nram0 Corresponding interrupt storage space.
其中,控制装置中断退出,可以包括:在执行到编译后的中断存储指令1时,中断当前计算指令的执行。Wherein, the interruption and exit of the control device may include: interrupting the execution of the current calculation instruction when the compiled interruption instruction 1 is executed.
示例二Example 2
如图22-3b所示,控制模块22-11对获取到的中断存储指令2进行编译,得到编译后的中断存储指令2(如中断存储指令2为breakdump.nram ldram0 500 1024),对编译后的中断存储指令2进行解析,得到中断存储指令2的操作码和操作域。其中,中断存储指令2的操作码为breakdump.nram,nram为存储空间类型,表示NRAM。500表示待存储数据地址。1024为目标存储量。ldram0表示中断存储空间的标识。在执行到编译后的中断存储指令2时,控制装置中断退出,并从NRAM中待存储数据地址500中获取目标存储量1024的待存储数据,将待存储数据存入与中断存储空间的标识ldram0对应的中断存储空间中。As shown in Figure 22-3b, the control module 22-11 compiles the obtained interrupt storage instruction 2 to obtain the compiled interrupt storage instruction 2 (for example, the interrupt storage instruction 2 is breakdump. The interrupt storage instruction 2 is analyzed to obtain the operation code and operation domain of the interrupt storage instruction 2. Among them, the operation code of the interrupt storage instruction 2 is breakdump.nram, and nram is a storage space type, indicating NRAM. 500 indicates the data address to be stored. 1024 is the target storage capacity. ldram0 indicates the ID of the interrupt storage space. After the execution of the compiled interrupt storage instruction 2, the control device interrupts and exits, and obtains the data to be stored of the target storage amount 1024 from the data address 500 to be stored in the NRAM, and stores the data to be stored in the ID ldram0 of the interrupt storage space Corresponding interrupt storage space.
其中,控制装置中断退出,可以包括:在执行到编译后的中断存储指令2时,中断当前计算指令的执行。The interruption of the control device may include interrupting the execution of the current calculation instruction when the compiled interruption instruction 2 is executed.
这样,中断存储指令处理装置可以高效、快速地对中断存储指令进行处理,能够高效、快速地对装置的中断退出做出响应,且提高了对数据进行运算的效率和速度。以上各模块的工作过程可参考上文的相关描述。In this way, the interrupt storage instruction processing device can efficiently and quickly process the interrupt storage instruction, can efficiently and quickly respond to the interrupt exit of the device, and improve the efficiency and speed of computing data. For the working process of the above modules, please refer to the relevant description above.
图22-4示出根据本公开一实施例的中断存储指令处理方法的流程图。该方法可以应用于包含存储器和处理器的如计算机设备等,其中,存储器用于存储执行方法过程中所使用的数据;处理器用于执行相关的处理、运算步骤,如执行下述步骤S51-22至步骤S53-22。如图22-4所示,该方法应用于上述中断存储指令处理装置,该方法包括步骤S51-22至步骤S53-22。22-4 shows a flowchart of an interrupt storage instruction processing method according to an embodiment of the present disclosure. The method can be applied to computer equipment including a memory and a processor, where the memory is used to store data used in the execution of the method; the processor is used to perform related processing and operation steps, such as performing the following steps S51-22 Go to step S53-22. As shown in FIG. 22-4, the method is applied to the above-mentioned interrupt storage instruction processing device, and the method includes steps S51-22 to S53-22.
在步骤S51-22中,对获取到的中断存储指令进行编译,得到编译后的中断存储指令。In step S51-22, compile the obtained interrupt storage instruction to obtain a compiled interrupt storage instruction.
在步骤S52-22中,根据编译后的中断存储指令的操作域和操作码,确定进行响应中断退出的处理所需的存储参数。其中,操作码用于指示中断存储指令对装置中断退出时所进行的处理为中断存储处理,存储参数用于指示装置中断退出时需要存储的数据。In step S52-22, according to the operation field and operation code of the compiled interrupt storage instruction, the storage parameters required for the process of responding to the interrupt exit are determined. Among them, the operation code is used to instruct the interrupt storage instruction to perform the processing when the device is interrupted and exited as interrupt storage processing, and the storage parameter is used to indicate the data that needs to be stored when the device is interrupted and exited.
在步骤S53-22中,在执行到编译后的中断存储指令时,控制装置中断退出,并根据存储参数进行数据存储。In step S53-22, when the compiled interrupt storage instruction is executed, the control device quits and performs data storage according to the storage parameters.
在一种可能的实现方式中,操作域可以包括用于指示中断存储空间的标识.其中,根据存储参数进行数据存储,可以包括:In a possible implementation manner, the operation domain may include an identifier for indicating an interrupt storage space. Among them, data storage according to storage parameters may include:
将根据存储参数获取的所需存储的数据存入与中断存储空间的标识相对应的中断存储空间中。Store the required storage data acquired according to the storage parameters into the interrupt storage space corresponding to the identifier of the interrupt storage space.
在一种可能的实现方式中,中断存储空间可以包括装置的片外存储和/或片上存储。其中,片外存储可以包括至少一个DDR,DDR可以包括至少一个LDRAM,片上存储可以包括寄存器、NRAM中的至少一种,片外存储的可用存储空间小于或等于指定存储容量。In a possible implementation, the interrupt storage space may include off-chip storage and / or on-chip storage of the device. The off-chip storage may include at least one DDR, the DDR may include at least one LDRAM, the on-chip storage may include at least one of registers and NRAM, and the available storage space of the off-chip storage is less than or equal to a specified storage capacity.
在一种可能的实现方式中,存储参数包括存储空间类型和存储空间标识。其中,操作码还用于指示存储空间类型,操作域包括存储空间标识。其中,根据存储参数进行数据存储,可以包括:In a possible implementation manner, the storage parameter includes a storage space type and a storage space identifier. Among them, the operation code is also used to indicate the storage space type, and the operation domain includes the storage space identifier. Among them, data storage according to storage parameters may include:
确定与存储空间类型、存储空间标识相匹配的至少一个目标存储空间,并存储目标存储空间中的数据。Determine at least one target storage space that matches the storage space type and storage space identifier, and store the data in the target storage space.
在一种可能的实现方式中,在目标存储空间为多个时,每个目标存储空间中的数据为一组待存储数据,多组待存储数据对应至少一种数据格式。In a possible implementation manner, when there are multiple target storage spaces, the data in each target storage space is a set of data to be stored, and the multiple sets of data to be stored correspond to at least one data format.
在一种可能的实现方式中,存储参数可以包括存储空间标识和待存储数据地址。其中,操作码还 可以用于指示存储空间标识,操作域可以包括待存储数据地址。其中,根据存储参数进行数据存储,可以包括:In a possible implementation manner, the storage parameter may include a storage space identifier and an address of data to be stored. Among them, the operation code can also be used to indicate the storage space identification, and the operation domain can include the address of the data to be stored. Among them, data storage according to storage parameters may include:
确定与存储空间标识相对应的目标存储空间,并从目标存储空间的待存储数据地址中获取待存储数据,存储待存储数据。Determine the target storage space corresponding to the storage space identifier, and obtain the data to be stored from the data to be stored address of the target storage space, and store the data to be stored.
在一种可能的实现方式中,存储参数还可以包括目标存储量。其中,操作域还可以包括目标存储量。其中,从目标存储空间的待存储数据地址中获取待存储数据,存储待存储数据,可以包括:In a possible implementation manner, the storage parameter may further include a target storage amount. Among them, the operation domain may also include the target storage amount. Wherein, obtaining the data to be stored from the data address to be stored in the target storage space, and storing the data to be stored may include:
从目标存储空间的待存储数据地址中获取数据量为目标存储量的待存储数据,存储待存储数据。Obtain the data to be stored from the data address to be stored in the target storage space as the target storage amount, and store the data to be stored.
在一种可能的实现方式中,该方法还可以包括:In a possible implementation manner, the method may further include:
获取计算指令,并对计算指令进行编译,得到编译后的计算指令,获取执行编译后的计算指令所需的待运算数据;Obtain the calculation instruction, compile the calculation instruction to obtain the compiled calculation instruction, and obtain the data to be calculated required to execute the compiled calculation instruction;
根据待运算数据,执行编译后的计算指令,得到运算结果,According to the data to be calculated, execute the compiled calculation instruction to obtain the calculation result,
其中,在执行到编译后的中断存储指令时,控制装置中断退出,可以包括:在执行到编译后的中断存储指令时,控制运算模块停止当前编译后的计算指令的执行。Wherein, when the compiled interrupt storage instruction is executed, the control device interrupts and exits, which may include: when the compiled interrupt storage instruction is executed, controlling the arithmetic module to stop the execution of the currently compiled calculation instruction.
在一种可能的实现方式中,该方法还可以包括:In a possible implementation manner, the method may further include:
解析编译后的计算指令得到多个运算指令;Analyze the compiled calculation instructions to obtain multiple calculation instructions;
其中,根据待运算数据,执行编译后的计算指令,得到运算结果,可以包括:According to the data to be calculated, executing the compiled calculation instruction to obtain the calculation result may include:
对待运算数据执行前序处理,以及与进行数据和运算指令的传输;Perform pre-processing on the data to be operated, and transfer data and operation instructions;
根据传输的数据和运算指令并行执行中间运算得到多个中间结果;Perform intermediate operations in parallel based on the transmitted data and operation instructions to obtain multiple intermediate results;
对多个中间结果执行后续处理,得到运算结果。Perform subsequent processing on multiple intermediate results to obtain the operation result.
在一种可能的实现方式中,该方法还可以包括:In a possible implementation manner, the method may further include:
利用存储模块中的片上存储来存储待运算数据,Use the on-chip storage in the storage module to store the data to be calculated,
其中,存储模块可以包括片外存储和/或片上存储,片上存储可以包括寄存器和缓存中的至少一种,The storage module may include off-chip storage and / or on-chip storage, and on-chip storage may include at least one of registers and caches.
缓存,用于存储待运算数据,缓存可以包括至少一个NRAM;The cache is used to store data to be calculated, and the cache may include at least one NRAM;
寄存器,用于存储待运算数据中的标量数据。The register is used to store the scalar data in the data to be calculated.
在一种可能的实现方式中,缓存可以包括神经元缓存,神经元缓存用于存储待运算数据中的神经元数据。In a possible implementation manner, the cache may include a neuron cache, and the neuron cache is used to store neuron data in the data to be calculated.
在一种可能的实现方式中,该方法还可以包括:In a possible implementation manner, the method may further include:
存储编译后的中断存储指令和编译后的计算指令;Store the compiled interrupt storage instruction and the compiled calculation instruction;
对编译后的中断存储指令和编译后的计算指令分别进行解析,得到对应的操作码和操作域;Analyze the compiled interrupt storage instruction and the compiled calculation instruction respectively to obtain the corresponding operation code and operation domain;
存储指令队列,指令队列包括按照执行顺序依次排列的多个待执行指令,多个待执行指令包括编译后的中断存储指令和编译后的计算指令。An instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed, which are sequentially arranged according to an execution order, and the plurality of instructions to be executed include a compiled interrupt storage instruction and a compiled calculation instruction.
在一种可能的实现方式中,该方法还可以包括:在确定多个待执行指令中的第一待执行指令与第一待执行指令之前的第零待执行指令存在关联关系时,缓存第一待执行指令,并在确定第零执行指令执行完毕后,控制进行第一待执行指令的执行,In a possible implementation manner, the method may further include: when determining that the first to-be-executed instruction among the plurality of to-be-executed instructions is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction, caching the first The instruction to be executed, and after determining that the execution of the zeroth execution instruction is completed, control to execute the execution of the first instruction to be executed,
其中,第一待执行指令与第一待执行指令之前的第零待执行指令存在关联关系可以包括:存储第一待执行指令所需数据的第一存储地址区间与存储第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first to-be-executed instruction is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction may include: a first storage address interval that stores data required by the first to-be-executed instruction and a zeroth to-be-executed instruction The zeroth storage address interval of data has overlapping areas.
在一种可能的实现方式中,控制控制模块对获取到的中断存储指令进行编译,得到编译后的中断存储指令,可以包括:根据中断存储指令生成第一汇编文件,并将第一汇编文件翻译成第一二进制文件。其中,第一二进制文件为编译后的中断存储指令。In a possible implementation manner, the control control module compiles the obtained interrupt storage instruction to obtain the compiled interrupt storage instruction, which may include: generating a first assembly file according to the interrupt storage instruction and translating the first assembly file Into the first binary file. The first binary file is a compiled interrupt storage instruction.
在一种可能的实现方式中,对获取到的计算指令进行编译,得到编译后的计算指令,可以包括:根据计算指令生成第二汇编文件,并将第二汇编文件翻译成第二二进制文件。其中,第二二进制文件为编译后的计算指令。In a possible implementation manner, compiling the obtained calculation instruction to obtain the compiled calculation instruction may include: generating a second assembly file according to the calculation instruction, and translating the second assembly file into the second binary file. Among them, the second binary file is a compiled calculation instruction.
需要说明的是,尽管以上述实施例作为示例介绍了中断存储指令处理方法如上,但本领域技术人 员能够理解,本公开应不限于此。事实上,用户完全可根据个人喜好和/或实际应用场景灵活设定各步骤,只要符合本公开的技术方案即可。It should be noted that although the above embodiment is taken as an example to describe the interrupt storage instruction processing method as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various steps flexibly according to personal preferences and / or actual application scenarios, as long as the technical solutions of the present disclosure are met.
本公开实施例所提供的中断存储指令处理方法的适用范围广,对中断存储指令的处理效率高、处理速度快,能够高效、快速地对装置的中断退出做出响应,且提高了对数据进行运算的效率和速度。The method for processing interrupt storage instructions provided by the embodiments of the present disclosure has a wide range of application, has high processing efficiency and fast processing speed for interrupt storage instructions, can respond to interrupt exits of the device efficiently and quickly, and improves data processing. Operational efficiency and speed.
依据以下条款可以更好的理解前述内容:The foregoing can be better understood based on the following clauses:
条款V1、一种中断存储指令处理装置,所述装置包括控制模块,所述控制模块包括:Clause V1, an interrupt storage instruction processing device, the device includes a control module, the control module includes:
指令编译子模块,对获取到的中断存储指令进行编译,得到编译后的中断存储指令;The instruction compilation submodule compiles the obtained interrupt storage instruction to obtain the compiled interrupt storage instruction;
参数获取子模块,根据所述编译后的中断存储指令的操作域和操作码,确定进行响应中断退出的处理所需的存储参数;The parameter acquisition sub-module, according to the operation domain and the operation code of the compiled interrupt storage instruction, determine the storage parameters required for the process of responding to the interrupt exit;
中断存储子模块,在执行到所述编译后的中断存储指令时,控制所述装置中断退出,并根据所述存储参数进行数据存储,An interruption storage submodule, when the compiled interruption storage instruction is executed, the device is controlled to interrupt and exit, and perform data storage according to the storage parameter,
其中,所述操作码用于指示所述中断存储指令对装置中断退出时所进行的处理为中断存储处理,所述存储参数用于指示所述装置中断退出时需要存储的数据。Wherein, the operation code is used to instruct the interrupt storage instruction to perform the interrupt storage process on the device when the device is interrupted and exited, and the storage parameter is used to indicate the data that needs to be stored when the device is interrupted and exited.
条款V2、根据条款V1所述的装置,所述操作域包括用于指示中断存储空间的标识,Clause V2. The device according to Clause V1, the operation field includes an identifier for indicating an interruption of the storage space,
其中,所述中断存储子模块,还用于将根据所述存储参数获取的所需存储的数据存入与所述中断存储空间的标识相对应的中断存储空间中。Wherein, the interrupt storage sub-module is also used to store the data needed to be stored obtained according to the storage parameters in the interrupt storage space corresponding to the identifier of the interrupt storage space.
条款V3、根据条款V2所述的装置,所述中断存储空间包括所述装置的片外存储和/或片上存储,Clause V3. The device according to Clause V2, the interrupt storage space includes off-chip storage and / or on-chip storage of the device,
其中,所述片外存储包括至少一个DDR,所述DDR包括至少一个LDRAM,所述片上存储包括寄存器、NRAM中的至少一种,所述片外存储的可用存储空间小于或等于指定存储容量。Wherein, the off-chip storage includes at least one DDR, the DDR includes at least one LDRAM, the on-chip storage includes at least one of registers and NRAM, and the available storage space of the off-chip storage is less than or equal to a specified storage capacity.
条款V4、根据条款V1所述的装置,所述存储参数包括存储空间类型和存储空间标识,Clause V4. The device according to Clause V1, the storage parameter includes a storage space type and a storage space identifier,
其中,所述操作码还用于指示所述存储空间类型,所述操作域包括存储空间标识,Wherein, the operation code is also used to indicate the storage space type, and the operation domain includes a storage space identifier,
其中,根据所述存储参数进行数据存储,包括:The data storage according to the storage parameters includes:
确定与所述存储空间类型、存储空间标识相匹配的至少一个目标存储空间,并存储目标存储空间中的数据。Determine at least one target storage space that matches the storage space type and storage space identifier, and store the data in the target storage space.
条款V5、根据条款V4所述的装置,在所述目标存储空间为多个时,每个目标存储空间中的数据为一组待存储数据,多组待存储数据对应至少一种数据格式。Clause V5. The device according to Clause V4, when there are multiple target storage spaces, the data in each target storage space is a set of data to be stored, and the multiple sets of data to be stored correspond to at least one data format.
条款V6、根据条款V1所述的装置,所述存储参数包括存储空间标识和待存储数据地址,Clause V6. The device according to Clause V1, the storage parameters include a storage space identifier and a data address to be stored,
其中,所述操作码还用于指示所述存储空间标识,所述操作域包括所述待存储数据地址,Wherein, the operation code is also used to indicate the storage space identifier, the operation domain includes the data address to be stored,
其中,根据所述存储参数进行数据存储,包括:The data storage according to the storage parameters includes:
确定与所述存储空间标识相对应的目标存储空间,并从所述目标存储空间的待存储数据地址中获取待存储数据,存储所述待存储数据。A target storage space corresponding to the storage space identifier is determined, and data to be stored is obtained from a data address to be stored of the target storage space, and the data to be stored is stored.
条款V7、根据条款V6所述的装置,所述存储参数还包括目标存储量,其中,所述操作域还包括目标存储量,Clause V7. The device according to Clause V6, the storage parameter further includes a target storage amount, wherein the operation domain further includes a target storage amount,
其中,从所述目标存储空间的待存储数据地址中获取待存储数据,存储所述待存储数据,包括:Wherein, obtaining the data to be stored from the data address to be stored in the target storage space, and storing the data to be stored includes:
从所述目标存储空间的待存储数据地址中获取数据量为所述目标存储量的待存储数据,存储所述待存储数据。Obtaining data to be stored from the data address to be stored in the target storage space as the target storage amount, and storing the data to be stored.
条款V8、根据条款V1所述的装置,所述装置还包括运算模块,Clause V8. The device according to Clause V1, the device further includes an arithmetic module,
所述控制模块,还用于获取计算指令,并对所述计算指令进行编译,得到编译后的计算指令,获取执行所述编译后的计算指令所需的待运算数据,并将所述待运算数据和所述编译后的计算指令发送至所述运算模块;The control module is also used to obtain calculation instructions and compile the calculation instructions to obtain the compiled calculation instructions, obtain the data to be calculated required to execute the compiled calculation instructions, and convert the calculation instructions The data and the compiled calculation instruction are sent to the calculation module;
所述运算模块,用于根据所述待运算数据,执行所述编译后的计算指令,得到运算结果,The calculation module is configured to execute the compiled calculation instruction according to the data to be calculated to obtain an operation result,
其中,在执行到所述编译后的中断存储指令时,控制所述装置中断退出,包括:Wherein, when the compiled interrupt storage instruction is executed, controlling the device to interrupt and exit includes:
在执行到编译后的中断存储指令时,控制所述运算模块中断当前所述编译后的计算指令的执行。When the compiled interruption storage instruction is executed, the operation module is controlled to interrupt the execution of the currently compiled calculation instruction.
条款V9、根据条款V8所述的装置,所述运算模块包括主运算子模块和多个从运算子模块,Clause V9. The device according to Clause V8, the operation module includes a master operation sub-module and a plurality of slave operation sub-modules,
所述控制模块,还用于解析所述编译后的计算指令得到多个运算指令,并将所述待运算数据和所 述多个运算指令发送至所述主运算子模块;The control module is further configured to parse the compiled calculation instruction to obtain a plurality of operation instructions, and send the data to be operated and the plurality of operation instructions to the main operation submodule;
所述主运算子模块,用于对所述待运算数据执行前序处理,以及与所述多个从运算子模块进行数据和运算指令的传输;The master operation sub-module is used for performing pre-processing on the data to be operated, and transmitting data and operation instructions with the plurality of slave operation sub-modules;
所述从运算子模块,用于根据从所述主运算子模块传输的数据和运算指令并行执行中间运算得到多个中间结果,并将所述多个中间结果传输给所述主运算子模块;The slave operation sub-module is configured to execute intermediate operations in parallel according to data and operation instructions transmitted from the master operation sub-module to obtain a plurality of intermediate results, and transmit the plurality of intermediate results to the master operation sub-module;
所述主运算子模块,还用于对所述多个中间结果执行后续处理,得到运算结果。The main operation sub-module is also used to perform subsequent processing on the plurality of intermediate results to obtain operation results.
条款V10、根据条款V8所述的装置,所述装置还包括存储模块,所述存储模块包括片外存储和/或片上存储,Clause V10. The apparatus according to Clause V8, the apparatus further comprises a storage module, the storage module includes off-chip storage and / or on-chip storage,
所述片上存储,用于存储所述待运算数据,The on-chip storage is used to store the data to be calculated,
其中,所述片上存储包括寄存器和缓存中的至少一种,Wherein, the on-chip storage includes at least one of a register and a cache,
所述缓存,用于存储所述待运算数据,所述缓存包括至少一个NRAM;The cache is used to store the data to be calculated, and the cache includes at least one NRAM;
所述寄存器,用于存储所述待运算数据中的标量数据。The register is used to store scalar data in the data to be calculated.
条款V11、根据条款V10所述的装置,所述缓存包括神经元缓存,Clause V11. The device according to Clause V10, the cache includes a neuron cache,
所述神经元缓存,用于存储所述待运算数据中的神经元数据。The neuron cache is used to store neuron data in the data to be calculated.
条款V12、根据条款V8所述的装置,所述控制模块,包括:Clause V12. The device according to Clause V8, the control module includes:
指令存储子模块,用于存储所述编译后的中断存储指令和所述编译后的计算指令;An instruction storage sub-module for storing the compiled interrupt storage instruction and the compiled calculation instruction;
指令处理子模块,用于对所述编译后的中断存储指令和所述编译后的计算指令分别进行解析,得到对应的操作码和操作域;An instruction processing sub-module, which is used to separately parse the compiled interrupt storage instruction and the compiled calculation instruction to obtain the corresponding operation code and operation domain;
队列存储子模块,用于存储指令队列,所述指令队列包括按照执行顺序依次排列的多个待执行指令,所述多个待执行指令包括所述编译后的中断存储指令和所述编译后的计算指令。A queue storage submodule is used to store an instruction queue, and the instruction queue includes a plurality of instructions to be executed in order according to an execution order, and the plurality of instructions to be executed include the compiled interrupt storage instruction and the compiled Calculation instructions.
条款V13、根据条款V12所述的装置,所述控制模块,还包括:Clause V13. The device according to Clause V12, the control module, further comprising:
依赖关系处理子模块,用于在确定所述多个待执行指令中的第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系时,将所述第一待执行指令缓存在所述指令存储子模块中,在所述第零待执行指令执行完毕后,从所述指令存储子模块中提取所述第一待执行指令发送至所述运算模块,The dependency processing sub-module is used to determine the first pending instruction when there is an association relationship between the first pending instruction in the plurality of pending instructions and the zeroth pending instruction before the first pending instruction The execution instruction is cached in the instruction storage submodule, and after the execution of the zeroth instruction to be executed is completed, the first instruction to be executed is extracted from the instruction storage submodule and sent to the arithmetic module,
其中,所述第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系包括:Wherein, the association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
存储所述第一待执行指令所需数据的第一存储地址区间与存储所述第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
条款V14、根据条款V1所述的装置,所述控制模块,还用于根据所述中断存储指令生成第一汇编文件,并将所述第一汇编文件翻译成第一二进制文件,其中,所述第一二进制文件为所述编译后的中断存储指令。Clause V14. The device according to Clause V1, the control module is further configured to generate a first assembly file according to the interrupt storage instruction and translate the first assembly file into a first binary file, wherein, The first binary file is the compiled interrupt storage instruction.
条款V15、一种机器学习运算装置,所述装置包括:Article V15. A machine learning computing device, the device comprising:
一个或多个如条款V1-条款V14任一项所述的中断存储指令处理装置,用于从其他处理装置中获取待运算数据和控制信息,并执行指定的机器学习运算,将执行结果通过I/O接口传递给其他处理装置;One or more interrupt storage instruction processing devices as described in any one of Clause V1-Clause V14, used to obtain data to be calculated and control information from other processing devices, and perform specified machine learning operations, and pass the execution result through I / O interface is passed to other processing devices;
当所述机器学习运算装置包含多个所述中断存储指令处理装置时,所述多个所述中断存储指令处理装置间可以通过特定的结构进行连接并传输数据;When the machine learning operation device includes a plurality of the interrupt storage instruction processing devices, the plurality of the interrupt storage instruction processing devices may be connected and transmit data through a specific structure;
其中,多个所述中断存储指令处理装置通过快速外部设备互连总线PCIE总线进行互联并传输数据,以支持更大规模的机器学习的运算;多个所述中断存储指令处理装置共享同一控制系统或拥有各自的控制系统;多个所述中断存储指令处理装置共享内存或者拥有各自的内存;多个所述中断存储指令处理装置的互联方式是任意互联拓扑。Among them, a plurality of the interrupt storage instruction processing devices interconnect and transmit data through a fast external device interconnect bus PCIE bus to support larger-scale machine learning operations; the plurality of interrupt storage instruction processing devices share the same control system Or have their own control systems; a plurality of the interrupt storage instruction processing devices share memory or have their own memories; the interconnection method of the plurality of interrupt storage instruction processing devices is any interconnected topology.
条款V16、一种组合处理装置,所述组合处理装置包括:Clause V16. A combined processing device, the combined processing device comprising:
如条款V15所述的机器学习运算装置、通用互联接口和其他处理装置;Machine learning computing devices, general interconnection interfaces and other processing devices as described in clause V15;
所述机器学习运算装置与所述其他处理装置进行交互,共同完成用户指定的计算操作,The machine learning computing device interacts with the other processing device to jointly complete the calculation operation specified by the user,
其中,所述组合处理装置还包括:存储装置,该存储装置分别与所述机器学习运算装置和所述其 他处理装置连接,用于保存所述机器学习运算装置和所述其他处理装置的数据。Wherein, the combined processing device further includes a storage device, which is respectively connected to the machine learning computing device and the other processing device, and is used for storing data of the machine learning computing device and the other processing device.
条款V17、一种机器学习芯片,所述机器学习芯片包括:Clause V17. A machine learning chip, the machine learning chip includes:
如条款V15所述的机器学习运算装置或如条款V16所述的组合处理装置。The machine learning arithmetic device according to clause V15 or the combined processing device according to clause V16.
条款V18、一种电子设备,所述电子设备包括:Article V18. An electronic device, the electronic device comprising:
如条款V17所述的机器学习芯片。Machine learning chip as described in clause V17.
条款V19、一种板卡,所述板卡包括:存储器件、接口装置和控制器件以及如条款V17所述的机器学习芯片;Clause V19, a board card including: a storage device, an interface device and a control device, and a machine learning chip as described in Clause V17;
其中,所述机器学习芯片与所述存储器件、所述控制器件以及所述接口装置分别连接;Wherein, the machine learning chip is respectively connected to the storage device, the control device and the interface device;
所述存储器件,用于存储数据;The storage device is used for storing data;
所述接口装置,用于实现所述机器学习芯片与外部设备之间的数据传输;The interface device is used to realize data transmission between the machine learning chip and an external device;
所述控制器件,用于对所述机器学习芯片的状态进行监控。The control device is used for monitoring the state of the machine learning chip.
条款V20、一种中断存储指令处理方法,所述方法应用于中断存储指令处理装置,所述方法包括:Clause V20. A method for processing an interrupt storage instruction. The method is applied to an apparatus for processing an interrupt storage instruction. The method includes:
对获取到的中断存储指令进行编译,得到编译后的中断存储指令;Compile the obtained interrupt storage instruction to obtain the compiled interrupt storage instruction;
根据所述编译后的中断存储指令的操作域和操作码,确定进行响应中断退出的处理所需的存储参数;According to the operation domain and operation code of the compiled interrupt storage instruction, determine the storage parameters required for the process of responding to the interrupt exit;
在执行到所述编译后的中断存储指令时,控制所述装置中断退出,并根据所述存储参数进行数据存储,When the compiled interrupt storage instruction is executed, the device is controlled to interrupt and exit, and data storage is performed according to the storage parameter,
其中,所述操作码用于指示所述中断存储指令对装置中断退出时所进行的处理为中断存储处理,所述存储参数用于指示所述装置中断退出时需要存储的数据。Wherein, the operation code is used to instruct the interrupt storage instruction to perform the interrupt storage process on the device when the device is interrupted and exited, and the storage parameter is used to indicate the data that needs to be stored when the device is interrupted and exited.
条款V21、根据条款V20所述的方法,所述操作域包括用于指示中断存储空间的标识,Clause V21, the method according to Clause V20, the operation field includes an identifier for indicating an interruption of the storage space,
其中,根据所述存储参数进行数据存储,包括:The data storage according to the storage parameters includes:
将根据所述存储参数获取的所需存储的数据存入与所述中断存储空间的标识相对应的中断存储空间中。Storing the required storage data acquired according to the storage parameter into the interrupt storage space corresponding to the identifier of the interrupt storage space.
条款V22、根据条款V21所述的方法,所述中断存储空间包括所述装置的片外存储和/或片上存储,Clause V22, the method according to Clause V21, the interrupt storage space includes off-chip storage and / or on-chip storage of the device,
其中,所述片外存储包括至少一个DDR,所述DDR包括至少一个LDRAM,所述片上存储包括寄存器、NRAM中的至少一种,所述片外存储的可用存储空间小于或等于指定存储容量。Wherein, the off-chip storage includes at least one DDR, the DDR includes at least one LDRAM, the on-chip storage includes at least one of registers and NRAM, and the available storage space of the off-chip storage is less than or equal to a specified storage capacity.
条款V23、根据条款V20所述的方法,所述存储参数包括存储空间类型和存储空间标识,Clause V23, the method according to Clause V20, the storage parameter includes a storage space type and a storage space identifier,
其中,所述操作码还用于指示所述存储空间类型,所述操作域包括存储空间标识,Wherein, the operation code is also used to indicate the storage space type, and the operation domain includes a storage space identifier,
其中,根据所述存储参数进行数据存储,包括:The data storage according to the storage parameters includes:
确定与所述存储空间类型、存储空间标识相匹配的至少一个目标存储空间,并存储目标存储空间中的数据。Determine at least one target storage space that matches the storage space type and storage space identifier, and store the data in the target storage space.
条款V24、根据条款V23所述的方法,在所述目标存储空间为多个时,每个目标存储空间中的数据为一组待存储数据,多组待存储数据对应至少一种数据格式。Clause V24. According to the method described in Clause V23, when there are multiple target storage spaces, the data in each target storage space is a set of data to be stored, and the multiple sets of data to be stored correspond to at least one data format.
条款V25、根据条款V20所述的方法,所述存储参数包括存储空间标识和待存储数据地址,Clause V25, the method according to Clause V20, the storage parameters include a storage space identifier and an address of the data to be stored,
其中,所述操作码还用于指示所述存储空间标识,所述操作域包括所述待存储数据地址,Wherein, the operation code is also used to indicate the storage space identifier, the operation domain includes the data address to be stored,
其中,根据所述存储参数进行数据存储,包括:The data storage according to the storage parameters includes:
确定与所述存储空间标识相对应的目标存储空间,并从所述目标存储空间的待存储数据地址中获取待存储数据,存储所述待存储数据。A target storage space corresponding to the storage space identifier is determined, and data to be stored is obtained from a data address to be stored of the target storage space, and the data to be stored is stored.
条款V26、根据条款V25所述的方法,所述存储参数还包括目标存储量,其中,所述操作域还包括目标存储量,Clause V26. The method according to Clause V25, the storage parameter further includes a target storage amount, wherein the operation domain further includes a target storage amount,
其中,从所述目标存储空间的待存储数据地址中获取待存储数据,存储所述待存储数据,包括:Wherein, obtaining the data to be stored from the data address to be stored in the target storage space, and storing the data to be stored includes:
从所述目标存储空间的待存储数据地址中获取数据量为所述目标存储量的待存储数据,存储所述待存储数据。Obtaining data to be stored from the data address to be stored in the target storage space as the target storage amount, and storing the data to be stored.
条款V27、根据条款V20所述的方法,所述方法还包括:Clause V27. The method according to Clause V20, the method further comprising:
获取计算指令,并对所述计算指令进行编译,得到编译后的计算指令,获取执行所述编译后的计 算指令所需的待运算数据;Obtaining calculation instructions, compiling the calculation instructions to obtain the compiled calculation instructions, and obtaining the data to be calculated required to execute the compiled calculation instructions;
根据所述待运算数据,执行所述编译后的计算指令,得到运算结果,Execute the compiled calculation instruction according to the data to be calculated to obtain the calculation result,
其中,在执行到所述编译后的中断存储指令时,控制所述装置中断退出,包括:Wherein, when the compiled interrupt storage instruction is executed, controlling the device to interrupt and exit includes:
在执行到编译后的中断存储指令时,控制所述运算模块中断当前所述编译后的计算指令的执行。When the compiled interruption storage instruction is executed, the operation module is controlled to interrupt the execution of the currently compiled calculation instruction.
条款V28、根据条款V27所述的方法,所述方法还包括:Clause V28. The method according to Clause V27, the method further comprising:
解析所述编译后的计算指令得到多个运算指令;Parse the compiled calculation instructions to obtain multiple calculation instructions;
其中,根据所述待运算数据,执行所述编译后的计算指令,得到运算结果,包括:Wherein, according to the data to be calculated, executing the compiled calculation instruction to obtain the calculation result includes:
对所述待运算数据执行前序处理,以及与进行数据和运算指令的传输;Perform pre-order processing on the data to be operated, and transmit data and operation instructions;
根据传输的数据和运算指令并行执行中间运算得到多个中间结果;Perform intermediate operations in parallel based on the transmitted data and operation instructions to obtain multiple intermediate results;
对所述多个中间结果执行后续处理,得到运算结果。Perform subsequent processing on the plurality of intermediate results to obtain an operation result.
条款V29、根据条款V27所述的方法,所述方法还包括:Clause V29. The method according to Clause V27, the method further comprising:
利用存储模块中的片上存储来存储所述待运算数据,Using on-chip storage in the storage module to store the data to be calculated,
其中,所述存储模块包括片外存储和/或片上存储,所述片上存储包括寄存器和缓存中的至少一种,Wherein, the storage module includes off-chip storage and / or on-chip storage, and the on-chip storage includes at least one of a register and a cache,
所述缓存,用于存储所述待运算数据,所述缓存包括至少一个NRAM;The cache is used to store the data to be calculated, and the cache includes at least one NRAM;
所述寄存器,用于存储所述待运算数据中的标量数据。The register is used to store scalar data in the data to be calculated.
条款V30、根据条款V29所述的方法,所述缓存包括神经元缓存,所述神经元缓存用于存储所述待运算数据中的神经元数据。Clause V30. The method according to Clause V29, the cache includes a neuron cache, and the neuron cache is used to store neuron data in the data to be operated.
条款V31、根据条款V29所述的方法,所述方法还包括:Clause V31. The method according to Clause V29, the method further comprising:
存储所述编译后的中断存储指令和所述编译后的计算指令;Storing the compiled interrupt storage instruction and the compiled calculation instruction;
对所述编译后的中断存储指令和所述编译后的计算指令分别进行解析,得到对应的操作码和操作域;Separately parse the compiled interrupt storage instruction and the compiled calculation instruction to obtain the corresponding operation code and operation domain;
存储指令队列,所述指令队列包括按照执行顺序依次排列的多个待执行指令,所述多个待执行指令包括所述编译后的中断存储指令和所述编译后的计算指令。An instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed that are sequentially arranged according to an execution order, and the plurality of instructions to be executed include the compiled interrupt storage instruction and the compiled calculation instruction.
条款V32、根据条款V31所述的方法,所述方法还包括:Clause V32. The method according to Clause V31, the method further comprising:
在确定所述多个待执行指令中的第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系时,缓存所述第一待执行指令,并在确定所述第零待执行指令执行完毕后,控制进行所述第一待执行指令的执行,When it is determined that the first to-be-executed instruction among the plurality of to-be-executed instructions is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction, cache the first to-be-executed instruction, and determine the After the execution of the zeroth to-be-executed instruction is completed, control to execute the execution of the first to-be-executed instruction,
其中,所述第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系包括:Wherein, the association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
存储所述第一待执行指令所需数据的第一存储地址区间与存储所述第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
条款V33、根据条款V20所述的方法,对获取到的中断存储指令进行编译,得到编译后的中断存储指令,包括:Clause V33. According to the method described in Clause V20, compile the obtained interrupt storage instruction to obtain the compiled interrupt storage instruction, including:
根据所述中断存储指令生成第一汇编文件,并将所述第一汇编文件翻译成第一二进制文件,其中,所述第一二进制文件为所述编译后的中断存储指令。A first assembly file is generated according to the interrupt storage instruction, and the first assembly file is translated into a first binary file, where the first binary file is the compiled interrupt storage instruction.
上述装置中的运算模块(或者上述装置中的处理模块)均可以包括主运算子模块121和多个从运算子模块122,以实现对上述指令的处理的同时,进行计算指令的处理。The arithmetic module in the above device (or the processing module in the above device) may include a master arithmetic sub-module 121 and a plurality of slave arithmetic sub-modules 122, so as to realize the processing of the above-mentioned instructions and the processing of the calculation instructions.
在一种可能的实现方式中,控制模块,还用于对获取到的计算指令进行解析,得到计算指令的操作域和操作码,并根据操作域和操作码获取执行计算指令所需的待运算数据。运算模块,还用于根据计算指令对待运算数据进行运算,得到计算指令的计算结果。其中,运算模块可以包括多个运算器,用于执行与计算指令的运算类型相对应的运算。In a possible implementation, the control module is also used to parse the obtained calculation instruction to obtain the operation domain and operation code of the calculation instruction, and obtain the to-be-calculated required to execute the calculation instruction according to the operation domain and operation code data. The calculation module is also used to perform calculation on the data to be calculated according to the calculation instruction to obtain the calculation result of the calculation instruction. The operation module may include a plurality of operators, which are used to perform operations corresponding to the operation type of the calculation instruction.
在该实现方式中,计算指令可以是其他对标量、向量、矩阵、张量等数据进行算术运算、逻辑运算等运算的指令,本领域技术人员可以根据实际需要对计算指令进行设置,本公开对此不作限制。In this implementation, the calculation instruction may be other instructions that perform arithmetic operations and logical operations on data such as scalars, vectors, matrices, and tensors. Those skilled in the art can set the calculation instructions according to actual needs. There are no restrictions.
该实现方式中,运算器可以包括加法器、除法器、乘法器、比较器等能够对数据进行算术运算、逻辑运算等运算的运算器。可以根据所需进行的运算的数据量的大小、运算类型、对数据进行运算的 处理速度、效率等要求对运算器的种类及数量进行设置,本公开对此不作限制。In this implementation, the operator may include an adder, a divider, a multiplier, a comparator, and the like that can perform arithmetic operations, logical operations, and the like on the data. The type and number of arithmetic units can be set according to the size of the amount of data to be calculated, the type of calculation, the processing speed and efficiency of the calculation of the data, etc., and this disclosure does not limit this.
在一种可能的实现方式中,控制模块,还用于解析计算指令得到多个运算指令,并将待运算数据和多个运算指令发送至主运算子模块121。In a possible implementation manner, the control module is further configured to parse the calculation instruction to obtain a plurality of operation instructions, and send the data to be operated and the plurality of operation instructions to the main operation sub-module 121.
主运算子模块121,用于对待运算数据执行前序处理,以及与多个从运算子模块122进行数据和运算指令的传输。The master operation sub-module 121 is used to perform pre-processing on the data to be operated, and to transmit data and operation instructions with a plurality of slave operation sub-modules 122.
从运算子模块122,用于根据从主运算子模块121传输的数据和运算指令并行执行中间运算得到多个中间结果,并将多个中间结果传输给主运算子模块122。The sub-operation sub-module 122 is configured to execute intermediate operations in parallel according to the data and operation instructions transmitted from the main sub-module 121 to obtain multiple intermediate results, and transmit the multiple intermediate results to the main sub-module 122.
主运算子模块121,还用于对多个中间结果执行后续处理,得到计算指令的计算结果,并将计算结果存入对应的地址中。The main operation sub-module 121 is also used to perform subsequent processing on multiple intermediate results to obtain the calculation result of the calculation instruction, and store the calculation result in the corresponding address.
在该实现方式中,在计算指令为针对标量、向量数据所进行的运算时,装置可以控制主运算子模块利用其中的运算器进行与计算指令相对应的运算。在计算指令为针对矩阵、张量等维度大于或等于2的数据进行运算时,装置可以控制从运算子模块利用其中的运算器进行与计算指令相对应的运算。In this implementation manner, when the calculation instruction is an operation performed on scalar and vector data, the device may control the main operation sub-module to perform an operation corresponding to the calculation instruction using the arithmetic unit therein. When the calculation instruction is to perform operations on data with dimensions greater than or equal to 2 such as a matrix and a tensor, the device may control the sub-module from the operation sub-module to perform an operation corresponding to the calculation instruction.
需要说明的是,本领域技术人员可以根据实际需要对主运算子模块和多个从运算子模块之间的连接方式进行设置,以实现对运算模块的架构设置,例如,运算模块的架构可以是“H”型架构、阵列型架构、树型架构等,本公开对此不作限制。It should be noted that those skilled in the art can set the connection mode between the main operation sub-module and multiple slave operation sub-modules according to actual needs, so as to implement the architecture setting of the operation module. For example, the architecture of the operation module may be The “H” -type architecture, the array-type architecture, the tree-type architecture, etc. are not limited in this disclosure.
图23a示出根据本公开一实施例的运算模块的框图。在一种可能的实现方式中,如图23a所示,运算模块还可以包括一个或多个分支运算子模块123,该分支运算子模块123用于转发主运算子模块121和从运算子模块122之间的数据和/或运算指令。其中,主运算子模块121与一个或多个分支运算子模块123连接。这样,运算模块中的主运算子模块、分支运算子模块和从运算子模块之间采用“H”型架构连接,通过分支运算子模块转发数据和/或运算指令,节省了对主运算子模块的资源占用,进而提高指令的处理速度。23a shows a block diagram of an arithmetic module according to an embodiment of the present disclosure. In a possible implementation, as shown in FIG. 23a, the operation module may further include one or more branch operation sub-modules 123, and the branch operation sub-module 123 is used to forward the master operation sub-module 121 and the slave operation sub-module 122 Between data and / or arithmetic instructions. The main operation sub-module 121 is connected to one or more branch operation sub-modules 123. In this way, the main operation sub-module, the branch operation sub-module and the slave operation sub-module in the operation module are connected with an "H" architecture, and the data and / or operation instructions are forwarded through the branch operation sub-module, saving the main operation sub-module Of resources, which in turn increases the processing speed of instructions.
图23b示出根据本公开一实施例的运算模块的框图。在一种可能的实现方式中,如图23b所示,多个从运算子模块122呈阵列分布。23b shows a block diagram of an arithmetic module according to an embodiment of the present disclosure. In a possible implementation, as shown in FIG. 23b, multiple slave operation sub-modules 122 are distributed in an array.
每个从运算子模块122与相邻的其他从运算子模块122连接,主运算子模块121连接多个从运算子模块122中的k个从运算子模块122,k个从运算子模块122为:第1行的n个从运算子模块122、第m行的n个从运算子模块122以及第1列的m个从运算子模块122。Each slave operation sub-module 122 is connected to other adjacent slave operation sub-modules 122, and the master operation sub-module 121 is connected to the k slave operation sub-modules 122 of the plurality of slave operation sub-modules 122. : N slave operation submodules 122 in the first row, n slave operation submodules 122 in the mth row, and m slave operation submodules 122 in the first column.
其中,如图23b所示,k个从运算子模块仅包括第1行的n个从运算子模块、第m行的n个从运算子模块以及第1列的m个从运算子模块,即该k个从运算子模块为多个从运算子模块中直接与主运算子模块连接的从运算子模块。其中,k个从运算子模块,用于在主运算子模块以及多个从运算子模块之间的数据以及指令的转发。这样,多个从运算子模块呈阵列分布,可以提高主运算子模块向从运算子模块发送数据和/或运算指令速度,进而提高指令的处理速度。Among them, as shown in FIG. 23b, the k slave operation submodules include only n slave operation submodules in the first row, n slave operation submodules in the mth row, and m slave operation submodules in the first column, namely The k slave operation sub-modules are slave operation sub-modules directly connected to the master operation sub-module among the plurality of slave operation sub-modules. Among them, k slave operation sub-modules are used for forwarding data and instructions between the master operation sub-module and multiple slave operation sub-modules. In this way, multiple slave operation sub-modules are distributed in an array, which can increase the speed of sending data and / or operation instructions from the master operation sub-module to the slave operation sub-modules, thereby increasing the processing speed of instructions.
图23c示出根据本公开一实施例的运算模块的框图。在一种可能的实现方式中,如图23c所示,运算模块还可以包括树型子模块124。该树型子模块124包括一个根端口401和多个支端口402。根端口401与主运算子模块121连接,多个支端口402与多个从运算子模块122分别连接。其中,树型子模块124具有收发功能,用于转发主运算子模块121和从运算子模块122之间的数据和/或运算指令。这样,通过树型子模块的作用使得运算模块呈树型架构连接,并利用树型子模块的转发功能,可以提高主运算子模块向从运算子模块发送数据和/或运算指令速度,进而提高指令的处理速度。23c shows a block diagram of an arithmetic module according to an embodiment of the present disclosure. In a possible implementation manner, as shown in FIG. 23c, the calculation module may further include a tree-shaped submodule 124. The tree-shaped submodule 124 includes a root port 401 and multiple branch ports 402. The root port 401 is connected to the main operation sub-module 121, and the plurality of branch ports 402 are respectively connected to the plurality of sub-operation sub-modules 122. The tree-shaped sub-module 124 has a transceiver function for forwarding data and / or operation instructions between the main operation sub-module 121 and the slave operation sub-module 122. In this way, through the function of the tree-shaped submodule, the operation modules are connected in a tree structure, and the forwarding function of the tree-shaped submodule can be used to increase the speed of sending data and / or operation instructions from the main operation submodule to the slave operation submodule, thereby increasing The processing speed of the instruction.
在一种可能的实现方式中,树型子模块124可以为该装置的可选结果,其可以包括至少一层节点。节点为具有转发功能的线结构,节点本身不具备运算功能。最下层的节点与从运算子模块连接,以转发主运算子模块121和从运算子模块122之间的数据和/或运算指令。特殊地,如树型子模块具有零层节点,该装置则无需树型子模块。In a possible implementation, the tree-shaped sub-module 124 may be an optional result of the device, which may include at least one layer of nodes. The node has a line structure with a forwarding function, and the node itself does not have a computing function. The node at the lowermost layer is connected to the slave operation sub-module to forward data and / or operation instructions between the master operation sub-module 121 and the slave operation sub-module 122. In particular, if the tree-shaped submodule has zero-level nodes, the device does not require a tree-shaped submodule.
在一种可能的实现方式中,树型子模块124可以包括n叉树结构的多个节点,n叉树结构的多个节点可以具有多个层。In a possible implementation, the tree-shaped submodule 124 may include multiple nodes of an n-ary tree structure, and multiple nodes of the n-ary tree structure may have multiple layers.
举例来说,图23d示出根据本公开一实施例的运算模块的框图。如图23d所示,n叉树结构可以是二叉树结构,树型子模块包括2层节点01。最下层节点01与从运算子模块122连接,以转发主运算子模 块121和从运算子模块122之间的数据和/或运算指令。For example, FIG. 23d shows a block diagram of an arithmetic module according to an embodiment of the present disclosure. As shown in FIG. 23d, the n-ary tree structure may be a binary tree structure, and the tree-shaped submodule includes 2-layer nodes 01. The lowermost node 01 is connected to the slave operation sub-module 122 to forward data and / or operation instructions between the master operation sub-module 121 and the slave operation sub-module 122.
在该实现方式中,n叉树结构还可以是三叉树结构等,n为大于或等于2的正整数。本领域技术人员可以根据需要对n叉树结构中的n以及n叉树结构中节点的层数进行设置,本公开对此不作限制。In this implementation, the n-ary tree structure may also be a tri-tree structure, etc., where n is a positive integer greater than or equal to 2. A person skilled in the art may set n in the n-ary tree structure and the number of nodes in the n-ary tree structure as needed, and the disclosure does not limit this.
图23e示出根据本公开一实施例的控制模块的框图。在一种可能的实现方式中,如图23e所示,控制模块可以包括指令存储子模块114、指令处理子模块115和队列存储子模块116。23e shows a block diagram of a control module according to an embodiment of the present disclosure. In a possible implementation, as shown in FIG. 23e, the control module may include an instruction storage sub-module 114, an instruction processing sub-module 115, and a queue storage sub-module 116.
指令存储子模块114用于存储上述指令和计算指令。The instruction storage submodule 114 is used to store the above instructions and calculation instructions.
指令处理子模块115用于对上述指令和计算指令分别进行解析,得到对应的操作码和操作域。也即对上述指令进行解析得到指令的操作码和操作域,以及对计算指令进行解析得到计算指令的操作码和操作域。The instruction processing sub-module 115 is used to parse the above instructions and calculation instructions respectively to obtain corresponding operation codes and operation domains. That is, parsing the above instruction to obtain the operation code and operation domain of the instruction, and parsing the calculation instruction to obtain the operation code and operation domain of the calculation instruction.
队列存储子模块116用于存储指令队列,指令队列包括按照执行顺序依次排列的多个待执行指令,多个待执行指令可以包括上述指令和计算指令。The queue storage sub-module 116 is used to store an instruction queue. The instruction queue includes a plurality of instructions to be executed in order according to the execution order. The plurality of instructions to be executed may include the above-mentioned instructions and calculation instructions.
在该实现方式中,可以根据待执行指令的接收时间、优先级别等对多个待执行指令的执行顺序进行排列获得指令队列,以便于根据指令队列依次执行多个待执行指令。In this implementation manner, the execution order of the plurality of instructions to be executed can be arranged according to the reception time and priority level of the instruction to be executed to obtain an instruction queue, so that the plurality of instructions to be executed can be sequentially executed according to the instruction queue.
在一种可能的实现方式中,如图23e所示,控制模块还可以包括依赖关系处理子模块117。依赖关系处理子模块117,用于在确定多个待执行命令中的第一待执行指令与第一待执行指令之前的第零待执行指令存在关联关系时,将第一待执行指令缓存在指令存储子模块114中,在第零待执行指令执行完毕后,从指令存储子模块114中提取第一待执行指令发送至运算模块。In a possible implementation, as shown in FIG. 23e, the control module may further include a dependency processing sub-module 117. The dependency processing sub-module 117 is configured to cache the first instruction to be executed in the instruction when it is determined that the first instruction to be executed among the plurality of instructions to be executed is associated with the zeroth instruction to be executed before the first instruction to be executed In the storage submodule 114, after the execution of the zeroth instruction to be executed is completed, the first instruction to be executed is extracted from the instruction storage submodule 114 and sent to the arithmetic module.
其中,第一待执行指令与第一待执行指令之前的第零待执行指令存在关联关系包括:存储第一待执行指令所需数据的第一存储地址区间与存储第零待执行指令所需数据的第零存储地址区间具有重叠的区域。反之,第一待执行指令与第一待执行指令之前的第零待执行指令之间没有关联关系可以是第一存储地址区间与第零存储地址区间没有重叠区域。The first to-be-executed instruction is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction, including: a first storage address interval storing data required by the first to-be-executed instruction and data required to store the zeroth to-be-executed instruction The zeroth storage address interval has overlapping areas. Conversely, there is no association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed may be that there is no overlapping area between the first storage address interval and the zeroth storage address interval.
通过这种方式,可以根据第一待执行指令与第一待执行指令之前的第零待执行指令之间的依赖关系,使得在先的第零待执行指令执行完毕之后,再执行在后的第一待执行指令,保证运算结果的准确性。In this way, according to the dependency relationship between the first to-be-executed instruction and the zero-th to-be-executed instruction before the first to-be-executed instruction, after the execution of the first zero-to-be-executed instruction is completed, the subsequent One instruction to be executed to ensure the accuracy of the calculation results.
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本披露并不受所描述的动作顺序的限制,因为依据本披露,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于可选实施例,所涉及的动作和模块并不一定是本披露所必须的。It should be noted that, for the sake of simple description, the foregoing method embodiments are all expressed as a series of action combinations, but those skilled in the art should know that this disclosure is not limited by the sequence of actions described. Because according to the present disclosure, certain steps can be performed in other orders or simultaneously. Secondly, those skilled in the art should also know that the embodiments described in the specification are all optional embodiments, and the involved actions and modules are not necessarily required by the present disclosure.
进一步需要说明的是,虽然图1-4、图2-4、图3-4、图4-4、图5-4、图6-4、图7-4、图8-4、图9-4、图10-4、图11-4、图12-4、图13-4、图14-4、图15-4、图16-4、图17-4、图18-4、图19-4、图20-4、图21-4、图22-4的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,图1-4、图2-4、图3-4、图4-4、图5-4、图6-4、图7-4、图8-4、图9-4、图10-4、图11-4、图12-4、图13-4、图14-4、图15-4、图16-4、图17-4、图18-4、图19-4、图20-4、图21-4、图22-4中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些子步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。It should be further noted that although Figure 1-4, Figure 2-4, Figure 3-4, Figure 4-4, Figure 5-4, Figure 6-4, Figure 7-4, Figure 8-4, and Figure 9- 4.Figure 10-4, Figure 11-4, Figure 12-4, Figure 13-4, Figure 14-4, Figure 15-4, Figure 16-4, Figure 17-4, Figure 18-4, Figure 19- 4. The steps in the flowcharts of FIGS. 20-4, 21-4, and 22-4 are displayed in order according to the arrows, but these steps are not necessarily executed in the order indicated by the arrows. Unless clearly stated in this article, the execution of these steps is not strictly limited in order, and these steps can be executed in other orders. Furthermore, Figures 1-4, 2-4, 3-4, 4-4, 5-4, 6-4, 7-4, 8-4, 9-4, and 10- 4.Figure 11-4, Figure 12-4, Figure 13-4, Figure 14-4, Figure 15-4, Figure 16-4, Figure 17-4, Figure 18-4, Figure 19-4, Figure 20- 4. At least some of the steps in FIGS. 21-4 and 22-4 may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed at the same time, but may be executed at different times. The execution order of these sub-steps or stages is not necessarily sequential, but may be executed in turn or alternately with at least a part of other steps or sub-steps or stages of other steps.
应该理解,上述的装置实施例仅是示意性的,本公开的装置还可通过其它的方式实现。例如,上述实施例中所述单元/模块的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。例如,多个单元、模块或组件可以结合,或者可以集成到另一个系统,或一些特征可以忽略或不执行。It should be understood that the above device embodiments are only schematic, and the device of the present disclosure may also be implemented in other ways. For example, the division of the units / modules in the above embodiments is only a division of logical functions, and there may be other divisions in actual implementation. For example, multiple units, modules, or components may be combined, or may be integrated into another system, or some features may be ignored or not implemented.
另外,若无特别说明,在本公开各个实施例中的各功能单元/模块可以集成在一个单元/模块中,也可以是各个单元/模块单独物理存在,也可以两个或两个以上单元/模块集成在一起。上述集成的单元/模块既可以采用硬件的形式实现,也可以采用软件程序模块的形式实现。In addition, unless otherwise specified, each functional unit / module in each embodiment of the present disclosure may be integrated into one unit / module, or each unit / module may exist alone physically, or two or more units / The modules are integrated together. The above integrated units / modules may be implemented in the form of hardware or software program modules.
所述集成的单元/模块如果以硬件的形式实现时,该硬件可以是数字电路,模拟电路等等。硬件结构的物理实现包括但不局限于晶体管,忆阻器等等。若无特别说明,所述人工智能处理器可以是任 何适当的硬件处理器,比如CPU、GPU、FPGA、DSP和ASIC等等。若无特别说明,所述存储单元可以是任何适当的磁存储介质或者磁光存储介质,比如,阻变式存储器RRAM(Resistive Random Access Memory)、动态随机存取存储器DRAM(Dynamic Random Access Memory)、静态随机存取存储器SRAM(Static Random-Access Memory)、增强动态随机存取存储器EDRAM(Enhanced Dynamic Random Access Memory)、高带宽内存HBM(High-Bandwidth Memory)、混合存储立方HMC(Hybrid Memory Cube)等等。If the integrated unit / module is implemented in the form of hardware, the hardware may be a digital circuit, an analog circuit, or the like. The physical implementation of the hardware structure includes but is not limited to transistors, memristors, and so on. Unless otherwise specified, the artificial intelligence processor may be any suitable hardware processor, such as CPU, GPU, FPGA, DSP, ASIC, and so on. Unless otherwise specified, the storage unit may be any suitable magnetic storage medium or magneto-optical storage medium, such as RRAM (Resistive Random Access Memory), DRAM (Dynamic Random Access Memory), Static random access memory SRAM (Static Random-Access Memory), enhanced dynamic random access memory EDRAM (Enhanced Dynamic Random Access Memory), high-bandwidth memory HBM (High-Bandwidth Memory), hybrid storage cube HMC (Hybrid Memory Memory Cube), etc. Wait.
所述集成的单元/模块如果以软件程序模块的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储器中。基于这样的理解,本公开的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储器中,包括若干指令用以使得一台计算机设备(可为个人计算机、服务器或者网络设备等)执行本公开各个实施例所述方法的全部或部分步骤。而前述的存储器包括:U盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。If the integrated unit / module is implemented in the form of a software program module and sold or used as an independent product, it may be stored in a computer-readable memory. Based on such an understanding, the technical solution of the present disclosure essentially or part of the contribution to the existing technology or all or part of the technical solution can be embodied in the form of a software product, the computer software product is stored in a memory, Several instructions are included to enable a computer device (which may be a personal computer, server, network device, etc.) to perform all or part of the steps of the methods described in various embodiments of the present disclosure. The aforementioned memory includes: U disk, Read-Only Memory (ROM, Read-Only Memory), Random Access Memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program codes.
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。上述实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。In the above embodiments, the description of each embodiment has its own emphasis. For a part that is not detailed in an embodiment, you can refer to the related descriptions of other embodiments. The technical features of the above embodiments can be arbitrarily combined. To simplify the description, all possible combinations of the technical features in the above embodiments are not described. However, as long as there is no contradiction in the combination of these technical features, they should be It is considered as the scope described in this specification.
本公开提供一种机器学习运算装置,该机器学习运算装置可以包括一个或多个上述指令处理装置,用于从其他处理装置中获取待运算数据和控制信息,执行指定的机器学习运算。该机器学习运算装置可以从其他机器学习运算装置或非机器学习运算装置中获得指令,并将执行结果通过I/O接口传递给外围设备(也可称其他处理装置)。外围设备譬如摄像头,显示器,鼠标,键盘,网卡,wifi接口,服务器。当包含一个以上指令处理装置时,指令处理装置间可以通过特定的结构进行链接并传输数据,譬如,通过PCIE总线进行互联并传输数据,以支持更大规模的神经网络的运算。此时,可以共享同一控制系统,也可以有各自独立的控制系统;可以共享内存,也可以每个加速器有各自的内存。此外,其互联方式可以是任意互联拓扑。The present disclosure provides a machine learning computing device. The machine learning computing device may include one or more of the above instruction processing devices for acquiring data to be operated and control information from other processing devices, and performing specified machine learning operations. The machine learning computing device can obtain instructions from other machine learning computing devices or non-machine learning computing devices, and transfer the execution results to peripheral devices (also called other processing devices) through the I / O interface. Peripheral equipment such as camera, monitor, mouse, keyboard, network card, wifi interface, server. When more than one instruction processing device is included, the instruction processing device can link and transmit data through a specific structure, for example, interconnect and transmit data through the PCIE bus to support a larger-scale neural network operation. At this time, you can share the same control system or have separate control systems; you can share memory, or each accelerator has its own memory. In addition, the interconnection method can be any interconnection topology.
该机器学习运算装置具有较高的兼容性,可通过PCIE接口与各种类型的服务器相连接。The machine learning computing device has high compatibility, and can be connected with various types of servers through the PCIE interface.
图24a示出根据本公开一实施例的组合处理装置的框图。如图24a所示,该组合处理装置包括上述机器学习运算装置、通用互联接口和其他处理装置。机器学习运算装置与其他处理装置进行交互,共同完成用户指定的操作。24a shows a block diagram of a combined processing device according to an embodiment of the present disclosure. As shown in FIG. 24a, the combined processing device includes the above-mentioned machine learning computing device, a universal interconnection interface, and other processing devices. The machine learning computing device interacts with other processing devices to complete the operation specified by the user.
其他处理装置,包括中央处理器CPU、图形处理器GPU、神经网络处理器等通用/专用处理器中的一种或以上的处理器类型。其他处理装置所包括的处理器数量不做限制。其他处理装置作为机器学习运算装置与外部数据和控制的接口,包括数据搬运,完成对本机器学习运算装置的开启、停止等基本控制;其他处理装置也可以和机器学习运算装置协作共同完成运算任务。Other processing devices include one or more types of general-purpose / special-purpose processors such as central processing unit CPU, graphics processor GPU, neural network processor. The number of processors included in other processing devices is not limited. Other processing devices serve as an interface between the machine learning computing device and external data and control, including data handling, to complete the basic control of starting and stopping the machine learning computing device; other processing devices can also cooperate with the machine learning computing device to complete the computing task.
通用互联接口,用于在机器学习运算装置与其他处理装置间传输数据和控制指令。该机器学习运算装置从其他处理装置中获取所需的输入数据,写入机器学习运算装置片上的存储装置;可以从其他处理装置中获取控制指令,写入机器学习运算装置片上的控制缓存;也可以读取机器学习运算装置的存储模块中的数据并传输给其他处理装置。General interconnection interface, used to transfer data and control instructions between machine learning computing devices and other processing devices. The machine learning computing device obtains the required input data from other processing devices and writes them into the on-chip storage device of the machine learning computing device; it can obtain control instructions from other processing devices and write them into the control cache of the machine learning computing device; also The data in the storage module of the machine learning computing device can be read and transmitted to other processing devices.
图24b示出根据本公开一实施例的组合处理装置的框图。在一种可能的实现方式中,如图24b所示,该组合处理装置还可以包括存储装置,存储装置分别与机器学习运算装置和所述其他处理装置连接。存储装置用于保存在机器学习运算装置和所述其他处理装置的数据,尤其适用于所需要运算的数据在本机器学习运算装置或其他处理装置的内部存储中无法全部保存的数据。24b shows a block diagram of a combined processing device according to an embodiment of the present disclosure. In a possible implementation, as shown in FIG. 24b, the combined processing device may further include a storage device, and the storage device is respectively connected to the machine learning operation device and the other processing device. The storage device is used to store data stored in the machine learning computing device and the other processing devices, and is particularly suitable for data that cannot be saved in the internal storage of the machine learning computing device or other processing devices.
该组合处理装置可以作为手机、机器人、无人机、视频监控设备等设备的SOC片上系统,有效降低控制部分的核心面积,提高处理速度,降低整体功耗。此情况时,该组合处理装置的通用互联接口与设备的某些部件相连接。某些部件譬如摄像头,显示器,鼠标,键盘,网卡,wifi接口。The combined processing device can be used as an SOC on-chip system for mobile phones, robots, drones, video surveillance equipment, etc., effectively reducing the core area of the control part, increasing processing speed, and reducing overall power consumption. In this case, the general interconnection interface of the combined processing device is connected to some components of the device. Some components such as camera, monitor, mouse, keyboard, network card, wifi interface.
本公开提供一种机器学习芯片,该芯片包括上述机器学习运算装置或组合处理装置。The present disclosure provides a machine learning chip including the above machine learning arithmetic device or combination processing device.
本公开提供一种机器学习芯片封装结构,该机器学习芯片封装结构包括上述机器学习芯片。The present disclosure provides a machine learning chip packaging structure including the above machine learning chip.
本公开提供一种板卡,图25示出根据本公开一实施例的板卡的结构示意图。如图25所示,该板卡包括上述机器学习芯片封装结构或者上述机器学习芯片。板卡除了包括机器学习芯片389以外,还可以包括其他的配套部件,该配套部件包括但不限于:存储器件390、接口装置391和控制器件392。The present disclosure provides a board card. FIG. 25 shows a schematic diagram of a board card according to an embodiment of the present disclosure. As shown in FIG. 25, the board includes the above machine learning chip packaging structure or the above machine learning chip. In addition to the machine learning chip 389, the board may also include other supporting components. The supporting components include but are not limited to: a storage device 390, an interface device 391, and a control device 392.
存储器件390与机器学习芯片389(或者机器学习芯片封装结构内的机器学习芯片)通过总线连接,用于存储数据。存储器件390可以包括多组存储单元393。每一组存储单元393与机器学习芯片389通过总线连接。可以理解,每一组存储单元393可以是DDR SDRAM(英文:Double Data Rate SDRAM,双倍速率同步动态随机存储器)。The storage device 390 and the machine learning chip 389 (or the machine learning chip in the machine learning chip package structure) are connected via a bus, and are used to store data. The memory device 390 may include multiple sets of memory cells 393. Each group of storage units 393 and the machine learning chip 389 are connected by a bus. It can be understood that each group of storage units 393 may be DDR SDRAM (English: Double Data Rate SDRAM, double rate synchronous dynamic random access memory).
DDR不需要提高时钟频率就能加倍提高SDRAM的速度。DDR允许在时钟脉冲的上升沿和下降沿读出数据。DDR的速度是标准SDRAM的两倍。DDR can double the speed of SDRAM without increasing the clock frequency. DDR allows data to be read on the rising and falling edges of the clock pulse. DDR is twice as fast as standard SDRAM.
在一个实施例中,存储器件390可以包括4组存储单元393。每一组存储单元393可以包括多个DDR4颗粒(芯片)。在一个实施例中,机器学习芯片389内部可以包括4个72位DDR4控制器,上述72位DDR4控制器中64bit用于传输数据,8bit用于ECC校验。可以理解,当每一组存储单元393中采用DDR4-3200颗粒时,数据传输的理论带宽可达到25600MB/s。In one embodiment, the memory device 390 may include 4 sets of memory cells 393. Each group of memory cells 393 may include multiple DDR4 particles (chips). In one embodiment, the machine learning chip 389 may include four 72-bit DDR4 controllers. Among the 72-bit DDR4 controllers, 64 bits are used for data transmission, and 8 bits are used for ECC check. It can be understood that when DDR4-3200 particles are used in each group of memory cells 393, the theoretical bandwidth of data transmission can reach 25600MB / s.
在一个实施例中,每一组存储单元393包括多个并联设置的双倍速率同步动态随机存储器。DDR在一个时钟周期内可以传输两次数据。在机器学习芯片389中设置控制DDR的控制器,用于对每个存储单元393的数据传输与数据存储的控制。In one embodiment, each group of storage units 393 includes multiple double-rate synchronous dynamic random access memories arranged in parallel. DDR can transfer data twice in one clock cycle. A controller for controlling DDR is provided in the machine learning chip 389 for controlling the data transmission and data storage of each storage unit 393.
接口装置391与机器学习芯片389(或者机器学习芯片封装结构内的机器学习芯片)电连接。接口装置391用于实现机器学习芯片389与外部设备(例如服务器或计算机)之间的数据传输。例如在一个实施例中,接口装置391可以为标准PCIE接口。比如,待处理的数据由服务器通过标准PCIE接口传递至机器学习芯片289,实现数据转移。优选的,当采用PCIE 3.0 X 16接口传输时,理论带宽可达到16000MB/s。在另一个实施例中,接口装置391还可以是其他的接口,本公开并不限制上述其他的接口的具体表现形式,接口装置能够实现转接功能即可。另外,机器学习芯片的计算结果仍由接口装置传送回外部设备(例如服务器)。The interface device 391 is electrically connected to the machine learning chip 389 (or the machine learning chip in the machine learning chip packaging structure). The interface device 391 is used to realize data transmission between the machine learning chip 389 and an external device (such as a server or a computer). For example, in one embodiment, the interface device 391 may be a standard PCIE interface. For example, the data to be processed is transferred from the server to the machine learning chip 289 through a standard PCIE interface to realize data transfer. Preferably, when the PCIE 3.0 X 16 interface is used for transmission, the theoretical bandwidth can reach 16000MB / s. In another embodiment, the interface device 391 may also be other interfaces. The present disclosure does not limit the specific expressions of the other interfaces described above, and the interface device can implement the transfer function. In addition, the calculation result of the machine learning chip is still transmitted back to the external device (such as a server) by the interface device.
控制器件392与机器学习芯片389电连接。控制器件392用于对机器学习芯片389的状态进行监控。具体的,机器学习芯片389与控制器件392可以通过SPI接口电连接。控制器件392可以包括单片机(Micro Controller Unit,MCU)。如机器学习芯片389可以包括多个处理芯片、多个处理核或多个处理电路,可以带动多个负载。因此,机器学习芯片389可以处于多负载和轻负载等不同的工作状态。通过控制器件可以实现对机器学习芯片中多个处理芯片、多个处理和/或多个处理电路的工作状态的调控。The control device 392 is electrically connected to the machine learning chip 389. The control device 392 is used to monitor the state of the machine learning chip 389. Specifically, the machine learning chip 389 and the control device 392 may be electrically connected through an SPI interface. The control device 392 may include a microcontroller (Micro Controller Unit, MCU). For example, the machine learning chip 389 may include multiple processing chips, multiple processing cores, or multiple processing circuits, and may drive multiple loads. Therefore, the machine learning chip 389 can be in different working states such as multiple loads and light loads. The control device can realize the regulation of the working states of multiple processing chips, multiple processes and / or multiple processing circuits in the machine learning chip.
本公开提供一种电子设备,该电子设备包括上述机器学习芯片或板卡。The present disclosure provides an electronic device including the aforementioned machine learning chip or board.
电子设备可以包括数据处理装置、计算机设备、机器人、电脑、打印机、扫描仪、平板电脑、智能终端、手机、行车记录仪、导航仪、传感器、摄像头、服务器、云端服务器、相机、摄像机、投影仪、手表、耳机、移动存储、可穿戴设备、交通工具、家用电器、和/或医疗设备。Electronic equipment can include data processing devices, computer equipment, robots, computers, printers, scanners, tablets, smart terminals, mobile phones, driving recorders, navigation devices, sensors, cameras, servers, cloud servers, cameras, cameras, projectors , Watches, headphones, mobile storage, wearable devices, vehicles, household appliances, and / or medical devices.
交通工具可以包括飞机、轮船和/或车辆。家用电器可以包括电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机。医疗设备可以包括核磁共振仪、B超仪和/或心电图仪。Vehicles may include airplanes, ships, and / or vehicles. Household appliances may include TVs, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lights, gas stoves, and range hoods. The medical equipment may include a nuclear magnetic resonance apparatus, a B-mode ultrasound apparatus, and / or an electrocardiograph.
本公开还提供一种非易失性计算机可读存储介质,其上存储有计算机程序指令,所述计算机程序指令被处理器执行时实现上述指令处理方法。The present disclosure also provides a non-volatile computer-readable storage medium on which computer program instructions are stored, and when the computer program instructions are executed by a processor, the above instruction processing method is implemented.
本公开实施例还提出一种电子设备,包括:处理器;用于存储处理器可执行指令的存储器;其中,所述处理器被配置为调用所述存储器存储的指令,以执行上述指令处理方法。An embodiment of the present disclosure also provides an electronic device, including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to call the instructions stored in the memory to perform the above instruction processing method .
以上对本公开实施例进行了详细介绍,本文中应用了具体个例对本公开的原理及实施方式进行了阐述,以上实施例的说明仅用于帮助理解本公开的方法及其核心思想。同时,本领域技术人员依据本公开的思想,基于本公开的具体实施方式及应用范围上做出的改变或变形之处,都属于本公开保护的范围。综上所述,本说明书内容不应理解为对本公开的限制。The embodiments of the present disclosure have been described in detail above, and specific examples have been used herein to explain the principles and implementations of the present disclosure. The descriptions of the above embodiments are only used to help understand the methods and core ideas of the present disclosure. At the same time, those skilled in the art based on the ideas of the present disclosure, based on the specific embodiments of the present disclosure and the changes or modifications made in the scope of application, all fall within the scope of protection of the present disclosure. In summary, the content of this specification should not be construed as limiting the disclosure.

Claims (25)

  1. 一种激活指令处理装置,其特征在于,所述装置包括:An activation instruction processing device, characterized in that the device includes:
    控制模块,用于对获取到的激活指令进行编译,得到编译后的激活指令,对所述编译后的激活指令进行解析,得到激活指令的操作码和操作域,并根据所述操作码和所述操作域获取执行激活指令所需的待运算数据和目标地址;The control module is used to compile the obtained activation instruction to obtain the compiled activation instruction, analyze the compiled activation instruction, obtain the operation code and operation domain of the activation instruction, and according to the operation code and The operation domain obtains the data to be calculated and the target address required to execute the activation instruction;
    运算模块,用于对所述待运算数据进行激活运算,得到运算结果,并将所述运算结果存入所述目标地址中,The operation module is used for performing activation operation on the data to be operated to obtain an operation result, and storing the operation result in the target address,
    其中,所述操作码用于指示所述激活指令对数据所进行的运算为激活运算,所述操作域包括待运算数据地址和所述目标地址。Wherein, the operation code is used to indicate that the operation performed by the activation instruction on the data is an activation operation, and the operation domain includes the data address to be operated and the target address.
  2. 根据权利要求1所述的装置,其特征在于,The device according to claim 1, characterized in that
    所述控制模块,还用于根据所述操作码和/或所述操作域获取激活参数表;The control module is also used to obtain an activation parameter table according to the operation code and / or the operation domain;
    所述运算模块,还用于根据所述激活参数表对所述待运算数据进行激活运算,得到运算结果,The calculation module is also used to perform activation calculation on the data to be calculated according to the activation parameter table to obtain an operation result,
    其中,所述激活参数表包括激活表和常数表。Wherein, the activation parameter table includes an activation table and a constant table.
  3. 根据权利要求1所述的装置,其特征在于,所述运算模块,包括:The device according to claim 1, wherein the arithmetic module comprises:
    多个激活运算器,用于对所述待运算数据进行激活运算。A plurality of activation calculators are used to perform activation calculation on the data to be calculated.
  4. 根据权利要求3所述的装置,其特征在于,所述运算模块包括主运算子模块和多个从运算子模块,所述主运算子模块包括所述多个激活运算器,The device according to claim 3, wherein the operation module includes a master operation sub-module and a plurality of slave operation sub-modules, the master operation sub-module includes the plurality of activation operators,
    所述主运算子模块,用于利用所述多个激活运算器对所述待运算数据进行激活运算,得到运算结果,并将所述运算结果存入所述目标地址中。The main operation sub-module is configured to perform activation operation on the data to be operated by using the plurality of activation operators to obtain an operation result, and store the operation result in the target address.
  5. 根据权利要求1所述的装置,其特征在于,所述操作域包括读入量或所述读入量的存储地址,其中,所述控制模块,还用于获取所述读入量,并按照所述读入量获取所述待运算数据。The apparatus according to claim 1, wherein the operation domain includes a read-in amount or a storage address of the read-in amount, wherein the control module is further configured to obtain the read-in amount and follow The read-in amount acquires the data to be calculated.
  6. 根据权利要求1所述的装置,其特征在于,所述装置还包括:The device according to claim 1, wherein the device further comprises:
    存储模块,用于存储所述待运算数据。The storage module is used for storing the data to be calculated.
  7. 根据权利要求1所述的装置,其特征在于,所述控制模块,包括:The device according to claim 1, wherein the control module comprises:
    指令存储子模块,用于存储所述编译后的激活指令;An instruction storage sub-module for storing the compiled activation instruction;
    指令处理子模块,用于对所述编译后的激活指令进行解析,得到激活指令的操作码和操作域;An instruction processing sub-module, which is used to parse the compiled activation instruction to obtain the operation code and operation domain of the activation instruction;
    队列存储子模块,用于存储指令队列,所述指令队列包括按照执行顺序依次排列的多个待执行指令,所述多个待执行指令包括所述编译后的激活指令。A queue storage sub-module is used to store an instruction queue. The instruction queue includes a plurality of instructions to be executed, which are sequentially arranged according to the execution order, and the plurality of instructions to be executed include the compiled activation instructions.
  8. 根据权利要求7所述的装置,其特征在于,所述控制模块,还包括:The device according to claim 7, wherein the control module further comprises:
    依赖关系处理子模块,用于在确定所述多个待执行指令中的第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系时,将所述第一待执行指令缓存在所述指令存储子模块中,在所述第零待执行指令执行完毕后,从所述指令存储子模块中提取所述第一待执行指令发送至所述运算模块,The dependency processing sub-module is used to determine the first pending instruction when there is an association relationship between the first pending instruction in the plurality of pending instructions and the zeroth pending instruction before the first pending instruction The execution instruction is cached in the instruction storage submodule, and after the execution of the zeroth instruction to be executed is completed, the first instruction to be executed is extracted from the instruction storage submodule and sent to the arithmetic module,
    其中,所述第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系包括:Wherein, the association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
    存储所述第一待执行指令所需数据的第一存储地址区间与存储所述第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
  9. 根据权利要求1所述的装置,其特征在于,The device according to claim 1, characterized in that
    所述控制模块,还用于根据所述激活指令生成汇编文件,并将所述汇编文件翻译成二进制文件,The control module is also used to generate an assembly file according to the activation instruction and translate the assembly file into a binary file,
    其中,所述二进制文件为所述编译后的激活指令。Wherein, the binary file is the compiled activation instruction.
  10. 根据权利要求1至9任一项所述的装置,其特征在于,所述激活运算所利用的激活函数包括以下至少一种:The device according to any one of claims 1 to 9, wherein the activation function used by the activation operation includes at least one of the following:
    线性整流函数、S型生长曲线函数、双曲正切函数、带泄露线性整流函数、取最大值函数和幂函数。Linear rectification function, S-shaped growth curve function, hyperbolic tangent function, linear rectification function with leakage, maximum function and power function.
  11. 一种机器学习运算装置,其特征在于,所述装置包括:A machine learning computing device, characterized in that the device includes:
    一个或多个如权利要求1-10任一项所述的激活指令处理装置,用于从其他处理装置中获取待运算数据和控制信息,并执行指定的机器学习运算,将执行结果通过I/O接口传递给其他处理装置;One or more activation instruction processing devices according to any one of claims 1-10, used to obtain data to be calculated and control information from other processing devices, and perform a specified machine learning operation, and pass the execution result through I / O interface is passed to other processing devices;
    当所述机器学习运算装置包含多个所述激活指令处理装置时,所述多个所述激活指令处理装置间可以通过特定的结构进行连接并传输数据;When the machine learning computing device includes a plurality of the activation instruction processing devices, the plurality of activation instruction processing devices may be connected and transmit data through a specific structure;
    其中,多个所述激活指令处理装置通过快速外部设备互连总线PCIE总线进行互联并传输数据,以支持更大规模的机器学习的运算;多个所述激活指令处理装置共享同一控制系统或拥有各自的控制系统;多个所述激活指令处理装置共享内存或者拥有各自的内存;多个所述激活指令处理装置的互联方式是任意互联拓扑。Among them, a plurality of the activation instruction processing devices interconnect and transmit data through a PCIE bus, a fast external device interconnection bus, to support larger-scale machine learning operations; a plurality of the activation instruction processing devices share the same control system or own Respective control systems; a plurality of the activation instruction processing devices share memory or have their own memories; the interconnection mode of the plurality of activation instruction processing devices is an arbitrary interconnection topology.
  12. 一种组合处理装置,其特征在于,所述组合处理装置包括:A combined processing device, characterized in that the combined processing device includes:
    如权利要求11所述的机器学习运算装置、通用互联接口和其他处理装置;The machine learning computing device, universal interconnection interface, and other processing devices of claim 11;
    所述机器学习运算装置与所述其他处理装置进行交互,共同完成用户指定的计算操作,The machine learning computing device interacts with the other processing device to jointly complete the calculation operation specified by the user,
    其中,所述组合处理装置还包括:存储装置,该存储装置分别与所述机器学习运算装置和所述其他处理装置连接,用于保存所述机器学习运算装置和所述其他处理装置的数据。Wherein, the combined processing device further includes: a storage device, which is respectively connected to the machine learning computing device and the other processing device, and is used for storing data of the machine learning computing device and the other processing device.
  13. 一种机器学习芯片,其特征在于,所述机器学习芯片包括:A machine learning chip, characterized in that the machine learning chip includes:
    如权利要求11所述的机器学习运算装置或如权利要求12所述的组合处理装置。The machine learning arithmetic device according to claim 11 or the combined processing device according to claim 12.
  14. 一种电子设备,其特征在于,所述电子设备包括:An electronic device, characterized in that the electronic device includes:
    如权利要求13所述的机器学习芯片。The machine learning chip according to claim 13.
  15. 一种板卡,其特征在于,所述板卡包括:存储器件、接口装置和控制器件以及如权利要求13所述的机器学习芯片;A board card, characterized in that the board card comprises: a storage device, an interface device and a control device, and the machine learning chip according to claim 13;
    其中,所述机器学习芯片与所述存储器件、所述控制器件以及所述接口装置分别连接;Wherein, the machine learning chip is respectively connected to the storage device, the control device and the interface device;
    所述存储器件,用于存储数据;The storage device is used for storing data;
    所述接口装置,用于实现所述机器学习芯片与外部设备之间的数据传输;The interface device is used to realize data transmission between the machine learning chip and an external device;
    所述控制器件,用于对所述机器学习芯片的状态进行监控。The control device is used for monitoring the state of the machine learning chip.
  16. 一种激活指令处理方法,其特征在于,所述方法应用于激活指令处理装置,所述方法包括:An activation instruction processing method, characterized in that the method is applied to an activation instruction processing device, and the method includes:
    利用控制模块对获取到的激活指令进行编译,得到编译后的激活指令,对所述编译后的激活指令进行解析,得到激活指令的操作码和操作域,并根据所述操作码和所述操作域获取执行激活指令所需的待运算数据和目标地址;The control module is used to compile the obtained activation instruction to obtain a compiled activation instruction, and the compiled activation instruction is parsed to obtain the operation code and operation domain of the activation instruction, and according to the operation code and the operation The domain obtains the data to be calculated and the target address required to execute the activation instruction;
    利用运算模块对所述待运算数据进行激活运算,得到运算结果,并将所述运算结果存入所述目标地址中,Using an arithmetic module to perform an activation operation on the data to be operated to obtain an operation result, and store the operation result in the target address,
    其中,所述操作码用于指示所述激活指令对数据所进行的运算为激活运算,所述操作域包括待运算数据地址和所述目标地址。Wherein, the operation code is used to indicate that the operation performed by the activation instruction on the data is an activation operation, and the operation domain includes the data address to be operated and the target address.
  17. 根据权利要求16所述的方法,其特征在于,所述方法还包括:The method according to claim 16, wherein the method further comprises:
    根据所述操作码和/或所述操作域获取激活参数表;Obtaining an activation parameter table according to the operation code and / or the operation domain;
    其中,利用运算模块对所述待运算数据进行激活运算,得到运算结果,包括:Wherein, the operation module is used to activate the operation data to obtain the operation result, including:
    根据所述激活参数表对所述待运算数据进行激活运算,得到运算结果,Performing an activation operation on the data to be calculated according to the activation parameter table to obtain an operation result,
    其中,所述激活参数表包括激活表和常数表。Wherein, the activation parameter table includes an activation table and a constant table.
  18. 根据权利要求16所述的方法,其特征在于,利用运算模块对所述待运算数据进行激活运算,得到运算结果,包括:The method according to claim 16, wherein the operation module is used to activate the data to be operated to obtain an operation result, including:
    利用多个激活运算器对所述待运算数据进行激活运算。A plurality of activation calculators are used to perform activation calculation on the data to be calculated.
  19. 根据权利要求18所述的方法,其特征在于,所述运算模块包括主运算子模块和多个从运算子模块,所述主运算子模块包括所述多个激活运算器,The method according to claim 18, wherein the operation module includes a master operation sub-module and a plurality of slave operation sub-modules, the master operation sub-module includes the plurality of activation operators,
    其中,利用运算模块对所述待运算数据进行激活运算,得到运算结果,包括:Wherein, the operation module is used to activate the operation data to obtain the operation result, including:
    利用所述主运算子模块中的多个激活运算器对所述待运算数据进行激活运算,得到运算结果,并将所述运算结果存入所述目标地址中。Use multiple activation operators in the main operation sub-module to perform activation operation on the data to be operated to obtain an operation result, and store the operation result in the target address.
  20. 根据权利要求16所述的方法,其特征在于,所述操作域还包括读入量或所述读入量的存储地址,The method according to claim 16, wherein the operation domain further includes a read-in amount or a storage address of the read-in amount,
    其中,根据所述操作码和所述操作域获取执行激活指令所需的待运算数据、激活表、常数表和目标地址,包括:Wherein, obtaining the data to be operated, the activation table, the constant table and the target address required to execute the activation instruction according to the operation code and the operation domain include:
    获取所述读入量,并按照所述读入量获取所述待运算数据。Acquiring the read-in amount, and acquiring the data to be calculated according to the read-in amount.
  21. 根据权利要求16所述的方法,其特征在于,所述方法还包括:The method according to claim 16, wherein the method further comprises:
    存储所述待运算数据。Store the data to be calculated.
  22. 根据权利要求16所述的方法,其特征在于,对所述编译后的激活指令进行解析,得到激活指令的操作码和操作域,包括:The method according to claim 16, characterized in that parsing the compiled activation instruction to obtain the operation code and operation domain of the activation instruction includes:
    存储所述编译后的激活指令;Storing the compiled activation instruction;
    对所述编译后的激活指令进行解析,得到激活指令的操作码和操作域;Parse the compiled activation instruction to obtain the operation code and operation domain of the activation instruction;
    存储指令队列,所述指令队列包括按照执行顺序依次排列的多个待执行指令,所述多个待执行指令包括所述编译后的激活指令。An instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed, which are sequentially arranged according to an execution order, and the plurality of instructions to be executed include the compiled activation instructions.
  23. 根据权利要求22所述的方法,其特征在于,所述方法还包括:The method according to claim 22, wherein the method further comprises:
    在确定所述多个待执行指令中的第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系时,缓存所述第一待执行指令,并在确定所述第零待执行指令执行完毕后,控制进行所述第一待执行指令的执行,When it is determined that the first to-be-executed instruction among the plurality of to-be-executed instructions is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction, cache the first to-be-executed instruction, and determine the After the execution of the zeroth to-be-executed instruction is completed, control to execute the execution of the first to-be-executed instruction,
    其中,所述第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系包括:Wherein, the association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
    存储所述第一待执行指令所需数据的第一存储地址区间与存储所述第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
  24. 根据权利要求16所述的方法,其特征在于,利用控制模块对获取到的激活指令进行编译,得到编译后的激活指令,包括:The method according to claim 16, wherein using the control module to compile the acquired activation instruction to obtain the compiled activation instruction includes:
    根据所述激活指令生成汇编文件,并将所述汇编文件翻译成二进制文件,Generate an assembly file according to the activation instruction, and translate the assembly file into a binary file,
    其中,所述二进制文件为所述编译后的激活指令。Wherein, the binary file is the compiled activation instruction.
  25. 根据权利要求16至24任一项所述的方法,其特征在于,所述激活运算所利用的激活函数包括以下至少一种:The method according to any one of claims 16 to 24, wherein the activation function utilized by the activation operation includes at least one of the following:
    线性整流函数、S型生长曲线函数、双曲正切函数、带泄露线性整流函数、取最大值函数和幂函数。Linear rectification function, S-shaped growth curve function, hyperbolic tangent function, linear rectification function with leakage, maximum function and power function.
PCT/CN2019/110146 2018-10-09 2019-10-09 Operation method and device, computer equipment, and storage medium WO2020073923A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201980039761.9A CN113330460A (en) 2018-10-09 2019-10-09 Operation method, operation device, computer equipment and storage medium

Applications Claiming Priority (86)

Application Number Priority Date Filing Date Title
CN201811172140 2018-10-09
CN201811172140.1 2018-10-09
CN201811185729 2018-10-11
CN201811184749.0 2018-10-11
CN201811185715 2018-10-11
CN201811184749 2018-10-11
CN201811185715.3 2018-10-11
CN201811185729.5 2018-10-11
CN201811191931 2018-10-12
CN201811191931.9 2018-10-12
CN201811190934 2018-10-12
CN201811190928.5 2018-10-12
CN201811191930 2018-10-12
CN201811190928 2018-10-12
CN201811190934.0 2018-10-12
CN201811191930.4 2018-10-12
CN201811202572 2018-10-16
CN201811202571.8 2018-10-16
CN201811202572.2 2018-10-16
CN201811202571 2018-10-16
CN201811363029.0 2018-11-14
CN201811363029 2018-11-14
CN201811407882.8 2018-11-23
CN201811407882 2018-11-23
CN201811482764 2018-12-05
CN201811482764.3 2018-12-05
CN201811488009 2018-12-06
CN201811488009.6 2018-12-06
CN201811496153.4 2018-12-07
CN201811496153 2018-12-07
CN201811532846 2018-12-14
CN201811532846.4 2018-12-14
CN201811548631 2018-12-18
CN201811548631.1 2018-12-18
CN201811555975 2018-12-19
CN201811558860.1 2018-12-19
CN201811558860 2018-12-19
CN201811555975.5 2018-12-19
CN201811564591.X 2018-12-20
CN201811564195 2018-12-20
CN201811563547 2018-12-20
CN201811564195.7 2018-12-20
CN201811564591 2018-12-20
CN201811563547.7 2018-12-20
CN201910403433.4 2019-05-15
CN201910403427 2019-05-15
CN201910407483.X 2019-05-15
CN201910403433 2019-05-15
CN201910403240 2019-05-15
CN201910403427.9 2019-05-15
CN201910403246 2019-05-15
CN201910403240.9 2019-05-15
CN201910407483 2019-05-15
CN201910403246.6 2019-05-15
CN201910548315.2A CN110096283A (en) 2018-10-12 2019-06-24 Operation method, device, computer equipment and storage medium
CN201910548315.2 2019-06-24
CN201910548268.1 2019-06-24
CN201910548268.1A CN110096309B (en) 2018-11-14 2019-06-24 Operation method, operation device, computer equipment and storage medium
CN201910614665 2019-07-09
CN201910614665.4 2019-07-09
CN201910620828.X 2019-07-10
CN201910620828 2019-07-10
CN201910621592.1 2019-07-10
CN201910621592 2019-07-10
CN201910626430 2019-07-11
CN201910623276.8 2019-07-11
CN201910625446.6 2019-07-11
CN201910624935.X 2019-07-11
CN201910626214 2019-07-11
CN201910624935 2019-07-11
CN201910623276 2019-07-11
CN201910624949.1 2019-07-11
CN201910624949 2019-07-11
CN201910626430.7 2019-07-11
CN201910626214.2 2019-07-11
CN201910625446 2019-07-11
CN201910629416.2 2019-07-12
CN201910629416 2019-07-12
CN201910636335 2019-07-15
CN201910635636.6 2019-07-15
CN201910636335.5 2019-07-15
CN201910635636 2019-07-15
CN201910645024.5 2019-07-17
CN201910645024 2019-07-17
CN201910755313.0 2019-08-15
CN201910755313 2019-08-15

Publications (1)

Publication Number Publication Date
WO2020073923A1 true WO2020073923A1 (en) 2020-04-16

Family

ID=70163909

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/110146 WO2020073923A1 (en) 2018-10-09 2019-10-09 Operation method and device, computer equipment, and storage medium

Country Status (1)

Country Link
WO (1) WO2020073923A1 (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102541814A (en) * 2010-12-27 2012-07-04 北京国睿中数科技股份有限公司 Matrix calculating device and matrix calculating method for data communication processor
CN104915322A (en) * 2015-06-09 2015-09-16 中国人民解放军国防科学技术大学 Method for accelerating convolution neutral network hardware and AXI bus IP core thereof
US20160342890A1 (en) * 2015-05-21 2016-11-24 Google Inc. Batch processing in a neural network processor
CN107992329A (en) * 2017-07-20 2018-05-04 上海寒武纪信息科技有限公司 A kind of computational methods and Related product
CN110096310A (en) * 2018-11-14 2019-08-06 上海寒武纪信息科技有限公司 Operation method, device, computer equipment and storage medium
CN110119807A (en) * 2018-10-12 2019-08-13 上海寒武纪信息科技有限公司 Operation method, device, computer equipment and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102541814A (en) * 2010-12-27 2012-07-04 北京国睿中数科技股份有限公司 Matrix calculating device and matrix calculating method for data communication processor
US20160342890A1 (en) * 2015-05-21 2016-11-24 Google Inc. Batch processing in a neural network processor
CN104915322A (en) * 2015-06-09 2015-09-16 中国人民解放军国防科学技术大学 Method for accelerating convolution neutral network hardware and AXI bus IP core thereof
CN107992329A (en) * 2017-07-20 2018-05-04 上海寒武纪信息科技有限公司 A kind of computational methods and Related product
CN110119807A (en) * 2018-10-12 2019-08-13 上海寒武纪信息科技有限公司 Operation method, device, computer equipment and storage medium
CN110096310A (en) * 2018-11-14 2019-08-06 上海寒武纪信息科技有限公司 Operation method, device, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
US9910802B2 (en) High bandwidth low latency data exchange between processing elements
CN110096310B (en) Operation method, operation device, computer equipment and storage medium
CA2879531A1 (en) Neural processing engine and architecture using the same
WO2017185347A1 (en) Apparatus and method for executing recurrent neural network and lstm computations
US11500811B2 (en) Apparatuses and methods for map reduce
CN110119807B (en) Operation method, operation device, computer equipment and storage medium
US9400656B2 (en) Chaining between exposed vector pipelines
WO2017185392A1 (en) Device and method for performing four fundamental operations of arithmetic of vectors
WO2017185404A1 (en) Apparatus and method for performing vector logical operation
WO2020073923A1 (en) Operation method and device, computer equipment, and storage medium
WO2020073925A1 (en) Operation method and apparatus, computer device and storage medium
CN113330460A (en) Operation method, operation device, computer equipment and storage medium
US11119766B2 (en) Hardware accelerator with locally stored macros
WO2020233387A1 (en) Command processing method and apparatus, and related products
CN111047005A (en) Operation method, operation device, computer equipment and storage medium
WO2022001500A1 (en) Computing apparatus, integrated circuit chip, board card, electronic device, and computing method
US11579871B2 (en) Systems, apparatuses, and methods for controllable sine and/or cosine operations
WO2022001497A1 (en) Computing apparatus, integrated circuit chip, board card, electronic device and computing method
US11500802B1 (en) Data replication for accelerator
EP1504349A1 (en) Signal aggregation
WO2020078446A1 (en) Computation method and apparatus, and related product
US10997277B1 (en) Multinomial distribution on an integrated circuit
CN111290788B (en) Operation method, operation device, computer equipment and storage medium
US11847491B1 (en) Low latency execution of a machine learning model
CN111353124A (en) Operation method, operation device, computer equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19870695

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 08.07.2021)

122 Ep: pct application non-entry in european phase

Ref document number: 19870695

Country of ref document: EP

Kind code of ref document: A1