WO2020108471A1 - Computing method and apparatus, and related product - Google Patents

Computing method and apparatus, and related product Download PDF

Info

Publication number
WO2020108471A1
WO2020108471A1 PCT/CN2019/120893 CN2019120893W WO2020108471A1 WO 2020108471 A1 WO2020108471 A1 WO 2020108471A1 CN 2019120893 W CN2019120893 W CN 2019120893W WO 2020108471 A1 WO2020108471 A1 WO 2020108471A1
Authority
WO
WIPO (PCT)
Prior art keywords
instruction
vector
checked
executed
processing
Prior art date
Application number
PCT/CN2019/120893
Other languages
French (fr)
Chinese (zh)
Inventor
张尧
陈煜�
刘少礼
曾洪博
于涌
陶劲桦
李震
韩栋
Original Assignee
上海寒武纪信息科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN201811456735.XA external-priority patent/CN111258641B/en
Priority claimed from CN201910001865.2A external-priority patent/CN111400341B/en
Priority claimed from CN201910001855.9A external-priority patent/CN111399905B/en
Priority claimed from CN201910293190.3A external-priority patent/CN111813376A/en
Priority claimed from CN201910293748.8A external-priority patent/CN111813448A/en
Priority claimed from CN201910293770.2A external-priority patent/CN111813537A/en
Priority claimed from CN201910293777.4A external-priority patent/CN111813449A/en
Priority claimed from CN201910294130.3A external-priority patent/CN111813450A/en
Application filed by 上海寒武纪信息科技有限公司 filed Critical 上海寒武纪信息科技有限公司
Publication of WO2020108471A1 publication Critical patent/WO2020108471A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons

Definitions

  • the present disclosure relates to the field of computer technology, and in particular, to an arithmetic method, device, and related products.
  • the present disclosure proposes an arithmetic method, device and related products.
  • a vector search instruction processing device includes:
  • the control module is used to parse the received vector search instruction, obtain the operation code and the operation domain of the vector search instruction, and determine the standby required to execute the vector search instruction according to the operation code and the operation domain Search vector, search condition and target address;
  • the operation module is used to sequentially determine whether a plurality of check numbers representing the search vector satisfy the search condition, determine the check number satisfying the search condition as the target number, and store the storage address of the target number Store in the target address as a search result,
  • the operation code is used to indicate that the operation performed by the vector search instruction on the vector data is a search operation, and the operation domain includes the vector address to be searched and the target address.
  • a machine learning computing device including:
  • One or more of the above vector search instruction processing devices are used to obtain data and control information to be calculated from other processing devices, and perform specified machine learning operations, and pass the execution results to other processing devices through the I/O interface;
  • the machine learning operation device includes a plurality of the vector search instruction processing devices
  • the plurality of vector search instruction processing devices may be connected and transmit data through a specific structure
  • a plurality of the vector search instruction processing devices interconnect and transmit data through a fast external device interconnect bus PCIE bus to support larger-scale machine learning operations; a plurality of the vector search instruction processing devices share the same control system Or have their own control systems; a plurality of the vector search instruction processing devices share memory or have their own memory; the interconnection method of the plurality of vector search instruction processing devices is an arbitrary interconnection topology.
  • a combined processing device comprising:
  • the machine learning operation device interacts with the other processing device to jointly complete the calculation operation specified by the user.
  • a machine learning chip includes the above machine learning network operation device or the above combination processing device.
  • a machine learning chip packaging structure including the above machine learning chip.
  • a board card including the above machine learning chip packaging structure.
  • an electronic device including the aforementioned machine learning chip or the aforementioned board.
  • the vector search instruction processing method, device and related products provided by the embodiments of the present disclosure include a control module and an arithmetic module.
  • the control module is used to parse the received vector search instruction, obtain the operation code and operation domain of the vector search instruction, and determine the to-be-searched vector, search condition, and target address required to execute the vector search instruction according to the operation code and operation domain.
  • the operation module is used to sequentially determine whether a plurality of to-be-checked numbers representing the to-be-searched vector satisfy the search condition, determine the to-be-checked number satisfying the search condition as the target number, and store the target number's storage address as the search result in the target address.
  • the vector search instruction processing method, device and related products provided by the embodiments of the present disclosure have a wide range of application, and have high processing efficiency and fast processing speed for the vector search instruction.
  • a scalar search instruction processing device includes:
  • the control module is used to parse the received scalar search instruction, obtain the operation code and the operation domain of the scalar search instruction, and determine the standby required to execute the scalar search instruction according to the operation code and the operation domain Find scalar, specified value, specified sort and target address;
  • the arithmetic module is used to sequentially determine whether the values of the plurality of to-be-checked numbers representing the to-be-searched scalar are equal to the specified value, and determine the to-be-checked numbers whose value is equal to the specified value and sorted into the specified sort are The target number, the storage address of the target number is stored in the target address as a search result,
  • the operation code is used to indicate that the operation performed by the scalar search instruction on the scalar data is a search operation, and the operation field includes the scalar address to be searched and the target address.
  • a machine learning computing device including:
  • One or more of the above scalar search instruction processing devices used to obtain the data to be operated and control information from other processing devices, and perform the specified machine learning operations, and pass the execution results to other processing devices through the I/O interface;
  • the machine learning computing device includes a plurality of the scalar search instruction processing devices
  • the plurality of scalar search instruction processing devices may be connected and transmit data through a specific structure
  • a plurality of the scalar search instruction processing devices interconnect and transmit data through a PCIE bus, a fast external device interconnection bus, to support larger-scale machine learning operations; a plurality of the scalar search instruction processing devices share the same control system Or have their own control systems; a plurality of the scalar search instruction processing devices share memory or have their own memories; the interconnection method of the plurality of scalar search instruction processing devices is an arbitrary interconnection topology.
  • a combined processing device comprising:
  • the machine learning operation device interacts with the other processing device to jointly complete the calculation operation specified by the user.
  • a machine learning chip includes the above machine learning network operation device or the above combination processing device.
  • a machine learning chip packaging structure including the above machine learning chip.
  • a board card including the above machine learning chip packaging structure.
  • an electronic device including the aforementioned machine learning chip or the aforementioned board.
  • a scalar search instruction processing method is provided.
  • the method is applied to a scalar search instruction processing device.
  • the method includes:
  • the operation code is used to indicate that the operation performed by the scalar search instruction on the scalar data is a search operation, and the operation field includes the scalar address to be searched and the target address.
  • the scalar search instruction processing method, device and related products provided by the embodiments of the present disclosure include a control module and an arithmetic module.
  • the control module is used to parse the received scalar search instruction, obtain the operation code and operation domain of the scalar search instruction, and determine the scalar to be searched, the specified value, the specified order and the required scalar search instruction according to the operation code and the operation domain. target address.
  • the arithmetic module is used to sequentially determine whether the values of the multiple check numbers representing the scalar to be searched are equal to the specified value, determine the check number that is equal to the specified value and sorted into the specified sort as the target number, and determine the storage address of the target number The target address is stored as the search result.
  • the scalar search instruction processing method, device and related products provided by the embodiments of the present disclosure have a wide range of application, and have high processing efficiency and fast processing speed for the scalar search instruction.
  • a resource lock and release instruction processing device comprising:
  • the control module is configured to parse the received resource lock instruction, obtain the operation code and operation domain of the resource lock instruction, and determine the instruction indicated by the resource lock instruction according to the operation code and the operation domain Resources to be processed, and determine the lock strategy required for resource lock processing;
  • a processing module configured to lock or release the resource to be processed according to the lock and release strategy to obtain the processed resource
  • the operation code is used to indicate that the resource lock instruction performs processing on the resource as locking or releasing processing, and the operation domain includes the resource identifier to be processed.
  • a machine learning computing device including:
  • One or more of the above resource lock instruction processing devices used to obtain data to be calculated and control information from other processing devices, and perform specified machine learning operations, and pass the execution results to other processing devices through the I/O interface;
  • the machine learning operation device includes a plurality of the resource lock instruction processing devices
  • the plurality of resource lock instruction processing devices may be connected and transmit data through a specific structure
  • a plurality of the resource lock and put instruction processing apparatuses interconnect and transmit data through a fast external device interconnect bus PCIE bus to support larger-scale machine learning operations; a plurality of the resource lock and put instruction processing apparatuses share the same
  • the control system may have its own control system; a plurality of the resource lock instruction processing devices share memory or have their own memories; the interconnection method of the plurality of resource lock instruction processing devices is any interconnected topology.
  • a combined processing device comprising:
  • the machine learning operation device interacts with the other processing device to jointly complete the calculation operation specified by the user.
  • a machine learning chip includes the above machine learning network operation device or the above combination processing device.
  • a machine learning chip packaging structure including the above machine learning chip.
  • a board card including the above machine learning chip packaging structure.
  • an electronic device including the aforementioned machine learning chip or the aforementioned board.
  • a method for processing a resource lock instruction is provided.
  • the method is applied to a device for processing a resource lock instruction.
  • the method includes:
  • Parse the received resource lock instruction obtain the operation code and operation domain of the resource lock instruction, and determine the resource to be processed indicated by the resource lock instruction according to the operation code and the operation domain, And determine the lock strategy required for resource lock processing;
  • the operation code is used to indicate that the resource lock instruction performs processing on the resource as locking or releasing processing, and the operation domain includes the resource identifier to be processed.
  • the resource lock instruction processing method, device and related products provided by the embodiments of the present disclosure include a control module and an arithmetic module.
  • the control module is used to analyze the received resource lock instruction, obtain the operation code and operation domain of the resource lock instruction, and determine the resource to be processed indicated by the resource lock instruction according to the operation code and operation domain, and determine the resource to be processed Locking strategy required for lock handling.
  • the processing module is used for locking or releasing the resources to be processed according to the lock and put strategy, to obtain the processed resources.
  • the resource lock instruction processing method, device and related products provided by the embodiments of the present disclosure have a wide range of applications, and the processing efficiency of locking and releasing resources according to the resource lock instruction is high and the processing speed is high.
  • a tensor rearrangement instruction processing device includes:
  • the control module is configured to parse the received tensor rearrangement instruction, obtain an operation code and an operation domain of the tensor rearrangement instruction, and determine to execute the tensor rearrangement according to the operation code and the operation domain
  • the processing module performs rearrangement processing on the to-be-processed tensor according to the rearrangement strategy to obtain a rearrangement tensor, and stores the rearrangement tensor into the target address,
  • the operation code is used to indicate that the processing performed by the tensor rearrangement instruction on the tensor data is rearrangement processing, and the operation field includes the to-be-processed tensor address and the target address.
  • a machine learning computing device including:
  • One or more of the above tensor reordering instruction processing devices used to obtain to-be-processed tensors and control information from other processing devices, and perform specified machine learning operations, and pass the execution results to other processing devices through the I/O interface ;
  • the machine learning operation device includes a plurality of the tensor rearrangement instruction processing devices
  • the plurality of the tensor rearrangement instruction processing devices may be connected and transmit data through a specific structure
  • a plurality of the tensor rearrangement instruction processing devices interconnect and transmit data through a fast external device interconnect bus PCIE bus to support larger-scale machine learning operations; a plurality of the tensor rearrangement instruction processing devices Sharing the same control system or having its own control system; a plurality of the tensor rearrangement instruction processing devices share memory or have their own memories; the interconnection method of the plurality of tensor rearrangement instruction processing devices is any interconnection topology.
  • a combined processing device comprising:
  • the machine learning operation device interacts with the other processing device to jointly complete the calculation operation specified by the user.
  • a machine learning chip includes the above machine learning network operation device or the above combination processing device.
  • a machine learning chip packaging structure including the above machine learning chip.
  • a board card including the above machine learning chip packaging structure.
  • an electronic device including the aforementioned machine learning chip or the aforementioned board.
  • a tensor rearrangement instruction processing method is provided.
  • the method is applied to a tensor rearrangement instruction processing apparatus.
  • the method includes:
  • the operation code is used to indicate that the processing performed by the tensor rearrangement instruction on the tensor data is rearrangement processing, and the operation field includes the to-be-processed tensor address and the target address.
  • the tensor rearrangement instruction processing method, device and related products provided by the embodiments of the present disclosure include a control module and a processing module.
  • the control module is used to analyze the received tensor rearrangement instruction, obtain the operation code and operation domain of the tensor rearrangement instruction, and determine the pending tensor required to execute the tensor rearrangement instruction according to the operation code and operation domain And the target address, and determine the rearrangement strategy required for rearrangement processing.
  • the processing module is used to rearrange the to-be-processed tensor according to the rearrangement strategy to obtain the rearranged tensor, and store the rearranged tensor into the target address.
  • the tensor rearrangement instruction processing method, device and related products provided by the embodiments of the present disclosure can realize the rearrangement processing of tensor data through one tensor rearrangement instruction, and the related art realizes tensor through multiple instructions Compared with the process of data rearrangement, the rearrangement of tensor data has high processing efficiency, fast processing speed, and wide application range.
  • a data processing device is provided, the device is used to perform machine learning calculations, and the device includes a control module And a processing module, the processing module includes a data transfer submodule and an accumulation submodule:
  • the control module is used to obtain a calculation instruction and obtain input data required to execute the calculation instruction
  • the data transfer sub-module is configured to process the input data according to the calculation instruction to obtain multiple intermediate results, and sequentially send the multiple intermediate results to the accumulation sub-module;
  • the accumulation submodule is used to perform a cyclic accumulation operation on the plurality of intermediate results to obtain the calculation result of the calculation instruction.
  • a machine learning computing device including:
  • One or more of the above data processing devices are used to obtain input data and control information from other processing devices, and perform specified machine learning operations, and pass the execution results to other processing devices through the I/O interface;
  • the data processing devices may be connected and transmit data through a specific structure
  • a plurality of the data processing apparatuses interconnect and transmit data through a PCIE bus, a fast external device interconnection bus, to support larger-scale machine learning operations; a plurality of the data processing apparatuses share the same control system or have their own Control system; multiple data processing devices share memory or have their own memory; multiple data processing devices are interconnected in any interconnection topology.
  • a combined processing device comprising:
  • the machine learning operation device interacts with the other processing device to jointly complete the calculation operation specified by the user.
  • a machine learning chip includes the above machine learning network operation device or the above combination processing device.
  • a machine learning chip packaging structure including the above machine learning chip.
  • a board card including the above machine learning chip packaging structure.
  • an electronic device including the aforementioned machine learning chip or the aforementioned board.
  • a data processing method is provided.
  • the method is applied to a data processing device, and the device is used to perform machine learning calculations.
  • the method includes:
  • the data processing device, method and related products provided by the embodiments of the present disclosure include: a control module and a processing module.
  • the processing module includes a data transfer submodule and an accumulation submodule.
  • the control module is used to obtain calculation instructions and obtain input data required to execute the calculation instructions.
  • the data transfer sub-module is used to process the input data according to the calculation instruction to obtain multiple intermediate results, and send the multiple intermediate results to the accumulation sub-module in sequence.
  • the accumulation submodule is used to perform a cyclic accumulation operation on multiple intermediate results to obtain the calculation result of the calculation instruction.
  • the data processing device, method and related products provided by the embodiments of the present disclosure reduce the amount of data access and calculation by cyclically accumulating multiple intermediate results, while ensuring the accuracy of calculation is not damaged, and can effectively increase the speed of data processing .
  • a matrix symmetric instruction processing device includes:
  • the control module is used to parse the received matrix symmetric instruction, obtain the operation code and operation domain of the matrix symmetric instruction, and determine the matrix to be processed and the target address required to execute the matrix symmetric instruction according to the operation code and operation domain Symmetry strategy required for symmetric processing;
  • the processing module performs symmetric processing on the matrix to be processed according to the symmetric strategy to obtain a symmetric matrix, and stores the symmetric matrix into the target address,
  • the operation code is used to indicate that the processing performed by the matrix symmetric instruction on the matrix data is symmetric processing, and the operation field includes the to-be-processed matrix address and the target address.
  • a machine learning computing device including:
  • One or more of the above matrix symmetric instruction processing devices are used to obtain the to-be-processed matrix and control information from other processing devices, and perform specified machine learning operations, and pass the execution results to other processing devices through the I/O interface;
  • the machine learning operation device includes a plurality of matrix symmetric instruction processing devices
  • the plurality of matrix symmetric instruction processing devices may be connected and transmit data through a specific structure
  • a plurality of the matrix symmetric instruction processing devices interconnect and transmit data through a fast external device interconnect bus PCIE bus to support larger-scale machine learning operations; a plurality of the matrix symmetric instruction processing devices share the same control system Or have their own control systems; a plurality of the matrix symmetric instruction processing devices share memory or have their own memories; the interconnection method of the plurality of matrix symmetric instruction processing devices is any interconnection topology.
  • a combined processing device comprising:
  • the machine learning operation device interacts with the other processing device to jointly complete the calculation operation specified by the user.
  • a machine learning chip includes the above machine learning network operation device or the above combination processing device.
  • a machine learning chip packaging structure including the above machine learning chip.
  • a board card including the above machine learning chip packaging structure.
  • an electronic device including the aforementioned machine learning chip or the aforementioned board.
  • a matrix symmetric instruction processing method is provided.
  • the method is applied to a matrix symmetric instruction processing device.
  • the method includes:
  • the operation code is used to indicate that the processing performed by the matrix symmetric instruction on the matrix data is symmetric processing, and the operation field includes the to-be-processed matrix address and the target address.
  • the matrix symmetric instruction processing method, device and related products provided by the embodiments of the present disclosure include a control module and a processing module.
  • the control module is used to parse the received matrix symmetric instruction, obtain the operation code and operation domain of the matrix symmetric instruction, and determine the matrix to be processed and the target address required to execute the matrix symmetric instruction according to the operation code and operation domain, and determine the progress Symmetry strategy required for symmetric processing.
  • the processing module is used to perform symmetric processing on the processing matrix according to a symmetric strategy to obtain the symmetric matrix, and store the symmetric matrix into the target address.
  • the matrix symmetric instruction processing method, device and related products provided by the embodiments of the present disclosure have a wide range of application, and the symmetric processing of the matrix according to the matrix symmetric instruction has high processing efficiency and fast processing speed.
  • a matrix mirroring instruction processing device includes:
  • the control module is used to parse the received matrix mirroring instruction, obtain the operation code and the operation domain of the matrix mirroring instruction, and determine the standby required to execute the matrix mirroring instruction according to the operation code and the operation domain Mirroring matrix and target address, and determining the mirroring strategy required for mirroring processing;
  • the processing module performs mirror processing on the matrix to be mirrored according to the mirroring strategy to obtain a mirrored matrix, and stores the mirrored matrix in the target address,
  • the operation code is used to indicate that the processing performed by the matrix mirroring instruction on the matrix data is mirror processing, and the operation domain includes the matrix address to be mirrored and the target address.
  • a machine learning computing device including:
  • One or more of the above matrix mirroring instruction processing devices are used to obtain the to-be-mirrored matrix and control information from other processing devices, and perform specified machine learning operations, and pass the execution results to other processing devices through the I/O interface;
  • the machine learning operation device includes a plurality of the matrix mirroring instruction processing devices
  • the plurality of matrix mirroring instruction processing devices may be connected and transmit data through a specific structure
  • a plurality of the matrix mirroring instruction processing devices interconnect and transmit data through a fast external device interconnect bus PCIE bus to support larger-scale machine learning operations; a plurality of the matrix mirroring instruction processing devices share the same control system Or have their own control systems; a plurality of the matrix mirroring instruction processing devices share memory or have their own memories; the interconnection method of the plurality of matrix mirroring instruction processing devices is an arbitrary interconnection topology.
  • a combined processing device comprising:
  • the machine learning operation device interacts with the other processing device to jointly complete the calculation operation specified by the user.
  • a machine learning chip includes the above machine learning network operation device or the above combination processing device.
  • a machine learning chip packaging structure including the above machine learning chip.
  • a board card including the above machine learning chip packaging structure.
  • an electronic device including the aforementioned machine learning chip or the aforementioned board.
  • a matrix mirroring instruction processing method is provided.
  • the method is applied to a matrix mirroring instruction processing apparatus.
  • the method includes:
  • the operation code is used to indicate that the processing performed by the matrix mirroring instruction on the matrix is mirror processing, and the operation domain includes the matrix address to be mirrored and the target address.
  • the matrix mirroring instruction processing method, device and related products provided by the embodiments of the present disclosure include a control module and a processing module.
  • the control module is used to analyze the received matrix mirroring instruction, obtain the operation code and operation domain of the matrix mirroring instruction, and determine the matrix to be mirrored and the target address required to execute the matrix mirroring instruction according to the operation code and the operation domain.
  • the processing module is used to perform mirror processing on the mirror matrix according to the mirror strategy to obtain the mirrored matrix, and store the mirrored matrix in the target address.
  • the matrix mirroring instruction processing method, device and related products provided by the embodiments of the present disclosure have a wide range of applications, and the mirroring processing of the matrix according to the matrix mirroring instruction has high processing efficiency and fast processing speed.
  • a matrix rotation instruction processing device includes:
  • the control module is used to parse the received matrix rotation instruction, obtain the operation code and the operation domain of the matrix rotation instruction, and determine the standby required to execute the matrix rotation instruction according to the operation code and the operation domain Rotation matrix and target address, and determine the rotation angle of the rotation matrix to be rotated;
  • the processing module performs rotation processing on the matrix to be rotated according to the rotation angle to obtain a matrix after rotation, and stores the matrix after rotation into the target address,
  • the operation code is used to indicate that the processing performed by the matrix rotation instruction on the matrix data is rotation processing, and the operation domain includes the matrix address to be rotated and the target address.
  • a machine learning computing device including:
  • One or more of the above matrix rotation instruction processing devices are used to obtain the matrix to be rotated and control information from other processing devices, and perform specified machine learning operations, and pass the execution results to other processing devices through the I/O interface;
  • the machine learning operation device includes a plurality of matrix rotation instruction processing devices
  • the plurality of matrix rotation instruction processing devices may be connected and transmit data through a specific structure
  • a plurality of the matrix rotation instruction processing apparatuses interconnect and transmit data through a PCIE bus that is a fast external device interconnection bus to support larger-scale machine learning operations; Or have their own control systems; a plurality of the matrix rotation instruction processing devices share memory or have their own memories; the interconnection method of the plurality of matrix rotation instruction processing devices is an arbitrary interconnection topology.
  • a combined processing device comprising:
  • the machine learning operation device interacts with the other processing device to jointly complete the calculation operation specified by the user.
  • a machine learning chip includes the above machine learning network operation device or the above combination processing device.
  • a machine learning chip packaging structure including the above machine learning chip.
  • a board card including the above machine learning chip packaging structure.
  • an electronic device including the aforementioned machine learning chip or the aforementioned board.
  • a matrix rotation instruction processing method is provided.
  • the method is applied to a matrix rotation instruction processing apparatus.
  • the method includes:
  • the operation code is used to indicate that the processing performed by the matrix rotation instruction on the matrix data is rotation processing, and the operation domain includes the matrix address to be rotated and the target address.
  • the matrix rotation instruction processing method, device and related products provided by the embodiments of the present disclosure include a control module and a processing module.
  • the control module is used to parse the received matrix rotation instruction, obtain the operation code and operation domain of the matrix rotation instruction, and determine the matrix to be rotated and the target address required to execute the matrix rotation instruction according to the operation code and operation domain, and determine the treatment The rotation angle of the rotation matrix.
  • the processing module is used to perform rotation processing on the rotation matrix according to the rotation angle to obtain the rotated matrix, and store the rotated matrix in the target address.
  • the matrix rotation instruction processing method, device and related products provided by the embodiments of the present disclosure have a wide range of application, and the processing efficiency of the matrix to be rotated according to the matrix rotation instruction is high and the processing speed is fast.
  • the electronic device includes a data processing device, a robot, a computer, a printer, a scanner, a tablet computer, a smart terminal, a mobile phone, a driving recorder, a navigator, a sensor, a camera, a server, a cloud server, a camera, Cameras, projectors, watches, headphones, mobile storage, wearable devices, vehicles, household appliances, and/or medical devices.
  • the vehicle includes an airplane, ship, and/or vehicle;
  • the household appliance includes a TV, air conditioner, microwave oven, refrigerator, rice cooker, humidifier, washing machine, electric lamp, gas stove, and range hood;
  • the medical Equipment includes MRI, B-mode ultrasound and/or electrocardiograph.
  • FIG. 1 shows a schematic diagram of a processor of an instruction processing method according to an embodiment of the present disclosure.
  • FIG. 2-1 illustrates a block diagram of a vector search instruction processing apparatus according to an embodiment of the present disclosure.
  • FIG. 2-2 shows a block diagram of a vector search instruction processing apparatus according to an embodiment of the present disclosure.
  • FIGS. 2-3a-2-3c illustrate schematic diagrams of application scenarios of a vector search instruction processing apparatus according to an embodiment of the present disclosure.
  • FIG. 2-4 show a flowchart of a vector search instruction processing method according to an embodiment of the present disclosure.
  • FIG. 3-1 shows a block diagram of a scalar search instruction processing device according to an embodiment of the present disclosure.
  • 3-2 shows a block diagram of a scalar search instruction processing device according to an embodiment of the present disclosure.
  • 3-3a-FIG. 3-3c are schematic diagrams illustrating application scenarios of a scalar search instruction processing device according to an embodiment of the present disclosure.
  • 3-4 show a flowchart of a scalar search instruction processing method according to an embodiment of the present disclosure.
  • FIG. 4-1 shows a block diagram of a resource lock instruction processing apparatus according to an embodiment of the present disclosure.
  • 4-2 shows a block diagram of a resource lock instruction processing apparatus according to an embodiment of the present disclosure.
  • FIG. 4-3a-FIG. 4-3b illustrate schematic diagrams of application scenarios of a resource lock instruction processing apparatus according to an embodiment of the present disclosure.
  • 4-4 shows a flowchart of a resource lock instruction processing method according to an embodiment of the present disclosure.
  • FIG. 5-1 shows a block diagram of a tensor rearrangement instruction processing apparatus according to an embodiment of the present disclosure.
  • 5-2 shows a block diagram of a tensor rearrangement instruction processing apparatus according to an embodiment of the present disclosure.
  • 5-3 shows a schematic diagram of an application scenario of a tensor rearrangement instruction processing apparatus according to an embodiment of the present disclosure.
  • 5-4 shows a flowchart of a tensor reordering instruction processing method according to an embodiment of the present disclosure.
  • 6-1 shows a block diagram of a data processing device according to an embodiment of the present disclosure.
  • 6-2 shows a schematic diagram of an application scenario of a data processing device according to an embodiment of the present disclosure.
  • FIG. 6-3 shows a block diagram of a data processing device according to an embodiment of the present disclosure.
  • 6-4 shows a block diagram of a data processing device according to an embodiment of the present disclosure.
  • 6-5a- 6-5d show block diagrams of processing modules in a data processing apparatus according to an embodiment of the present disclosure.
  • 6-6 show a flowchart of a data processing method according to an embodiment of the present disclosure.
  • FIG. 7-1 shows a block diagram of a matrix symmetric instruction processing device according to an embodiment of the present disclosure.
  • FIG. 7-2 shows a block diagram of a matrix symmetric instruction processing device according to an embodiment of the present disclosure.
  • FIG. 7-3 shows a schematic diagram of an application scenario of a matrix symmetric instruction processing device according to an embodiment of the present disclosure.
  • FIG. 7-4 shows a flowchart of a matrix symmetric instruction processing method according to an embodiment of the present disclosure.
  • FIG. 8-1 shows a block diagram of a matrix mirroring instruction processing apparatus according to an embodiment of the present disclosure.
  • FIG. 8-2 shows a block diagram of a matrix mirroring instruction processing apparatus according to an embodiment of the present disclosure.
  • FIG. 8-3 shows a schematic diagram of an application scenario of a matrix mirroring instruction processing apparatus according to an embodiment of the present disclosure.
  • FIG. 8-4 shows a flowchart of a matrix mirroring instruction processing method according to an embodiment of the present disclosure.
  • 9-1 shows a block diagram of a matrix rotation instruction processing device according to an embodiment of the present disclosure.
  • 9-2 shows a block diagram of a matrix rotation instruction processing device according to an embodiment of the present disclosure.
  • 9-3 shows a schematic diagram of an application scenario of a matrix rotation instruction processing device according to an embodiment of the present disclosure.
  • 9-4 shows a flowchart of a matrix rotation instruction processing method according to an embodiment of the present disclosure.
  • 10a and 10b show block diagrams of a combined processing device according to an embodiment of the present disclosure.
  • FIG. 11 shows a schematic structural diagram of a board according to an embodiment of the present disclosure.
  • the term “if” may be interpreted as “when” or “once” or “in response to a determination” or “in response to a detection” depending on the context.
  • the phrase “if determined” or “if [described condition or event] is detected” can be interpreted in the context to mean “once determined” or “in response to a determination” or “once detected [described condition or event ]” or “In response to detection of [the described condition or event]”.
  • the present disclosure provides instruction processing methods and apparatuses corresponding to different operations or processes, and computer equipment and storage media corresponding to each instruction processing method and device, and instruction processing methods corresponding to different operations or processes And devices include: vector search instruction processing method and device, scalar search instruction processing method and device, resource lock instruction processing method and device, tensor rearrangement instruction processing method and device, data processing method and device, matrix symmetric instruction processing method And device, matrix mirroring instruction processing method and device, and matrix rotation instruction processing method and device.
  • the instruction processing method and instruction processing device described below may be any of the instruction processing methods and devices listed above.
  • the instruction processing method may be applied to a processor, which may be a general-purpose processor, such as a CPU (Central Processing Unit, central processing unit), or artificial intelligence processing for performing artificial intelligence operations Device (IPU).
  • Artificial intelligence operations can include machine learning operations, brain-like operations, and so on. Among them, machine learning operations include neural network operations, k-means operations, support vector machine operations, etc.
  • the artificial intelligence processor may include, for example, GPU (Graphics Processing Unit), NPU (Neural-Network Processing Unit, neural network processing unit), DSP (Digital Signal Processing, digital signal processing unit), field programmable gate array (Field-Programmable Gate Array, FPGA) One or a combination of chips. This disclosure does not limit the specific types of processors.
  • the processor mentioned in the present disclosure may include multiple processing units, and each processing unit may independently run various assigned tasks, such as: convolution operation tasks and pooling tasks Or fully connected tasks.
  • the present disclosure does not limit the processing unit and the tasks performed by the processing unit.
  • FIG. 1 shows a schematic diagram of a processor of an instruction processing method according to an embodiment of the present disclosure.
  • the processor 100 includes a plurality of processing units 101 and a storage unit 102.
  • the plurality of processing units 101 are used to execute an instruction sequence
  • the storage unit 102 is used to store data, which may include a random access memory (RAM, Random Access Memory) And register file.
  • the multiple processing units 101 in the processor 100 can share a part of the storage space, for example, share a part of the RAM storage space and the register file, and can also have their own storage spaces at the same time.
  • FIG. 2-1 shows a block diagram of a vector search instruction processing apparatus according to an embodiment of the present disclosure.
  • the device includes a control module 11-2 and an arithmetic module 12-2.
  • the control module 11-2 is used to analyze the received vector search instruction, obtain the operation code and operation domain of the vector search instruction, and determine the vector to be searched and the search condition required to execute the vector search instruction according to the operation code and operation domain And destination address.
  • the operation code is used to indicate that the operation performed by the vector search instruction on the vector data is a search operation, and the operation domain includes the vector address and the target address to be searched.
  • the operation module 12-2 is used to sequentially determine whether a plurality of check numbers representing the search vector satisfy the search condition, and determine the check number that meets the search condition as the target number, and store the storage address of the target number as the search result target address.
  • the to-be-searched vector may be composed of multiple to-be-checked numbers.
  • the decimal representation of the vector m to be searched is (5, 6, 4)
  • the multiple numbers to be searched for the vector m to be searched are "5", "6", and "4".
  • the vector to be searched can also be represented by a string of binary, hexadecimal, etc.
  • the binary representation of the vector m to be searched is "101110100", where "101", "110" and "100" are the multiple numbers to be searched for the vector m to be searched, respectively when the vector m to be searched is converted to decimal Corresponding numbers 5, 6, and 4.
  • the control module can obtain the vector to be found from the address of the vector to be found.
  • the address of the vector to be searched may be the first address for storing the vector to be searched, and so on.
  • the control module may obtain the vector search instruction and the vector to be searched through the data input/output unit.
  • the data input/output unit may be one or more data I/O interfaces or I/O pins.
  • the operation code may be the part of the instruction or field (usually represented by code) specified in the computer program to perform the operation, and is the instruction serial number used to inform the device that executes the instruction which instruction needs to be executed.
  • the operation domain may be the source of all data required to execute the corresponding instruction. All data required to execute the corresponding instruction include parameter data, vector to be searched, and corresponding operation method, or store parameter data, vector to be searched, and corresponding operation The address of the method, etc.
  • a vector search instruction it must include an operation code and an operation field, where the operation field includes at least the vector address and the target address to be searched.
  • the device may include one or more control modules and one or more arithmetic modules, and the number of control modules and arithmetic modules may be set according to actual needs, which is not limited in the present disclosure.
  • the vector search instruction processing device includes a control module and an operation module.
  • the control module is used to parse the received vector search instruction, obtain the operation code and operation domain of the vector search instruction, and determine the to-be-searched vector, search condition, and target address required to execute the vector search instruction according to the operation code and operation domain.
  • the operation module is used to sequentially determine whether a plurality of to-be-checked numbers representing the to-be-searched vector satisfy the search condition, determine the to-be-checked number satisfying the search condition as the target number, and store the target number's storage address as the search result in the target address.
  • the vector search instruction processing device provided by the embodiments of the present disclosure has a wide application range, and has high processing efficiency and fast processing speed for the vector search instruction.
  • the number of to-be-checked that meets the search condition may include at least one of the following:
  • the numeric value is the specified multiple of the specified value, and the sorting is the number to be checked of the specified sorting
  • the numeric value is the number to be checked for the specified multiple of the specified value.
  • the specified sorting may include at least one of the following: the sorting of the number to be checked is the nth of the number to be checked whose value is the specified multiple of the specified value, n is a positive integer greater than or equal to 1; the sorting of the number to be checked is The numeric value is the m-th to last in the number to be checked in the specified multiple of the specified value, and m is a positive integer greater than or equal to 1. Among them, m and n are less than or equal to the number of to-be-checked numbers in the to-be-searched vector.
  • the order of the number to be checked is the nth of the number to be checked which is the specified multiple of the specified value
  • the order of the number to be checked is the number which is the specified multiple of the specified value.
  • the "mth to the last" in the count is set with different expressions, etc., to distinguish the specified order of the countdown and the positive number.
  • the sorting of the number of checks is that the m-th to the last of the numbers to be searched whose value is the specified multiple of the specified value is expressed in the vector search instruction as "m0".
  • a person skilled in the art can set the expression of the specified order according to actual needs, and this disclosure does not limit this.
  • the specified value may be 0, 1, 2, 3, and so on.
  • the specified multiple can be 1 times (that is, the value is the same as the specified value), 2 times, 3 times and other multiples.
  • the target number found by the vector search instruction may be the first 1, the last 1, the first 2, 3 times the number to be searched, and the last 2 3 times the number to be checked, less than 5 to be checked, more than 9 to be checked, etc.
  • the operation domain may also include the input length.
  • the control module 11-2 is also used to obtain the vector to be searched from the address of the vector to be searched according to the input length.
  • the length of the vector to be searched obtained from the address of the vector to be searched according to the input length needs to be equal to the input length, or needs to be less than the input length.
  • the vector to be searched may be obtained according to a preset default input length. It is also possible to obtain all data in the address of the vector to be searched as the vector to be searched.
  • the operation field may further include the width of the data to be checked.
  • the operation module 12-2 is also used to determine a plurality of numbers to be checked from the vector to be looked up according to the width of the numbers to be checked.
  • the width of the number to be checked may represent the width corresponding to each number to be checked in the character string of the vector to be looked up.
  • the width of the data to be searched is included in the operation domain, a plurality of groups of character strings whose width is the width of the data to be searched can be determined from the character strings representing the vector to be searched, and each group of character strings corresponds to one data to be searched.
  • the vector to be searched m (expressed as (5,6,4) when converted to decimal) is "101110100”
  • the number of the searched vector m to be searched is "101" "110 "And "100”
  • the multiple to-be-checked numbers "101", "110” and “100” are the corresponding numbers 5, 6, and 4 when the to-be-searched vector m is converted to decimal, respectively.
  • the width of the number to be checked is 1, the plurality of numbers to be searched for the vector m are "1", “0”, “1", “1”, “1”, “0”, “1", “0” And “0", or the width of the number to be searched is 2, 4 and other widths other than 3, the obtained number of the searched vector m is only a number composed of character strings, and converted to the searched vector m The corresponding numbers 5, 6, and 4 in decimal are irrelevant.
  • the operation domain may further include search conditions.
  • the control module 11-2 is also used to determine search conditions according to the operation domain.
  • the search condition in the operation domain can be directly obtained.
  • control module 11-2 is also used to determine the search condition according to the operation code.
  • the opcode is also used to indicate the search condition of the vector search instruction.
  • different operation codes can be set to represent different search conditions.
  • the width of the data to be checked can also be determined according to the operation code or the default width.
  • the operation code "Find_vlast" is the last one of the multiple numbers to be searched to find the vector to be searched (the width of the number to be checked is greater than 1, the number of the number to be checked that meets the search condition is: the value is the number of the number to be checked that is double the specified value The penultimate number 1 to be checked).
  • the width of the number to be checked can be further determined according to the operation code, or the default width can be determined as the width of the number to be checked, and then multiple number of numbers to be checked with the width of the number to be checked can be obtained .
  • the operation module 12-2 may include at least one comparator 121-2, configured to compare a plurality of to-be-checked numbers with search conditions to obtain a comparison result, In order to determine whether the number to be checked meets the search condition according to the comparison result.
  • the first of the numbers to be checked whose value is 1 (that is, the value is twice the specified value 1) is an example.
  • the comparator can sequentially search multiple The value of the number to be checked is compared with the specified value "1", and then it can be determined whether the value of the number to be checked is equal to the specified value "1", and the value is equal to the specified value "1" and sorted to be equal to the specified value "1"
  • the first to-be-checked number in the to-be-checked number is determined as the target number, and the storage address of the target number is stored in the target address as the search result.
  • the number of comparators can be set according to the size of the data amount to be compared, the processing speed, efficiency, etc. of the comparison, which is not limited in the present disclosure.
  • the device may further include a storage module 13-2.
  • the storage module 13-2 is used to store the vector to be searched.
  • the storage module may include one or more of a memory, a cache, and a register
  • the cache may include a temporary storage cache.
  • the vector to be searched can be stored in the memory, cache, and/or register in the storage module according to needs, which is not limited in the present disclosure.
  • the device may further include a direct memory access module, which is used to read or store data from the storage module.
  • control module 11-2 may include an instruction storage submodule 111-2, an instruction processing submodule 112-2, and a queue storage submodule 113-2.
  • the instruction storage submodule 111-2 is used to store vector search instructions.
  • the instruction processing sub-module 112-2 is used to parse the vector search instruction to obtain the operation code and operation domain of the vector search instruction.
  • the queue storage sub-module 113-2 is used to store an instruction queue.
  • the instruction queue includes a plurality of instructions to be executed in order according to the execution order.
  • the plurality of instructions to be executed may include vector search instructions.
  • the plurality of instructions to be executed may include other calculation instructions related to the vector search instruction.
  • the execution order of the plurality of instructions to be executed can be arranged according to the reception time and priority level of the instruction to be executed to obtain an instruction queue, so that the plurality of instructions to be executed can be sequentially executed according to the instruction queue.
  • control module 11-2 may further include a dependency processing sub-module 114-2.
  • the dependency processing sub-module 114-2 is configured to cache the first instruction to be executed when it is determined that there is an association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed In the instruction storage sub-module 112-2, after the execution of the zeroth to-be-executed instruction is completed, the first to-be-executed instruction is extracted from the instruction storage sub-module 112-2 and sent to the arithmetic module 12-2.
  • the first instruction to be executed and the zeroth instruction to be executed are instructions among a plurality of instructions to be executed.
  • the first to-be-executed instruction is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction includes: a first storage address interval storing data required by the first to-be-executed instruction and data required to store the zeroth to-be-executed instruction
  • the zeroth storage address interval has overlapping areas. Conversely, there is no association between the first instruction to be executed and the zeroth instruction to be executed may be that there is no overlapping area between the first storage address interval and the zeroth storage address interval.
  • the instruction format of the vector search instruction may be as shown in Table 1 below, and the operation code and the position of the operation code may be set.
  • Table 2 a conventional vector search instruction (Find) is given, and any number in the vector to be searched can be found by using the conventional vector search instruction; and an example of the vector search instruction is given in Table 2 and defined
  • Two special types of vector search instructions (Find_vfirst, Find_vlast) need to include the opcode and operation field. Using a special type of vector search instruction to search for the search vector can simplify the instruction processing process and save the search time.
  • the number of to-be-checked satisfying the search condition refers to: the value is the number of to-be-checked of the specified multiple of the specified value.
  • the number of queries to satisfy the search condition means that the number is equal to the specified value and the order is the number of queries to the specified order.
  • the number to be checked that meets the search condition refers to: the value is the number to be checked of the specified multiple of the specified value.
  • the number of queries to satisfy the search condition refers to: the value is the specified multiple of the specified value, and the sort is the number of the query to be specified.
  • the vector search instruction whose operation code is "Find_vfirst” the number of the searched to meet the search conditions is: the value is double the specified value 1 (that is, the value is equal to the specified value 1), and the sorted value is The first of the counts to be checked that is double the specified value 1.
  • the width of the data to be checked is greater than 1.
  • the vector search instruction whose operation code is "Find_vlast" finds that the number of pending queries that meet the search conditions is: the value is double the specified value 1 (that is, the value is equal to the specified value 1) and the sorted value is the specified value
  • the width of the data to be checked is greater than 1.
  • the device may be set in a graphics processor (GPU), a central processing unit (CPU) and a neural-network processing unit (Neural-network Processing) , Referred to as NPU).
  • GPU graphics processor
  • CPU central processing unit
  • NPU neural-network processing unit
  • FIGS. 2-3a-2-3c illustrate schematic diagrams of application scenarios of a vector search instruction processing apparatus according to an embodiment of the present disclosure. As shown in Figures 2-3a-2-3c, the vector search instruction processing device processes the vector search instruction as follows.
  • the to-be-searched vector a is "0101 1011 0001 0101 1100 0001 1001".
  • every four binary numbers represent a number of the vector a to be searched in decimal, that is, the vector a to be searched in decimal is (5,11,1,5,12,1,9).
  • the storage address of the vector a to be searched in different vector search instructions is different.
  • the vector search instructions to be processed by the device include:
  • Vector search instruction 1 @Find#100#28#200#4#1#01
  • Vector search instruction 2 @Find_vfirst#101#28#201#4
  • Vector search instruction 3 @Find_vlast#102#28#202#4
  • the control module 11-2 parses the vector search instruction 1 when receiving the vector search instruction 1, obtains the operation code of the vector search instruction 1 as Find, and determines the vector search instruction 1 according to the operation domain ,
  • the vector address to be searched is "100"
  • the input length is "28”
  • the target address is "200”
  • the specified sort is "the number of counts to be checked is equal to the specified value (because the specified multiple position in the vector search instruction 1 is empty , By default, the multiple is doubled)
  • the first of the number to be checked the specified value is "1”
  • the width of the number to be checked is "4".
  • the control module 11-2 obtains the above-mentioned to-be-searched vector a whose input length is 28 from the to-be-searched vector address 200 is "0101 1011 0001 0101 1100 0001 1001".
  • the arithmetic module 12-2 obtains a plurality of numbers to be checked sequentially from the vector to be searched according to the width of the numbers to be checked "4", and sequentially determines whether the values of the plurality of numbers to be checked are equal to the specified value "1", and sets the value
  • the first to-be-checked number among the to-be-checked numbers equal to the specified value “1” and sorted to be equal to the specified value “1” is determined as the target number, and the storage address of the target number is stored in the target address 200 as a search result.
  • the arithmetic module 12-2 first obtains the first to-be-checked number "0101" with a width of 4 from the to-be-searched vector a, and judges whether the value of the to-be-checked number "0101" is equal to the specified value "1". Since the value of the number to be checked "0101" is not 1, the arithmetic module 12-2 continues to obtain the next number to be checked "1011” from the vector to be looked up a, and judges whether the value of the number to be checked "1011" is equal to the specified value " 1".
  • the arithmetic module 12-2 continues to obtain the next number to be checked “0001” from the vector to be looked up a, and judges whether the value of the number to be checked “0001" is equal to the specified value " 1". Since the value of the number to be checked "0001" is equal to 1, and its sorting is the specified sorting (that is, the sorting of the number to be checked is the first of the number to be checked equal to the specified value), the number "0001" to be checked is determined as For the target number, the storage address 500 of the number to be checked "0001" is stored in the target address 200 as the search result.
  • the control module 11-2 parses the vector search instruction 2 when receiving the vector search instruction 2, obtains the operation code of the vector search instruction 2 as Find_vfirst, and determines the vector search instruction 2 according to the operation domain
  • the address of the vector to be searched is "101", the input length is "28”, the target address is "201", and the width of the number to be searched is "4".
  • it is determined according to the operation code Find_vfirst that the specified value of the vector search instruction 2 is "1" and the specified sort is "the sort of the number to be checked is the first of the number to be checked equal to the specified value”.
  • the control module 11-2 obtains the above-mentioned to-be-searched vector a whose input length is 28 from the to-be-searched vector address 201 is "0101 1011 0001 0101 1100 0001 1001".
  • the arithmetic module 12-2 obtains a plurality of numbers to be checked sequentially from the vector to be searched according to the width of the numbers to be checked "4", and sequentially determines whether the values of the plurality of numbers to be checked are equal to the specified value "1", and sets the value
  • the first to-be-checked number among the to-be-checked numbers equal to the specified value "1" and sorted to be equal to the specified value "1" is determined as the target number, and the storage address of the target number is stored in the target address 201 as the search result.
  • the arithmetic module 12-2 first obtains the first to-be-checked number "0101" with a width of 4 from the to-be-searched vector a, and judges whether the value of the to-be-checked number "0101" is equal to the specified value "1". Since the value of the number to be checked "0101" is not 1, the arithmetic module 12-2 continues to obtain the next number to be checked "1011” from the vector to be looked up a, and judges whether the value of the number to be checked "1011" is equal to the specified value " 1".
  • the arithmetic module 12-2 continues to obtain the next number to be checked “0001” from the vector to be looked up a, and judges whether the value of the number to be checked “0001" is equal to the specified value " 1". Since the value of the number to be checked "0001" is equal to 1, and its sorting is the specified sorting (that is, the sorting of the number to be checked is the first of the number to be checked equal to the specified value), the number "0001" to be checked is determined as For the target number, the storage address 501 of the number to be checked "0001" is stored in the target address 201 as the search result.
  • the control module 11-2 parses the vector search instruction 3 when receiving the vector search instruction 3, obtains the operation code of the vector search instruction 3 as Find_vlast, and determines the vector search instruction 3 according to the operation domain
  • the address of the vector to be searched is "102", the input length is "28”, the target address is "202", and the width of the data to be searched is "4".
  • it is determined according to the operation code Find_vlast that the specified value of the vector search instruction 3 is "1" and the specified sort is "the sort of the number to be checked is the penultimate number of the number to be checked equal to the specified value”.
  • the control module 11-2 obtains the above-mentioned to-be-searched vector a whose input length is 28 from the to-be-searched vector address 202 is "0101 1011 0001 0101 1100 0001 1001".
  • the arithmetic module 12-2 obtains a plurality of numbers to be checked sequentially from the vector to be searched according to the width of the numbers to be checked "4", and sequentially determines whether the values of the plurality of numbers to be checked are equal to the specified value "1", and sets the value
  • the first to-be-checked number among the to-be-checked numbers equal to the specified value "1" and sorted to be equal to the specified value "1" is determined as the target number, and the storage address of the target number is stored in the target address 202 as the search result .
  • the arithmetic module 12-2 first obtains the first to-be-checked number "1001" with a width of 4 from the to-be-searched vector a, and determines whether the value of the to-be-checked number "1001" is equal to the specified value "1" . Since the value of the number to be checked "1001" is not 1, the arithmetic module 12-2 continues to obtain the next number to be checked "0001" from the vector to be looked up a, and judges whether the value of the number to be checked "0001" is equal to the specified value " 1".
  • the number to be checked "0001" is determined As the target number, the storage address 502 of the number to be checked "0001" is stored in the target address 202 as a search result.
  • the vector search instruction processing device can process the vector search instruction quickly and efficiently.
  • FIG. 2-4 show a flowchart of a vector search instruction processing method according to an embodiment of the present disclosure. As shown in FIG. 2-4, this method is applied to the above vector search instruction processing device. The method includes step S51-2 and step S52-2.
  • step S51-2 the received vector search instruction is parsed to obtain the operation code and operation domain of the vector search instruction, and the vector to be searched, the search condition and the search vector required for executing the vector search instruction are determined according to the operation code and operation domain target address.
  • the operation code is used to indicate that the operation performed by the vector search instruction on the vector data is a search operation, and the operation domain includes the vector address and the target address to be searched.
  • step S52-2 it is sequentially determined whether a plurality of numbers to be searched for the vector to be searched satisfy the search condition, and the number to be searched that meets the search condition is determined as the target number, and the storage address of the target number is stored as the search result in the target address.
  • the operation domain may also include the input length.
  • determining the vector to be searched, the search condition and the target address required to execute the vector search instruction according to the operation code and the operation domain may include: obtaining the vector to be searched from the vector address to be searched according to the input length.
  • the operation field may further include the width of the data to be checked.
  • the method may further include: determining a plurality of to-be-checked numbers from the to-be-checked vector according to the width of the to-be-checked numbers.
  • the operation domain may further include search conditions.
  • determining the vector to be searched, the search condition and the target address required to execute the vector search instruction according to the operation code and the operation domain may include: determining the search condition according to the operation domain.
  • determining the to-be-searched vector, the search condition, and the target address required to execute the vector search instruction according to the operation code and the operation domain may include:
  • the search condition is determined, and the operation code is also used to indicate the search condition of the vector search instruction.
  • sequentially determining whether a plurality of to-be-checked numbers representing the to-be-searched vector satisfy the search condition may include:
  • At least one comparator is used to compare the plurality of to-be-checked numbers with the search condition to obtain a comparison result, so as to determine whether the to-be-checked number meets the search condition according to the comparison result.
  • the number of to-be-checked that meets the search condition may include at least one of the following:
  • the numeric value is the specified multiple of the specified value, and the sorting is the number to be checked of the specified sorting
  • the numeric value is the number to be checked for the specified multiple of the specified value.
  • the specified sorting may include at least one of the following: the sorting of the number to be checked is the nth of the number to be checked whose value is the specified multiple of the specified value, n is a positive integer greater than or equal to 1; the sorting of the number to be checked is The numeric value is the m-th to last in the number to be checked in the specified multiple of the specified value, and m is a positive integer greater than or equal to 1. Among them, m and n are less than or equal to the number of to-be-checked numbers in the to-be-searched vector.
  • the method may further include: storing the to-be-searched vector.
  • parsing the received vector search instruction to obtain the operation code and operation domain of the vector search instruction may include:
  • the instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed in order according to the execution order, and the plurality of instructions to be executed may include vector search instructions.
  • the method may further include:
  • the association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
  • a first storage address interval storing data required for the first instruction to be executed has an overlapping area with a zeroth storage address interval storing data required for the zeroth instruction to be executed.
  • the vector search instruction processing method provided by the embodiments of the present disclosure has a wide application range, and has high processing efficiency and fast processing speed for the vector search instruction.
  • a vector search instruction processing device comprising:
  • the control module is used to parse the received vector search instruction, obtain the operation code and the operation domain of the vector search instruction, and determine the standby required to execute the vector search instruction according to the operation code and the operation domain Search vector, search condition and target address;
  • the operation module is used to sequentially determine whether a plurality of check numbers representing the search vector satisfy the search condition, determine the check number satisfying the search condition as the target number, and store the storage address of the target number Store in the target address as a search result,
  • the operation code is used to indicate that the operation performed by the vector search instruction on the vector data is a search operation, and the operation domain includes the vector address to be searched and the target address.
  • the operation field further includes an input length
  • the control module is further configured to obtain the vector to be searched from the address of the vector to be searched according to the input length.
  • Clause A3 The device according to Clause A1, the operation domain further includes the width of the data to be checked,
  • the calculation module is further configured to determine the plurality of to-be-checked numbers from the to-be-searched vector according to the width of the to-be-checked number.
  • the operation domain further includes a search condition
  • the control module is also used to determine the search condition according to the operation domain.
  • the control module is further used to determine the search condition according to the operation code, wherein the operation code is also used to indicate the search condition of the vector search instruction.
  • the arithmetic module includes:
  • At least one comparator is used to compare the plurality of to-be-checked numbers with the search condition to obtain a comparison result, so as to determine whether the to-be-checked number meets the search condition according to the comparison result.
  • Clause A7 The device according to any one of Clause A1-Clause A6, the number of to-be-checked satisfying the search condition includes at least one of the following:
  • the numeric value is the specified multiple of the specified value, and the sorting is the number to be checked of the specified sorting
  • the value is the number to be checked for the specified multiple of the specified value
  • the designated order includes at least one of the following:
  • the sorting of the number to be checked is the nth of the number to be checked whose value is a specified multiple of the specified value, where n is a positive integer greater than or equal to 1;
  • the order of the numbers to be checked is the m-th to the last one of the numbers to be checked whose value is the specified multiple of the specified value, where m is a positive integer greater than or equal to 1,
  • m and n are less than or equal to the number of numbers to be checked in the vector to be looked up.
  • the device further includes a storage module for storing the to-be-searched vector,
  • control module includes:
  • Instruction storage sub-module for storing the vector search instruction
  • An instruction processing sub-module which is used to parse the vector search instruction to obtain the operation code and operation domain of the vector search instruction
  • a queue storage sub-module is used to store an instruction queue, the instruction queue includes a plurality of instructions to be executed, which are sequentially arranged according to the execution order, and the plurality of instructions to be executed include the vector search instruction,
  • control module also includes:
  • the dependency processing sub-module is used to determine the first pending instruction when there is an association between the first pending instruction among the multiple pending instructions and the zeroth pending instruction before the first pending instruction
  • the execution instruction is cached in the instruction storage submodule, and after the execution of the zeroth instruction to be executed is completed, the first instruction to be executed is extracted from the instruction storage submodule and sent to the arithmetic module,
  • association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
  • the first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
  • a machine learning computing device comprising:
  • One or more vector search instruction processing devices as described in any one of clauses A1 to A8, used to obtain data and control information to be calculated from other processing apparatuses, and perform specified machine learning operations, and pass the execution result through I /O interface is passed to other processing devices;
  • the machine learning operation device includes a plurality of the vector search instruction processing devices
  • the plurality of vector search instruction processing devices may be connected and transmit data through a specific structure
  • a plurality of the vector search instruction processing devices interconnect and transmit data through a fast external device interconnect bus PCIE bus to support larger-scale machine learning operations; a plurality of the vector search instruction processing devices share the same control system Or have their own control systems; a plurality of the vector search instruction processing devices share memory or have their own memory; the interconnection method of the plurality of vector search instruction processing devices is an arbitrary interconnection topology.
  • a combined processing device comprising:
  • Machine learning computing device general interconnection interface and other processing devices as described in clause A9;
  • the machine learning computing device interacts with the other processing device to jointly complete the calculation operation specified by the user,
  • the combined processing device further includes: a storage device, which is respectively connected to the machine learning computing device and the other processing device, and is used for storing data of the machine learning computing device and the other processing device.
  • a machine learning chip includes:
  • Article A12 An electronic device, the electronic device comprising:
  • a board card includes: a storage device, an interface device and a control device, and a machine learning chip as described in Clause A11;
  • the machine learning chip is respectively connected to the storage device, the control device and the interface device;
  • the storage device is used to store data
  • the interface device is used to realize data transmission between the machine learning chip and an external device
  • the control device is used for monitoring the state of the machine learning chip.
  • a vector search instruction processing method is applied to a vector search instruction processing device.
  • the method includes:
  • the operation code is used to indicate that the operation performed by the vector search instruction on the vector data is a search operation, and the operation domain includes the vector address to be searched and the target address.
  • the operation field further includes an input length
  • determining the vector to be searched, the search condition and the target address required to execute the vector search instruction according to the operation code and the operation domain include:
  • Clause A16 The method according to Clause A14, the operation domain further includes a width to be checked, and the method further includes:
  • the plurality of numbers to be checked are determined from the vector to be looked up.
  • the operation domain further includes a search condition
  • determining the vector to be searched, the search condition and the target address required to execute the vector search instruction according to the operation code and the operation domain include:
  • the search condition is determined.
  • a vector to be searched, a search condition, and a target address required to execute the vector search instruction are determined, including:
  • the search condition is determined according to the operation code, and the operation code is also used to indicate the search condition of the vector search instruction.
  • sequentially determining whether a plurality of to-be-checked numbers representing the to-be-searched vector satisfy the search condition includes:
  • At least one comparator is used to compare the plurality of to-be-checked numbers with the search condition to obtain a comparison result, so as to determine whether the to-be-checked number satisfies the search condition according to the comparison result.
  • the number of to-be-checked that meets the search condition includes at least one of the following:
  • the numeric value is the specified multiple of the specified value, and the sorting is the number to be checked of the specified sorting
  • the value is the number to be checked for the specified multiple of the specified value
  • the designated order includes at least one of the following:
  • the sorting of the number to be checked is the nth of the number to be checked whose value is a specified multiple of the specified value, where n is a positive integer greater than or equal to 1;
  • the order of the numbers to be checked is the m-th to the last one of the numbers to be checked whose value is the specified multiple of the specified value, where m is a positive integer greater than or equal to 1,
  • m and n are less than or equal to the number of numbers to be checked in the vector to be looked up.
  • the method further includes: storing the to-be-searched vector,
  • analyzing the received vector search instruction to obtain the operation code and operation domain of the vector search instruction includes:
  • the instruction queue includes a plurality of instructions to be executed in order according to the execution order, the plurality of instructions to be executed includes the vector search instruction,
  • the method further includes:
  • the first to-be-executed instruction among the plurality of to-be-executed instructions When it is determined that the first to-be-executed instruction among the plurality of to-be-executed instructions is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction, cache the first to-be-executed instruction, and determine the After the execution of the zeroth to-be-executed instruction is completed, control to execute the execution of the first to-be-executed instruction,
  • association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
  • the first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
  • FIG. 3-1 shows a block diagram of a scalar search instruction processing device according to an embodiment of the present disclosure. As shown in Figure 3-1, the device includes a control module 11-3 and an arithmetic module 12-3.
  • the control module 11-3 is used to parse the received scalar search instruction, obtain the operation code and operation domain of the scalar search instruction, and determine the scalar to be searched and the specified value required to execute the scalar search instruction according to the operation code and operation domain , Specify sorting and target address.
  • the operation code is used to indicate that the operation performed by the scalar search instruction on the data is a search operation, and the operation domain includes the scalar address and the target address to be searched.
  • the arithmetic module 12-3 is used to sequentially determine whether the values of the plurality of check numbers representing the scalar to be searched are equal to the specified value, and determine the check number with the value equal to the specified value and sorted to the specified sort as the target number, and determine the target The storage address of the number is stored in the target address as the search result.
  • the scalar to be searched may be a character string in binary, hexadecimal, or the like.
  • the binary representation of the scalar 87 to be searched is "01010111”
  • the multiple queried numbers of the scalar 87 to be searched are "0", "1", "0", "1", "0", "1", "1” and "1".
  • the control module can obtain the scalar to be found from the scalar address to be found.
  • the scalar address to be searched may be the first address storing the scalar to be searched, and so on.
  • the control module may obtain the scalar search instruction and the scalar to be searched through the data input and output unit, and the data input and output unit may be one or more data I/O interfaces or I/O pins.
  • the operation code may be the part of the instruction or field (usually represented by code) specified in the computer program to perform the operation, and is the instruction serial number used to inform the device that executes the instruction which instruction needs to be executed.
  • the operation domain can be the source of all data required to execute the corresponding instruction. All data required to execute the corresponding instruction include parameter data, scalar to be searched, and corresponding operation method, or store parameter data, scalar to be searched, and corresponding operation The address of the method, etc.
  • a scalar search instruction it must include an operation code and an operation field, where the operation field includes at least the scalar address to be searched and the target address.
  • the device may include one or more control modules and one or more arithmetic modules, and the number of control modules and arithmetic modules may be set according to actual needs, which is not limited in the present disclosure.
  • a scalar search instruction processing device includes a control module and an arithmetic module.
  • the control module is used to parse the received scalar search instruction, obtain the operation code and operation domain of the scalar search instruction, and determine the scalar to be searched, the specified value, the specified order and the required scalar search instruction according to the operation code and the operation domain. target address.
  • the arithmetic module is used to sequentially determine whether the values of the multiple check numbers representing the scalar to be searched are equal to the specified value, determine the check number that is equal to the specified value and sorted into the specified sort as the target number, and determine the storage address of the target number The target address is stored as the search result.
  • the scalar search instruction processing device provided by the embodiment of the present disclosure has a wide application range, and has high processing efficiency and fast processing speed for the scalar search instruction.
  • the specified sorting may include at least one of the following: the sorting of the number to be checked is the nth of the number to be checked equal to the specified value, n is a positive integer greater than or equal to 1; the number to be checked The order of is the m-th to the last one of the number to be checked equal to the specified value, and m is a positive integer greater than or equal to 1. Among them, m and n are less than or equal to the number of the number to be checked in the scalar to be looked up.
  • the specified value may be 0, 1, 2, 3, and so on.
  • the specified values can be 0-9, A-F. If the scalar to be searched is a binary string, the specified value can be 0 or 1.
  • the target number found by the scalar search instruction may be the first one, the last one, etc. of the multiple queried numbers to be searched for.
  • the operation domain may also include the input length.
  • the control module 11-3 is also used to obtain the scalar to be found from the scalar address to be found according to the input length.
  • the length of the scalar to be searched obtained from the scalar address to be searched according to the input length needs to be equal to the input length, or needs to be less than the input length.
  • the scalar to be searched can be obtained according to a preset default input length. It is also possible to obtain all data in the scalar address to be searched as the scalar to be searched.
  • the operation field may further include the width of the data to be checked.
  • the operation module 12-3 is also used to determine a plurality of to-be-checked numbers from the to-be-searched scalars according to the to-be-checked width.
  • the width of the data to be checked is not included in the operation field (it may mean that the position corresponding to the width of the data to be checked in the scalar search instruction is empty, or there is no width of the data to be checked), or the width of the data to be checked is At 1, the multiple queried numbers of the scalar to be searched are multiple characters in the character string. For example, when the scalar n to be searched is "01010111", and the width of the queuing to be searched is 1, the multiple queried to be searched for the scalar n is "0", "1", "0", “1", “0” "1", “1” and "1".
  • the plurality of numbers to be checked for the scalar to be searched are multiple binary digit strings having the width of the number to be checked, each of the number to be checked
  • the string of binary digits in width represents a number to be checked. For example, if the width of the data to be searched is 3, the scalar m to be searched is "101110100", and the plurality of scalars to be searched m are "101", "110", and "100".
  • the operation domain may further include a specified value and a specified order.
  • the control module 11-3 is also used to determine the specified value and specified order according to the operation domain.
  • the specified value and specified order in the operation domain can be directly obtained.
  • control module 11-3 is also used to determine the specified value and the specified order according to the operation code.
  • the opcode is also used to indicate the specified value and specified order of the scalar search instruction.
  • different operation codes can be set to represent different specified values and specified orders.
  • the width of the data to be checked can also be determined according to the operation code or the default width.
  • the operation code "Find_blast” is the last one of the multiple numbers to be found to find the scalar to be found (the width of the number to be checked is 1, the specified value is 1, the specified sort is the number of checked numbers, and the sort is equal to the specified value The penultimate number in the number).
  • the width of the number-to-be-checked can be further determined to be 1 according to the operation code, and then a plurality of number-to-be-checked with a width of 1 can be obtained.
  • the operation module 12-3 may include at least one comparator 121-3, which is used to compare the values of multiple numbers to be checked and the specified values to obtain a comparison The result, in order to determine whether the value of the number to be checked is equal to the specified value according to the comparison result.
  • the comparator may sequentially select multiple The value of the check number is compared with the specified value "1" to obtain the comparison result.
  • the arithmetic module can determine whether the value of the number to be checked is equal to the specified value "1” according to the comparison result, and the value is equal to the specified value "1" and sorted to be equal to the specified value.
  • the number is determined as the target number, and the storage address of the target number is stored in the target address as the search result.
  • the number of comparators can be set according to the size of the data amount to be compared, the processing speed, efficiency, etc. of the comparison, which is not limited in the present disclosure.
  • the device may further include a storage module 13-3.
  • the storage module 13-3 is used to store the scalar to be found.
  • the storage module may include one or more of a memory, a cache, and a register
  • the cache may include a temporary storage cache.
  • the scalar to be searched can be stored in the memory, cache, and/or register in the storage module as needed, which is not limited in this disclosure.
  • the device may further include a direct memory access module, which is used to read or store data from the storage module.
  • control module 11-3 may include an instruction storage sub-module 111-3, an instruction processing sub-module 112-3, and a queue storage sub-module 113-3.
  • the instruction storage submodule 111-3 is used to store scalar search instructions.
  • the instruction processing submodule 112-3 is used to parse the scalar search instruction to obtain the operation code and operation domain of the scalar search instruction.
  • the queue storage sub-module 113-3 is used to store an instruction queue.
  • the instruction queue includes a plurality of instructions to be executed in order according to the execution order.
  • the plurality of instructions to be executed may include a scalar search instruction.
  • the execution order of the plurality of instructions to be executed can be arranged according to the reception time and priority level of the instruction to be executed to obtain an instruction queue, so that the plurality of instructions to be executed can be sequentially executed according to the instruction queue.
  • control module 11-3 may further include a dependency processing sub-module 114-3.
  • the dependency processing sub-module 114-3 is configured to cache the first instruction to be executed when it is determined that there is an association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed In the instruction storage submodule 112-3, after the execution of the zeroth to-be-executed instruction is completed, the first to-be-executed instruction is extracted from the instruction storage submodule 112-3 and sent to the arithmetic module 12-3.
  • the first instruction to be executed and the zeroth instruction to be executed are instructions among a plurality of instructions to be executed.
  • the first to-be-executed instruction is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction includes: a first storage address interval storing data required by the first to-be-executed instruction and data required to store the zeroth to-be-executed instruction
  • the zeroth storage address interval has overlapping areas. Conversely, there is no association between the first instruction to be executed and the zeroth instruction to be executed may be that there is no overlapping area between the first storage address interval and the zeroth storage address interval.
  • the instruction format of the scalar search instruction may be as shown in Table 3 below, and the operation code and the position of the operation code may be set.
  • Table 4 the conventional scalar search instruction (Find) is given, and any number in the scalar to be searched can be found by using the conventional scalar search instruction; and an example of the scalar search instruction is given in Table 4 and defined
  • Two special types of scalar search instructions (Find_bfirst, Find_blast) need to include the operation code and operation field. Using a special type of scalar search instruction to search for the scalar to be searched can simplify the instruction processing process and save search time.
  • the scalar search instruction whose operation code is "Find_blast" corresponds to the width of the number to be checked is 1, the specified value is 1, and the specified order is the sort of the number to be checked is the penultimate number of the number to be checked equal to the specified value.
  • the device may be set in a graphics processor (GPU), a central processing unit (CPU) and a neural-network processing unit (Neural-network Processing) , Referred to as NPU).
  • GPU graphics processor
  • CPU central processing unit
  • NPU neural-network processing unit
  • FIGS. 3-3a to 3-3c are schematic diagrams illustrating application scenarios of a scalar search instruction processing device according to an embodiment of the present disclosure. As shown in FIGS. 3-3a to 3-3c, the scalar search instruction processing device processes the scalar search instruction as follows.
  • the scalar a to be searched is "010110110001".
  • the number of the scalar a to be searched in decimal that is, the scalar a to be searched in decimal is 1457.
  • the storage addresses of the scalar a to be searched in different scalar search instructions are different.
  • the scalar search instructions to be processed by the device include:
  • Scalar search instruction 1 @Find#1#100#12#200#01#4
  • Scalar search instruction 4 @Find_bfirst#103#12#203
  • Scalar search instruction 5 @Find_blast#104#12#204
  • the control module 11-3 parses the scalar search instruction 1 when receiving the scalar search instruction 1, obtains the operation code of the scalar search instruction 1 as Find, and determines the scalar search instruction 1 according to the operation domain
  • the specified value is "1”
  • the scalar address to be searched is "100”
  • the input length is "12”
  • the target address is "200”
  • the specified sort is "the number of counts to be checked is equal to the specified value.
  • the control module 11-3 obtains the above-mentioned to-be-searched scalar a whose length is 12 from the to-be-searched scalar address 200 is "010110110001".
  • the arithmetic module 12-3 sequentially obtains a plurality of numbers to be checked from the scalar a to be searched according to the width of the numbers to be checked to be "4", and sequentially determines whether the values of the plurality of to-be-checked numbers are equal to the specified value "1", and sets the value
  • the first to-be-checked number among the to-be-checked numbers equal to the specified value “1” and sorted to be equal to the specified value “1” is determined as the target number, and the storage address of the target number is stored in the target address 200 as a search result.
  • the arithmetic module 12-3 first obtains the first to-be-checked number "0101" with a width of 4 from the to-be-searched scalar a, and judges whether the value of the to-be-checked number "0101" is equal to the specified value "1". Since the value of the number to be checked "0101" is not 1, the arithmetic module 12-3 continues to obtain the next number to be checked "1011" from the scalar a to be searched, and judges whether the value of the number to be checked "1011" is equal to the specified value " 1".
  • the arithmetic module 12-3 continues to obtain the next number to be checked “0001” from the scalar a to be searched, and judges whether the value of the number to be checked “0001" is equal to the specified value " 1". Since the value of the number to be checked "0001" is equal to 1, and its sorting is the specified sorting (that is, the sorting of the number to be checked is the first of the number to be checked equal to the specified value), the number "0001" to be checked is determined as For the target number, the storage address 500 of the number to be checked "0001" is stored in the target address 200 as the search result.
  • the control module 11-3 parses the scalar search instruction 4 when receiving the scalar search instruction 4, obtains the operation code of the scalar search instruction 4 as Find_bfirst, and determines the scalar search instruction 4 according to the operation domain
  • the scalar address to be searched for is "103"
  • the input length is "12”
  • the target address is "203”.
  • it is determined according to the operation code Find_bfirst that the specified value of the scalar search instruction 4 is "1”
  • the specified sort is "the sort of the number to be checked is the first of the number to be checked equal to the specified value”.
  • the control module 11-3 obtains the above-mentioned to-be-searched scalar a whose length is 12 from the to-be-searched scalar address 203, "010110110001".
  • the arithmetic module 12-3 sequentially obtains a plurality of numbers to be checked from the scalar to be searched a, and sequentially determines whether the values of the plurality of numbers to be checked are equal to the specified value "1", and the values are equal to the specified value "1", and is sorted as The first to-be-checked number among the to-be-checked numbers equal to the specified value "1" is determined as the target number, and the storage address of the target number is stored in the target address 203 as a search result.
  • the arithmetic module 12-3 first obtains the first to-be-checked number "0" from the to-be-searched scalar a, and judges whether the value of the to-be-checked number "0" is equal to the specified value "1". Since the value of the number to be checked "0" is not 1, the arithmetic module 12-3 continues to obtain the next number to be checked "1" from the scalar a to be searched, and judges whether the value of the number to be checked "1" is equal to the specified value " 1".
  • the number "1" to be checked is determined as For the target number, the storage address 503 of the number to be checked "1" is stored in the target address 203 as the search result.
  • the control module 11-3 parses the scalar search instruction 5, obtains the operation code of the scalar search instruction 4 as Find_blast, and determines the scalar search instruction 5 according to the operation domain
  • the scalar address to be searched for is "104"
  • the input length is "12”
  • the target address is "204”.
  • it is determined according to the operation code Find_blast that the specified value of the scalar search instruction 5 is "1”
  • the specified sort is "the sort of the number to be checked is the penultimate of the number to be checked equal to the specified value”.
  • the control module 11-3 obtains the above-mentioned to-be-searched scalar a whose length is 12 from the to-be-searched scalar address 204 is "010110110001".
  • the arithmetic module 12-3 sequentially obtains a plurality of numbers to be checked from the scalar to be searched a, and sequentially determines whether the values of the plurality of numbers to be checked are equal to the specified value "1", and the values are equal to the specified value "1", and is sorted as The first to-be-checked number among the to-be-checked numbers equal to the specified value "1" is determined as the target number, and the storage address of the target number is stored in the target address 204 as a search result.
  • the arithmetic module 12-3 first obtains the first to-be-checked number "1" from the to-be-searched scalar a, and judges whether the value of the to-be-checked number "1" is equal to the specified value "1". Since the value of the number to be checked "1" is equal to 1, and its sorting is the specified sorting (that is, the sorting of the number to be checked is the penultimate of the number of counts that is equal to the specified value), the number "1" is determined As the target number, the storage address 504 of the number to be checked "1" is stored in the target address 204 as a search result.
  • the scalar search instruction processing device can process the scalar search instruction quickly and efficiently.
  • FIG. 3-4 show a flowchart of a scalar search instruction processing method according to an embodiment of the present disclosure. As shown in FIG. 3-4, the method is applied to the above scalar search instruction processing device, and the method includes step S51-3 and step S52-3.
  • step S51-3 the received scalar search instruction is parsed to obtain the operation code and operation domain of the scalar search instruction, and according to the operation code and operation domain, the to-be-searched scalar required to execute the scalar search instruction, the specified value, Specify sorting and destination addresses.
  • the operation code is used to indicate that the operation performed by the scalar search instruction on the scalar data is a search operation, and the operation domain includes the scalar address to be searched and the target address.
  • step S52-3 it is sequentially determined whether the values of the plurality of check numbers representing the scalar to be searched are equal to the specified value, and the check number that is equal to the specified value and sorted into the specified sort is determined as the target number, and the target number is determined
  • the storage address of is stored in the target address as the search result.
  • the operation domain may also include the input length.
  • determining the scalar to be searched, the specified value, the specified order, and the target address required to execute the scalar search instruction according to the operation code and the operation domain may include: obtaining the scalar to be searched from the scalar address to be searched according to the input length.
  • the operation domain may further include a specified value and a specified order.
  • determining the scalar to be searched, the specified value, the specified order and the target address required to execute the scalar search instruction according to the operation code and the operation domain may include: determining the specified value and the specified order according to the operation domain.
  • determining the scalar to be searched, the specified value, the specified order, and the target address required to execute the scalar search instruction according to the operation code and the operation field may include:
  • the specified value and specified order are determined.
  • the operation code is also used to indicate the specified value and specified order of the scalar search instruction.
  • sequentially determining whether the values of the plurality of to-be-checked numbers representing the to-be-searched scalar are equal to the specified value may include:
  • At least one comparator is used to compare the values of a plurality of to-be-checked numbers with a specified value to obtain a comparison result, so as to determine whether the to-be-checked value is equal to the specified value according to the comparison result.
  • the specified ordering may include at least one of the following:
  • the order of the number to be checked is the nth of the number of the number to be checked equal to the specified value, n is a positive integer greater than or equal to 1; It is a positive integer greater than or equal to 1. Among them, m and n are less than or equal to the number of the number to be checked in the scalar to be looked up.
  • the method may further include: storing the scalar to be searched.
  • parsing the received scalar search instruction to obtain the operation code and operation domain of the scalar search instruction may include:
  • the instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed in order according to the execution order, and the plurality of instructions to be executed may include a scalar search instruction.
  • the method may further include:
  • the association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
  • a first storage address interval storing data required for the first instruction to be executed has an overlapping area with a zeroth storage address interval storing data required for the zeroth instruction to be executed.
  • the scalar search instruction processing method provided by the embodiments of the present disclosure has a wide application range, and has high processing efficiency and fast processing speed for the scalar search instruction.
  • a scalar search instruction processing device comprising:
  • the control module is used to parse the received scalar search instruction, obtain the operation code and the operation domain of the scalar search instruction, and determine the standby required to execute the scalar search instruction according to the operation code and the operation domain Find scalar, specified value, specified sort and target address;
  • the arithmetic module is used to sequentially determine whether the values of the plurality of to-be-checked numbers representing the to-be-searched scalar are equal to the specified value, and determine the to-be-checked numbers whose value is equal to the specified value and sorted into the specified sort are The target number, the storage address of the target number is stored in the target address as a search result,
  • the operation code is used to indicate that the operation performed by the scalar search instruction on the scalar data is a search operation, and the operation field includes the scalar address to be searched and the target address.
  • the operation field further includes an input length
  • the control module is further configured to obtain the scalar to be searched from the scalar to be searched address according to the input length.
  • Clause B3 The device according to Clause B1, the operation domain further includes a specified value and a specified order,
  • the control module is also used to determine the specified value and the specified order according to the operation domain.
  • the control module is further configured to determine the specified value and the specified order according to the operation code, wherein the operation code is also used to indicate the specified value and the specified order of the scalar search instruction.
  • the arithmetic module includes:
  • At least one comparator is used to compare the values of the plurality of to-be-checked numbers with the specified value to obtain a comparison result, so as to determine whether the value of the to-be-checked number is equal to the specified value according to the comparison result.
  • Clause B6 The device according to any one of Clause B1-Clause B5, the designated order includes at least one of the following:
  • the order of the number to be checked is the nth of the number to be checked equal to the specified value, where n is a positive integer greater than or equal to 1;
  • the order of the number to be checked is the m-th to the last of the number to be checked which is equal to the specified value, the m is a positive integer greater than or equal to 1,
  • m and n are less than or equal to the number of the number to be checked in the scalar to be searched.
  • the device also includes a storage module for storing the scalar to be searched.
  • control module includes:
  • Instruction processing sub-module which is used to parse the scalar search instruction to obtain the operation code and operation domain of the scalar search instruction
  • a queue storage sub-module which is used to store an instruction queue, the instruction queue includes a plurality of instructions to be executed in order according to the execution order, and the plurality of instructions to be executed include the scalar search instruction,
  • control module also includes:
  • the dependency processing sub-module is used to determine the first pending instruction when there is an association between the first pending instruction among the multiple pending instructions and the zeroth pending instruction before the first pending instruction
  • the execution instruction is cached in the instruction storage submodule, and after the execution of the zeroth instruction to be executed is completed, the first instruction to be executed is extracted from the instruction storage submodule and sent to the arithmetic module,
  • association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
  • the first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
  • a machine learning computing device comprising:
  • the machine learning computing device includes a plurality of the scalar search instruction processing devices
  • the plurality of scalar search instruction processing devices may be connected and transmit data through a specific structure
  • a plurality of the scalar search instruction processing devices interconnect and transmit data through a PCIE bus, a fast external device interconnection bus, to support larger-scale machine learning operations; a plurality of the scalar search instruction processing devices share the same control system Or have their own control systems; a plurality of the scalar search instruction processing devices share memory or have their own memories; the interconnection method of the plurality of scalar search instruction processing devices is an arbitrary interconnection topology.
  • a combined processing device comprising:
  • Machine learning computing devices general interconnection interfaces and other processing devices as described in clause B8;
  • the machine learning computing device interacts with the other processing device to jointly complete the calculation operation specified by the user,
  • the combined processing device further includes: a storage device, which is respectively connected to the machine learning computing device and the other processing device, and is used for storing data of the machine learning computing device and the other processing device.
  • Article B10 A machine learning chip, the machine learning chip includes:
  • Article B11 An electronic device, the electronic device comprising:
  • a board card includes: a storage device, an interface device and a control device, and a machine learning chip as described in Clause B10;
  • the machine learning chip is respectively connected to the storage device, the control device and the interface device;
  • the storage device is used to store data
  • the interface device is used to realize data transmission between the machine learning chip and an external device
  • the control device is used for monitoring the state of the machine learning chip.
  • Article B13 A scalar search instruction processing method.
  • the method is applied to a scalar search instruction processing device.
  • the method includes:
  • the operation code is used to indicate that the operation performed by the scalar search instruction on the data is a search operation, and the operation domain includes the scalar address to be searched and the target address.
  • the operation field further includes an input length
  • determining the scalar to be searched, the specified value, the specified order and the target address required to execute the scalar search instruction according to the operation code and the operation domain include:
  • the operation domain further includes a specified value and a specified order
  • determining the scalar to be searched, the specified value, the specified order and the target address required to execute the scalar search instruction according to the operation code and the operation domain include:
  • the specified value and the specified order are determined.
  • Clause B16 According to the method described in Clause B13, determine the scalar to be searched, the specified value, the specified order, and the target address required to execute the scalar search instruction based on the operation code and the operation domain, including:
  • the specified value and the specified order are determined according to the operation code, and the operation code is also used to indicate the specified value and the specified order of the scalar search instruction.
  • Clause B17 According to the method described in Clause B13, sequentially determine whether the values of the plurality of to-be-checked numbers representing the to-be-searched scalar are equal to the specified value, including:
  • At least one comparator is used to compare the values of the plurality of to-be-checked numbers with the specified value to obtain a comparison result, so as to determine whether the value of the to-be-checked number is equal to the specified value according to the comparison result.
  • Clause B18 The method according to any one of Clause B13-B17, the specified ranking includes at least one of the following:
  • the order of the number to be checked is the nth of the number to be checked equal to the specified value, where n is a positive integer greater than or equal to 1;
  • the order of the number to be checked is the m-th to the last of the number to be checked which is equal to the specified value, the m is a positive integer greater than or equal to 1,
  • m and n are less than or equal to the number of the number to be checked in the scalar to be searched.
  • the method further includes: storing the scalar to be found,
  • the received scalar search instruction is parsed to obtain the operation code and operation domain of the scalar search instruction, including:
  • the instruction queue includes a plurality of instructions to be executed in order according to the execution order, the plurality of instructions to be executed includes the scalar search instruction,
  • the method further includes:
  • the first to-be-executed instruction among the plurality of to-be-executed instructions When it is determined that the first to-be-executed instruction among the plurality of to-be-executed instructions is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction, cache the first to-be-executed instruction, and determine the After the execution of the zeroth to-be-executed instruction is completed, control to execute the execution of the first to-be-executed instruction,
  • association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
  • the first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
  • neural network algorithms are more and more widely used in the fields of image recognition, speech recognition, and natural language processing, the complexity of neural network algorithms is getting higher and higher, and the types and number of data operations involved are increasing.
  • the process of using the neural network algorithm for data operation it is necessary to frequently lock and release resources to ensure the reasonable use of resources.
  • the speed and efficiency of the way of locking and releasing resources are difficult to match the resource lock requirements during data calculation, and the lock speed is slow and the efficiency is low.
  • FIG. 4-1 shows a block diagram of a resource lock instruction processing apparatus according to an embodiment of the present disclosure.
  • the device includes a control module 11-4 and an arithmetic module 12-4.
  • the control module 11-4 is used to parse the received resource lock instruction, obtain the operation code and operation domain of the resource lock instruction, and determine the resource to be processed indicated by the resource lock instruction according to the operation code and operation domain, And determine the lock strategy required for resource lock processing.
  • the operation code is used to indicate that the processing performed by the resource lock instruction on the resource is locking or releasing processing, and the operation domain includes the identifier of the resource to be processed.
  • the operation module 12-4 is used to lock or release the resource to be processed according to the lock-and-release strategy to obtain the processed resource.
  • the lock and release strategy may indicate the manner of processing the resource to be processed, including locking the resource to be processed and releasing the resource to be processed.
  • the control module may determine the resource to be processed according to the identifier of the resource to be processed.
  • the identifier of the resource to be processed may be information such as a number and a name that identify the resource to be processed.
  • the control module can obtain the resource lock instruction and the resource to be processed through the data input/output unit.
  • the data input/output unit may be one or more data I/O interfaces or I/O pins.
  • an operation code and an operation domain can be registered for a resource lock instruction.
  • the operation code may be a part of an instruction or field (usually represented by code) specified in the computer program to perform the operation, and is an instruction sequence number used to inform the device executing the instruction which instruction needs to be executed.
  • the operation domain may be the source of all data or resources required to execute the corresponding instruction. All data or resources required to execute the corresponding instruction include the resource to be processed, the corresponding lock and put strategy, and so on.
  • the operation domain may include at least a resource identifier to be processed.
  • the device may include one or more control modules and one or more processing modules, and the number of control modules and processing modules may be set according to actual needs, which is not limited in the present disclosure.
  • the control module can receive a resource lock instruction and control one or more processing modules to perform lock or release processing.
  • the multiple control modules may respectively receive resource lock and release instructions, and control the corresponding one or more processing modules to perform locking or releasing processing.
  • the resource lock instruction processing device includes a control module and a processing module.
  • the control module is used to analyze the received resource lock instruction, obtain the operation code and operation domain of the resource lock instruction, and determine the resource to be processed indicated by the resource lock instruction according to the operation code and operation domain, and determine the resource to be processed Locking strategy required for lock handling.
  • the processing module is used for locking or releasing the resources to be processed according to the lock and put strategy, to obtain the processed resources.
  • the resource lock instruction processing device provided by the embodiments of the present disclosure has a wide application range, and the processing efficiency of locking and releasing resources according to the resource lock instruction is high and the processing speed is fast.
  • the lock-and-release strategy may include at least one of locking resources to be processed and releasing resources to be processed. Among them, the task to be processed cannot be assigned after the resource to be processed is locked, and the task can be assigned to after the resource to be processed is released.
  • the code in the resource lock instruction can be set for different lock and put strategies.
  • "lock the resource to be processed” can be represented by the code PV0
  • "release the resource to be processed” can be Represented by code PV1.
  • a person skilled in the art may set the lock and release strategy and the code of the lock and release strategy according to actual needs, and the disclosure does not limit this.
  • the operation domain can also be used to indicate the lock and release strategy.
  • the operation code can also be used to indicate the lock and release strategy.
  • a default lock and release strategy may be preset.
  • the default lock strategy can be determined as the lock strategy of the current resource lock instruction.
  • the resources to be processed may include at least one of IPU resources, GPU resources, CPU resources, and memory access resources.
  • the IPU resource may be a storage resource of an IPU (Image Processing Unit).
  • the GPU resource may be a storage resource of a GPU (Graphics Processing Unit).
  • CPU resources are storage resources that can be CPU (Central Processing Unit, central processing unit).
  • the memory access resource may be a memory resource such as a memory of the device that can be accessed by the resource lock instruction processing device.
  • the device may further include a storage module 13-4.
  • the storage module 13-4 is used to store the resource identifier to be processed.
  • the storage module may include one or more of a memory, a cache, and a register
  • the cache may include a temporary storage cache.
  • the resources to be processed can be stored in the memory, cache, and/or registers in the storage module according to needs, which is not limited in this disclosure.
  • the device may further include a direct memory access module, which is used to read or store data from the storage module.
  • control module 11-4 may include an instruction storage submodule 111-4, an instruction processing submodule 112-4, and a queue storage submodule 113-4.
  • the instruction storage submodule 111-4 is used to store resource lock and release instructions.
  • the instruction processing submodule 112-4 is used to parse the resource lock instruction and obtain the operation code and operation domain of the resource lock instruction.
  • the queue storage submodule 113-4 is used to store an instruction queue, and the instruction queue includes a plurality of instructions to be executed in order according to the execution order, and the plurality of instructions to be executed may include resource lock and release instructions.
  • the plurality of instructions to be executed may include other calculation instructions related to the resource lock instruction.
  • the execution order of the plurality of instructions to be executed can be arranged according to the reception time and priority level of the instruction to be executed to obtain an instruction queue, so that the plurality of instructions to be executed can be sequentially executed according to the instruction queue.
  • control module 11-4 may further include a dependency processing sub-module 114-4.
  • the dependency processing sub-module 114-4 is used to determine the dependency relationship processing sub-module 114- when the first to-be-executed instruction and the zeroth to-be-executed instruction before the first to-be-executed instruction have a dependency relationship.
  • the first instruction to be executed can be cached in the instruction storage submodule 112-4, and after the execution of the zeroth instruction to be executed is completed, the first instruction to be executed is extracted from the instruction storage submodule 112-4 and sent to the processing module 12- 4.
  • the first instruction to be executed and the zeroth instruction to be executed are instructions among a plurality of instructions to be executed.
  • the dependency relationship between the first to-be-executed instruction and the zeroth to-be-executed instruction before the first to-be-executed instruction includes: a first storage address interval to store data required by the first to-be-executed instruction and data required to store the zeroth to-be-executed instruction
  • the zeroth storage address interval has overlapping areas.
  • there is no dependency relationship between the first instruction to be executed and the zeroth instruction to be executed may be that there is no overlapping area between the first storage address interval and the zeroth storage address interval.
  • the instruction format of the resource lock instruction may be:
  • PV is the operation code
  • sign and type are the operation domains.
  • PV is used to indicate that the instruction is a resource lock instruction. sign Pending resource identification.
  • the type is the lock and put strategy, the type of "locking resources to be processed” is PV0, and the type of "releasing resources to be processed” is PV1.
  • the instruction format of the resource lock instruction may also be:
  • PVx is the operation code
  • sign is the operation domain.
  • PVx is used to indicate that the instruction is a resource lock instruction.
  • sign is the identifier of the resource to be processed.
  • the x in PVx can indicate the lock and put strategy, "lock pending resources" x is 0, and "release pending resources” x is 1.
  • the device may be set in a graphics processor (GPU), a central processing unit (CPU) and a neural-network processing unit (Neural-network Processing) , Referred to as NPU).
  • GPU graphics processor
  • CPU central processing unit
  • NPU neural-network processing unit
  • FIGS. 4-3a to 4-3b illustrate schematic diagrams of application scenarios of a resource lock instruction processing apparatus according to an embodiment of the present disclosure.
  • the resource lock instruction processing device processes the resource lock instruction as follows.
  • the control module 11-4 when the control module 11-4 receives the resource lock instruction 1 (eg PV0r1), it parses the resource lock instruction 1 to obtain the operation code and operation domain of the resource lock instruction 1 .
  • the operation code of the resource lock instruction 1 is PV0, that is, the resource to be processed is locked.
  • the identifier of the resource to be processed can be determined as r1 according to the operation domain.
  • the control module 11-4 can determine the resource 1 to be processed according to the resource identifier r1 to be processed.
  • the processing module 12-4 locks the resource 1 to be processed according to the lock and release strategy PV0, and obtains the processed resource 1'.
  • the processed resource 1' is in a locked state and cannot be assigned a task.
  • the control module 11-4 when the control module 11-4 receives the resource lock instruction 2 (eg PV1), it parses the resource lock instruction 2 to obtain the operation code and operation domain of the resource lock instruction 2 .
  • the operation code of the resource lock and put instruction 1 is PV1, and according to the operation code PV1, it can be determined that the lock and put strategy is to release the resource to be processed.
  • the resource identifier to be processed is r2.
  • the control module 11-4 may determine the resource to be processed 2 according to the resource identifier to be processed r2.
  • the processing module 12-4 releases the resource 2 to be processed according to the lock and release strategy PV1 to obtain the processed resource 2'.
  • the processed resource 2' is in an idle state and can be assigned tasks.
  • the resource lock instruction processing device can quickly and efficiently perform lock processing on the resource according to the resource lock instruction.
  • FIG. 4-4 shows a flowchart of a resource lock instruction processing method according to an embodiment of the present disclosure. As shown in FIG. 4-4, this method is applied to the above resource lock instruction processing device. The method includes step S51-4 and step S52-4.
  • step S51-4 the received resource lock instruction is parsed to obtain the operation code and operation domain of the resource lock instruction, and the resource to be processed indicated by the resource lock instruction is determined according to the operation code and operation domain, and Determine the lock strategy required for resource lock processing.
  • the operation code is used to indicate that the processing performed by the resource lock instruction on the resource is locking or releasing processing, and the operation domain includes the identifier of the resource to be processed.
  • step S52-4 according to the lock and release strategy, the resource to be processed is locked or released to obtain the processed resource.
  • the operation domain can also be used to indicate the lock and release strategy.
  • the operation code can also be used to indicate the lock and release strategy.
  • the lock-and-release strategy may include at least one of locking resources to be processed and releasing resources to be processed. Among them, the task to be processed cannot be assigned after the resource to be processed is locked, and the task can be assigned to after the resource to be processed is released.
  • the resources to be processed may include at least one of IPU resources, GPU resources, CPU resources, and memory access resources.
  • the method may further include: storing the identifier of the resource to be processed.
  • parsing the received resource lock instruction to obtain the operation code and operation domain of the resource lock instruction may include:
  • An instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed in order according to the execution order, and the plurality of instructions to be executed may include resource lock and release instructions.
  • the method may further include:
  • the dependency relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
  • a first storage address interval storing data required for the first instruction to be executed has an overlapping area with a zeroth storage address interval storing data required for the zeroth instruction to be executed.
  • the processing method of the resource lock and release instruction provided by the embodiments of the present disclosure has a wide application range, and the processing efficiency of locking and releasing resources according to the resource lock and release instruction is high and the processing speed is fast.
  • a resource lock instruction processing device the device includes:
  • the control module is configured to parse the received resource lock instruction, obtain the operation code and operation domain of the resource lock instruction, and determine the instruction indicated by the resource lock instruction according to the operation code and the operation domain Resources to be processed, and determine the lock strategy required for resource lock processing;
  • a processing module configured to lock or release the resource to be processed according to the lock and release strategy to obtain the processed resource
  • the operation code is used to indicate that the resource lock instruction performs processing on the resource as locking or releasing processing, and the operation domain includes the resource identifier to be processed.
  • Clause C2 The device according to Clause C1, the operation domain is also used to indicate a lock and release strategy.
  • Clause C3 The device according to Clause C1, the operation code is further used to indicate the lock and release strategy.
  • the lock and put strategy includes at least one of locking the resource to be processed and releasing the resource to be processed
  • the resources to be processed cannot be assigned tasks after being locked, and the resources to be processed can be assigned tasks after being released.
  • the resources to be processed include at least one of IPU resources, GPU resources, CPU resources, and memory access resources.
  • the device also includes a storage module for storing the to-be-processed resource identifier
  • control module includes:
  • An instruction storage submodule used to store the resource lock instruction
  • An instruction processing submodule used for parsing the resource lock instruction, and obtaining an operation code and an operation domain of the resource lock instruction
  • a queue storage sub-module which is used to store an instruction queue, and the instruction queue includes a plurality of instructions to be executed, which are sequentially arranged according to the execution order, and the plurality of instructions to be executed include the resource lock and release instruction
  • control module also includes:
  • the dependency processing sub-module is used to determine the first pending instruction when there is a dependency relationship between the first pending instruction in the plurality of pending instructions and the zeroth pending instruction before the first pending instruction
  • the execution instruction is cached in the instruction storage submodule, and after the execution of the zeroth instruction to be executed is completed, the first instruction to be executed is extracted from the instruction storage submodule and sent to the processing module,
  • the dependency relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
  • the first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
  • a machine learning computing device comprising:
  • the machine learning operation device includes a plurality of the resource lock instruction processing devices
  • the plurality of resource lock instruction processing devices may be connected and transmit data through a specific structure
  • a plurality of the resource lock and put instruction processing apparatuses interconnect and transmit data through a fast external device interconnect bus PCIE bus to support larger-scale machine learning operations; a plurality of the resource lock and put instruction processing apparatuses share the same
  • the control system may have its own control system; a plurality of the resource lock instruction processing devices share memory or have their own memories; the interconnection method of the plurality of resource lock instruction processing devices is any interconnected topology.
  • a combined processing device comprising:
  • Machine learning computing devices general interconnection interfaces and other processing devices as described in clause C7;
  • the machine learning computing device interacts with the other processing device to jointly complete the calculation operation specified by the user,
  • the combined processing device further includes: a storage device, which is respectively connected to the machine learning computing device and the other processing device, and is used for storing data of the machine learning computing device and the other processing device.
  • a machine learning chip includes:
  • An electronic device comprising:
  • a board card includes: a storage device, an interface device and a control device, and a machine learning chip as described in Clause C9;
  • the machine learning chip is respectively connected to the storage device, the control device and the interface device;
  • the storage device is used to store data
  • the interface device is used to realize data transmission between the machine learning chip and an external device
  • the control device is used for monitoring the state of the machine learning chip.
  • Clause C12 A method for processing a resource lock instruction. The method is applied to a device for processing a resource lock instruction. The method includes:
  • Parse the received resource lock instruction obtain the operation code and operation domain of the resource lock instruction, and determine the resource to be processed indicated by the resource lock instruction according to the operation code and the operation domain, And determine the lock strategy required for resource lock processing;
  • the operation code is used to indicate that the resource lock instruction performs processing on the resource as locking or releasing processing, and the operation domain includes the resource identifier to be processed.
  • the operation field is also used to indicate a lock and release strategy.
  • Clause C14 The method according to Clause C12, the operation code is also used to indicate the lock-and-release strategy.
  • the lock-and-release strategy includes at least one of locking the resource to be processed and releasing the resource to be processed
  • the resources to be processed cannot be assigned tasks after being locked, and the resources to be processed can be assigned tasks after being released.
  • the resources to be processed include at least one of IPU resources, GPU resources, CPU resources, and memory access resources.
  • the method further includes: storing the resource identifier to be processed,
  • analyzing the received resource lock instruction to obtain the operation code and operation domain of the resource lock instruction includes:
  • the instruction queue includes a plurality of instructions to be executed in order according to the execution order, the plurality of instructions to be executed includes the resource lock and release instruction,
  • the method further includes:
  • the first instruction to be executed When it is determined that the first instruction to be executed among the plurality of instructions to be executed has a dependency relationship with the zeroth instruction to be executed before the first instruction to be executed, the first instruction to be executed is cached, and the After the execution of the zeroth to-be-executed instruction is completed, control to execute the execution of the first to-be-executed instruction,
  • the dependency relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
  • the first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
  • tensor is a relatively common data form in neural network algorithm, which is composed of numbers and/or characters. Since tensors have different dimensions, the existence of tensors meets the representation needs of various types of data in neural network algorithms. For example, 0-dimensional tensors can be used to represent scalars, 1-dimensional tensors can be used to represent vectors, and 2-dimensional tensors can be used.
  • the processing of tensors in the neural network algorithm includes the rearrangement of tensors. In the related art, multiple instructions are required to achieve the rearrangement of tensor data, which is inefficient and slow.
  • FIG. 5-1 shows a block diagram of a tensor rearrangement instruction processing apparatus according to an embodiment of the present disclosure. As shown in Figure 5-1, the device includes a control module 11-5 and a processing module 12-5.
  • the control module 11-5 is used to parse the received tensor rearrangement instruction, obtain the operation code and operation domain of the tensor rearrangement instruction, and determine the required tensor rearrangement instruction according to the operation code and operation domain The tensor and target address to be processed, and the rearrangement strategy required for the rearrangement process.
  • the operation code is used to instruct the processing performed by the tensor rearrangement instruction on the tensor data to be rearrangement processing
  • the operation domain includes the to-be-processed tensor address and the target address.
  • the processing module 12-5 is configured to perform rearrangement processing on the tensor to be processed according to the rearrangement strategy to obtain the rearrangement tensor, and store the rearrangement tensor into the target address.
  • the tensor can contain multiple forms of data composition.
  • the more common tensor is the matrix form.
  • the tensor can be of different orders.
  • the scalar can be regarded as a 0-dimensional tensor, and the vector can be regarded as One-dimensional tensors, and tensors of more than two dimensions are two-dimensional or multi-dimensional matrices.
  • Tensor rearrangement refers to the method of rearranging tensors to obtain rearranged tensors.
  • the method of rearranging tensors can be based on a certain dimension as the priority for tensor rearrangement, or can be based on a few dimensions
  • the rearrangement of 2-dimensional tensors may include one or more of rearrangement by row, column, block, etc. Pcs.
  • rearrangement by row can refer to data in input and/or output tensors according to row first
  • rearrangement by column can refer to data in input and/or output tensors according to column first
  • rearrangement by block can Refers to data input and/or output tensors in a block-first manner.
  • the method of rearranging tensors can be defined by a rearrangement strategy.
  • the rearrangement strategy can indicate the relevant parameters for rearranging tensors, including input tensors according to rows, columns, or blocks.
  • the tensor is output in the form of a block or a block, and the size of the block or more than two dimensions when input or output is performed in blocks or more than two dimensions.
  • different codes can be set for different rearrangement strategies to distinguish different rearrangement strategies.
  • a person skilled in the art can set the rearrangement strategy and the code of the rearrangement strategy according to actual needs, which is not limited in the present disclosure.
  • the control module may obtain the to-be-processed tensor from the to-be-processed tensor address.
  • the to-be-processed tensor address may be a physical address such as the first address storing the to-be-processed tensor, or may be a logical address or a linear address.
  • the control module can store the rearrangement tensor in the target address.
  • the target address may be a physical address such as the first address storing the rearrangement tensor, or a logical address or a linear address.
  • the present disclosure does not limit the way in which tensor addresses and target addresses are treated.
  • the control module may obtain a tensor rearrangement instruction and a tensor to be processed through a data input/output unit.
  • the data input/output unit may be one or more data I/O interfaces or I/O pins.
  • an operation code and an operation field may be included.
  • the operation code may be a pre-configured instruction sequence number, which is used to inform the device executing the instruction which instruction needs to be executed.
  • the operation domain may include the source of all data required to execute the corresponding instruction. All data required to execute the corresponding instruction include the tensor to be processed and the corresponding rearrangement strategy, or to store the tensor to be processed and the corresponding rearrangement strategy. Address and so on.
  • the operation domain may include a tensor address to be processed and a target address.
  • the device may include one or more control modules and one or more processing modules, and the number of control modules and processing modules may be set according to actual needs, which is not limited in the present disclosure.
  • the control module may receive a tensor rearrangement instruction and control one or more processing modules to perform rearrangement processing.
  • the multiple control modules may respectively receive tensor rearrangement instructions and control the corresponding one or more processing modules to perform rearrangement processing.
  • a tensor rearrangement instruction processing device includes a control module and a processing module.
  • the control module is used to analyze the received tensor rearrangement instruction, obtain the operation code and operation domain of the tensor rearrangement instruction, and determine the pending tensor required to execute the tensor rearrangement instruction according to the operation code and operation domain And the target address, and determine the rearrangement strategy required for rearrangement processing.
  • the processing module is used to rearrange the processing tensor according to the rearrangement strategy to obtain the rearrangement tensor, and store the rearrangement tensor into the target address.
  • Tensor data can be rearranged by a single tensor rearrangement instruction. Compared with the process of implementing tensor data rearrangement by multiple instructions in the related art, rearrangement of tensor data is highly efficient 1. Fast processing speed and wide application range.
  • the operation domain may further include at least one of an input shape of the tensor to be processed and an output shape of the rearranged tensor.
  • the processing module 12-5 is further configured to At least one, and a rearrangement strategy, perform a rearrangement process on the tensor to be processed to obtain a rearrangement tensor.
  • the operation domain may also include the shape of the tensor to be processed and/or the shape of the rearranged tensor.
  • the "shape" of the tensor can be the dimensions of the tensor to be processed and different dimensions. It is represented by the number of digits and/or characters present.
  • the shape of the tensor to be processed may represent the dimension of the tensor to be processed and the number of numbers and/or characters present in different dimensions.
  • the shape of the rearrangement tensor may be a dimension representing the rearrangement tensor and the number of numbers and/or characters present in different dimensions.
  • the shape of the tensor to be processed is (2,4) , which means that the to-be-processed tensor is a two-dimensional tensor with 2 rows and 4 columns.
  • the rearrangement of the to-be-processed tensor can be: input by row first [1, 3, 5 ,7,2,4,6,8], and then output it by column priority to get the rearrangement tensor [(1,3,5,7),(2,4,6,8)], the rearrangement
  • the shape of the quantity is (4, 2), that is, the rearrangement tensor is a two-dimensional tensor with 4 rows and 2 columns.
  • rearrangement of the to-be-processed tensor can be: column-first input gets [1,2,3 ,4,5,6,7,8], and then output by column priority to get the rearrangement tensor [(1,2,3,4),(5,6,7,8)].
  • the shape of the rearrangement tensor is (4, 2), that is, the rearrangement tensor is a two-dimensional tensor with 4 rows and 2 columns.
  • the rearrangement processing of the to-be-processed tensor can be: first input by line to obtain [1,3,5,7,2,4,6,8], and then output by block priority of (1,2) to obtain rearranged The amount [(1,5,2,6),(3,7,4,8)].
  • the shape of the rearrangement tensor is (2, 4), that is, the rearrangement tensor is a two-dimensional tensor with 4 rows and 2 columns.
  • the default input shape of the tensor to be processed can be preset.
  • the default input shape of the tensor to be processed can be determined as the input shape of the tensor to be processed of the current tensor rearrangement instruction.
  • the default output shape of the rearranged tensor may be preset.
  • the default output shape of the rearranged tensor can be determined as the output shape of the rearranged tensor of the current tensor rearrangement instruction.
  • the dimension of the tensor to be processed and the dimension of the rearranged tensor may be different.
  • the dimensions of the to-be-processed tensor and the rearranged tensor may also be the same.
  • the dimension of the tensor to be processed and the dimension of the rearranged tensor can be set according to actual needs, and this disclosure does not limit this.
  • the input tensor with shape (2,8) is as follows:
  • the rearrangement of the to-be-processed tensor can be: input by row first [1] [1,9,2,10,3,11,4,12,4,13,6,14,7,15,8,16], and then prioritize the output in three dimensions to get the rearrangement tensor [[( 1,2,3,4),(5,6,7,8)],[(9,10,11,12),(13,14,15,16)]].
  • the operation domain may also be used to indicate the rearrangement strategy.
  • the operation code can also be used to indicate the rearrangement strategy.
  • a default rearrangement strategy can also be set.
  • the default rearrangement strategy can be determined as the rearrangement strategy of the current tensor rearrangement instruction.
  • the device may further include a storage module 13-5.
  • the storage module 13-5 is used to store the tensor to be rearranged.
  • the storage module may include one or more of a memory, a cache, and a register
  • the cache may include a temporary storage cache.
  • the tensor to be rearranged can be stored in the memory, cache, and/or register in the storage module according to needs, which is not limited in the present disclosure.
  • the device may further include a direct memory access module, which is used to read or store data from the storage module.
  • control module 11-5 may include an instruction storage submodule 111-5, an instruction processing submodule 112-5, and a queue storage submodule 113-5.
  • the instruction storage submodule 111-5 is used to store tensor rearrangement instructions.
  • the instruction processing sub-module 112-5 is used to parse the tensor rearrangement instruction to obtain the operation code and operation domain of the tensor rearrangement instruction.
  • the queue storage sub-module 113-5 is used to store an instruction queue.
  • the instruction queue includes a plurality of instructions to be executed in order according to the execution order, and the plurality of instructions to be executed may include tensor reordering instructions.
  • the plurality of instructions to be executed may include other calculation instructions related to the tensor rearrangement instruction.
  • the execution order of the plurality of instructions to be executed can be arranged according to the reception time and priority level of the instruction to be executed to obtain an instruction queue, so that the plurality of instructions to be executed can be sequentially executed according to the instruction queue.
  • control module 11-5 may further include a dependency processing sub-module 114-5.
  • the dependency processing submodule 114-5 may cache the first instruction to be executed in the instruction In the storage submodule 112-5, after the execution of the zeroth instruction to be executed is completed, the first instruction to be executed is extracted from the instruction storage submodule 112-5 and sent to the processing module 12-5.
  • the first instruction to be executed and the zeroth instruction to be executed are instructions among a plurality of instructions to be executed.
  • the dependency relationship between the first to-be-executed instruction and the zeroth to-be-executed instruction before the first to-be-executed instruction includes: a first storage address interval to store data required by the first to-be-executed instruction and data required to store the zeroth to-be-executed instruction
  • the zeroth storage address interval has overlapping areas.
  • there is no dependency relationship between the first instruction to be executed and the zeroth instruction to be executed may be that there is no overlapping area between the first storage address interval and the zeroth storage address interval.
  • the instruction format of the tensor rearrangement instruction may be:
  • Tiling is the operation code, dst, src, type, src_shape, dst_shape are the operation domain. Tiling is used to indicate that the instruction is a tensor rearrangement instruction. dst is the target address. src is the address of the tensor to be processed. type is the rearrangement strategy. src_shape is the input shape. dst_shape is the output shape.
  • the instruction format of the tensor rearrangement instruction may be:
  • Tiling.type is the operation code, dst, src, src_shape, dst_shape are the operation domain.
  • Tiling in Tiling.type is used to indicate that the instruction is a tensor rearrangement instruction, and type in Tiling.type is a rearrangement strategy.
  • dst is the target address.
  • src is the address of the tensor to be processed.
  • src_shape is the input shape.
  • dst_shape is the output shape.
  • the device may be set in a graphics processor (GPU), a central processing unit (CPU) and a neural-network processing unit (Neural-network Processing) , Referred to as NPU).
  • GPU graphics processor
  • CPU central processing unit
  • NPU neural-network processing unit
  • FIG. 5-3 shows a schematic diagram of an application scenario of a tensor rearrangement instruction processing apparatus according to an embodiment of the present disclosure.
  • the tensor rearrangement instruction processing device processes the tensor rearrangement instruction as follows.
  • the control module 11-5 When the control module 11-5 receives the tensor rearrangement instruction 1 (such as: Tiling 200, 100, type, S1, and S2), it parses the tensor rearrangement instruction 1 to obtain the operation code and operation field of the tensor rearrangement instruction 1.
  • the operation code of the tensor rearrangement instruction 1 is Tiling. And it can be determined according to the operation domain: the rearrangement strategy is type, the tensor address to be processed is 100, the input shape is S1, the target address is 200, and the output shape is S2. Furthermore, the control module 11-5 acquires the to-be-processed tensor a whose input shape is S1 from the to-be-processed tensor address 200.
  • the processing module 12-5 performs rearrangement processing on the to-be-processed tensor a according to the rearrangement strategy type, the input shape, and the output shape to obtain the rearrangement tensor b, and stores the rearrangement tensor b into the target address 200.
  • the tensor reordering instruction 1 can be Tiling 200, type 100, S1, S2, or Tiling.type 200, 100, S1, S2.
  • the processing procedure of the tensor reordering commands in different command formats is similar and will not be repeated here.
  • the tensor rearrangement instruction processing device can quickly and efficiently process the tensor rearrangement instruction to complete the process of rearranging the tensor.
  • FIG. 5-4 shows a flowchart of a tensor reordering instruction processing method according to an embodiment of the present disclosure. As shown in FIG. 5-4, this method is applied to the above-mentioned tensor rearrangement instruction processing apparatus. The method includes step S51-5 and step S52-5.
  • step S51-5 the received tensor rearrangement instruction is parsed to obtain the operation code and operation domain of the tensor rearrangement instruction, and according to the operation code and operation domain to determine the execution of the tensor rearrangement instruction Processing tensors and target addresses, and determining the rearrangement strategy required for rearrangement processing.
  • the operation code is used to instruct the processing performed by the tensor rearrangement instruction on the tensor data to be rearrangement processing
  • the operation domain includes the to-be-processed tensor address and the target address.
  • step S52-5 rearrangement processing is performed on the tensor to be processed according to the rearrangement strategy to obtain the rearrangement tensor, and the rearrangement tensor is stored in the target address.
  • the operation domain may further include at least one of an input shape of the tensor to be processed and an output shape of the rearranged tensor.
  • rearranging the to-be-processed tensor according to the rearrangement strategy to obtain the rearrangement tensor may include: rearranging the to-be-processed tensor according to at least one of the input shape and the output shape and the rearrangement strategy to obtain Rearrange tensors.
  • the dimension of the tensor to be processed and the dimension of the rearranged tensor may be different.
  • the operation domain may also be used to indicate the rearrangement strategy.
  • the operation code can also be used to indicate the rearrangement strategy.
  • the method may further include: storing the tensor to be processed.
  • parsing the received tensor rearrangement instruction to obtain the operation code and operation domain of the tensor rearrangement instruction may include:
  • An instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed in order according to the execution order, and the plurality of instructions to be executed may include tensor reordering instructions.
  • the method may further include:
  • the dependency relationship between the first to-be-executed instruction and the zeroth to-be-executed instruction before the first to-be-executed instruction includes: a first storage address interval to store data required by the first to-be-executed instruction and data required to store the zeroth to-be-executed instruction
  • the zeroth storage address interval has overlapping areas.
  • the tensor rearrangement instruction processing method provided by the embodiment of the present disclosure can realize the rearrangement processing of tensor data through one tensor rearrangement instruction, and the related art realizes the rearrangement processing of tensor data through multiple instructions Compared with the process of, the rearrangement of tensor data has high processing efficiency, fast processing speed, and wide application range.
  • a tensor rearrangement instruction processing device comprising:
  • the control module is configured to parse the received tensor rearrangement instruction, obtain an operation code and an operation domain of the tensor rearrangement instruction, and determine to execute the tensor rearrangement according to the operation code and the operation domain
  • the processing module performs rearrangement processing on the to-be-processed tensor according to the rearrangement strategy to obtain a rearrangement tensor, and stores the rearrangement tensor into the target address,
  • the operation code is used to indicate that the processing performed by the tensor rearrangement instruction on the tensor data is rearrangement processing, and the operation field includes the to-be-processed tensor address and the target address.
  • the operation domain further includes at least one of an input shape of a tensor to be processed and an output shape of a rearranged tensor,
  • the processing module is further configured to perform rearrangement processing on the to-be-processed tensor according to at least one of the input shape and the output shape and the rearrangement strategy to obtain the rearrangement tensor.
  • Clause D3 The apparatus according to Clause D1, the dimension of the to-be-processed tensor is different from the dimension of the rearrangement tensor.
  • Clause D4 The apparatus according to Clause D1, the operation field is further used to indicate a rearrangement strategy.
  • Clause D5 The apparatus according to Clause D1, the operation code is further used to indicate the rearrangement strategy.
  • the device further includes a storage module for storing the to-be-processed tensor,
  • control module includes:
  • An instruction processing sub-module which is used to parse the tensor rearrangement instruction to obtain the operation code and operation domain of the tensor rearrangement instruction;
  • a queue storage sub-module which is used to store an instruction queue, the instruction queue including a plurality of instructions to be executed in order according to the execution order, the plurality of instructions to be executed include the tensor reordering instructions
  • control module also includes:
  • the dependency processing sub-module is used to determine the first pending instruction when there is a dependency relationship between the first pending instruction in the plurality of pending instructions and the zeroth pending instruction before the first pending instruction
  • the execution instruction is cached in the instruction storage submodule, and after the execution of the zeroth instruction to be executed is completed, the first instruction to be executed is extracted from the instruction storage submodule and sent to the processing module,
  • the dependency relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
  • the first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
  • a machine learning computing device comprising:
  • One or more tensor rearrangement instruction processing devices as described in any one of clauses D1 to D5, used to obtain to-be-processed tensors and control information from other processing devices, and perform specified machine learning operations, which will be executed The result is transferred to other processing devices through the I/O interface;
  • the machine learning operation device includes a plurality of the tensor rearrangement instruction processing devices
  • the plurality of the tensor rearrangement instruction processing devices may be connected and transmit data through a specific structure
  • a plurality of the tensor rearrangement instruction processing devices interconnect and transmit data through a fast external device interconnect bus PCIE bus to support larger-scale machine learning operations; a plurality of the tensor rearrangement instruction processing devices Sharing the same control system or having its own control system; a plurality of the tensor rearrangement instruction processing devices share memory or have their own memories; the interconnection method of the plurality of tensor rearrangement instruction processing devices is any interconnection topology.
  • a combined processing device comprising:
  • Machine learning computing devices general interconnect interfaces and other processing devices as described in clause D7;
  • the machine learning computing device interacts with the other processing device to jointly complete the calculation operation specified by the user,
  • the combined processing device further includes a storage device respectively connected to the machine learning computing device and the other processing device, and used for storing data of the machine learning computing device and the other processing device.
  • a machine learning chip includes:
  • Article D10 An electronic device, the electronic device comprising:
  • a board card including: a storage device, an interface device and a control device, and a machine learning chip as described in Clause D9;
  • the machine learning chip is respectively connected to the storage device, the control device and the interface device;
  • the storage device is used to store data
  • the interface device is used to realize data transmission between the machine learning chip and an external device
  • the control device is used for monitoring the state of the machine learning chip.
  • a tensor rearrangement instruction processing method is applied to a tensor rearrangement instruction processing apparatus.
  • the method includes:
  • the operation code is used to indicate that the processing performed by the tensor rearrangement instruction on the tensor data is rearrangement processing, and the operation field includes the to-be-processed tensor address and the target address.
  • the operation domain further includes at least one of an input shape of the tensor to be processed and an output shape of the rearranged tensor,
  • performing rearrangement processing on the to-be-processed tensor according to the rearrangement strategy to obtain a rearrangement tensor includes:
  • the rearrangement strategy perform rearrangement processing on the to-be-processed tensor to obtain the rearrangement tensor.
  • the dimension of the to-be-processed tensor is different from the dimension of the rearrangement tensor.
  • Clause D15 The method according to Clause D12, the operation field is used to indicate a rearrangement strategy.
  • Clause D16 The method according to Clause D12, the operation code is further used to indicate the rearrangement strategy.
  • the method also includes storing the to-be-processed tensor,
  • parsing the received tensor rearrangement instruction to obtain the operation code and operation domain of the tensor rearrangement instruction includes:
  • the instruction queue including a plurality of instructions to be executed in order according to an execution order, the plurality of instructions to be executed including the tensor reordering instruction
  • the method further includes:
  • the first instruction to be executed When it is determined that the first instruction to be executed among the plurality of instructions to be executed has a dependency relationship with the zeroth instruction to be executed before the first instruction to be executed, the first instruction to be executed is cached, and the After the execution of the zeroth to-be-executed instruction is completed, control to execute the execution of the first to-be-executed instruction,
  • the dependency relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
  • the first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
  • FIG. 6-1 shows a block diagram of a data processing device according to an embodiment of the present disclosure.
  • the device is used to perform machine learning calculations.
  • the device includes a control module 11-6 and a processing module 12-6.
  • the processing module 12-6 includes a data transfer submodule 121-6 and an accumulation submodule 122-6.
  • the control module 11-6 is used to obtain calculation instructions and obtain input data required to execute the calculation instructions.
  • the data transfer submodule 121-6 is used to process the input data according to the calculation instruction to obtain a plurality of intermediate results, and send the plurality of intermediate results to the accumulation submodule 122-6 in sequence.
  • the accumulation submodule 122-6 is used to perform a cyclic accumulation operation on a plurality of intermediate results to obtain the calculation result of the calculation instruction.
  • the cyclic accumulation operation may be an accumulation result obtained by adding an intermediate result to the "current operation period", and when the intermediate result is added to the "later operation period", the intermediate result is added to the accumulation result Add to get a new accumulation result.
  • “Later operation cycle” may be the first, second, third and other operation cycles after the "current operation cycle”.
  • the “after operation cycle” may be the "current operation cycle” according to the computing power of the device and other occasions. The following several calculation cycles are set, which is not limited in this disclosure.
  • the device may include one or more control modules and one or more processing modules, and the number of control modules and processing modules may be set according to actual needs, which is not limited in the present disclosure.
  • the data processing device provided by the embodiment of the present disclosure includes a control module and a processing module.
  • the processing module includes a data transfer submodule and an accumulation submodule.
  • the control module is used to obtain calculation instructions and obtain input data required to execute the calculation instructions.
  • the data transfer sub-module is used to process the input data according to the calculation instruction to obtain multiple intermediate results, and send the multiple intermediate results to the accumulation sub-module in sequence.
  • the accumulation submodule is used to perform a cyclic accumulation operation on multiple intermediate results to obtain the calculation result of the calculation instruction.
  • the data processing device provided by the embodiments of the present disclosure reduces the amount of data access and calculation by cyclically accumulating a plurality of intermediate results, and at the same time ensures the accuracy of calculation without loss, and can effectively increase the data processing speed.
  • the loop accumulation process of the accumulation sub-module can be set according to the actual needs of the device, such as the computing power. Examples of loop accumulation processes in Mode 1 and Mode 2 are given below. It should be noted that those skilled in the art can set the loop accumulation process according to actual needs, which is not limited in the present disclosure.
  • the accumulation submodule 122-6 performs a cyclic accumulation operation on multiple intermediate results, which may include:
  • the value of the first intermediate data in the initial calculation cycle is zero.
  • the "first operation cycle when the intermediate result is received" described in the first way may be any operation cycle when the accumulation submodule receives the intermediate result, and "the second operation cycle when the intermediate result is not received” It may be an operation cycle when the accumulation submodule does not receive the intermediate result.
  • the first calculation cycle of receiving the intermediate result describes the process of repeated execution of the accumulation submodule, and the “second calculation cycle of not receiving the intermediate result” is the process of finally determining the calculation result of the accumulation submodule.
  • the accumulation sub-module can cyclically execute a plurality of "first operation cycles of receiving intermediate results” and execute a "second operation cycle of not receiving intermediate results", and has completed operations on a plurality of intermediate results.
  • the accumulation sub-module performs loop accumulation on a pair of multiple intermediate results in the following manner.
  • the first operation cycle, the second operation cycle and the third operation cycle are equivalent to the "first operation cycle of receiving the intermediate result" in the above manner 1
  • the fourth operation cycle is equivalent to the "not The second calculation cycle when the intermediate result is received”.
  • the accumulation submodule receives the intermediate result "1", and adds the intermediate result "1" to the first intermediate data "0" of the first operation cycle to obtain the first An accumulation result "0+1”. Then, the first accumulation result "0+1" is stored as the first intermediate data "0+1" of the second calculation cycle (that is, the next calculation cycle).
  • the accumulation submodule receives the intermediate result "2", and adds the intermediate result "2" to the first intermediate data "0+1” in the second operation cycle to obtain the second operation cycle The first accumulation result of "0+1+2”. Then, the first accumulation result "0+1+2" of the second operation cycle is stored as the first intermediate data "0+1+2" of the third operation cycle (that is, the next operation cycle).
  • the accumulation submodule receives the intermediate result "3", and adds the intermediate result "3" to the first intermediate data "0+1+2" in the third operation cycle to obtain the third
  • the first accumulation result of the operation cycle is "0+1+2+3”.
  • the first accumulation result "0+1+2+3" of the third operation cycle is stored as the first intermediate data "0+1+2+3" of the fourth operation cycle (that is, the next operation cycle) .
  • the accumulation submodule does not receive the intermediate result, and determines the first intermediate data "0+1+2+3" of the fourth calculation cycle as the calculation result.
  • the accumulation submodule 122-6 performs a cyclic accumulation operation on a plurality of intermediate results, and may further include:
  • the second intermediate data in the fourth calculation cycle and the third intermediate data in the fourth calculation cycle are added to obtain a calculation result.
  • the value of the second intermediate data and the third intermediate data in the initial calculation cycle is zero.
  • the "third operation cycle of receiving the intermediate result” described in the second way may be any operation cycle of the intermediate result received by the accumulation submodule, and "the fourth operation cycle of not receiving the intermediate result” It may be an operation cycle when the accumulation submodule does not receive the intermediate result.
  • the third operation cycle of receiving the intermediate result describes the process of repeated execution of the accumulation sub-module, and the “fourth operation cycle of not receiving the intermediate result” is the process of the final determination of the calculation result of the accumulation sub-module.
  • the accumulation sub-module can cyclically execute multiple "third operation cycles of receiving intermediate results” and execute a "fourth operation cycle of not receiving intermediate results", and have completed operations on multiple intermediate results.
  • the accumulation sub-module performs the cyclic accumulation of multiple intermediate results in the second way as follows.
  • the first operation cycle, the second operation cycle, the third operation cycle and the fourth operation cycle are equivalent to the "third operation cycle of receiving the intermediate result" in the above manner 2
  • the fifth operation cycle is equivalent to In the second way above, "the fourth calculation cycle without receiving the intermediate result”.
  • the accumulation submodule receives the intermediate result "1", and adds the intermediate result "1" to the third intermediate data "0" of the first calculation cycle to obtain the first calculation cycle. Two accumulation results "0+1”. Then, the second intermediate data "0" of the first operation cycle is stored as the third intermediate data of the second operation cycle (that is, the next operation cycle), and the second accumulation result of the first operation cycle "0" "+1" is stored as the second intermediate data of the second calculation cycle (that is, the next calculation cycle).
  • the accumulation submodule receives the intermediate result "2", and adds the intermediate result "2" to the third intermediate data "0" in the second operation cycle to obtain the second operation cycle. Two accumulation results "0+2”. Then the second intermediate data "0+1" of the second operation cycle is stored as the third intermediate data of the third operation cycle (that is, the next operation cycle), and the second accumulation result of the second operation cycle "0+2" is stored as the second intermediate data of the third calculation cycle (that is, the next calculation cycle).
  • the accumulation submodule receives the intermediate result "3", and adds the intermediate result "3" to the third intermediate data "0+1” of the third operation cycle to obtain the third operation cycle The second accumulation result of "0+1+3”. Then, the second intermediate data "0+2" of the third operation cycle is stored as the third intermediate data of the fourth operation cycle (that is, the next operation cycle), and the second accumulation result of the third operation cycle "0+1+3" is stored as the second intermediate data of the fourth calculation cycle (that is, the next calculation cycle).
  • the accumulation submodule receives the intermediate result "4", and adds the intermediate result "4" to the third intermediate data "0+2" of the fourth operation cycle to obtain the fourth operation cycle The second accumulation result of "0+2+4". Then, the second intermediate data "0+1+3" of the fourth operation cycle is stored as the third intermediate data of the fifth operation cycle (that is, the next operation cycle), and the second intermediate data of the fourth operation cycle The accumulation result "0+2+4" is stored as the second intermediate data of the fifth calculation cycle (that is, the next calculation cycle).
  • the accumulation submodule determines that the intermediate result is not received, and the second intermediate number "0+2+4" of the fifth operation cycle and the third intermediate data "0+” of the fifth operation cycle Add 1+3" to get the second accumulation result "0+1+2+3+4" in the fifth calculation cycle.
  • the second accumulation result "0+1+2+3+4" of the fifth calculation cycle is determined as the calculation result.
  • the machine learning calculation may include artificial neural network operations
  • the input data may include input neuron data and weight data
  • the calculation result is output neuron data.
  • the data type of the input data may include at least one of exponential type and dynamic fixed-point type, and the data types of the input neuron data and the weight data are different.
  • the data transfer submodule 121-6 is used to process the input data according to the calculation instruction to obtain multiple intermediate results, which may include: the data transfer submodule is used to shift the weight data or the input neuron data according to the calculation instruction Operation, get intermediate results.
  • the exponential input data may include exponent bits, and the data obtained by calculating with the specified value as the base and the data stored in the exponent bits as exponents represent the value of the exponential input data.
  • the input data of the dynamic fixed-point type can include a decimal point and an integer.
  • the data stored in the decimal point is used to mark the position of the decimal point of the input data of the dynamic fixed-point in the data stored in the integer, to distinguish the integer part of the data of the integer And the decimal part.
  • the specified value corresponding to the exponential input data is the same as the carry system of the input data. For example, assuming that the specified value is 2, the input data needs to be binary data. In this way, we can ensure that the input data is shifted.
  • the input neuron data may be exponential data, while the weight data is dynamic fixed-point data.
  • the input neuron data may be dynamic fixed-point data, and the weight data is exponential data.
  • a person skilled in the art may set the types of input neuron data and weight data according to actual needs, which is not limited in the present disclosure.
  • the shift operation on the weight data or input neuron data according to the calculation instruction may be: when it is determined that the operation performed on the weight data and the input neuron data is a multiplication operation according to the calculation instruction, it may be Through the operation method of shifting the input neuron data or weight data, the purpose of multiplying the weight data and the input neuron data is realized.
  • the shift operation may be based on the weight data and the exponential data in the input neuron data to determine the number of movements and the direction of movement, and then the weight data and the decimal point position of the dynamic fixed-point data in the input neuron data Move according to the number of digits and direction of movement, and change the value of the data stored in the decimal point to indicate the direction and number of digits of the decimal point, and then determine the calculation result.
  • the values stored in the exponent bits in the weight data and the exponential data in the input neuron data are added to the values in the weight data and the decimal point storage data in the dynamic fixed-point data in the input neuron data to obtain For the addition result, replace the data stored in the decimal point of the original dynamic fixed-point data with the addition result, and then you can get the calculation result of the weight data multiplied by the input neuron data.
  • the carry system of the input data may be binary, decimal, hexadecimal, etc., which is not limited in this disclosure.
  • FIG. 6-2 shows a schematic diagram of an application scenario of a data processing device according to an embodiment of the present disclosure.
  • an example of the data transmission channel operating on exponential weight data and dynamic fixed-point input neuron data assumes that the exponential weight data is binary "00001" The decimal number corresponding to the value data is 2 1 ).
  • the dynamic fixed-point input neuron data is binary "11001000, 1000" (the decimal number corresponding to the input neuron data is 12.5), in which the first 8 digits are integer digits and the last 4 digits are decimal digits.
  • the control module obtains the above two input data and calculation instructions.
  • the processing module determines that the calculation of the exponential weight data "00001” and the dynamic fixed-point input neuron data "11001000, 1000" needs to be multiplied according to the calculation instruction, it can be based on the exponential weight data ""00001” determines that the shift operation that needs to be performed on the input neuron data is "the decimal point position is shifted to the right by 1".
  • the exponential weight data is binary "00001”
  • the dynamic fixed-point input neuron data is binary "11001000, 0100” multiplication result "11001000, 0101” (the decimal number corresponding to the calculation result For 25).
  • the ",” in the dynamic fixed-point input neuron data "11001000, 0100” is to distinguish the integer and decimal points, and the ",” may not be set in actual use.
  • the “,” in the input data of the dynamic fixed-point type below is the same as here, and will not be explained later.
  • the device may further include a first type conversion module.
  • the first type conversion module is used to convert the received data to be processed into first data with a specified value as the base, and generate exponential input data according to the exponent of the first data. Among them, the exponent bit of the exponential input data is used to store the exponent.
  • the exponent of the first data converted by the data to be processed received by the first type conversion module needs to be an integer to ensure that the input data can be shifted.
  • the number of bits occupied by the exponent bits can be set according to actual needs, for example, 5 bits, which is not limited in this disclosure.
  • the exponential input data may further include a designated value bit, which is used to mark the designated value of the input data.
  • the exponent bit also includes a sign bit, which is used to indicate whether the data stored in the exponent bit is positive or negative.
  • a sign bit which is used to indicate whether the data stored in the exponent bit is positive or negative.
  • the first type conversion module may convert the data to be processed “1024" into the first data "2 10 "with base 2 (specified value).
  • An exponential, binary input data "01010” is generated based on the index "10" of the first data "2 10 ".
  • the received data to be processed is 0.5, the specified value is set to 2, and the input data is a binary number.
  • the first type conversion module may convert the data to be processed "0.5” into the first data "2-1” with 2 (specified value) as the base.
  • the exponential binary input data "10001” is generated based on the index "-1" of the first data "2-1".
  • the device may further include a second type conversion module.
  • the second type conversion module is used to convert the received data to be processed to obtain second data respectively representing the value of the integer part of the data to be processed and third data representing the value of the decimal part, and according to the second data, the first Three data, and the position of the decimal point of the data to be processed, to generate dynamic fixed-point input data.
  • the integer bits of the dynamic fixed-point input data are used to store the second data and the third data
  • the data stored in the decimal point of the dynamic fixed-point input data are used to mark the decimal point of the data to be processed in the data stored in the integer bits s position.
  • the data to be processed received by the second type conversion module may be a decimal. For example, 123.4 (decimal), etc.
  • Those skilled in the art can set the total number of bits occupied by the input data of the dynamic fixed-point type and the number of bits occupied by the integer and decimal points according to actual needs, which is not limited in the present disclosure.
  • the second type conversion module may convert the integer part "24" of the data to be processed into binary second data "11000", and convert the fractional part "0.5" of the data to be processed into binary third data "0.1000”. It can be determined that the integer position of the input data of the dynamic fixed-point type is stored as "0110001000". Since the position of the decimal point is after the sixth place of "0110001000” stored in the integer position, the position of the decimal point can be represented by "0110". Then, finally, the input data of the dynamic fixed-point type generated by the second type conversion module according to the data to be processed "24.5" is "0110001000, 0110".
  • the device may further include a storage module 13-6.
  • the storage module 13-6 is used to store the vector to be found.
  • the storage module may include one or more of a memory, a cache, and a register
  • the cache may include a temporary storage cache.
  • the vector to be searched can be stored in the memory, cache, and/or register in the storage module according to needs, which is not limited in the present disclosure.
  • the device may further include a direct memory access module, which is used to read or store data from the storage module.
  • control module 11-6 may include an instruction storage submodule 111-6, an instruction processing submodule 112-6, and a queue storage submodule 113-6.
  • the instruction storage submodule 111-6 is used to store vector search instructions.
  • the instruction processing sub-module 112-6 is used to parse the vector search instruction to obtain the operation code and operation domain of the vector search instruction.
  • the queue storage sub-module 113-6 is used to store an instruction queue.
  • the instruction queue includes a plurality of instructions to be executed in order according to the execution order, and the plurality of instructions to be executed may include vector search instructions.
  • the plurality of instructions to be executed may include other calculation instructions related to the vector search instruction.
  • the execution order of the plurality of instructions to be executed can be arranged according to the reception time and priority level of the instruction to be executed to obtain an instruction queue, so that the plurality of instructions to be executed can be sequentially executed according to the instruction queue.
  • control module 11-6 may further include a dependency processing sub-module 114-6.
  • the dependency processing sub-module 114-6 is configured to cache the first instruction to be executed when it is determined that there is an association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed
  • the first to-be-executed instruction is extracted from the instruction storage sub-module 112-6 and sent to the processing module 12-6.
  • the first instruction to be executed and the zeroth instruction to be executed are instructions among a plurality of instructions to be executed.
  • the first to-be-executed instruction is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction includes: a first storage address interval storing data required by the first to-be-executed instruction and data required to store the zeroth to-be-executed instruction
  • the zeroth storage address interval has overlapping areas. Conversely, there is no association between the first instruction to be executed and the zeroth instruction to be executed may be that there is no overlapping area between the first storage address interval and the zeroth storage address interval.
  • the processing module 12-6 may include a master processing sub-module 124 and multiple slave processing sub-modules 125.
  • Each slave processing sub-module 125 may include a data transmission sub-module and an accumulation sub-module (not shown in the figure).
  • the control module 11-6 is also used to parse the calculation instructions to obtain a plurality of calculation instructions, and send the input data and the plurality of calculation instructions to the main processing sub-module 124.
  • the main processing sub-module 124 is used for performing pre-processing on input data and transmitting data and operation instructions with a plurality of sub-processing sub-modules 125.
  • the sub-processing sub-module 125 is used to execute intermediate operations in parallel according to the data and operation instructions transmitted from the main processing sub-module 124 to obtain multiple intermediate results, and transmit the multiple intermediate results to the main processing sub-module 124.
  • the intermediate operation may be arithmetic, logic, and other operations on the data.
  • the input data includes input neuron data and weight data
  • the input neuron data and weight data correspond to different types of the above data
  • the intermediate operation performed according to the operation instruction is determined to be input neuron data and When the weight data is multiplied, the input neuron data or the weight data can be shifted to obtain an intermediate result.
  • the main processing sub-module 124 is also used to perform subsequent processing on a plurality of intermediate results to obtain calculation results, and store the calculation results in the target address.
  • the architecture of the processing module may be The “H”-type architecture, the array-type architecture, the tree-type architecture, etc. are not limited in this disclosure.
  • the processing module 12-6 may further include one or more branch processing sub-modules 126, and the branch processing sub-module 126 is used to forward the main processing sub-module 124 and The data and/or arithmetic instructions between the sub-modules 125 are processed.
  • the main processing sub-module 124 is connected to one or more branch processing sub-modules 126.
  • the main processing sub-module, the branch processing sub-module and the slave processing sub-module in the processing module are connected with an "H" type architecture, and the data and/or operation instructions are forwarded through the branch processing sub-module, saving the main processing sub-module Of resources, which in turn increases the processing speed of instructions.
  • FIGS. 6-5b show a block diagram of a processing module in a data processing device according to an embodiment of the present disclosure.
  • multiple slave processing sub-modules 125 are distributed in an array.
  • Each slave processing sub-module 125 is connected to other adjacent slave processing sub-modules 125, and the master processing sub-module 124 is connected to the k slave processing sub-modules 125 of the plurality of slave processing sub-modules 125, and the k slave processing sub-modules 125 are : N slave processing submodules 125 in the first row, n slave processing submodules 125 in the mth row, and m slave processing submodules 125 in the first column.
  • the k slave processing submodules include only n slave processing submodules in the first row, n slave processing submodules in the mth row, and m slave processing submodules in the first column That is, the k slave processing submodules are slave processing submodules directly connected to the master processing submodule among the multiple slave processing submodules.
  • k slave processing sub-modules are used for forwarding data and instructions between the master processing sub-module and multiple slave processing sub-modules. In this way, multiple slave processing sub-modules are distributed in an array, which can increase the speed of sending data and/or operation instructions from the master processing sub-module to the slave processing sub-modules, thereby increasing the processing speed of the instructions.
  • the processing module may further include a tree-shaped submodule 127.
  • the tree-shaped submodule 127 includes a root port 401 and multiple branch ports 402.
  • the root port 401 is connected to the main processing submodule 124, and the plurality of branch ports 402 are respectively connected to the plurality of slave processing submodules 125.
  • the tree-shaped submodule 127 has a transceiver function for forwarding data and/or operation instructions between the master processing submodule 124 and the slave processing submodule 125.
  • the processing modules are connected in a tree structure through the role of the tree-shaped submodules, and the forwarding function of the tree-shaped submodules can be used to increase the speed of sending data and/or operation instructions from the main processing submodule to the slave processing submodules, thereby increasing The processing speed of the instruction.
  • the tree-shaped submodule 127 may be an optional result of the device, which may include at least one layer of nodes.
  • the node has a line structure with a forwarding function, and the node itself does not have a computing function.
  • the lowermost node is connected to the slave processing sub-module to forward data and/or operation instructions between the master processing sub-module 124 and the slave processing sub-module 125.
  • the device does not require a tree-shaped submodule.
  • the tree-shaped submodule 127 may include multiple nodes of an n-ary tree structure, and multiple nodes of the n-ary tree structure may have multiple layers.
  • FIGS. 6-5d show a block diagram of a processing module in a data processing device according to an embodiment of the present disclosure.
  • the n-ary tree structure may be a binary tree structure
  • the tree-shaped submodule 127 includes 2-layer nodes 01.
  • the lowermost node 01 is connected to the slave processing submodule 125 to forward data and/or operation instructions between the master processing submodule 124 and the slave processing submodule 125.
  • the n-ary tree structure may also be a tri-tree structure, etc., where n is a positive integer greater than or equal to 2.
  • n is a positive integer greater than or equal to 2.
  • a person skilled in the art may set n in the n-ary tree structure and the number of nodes in the n-ary tree structure as needed, and the disclosure does not limit this.
  • FIG. 6-6 show a flowchart of a data processing method according to an embodiment of the present disclosure. As shown in Figs. 6-6, this method is applied to the above data processing device, and the data processing device is used to perform machine learning calculations. The method includes steps S51-6 to S53-6.
  • step S51-6 the calculation instruction is acquired, and the input data required to execute the calculation instruction is acquired.
  • step S52-6 the input data is processed according to the calculation instruction to obtain multiple intermediate results, and the multiple intermediate results are issued in sequence.
  • step S53-6 a cyclic accumulation operation is performed on a plurality of intermediate results to obtain the calculation result of the calculation instruction.
  • performing a cyclic accumulation operation on multiple intermediate results may include:
  • the value of the first intermediate data in the initial calculation cycle is zero.
  • performing a cyclic accumulation operation on multiple intermediate results may include:
  • the value of the second intermediate data and the third intermediate data in the initial calculation cycle is zero.
  • the machine learning calculation may include: artificial neural network operation, and the input data may include: input neuron data and weight data; the calculation result is output neuron data.
  • the data type of the input data includes at least one of exponential type and dynamic fixed-point type, and the data types of the input neuron data and the weight data are different.
  • processing the input data according to the calculation instruction to obtain multiple intermediate results may include: performing shift operation on the weight data or input neuron data according to the calculation instruction to obtain the intermediate result.
  • the exponential input data includes exponent bits, and the data obtained by calculating with the specified value as the base and the data stored in the exponent bits as exponents represent the value of the exponential input data.
  • the input data of the dynamic fixed-point type includes a decimal point and an integer.
  • the data stored in the decimal point is used to mark the position of the decimal point of the input data of the dynamic fixed-point in the data stored in the integer, to distinguish the integer part of the data in the integer. decimal part.
  • the specified value corresponding to the exponential input data is the same as the carry system of the input data.
  • obtaining the calculation instruction and obtaining the input data required to execute the calculation instruction may include: analyzing the calculation instruction to obtain multiple calculation instructions.
  • the method may further include:
  • the method may include: storing input data.
  • obtaining the calculation instruction and obtaining the input data required to execute the calculation instruction may include:
  • the instruction queue includes a plurality of instructions to be executed in order according to the execution order, and the plurality of instructions to be executed includes a plurality of arithmetic instructions;
  • acquiring the calculation instruction and acquiring multiple input data required to execute the calculation instruction may further include:
  • the first to-be-executed instruction among the plurality of to-be-executed instructions is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction
  • the first to-be-executed instruction is cached, and after determining that the execution of the zeroth to-be-executed instruction is completed, Controlling the execution of the first instruction to be executed.
  • the first to-be-executed instruction is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction includes: a first storage address interval storing data required by the first to-be-executed instruction and data required to store the zeroth to-be-executed instruction
  • the zeroth storage address interval has overlapping areas.
  • the data processing method provided by the embodiments of the present disclosure reduces the amount of data access and calculation by cyclically accumulating a plurality of intermediate results, and at the same time ensures that the accuracy of calculation is non-destructive and can effectively increase the data processing speed.
  • a data processing device for performing machine learning calculations.
  • the device includes a control module and a processing module.
  • the processing module includes a data transfer submodule and an accumulation submodule:
  • the control module is used to obtain a calculation instruction and obtain input data required to execute the calculation instruction
  • the data transfer sub-module is configured to process the input data according to the calculation instruction to obtain multiple intermediate results, and sequentially send the multiple intermediate results to the accumulation sub-module;
  • the accumulation submodule is used to perform a cyclic accumulation operation on the plurality of intermediate results to obtain the calculation result of the calculation instruction.
  • the accumulation submodule performs a cyclic accumulation operation on the plurality of intermediate results, including:
  • the value of the first intermediate data in the initial calculation cycle is zero.
  • the accumulation submodule performs a cyclic accumulation operation on the plurality of intermediate results, including:
  • the value of the second intermediate data and the third intermediate data in the initial calculation cycle is zero.
  • Clause E4 The device according to any one of Clauses E1-E3, the machine learning calculation includes: artificial neural network operation, the input data includes: input neuron data and weight data; the calculation result is output Neuron data.
  • Clause E5 The device according to Clause E4, the data type of the input data includes at least one of an exponential type and a dynamic fixed-point type, and the data types of the input neuron data and the weight data are different,
  • the data transfer sub-module is used to process the input data according to the calculation instruction to obtain multiple intermediate results, including:
  • the data transfer sub-module is used to perform shift operation on the weight data or the input neuron data according to the calculation instruction to obtain an intermediate result
  • the exponential input data includes exponent bits, and the data obtained by calculating with the specified value as the base and the data stored in the exponent bits as the exponent represent the value of the exponential input data,
  • the dynamic fixed-point input data includes a decimal point and an integer.
  • the data stored in the decimal point is used to mark the position of the decimal point of the dynamic fixed-point input data in the data stored in the integer to distinguish Integer part and decimal part in the data of the integer position,
  • the specified value corresponding to the exponential input data is the same as the carry system of the input data.
  • the processing module includes a master processing submodule and a plurality of slave processing submodules, the master processing submodule includes the data transfer submodule and the accumulation submodule,
  • the control module is further configured to parse the calculation instruction to obtain a plurality of operation instructions, and send the input data and the plurality of operation instructions to the main processing sub-module;
  • the master processing sub-module is used to perform pre-processing on the input data and transmit data and operation instructions with the plurality of slave processing sub-modules;
  • the plurality of sub-processing sub-modules are configured to execute intermediate operations in parallel according to data and operation instructions transmitted from the main processing sub-module to obtain a plurality of intermediate results, and transmit the plurality of intermediate results to the main processing sub Module
  • the main processing sub-module is also used to perform subsequent processing on the plurality of intermediate results to obtain the calculation result of the calculation instruction.
  • the device also includes a storage module for storing the input data
  • control module includes:
  • An instruction processing sub-module which is used to analyze the calculation instruction to obtain a plurality of calculation instructions of the calculation instruction
  • a queue storage sub-module which is used to store an instruction queue, the instruction queue including a plurality of instructions to be executed in order according to the execution order, and the plurality of instructions to be executed include the plurality of arithmetic instructions;
  • control module also includes:
  • the dependency processing sub-module is used to determine the first pending instruction when there is an association between the first pending instruction among the multiple pending instructions and the zeroth pending instruction before the first pending instruction
  • the execution instruction is cached in the instruction storage submodule, and after the execution of the zeroth instruction to be executed is completed, the first instruction to be executed is extracted from the instruction storage submodule and sent to the processing module,
  • association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
  • a first storage address interval storing data required for the first instruction to be executed has an overlapping area with a zeroth storage address interval storing data required for the zeroth instruction to be executed.
  • a machine learning computing device comprising:
  • One or more data processing devices as described in any one of Clause E1-Clause E7 used to obtain data to be calculated and control information from other processing devices, and perform specified machine learning operations, and pass the execution results through I/O The interface is passed to other processing devices;
  • the data processing devices may be connected and transmit data through a specific structure
  • a plurality of the data processing apparatuses interconnect and transmit data through a PCIE bus, a fast external device interconnection bus, to support larger-scale machine learning operations; a plurality of the data processing apparatuses share the same control system or have their own Control system; multiple data processing devices share memory or have their own memory; multiple data processing devices are interconnected in any interconnection topology.
  • a combined processing device comprising:
  • Machine learning computing devices general interconnection interfaces and other processing devices as described in clause E8;
  • the machine learning computing device interacts with the other processing device to jointly complete the calculation operation specified by the user,
  • the combined processing device further includes: a storage device, which is respectively connected to the machine learning computing device and the other processing device, and is used for storing data of the machine learning computing device and the other processing device.
  • a machine learning chip includes:
  • An electronic device comprising:
  • a board card includes: a storage device, an interface device and a control device, and a machine learning chip as described in Clause E10;
  • the machine learning chip is respectively connected to the storage device, the control device and the interface device;
  • the storage device is used to store data
  • the interface device is used to realize data transmission between the machine learning chip and an external device
  • the control device is used for monitoring the state of the machine learning chip.
  • a data processing method is applied to a data processing device, the device is used to perform machine learning calculations, the method includes:
  • Clause E14 According to the method described in Clause E13, performing a cyclic accumulation operation on the plurality of intermediate results, including:
  • the value of the first intermediate data in the initial calculation cycle is zero.
  • Clause E15 According to the method described in Clause E13, performing a cyclic accumulation operation on the plurality of intermediate results, including:
  • the value of the second intermediate data and the third intermediate data in the initial calculation cycle is zero.
  • the machine learning calculation includes: artificial neural network operation, and the input data includes: input neuron data and weight data; the calculation result is output neuron data .
  • the data type of the input data includes at least one of an exponential type and a dynamic fixed-point type, the data type of the input neuron data and the weight data is different,
  • processing the input data according to the calculation instruction to obtain multiple intermediate results includes:
  • the exponential input data includes exponent bits, and the data obtained by calculating with the specified value as the base and the data stored in the exponent bits as the exponent represent the value of the exponential input data,
  • the dynamic fixed-point input data includes a decimal point and an integer.
  • the data stored in the decimal point is used to mark the position of the decimal point of the dynamic fixed-point input data in the data stored in the integer to distinguish Integer part and decimal part in the data of the integer position,
  • the specified value corresponding to the exponential input data is the same as the carry system of the input data.
  • Clause E18 According to the method described in Clause E13, obtain a calculation instruction, and obtain input data required to execute the calculation instruction, including:
  • the method further includes:
  • the method includes: storing the input data
  • obtaining the calculation instruction, and obtaining the input data required to execute the calculation instruction include:
  • the instruction queue includes a plurality of instructions to be executed, which are sequentially arranged in order of execution, and the plurality of instructions to be executed include the plurality of arithmetic instructions;
  • obtaining a calculation instruction, and obtaining a plurality of input data required to execute the calculation instruction further includes:

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Neurology (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Advance Control (AREA)

Abstract

The present application relates to a computing method and apparatus, and a related product. A board comprises: a storage device, an interface apparatus, a control device, and a machine learning chip. The machine learning chip is connected to the storage device, the control device, and the interface apparatus, separately. The storage device is used for storing data. The interface apparatus is used for performing data transmission between the machine learning chip and an external device. The control device is used for monitoring the state of the machine learning chip. The computing method and apparatus, and the related product provided in embodiments of the present application are widely applied, and have high processing efficiency and fast processing speed of an instruction.

Description

运算方法、装置及相关产品Calculation method, device and related products 技术领域Technical field
本公开涉及计算机技术领域,尤其涉及一种运算方法、装置及相关产品。The present disclosure relates to the field of computer technology, and in particular, to an arithmetic method, device, and related products.
背景技术Background technique
随着科技的不断发展,机器学习,尤其是神经网络算法的使用越来越广泛。其在图像识别、语音识别、自然语言处理等领域中都得到了良好的应用。但由于神经网络算法的复杂度越来越高,所涉及的数据运算种类和数量不断增大。相关技术中,在对向量、标量、矩阵、张量、资源锁放等进行运算、查找、累加等处理的效率低、速度慢。With the continuous development of science and technology, machine learning, especially the use of neural network algorithms is becoming more and more widely used. It has been well applied in the fields of image recognition, speech recognition, natural language processing and so on. However, due to the increasing complexity of neural network algorithms, the types and number of data operations involved are increasing. In the related art, it is inefficient and slow in processing, searching, and accumulating vectors, scalars, matrices, tensors, and resource locks.
发明内容Summary of the invention
有鉴于此,本公开提出了一种运算方法、装置及相关产品。In view of this, the present disclosure proposes an arithmetic method, device and related products.
为提高对向量进行查找运算的效率和速度,根据本公开的一方面,提供了一种向量查找指令处理装置,所述装置包括:In order to improve the efficiency and speed of performing search operations on vectors, according to an aspect of the present disclosure, a vector search instruction processing device is provided, and the device includes:
控制模块,用于对接收到的向量查找指令进行解析,获得所述向量查找指令的操作码和操作域,并根据所述操作码和所述操作域确定执行所述向量查找指令所需的待查找向量、查找条件和目标地址;The control module is used to parse the received vector search instruction, obtain the operation code and the operation domain of the vector search instruction, and determine the standby required to execute the vector search instruction according to the operation code and the operation domain Search vector, search condition and target address;
运算模块,用于依次确定表示所述待查找向量的多个待查数是否满足所述查找条件,并将满足所述查找条件的待查数确定为目标数,将所述目标数的存储地址作为查找结果存入所述目标地址,The operation module is used to sequentially determine whether a plurality of check numbers representing the search vector satisfy the search condition, determine the check number satisfying the search condition as the target number, and store the storage address of the target number Store in the target address as a search result,
其中,所述操作码用于指示所述向量查找指令对向量数据所进行的运算为查找运算,所述操作域包括所述待查找向量地址和所述目标地址。Wherein, the operation code is used to indicate that the operation performed by the vector search instruction on the vector data is a search operation, and the operation domain includes the vector address to be searched and the target address.
根据本公开的另一方面,提供了一种机器学习运算装置,所述装置包括:According to another aspect of the present disclosure, a machine learning computing device is provided, the device including:
一个或多个上述向量查找指令处理装置,用于从其他处理装置中获取待运算数据和控制信息,并执行指定的机器学习运算,将执行结果通过I/O接口传递给其他处理装置;One or more of the above vector search instruction processing devices are used to obtain data and control information to be calculated from other processing devices, and perform specified machine learning operations, and pass the execution results to other processing devices through the I/O interface;
当所述机器学习运算装置包含多个所述向量查找指令处理装置时,所述多个所述向量查找指令处理装置间可以通过特定的结构进行连接并传输数据;When the machine learning operation device includes a plurality of the vector search instruction processing devices, the plurality of vector search instruction processing devices may be connected and transmit data through a specific structure;
其中,多个所述向量查找指令处理装置通过快速外部设备互连总线PCIE总线进行互联并传输数据,以支持更大规模的机器学习的运算;多个所述向量查找指令处理装置共享同一控制系统或拥有各自的控制系统;多个所述向量查找指令处理装置共享内存或者拥有各自的内存;多个所述向量查找指令处理装置的互联方式是任意互联拓扑。Among them, a plurality of the vector search instruction processing devices interconnect and transmit data through a fast external device interconnect bus PCIE bus to support larger-scale machine learning operations; a plurality of the vector search instruction processing devices share the same control system Or have their own control systems; a plurality of the vector search instruction processing devices share memory or have their own memory; the interconnection method of the plurality of vector search instruction processing devices is an arbitrary interconnection topology.
根据本公开的另一方面,提供了一种组合处理装置,所述装置包括:According to another aspect of the present disclosure, a combined processing device is provided, the device comprising:
上述机器学习运算装置、通用互联接口和其他处理装置;The above-mentioned machine learning computing device, universal interconnection interface and other processing devices;
所述机器学习运算装置与所述其他处理装置进行交互,共同完成用户指定的计算操作。The machine learning operation device interacts with the other processing device to jointly complete the calculation operation specified by the user.
根据本公开的另一方面,提供了一种机器学习芯片,所述机器学习芯片包括上述机器学习络运算装置或上述组合处理装置。According to another aspect of the present disclosure, a machine learning chip is provided. The machine learning chip includes the above machine learning network operation device or the above combination processing device.
根据本公开的另一方面,提供了一种机器学习芯片封装结构,该机器学习芯片封装结构包括上述机器学习芯片。According to another aspect of the present disclosure, there is provided a machine learning chip packaging structure including the above machine learning chip.
根据本公开的另一方面,提供了一种板卡,该板卡包括上述机器学习芯片封装结构。According to another aspect of the present disclosure, there is provided a board card including the above machine learning chip packaging structure.
根据本公开的另一方面,提供了一种电子设备,所述电子设备包括上述机器学习芯片或上述板卡。According to another aspect of the present disclosure, there is provided an electronic device including the aforementioned machine learning chip or the aforementioned board.
本公开实施例所提供的向量查找指令处理方法、装置及相关产品,该装置包括控制模块和运算模块。控制模块用于对接收到的向量查找指令进行解析,获得向量查找指令的操作码和操作域,并根据操作码和操作域确定执行向量查找指令所需的待查找向量、查找条件和目标地址。运算模块用于依次确定表示待查找向量的多个待查数是否满足查找条件,并将满足查找条件的待查数确定为目标数,将目标数的存储地址作为查找结果存入目标地址。本公开实施例所提供的向量查找指令处理方法、装置及相关产品的适用范围广,对向量查找指令的处理效率高、处理速度快。The vector search instruction processing method, device and related products provided by the embodiments of the present disclosure include a control module and an arithmetic module. The control module is used to parse the received vector search instruction, obtain the operation code and operation domain of the vector search instruction, and determine the to-be-searched vector, search condition, and target address required to execute the vector search instruction according to the operation code and operation domain. The operation module is used to sequentially determine whether a plurality of to-be-checked numbers representing the to-be-searched vector satisfy the search condition, determine the to-be-checked number satisfying the search condition as the target number, and store the target number's storage address as the search result in the target address. The vector search instruction processing method, device and related products provided by the embodiments of the present disclosure have a wide range of application, and have high processing efficiency and fast processing speed for the vector search instruction.
为提高对标量进行查找运算的效率和速度,根据本公开的一方面,提供了一种标量查找指令处理装置,所述装置包括:In order to improve the efficiency and speed of performing a search operation on a scalar, according to an aspect of the present disclosure, a scalar search instruction processing device is provided, and the device includes:
控制模块,用于对接收到的标量查找指令进行解析,获得所述标量查找指令的操作码和操作域,并根据所述操作码和所述操作域确定执行所述标量查找指令所需的待查找标量、指定值、指定排序和目标地址;The control module is used to parse the received scalar search instruction, obtain the operation code and the operation domain of the scalar search instruction, and determine the standby required to execute the scalar search instruction according to the operation code and the operation domain Find scalar, specified value, specified sort and target address;
运算模块,用于依次确定表示所述待查找标量的多个待查数的数值是否等于所述指定值,并将数值等于所述指定值、且排序为所述指定排序的待查数确定为目标数,将所述目标数的存储地址作为查找结果存入所述目标地址,The arithmetic module is used to sequentially determine whether the values of the plurality of to-be-checked numbers representing the to-be-searched scalar are equal to the specified value, and determine the to-be-checked numbers whose value is equal to the specified value and sorted into the specified sort are The target number, the storage address of the target number is stored in the target address as a search result,
其中,所述操作码用于指示所述标量查找指令对标量数据所进行的运算为查找运算,所述操作域包括所述待查找标量地址和所述目标地址。The operation code is used to indicate that the operation performed by the scalar search instruction on the scalar data is a search operation, and the operation field includes the scalar address to be searched and the target address.
根据本公开的另一方面,提供了一种机器学习运算装置,所述装置包括:According to another aspect of the present disclosure, a machine learning computing device is provided, the device including:
一个或多个上述标量查找指令处理装置,用于从其他处理装置中获取待运算数据和控制信息,并执行指定的机器学习运算,将执行结果通过I/O接口传递给其他处理装置;One or more of the above scalar search instruction processing devices, used to obtain the data to be operated and control information from other processing devices, and perform the specified machine learning operations, and pass the execution results to other processing devices through the I/O interface;
当所述机器学习运算装置包含多个所述标量查找指令处理装置时,所述多个所述标量查找指令处理装置间可以通过特定的结构进行连接并传输数据;When the machine learning computing device includes a plurality of the scalar search instruction processing devices, the plurality of scalar search instruction processing devices may be connected and transmit data through a specific structure;
其中,多个所述标量查找指令处理装置通过快速外部设备互连总线PCIE总线进行互联并传输数据,以支持更大规模的机器学习的运算;多个所述标量查找指令处理装置共享同一控制系统或拥有各自的控制系统;多个所述标量查找指令处理装置共享内存或者拥有各自的内存;多个所述标量查找指令处理装置的互联方式是任意互联拓扑。Among them, a plurality of the scalar search instruction processing devices interconnect and transmit data through a PCIE bus, a fast external device interconnection bus, to support larger-scale machine learning operations; a plurality of the scalar search instruction processing devices share the same control system Or have their own control systems; a plurality of the scalar search instruction processing devices share memory or have their own memories; the interconnection method of the plurality of scalar search instruction processing devices is an arbitrary interconnection topology.
根据本公开的另一方面,提供了一种组合处理装置,所述装置包括:According to another aspect of the present disclosure, a combined processing device is provided, the device comprising:
上述机器学习运算装置、通用互联接口和其他处理装置;The above-mentioned machine learning computing device, universal interconnection interface and other processing devices;
所述机器学习运算装置与所述其他处理装置进行交互,共同完成用户指定的计算操作。The machine learning operation device interacts with the other processing device to jointly complete the calculation operation specified by the user.
根据本公开的另一方面,提供了一种机器学习芯片,所述机器学习芯片包括上述机器学习络运算装置或上述组合处理装置。According to another aspect of the present disclosure, a machine learning chip is provided. The machine learning chip includes the above machine learning network operation device or the above combination processing device.
根据本公开的另一方面,提供了一种机器学习芯片封装结构,该机器学习芯片封装结构包括上述机器学习芯片。According to another aspect of the present disclosure, there is provided a machine learning chip packaging structure including the above machine learning chip.
根据本公开的另一方面,提供了一种板卡,该板卡包括上述机器学习芯片封装结构。According to another aspect of the present disclosure, there is provided a board card including the above machine learning chip packaging structure.
根据本公开的另一方面,提供了一种电子设备,所述电子设备包括上述机器学习芯片或上述板卡。According to another aspect of the present disclosure, there is provided an electronic device including the aforementioned machine learning chip or the aforementioned board.
根据本公开的另一方面,提供了一种标量查找指令处理方法,所述方法应用于标量查找指令处理装置,所述方法包括:According to another aspect of the present disclosure, a scalar search instruction processing method is provided. The method is applied to a scalar search instruction processing device. The method includes:
对接收到的标量查找指令进行解析,获得所述标量查找指令的操作码和操作域,并根据所述操作码和所述操作域确定执行所述标量查找指令所需的待查找标量、指定值、指定排序和目标地址;Parse the received scalar search instruction to obtain the operation code and operation domain of the scalar search instruction, and determine the scalar to be searched and the specified value required to execute the scalar search instruction according to the operation code and the operation domain 、Specify sorting and target address;
依次确定表示所述待查找标量的多个待查数的数值是否等于所述指定值,并将数值等于所述指定 值、且排序为所述指定排序的待查数确定为目标数,将所述目标数的存储地址作为查找结果存入所述目标地址,In turn, determine whether the values of the plurality of to-be-checked numbers representing the to-be-searched scalar are equal to the specified value, and determine that the to-be-checked numbers whose value is equal to the specified value and sorted into the specified sort are the target number, The storage address of the target number is stored in the target address as a search result,
其中,所述操作码用于指示所述标量查找指令对标量数据所进行的运算为查找运算,所述操作域包括所述待查找标量地址和所述目标地址。The operation code is used to indicate that the operation performed by the scalar search instruction on the scalar data is a search operation, and the operation field includes the scalar address to be searched and the target address.
本公开实施例所提供的标量查找指令处理方法、装置及相关产品,该装置包括控制模块和运算模块。控制模块用于对接收到的标量查找指令进行解析,获得标量查找指令的操作码和操作域,并根据操作码和操作域确定执行标量查找指令所需的待查找标量、指定值、指定排序和目标地址。运算模块用于依次确定表示待查找标量的多个待查数的数值是否等于指定值,并将数值等于指定值、且排序为指定排序的待查数确定为目标数,将目标数的存储地址作为查找结果存入目标地址。本公开实施例所提供的标量查找指令处理方法、装置及相关产品的适用范围广,对标量查找指令的处理效率高、处理速度快。The scalar search instruction processing method, device and related products provided by the embodiments of the present disclosure include a control module and an arithmetic module. The control module is used to parse the received scalar search instruction, obtain the operation code and operation domain of the scalar search instruction, and determine the scalar to be searched, the specified value, the specified order and the required scalar search instruction according to the operation code and the operation domain. target address. The arithmetic module is used to sequentially determine whether the values of the multiple check numbers representing the scalar to be searched are equal to the specified value, determine the check number that is equal to the specified value and sorted into the specified sort as the target number, and determine the storage address of the target number The target address is stored as the search result. The scalar search instruction processing method, device and related products provided by the embodiments of the present disclosure have a wide range of application, and have high processing efficiency and fast processing speed for the scalar search instruction.
为提高对资源进行锁定、释放处理的效率和速度,根据本公开的第一方面,提供了一种资源锁放指令处理装置,所述装置包括:In order to improve the efficiency and speed of the process of locking and releasing resources, according to the first aspect of the present disclosure, there is provided a resource lock and release instruction processing device, the device comprising:
控制模块,用于对接收到的资源锁放指令进行解析,获得所述资源锁放指令的操作码和操作域,并根据所述操作码和所述操作域确定所述资源锁放指令所指示的待处理资源,以及确定进行资源锁放处理所需的锁放策略;The control module is configured to parse the received resource lock instruction, obtain the operation code and operation domain of the resource lock instruction, and determine the instruction indicated by the resource lock instruction according to the operation code and the operation domain Resources to be processed, and determine the lock strategy required for resource lock processing;
处理模块,用于根据所述锁放策略,对所述待处理资源进行锁定或释放处理,得到处理后的资源,A processing module, configured to lock or release the resource to be processed according to the lock and release strategy to obtain the processed resource,
其中,所述操作码用于指示所述资源锁放指令对资源所进行的处理为锁定或释放处理,所述操作域包括所述待处理资源标识。Wherein, the operation code is used to indicate that the resource lock instruction performs processing on the resource as locking or releasing processing, and the operation domain includes the resource identifier to be processed.
根据本公开的另一方面,提供了一种机器学习运算装置,所述装置包括:According to another aspect of the present disclosure, a machine learning computing device is provided, the device including:
一个或多个上述资源锁放指令处理装置,用于从其他处理装置中获取待运算数据和控制信息,并执行指定的机器学习运算,将执行结果通过I/O接口传递给其他处理装置;One or more of the above resource lock instruction processing devices, used to obtain data to be calculated and control information from other processing devices, and perform specified machine learning operations, and pass the execution results to other processing devices through the I/O interface;
当所述机器学习运算装置包含多个所述资源锁放指令处理装置时,所述多个所述资源锁放指令处理装置间可以通过特定的结构进行连接并传输数据;When the machine learning operation device includes a plurality of the resource lock instruction processing devices, the plurality of resource lock instruction processing devices may be connected and transmit data through a specific structure;
其中,多个所述资源锁放指令处理装置通过快速外部设备互连总线PCIE总线进行互联并传输数据,以支持更大规模的机器学习的运算;多个所述资源锁放指令处理装置共享同一控制系统或拥有各自的控制系统;多个所述资源锁放指令处理装置共享内存或者拥有各自的内存;多个所述资源锁放指令处理装置的互联方式是任意互联拓扑。Among them, a plurality of the resource lock and put instruction processing apparatuses interconnect and transmit data through a fast external device interconnect bus PCIE bus to support larger-scale machine learning operations; a plurality of the resource lock and put instruction processing apparatuses share the same The control system may have its own control system; a plurality of the resource lock instruction processing devices share memory or have their own memories; the interconnection method of the plurality of resource lock instruction processing devices is any interconnected topology.
根据本公开的另一方面,提供了一种组合处理装置,所述装置包括:According to another aspect of the present disclosure, a combined processing device is provided, the device comprising:
上述机器学习运算装置、通用互联接口和其他处理装置;The above-mentioned machine learning computing device, universal interconnection interface and other processing devices;
所述机器学习运算装置与所述其他处理装置进行交互,共同完成用户指定的计算操作。The machine learning operation device interacts with the other processing device to jointly complete the calculation operation specified by the user.
根据本公开的另一方面,提供了一种机器学习芯片,所述机器学习芯片包括上述机器学习络运算装置或上述组合处理装置。According to another aspect of the present disclosure, a machine learning chip is provided. The machine learning chip includes the above machine learning network operation device or the above combination processing device.
根据本公开的另一方面,提供了一种机器学习芯片封装结构,该机器学习芯片封装结构包括上述机器学习芯片。According to another aspect of the present disclosure, there is provided a machine learning chip packaging structure including the above machine learning chip.
根据本公开的另一方面,提供了一种板卡,该板卡包括上述机器学习芯片封装结构。According to another aspect of the present disclosure, there is provided a board card including the above machine learning chip packaging structure.
根据本公开的另一方面,提供了一种电子设备,所述电子设备包括上述机器学习芯片或上述板卡。According to another aspect of the present disclosure, there is provided an electronic device including the aforementioned machine learning chip or the aforementioned board.
根据本公开的另一方面,提供了一种资源锁放指令处理方法,所述方法应用于资源锁放指令处理装置,所述方法包括:According to another aspect of the present disclosure, a method for processing a resource lock instruction is provided. The method is applied to a device for processing a resource lock instruction. The method includes:
对接收到的资源锁放指令进行解析,获得所述资源锁放指令的操作码和操作域,并根据所述操作码和所述操作域确定所述资源锁放指令所指示的待处理资源,以及确定进行资源锁放处理所需的锁放策略;Parse the received resource lock instruction, obtain the operation code and operation domain of the resource lock instruction, and determine the resource to be processed indicated by the resource lock instruction according to the operation code and the operation domain, And determine the lock strategy required for resource lock processing;
根据所述锁放策略,对所述待处理资源进行锁定或释放处理,得到处理后的资源,Lock or release the resources to be processed according to the lock and put strategy to obtain the processed resources,
其中,所述操作码用于指示所述资源锁放指令对资源所进行的处理为锁定或释放处理,所述操作域包括所述待处理资源标识。Wherein, the operation code is used to indicate that the resource lock instruction performs processing on the resource as locking or releasing processing, and the operation domain includes the resource identifier to be processed.
本公开实施例所提供的资源锁放指令处理方法、装置及相关产品,该装置包括控制模块和运算模块。控制模块用于对接收到的资源锁放指令进行解析,获得资源锁放指令的操作码和操作域,并根据操作码和操作域确定资源锁放指令所指示的待处理资源,以及确定进行资源锁放处理所需的锁放策略。处理模块用于根据锁放策略,对待处理资源进行锁定或释放处理,得到处理后的资源。本公开实施例所提供的资源锁放指令处理方法、装置及相关产品的适用范围广,根据资源锁放指令对资源进行锁定和释放的处理效率高、处理速度快。The resource lock instruction processing method, device and related products provided by the embodiments of the present disclosure include a control module and an arithmetic module. The control module is used to analyze the received resource lock instruction, obtain the operation code and operation domain of the resource lock instruction, and determine the resource to be processed indicated by the resource lock instruction according to the operation code and operation domain, and determine the resource to be processed Locking strategy required for lock handling. The processing module is used for locking or releasing the resources to be processed according to the lock and put strategy, to obtain the processed resources. The resource lock instruction processing method, device and related products provided by the embodiments of the present disclosure have a wide range of applications, and the processing efficiency of locking and releasing resources according to the resource lock instruction is high and the processing speed is high.
为提高对张量进行重排处理的效率和速度,根据本公开的一方面,提供了一种张量重排指令处理装置,所述装置包括:In order to improve the efficiency and speed of the rearrangement processing of tensors, according to an aspect of the present disclosure, a tensor rearrangement instruction processing device is provided, and the device includes:
控制模块,用于对接收到的张量重排指令进行解析,获得所述张量重排指令的操作码和操作域,并根据所述操作码和所述操作域确定执行所述张量重排指令所需的待处理张量和目标地址,以及确定进行重排处理所需的重排策略;The control module is configured to parse the received tensor rearrangement instruction, obtain an operation code and an operation domain of the tensor rearrangement instruction, and determine to execute the tensor rearrangement according to the operation code and the operation domain The tensor and target address required for the reordering instruction, and the reordering strategy required for reordering;
处理模块,根据所述重排策略对所述待处理张量进行重排处理,得到重排张量,并将所述重排张量存入所述目标地址中,The processing module performs rearrangement processing on the to-be-processed tensor according to the rearrangement strategy to obtain a rearrangement tensor, and stores the rearrangement tensor into the target address,
其中,所述操作码用于指示所述张量重排指令对张量数据所进行的处理为重排处理,所述操作域包括所述待处理张量地址和所述目标地址。Wherein, the operation code is used to indicate that the processing performed by the tensor rearrangement instruction on the tensor data is rearrangement processing, and the operation field includes the to-be-processed tensor address and the target address.
根据本公开的另一方面,提供了一种机器学习运算装置,所述装置包括:According to another aspect of the present disclosure, a machine learning computing device is provided, the device including:
一个或多个上述张量重排指令处理装置,用于从其他处理装置中获取待处理张量和控制信息,并执行指定的机器学习运算,将执行结果通过I/O接口传递给其他处理装置;One or more of the above tensor reordering instruction processing devices, used to obtain to-be-processed tensors and control information from other processing devices, and perform specified machine learning operations, and pass the execution results to other processing devices through the I/O interface ;
当所述机器学习运算装置包含多个所述张量重排指令处理装置时,所述多个所述张量重排指令处理装置间可以通过特定的结构进行连接并传输数据;When the machine learning operation device includes a plurality of the tensor rearrangement instruction processing devices, the plurality of the tensor rearrangement instruction processing devices may be connected and transmit data through a specific structure;
其中,多个所述张量重排指令处理装置通过快速外部设备互连总线PCIE总线进行互联并传输数据,以支持更大规模的机器学习的运算;多个所述张量重排指令处理装置共享同一控制系统或拥有各自的控制系统;多个所述张量重排指令处理装置共享内存或者拥有各自的内存;多个所述张量重排指令处理装置的互联方式是任意互联拓扑。Among them, a plurality of the tensor rearrangement instruction processing devices interconnect and transmit data through a fast external device interconnect bus PCIE bus to support larger-scale machine learning operations; a plurality of the tensor rearrangement instruction processing devices Sharing the same control system or having its own control system; a plurality of the tensor rearrangement instruction processing devices share memory or have their own memories; the interconnection method of the plurality of tensor rearrangement instruction processing devices is any interconnection topology.
根据本公开的另一方面,提供了一种组合处理装置,所述装置包括:According to another aspect of the present disclosure, a combined processing device is provided, the device comprising:
上述机器学习运算装置、通用互联接口和其他处理装置;The above-mentioned machine learning computing device, universal interconnection interface and other processing devices;
所述机器学习运算装置与所述其他处理装置进行交互,共同完成用户指定的计算操作。The machine learning operation device interacts with the other processing device to jointly complete the calculation operation specified by the user.
根据本公开的另一方面,提供了一种机器学习芯片,所述机器学习芯片包括上述机器学习络运算装置或上述组合处理装置。According to another aspect of the present disclosure, a machine learning chip is provided. The machine learning chip includes the above machine learning network operation device or the above combination processing device.
根据本公开的另一方面,提供了一种机器学习芯片封装结构,该机器学习芯片封装结构包括上述机器学习芯片。According to another aspect of the present disclosure, there is provided a machine learning chip packaging structure including the above machine learning chip.
根据本公开的另一方面,提供了一种板卡,该板卡包括上述机器学习芯片封装结构。According to another aspect of the present disclosure, there is provided a board card including the above machine learning chip packaging structure.
根据本公开的另一方面,提供了一种电子设备,所述电子设备包括上述机器学习芯片或上述板卡。According to another aspect of the present disclosure, there is provided an electronic device including the aforementioned machine learning chip or the aforementioned board.
根据本公开的另一方面,提供了一种张量重排指令处理方法,所述方法应用于张量重排指令处理装置,所述方法包括:According to another aspect of the present disclosure, a tensor rearrangement instruction processing method is provided. The method is applied to a tensor rearrangement instruction processing apparatus. The method includes:
对接收到的张量重排指令进行解析,获得所述张量重排指令的操作码和操作域,并根据所述操作码和所述操作域确定执行所述张量重排指令所需的待处理张量和目标地址,以及确定进行重排处理所需的重排策略;Analyze the received tensor rearrangement instruction to obtain the operation code and operation domain of the tensor rearrangement instruction, and determine the requirements for executing the tensor rearrangement instruction according to the operation code and the operation domain Tensors and target addresses to be processed, and the rearrangement strategy required for rearrangement processing;
根据所述重排策略对所述待处理张量进行重排处理,得到重排张量,并将所述重排张量存入所述目标地址中,Performing rearrangement processing on the to-be-processed tensor according to the rearrangement strategy to obtain a rearrangement tensor, and storing the rearrangement tensor into the target address,
其中,所述操作码用于指示所述张量重排指令对张量数据所进行的处理为重排处理,所述操作域包括所述待处理张量地址和所述目标地址。Wherein, the operation code is used to indicate that the processing performed by the tensor rearrangement instruction on the tensor data is rearrangement processing, and the operation field includes the to-be-processed tensor address and the target address.
本公开实施例所提供的张量重排指令处理方法、装置及相关产品,该装置包括控制模块和处理模块。控制模块用于对接收到的张量重排指令进行解析,获得张量重排指令的操作码和操作域,并根据操作码和操作域确定执行张量重排指令所需的待处理张量和目标地址,以及确定进行重排处理所需的重排策略。处理模块用于根据重排策略对待处理张量进行重排处理,得到重排张量,并将重排张量存入目标地址中。本公开实施例所提供的张量重排指令处理方法、装置及相关产品,通过一条张量重排指令便可以实现对张量数据的重排处理,与相关技术中通过多条指令实现张量数据的重排处理的过程相比,对张量数据进行重排的处理效率高、处理速度快,且适用范围广。The tensor rearrangement instruction processing method, device and related products provided by the embodiments of the present disclosure include a control module and a processing module. The control module is used to analyze the received tensor rearrangement instruction, obtain the operation code and operation domain of the tensor rearrangement instruction, and determine the pending tensor required to execute the tensor rearrangement instruction according to the operation code and operation domain And the target address, and determine the rearrangement strategy required for rearrangement processing. The processing module is used to rearrange the to-be-processed tensor according to the rearrangement strategy to obtain the rearranged tensor, and store the rearranged tensor into the target address. The tensor rearrangement instruction processing method, device and related products provided by the embodiments of the present disclosure can realize the rearrangement processing of tensor data through one tensor rearrangement instruction, and the related art realizes tensor through multiple instructions Compared with the process of data rearrangement, the rearrangement of tensor data has high processing efficiency, fast processing speed, and wide application range.
为解决保证计算精度与降低数据访存量、计算量无法同时满足的问题,根据本公开的一方面,提供了一种数据处理装置,所述装置用于执行机器学习计算,所述装置包括控制模块和处理模块,所述处理模块包括数据传递子模块和累加子模块:In order to solve the problem of ensuring calculation accuracy and reducing the amount of data access and the amount of calculation cannot be satisfied at the same time, according to an aspect of the present disclosure, a data processing device is provided, the device is used to perform machine learning calculations, and the device includes a control module And a processing module, the processing module includes a data transfer submodule and an accumulation submodule:
所述控制模块用于获取计算指令,并获取执行所述计算指令所需的输入数据;The control module is used to obtain a calculation instruction and obtain input data required to execute the calculation instruction;
所述数据传递子模块用于根据所述计算指令对所述输入数据进行处理,得到多个中间结果,并将所述多个中间结果依次发送至所述累加子模块;The data transfer sub-module is configured to process the input data according to the calculation instruction to obtain multiple intermediate results, and sequentially send the multiple intermediate results to the accumulation sub-module;
所述累加子模块用于对所述多个中间结果进行循环累加运算,得到所述计算指令的计算结果。The accumulation submodule is used to perform a cyclic accumulation operation on the plurality of intermediate results to obtain the calculation result of the calculation instruction.
根据本公开的另一方面,提供了一种机器学习运算装置,所述装置包括:According to another aspect of the present disclosure, a machine learning computing device is provided, the device including:
一个或多个上述数据处理装置,用于从其他处理装置中获取输入数据和控制信息,并执行指定的机器学习运算,将执行结果通过I/O接口传递给其他处理装置;One or more of the above data processing devices are used to obtain input data and control information from other processing devices, and perform specified machine learning operations, and pass the execution results to other processing devices through the I/O interface;
当所述机器学习运算装置包含多个所述数据处理装置时,所述多个所述数据处理装置间可以通过特定的结构进行连接并传输数据;When the machine learning computing device includes a plurality of the data processing devices, the data processing devices may be connected and transmit data through a specific structure;
其中,多个所述数据处理装置通过快速外部设备互连总线PCIE总线进行互联并传输数据,以支持更大规模的机器学习的运算;多个所述数据处理装置共享同一控制系统或拥有各自的控制系统;多个所述数据处理装置共享内存或者拥有各自的内存;多个所述数据处理装置的互联方式是任意互联拓扑。Among them, a plurality of the data processing apparatuses interconnect and transmit data through a PCIE bus, a fast external device interconnection bus, to support larger-scale machine learning operations; a plurality of the data processing apparatuses share the same control system or have their own Control system; multiple data processing devices share memory or have their own memory; multiple data processing devices are interconnected in any interconnection topology.
根据本公开的另一方面,提供了一种组合处理装置,所述装置包括:According to another aspect of the present disclosure, a combined processing device is provided, the device comprising:
上述机器学习运算装置、通用互联接口和其他处理装置;The above-mentioned machine learning computing device, universal interconnection interface and other processing devices;
所述机器学习运算装置与所述其他处理装置进行交互,共同完成用户指定的计算操作。The machine learning operation device interacts with the other processing device to jointly complete the calculation operation specified by the user.
根据本公开的另一方面,提供了一种机器学习芯片,所述机器学习芯片包括上述机器学习络运算装置或上述组合处理装置。According to another aspect of the present disclosure, a machine learning chip is provided. The machine learning chip includes the above machine learning network operation device or the above combination processing device.
根据本公开的另一方面,提供了一种机器学习芯片封装结构,该机器学习芯片封装结构包括上述机器学习芯片。According to another aspect of the present disclosure, there is provided a machine learning chip packaging structure including the above machine learning chip.
根据本公开的另一方面,提供了一种板卡,该板卡包括上述机器学习芯片封装结构。According to another aspect of the present disclosure, there is provided a board card including the above machine learning chip packaging structure.
根据本公开的另一方面,提供了一种电子设备,所述电子设备包括上述机器学习芯片或上述板卡。According to another aspect of the present disclosure, there is provided an electronic device including the aforementioned machine learning chip or the aforementioned board.
根据本公开的另一方面,提供了一种数据处理方法,所述方法应用于数据处理装置,所述装置用于执行机器学习计算,所述方法包括:According to another aspect of the present disclosure, a data processing method is provided. The method is applied to a data processing device, and the device is used to perform machine learning calculations. The method includes:
获取计算指令,并获取执行所述计算指令所需的输入数据;Obtaining calculation instructions, and obtaining input data required to execute the calculation instructions;
根据所述计算指令对所述输入数据进行处理,得到多个中间结果,并将所述多个中间结果依次发出;Processing the input data according to the calculation instruction to obtain multiple intermediate results, and sending the multiple intermediate results in sequence;
对所述多个中间结果进行循环累加运算,得到所述计算指令的计算结果。Performing a cyclic accumulation operation on the plurality of intermediate results to obtain the calculation result of the calculation instruction.
本公开实施例所提供的数据处理装置、方法及相关产品,该装置包括:控制模块和处理模块,处理模块包括数据传递子模块和累加子模块。控制模块用于获取计算指令,并获取执行计算指令所需的输入数据。数据传递子模块用于根据计算指令对输入数据进行处理,得到多个中间结果,并将多个中间结果依次发送至累加子模块。累加子模块用于对多个中间结果进行循环累加运算,得到计算指令的计算结果。本公开实施例所提供的数据处理装置、方法及相关产品,通过对多个中间结果进行循环累加的方式降低了数据访存量和计算量,同时保证计算的精度无损,且能够有效提高数据处理速度。The data processing device, method and related products provided by the embodiments of the present disclosure include: a control module and a processing module. The processing module includes a data transfer submodule and an accumulation submodule. The control module is used to obtain calculation instructions and obtain input data required to execute the calculation instructions. The data transfer sub-module is used to process the input data according to the calculation instruction to obtain multiple intermediate results, and send the multiple intermediate results to the accumulation sub-module in sequence. The accumulation submodule is used to perform a cyclic accumulation operation on multiple intermediate results to obtain the calculation result of the calculation instruction. The data processing device, method and related products provided by the embodiments of the present disclosure reduce the amount of data access and calculation by cyclically accumulating multiple intermediate results, while ensuring the accuracy of calculation is not damaged, and can effectively increase the speed of data processing .
为提高对矩阵进行对称处理的效率和速度,根据本公开的一方面,提供了一种矩阵对称指令处理装置,所述装置包括:In order to improve the efficiency and speed of performing symmetric processing on the matrix, according to an aspect of the present disclosure, a matrix symmetric instruction processing device is provided, and the device includes:
控制模块,用于对接收到的矩阵对称指令进行解析,获得矩阵对称指令的操作码和操作域,并根据操作码和操作域确定执行矩阵对称指令所需的待处理矩阵和目标地址,以及确定进行对称处理所需的对称策略;The control module is used to parse the received matrix symmetric instruction, obtain the operation code and operation domain of the matrix symmetric instruction, and determine the matrix to be processed and the target address required to execute the matrix symmetric instruction according to the operation code and operation domain Symmetry strategy required for symmetric processing;
处理模块,根据所述对称策略对所述待处理矩阵进行对称处理,得到对称后矩阵,并将所述对称后矩阵存入所述目标地址中,The processing module performs symmetric processing on the matrix to be processed according to the symmetric strategy to obtain a symmetric matrix, and stores the symmetric matrix into the target address,
其中,所述操作码用于指示所述矩阵对称指令对矩阵数据所进行的处理为对称处理,所述操作域包括所述待处理矩阵地址和所述目标地址。Wherein, the operation code is used to indicate that the processing performed by the matrix symmetric instruction on the matrix data is symmetric processing, and the operation field includes the to-be-processed matrix address and the target address.
根据本公开的另一方面,提供了一种机器学习运算装置,所述装置包括:According to another aspect of the present disclosure, a machine learning computing device is provided, the device including:
一个或多个上述矩阵对称指令处理装置,用于从其他处理装置中获取待处理矩阵和控制信息,并执行指定的机器学习运算,将执行结果通过I/O接口传递给其他处理装置;One or more of the above matrix symmetric instruction processing devices are used to obtain the to-be-processed matrix and control information from other processing devices, and perform specified machine learning operations, and pass the execution results to other processing devices through the I/O interface;
当所述机器学习运算装置包含多个所述矩阵对称指令处理装置时,所述多个所述矩阵对称向指令处理装置间可以通过特定的结构进行连接并传输数据;When the machine learning operation device includes a plurality of matrix symmetric instruction processing devices, the plurality of matrix symmetric instruction processing devices may be connected and transmit data through a specific structure;
其中,多个所述矩阵对称指令处理装置通过快速外部设备互连总线PCIE总线进行互联并传输数据,以支持更大规模的机器学习的运算;多个所述矩阵对称指令处理装置共享同一控制系统或拥有各自的控制系统;多个所述矩阵对称指令处理装置共享内存或者拥有各自的内存;多个所述矩阵对称指令处理装置的互联方式是任意互联拓扑。Among them, a plurality of the matrix symmetric instruction processing devices interconnect and transmit data through a fast external device interconnect bus PCIE bus to support larger-scale machine learning operations; a plurality of the matrix symmetric instruction processing devices share the same control system Or have their own control systems; a plurality of the matrix symmetric instruction processing devices share memory or have their own memories; the interconnection method of the plurality of matrix symmetric instruction processing devices is any interconnection topology.
根据本公开的另一方面,提供了一种组合处理装置,所述装置包括:According to another aspect of the present disclosure, a combined processing device is provided, the device comprising:
上述机器学习运算装置、通用互联接口和其他处理装置;The above-mentioned machine learning computing device, universal interconnection interface and other processing devices;
所述机器学习运算装置与所述其他处理装置进行交互,共同完成用户指定的计算操作。The machine learning operation device interacts with the other processing device to jointly complete the calculation operation specified by the user.
根据本公开的另一方面,提供了一种机器学习芯片,所述机器学习芯片包括上述机器学习络运算装置或上述组合处理装置。According to another aspect of the present disclosure, a machine learning chip is provided. The machine learning chip includes the above machine learning network operation device or the above combination processing device.
根据本公开的另一方面,提供了一种机器学习芯片封装结构,该机器学习芯片封装结构包括上述机器学习芯片。According to another aspect of the present disclosure, there is provided a machine learning chip packaging structure including the above machine learning chip.
根据本公开的另一方面,提供了一种板卡,该板卡包括上述机器学习芯片封装结构。According to another aspect of the present disclosure, there is provided a board card including the above machine learning chip packaging structure.
根据本公开的另一方面,提供了一种电子设备,所述电子设备包括上述机器学习芯片或上述板卡。According to another aspect of the present disclosure, there is provided an electronic device including the aforementioned machine learning chip or the aforementioned board.
根据本公开的另一方面,提供了一种矩阵对称指令处理方法,所述方法应用于矩阵对称指令处理装置,所述方法包括:According to another aspect of the present disclosure, a matrix symmetric instruction processing method is provided. The method is applied to a matrix symmetric instruction processing device. The method includes:
对接收到的矩阵对称指令进行解析,获得所述矩阵对称指令的操作码和操作域,并根据所述操作码和所述操作域确定执行所述矩阵对称指令所需的待处理矩阵和目标地址,以及确定进行对称处理所需的对称策略;Parse the received matrix symmetric instruction to obtain the operation code and operation domain of the matrix symmetric instruction, and determine the to-be-processed matrix and the target address required to execute the matrix symmetric instruction according to the operation code and the operation domain , And determine the symmetrical strategy required for symmetrical processing;
根据所述对称策略对所述待处理矩阵进行对称处理,得到对称后矩阵,并将所述对称后矩阵存入所述目标地址中,Performing symmetric processing on the matrix to be processed according to the symmetric strategy to obtain a symmetric matrix, and storing the symmetric matrix into the target address,
其中,所述操作码用于指示所述矩阵对称指令对矩阵数据所进行的处理为对称处理,所述操作域包括所述待处理矩阵地址和所述目标地址。Wherein, the operation code is used to indicate that the processing performed by the matrix symmetric instruction on the matrix data is symmetric processing, and the operation field includes the to-be-processed matrix address and the target address.
本公开实施例所提供的矩阵对称指令处理方法、装置及相关产品,该装置包括控制模块和处理模块。控制模块用于对接收到的矩阵对称指令进行解析,获得矩阵对称指令的操作码和操作域,并根据操作码和操作域确定执行矩阵对称指令所需的待处理矩阵和目标地址,以及确定进行对称处理所需的对称策略。处理模块用于根据对称策略对待处理矩阵进行对称处理,得到对称后矩阵,并将对称后矩阵存入目标地址中。本公开实施例所提供的矩阵对称指令处理方法、装置及相关产品的适用范围广,根据矩阵对称指令对矩阵进行对称处理的处理效率高、处理速度快。The matrix symmetric instruction processing method, device and related products provided by the embodiments of the present disclosure include a control module and a processing module. The control module is used to parse the received matrix symmetric instruction, obtain the operation code and operation domain of the matrix symmetric instruction, and determine the matrix to be processed and the target address required to execute the matrix symmetric instruction according to the operation code and operation domain, and determine the progress Symmetry strategy required for symmetric processing. The processing module is used to perform symmetric processing on the processing matrix according to a symmetric strategy to obtain the symmetric matrix, and store the symmetric matrix into the target address. The matrix symmetric instruction processing method, device and related products provided by the embodiments of the present disclosure have a wide range of application, and the symmetric processing of the matrix according to the matrix symmetric instruction has high processing efficiency and fast processing speed.
为提高对矩阵进行镜像处理的效率和速度,根据本公开的一方面,提供了一种矩阵镜像指令处理装置,所述装置包括:In order to improve the efficiency and speed of the mirroring processing of the matrix, according to an aspect of the present disclosure, a matrix mirroring instruction processing device is provided. The device includes:
控制模块,用于对接收到的矩阵镜像指令进行解析,获得所述矩阵镜像指令的操作码和操作域,并根据所述操作码和所述操作域确定执行所述矩阵镜像指令所需的待镜像矩阵和目标地址,以及确定进行镜像处理所需的镜像策略;The control module is used to parse the received matrix mirroring instruction, obtain the operation code and the operation domain of the matrix mirroring instruction, and determine the standby required to execute the matrix mirroring instruction according to the operation code and the operation domain Mirroring matrix and target address, and determining the mirroring strategy required for mirroring processing;
处理模块,根据所述镜像策略对所述待镜像矩阵进行镜像处理,得到镜像后矩阵,并将所述镜像后矩阵存入所述目标地址中,The processing module performs mirror processing on the matrix to be mirrored according to the mirroring strategy to obtain a mirrored matrix, and stores the mirrored matrix in the target address,
其中,所述操作码用于指示所述矩阵镜像指令对矩阵数据所进行的处理为镜像处理,所述操作域包括所述待镜像矩阵地址和所述目标地址。Wherein, the operation code is used to indicate that the processing performed by the matrix mirroring instruction on the matrix data is mirror processing, and the operation domain includes the matrix address to be mirrored and the target address.
根据本公开的另一方面,提供了一种机器学习运算装置,所述装置包括:According to another aspect of the present disclosure, a machine learning computing device is provided, the device including:
一个或多个上述矩阵镜像指令处理装置,用于从其他处理装置中获取待镜像矩阵和控制信息,并执行指定的机器学习运算,将执行结果通过I/O接口传递给其他处理装置;One or more of the above matrix mirroring instruction processing devices are used to obtain the to-be-mirrored matrix and control information from other processing devices, and perform specified machine learning operations, and pass the execution results to other processing devices through the I/O interface;
当所述机器学习运算装置包含多个所述矩阵镜像指令处理装置时,所述多个所述矩阵镜像指令处理装置间可以通过特定的结构进行连接并传输数据;When the machine learning operation device includes a plurality of the matrix mirroring instruction processing devices, the plurality of matrix mirroring instruction processing devices may be connected and transmit data through a specific structure;
其中,多个所述矩阵镜像指令处理装置通过快速外部设备互连总线PCIE总线进行互联并传输数据,以支持更大规模的机器学习的运算;多个所述矩阵镜像指令处理装置共享同一控制系统或拥有各自的控制系统;多个所述矩阵镜像指令处理装置共享内存或者拥有各自的内存;多个所述矩阵镜像指令处理装置的互联方式是任意互联拓扑。Among them, a plurality of the matrix mirroring instruction processing devices interconnect and transmit data through a fast external device interconnect bus PCIE bus to support larger-scale machine learning operations; a plurality of the matrix mirroring instruction processing devices share the same control system Or have their own control systems; a plurality of the matrix mirroring instruction processing devices share memory or have their own memories; the interconnection method of the plurality of matrix mirroring instruction processing devices is an arbitrary interconnection topology.
根据本公开的另一方面,提供了一种组合处理装置,所述装置包括:According to another aspect of the present disclosure, a combined processing device is provided, the device comprising:
上述机器学习运算装置、通用互联接口和其他处理装置;The above-mentioned machine learning computing device, universal interconnection interface and other processing devices;
所述机器学习运算装置与所述其他处理装置进行交互,共同完成用户指定的计算操作。The machine learning operation device interacts with the other processing device to jointly complete the calculation operation specified by the user.
根据本公开的另一方面,提供了一种机器学习芯片,所述机器学习芯片包括上述机器学习络运算装置或上述组合处理装置。According to another aspect of the present disclosure, a machine learning chip is provided. The machine learning chip includes the above machine learning network operation device or the above combination processing device.
根据本公开的另一方面,提供了一种机器学习芯片封装结构,该机器学习芯片封装结构包括上述机器学习芯片。According to another aspect of the present disclosure, there is provided a machine learning chip packaging structure including the above machine learning chip.
根据本公开的另一方面,提供了一种板卡,该板卡包括上述机器学习芯片封装结构。According to another aspect of the present disclosure, there is provided a board card including the above machine learning chip packaging structure.
根据本公开的另一方面,提供了一种电子设备,所述电子设备包括上述机器学习芯片或上述板卡。According to another aspect of the present disclosure, there is provided an electronic device including the aforementioned machine learning chip or the aforementioned board.
根据本公开的另一方面,提供了一种矩阵镜像指令处理方法,所述方法应用于矩阵镜像指令处理装置,所述方法包括:According to another aspect of the present disclosure, a matrix mirroring instruction processing method is provided. The method is applied to a matrix mirroring instruction processing apparatus. The method includes:
对接收到的矩阵镜像指令进行解析,获得所述矩阵镜像指令的操作码和操作域,并根据所述操作码和所述操作域确定执行所述矩阵镜像指令所需的待镜像矩阵和目标地址,以及确定进行镜像处理所需的镜像策略;Parse the received matrix mirroring instruction to obtain the operation code and operation domain of the matrix mirroring instruction, and determine the matrix to be mirrored and the target address required to execute the matrix mirroring instruction according to the operation code and the operation domain , And determine the mirroring strategy required for mirroring;
根据所述镜像策略对所述待镜像矩阵进行镜像处理,得到镜像后矩阵,并将所述镜像后矩阵存入所述目标地址中,Mirroring the matrix to be mirrored according to the mirroring strategy to obtain a mirrored matrix, and storing the mirrored matrix in the target address,
其中,所述操作码用于指示所述矩阵镜像指令对矩阵所进行的处理为镜像处理,所述操作域包括所述待镜像矩阵地址和所述目标地址。Wherein, the operation code is used to indicate that the processing performed by the matrix mirroring instruction on the matrix is mirror processing, and the operation domain includes the matrix address to be mirrored and the target address.
本公开实施例所提供的矩阵镜像指令处理方法、装置及相关产品,该装置包括控制模块和处理模块。控制模块用于对接收到的矩阵镜像指令进行解析,获得矩阵镜像指令的操作码和操作域,并根据操作码和操作域确定执行矩阵镜像指令所需的待镜像矩阵和目标地址。处理模块用于根据镜像策略对待镜像矩阵进行镜像处理,得到镜像后矩阵,并将镜像后矩阵存入目标地址中。本公开实施例所提供的矩阵镜像指令处理方法、装置及相关产品的适用范围广,根据矩阵镜像指令对矩阵进行镜像处理的处理效率高、处理速度快。The matrix mirroring instruction processing method, device and related products provided by the embodiments of the present disclosure include a control module and a processing module. The control module is used to analyze the received matrix mirroring instruction, obtain the operation code and operation domain of the matrix mirroring instruction, and determine the matrix to be mirrored and the target address required to execute the matrix mirroring instruction according to the operation code and the operation domain. The processing module is used to perform mirror processing on the mirror matrix according to the mirror strategy to obtain the mirrored matrix, and store the mirrored matrix in the target address. The matrix mirroring instruction processing method, device and related products provided by the embodiments of the present disclosure have a wide range of applications, and the mirroring processing of the matrix according to the matrix mirroring instruction has high processing efficiency and fast processing speed.
为提高对矩阵进行旋转处理的效率和速度,根据本公开的一方面,提供了一种矩阵旋转指令处理装置,所述装置包括:In order to improve the efficiency and speed of rotating the matrix, according to an aspect of the present disclosure, a matrix rotation instruction processing device is provided, and the device includes:
控制模块,用于对接收到的矩阵旋转指令进行解析,获得所述矩阵旋转指令的操作码和操作域,并根据所述操作码和所述操作域确定执行所述矩阵旋转指令所需的待旋转矩阵和目标地址,以及确定对待旋转矩阵进行旋转的旋转角度;The control module is used to parse the received matrix rotation instruction, obtain the operation code and the operation domain of the matrix rotation instruction, and determine the standby required to execute the matrix rotation instruction according to the operation code and the operation domain Rotation matrix and target address, and determine the rotation angle of the rotation matrix to be rotated;
处理模块,根据所述旋转角度对所述待旋转矩阵进行旋转处理,得到旋转后矩阵,并将所述旋转后矩阵存入所述目标地址中,The processing module performs rotation processing on the matrix to be rotated according to the rotation angle to obtain a matrix after rotation, and stores the matrix after rotation into the target address,
其中,所述操作码用于指示所述矩阵旋转指令对矩阵数据所进行的处理为旋转处理,所述操作域包括所述待旋转矩阵地址和所述目标地址。Wherein, the operation code is used to indicate that the processing performed by the matrix rotation instruction on the matrix data is rotation processing, and the operation domain includes the matrix address to be rotated and the target address.
根据本公开的另一方面,提供了一种机器学习运算装置,所述装置包括:According to another aspect of the present disclosure, a machine learning computing device is provided, the device including:
一个或多个上述矩阵旋转指令处理装置,用于从其他处理装置中获取待旋转矩阵和控制信息,并执行指定的机器学习运算,将执行结果通过I/O接口传递给其他处理装置;One or more of the above matrix rotation instruction processing devices are used to obtain the matrix to be rotated and control information from other processing devices, and perform specified machine learning operations, and pass the execution results to other processing devices through the I/O interface;
当所述机器学习运算装置包含多个所述矩阵旋转指令处理装置时,所述多个所述矩阵旋转指令处理装置间可以通过特定的结构进行连接并传输数据;When the machine learning operation device includes a plurality of matrix rotation instruction processing devices, the plurality of matrix rotation instruction processing devices may be connected and transmit data through a specific structure;
其中,多个所述矩阵旋转指令处理装置通过快速外部设备互连总线PCIE总线进行互联并传输数据,以支持更大规模的机器学习的运算;多个所述矩阵旋转指令处理装置共享同一控制系统或拥有各自的控制系统;多个所述矩阵旋转指令处理装置共享内存或者拥有各自的内存;多个所述矩阵旋转指令处理装置的互联方式是任意互联拓扑。Among them, a plurality of the matrix rotation instruction processing apparatuses interconnect and transmit data through a PCIE bus that is a fast external device interconnection bus to support larger-scale machine learning operations; Or have their own control systems; a plurality of the matrix rotation instruction processing devices share memory or have their own memories; the interconnection method of the plurality of matrix rotation instruction processing devices is an arbitrary interconnection topology.
根据本公开的另一方面,提供了一种组合处理装置,所述装置包括:According to another aspect of the present disclosure, a combined processing device is provided, the device comprising:
上述机器学习运算装置、通用互联接口和其他处理装置;The above-mentioned machine learning computing device, universal interconnection interface and other processing devices;
所述机器学习运算装置与所述其他处理装置进行交互,共同完成用户指定的计算操作。The machine learning operation device interacts with the other processing device to jointly complete the calculation operation specified by the user.
根据本公开的另一方面,提供了一种机器学习芯片,所述机器学习芯片包括上述机器学习络运算装置或上述组合处理装置。According to another aspect of the present disclosure, a machine learning chip is provided. The machine learning chip includes the above machine learning network operation device or the above combination processing device.
根据本公开的另一方面,提供了一种机器学习芯片封装结构,该机器学习芯片封装结构包括上述机器学习芯片。According to another aspect of the present disclosure, there is provided a machine learning chip packaging structure including the above machine learning chip.
根据本公开的另一方面,提供了一种板卡,该板卡包括上述机器学习芯片封装结构。According to another aspect of the present disclosure, there is provided a board card including the above machine learning chip packaging structure.
根据本公开的另一方面,提供了一种电子设备,所述电子设备包括上述机器学习芯片或上述板卡。According to another aspect of the present disclosure, there is provided an electronic device including the aforementioned machine learning chip or the aforementioned board.
根据本公开的另一方面,提供了一种矩阵旋转指令处理方法,所述方法应用于矩阵旋转指令处理装置,所述方法包括:According to another aspect of the present disclosure, a matrix rotation instruction processing method is provided. The method is applied to a matrix rotation instruction processing apparatus. The method includes:
对接收到的矩阵旋转指令进行解析,获得所述矩阵旋转指令的操作码和操作域,并根据所述操作码和所述操作域确定执行所述矩阵旋转指令所需的待旋转矩阵和目标地址,以及确定对待旋转矩阵进行旋转的旋转角度;Parse the received matrix rotation instruction to obtain the operation code and operation domain of the matrix rotation instruction, and determine the matrix to be rotated and the target address required to execute the matrix rotation instruction according to the operation code and the operation domain , And determine the rotation angle of the matrix to be rotated;
根据所述旋转角度对所述待旋转矩阵进行旋转处理,得到旋转后矩阵,并将所述旋转后矩阵存入所述目标地址中,Rotating the matrix to be rotated according to the rotation angle to obtain a matrix after rotation, and storing the matrix after rotation into the target address,
其中,所述操作码用于指示所述矩阵旋转指令对矩阵数据所进行的处理为旋转处理,所述操作域包括所述待旋转矩阵地址和所述目标地址。Wherein, the operation code is used to indicate that the processing performed by the matrix rotation instruction on the matrix data is rotation processing, and the operation domain includes the matrix address to be rotated and the target address.
本公开实施例所提供的矩阵旋转指令处理方法、装置及相关产品,该装置包括控制模块和处理模块。控制模块用于对接收到的矩阵旋转指令进行解析,获得矩阵旋转指令的操作码和操作域,并根据操作码和操作域确定执行矩阵旋转指令所需的待旋转矩阵和目标地址,以及确定对待旋转矩阵进行旋转的旋转角度。处理模块用于根据旋转角度对待旋转矩阵进行旋转处理,得到旋转后矩阵,并将旋转后矩阵存入目标地址中。本公开实施例所提供的矩阵旋转指令处理方法、装置及相关产品的适用范围广,根据矩阵旋转指令对待旋转矩阵的处理效率高、处理速度快。The matrix rotation instruction processing method, device and related products provided by the embodiments of the present disclosure include a control module and a processing module. The control module is used to parse the received matrix rotation instruction, obtain the operation code and operation domain of the matrix rotation instruction, and determine the matrix to be rotated and the target address required to execute the matrix rotation instruction according to the operation code and operation domain, and determine the treatment The rotation angle of the rotation matrix. The processing module is used to perform rotation processing on the rotation matrix according to the rotation angle to obtain the rotated matrix, and store the rotated matrix in the target address. The matrix rotation instruction processing method, device and related products provided by the embodiments of the present disclosure have a wide range of application, and the processing efficiency of the matrix to be rotated according to the matrix rotation instruction is high and the processing speed is fast.
在一些实施例中,所述电子设备包括数据处理装置、机器人、电脑、打印机、扫描仪、平板电脑、智能终端、手机、行车记录仪、导航仪、传感器、摄像头、服务器、云端服务器、相机、摄像机、投影仪、手表、耳机、移动存储、可穿戴设备、交通工具、家用电器、和/或医疗设备。In some embodiments, the electronic device includes a data processing device, a robot, a computer, a printer, a scanner, a tablet computer, a smart terminal, a mobile phone, a driving recorder, a navigator, a sensor, a camera, a server, a cloud server, a camera, Cameras, projectors, watches, headphones, mobile storage, wearable devices, vehicles, household appliances, and/or medical devices.
在一些实施例中,所述交通工具包括飞机、轮船和/或车辆;所述家用电器包括电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机;所述医疗设备包括核磁共振仪、B超仪和/或心电图仪。In some embodiments, the vehicle includes an airplane, ship, and/or vehicle; the household appliance includes a TV, air conditioner, microwave oven, refrigerator, rice cooker, humidifier, washing machine, electric lamp, gas stove, and range hood; and the medical Equipment includes MRI, B-mode ultrasound and/or electrocardiograph.
根据下面参考附图对示例性实施例的详细说明,本公开的其它特征及方面将变得清楚。Other features and aspects of the present disclosure will become clear from the following detailed description of exemplary embodiments with reference to the accompanying drawings.
附图说明BRIEF DESCRIPTION
包含在说明书中并且构成说明书的一部分的附图与说明书一起示出了本公开的示例性实施例、特征和方面,并且用于解释本公开的原理。The drawings included in the specification and forming a part of the specification together with the specification show exemplary embodiments, features, and aspects of the present disclosure, and are used to explain the principles of the present disclosure.
图1示出根据本公开实施例的指令处理方法的处理器的示意图。FIG. 1 shows a schematic diagram of a processor of an instruction processing method according to an embodiment of the present disclosure.
图2-1示出根据本公开一实施例的向量查找指令处理装置的框图。FIG. 2-1 illustrates a block diagram of a vector search instruction processing apparatus according to an embodiment of the present disclosure.
图2-2示出根据本公开一实施例的向量查找指令处理装置的框图。2-2 shows a block diagram of a vector search instruction processing apparatus according to an embodiment of the present disclosure.
图2-3a-图2-3c示出根据本公开一实施例的向量查找指令处理装置的应用场景的示意图。FIGS. 2-3a-2-3c illustrate schematic diagrams of application scenarios of a vector search instruction processing apparatus according to an embodiment of the present disclosure.
图2-4示出根据本公开一实施例的向量查找指令处理方法的流程图。2-4 show a flowchart of a vector search instruction processing method according to an embodiment of the present disclosure.
图3-1示出根据本公开一实施例的标量查找指令处理装置的框图。FIG. 3-1 shows a block diagram of a scalar search instruction processing device according to an embodiment of the present disclosure.
图3-2示出根据本公开一实施例的标量查找指令处理装置的框图。3-2 shows a block diagram of a scalar search instruction processing device according to an embodiment of the present disclosure.
图3-3a-图3-3c示出根据本公开一实施例的标量查找指令处理装置的应用场景的示意图。3-3a-FIG. 3-3c are schematic diagrams illustrating application scenarios of a scalar search instruction processing device according to an embodiment of the present disclosure.
图3-4示出根据本公开一实施例的标量查找指令处理方法的流程图。3-4 show a flowchart of a scalar search instruction processing method according to an embodiment of the present disclosure.
图4-1示出根据本公开一实施例的资源锁放指令处理装置的框图。FIG. 4-1 shows a block diagram of a resource lock instruction processing apparatus according to an embodiment of the present disclosure.
图4-2示出根据本公开一实施例的资源锁放指令处理装置的框图。4-2 shows a block diagram of a resource lock instruction processing apparatus according to an embodiment of the present disclosure.
图4-3a-图4-3b示出根据本公开一实施例的资源锁放指令处理装置的应用场景的示意图。4-3a-FIG. 4-3b illustrate schematic diagrams of application scenarios of a resource lock instruction processing apparatus according to an embodiment of the present disclosure.
图4-4示出根据本公开一实施例的资源锁放指令处理方法的流程图。4-4 shows a flowchart of a resource lock instruction processing method according to an embodiment of the present disclosure.
图5-1示出根据本公开一实施例的张量重排指令处理装置的框图。FIG. 5-1 shows a block diagram of a tensor rearrangement instruction processing apparatus according to an embodiment of the present disclosure.
图5-2示出根据本公开一实施例的张量重排指令处理装置的框图。5-2 shows a block diagram of a tensor rearrangement instruction processing apparatus according to an embodiment of the present disclosure.
图5-3示出根据本公开一实施例的张量重排指令处理装置的应用场景的示意图。5-3 shows a schematic diagram of an application scenario of a tensor rearrangement instruction processing apparatus according to an embodiment of the present disclosure.
图5-4示出根据本公开一实施例的张量重排指令处理方法的流程图。5-4 shows a flowchart of a tensor reordering instruction processing method according to an embodiment of the present disclosure.
图6-1示出根据本公开一实施例的数据处理装置的框图。6-1 shows a block diagram of a data processing device according to an embodiment of the present disclosure.
图6-2示出根据本公开一实施例的数据处理装置的应用场景的示意图。6-2 shows a schematic diagram of an application scenario of a data processing device according to an embodiment of the present disclosure.
图6-3示出根据本公开一实施例的数据处理装置的框图。6-3 shows a block diagram of a data processing device according to an embodiment of the present disclosure.
图6-4示出根据本公开一实施例的数据处理装置的框图。6-4 shows a block diagram of a data processing device according to an embodiment of the present disclosure.
图6-5a-图6-5d示出根据本公开一实施例的数据处理装置中处理模块的框图。6-5a- 6-5d show block diagrams of processing modules in a data processing apparatus according to an embodiment of the present disclosure.
图6-6示出根据本公开一实施例的数据处理方法的流程图。6-6 show a flowchart of a data processing method according to an embodiment of the present disclosure.
图7-1示出根据本公开一实施例的矩阵对称指令处理装置的框图。FIG. 7-1 shows a block diagram of a matrix symmetric instruction processing device according to an embodiment of the present disclosure.
图7-2示出根据本公开一实施例的矩阵对称指令处理装置的框图。7-2 shows a block diagram of a matrix symmetric instruction processing device according to an embodiment of the present disclosure.
图7-3示出根据本公开一实施例的矩阵对称指令处理装置的应用场景的示意图。7-3 shows a schematic diagram of an application scenario of a matrix symmetric instruction processing device according to an embodiment of the present disclosure.
图7-4示出根据本公开一实施例的矩阵对称指令处理方法的流程图。7-4 shows a flowchart of a matrix symmetric instruction processing method according to an embodiment of the present disclosure.
图8-1示出根据本公开一实施例的矩阵镜像指令处理装置的框图。8-1 shows a block diagram of a matrix mirroring instruction processing apparatus according to an embodiment of the present disclosure.
图8-2示出根据本公开一实施例的矩阵镜像指令处理装置的框图。8-2 shows a block diagram of a matrix mirroring instruction processing apparatus according to an embodiment of the present disclosure.
图8-3示出根据本公开一实施例的矩阵镜像指令处理装置的应用场景的示意图。8-3 shows a schematic diagram of an application scenario of a matrix mirroring instruction processing apparatus according to an embodiment of the present disclosure.
图8-4示出根据本公开一实施例的矩阵镜像指令处理方法的流程图。8-4 shows a flowchart of a matrix mirroring instruction processing method according to an embodiment of the present disclosure.
图9-1示出根据本公开一实施例的矩阵旋转指令处理装置的框图。9-1 shows a block diagram of a matrix rotation instruction processing device according to an embodiment of the present disclosure.
图9-2示出根据本公开一实施例的矩阵旋转指令处理装置的框图。9-2 shows a block diagram of a matrix rotation instruction processing device according to an embodiment of the present disclosure.
图9-3示出根据本公开一实施例的矩阵旋转指令处理装置的应用场景的示意图。9-3 shows a schematic diagram of an application scenario of a matrix rotation instruction processing device according to an embodiment of the present disclosure.
图9-4示出根据本公开一实施例的矩阵旋转指令处理方法的流程图。9-4 shows a flowchart of a matrix rotation instruction processing method according to an embodiment of the present disclosure.
图10a、图10b示出根据本公开一实施例的组合处理装置的框图。10a and 10b show block diagrams of a combined processing device according to an embodiment of the present disclosure.
图11示出根据本公开一实施例的板卡的结构示意图。FIG. 11 shows a schematic structural diagram of a board according to an embodiment of the present disclosure.
具体实施方式detailed description
下面将结合本公开实施例中的附图,对本公开实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本公开一部分实施例,而不是全部的实施例。基于本公开中的实施例,本领域技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本公开保护的范围。The technical solutions in the embodiments of the present disclosure will be described clearly and completely in conjunction with the drawings in the embodiments of the present disclosure. Obviously, the described embodiments are part of the embodiments of the present disclosure, but not all of the embodiments. Based on the embodiments in the present disclosure, all other embodiments obtained by those skilled in the art without creative efforts fall within the protection scope of the present disclosure.
应当理解,本公开的权利要求、说明书及附图中的术语“第零”、“第一”、“第二”、“第三”和“第四”等是用于区别不同对象,而不是用于描述特定顺序。本公开的说明书和权利要求书中使用的术语“包括”和“包含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多 个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。It should be understood that the terms "zeroth", "first", "second", "third", and "fourth" in the claims, specification, and drawings of the present disclosure are used to distinguish different objects, not Used to describe a specific order. The terms "comprising" and "including" used in the specification and claims of this disclosure indicate the presence of the described features, wholes, steps, operations, elements and/or components, but do not exclude one or more other features, wholes , Steps, operations, elements, components, and/or their existence or addition.
还应当理解,在此本公开说明书中所使用的术语仅仅是出于描述特定实施例的目的,而并不意在限定本公开。如在本公开说明书和权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。还应当进一步理解,在本公开说明书和权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。It should also be understood that the terminology used in the present specification of the present disclosure is for the purpose of describing particular embodiments only, and is not intended to limit the present disclosure. As used in this disclosure specification and claims, unless the context clearly indicates otherwise, the singular forms "a", "an", and "the" are intended to include the plural forms. It should also be further understood that the term "and/or" used in the present specification and claims refers to any and all possible combinations of one or more of the associated listed items and includes these combinations.
如在本说明书和权利要求书中所使用的那样,术语“如果”可以依据上下文被解释为“当...时”或“一旦”或“响应于确定”或“响应于检测到”。类似地,短语“如果确定”或“如果检测到[所描述条件或事件]”可以依据上下文被解释为意指“一旦确定”或“响应于确定”或“一旦检测到[所描述条件或事件]”或“响应于检测到[所描述条件或事件]”。As used in this specification and claims, the term "if" may be interpreted as "when" or "once" or "in response to a determination" or "in response to a detection" depending on the context. Similarly, the phrase "if determined" or "if [described condition or event] is detected" can be interpreted in the context to mean "once determined" or "in response to a determination" or "once detected [described condition or event ]" or "In response to detection of [the described condition or event]".
本公开提供了针对不同的运算或处理所对应的指令处理方法和装置、以及与每一种指令处理方法和设备相对应的计算机设备和存储介质,针对不同的运算或处理所对应的指令处理方法和装置包括:向量查找指令处理方法和装置、标量查找指令处理方法和装置、资源锁放指令处理方法和装置、张量重排指令处理方法和装置、数据处理方法和装置、矩阵对称指令处理方法和装置、矩阵镜像指令处理方法和装置、矩阵旋转指令处理方法和装置。下文所述的指令处理方法、指令处理装置可以为上述所列举的指令处理方法、装置中的任意一个。The present disclosure provides instruction processing methods and apparatuses corresponding to different operations or processes, and computer equipment and storage media corresponding to each instruction processing method and device, and instruction processing methods corresponding to different operations or processes And devices include: vector search instruction processing method and device, scalar search instruction processing method and device, resource lock instruction processing method and device, tensor rearrangement instruction processing method and device, data processing method and device, matrix symmetric instruction processing method And device, matrix mirroring instruction processing method and device, and matrix rotation instruction processing method and device. The instruction processing method and instruction processing device described below may be any of the instruction processing methods and devices listed above.
根据本公开实施例的指令处理方法可应用于处理器中,该处理器可以是通用处理器,例如CPU(Central Processing Unit,中央处理器),也可以是用于执行人工智能运算的人工智能处理器(IPU)。人工智能运算可包括机器学习运算,类脑运算等。其中,机器学习运算包括神经网络运算、k-means运算、支持向量机运算等。该人工智能处理器可例如包括GPU(Graphics Processing Unit,图形处理单元)、NPU(Neural-Network Processing Unit,神经网络处理单元)、DSP(Digital Signal Process,数字信号处理单元)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)芯片中的一种或组合。本公开对处理器的具体类型不作限制。The instruction processing method according to an embodiment of the present disclosure may be applied to a processor, which may be a general-purpose processor, such as a CPU (Central Processing Unit, central processing unit), or artificial intelligence processing for performing artificial intelligence operations Device (IPU). Artificial intelligence operations can include machine learning operations, brain-like operations, and so on. Among them, machine learning operations include neural network operations, k-means operations, support vector machine operations, etc. The artificial intelligence processor may include, for example, GPU (Graphics Processing Unit), NPU (Neural-Network Processing Unit, neural network processing unit), DSP (Digital Signal Processing, digital signal processing unit), field programmable gate array (Field-Programmable Gate Array, FPGA) One or a combination of chips. This disclosure does not limit the specific types of processors.
在一种可能的实现方式中,本公开中所提及的处理器可包括多个处理单元,每个处理单元可以独立运行所分配到的各种任务,如:卷积运算任务、池化任务或全连接任务等。本公开对处理单元及处理单元所运行的任务不作限制。In a possible implementation manner, the processor mentioned in the present disclosure may include multiple processing units, and each processing unit may independently run various assigned tasks, such as: convolution operation tasks and pooling tasks Or fully connected tasks. The present disclosure does not limit the processing unit and the tasks performed by the processing unit.
图1示出根据本公开实施例的指令处理方法的处理器的示意图。如图1所示,处理器100包括多个处理单元101以及存储单元102,多个处理单元101用于执行指令序列,存储单元102用于存储数据,可包括随机存储器(RAM,Random Access Memory)和寄存器堆。处理器100中的多个处理单元101既可共用部分存储空间,例如共用部分RAM存储空间和寄存器堆,又可同时拥有各自的存储空间。FIG. 1 shows a schematic diagram of a processor of an instruction processing method according to an embodiment of the present disclosure. As shown in FIG. 1, the processor 100 includes a plurality of processing units 101 and a storage unit 102. The plurality of processing units 101 are used to execute an instruction sequence, and the storage unit 102 is used to store data, which may include a random access memory (RAM, Random Access Memory) And register file. The multiple processing units 101 in the processor 100 can share a part of the storage space, for example, share a part of the RAM storage space and the register file, and can also have their own storage spaces at the same time.
图2-1示出根据本公开一实施例的向量查找指令处理装置的框图。如图2-1所示,该装置包括控制模块11-2和运算模块12-2。FIG. 2-1 shows a block diagram of a vector search instruction processing apparatus according to an embodiment of the present disclosure. As shown in Figure 2-1, the device includes a control module 11-2 and an arithmetic module 12-2.
控制模块11-2,用于对接收到的向量查找指令进行解析,获得向量查找指令的操作码和操作域,并根据操作码和操作域确定执行向量查找指令所需的待查找向量、查找条件和目标地址。其中,操作码用于指示向量查找指令对向量数据所进行的运算为查找运算,操作域包括待查找向量地址和目标地址。The control module 11-2 is used to analyze the received vector search instruction, obtain the operation code and operation domain of the vector search instruction, and determine the vector to be searched and the search condition required to execute the vector search instruction according to the operation code and operation domain And destination address. The operation code is used to indicate that the operation performed by the vector search instruction on the vector data is a search operation, and the operation domain includes the vector address and the target address to be searched.
运算模块12-2,用于依次确定表示待查找向量的多个待查数是否满足查找条件,并将满足查找条件的待查数确定为目标数,将目标数的存储地址作为查找结果存入目标地址。The operation module 12-2 is used to sequentially determine whether a plurality of check numbers representing the search vector satisfy the search condition, and determine the check number that meets the search condition as the target number, and store the storage address of the target number as the search result target address.
在本实施例中,待查找向量可以是由多个待查数构成的。例如,待查找向量m的十进制表示为(5,6,4),则待查找向量m的多个待查数即为“5”、“6”和“4”。待查找向量还可以通过二进制、十六进制等的字符串表示。例如,待查找向量m的二进制表示为“101110100”,其中,“101”、“110”和“100”为待查找向量m的多个待查数,分别为该待查找向量m转换为十进制时所对应数5、6、4。控制模块可以从待查找向量地址中获取待查找向量。待查找向量地址可以是存储待查找向量的首地址等。控制模块可以通过数据输入输出单元获得向量查找指令、待查找向量,该数据输入输出单元可以为一个或多个数据I/O接口或I/O引脚。In this embodiment, the to-be-searched vector may be composed of multiple to-be-checked numbers. For example, if the decimal representation of the vector m to be searched is (5, 6, 4), then the multiple numbers to be searched for the vector m to be searched are "5", "6", and "4". The vector to be searched can also be represented by a string of binary, hexadecimal, etc. For example, the binary representation of the vector m to be searched is "101110100", where "101", "110" and "100" are the multiple numbers to be searched for the vector m to be searched, respectively when the vector m to be searched is converted to decimal Corresponding numbers 5, 6, and 4. The control module can obtain the vector to be found from the address of the vector to be found. The address of the vector to be searched may be the first address for storing the vector to be searched, and so on. The control module may obtain the vector search instruction and the vector to be searched through the data input/output unit. The data input/output unit may be one or more data I/O interfaces or I/O pins.
在本实施例中,操作码可以是计算机程序中所规定的要执行操作的那一部分指令或字段(通常用代码表示),是指令序列号,用来告知执行指令的装置具体需要执行哪一条指令。操作域可以是执行对应的指令所需的所有数据的来源,执行对应的指令所需的所有数据包括参数数据、待查找向量、对应的运算方法,或者存储参数数据、待查找向量、对应的运算方法的地址等等。对于一个向量查找指令其必须包括操作码和操作域,其中操作域至少包括待查找向量地址和目标地址。In this embodiment, the operation code may be the part of the instruction or field (usually represented by code) specified in the computer program to perform the operation, and is the instruction serial number used to inform the device that executes the instruction which instruction needs to be executed. . The operation domain may be the source of all data required to execute the corresponding instruction. All data required to execute the corresponding instruction include parameter data, vector to be searched, and corresponding operation method, or store parameter data, vector to be searched, and corresponding operation The address of the method, etc. For a vector search instruction, it must include an operation code and an operation field, where the operation field includes at least the vector address and the target address to be searched.
应当理解的是,本领域技术人员可以根据需要对向量查找指令的指令格式以及所包含的操作码和操作域进行设置,本公开对此不作限制。It should be understood that a person skilled in the art may set the instruction format of the vector search instruction, as well as the included operation codes and operation domains as needed, which is not limited in this disclosure.
在本实施例中,该装置可以包括一个或多个控制模块,以及一个或多个运算模块,可以根据实际需要对控制模块和运算模块的数量进行设置,本公开对此不作限制。In this embodiment, the device may include one or more control modules and one or more arithmetic modules, and the number of control modules and arithmetic modules may be set according to actual needs, which is not limited in the present disclosure.
本公开实施例所提供的向量查找指令处理装置,该装置包括控制模块和运算模块。控制模块用于对接收到的向量查找指令进行解析,获得向量查找指令的操作码和操作域,并根据操作码和操作域确定执行向量查找指令所需的待查找向量、查找条件和目标地址。运算模块用于依次确定表示待查找向量的多个待查数是否满足查找条件,并将满足查找条件的待查数确定为目标数,将目标数的存储地址作为查找结果存入目标地址。本公开实施例所提供的向量查找指令处理装置的适用范围广,对向量查找指令的处理效率高、处理速度快。The vector search instruction processing device provided by the embodiment of the present disclosure includes a control module and an operation module. The control module is used to parse the received vector search instruction, obtain the operation code and operation domain of the vector search instruction, and determine the to-be-searched vector, search condition, and target address required to execute the vector search instruction according to the operation code and operation domain. The operation module is used to sequentially determine whether a plurality of to-be-checked numbers representing the to-be-searched vector satisfy the search condition, determine the to-be-checked number satisfying the search condition as the target number, and store the target number's storage address as the search result in the target address. The vector search instruction processing device provided by the embodiments of the present disclosure has a wide application range, and has high processing efficiency and fast processing speed for the vector search instruction.
在一种可能的实现方式中,满足查找条件的待查数可以包括以下至少一项:In a possible implementation manner, the number of to-be-checked that meets the search condition may include at least one of the following:
数值是指定值的指定倍数、且排序为指定排序的待查数;The numeric value is the specified multiple of the specified value, and the sorting is the number to be checked of the specified sorting;
数值处于指定数值区间的待查数;Number to be checked if the value is within the specified value interval;
数值是指定值的指定倍数的待查数。The numeric value is the number to be checked for the specified multiple of the specified value.
其中,指定排序可以包括以下至少一种:待查数的排序为数值是指定值的指定倍数的待查数中的第n个,n为大于或等于1的正整数;待查数的排序为数值是指定值的指定倍数的待查数中的倒数第m个,m为大于或等于1的正整数。其中,m、n小于或等于待查找向量中待查数的数量。Among them, the specified sorting may include at least one of the following: the sorting of the number to be checked is the nth of the number to be checked whose value is the specified multiple of the specified value, n is a positive integer greater than or equal to 1; the sorting of the number to be checked is The numeric value is the m-th to last in the number to be checked in the specified multiple of the specified value, and m is a positive integer greater than or equal to 1. Among them, m and n are less than or equal to the number of to-be-checked numbers in the to-be-searched vector.
在该实现方式中,可以通过对“待查数的排序为数值是指定值的指定倍数的待查数中的第n个”、“待查数的排序为数值是指定值的指定倍数的待查数中的倒数第m个”设置不同表达方式等,来区分倒数和正数的指定排序。可以设定指定排序中“待查数的排序为数值是指定值的指定倍数的待查数中的第n个”在向量查找指令中的表达方式为“0n”,设定指定排序中“待查数的排序为数值是指定值的指定倍数的待查数中的倒数第m个”在向量查找指令中的表达方式为“m0”。本领域技术人员可以根据实际需要对指定排序的表达方式进行设置,本公开对此不作限制。In this implementation, the order of the number to be checked is the nth of the number to be checked which is the specified multiple of the specified value, and the order of the number to be checked is the number which is the specified multiple of the specified value. The "mth to the last" in the count is set with different expressions, etc., to distinguish the specified order of the countdown and the positive number. You can set the expression in the specified sorting order to be "the number of the number to be checked is the nth of the number to be checked that is the specified multiple of the specified value" in the vector search instruction as "0n", and set the "waiting The sorting of the number of checks is that the m-th to the last of the numbers to be searched whose value is the specified multiple of the specified value is expressed in the vector search instruction as "m0". A person skilled in the art can set the expression of the specified order according to actual needs, and this disclosure does not limit this.
在一种可能的实现方式中,指定值可以是0、1、2、3等数值。指定倍数可以是1倍(也即数值与指定值相同)、2倍、3倍等倍数。In a possible implementation, the specified value may be 0, 1, 2, 3, and so on. The specified multiple can be 1 times (that is, the value is the same as the specified value), 2 times, 3 times and other multiples.
举例来说,向量查找指令所查找到的目标数可以是待查找向量的多个待查数中的第一个1、最后一个1、第一个2的3倍的待查数、最后一个2的3倍的待查数、小于5的待查数、大于9的待查数等。For example, the target number found by the vector search instruction may be the first 1, the last 1, the first 2, 3 times the number to be searched, and the last 2 3 times the number to be checked, less than 5 to be checked, more than 9 to be checked, etc.
在一种可能的实现方式中,操作域还可以包括输入长度。控制模块11-2,还用于根据输入长度,从待查找向量地址中获取待查找向量。In a possible implementation, the operation domain may also include the input length. The control module 11-2 is also used to obtain the vector to be searched from the address of the vector to be searched according to the input length.
在该实现方式中,根据输入长度从待查找向量地址中获取待查找向量的长度需等于输入长度,或者需小于输入长度。In this implementation, the length of the vector to be searched obtained from the address of the vector to be searched according to the input length needs to be equal to the input length, or needs to be less than the input length.
在一种可能的实现方式中,在操作域中不包括输入长度时,可以根据预先设置的默认输入长度获取待查找向量。还可以获取待查找向量地址中全部数据作为待查找向量。In a possible implementation manner, when the input length is not included in the operation domain, the vector to be searched may be obtained according to a preset default input length. It is also possible to obtain all data in the address of the vector to be searched as the vector to be searched.
在一种可能的实现方式中,操作域还可以包括待查数宽度。运算模块12-2,还用于根据待查数宽度,从待查找向量中确定出多个待查数。In a possible implementation manner, the operation field may further include the width of the data to be checked. The operation module 12-2 is also used to determine a plurality of numbers to be checked from the vector to be looked up according to the width of the numbers to be checked.
在该实现方式中,待查数宽度可以表示待查找向量的字符串中,各个待查数所对应的宽度。在操作域中包括待查数宽度时,可以从表示待查找向量的字符串中确定出宽度为待查数宽度的多组字符串,每组字符串对应表示一个待查数。例如,若待查数宽度为3,待查找向量m(转换为十进制时表示为(5,6,4))为“101110100”,待查找向量m的多个待查数为“101”“110”和“100”,多个待查数“101”、“110”和“100”分别为该待查找向量m转换为十进制时所对应数5、6、4。若待查数宽度为1,待查找向量m的多个待查数为“1”、“0”、“1”、“1”、“1”、“0”、“1”、“0”和“0”,或者待查数宽度为2、4等不为3的宽度时,所获得的待查找向量m的多个待查数仅为由字符串组成的数、与待查找向量m转换为十进制时所对应数5、6、4并无任何关系。In this implementation, the width of the number to be checked may represent the width corresponding to each number to be checked in the character string of the vector to be looked up. When the width of the data to be searched is included in the operation domain, a plurality of groups of character strings whose width is the width of the data to be searched can be determined from the character strings representing the vector to be searched, and each group of character strings corresponds to one data to be searched. For example, if the width of the number to be searched is 3, the vector to be searched m (expressed as (5,6,4) when converted to decimal) is "101110100", and the number of the searched vector m to be searched is "101" "110 "And "100", the multiple to-be-checked numbers "101", "110" and "100" are the corresponding numbers 5, 6, and 4 when the to-be-searched vector m is converted to decimal, respectively. If the width of the number to be checked is 1, the plurality of numbers to be searched for the vector m are "1", "0", "1", "1", "1", "0", "1", "0" And "0", or the width of the number to be searched is 2, 4 and other widths other than 3, the obtained number of the searched vector m is only a number composed of character strings, and converted to the searched vector m The corresponding numbers 5, 6, and 4 in decimal are irrelevant.
在一种可能的实现方式中,操作域还可以包括查找条件。控制模块11-2,还用于根据操作域,确定查找条件。In a possible implementation, the operation domain may further include search conditions. The control module 11-2 is also used to determine search conditions according to the operation domain.
在该实现方式中,在操作域中包括查找条件时,可以直接获取操作域中的查找条件。In this implementation manner, when the search condition is included in the operation domain, the search condition in the operation domain can be directly obtained.
在一种可能的实现方式中,控制模块11-2,还用于根据操作码,确定查找条件。其中,操作码还用于指示向量查找指令的查找条件。In a possible implementation, the control module 11-2 is also used to determine the search condition according to the operation code. Among them, the opcode is also used to indicate the search condition of the vector search instruction.
在该实现方式中,可以设置不同的操作码来表示不同的查找条件。还可以根据操作码或者默认宽度确定待查数宽度。In this implementation, different operation codes can be set to represent different search conditions. The width of the data to be checked can also be determined according to the operation code or the default width.
举例来说,可以设置操作码“Find_vfirst”为找到待查找向量的多个待查数中的第一个1(待查数的宽度大于1,满足查找条件的待查数是:数值为指定值的一倍的待查数中的第1个待查数)。操作码“Find_vlast”为找到待查找向量的多个待查数中的最后一个1(待查数的宽度大于1,满足查找条件的待查数是:数值为指定值一倍的待查数中的倒数第1个待查数)。在操作码为“Find_vfirst”和“Find_vlast”可以进一步根据操作码确定待查数宽度,或者将默认宽度确定为待查数宽度,进而获取待查找向量的具有待查数宽度的多个待查数。For example, you can set the opcode "Find_vfirst" to find the first 1 of the multiple numbers to be searched for the vector to be found (the width of the number to be checked is greater than 1, the number of the number to be checked that meets the search condition is: the value is the specified value The first to be checked out of the doubled to be checked out). The operation code "Find_vlast" is the last one of the multiple numbers to be searched to find the vector to be searched (the width of the number to be checked is greater than 1, the number of the number to be checked that meets the search condition is: the value is the number of the number to be checked that is double the specified value The penultimate number 1 to be checked). When the operation codes are "Find_vfirst" and "Find_vlast", the width of the number to be checked can be further determined according to the operation code, or the default width can be determined as the width of the number to be checked, and then multiple number of numbers to be checked with the width of the number to be checked can be obtained .
图2-2示出根据本公开一实施例的向量查找指令处理装置的框图。在一种可能的实现方式中,如图2-2所示,运算模块12-2可以包括至少一个比较器121-2,用于对多个待查数与查找条件进行比较,获得比较结果,以便于根据比较结果确定待查数是否满足查找条件。2-2 shows a block diagram of a vector search instruction processing apparatus according to an embodiment of the present disclosure. In a possible implementation manner, as shown in FIG. 2-2, the operation module 12-2 may include at least one comparator 121-2, configured to compare a plurality of to-be-checked numbers with search conditions to obtain a comparison result, In order to determine whether the number to be checked meets the search condition according to the comparison result.
举例来说,向量查找指令所要查找的是:数值是1(也即数值是指定值1的一倍)的待查数中的第1个为例,比较器可以依次将待查找向量的多个待查数的数值与指定值“1”进行比较,进而可以确定待查数的数值与指定值“1”是否相等,并将数值等于指定值“1”、且排序为等于指定值“1”的待查数中的第1个待查数确定为目标数,将目标数的存储地址作为查找结果存入目标地址。可以根据所需进行比较的数据量的大小、对比较的处理速度、效率等要求对比较器的数量进行设置,本公开对此不作限制。For example, what the vector search instruction is looking for is: the first of the numbers to be checked whose value is 1 (that is, the value is twice the specified value 1) is an example. The comparator can sequentially search multiple The value of the number to be checked is compared with the specified value "1", and then it can be determined whether the value of the number to be checked is equal to the specified value "1", and the value is equal to the specified value "1" and sorted to be equal to the specified value "1" The first to-be-checked number in the to-be-checked number is determined as the target number, and the storage address of the target number is stored in the target address as the search result. The number of comparators can be set according to the size of the data amount to be compared, the processing speed, efficiency, etc. of the comparison, which is not limited in the present disclosure.
在一种可能的实现方式中,如图2-2所示,该装置还可以包括存储模块13-2。存储模块13-2用于存 储待查找向量。In a possible implementation manner, as shown in FIG. 2-2, the device may further include a storage module 13-2. The storage module 13-2 is used to store the vector to be searched.
在该实现方式中,存储模块可以包括内存、缓存和寄存器中的一种或多种,缓存可以包括速暂存缓存。可以根据需要将待查找向量在存储模块中的内存、缓存和/或寄存器中,本公开对此不作限制。In this implementation, the storage module may include one or more of a memory, a cache, and a register, and the cache may include a temporary storage cache. The vector to be searched can be stored in the memory, cache, and/or register in the storage module according to needs, which is not limited in the present disclosure.
在一种可能的实现方式中,该装置还可以包括直接内存访问模块,用于从存储模块中读取或者存储数据。In a possible implementation, the device may further include a direct memory access module, which is used to read or store data from the storage module.
在一种可能的实现方式中,如图2-2所示,控制模块11-2可以包括指令存储子模块111-2、指令处理子模块112-2和队列存储子模块113-2。In a possible implementation, as shown in FIG. 2-2, the control module 11-2 may include an instruction storage submodule 111-2, an instruction processing submodule 112-2, and a queue storage submodule 113-2.
指令存储子模块111-2用于存储向量查找指令。The instruction storage submodule 111-2 is used to store vector search instructions.
指令处理子模块112-2用于对向量查找指令进行解析,得到向量查找指令的操作码和操作域。The instruction processing sub-module 112-2 is used to parse the vector search instruction to obtain the operation code and operation domain of the vector search instruction.
队列存储子模块113-2用于存储指令队列,指令队列包括按照执行顺序依次排列的多个待执行指令,多个待执行指令可以包括向量查找指令。多个待执行指令可以包括还可以包括与向量查找指令相关的其他计算指令。The queue storage sub-module 113-2 is used to store an instruction queue. The instruction queue includes a plurality of instructions to be executed in order according to the execution order. The plurality of instructions to be executed may include vector search instructions. The plurality of instructions to be executed may include other calculation instructions related to the vector search instruction.
在该实现方式中,可以根据待执行指令的接收时间、优先级别等对多个待执行指令的执行顺序进行排列获得指令队列,以便于根据指令队列依次执行多个待执行指令。In this implementation manner, the execution order of the plurality of instructions to be executed can be arranged according to the reception time and priority level of the instruction to be executed to obtain an instruction queue, so that the plurality of instructions to be executed can be sequentially executed according to the instruction queue.
在一种可能的实现方式中,如图2-2所示,控制模块11-2还可以包括依赖关系处理子模块114-2。In a possible implementation, as shown in FIG. 2-2, the control module 11-2 may further include a dependency processing sub-module 114-2.
依赖关系处理子模块114-2,用于在确定多个待执行指令中的第一待执行指令与第一待执行指令之前的第零待执行指令存在关联关系时,将第一待执行指令缓存在指令存储子模块112-2中,在第零待执行指令执行完毕后,从指令存储子模块112-2中提取第一待执行指令发送至运算模块12-2。其中,第一待执行指令和第零待执行指令是多个待执行指令中的指令。The dependency processing sub-module 114-2 is configured to cache the first instruction to be executed when it is determined that there is an association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed In the instruction storage sub-module 112-2, after the execution of the zeroth to-be-executed instruction is completed, the first to-be-executed instruction is extracted from the instruction storage sub-module 112-2 and sent to the arithmetic module 12-2. Wherein, the first instruction to be executed and the zeroth instruction to be executed are instructions among a plurality of instructions to be executed.
其中,第一待执行指令与第一待执行指令之前的第零待执行指令存在关联关系包括:存储第一待执行指令所需数据的第一存储地址区间与存储第零待执行指令所需数据的第零存储地址区间具有重叠的区域。反之,第一待执行指令与第零待执行指令之间没有关联关系可以是第一存储地址区间与第零存储地址区间没有重叠区域。The first to-be-executed instruction is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction includes: a first storage address interval storing data required by the first to-be-executed instruction and data required to store the zeroth to-be-executed instruction The zeroth storage address interval has overlapping areas. Conversely, there is no association between the first instruction to be executed and the zeroth instruction to be executed may be that there is no overlapping area between the first storage address interval and the zeroth storage address interval.
通过这种方式,可以根据待执行指令之间的依赖关系,使得在先的待执行令执行完毕之后,再执行在后的待执行指令,保证运算结果的准确性。In this way, according to the dependency relationship between the instructions to be executed, after the execution of the first to-be-executed order is completed, the subsequent to-be-executed instruction is executed again to ensure the accuracy of the calculation result.
在一种可能的实现方式中,向量查找指令的指令格式可以如下表1所示,可以对操作码和操作码的位置进行设置。并在表2中给出了常规的向量查找指令(Find),利用该常规向量查找指令可以查找待查找向量中的任意数;以及在表2中给出了向量查找指令的示例,并定义了两个特殊类型的向量查找指令(Find_vfirst、Find_vlast)所需包括的操作码和操作域。利用特殊类型的向量查找指令对待查找向量进行查找,可以简化指令处理过程、节省查找的时间。In a possible implementation manner, the instruction format of the vector search instruction may be as shown in Table 1 below, and the operation code and the position of the operation code may be set. In Table 2, a conventional vector search instruction (Find) is given, and any number in the vector to be searched can be found by using the conventional vector search instruction; and an example of the vector search instruction is given in Table 2 and defined Two special types of vector search instructions (Find_vfirst, Find_vlast) need to include the opcode and operation field. Using a special type of vector search instruction to search for the search vector can simplify the instruction processing process and save the search time.
表1 指令格式Table 1 Command format
Figure PCTCN2019120893-appb-000001
Figure PCTCN2019120893-appb-000001
其中,在查找条件中仅包含指定数值区间时,满足查找条件的待查数是指:数值是指定值的指定倍数的待查数。Among them, when the search condition includes only the specified value interval, the number of to-be-checked satisfying the search condition refers to: the value is the number of to-be-checked of the specified multiple of the specified value.
在查找条件中包含指定值和指定排序时,满足查找条件的待查数是指:数值等于指定值、且排序为指定排序的待查数。When the search condition includes the specified value and the specified order, the number of queries to satisfy the search condition means that the number is equal to the specified value and the order is the number of queries to the specified order.
在查找条件中包含指定值和指定倍数时,满足查找条件的待查数是指:数值是指定值的指定倍数的待查数。When the search condition includes the specified value and the specified multiple, the number to be checked that meets the search condition refers to: the value is the number to be checked of the specified multiple of the specified value.
在查找条件包括指定值、指定倍数和指定排序时,满足查找条件的待查数是指:数值是指定值的指定倍数、且排序为指定排序的待查数。When the search condition includes a specified value, a specified multiple, and a specified sort, the number of queries to satisfy the search condition refers to: the value is the specified multiple of the specified value, and the sort is the number of the query to be specified.
表2 向量查找指令示例Table 2 Examples of vector search instructions
Figure PCTCN2019120893-appb-000002
Figure PCTCN2019120893-appb-000002
其中,操作码为“Find_vfirst”的向量查找指令,其所查找的满足查找条件的待查数是:数值是指定值1的一倍(也即数值与指定值1相等)、且排序为数值是指定值1的一倍的待查数中的第1个。待查数宽度大于1。Among them, the vector search instruction whose operation code is "Find_vfirst", the number of the searched to meet the search conditions is: the value is double the specified value 1 (that is, the value is equal to the specified value 1), and the sorted value is The first of the counts to be checked that is double the specified value 1. The width of the data to be checked is greater than 1.
操作码为“Find_vlast”的向量查找指令,其所查找的满足查找条件的待查数是:数值是指定值1的一倍(也即数值与指定值1相等)且排序为数值是指定值1的一倍的待查数中的倒数第1个。待查数宽度大于1。The vector search instruction whose operation code is "Find_vlast" finds that the number of pending queries that meet the search conditions is: the value is double the specified value 1 (that is, the value is equal to the specified value 1) and the sorted value is the specified value The penultimate of the counts to be doubled. The width of the data to be checked is greater than 1.
应当理解的是,本领域技术人员可以根据需要对向量查找指令的操作码、指令格式中操作码以及操作域的位置进行设置,本公开对此不作限制。It should be understood that a person skilled in the art may set the operation code of the vector search instruction, the operation code in the instruction format, and the position of the operation field according to needs, which is not limited in the present disclosure.
在一种可能的实现方式中,该装置可以设置于图形处理器(Graphics Processing Unit,简称GPU)、中央处理器(Central Processing Unit,简称CPU)和嵌入式神经网络处理器(Neural-network Processing Unit,简称NPU)的一种或多种之中。In a possible implementation, the device may be set in a graphics processor (GPU), a central processing unit (CPU) and a neural-network processing unit (Neural-network Processing) , Referred to as NPU).
需要说明的是,尽管以上述实施例作为示例介绍了向量查找指令处理装置如上,但本领域技术人员能够理解,本公开应不限于此。事实上,用户完全可根据个人喜好和/或实际应用场景灵活设定各模块,只要符合本公开的技术方案即可。It should be noted that although the above embodiment is taken as an example to introduce the vector search instruction processing apparatus as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various modules flexibly according to personal preferences and/or actual application scenarios, as long as the technical solutions of the present disclosure are met.
应用示例Application examples
以下结合“利用向量查找指令处理装置对待查找向量进行查找”作为一个示例性应用场景,给出根据本公开实施例的应用示例,以便于理解向量查找指令处理装置的流程。本领域技术人员应理解,以下应用示例仅仅是出于便于理解本公开实施例的目的,不应视为对本公开实施例的限制。In the following, an application example according to an embodiment of the present disclosure is given in conjunction with "using a vector search instruction processing device to search for a search vector to be searched" as an exemplary application scenario, so as to facilitate understanding of the flow of the vector search instruction processing device. Those skilled in the art should understand that the following application examples are only for the purpose of facilitating understanding of the embodiments of the present disclosure, and should not be considered as limitations to the embodiments of the present disclosure.
图2-3a-图2-3c示出根据本公开一实施例的向量查找指令处理装置的应用场景的示意图。如图2-3a-图2-3c所示,向量查找指令处理装置对向量查找指令进行处理的过程如下。FIGS. 2-3a-2-3c illustrate schematic diagrams of application scenarios of a vector search instruction processing apparatus according to an embodiment of the present disclosure. As shown in Figures 2-3a-2-3c, the vector search instruction processing device processes the vector search instruction as follows.
首先,假定待查找向量a为“0101 1011 0001 0101 1100 0001 1001”。并且,每四个二进制数表示待查找向量a在十进制中的一个数,也即在十进制中待查找向量a为(5,11,1,5,12,1,9)。为便于区分不同向量查找指令假定在不同的向量查找指令中待查找向量a的存储地址不同。First, assume that the to-be-searched vector a is "0101 1011 0001 0101 1100 0001 1001". In addition, every four binary numbers represent a number of the vector a to be searched in decimal, that is, the vector a to be searched in decimal is (5,11,1,5,12,1,9). To facilitate distinguishing between different vector search instructions, it is assumed that the storage address of the vector a to be searched in different vector search instructions is different.
装置所需处理的向量查找指令包括:The vector search instructions to be processed by the device include:
向量查找指令1:@Find#100#28#200#4#1#01Vector search instruction 1: @Find#100#28#200#4#1#01
向量查找指令2:@Find_vfirst#101#28#201#4Vector search instruction 2: @Find_vfirst#101#28#201#4
向量查找指令3:@Find_vlast#102#28#202#4Vector search instruction 3: @Find_vlast#102#28#202#4
示例1Example 1
如图2-3a所示,控制模块11-2在接收到向量查找指令1时,对向量查找指令1进行解析,获得向量查找指令1的操作码为Find,以及根据操作域确定向量查找指令1的、待查找向量地址为“100”、输入长度为“28”、目标地址为“200”、指定排序为“待查数的排序为等于指定值(由于向量查找指令1中指定倍数位置为空,默认指定倍数为一倍)的待查数中的第1个”、指定值为“1”、待查数宽度为“4”。进而控制模块11-2从待查找向量地址200中获取输入长度为28的上述待查找向量a“0101 1011 0001 0101 1100 0001 1001”。As shown in Figure 2-3a, the control module 11-2 parses the vector search instruction 1 when receiving the vector search instruction 1, obtains the operation code of the vector search instruction 1 as Find, and determines the vector search instruction 1 according to the operation domain , The vector address to be searched is "100", the input length is "28", the target address is "200", the specified sort is "the number of counts to be checked is equal to the specified value (because the specified multiple position in the vector search instruction 1 is empty , By default, the multiple is doubled), the first of the number to be checked", the specified value is "1", and the width of the number to be checked is "4". Further, the control module 11-2 obtains the above-mentioned to-be-searched vector a whose input length is 28 from the to-be-searched vector address 200 is "0101 1011 0001 0101 1100 0001 1001".
运算模块12-2根据待查数宽度为“4”,从待查找向量a中依次得到多个待查数,并依次确定多个待查数的数值是否等于指定值“1”,并将数值等于指定值“1”、且排序为等于指定值“1”的待查数中的第1个的待查数确定为目标数,将目标数的存储地址作为查找结果存入目标地址200中。The arithmetic module 12-2 obtains a plurality of numbers to be checked sequentially from the vector to be searched according to the width of the numbers to be checked "4", and sequentially determines whether the values of the plurality of numbers to be checked are equal to the specified value "1", and sets the value The first to-be-checked number among the to-be-checked numbers equal to the specified value “1” and sorted to be equal to the specified value “1” is determined as the target number, and the storage address of the target number is stored in the target address 200 as a search result.
在该示例中,运算模块12-2首先从待查找向量a中获得宽度为4的第一个待查数“0101”,并判断待查数“0101”的数值是否等于指定值“1”。由于待查数“0101”的数值不为1,运算模块12-2继续从待查找向量a中获取下一个待查数“1011”,并判断待查数“1011”的数值是否等于指定值“1”。由于待查数“1011”的数值不为1,运算模块12-2继续从待查找向量a中获取下一个待查数“0001”,并判断待查数“0001”的数值是否等于指定值“1”。由于待查数“0001”的数值等于1、且其排序为指定排序(即待查数的排序为等于指定值的待查数中的第1个),则将待查数“0001”确定为目标数,将待查数“0001”的存储地址500作为查找结果存入目标地址200中。In this example, the arithmetic module 12-2 first obtains the first to-be-checked number "0101" with a width of 4 from the to-be-searched vector a, and judges whether the value of the to-be-checked number "0101" is equal to the specified value "1". Since the value of the number to be checked "0101" is not 1, the arithmetic module 12-2 continues to obtain the next number to be checked "1011" from the vector to be looked up a, and judges whether the value of the number to be checked "1011" is equal to the specified value " 1". Since the value of the number to be checked "1011" is not 1, the arithmetic module 12-2 continues to obtain the next number to be checked "0001" from the vector to be looked up a, and judges whether the value of the number to be checked "0001" is equal to the specified value " 1". Since the value of the number to be checked "0001" is equal to 1, and its sorting is the specified sorting (that is, the sorting of the number to be checked is the first of the number to be checked equal to the specified value), the number "0001" to be checked is determined as For the target number, the storage address 500 of the number to be checked "0001" is stored in the target address 200 as the search result.
示例2Example 2
如图2-3b所示,控制模块11-2在接收到向量查找指令2时,对向量查找指令2进行解析,获得向量查找指令2的操作码为Find_vfirst,以及根据操作域确定向量查找指令2的待查找向量地址为“101”、输入长度为“28”、目标地址为“201”、待查数宽度为“4”。并且,根据操作码Find_vfirst确定向量查找指令2的指定值为“1”、指定排序为“待查数的排序为等于指定值的待查数中的第1个”。进而控制模块11-2从待查找向量地址201中获取输入长度为28的上述待查找向量a“0101 1011 0001 0101 1100 0001 1001”。As shown in Figure 2-3b, the control module 11-2 parses the vector search instruction 2 when receiving the vector search instruction 2, obtains the operation code of the vector search instruction 2 as Find_vfirst, and determines the vector search instruction 2 according to the operation domain The address of the vector to be searched is "101", the input length is "28", the target address is "201", and the width of the number to be searched is "4". In addition, it is determined according to the operation code Find_vfirst that the specified value of the vector search instruction 2 is "1" and the specified sort is "the sort of the number to be checked is the first of the number to be checked equal to the specified value". Furthermore, the control module 11-2 obtains the above-mentioned to-be-searched vector a whose input length is 28 from the to-be-searched vector address 201 is "0101 1011 0001 0101 1100 0001 1001".
运算模块12-2根据待查数宽度为“4”,从待查找向量a中依次得到多个待查数,并依次确定多个待查数的数值是否等于指定值“1”,并将数值等于指定值“1”、且排序为等于指定值“1”的待查数中的第1个的待查数确定为目标数,将目标数的存储地址作为查找结果存入目标地址201中。The arithmetic module 12-2 obtains a plurality of numbers to be checked sequentially from the vector to be searched according to the width of the numbers to be checked "4", and sequentially determines whether the values of the plurality of numbers to be checked are equal to the specified value "1", and sets the value The first to-be-checked number among the to-be-checked numbers equal to the specified value "1" and sorted to be equal to the specified value "1" is determined as the target number, and the storage address of the target number is stored in the target address 201 as the search result.
在该示例中,运算模块12-2首先从待查找向量a中获得宽度为4的第一个待查数“0101”,并判断待查数“0101”的数值是否等于指定值“1”。由于待查数“0101”的数值不为1,运算模块12-2继续从待查找向量a中获取下一个待查数“1011”,并判断待查数“1011”的数值是否等于指定值“1”。由于待查数“1011”的数值不为1,运算模块12-2继续从待查找向量a中获取下一个待查数“0001”,并判断待查数“0001”的数值是否等于指定值“1”。由于待查数“0001”的数值等于1、且其排序为指定排序(即待查数的排序为等于指定值的待查数中的第1个),则将待查数“0001”确定为目标数,将待查数“0001”的存储地址501作为查找结果存入目标地址201中。In this example, the arithmetic module 12-2 first obtains the first to-be-checked number "0101" with a width of 4 from the to-be-searched vector a, and judges whether the value of the to-be-checked number "0101" is equal to the specified value "1". Since the value of the number to be checked "0101" is not 1, the arithmetic module 12-2 continues to obtain the next number to be checked "1011" from the vector to be looked up a, and judges whether the value of the number to be checked "1011" is equal to the specified value " 1". Since the value of the number to be checked "1011" is not 1, the arithmetic module 12-2 continues to obtain the next number to be checked "0001" from the vector to be looked up a, and judges whether the value of the number to be checked "0001" is equal to the specified value " 1". Since the value of the number to be checked "0001" is equal to 1, and its sorting is the specified sorting (that is, the sorting of the number to be checked is the first of the number to be checked equal to the specified value), the number "0001" to be checked is determined as For the target number, the storage address 501 of the number to be checked "0001" is stored in the target address 201 as the search result.
示例3Example 3
如图2-3c所示,控制模块11-2在接收到向量查找指令3时,对向量查找指令3进行解析,获得向量查找指令3的操作码为Find_vlast,以及根据操作域确定向量查找指令3的待查找向量地址为“102”、输入长度为“28”、目标地址为“202”、待查数宽度为“4”。并且,根据操作码Find_vlast确定向量查找指令3的指定值为“1”、指定排序为“待查数的排序为等于指定值的待查数中的倒数第1个”。进而控制模块11-2从待查找向量地址202中获取输入长度为28的上述待查找向量a“0101 1011 0001 0101 1100 0001 1001”。As shown in Figure 2-3c, the control module 11-2 parses the vector search instruction 3 when receiving the vector search instruction 3, obtains the operation code of the vector search instruction 3 as Find_vlast, and determines the vector search instruction 3 according to the operation domain The address of the vector to be searched is "102", the input length is "28", the target address is "202", and the width of the data to be searched is "4". In addition, it is determined according to the operation code Find_vlast that the specified value of the vector search instruction 3 is "1" and the specified sort is "the sort of the number to be checked is the penultimate number of the number to be checked equal to the specified value". Furthermore, the control module 11-2 obtains the above-mentioned to-be-searched vector a whose input length is 28 from the to-be-searched vector address 202 is "0101 1011 0001 0101 1100 0001 1001".
运算模块12-2根据待查数宽度为“4”,从待查找向量a中依次得到多个待查数,并依次确定多个待查数的数值是否等于指定值“1”,并将数值等于指定值“1”、且排序为等于指定值“1”的待查数中的倒数第1个的待查数确定为目标数,将目标数的存储地址作为查找结果存入目标地址202中。The arithmetic module 12-2 obtains a plurality of numbers to be checked sequentially from the vector to be searched according to the width of the numbers to be checked "4", and sequentially determines whether the values of the plurality of numbers to be checked are equal to the specified value "1", and sets the value The first to-be-checked number among the to-be-checked numbers equal to the specified value "1" and sorted to be equal to the specified value "1" is determined as the target number, and the storage address of the target number is stored in the target address 202 as the search result .
在该示例中,运算模块12-2首先从待查找向量a中获得宽度为4的倒数第一个待查数“1001”,并判断待查数“1001”的数值是否等于指定值“1”。由于待查数“1001”的数值不为1,运算模块12-2继续从待查找向量a中获取下一个待查数“0001”,并判断待查数“0001”的数值是否等于指定值“1”。由于待查数“0001”的数值等于1、且其排序为指定排序(即待查数的排序为等于指定值的待查数中的倒数第1个),则将待查数“0001”确定为目标数,将待查数“0001”的存储地址502作为查找结果存入目标地址202中。In this example, the arithmetic module 12-2 first obtains the first to-be-checked number "1001" with a width of 4 from the to-be-searched vector a, and determines whether the value of the to-be-checked number "1001" is equal to the specified value "1" . Since the value of the number to be checked "1001" is not 1, the arithmetic module 12-2 continues to obtain the next number to be checked "0001" from the vector to be looked up a, and judges whether the value of the number to be checked "0001" is equal to the specified value " 1". Since the value of the number to be checked "0001" is equal to 1, and its sorting is the specified sorting (that is, the sorting of the number to be checked is the penultimate of the number to be checked equal to the specified value), the number to be checked "0001" is determined As the target number, the storage address 502 of the number to be checked "0001" is stored in the target address 202 as a search result.
这样,向量查找指令处理装置可以快速、高效地向量查找指令进行处理。In this way, the vector search instruction processing device can process the vector search instruction quickly and efficiently.
图2-4示出根据本公开一实施例的向量查找指令处理方法的流程图。如图2-4所示,该方法应用于上述向量查找指令处理装置,该方法包括步骤S51-2和步骤S52-2。2-4 show a flowchart of a vector search instruction processing method according to an embodiment of the present disclosure. As shown in FIG. 2-4, this method is applied to the above vector search instruction processing device. The method includes step S51-2 and step S52-2.
在步骤S51-2中,对接收到的向量查找指令进行解析,获得向量查找指令的操作码和操作域,并根据操作码和操作域确定执行向量查找指令所需的待查找向量、查找条件和目标地址。其中,操作码用于指示向量查找指令对向量数据所进行的运算为查找运算,操作域包括待查找向量地址和目标地址。In step S51-2, the received vector search instruction is parsed to obtain the operation code and operation domain of the vector search instruction, and the vector to be searched, the search condition and the search vector required for executing the vector search instruction are determined according to the operation code and operation domain target address. The operation code is used to indicate that the operation performed by the vector search instruction on the vector data is a search operation, and the operation domain includes the vector address and the target address to be searched.
在步骤S52-2中,依次确定表示待查找向量的多个待查数是否满足查找条件,并将满足查找条件的待查数确定为目标数,将目标数的存储地址作为查找结果存入目标地址。In step S52-2, it is sequentially determined whether a plurality of numbers to be searched for the vector to be searched satisfy the search condition, and the number to be searched that meets the search condition is determined as the target number, and the storage address of the target number is stored as the search result in the target address.
在一种可能的实现方式中,操作域还可以包括输入长度。其中,根据操作码和操作域确定执行向量查找指令所需的待查找向量、查找条件和目标地址,可以包括:根据输入长度,从待查找向量地址中获取待查找向量。In a possible implementation, the operation domain may also include the input length. Wherein, determining the vector to be searched, the search condition and the target address required to execute the vector search instruction according to the operation code and the operation domain may include: obtaining the vector to be searched from the vector address to be searched according to the input length.
在一种可能的实现方式中,操作域还可以包括待查数宽度。该方法还可以包括:根据待查数宽度,从待查找向量中确定出多个待查数。In a possible implementation manner, the operation field may further include the width of the data to be checked. The method may further include: determining a plurality of to-be-checked numbers from the to-be-checked vector according to the width of the to-be-checked numbers.
在一种可能的实现方式中,操作域还可以包括查找条件。其中,根据操作码和操作域确定执行向量查找指令所需的待查找向量、查找条件和目标地址,可以包括:根据操作域,确定查找条件。In a possible implementation, the operation domain may further include search conditions. Wherein, determining the vector to be searched, the search condition and the target address required to execute the vector search instruction according to the operation code and the operation domain may include: determining the search condition according to the operation domain.
在一种可能的实现方式中,根据操作码和操作域确定执行向量查找指令所需的待查找向量、查找条件和目标地址,可以包括:In a possible implementation manner, determining the to-be-searched vector, the search condition, and the target address required to execute the vector search instruction according to the operation code and the operation domain may include:
根据操作码,确定查找条件,操作码还用于指示向量查找指令的查找条件。According to the operation code, the search condition is determined, and the operation code is also used to indicate the search condition of the vector search instruction.
在一种可能的实现方式中,依次确定表示待查找向量的多个待查数是否满足查找条件,可以包括:In a possible implementation manner, sequentially determining whether a plurality of to-be-checked numbers representing the to-be-searched vector satisfy the search condition may include:
利用至少一个比较器对多个待查数与查找条件进行比较,获得比较结果,以便于根据比较结果确定待查数是否满足查找条件。At least one comparator is used to compare the plurality of to-be-checked numbers with the search condition to obtain a comparison result, so as to determine whether the to-be-checked number meets the search condition according to the comparison result.
在一种可能的实现方式中,满足查找条件的待查数可以包括以下至少一项:In a possible implementation manner, the number of to-be-checked that meets the search condition may include at least one of the following:
数值是指定值的指定倍数、且排序为指定排序的待查数;The numeric value is the specified multiple of the specified value, and the sorting is the number to be checked of the specified sorting;
数值处于指定数值区间的待查数;Number to be checked if the value is within the specified value interval;
数值是指定值的指定倍数的待查数。The numeric value is the number to be checked for the specified multiple of the specified value.
其中,指定排序可以包括以下至少一种:待查数的排序为数值是指定值的指定倍数的待查数中的第n个,n为大于或等于1的正整数;待查数的排序为数值是指定值的指定倍数的待查数中的倒数第m个,m为大于或等于1的正整数。其中,m、n小于或等于待查找向量中待查数的数量。Among them, the specified sorting may include at least one of the following: the sorting of the number to be checked is the nth of the number to be checked whose value is the specified multiple of the specified value, n is a positive integer greater than or equal to 1; the sorting of the number to be checked is The numeric value is the m-th to last in the number to be checked in the specified multiple of the specified value, and m is a positive integer greater than or equal to 1. Among them, m and n are less than or equal to the number of to-be-checked numbers in the to-be-searched vector.
在一种可能的实现方式中,该方法还可以包括:存储待查找向量。In a possible implementation manner, the method may further include: storing the to-be-searched vector.
在一种可能的实现方式中,对接收到的向量查找指令进行解析,获得向量查找指令的操作码和操作域,可以包括:In a possible implementation manner, parsing the received vector search instruction to obtain the operation code and operation domain of the vector search instruction may include:
存储向量查找指令;Store vector search instruction;
对向量查找指令进行解析,得到向量查找指令的操作码和操作域;Analyze the vector search instruction to obtain the operation code and operation domain of the vector search instruction;
存储指令队列,指令队列包括按照执行顺序依次排列的多个待执行指令,多个待执行指令可以包括向量查找指令。The instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed in order according to the execution order, and the plurality of instructions to be executed may include vector search instructions.
在一种可能的实现方式中,该方法还可以包括:In a possible implementation manner, the method may further include:
在确定多个待执行指令中的第一待执行指令与第一待执行指令之前的第零待执行指令存在关联关系时,缓存第一待执行指令,并在确定第零待执行指令执行完毕后,控制进行第一待执行指令的执行,When it is determined that the first to-be-executed instruction among the plurality of to-be-executed instructions is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction, cache the first to-be-executed instruction, and after determining that the execution of the zeroth to-be-executed instruction is completed , Control the execution of the first instruction to be executed,
其中,第一待执行指令与第一待执行指令之前的第零待执行指令存在关联关系包括:The association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
存储第一待执行指令所需数据的第一存储地址区间与存储第零待执行指令所需数据的第零存储地址区间具有重叠的区域。A first storage address interval storing data required for the first instruction to be executed has an overlapping area with a zeroth storage address interval storing data required for the zeroth instruction to be executed.
需要说明的是,尽管以上述实施例作为示例介绍了向量查找指令处理方法如上,但本领域技术人员能够理解,本公开应不限于此。事实上,用户完全可根据个人喜好和/或实际应用场景灵活设定各步骤,只要符合本公开的技术方案即可。It should be noted that although the above embodiment is taken as an example to introduce the vector search instruction processing method as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various steps flexibly according to personal preferences and/or actual application scenarios, as long as the technical solutions of the present disclosure are met.
本公开实施例所提供的向量查找指令处理方法的适用范围广,对向量查找指令的处理效率高、处理速度快。The vector search instruction processing method provided by the embodiments of the present disclosure has a wide application range, and has high processing efficiency and fast processing speed for the vector search instruction.
依据以下条款可更好地理解前述内容:The foregoing can be better understood based on the following terms:
条款A1、一种向量查找指令处理装置,所述装置包括:Clause A1, a vector search instruction processing device, the device comprising:
控制模块,用于对接收到的向量查找指令进行解析,获得所述向量查找指令的操作码和操作域,并根据所述操作码和所述操作域确定执行所述向量查找指令所需的待查找向量、查找条件和目标地址;The control module is used to parse the received vector search instruction, obtain the operation code and the operation domain of the vector search instruction, and determine the standby required to execute the vector search instruction according to the operation code and the operation domain Search vector, search condition and target address;
运算模块,用于依次确定表示所述待查找向量的多个待查数是否满足所述查找条件,并将满足所述查找条件的待查数确定为目标数,将所述目标数的存储地址作为查找结果存入所述目标地址,The operation module is used to sequentially determine whether a plurality of check numbers representing the search vector satisfy the search condition, determine the check number satisfying the search condition as the target number, and store the storage address of the target number Store in the target address as a search result,
其中,所述操作码用于指示所述向量查找指令对向量数据所进行的运算为查找运算,所述操作域包括所述待查找向量地址和所述目标地址。Wherein, the operation code is used to indicate that the operation performed by the vector search instruction on the vector data is a search operation, and the operation domain includes the vector address to be searched and the target address.
条款A2、根据条款A1所述的装置,所述操作域还包括输入长度,Clause A2. The device according to Clause A1, the operation field further includes an input length,
所述控制模块,还用于根据所述输入长度,从所述待查找向量地址中获取所述待查找向量。The control module is further configured to obtain the vector to be searched from the address of the vector to be searched according to the input length.
条款A3、根据条款A1所述的装置,所述操作域还包括待查数宽度,Clause A3. The device according to Clause A1, the operation domain further includes the width of the data to be checked,
所述运算模块,还用于根据所述待查数宽度,从所述待查找向量中确定出所述多个待查数。The calculation module is further configured to determine the plurality of to-be-checked numbers from the to-be-searched vector according to the width of the to-be-checked number.
条款A4、根据条款A1所述的装置,所述操作域还包括查找条件,Clause A4. The device according to Clause A1, the operation domain further includes a search condition,
所述控制模块,还用于根据所述操作域,确定所述查找条件。The control module is also used to determine the search condition according to the operation domain.
条款A5、根据条款A1所述的装置,Clause A5, the device according to Clause A1,
所述控制模块,还用于根据所述操作码,确定所述查找条件,其中,所述操作码还用于指示所述 向量查找指令的查找条件。The control module is further used to determine the search condition according to the operation code, wherein the operation code is also used to indicate the search condition of the vector search instruction.
条款A6、根据条款A1所述的装置,所述运算模块,包括:Clause A6. The device according to Clause A1, the arithmetic module includes:
至少一个比较器,用于对所述多个待查数与所述查找条件进行比较,获得比较结果,以便于根据所述比较结果确定待查数是否满足所述查找条件。At least one comparator is used to compare the plurality of to-be-checked numbers with the search condition to obtain a comparison result, so as to determine whether the to-be-checked number meets the search condition according to the comparison result.
条款A7、根据条款A1-条款A6任一项所述的装置,满足所述查找条件的待查数包括以下至少一项:Clause A7. The device according to any one of Clause A1-Clause A6, the number of to-be-checked satisfying the search condition includes at least one of the following:
数值是指定值的指定倍数、且排序为指定排序的待查数;The numeric value is the specified multiple of the specified value, and the sorting is the number to be checked of the specified sorting;
数值处于指定数值区间的待查数;Number to be checked if the value is within the specified value interval;
数值是指定值的指定倍数的待查数,The value is the number to be checked for the specified multiple of the specified value,
其中,所述指定排序包括以下至少一种:Wherein, the designated order includes at least one of the following:
所述待查数的排序为数值是指定值的指定倍数的待查数中的第n个,所述n为大于或等于1的正整数;The sorting of the number to be checked is the nth of the number to be checked whose value is a specified multiple of the specified value, where n is a positive integer greater than or equal to 1;
所述待查数的排序为数值是指定值的指定倍数的待查数中的倒数第m个,所述m为大于或等于1的正整数,The order of the numbers to be checked is the m-th to the last one of the numbers to be checked whose value is the specified multiple of the specified value, where m is a positive integer greater than or equal to 1,
其中,m、n小于或等于所述待查找向量中待查数的数量。Wherein, m and n are less than or equal to the number of numbers to be checked in the vector to be looked up.
条款A8、根据条款A1所述的装置,Clause A8. The device according to Clause A1,
所述装置还包括:存储模块,用于存储所述待查找向量,The device further includes a storage module for storing the to-be-searched vector,
其中,所述控制模块,包括:Wherein, the control module includes:
指令存储子模块,用于存储所述向量查找指令;Instruction storage sub-module for storing the vector search instruction;
指令处理子模块,用于对所述向量查找指令进行解析,得到所述向量查找指令的操作码和操作域;An instruction processing sub-module, which is used to parse the vector search instruction to obtain the operation code and operation domain of the vector search instruction;
队列存储子模块,用于存储指令队列,所述指令队列包括按照执行顺序依次排列的多个待执行指令,所述多个待执行指令包括所述向量查找指令,A queue storage sub-module is used to store an instruction queue, the instruction queue includes a plurality of instructions to be executed, which are sequentially arranged according to the execution order, and the plurality of instructions to be executed include the vector search instruction,
其中,所述控制模块,还包括:Wherein, the control module also includes:
依赖关系处理子模块,用于在确定所述多个待执行指令中的第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系时,将所述第一待执行指令缓存在所述指令存储子模块中,在所述第零待执行指令执行完毕后,从所述指令存储子模块中提取所述第一待执行指令发送至所述运算模块,The dependency processing sub-module is used to determine the first pending instruction when there is an association between the first pending instruction among the multiple pending instructions and the zeroth pending instruction before the first pending instruction The execution instruction is cached in the instruction storage submodule, and after the execution of the zeroth instruction to be executed is completed, the first instruction to be executed is extracted from the instruction storage submodule and sent to the arithmetic module,
其中,所述第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系包括:Wherein, the association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
存储所述第一待执行指令所需数据的第一存储地址区间与存储所述第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
条款A9、一种机器学习运算装置,所述装置包括:Clause A9. A machine learning computing device, the device comprising:
一个或多个如条款A1-条款A8任一项所述的向量查找指令处理装置,用于从其他处理装置中获取待运算数据和控制信息,并执行指定的机器学习运算,将执行结果通过I/O接口传递给其他处理装置;One or more vector search instruction processing devices as described in any one of clauses A1 to A8, used to obtain data and control information to be calculated from other processing apparatuses, and perform specified machine learning operations, and pass the execution result through I /O interface is passed to other processing devices;
当所述机器学习运算装置包含多个所述向量查找指令处理装置时,所述多个所述向量查找指令处理装置间可以通过特定的结构进行连接并传输数据;When the machine learning operation device includes a plurality of the vector search instruction processing devices, the plurality of vector search instruction processing devices may be connected and transmit data through a specific structure;
其中,多个所述向量查找指令处理装置通过快速外部设备互连总线PCIE总线进行互联并传输数据,以支持更大规模的机器学习的运算;多个所述向量查找指令处理装置共享同一控制系统或拥有各自的控制系统;多个所述向量查找指令处理装置共享内存或者拥有各自的内存;多个所述向量查找指令处理装置的互联方式是任意互联拓扑。Among them, a plurality of the vector search instruction processing devices interconnect and transmit data through a fast external device interconnect bus PCIE bus to support larger-scale machine learning operations; a plurality of the vector search instruction processing devices share the same control system Or have their own control systems; a plurality of the vector search instruction processing devices share memory or have their own memory; the interconnection method of the plurality of vector search instruction processing devices is an arbitrary interconnection topology.
条款A10、一种组合处理装置,所述组合处理装置包括:Clause A10. A combined processing device, the combined processing device comprising:
如条款A9所述的机器学习运算装置、通用互联接口和其他处理装置;Machine learning computing device, general interconnection interface and other processing devices as described in clause A9;
所述机器学习运算装置与所述其他处理装置进行交互,共同完成用户指定的计算操作,The machine learning computing device interacts with the other processing device to jointly complete the calculation operation specified by the user,
其中,所述组合处理装置还包括:存储装置,该存储装置分别与所述机器学习运算装置和所述其他处理装置连接,用于保存所述机器学习运算装置和所述其他处理装置的数据。Wherein, the combined processing device further includes: a storage device, which is respectively connected to the machine learning computing device and the other processing device, and is used for storing data of the machine learning computing device and the other processing device.
条款A11、一种机器学习芯片,所述机器学习芯片包括:Article A11. A machine learning chip, the machine learning chip includes:
如条款A9所述的机器学习运算装置或如条款A10所述的组合处理装置。The machine learning arithmetic device according to clause A9 or the combined processing device according to clause A10.
条款A12、一种电子设备,所述电子设备包括:Article A12. An electronic device, the electronic device comprising:
如条款A11所述的机器学习芯片。Machine learning chip as described in clause A11.
条款A13、一种板卡,所述板卡包括:存储器件、接口装置和控制器件以及如条款A11所述的机器学习芯片;Clause A13, a board card, the board card includes: a storage device, an interface device and a control device, and a machine learning chip as described in Clause A11;
其中,所述机器学习芯片与所述存储器件、所述控制器件以及所述接口装置分别连接;Wherein, the machine learning chip is respectively connected to the storage device, the control device and the interface device;
所述存储器件,用于存储数据;The storage device is used to store data;
所述接口装置,用于实现所述机器学习芯片与外部设备之间的数据传输;The interface device is used to realize data transmission between the machine learning chip and an external device;
所述控制器件,用于对所述机器学习芯片的状态进行监控。The control device is used for monitoring the state of the machine learning chip.
条款A14、一种向量查找指令处理方法,所述方法应用于向量查找指令处理装置,所述方法包括:Clause A14. A vector search instruction processing method. The method is applied to a vector search instruction processing device. The method includes:
对接收到的向量查找指令进行解析,获得所述向量查找指令的操作码和操作域,并根据所述操作码和所述操作域确定执行所述向量查找指令所需的待查找向量、查找条件和目标地址;Parse the received vector search instruction to obtain the operation code and operation domain of the vector search instruction, and determine the vector to be searched and the search condition required to execute the vector search instruction according to the operation code and the operation domain And destination address;
依次确定表示所述待查找向量的多个待查数是否满足所述查找条件,并将满足所述查找条件的待查数确定为目标数,将所述目标数的存储地址作为查找结果存入所述目标地址,Sequentially determine whether a plurality of searched numbers representing the searched vector satisfy the search condition, determine the searched number satisfying the search condition as a target number, and store the storage address of the target number as a search result The target address,
其中,所述操作码用于指示所述向量查找指令对向量数据所进行的运算为查找运算,所述操作域包括所述待查找向量地址和所述目标地址。Wherein, the operation code is used to indicate that the operation performed by the vector search instruction on the vector data is a search operation, and the operation domain includes the vector address to be searched and the target address.
条款A15、根据条款A14所述的方法,所述操作域还包括输入长度,Clause A15. The method according to Clause A14, the operation field further includes an input length,
其中,根据所述操作码和所述操作域确定执行所述向量查找指令所需的待查找向量、查找条件和目标地址,包括:Wherein, determining the vector to be searched, the search condition and the target address required to execute the vector search instruction according to the operation code and the operation domain include:
根据所述输入长度,从所述待查找向量地址中获取所述待查找向量。Obtain the vector to be searched from the address of the vector to be searched according to the input length.
条款A16、根据条款A14所述的方法,所述操作域还包括待查数宽度,所述方法还包括:Clause A16. The method according to Clause A14, the operation domain further includes a width to be checked, and the method further includes:
根据所述待查数宽度,从所述待查找向量中确定出所述多个待查数。According to the width of the number to be checked, the plurality of numbers to be checked are determined from the vector to be looked up.
条款A17、根据条款A14所述的方法,所述操作域还包括查找条件,Clause A17. The method according to Clause A14, the operation domain further includes a search condition,
其中,根据所述操作码和所述操作域确定执行所述向量查找指令所需的待查找向量、查找条件和目标地址,包括:Wherein, determining the vector to be searched, the search condition and the target address required to execute the vector search instruction according to the operation code and the operation domain include:
根据所述操作域,确定所述查找条件。According to the operation domain, the search condition is determined.
条款A18、根据条款A14所述的方法,根据所述操作码和所述操作域确定执行所述向量查找指令所需的待查找向量、查找条件和目标地址,包括:Clause A18. According to the method described in Clause A14, according to the operation code and the operation domain, a vector to be searched, a search condition, and a target address required to execute the vector search instruction are determined, including:
根据所述操作码,确定所述查找条件,所述操作码还用于指示所述向量查找指令的查找条件。The search condition is determined according to the operation code, and the operation code is also used to indicate the search condition of the vector search instruction.
条款A19、根据条款A14所述的方法,依次确定表示所述待查找向量的多个待查数是否满足所述查找条件,包括:Clause A19. According to the method described in Clause A14, sequentially determining whether a plurality of to-be-checked numbers representing the to-be-searched vector satisfy the search condition includes:
利用至少一个比较器对所述多个待查数与所述查找条件进行比较,获得比较结果,以便于根据所述比较结果确定待查数是否满足所述查找条件。At least one comparator is used to compare the plurality of to-be-checked numbers with the search condition to obtain a comparison result, so as to determine whether the to-be-checked number satisfies the search condition according to the comparison result.
条款A20、根据条款A14-条款A19任一项所述的方法,满足所述查找条件的待查数包括以下至少 一项:Clause A20. According to the method described in any one of Clause A14-Clause A19, the number of to-be-checked that meets the search condition includes at least one of the following:
数值是指定值的指定倍数、且排序为指定排序的待查数;The numeric value is the specified multiple of the specified value, and the sorting is the number to be checked of the specified sorting;
数值处于指定数值区间的待查数;Number to be checked if the value is within the specified value interval;
数值是指定值的指定倍数的待查数,The value is the number to be checked for the specified multiple of the specified value,
其中,所述指定排序包括以下至少一种:Wherein, the designated order includes at least one of the following:
所述待查数的排序为数值是指定值的指定倍数的待查数中的第n个,所述n为大于或等于1的正整数;The sorting of the number to be checked is the nth of the number to be checked whose value is a specified multiple of the specified value, where n is a positive integer greater than or equal to 1;
所述待查数的排序为数值是指定值的指定倍数的待查数中的倒数第m个,所述m为大于或等于1的正整数,The order of the numbers to be checked is the m-th to the last one of the numbers to be checked whose value is the specified multiple of the specified value, where m is a positive integer greater than or equal to 1,
其中,m、n小于或等于所述待查找向量中待查数的数量。Wherein, m and n are less than or equal to the number of numbers to be checked in the vector to be looked up.
条款A21、根据条款A14所述的方法,Clause A21, according to the method described in Clause A14,
所述方法还包括:存储所述待查找向量,The method further includes: storing the to-be-searched vector,
其中,对接收到的向量查找指令进行解析,获得所述向量查找指令的操作码和操作域,包括:Wherein, analyzing the received vector search instruction to obtain the operation code and operation domain of the vector search instruction includes:
存储所述向量查找指令;Store the vector search instruction;
对所述向量查找指令进行解析,得到所述向量查找指令的操作码和操作域;Parse the vector search instruction to obtain the operation code and operation domain of the vector search instruction;
存储指令队列,所述指令队列包括按照执行顺序依次排列的多个待执行指令,所述多个待执行指令包括所述向量查找指令,Store an instruction queue, the instruction queue includes a plurality of instructions to be executed in order according to the execution order, the plurality of instructions to be executed includes the vector search instruction,
其中,所述方法还包括:Wherein, the method further includes:
在确定所述多个待执行指令中的第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系时,缓存所述第一待执行指令,并在确定所述第零待执行指令执行完毕后,控制进行所述第一待执行指令的执行,When it is determined that the first to-be-executed instruction among the plurality of to-be-executed instructions is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction, cache the first to-be-executed instruction, and determine the After the execution of the zeroth to-be-executed instruction is completed, control to execute the execution of the first to-be-executed instruction,
其中,所述第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系包括:Wherein, the association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
存储所述第一待执行指令所需数据的第一存储地址区间与存储所述第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
图3-1示出根据本公开一实施例的标量查找指令处理装置的框图。如图3-1所示,该装置包括控制模块11-3和运算模块12-3。FIG. 3-1 shows a block diagram of a scalar search instruction processing device according to an embodiment of the present disclosure. As shown in Figure 3-1, the device includes a control module 11-3 and an arithmetic module 12-3.
控制模块11-3,用于对接收到的标量查找指令进行解析,获得标量查找指令的操作码和操作域,并根据操作码和操作域确定执行标量查找指令所需的待查找标量、指定值、指定排序和目标地址。其中,操作码用于指示标量查找指令对数据所进行的运算为查找运算,操作域包括待查找标量地址和目标地址。The control module 11-3 is used to parse the received scalar search instruction, obtain the operation code and operation domain of the scalar search instruction, and determine the scalar to be searched and the specified value required to execute the scalar search instruction according to the operation code and operation domain , Specify sorting and target address. The operation code is used to indicate that the operation performed by the scalar search instruction on the data is a search operation, and the operation domain includes the scalar address and the target address to be searched.
运算模块12-3,用于依次确定表示待查找标量的多个待查数的数值是否等于指定值,并将数值等于指定值、且排序为指定排序的待查数确定为目标数,将目标数的存储地址作为查找结果存入目标地址。The arithmetic module 12-3 is used to sequentially determine whether the values of the plurality of check numbers representing the scalar to be searched are equal to the specified value, and determine the check number with the value equal to the specified value and sorted to the specified sort as the target number, and determine the target The storage address of the number is stored in the target address as the search result.
在本实施例中,待查找标量可以是二进制、十六进制等的字符串。例如,待查找标量87的二进制表示为“01010111”,待查找标量87的多个待查数即为“0”、“1”、“0”、“1”、“0”、“1”、“1”和“1”。控制模块可以从待查找标量地址中获取待查找标量。待查找标量地址可以是存储待查找标量的首地址等。控制模块可以通过数据输入输出单元获得标量查找指令、待查找标量,该数据输入输出单元可以为一个或多个数据I/O接口或I/O引脚。In this embodiment, the scalar to be searched may be a character string in binary, hexadecimal, or the like. For example, the binary representation of the scalar 87 to be searched is "01010111", and the multiple queried numbers of the scalar 87 to be searched are "0", "1", "0", "1", "0", "1", "1" and "1". The control module can obtain the scalar to be found from the scalar address to be found. The scalar address to be searched may be the first address storing the scalar to be searched, and so on. The control module may obtain the scalar search instruction and the scalar to be searched through the data input and output unit, and the data input and output unit may be one or more data I/O interfaces or I/O pins.
在本实施例中,操作码可以是计算机程序中所规定的要执行操作的那一部分指令或字段(通常用代码表示),是指令序列号,用来告知执行指令的装置具体需要执行哪一条指令。操作域可以是执行对应的指令所需的所有数据的来源,执行对应的指令所需的所有数据包括参数数据、待查找标量、对应的运算方法,或者存储参数数据、待查找标量、对应的运算方法的地址等等。对于一个标量查找指令其必须包括操作码和操作域,其中操作域至少包括待查找标量地址和目标地址。In this embodiment, the operation code may be the part of the instruction or field (usually represented by code) specified in the computer program to perform the operation, and is the instruction serial number used to inform the device that executes the instruction which instruction needs to be executed. . The operation domain can be the source of all data required to execute the corresponding instruction. All data required to execute the corresponding instruction include parameter data, scalar to be searched, and corresponding operation method, or store parameter data, scalar to be searched, and corresponding operation The address of the method, etc. For a scalar search instruction, it must include an operation code and an operation field, where the operation field includes at least the scalar address to be searched and the target address.
应当理解的是,本领域技术人员可以根据需要对标量查找指令的指令格式以及所包含的操作码和操作域进行设置,本公开对此不作限制。It should be understood that those skilled in the art can set the instruction format of the scalar search instruction, as well as the included operation codes and operation domains as needed, and this disclosure does not limit this.
在本实施例中,该装置可以包括一个或多个控制模块,以及一个或多个运算模块,可以根据实际需要对控制模块和运算模块的数量进行设置,本公开对此不作限制。In this embodiment, the device may include one or more control modules and one or more arithmetic modules, and the number of control modules and arithmetic modules may be set according to actual needs, which is not limited in the present disclosure.
本公开实施例所提供的标量查找指令处理装置,该装置包括控制模块和运算模块。控制模块用于对接收到的标量查找指令进行解析,获得标量查找指令的操作码和操作域,并根据操作码和操作域确定执行标量查找指令所需的待查找标量、指定值、指定排序和目标地址。运算模块用于依次确定表示待查找标量的多个待查数的数值是否等于指定值,并将数值等于指定值、且排序为指定排序的待查数确定为目标数,将目标数的存储地址作为查找结果存入目标地址。本公开实施例所提供的标量查找指令处理装置的适用范围广,对标量查找指令的处理效率高、处理速度快。A scalar search instruction processing device provided by an embodiment of the present disclosure includes a control module and an arithmetic module. The control module is used to parse the received scalar search instruction, obtain the operation code and operation domain of the scalar search instruction, and determine the scalar to be searched, the specified value, the specified order and the required scalar search instruction according to the operation code and the operation domain. target address. The arithmetic module is used to sequentially determine whether the values of the multiple check numbers representing the scalar to be searched are equal to the specified value, determine the check number that is equal to the specified value and sorted into the specified sort as the target number, and determine the storage address of the target number The target address is stored as the search result. The scalar search instruction processing device provided by the embodiment of the present disclosure has a wide application range, and has high processing efficiency and fast processing speed for the scalar search instruction.
在一种可能的实现方式中,指定排序可以包括以下至少一种:待查数的排序为等于指定值的待查数中的第n个,n为大于或等于1的正整数;待查数的排序为等于指定值的待查数中的倒数第m个,m为大于或等于1的正整数。其中,m、n小于或等于待查找标量中待查数的数量。In a possible implementation manner, the specified sorting may include at least one of the following: the sorting of the number to be checked is the nth of the number to be checked equal to the specified value, n is a positive integer greater than or equal to 1; the number to be checked The order of is the m-th to the last one of the number to be checked equal to the specified value, and m is a positive integer greater than or equal to 1. Among them, m and n are less than or equal to the number of the number to be checked in the scalar to be looked up.
在该实现方式中,可以通过对“待查数的排序为等于指定值的待查数中的第n个”、“待查数的排序为等于指定值的待查数中的倒数第m个”设置不同表达方式等,来区分倒数和正数的指定排序。可以设定指定排序中“待查数的排序为等于指定值的待查数中的第n个”在标量查找指令中的表达方式为“0n”,设定指定排序中“待查数的排序为等于指定值的待查数中的倒数第m个”在标量查找指令中的表达方式为“m0”。本领域技术人员可以根据实际需要对指定排序的表达方式进行设置,本公开对此不作限制。In this implementation, you can sort the "number to be checked to be the nth of the number to be checked that is equal to the specified value" and "the order to be checked to be the mth number to the bottom of the number that is to be checked." "Set different expressions, etc., to distinguish the specified order of reciprocal and positive numbers. You can set the expression of "the number of the number to be checked to be the nth of the number to be checked equal to the specified value" in the specified sort in the scalar search instruction is "0n", set the "sort of the number to be checked" in the specified sort The "m-th to last" in the number to be checked equal to the specified value is expressed in the scalar search instruction as "m0". A person skilled in the art can set the expression of the specified order according to actual needs, and this disclosure does not limit this.
在一种可能的实现方式中,指定值可以是0、1、2、3等数值。In a possible implementation, the specified value may be 0, 1, 2, 3, and so on.
举例来说,若待查找标量为十六进制字符串,指定值可以是0-9、A-F。若待查找标量为二进制字符串,指定值可以是0或1。标量查找指令所查找到的目标数可以是待查找标量的多个待查数中的第一个1、最后一个1等。For example, if the scalar to be searched is a hexadecimal string, the specified values can be 0-9, A-F. If the scalar to be searched is a binary string, the specified value can be 0 or 1. The target number found by the scalar search instruction may be the first one, the last one, etc. of the multiple queried numbers to be searched for.
在一种可能的实现方式中,操作域还可以包括输入长度。控制模块11-3,还用于根据输入长度,从待查找标量地址中获取待查找标量。In a possible implementation, the operation domain may also include the input length. The control module 11-3 is also used to obtain the scalar to be found from the scalar address to be found according to the input length.
在该实现方式中,根据输入长度从待查找标量地址中获取待查找标量的长度需等于输入长度,或者需小于输入长度。In this implementation manner, the length of the scalar to be searched obtained from the scalar address to be searched according to the input length needs to be equal to the input length, or needs to be less than the input length.
在一种可能的实现方式中,在操作域中不包括输入长度时,可以根据预先设置的默认输入长度获取待查找标量。还可以获取待查找标量地址中全部数据作为待查找标量。In a possible implementation manner, when the input length is not included in the operation domain, the scalar to be searched can be obtained according to a preset default input length. It is also possible to obtain all data in the scalar address to be searched as the scalar to be searched.
在一种可能的实现方式中,操作域还可以包括待查数宽度。运算模块12-3,还用于根据待查数宽度,从待查找标量中确定出多个待查数。In a possible implementation manner, the operation field may further include the width of the data to be checked. The operation module 12-3 is also used to determine a plurality of to-be-checked numbers from the to-be-searched scalars according to the to-be-checked width.
在该实现方式中,在操作域中包括待查数宽度时,可以从待查找标量中确定出、宽度为待查数宽度的多个待查数。In this implementation manner, when the width of the data to be checked is included in the operation domain, a plurality of data to be searched whose width is the width of the data to be checked can be determined from the scalar to be searched.
在该实现方式中,在操作域不包括待查数宽度(可以是指在标量查找指令中待查数宽度所对应的 位置为空,或者不存在待查数宽度),或者待查数宽度为1时,待查找标量的多个待查数是字符串中的多个字符。例如,待查找标量n为“01010111”,待查数宽度为1时待查找标量n的多个待查数即为“0”、“1”、“0”、“1”、“0”、“1”、“1”和“1”。In this implementation, the width of the data to be checked is not included in the operation field (it may mean that the position corresponding to the width of the data to be checked in the scalar search instruction is empty, or there is no width of the data to be checked), or the width of the data to be checked is At 1, the multiple queried numbers of the scalar to be searched are multiple characters in the character string. For example, when the scalar n to be searched is "01010111", and the width of the queuing to be searched is 1, the multiple queried to be searched for the scalar n is "0", "1", "0", "1", "0" "1", "1" and "1".
在该实现方式中,在操作域包括待查数宽度、且待查数宽度大于1时,待查找标量的多个待查数是具有待查数宽度多个二进制数字串,每一个待查数宽度的二进制数字串表示一个待查数。例如,若待查数宽度为3,待查找标量m为“101110100”,待查找标量m的多个待查数为“101”、“110”和“100”。In this implementation manner, when the operation domain includes the width of the number to be checked and the width of the number to be checked is greater than 1, the plurality of numbers to be checked for the scalar to be searched are multiple binary digit strings having the width of the number to be checked, each of the number to be checked The string of binary digits in width represents a number to be checked. For example, if the width of the data to be searched is 3, the scalar m to be searched is "101110100", and the plurality of scalars to be searched m are "101", "110", and "100".
在一种可能的实现方式中,操作域还可以包括指定值和指定排序。控制模块11-3,还用于根据操作域,确定指定值和指定排序。In a possible implementation manner, the operation domain may further include a specified value and a specified order. The control module 11-3 is also used to determine the specified value and specified order according to the operation domain.
在该实现方式中,在操作域中包括指定值和指定排序时,可以直接获取操作域中的指定值和指定排序。In this implementation, when the specified value and specified order are included in the operation domain, the specified value and specified order in the operation domain can be directly obtained.
在一种可能的实现方式中,控制模块11-3,还用于根据操作码,确定指定值和指定排序。其中,操作码还用于指示标量查找指令的指定值和指定排序。In a possible implementation, the control module 11-3 is also used to determine the specified value and the specified order according to the operation code. Among them, the opcode is also used to indicate the specified value and specified order of the scalar search instruction.
在该实现方式中,可以设置不同的操作码来表示不同的指定值和指定排序。特殊地,还可以根据操作码或者默认宽度确定待查数宽度。In this implementation, different operation codes can be set to represent different specified values and specified orders. In particular, the width of the data to be checked can also be determined according to the operation code or the default width.
举例来说,可以设置操作码“Find_bfirst”为找到待查找标量的多个待查数中的第一个1(待查数的宽度为1,指定值为1,指定排序为待查数的排序为等于指定值的待查数中的第1个)。操作码“Find_blast”为找到待查找标量的多个待查数中的最后一个1(待查数的宽度为1,指定值为1,指定排序为待查数的排序为等于指定值的待查数中的倒数第1个)。在操作码为“Find_bfirst”和“Find_blast”时,可以进一步根据操作码确定待查数宽度为1,进而获取待查找标量的、宽度为1的多个待查数。For example, you can set the opcode "Find_bfirst" to find the first one of the multiple numbers to be searched for the scalar to be found (the width of the number to be checked is 1, the specified value is 1, and the specified sort is the sort of the number to be checked It is the first one of the number to be checked equal to the specified value). The operation code "Find_blast" is the last one of the multiple numbers to be found to find the scalar to be found (the width of the number to be checked is 1, the specified value is 1, the specified sort is the number of checked numbers, and the sort is equal to the specified value The penultimate number in the number). When the operation codes are "Find_bfirst" and "Find_blast", the width of the number-to-be-checked can be further determined to be 1 according to the operation code, and then a plurality of number-to-be-checked with a width of 1 can be obtained.
图3-2示出根据本公开一实施例的标量查找指令处理装置的框图。在一种可能的实现方式中,如图3-2所示,运算模块12-3可以包括至少一个比较器121-3,用于对多个待查数的数值和指定值进行比较,获得比较结果,以便于根据比较结果确定待查数的数值与指定值是否相等。3-2 shows a block diagram of a scalar search instruction processing device according to an embodiment of the present disclosure. In a possible implementation manner, as shown in FIG. 3-2, the operation module 12-3 may include at least one comparator 121-3, which is used to compare the values of multiple numbers to be checked and the specified values to obtain a comparison The result, in order to determine whether the value of the number to be checked is equal to the specified value according to the comparison result.
举例来说,以指定值为“1”、指定排序为“待查数的排序为等于指定值的待查数中的第1个”为例,比较器可以依次将待查找标量的多个待查数的数值与指定值“1”进行比较,获得比较结果。进而使得运算模块可以根据比较结果确定待查数的数值与指定值“1”是否相等,并将数值等于指定值“1”、且排序为等于指定值的待查数中的第1个待查数确定为目标数,将目标数的存储地址作为查找结果存入目标地址。可以根据所需进行比较的数据量的大小、对比较的处理速度、效率等要求对比较器的数量进行设置,本公开对此不作限制。For example, taking the specified value as "1" and the specified order as "the number of items to be checked is the first of the number of items to be checked equal to the specified value" as an example, the comparator may sequentially select multiple The value of the check number is compared with the specified value "1" to obtain the comparison result. In turn, the arithmetic module can determine whether the value of the number to be checked is equal to the specified value "1" according to the comparison result, and the value is equal to the specified value "1" and sorted to be equal to the specified value. The number is determined as the target number, and the storage address of the target number is stored in the target address as the search result. The number of comparators can be set according to the size of the data amount to be compared, the processing speed, efficiency, etc. of the comparison, which is not limited in the present disclosure.
在一种可能的实现方式中,如图3-2所示,该装置还可以包括存储模块13-3。存储模块13-3用于存储待查找标量。In a possible implementation manner, as shown in FIG. 3-2, the device may further include a storage module 13-3. The storage module 13-3 is used to store the scalar to be found.
在该实现方式中,存储模块可以包括内存、缓存和寄存器中的一种或多种,缓存可以包括速暂存缓存。可以根据需要将待查找标量在存储模块中的内存、缓存和/或寄存器中,本公开对此不作限制。In this implementation, the storage module may include one or more of a memory, a cache, and a register, and the cache may include a temporary storage cache. The scalar to be searched can be stored in the memory, cache, and/or register in the storage module as needed, which is not limited in this disclosure.
在一种可能的实现方式中,该装置还可以包括直接内存访问模块,用于从存储模块中读取或者存储数据。In a possible implementation, the device may further include a direct memory access module, which is used to read or store data from the storage module.
在一种可能的实现方式中,如图3-2所示,控制模块11-3可以包括指令存储子模块111-3、指令处理子模块112-3和队列存储子模块113-3。In a possible implementation, as shown in FIG. 3-2, the control module 11-3 may include an instruction storage sub-module 111-3, an instruction processing sub-module 112-3, and a queue storage sub-module 113-3.
指令存储子模块111-3用于存储标量查找指令。The instruction storage submodule 111-3 is used to store scalar search instructions.
指令处理子模块112-3用于对标量查找指令进行解析,得到标量查找指令的操作码和操作域。The instruction processing submodule 112-3 is used to parse the scalar search instruction to obtain the operation code and operation domain of the scalar search instruction.
队列存储子模块113-3用于存储指令队列,指令队列包括按照执行顺序依次排列的多个待执行指 令,多个待执行指令可以包括标量查找指令。The queue storage sub-module 113-3 is used to store an instruction queue. The instruction queue includes a plurality of instructions to be executed in order according to the execution order. The plurality of instructions to be executed may include a scalar search instruction.
在该实现方式中,可以根据待执行指令的接收时间、优先级别等对多个待执行指令的执行顺序进行排列获得指令队列,以便于根据指令队列依次执行多个待执行指令。In this implementation manner, the execution order of the plurality of instructions to be executed can be arranged according to the reception time and priority level of the instruction to be executed to obtain an instruction queue, so that the plurality of instructions to be executed can be sequentially executed according to the instruction queue.
在一种可能的实现方式中,如图3-2所示,控制模块11-3还可以包括依赖关系处理子模块114-3。In a possible implementation, as shown in FIG. 3-2, the control module 11-3 may further include a dependency processing sub-module 114-3.
依赖关系处理子模块114-3,用于在确定多个待执行指令中的第一待执行指令与第一待执行指令之前的第零待执行指令存在关联关系时,将第一待执行指令缓存在指令存储子模块112-3中,在第零待执行指令执行完毕后,从指令存储子模块112-3中提取第一待执行指令发送至运算模块12-3。其中,第一待执行指令和第零待执行指令是多个待执行指令中的指令。The dependency processing sub-module 114-3 is configured to cache the first instruction to be executed when it is determined that there is an association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed In the instruction storage submodule 112-3, after the execution of the zeroth to-be-executed instruction is completed, the first to-be-executed instruction is extracted from the instruction storage submodule 112-3 and sent to the arithmetic module 12-3. Wherein, the first instruction to be executed and the zeroth instruction to be executed are instructions among a plurality of instructions to be executed.
其中,第一待执行指令与第一待执行指令之前的第零待执行指令存在关联关系包括:存储第一待执行指令所需数据的第一存储地址区间与存储第零待执行指令所需数据的第零存储地址区间具有重叠的区域。反之,第一待执行指令与第零待执行指令之间没有关联关系可以是第一存储地址区间与第零存储地址区间没有重叠区域。The first to-be-executed instruction is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction includes: a first storage address interval storing data required by the first to-be-executed instruction and data required to store the zeroth to-be-executed instruction The zeroth storage address interval has overlapping areas. Conversely, there is no association between the first instruction to be executed and the zeroth instruction to be executed may be that there is no overlapping area between the first storage address interval and the zeroth storage address interval.
通过这种方式,可以根据待执行指令之间的依赖关系,使得在先的待执行指令执行完毕之后,再执行在后的待执行指令,保证运算结果的准确性。In this way, according to the dependency relationship between the instructions to be executed, after the execution of the first instruction to be executed is completed, the instruction to be executed later is executed to ensure the accuracy of the calculation result.
在一种可能的实现方式中,标量查找指令的指令格式可以如下表3所示,可以对操作码和操作码的位置进行设置。并在表4中给出了常规的标量查找指令(Find),利用该常规标量查找指令可以查找待查找标量中的任意数;以及在表4中给出了标量查找指令的示例,并定义了两个特殊类型的标量查找指令(Find_bfirst、Find_blast)所需包括的操作码和操作域。利用特殊类型的标量查找指令对待查找标量进行查找,可以简化指令处理过程、节省查找的时间。In a possible implementation manner, the instruction format of the scalar search instruction may be as shown in Table 3 below, and the operation code and the position of the operation code may be set. In Table 4, the conventional scalar search instruction (Find) is given, and any number in the scalar to be searched can be found by using the conventional scalar search instruction; and an example of the scalar search instruction is given in Table 4 and defined Two special types of scalar search instructions (Find_bfirst, Find_blast) need to include the operation code and operation field. Using a special type of scalar search instruction to search for the scalar to be searched can simplify the instruction processing process and save search time.
表3 指令格式Table 3 Command format
Figure PCTCN2019120893-appb-000003
Figure PCTCN2019120893-appb-000003
表4 标量查找指令示例Table 4 Examples of scalar search instructions
Figure PCTCN2019120893-appb-000004
Figure PCTCN2019120893-appb-000004
其中,操作码“Find_bfirst”的标量查找指令,其所对应的指定值为1,指定排序为待查数的排序为等于指定值的待查数中的第1个,待查数的宽度为1。Among them, the scalar search instruction of the operation code "Find_bfirst", the corresponding specified value is 1, the specified sort is the number of the number to be checked is equal to the specified value of the number of the number to be checked, the width of the number to be checked is .
操作码为“Find_blast”的标量查找指令,其所对应的待查数宽度为1,指定值为1,指定排序为待查数的排序为等于指定值的待查数中的倒数第1个。The scalar search instruction whose operation code is "Find_blast" corresponds to the width of the number to be checked is 1, the specified value is 1, and the specified order is the sort of the number to be checked is the penultimate number of the number to be checked equal to the specified value.
应当理解的是,本领域技术人员可以根据需要对标量查找指令的操作码、指令格式中操作码以及操作域的位置进行设置,本公开对此不作限制。It should be understood that those skilled in the art can set the operation code of the scalar search instruction, the operation code in the instruction format, and the position of the operation field as needed, and the disclosure does not limit this.
在一种可能的实现方式中,该装置可以设置于图形处理器(Graphics Processing Unit,简称GPU)、中央处理器(Central Processing Unit,简称CPU)和嵌入式神经网络处理器(Neural-network Processing Unit,简称NPU)的一种或多种之中。In a possible implementation, the device may be set in a graphics processor (GPU), a central processing unit (CPU) and a neural-network processing unit (Neural-network Processing) , Referred to as NPU).
需要说明的是,尽管以上述实施例作为示例介绍了标量查找指令处理装置如上,但本领域技术人员能够理解,本公开应不限于此。事实上,用户完全可根据个人喜好和/或实际应用场景灵活设定各模块,只要符合本公开的技术方案即可。It should be noted that although the above embodiment is taken as an example to introduce the scalar search instruction processing apparatus as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various modules flexibly according to personal preferences and/or actual application scenarios, as long as the technical solutions of the present disclosure are met.
应用示例Application examples
以下结合“利用标量查找指令处理装置对待查找标量进行查找”作为一个示例性应用场景,给出根据本公开实施例的应用示例,以便于理解标量查找指令处理装置的流程。本领域技术人员应理解,以下应用示例仅仅是出于便于理解本公开实施例的目的,不应视为对本公开实施例的限制。In the following, an application example according to an embodiment of the present disclosure is given in conjunction with "using a scalar search instruction processing device to search for a scalar to be searched" as an exemplary application scenario, so as to facilitate understanding of the flow of the scalar search instruction processing device. Those skilled in the art should understand that the following application examples are only for the purpose of facilitating understanding of the embodiments of the present disclosure, and should not be considered as limitations to the embodiments of the present disclosure.
图3-3a-图3-3c示出根据本公开一实施例的标量查找指令处理装置的应用场景的示意图。如图3-3a-图3-3c所示,标量查找指令处理装置对标量查找指令进行处理的过程如下。3-3a-FIG. 3-3c are schematic diagrams illustrating application scenarios of a scalar search instruction processing device according to an embodiment of the present disclosure. As shown in FIGS. 3-3a to 3-3c, the scalar search instruction processing device processes the scalar search instruction as follows.
首先,假定待查找标量a为“010110110001”。待查找标量a在十进制中的一个数,也即在十进制中待查找标量a为1457。为便于区分不同标量查找指令假定在不同的标量查找指令中待查找标量a的存储地址不同。First, assume that the scalar a to be searched is "010110110001". The number of the scalar a to be searched in decimal, that is, the scalar a to be searched in decimal is 1457. To facilitate distinguishing between different scalar search instructions, it is assumed that the storage addresses of the scalar a to be searched in different scalar search instructions are different.
装置所需处理的标量查找指令包括:The scalar search instructions to be processed by the device include:
标量查找指令1:@Find#1#100#12#200#01#4Scalar search instruction 1: @Find#1#100#12#200#01#4
标量查找指令4:@Find_bfirst#103#12#203Scalar search instruction 4: @Find_bfirst#103#12#203
标量查找指令5:@Find_blast#104#12#204Scalar search instruction 5: @Find_blast#104#12#204
示例1-3Example 1-3
如图3-3a所示,控制模块11-3在接收到标量查找指令1时,对标量查找指令1进行解析,获得标量查找指令1的操作码为Find,以及根据操作域确定标量查找指令1的指定值为“1”、待查找标量地址为“100”、输入长度为“12”、目标地址为“200”、指定排序为“待查数的排序为等于指定值的待查数中的第1个”、待查数宽度为“4”。进而控制模块11-3从待查找标量地址200中获取输入长度为12的上述待查找标量a“010110110001”。As shown in Figure 3-3a, the control module 11-3 parses the scalar search instruction 1 when receiving the scalar search instruction 1, obtains the operation code of the scalar search instruction 1 as Find, and determines the scalar search instruction 1 according to the operation domain The specified value is "1", the scalar address to be searched is "100", the input length is "12", the target address is "200", and the specified sort is "the number of counts to be checked is equal to the specified value. "1st", the width of the data to be checked is "4". Further, the control module 11-3 obtains the above-mentioned to-be-searched scalar a whose length is 12 from the to-be-searched scalar address 200 is "010110110001".
运算模块12-3根据待查数宽度为“4”,从待查找标量a中依次得到多个待查数,并依次确定多个待查数的数值是否等于指定值“1”,并将数值等于指定值“1”、且排序为等于指定值“1”的待查数中的第1个的待查数确定为目标数,将目标数的存储地址作为查找结果存入目标地址200中。The arithmetic module 12-3 sequentially obtains a plurality of numbers to be checked from the scalar a to be searched according to the width of the numbers to be checked to be "4", and sequentially determines whether the values of the plurality of to-be-checked numbers are equal to the specified value "1", and sets the value The first to-be-checked number among the to-be-checked numbers equal to the specified value “1” and sorted to be equal to the specified value “1” is determined as the target number, and the storage address of the target number is stored in the target address 200 as a search result.
在该示例中,运算模块12-3首先从待查找标量a中获得宽度为4的第一个待查数“0101”,并判断待查数“0101”的数值是否等于指定值“1”。由于待查数“0101”的数值不为1,运算模块12-3继续从待查找标量a中获取下一个待查数“1011”,并判断待查数“1011”的数值是否等于指定值“1”。由于待查数“1011”的数值不为1,运算模块12-3继续从待查找标量a中获取下一个待查数“0001”,并判断待查数“0001”的数值是否等于指定值“1”。由于待查数“0001”的数值等于1、且其排序为指定排序(即待查数的排序为等于指定值的待查数中的第1个),则将待查数“0001”确定为目标数,将待查数“0001”的存储地址500作为查找结果存入目标地址200中。In this example, the arithmetic module 12-3 first obtains the first to-be-checked number "0101" with a width of 4 from the to-be-searched scalar a, and judges whether the value of the to-be-checked number "0101" is equal to the specified value "1". Since the value of the number to be checked "0101" is not 1, the arithmetic module 12-3 continues to obtain the next number to be checked "1011" from the scalar a to be searched, and judges whether the value of the number to be checked "1011" is equal to the specified value " 1". Since the value of the number to be checked "1011" is not 1, the arithmetic module 12-3 continues to obtain the next number to be checked "0001" from the scalar a to be searched, and judges whether the value of the number to be checked "0001" is equal to the specified value " 1". Since the value of the number to be checked "0001" is equal to 1, and its sorting is the specified sorting (that is, the sorting of the number to be checked is the first of the number to be checked equal to the specified value), the number "0001" to be checked is determined as For the target number, the storage address 500 of the number to be checked "0001" is stored in the target address 200 as the search result.
示例2-3Example 2-3
如图3-3b所示,控制模块11-3在接收到标量查找指令4时,对标量查找指令4进行解析,获得标量查找指令4的操作码为Find_bfirst,以及根据操作域确定标量查找指令4的待查找标量地址为“103”、输入长度为“12”、目标地址为“203”。并且,根据操作码Find_bfirst确定标量查找指令4的指定值为 “1”、指定排序为“待查数的排序为等于指定值的待查数中的第1个”。进而控制模块11-3从待查找标量地址203中获取输入长度为12的上述待查找标量a“010110110001”。As shown in Fig. 3-3b, the control module 11-3 parses the scalar search instruction 4 when receiving the scalar search instruction 4, obtains the operation code of the scalar search instruction 4 as Find_bfirst, and determines the scalar search instruction 4 according to the operation domain The scalar address to be searched for is "103", the input length is "12", and the target address is "203". And, it is determined according to the operation code Find_bfirst that the specified value of the scalar search instruction 4 is "1", and the specified sort is "the sort of the number to be checked is the first of the number to be checked equal to the specified value". Furthermore, the control module 11-3 obtains the above-mentioned to-be-searched scalar a whose length is 12 from the to-be-searched scalar address 203, "010110110001".
运算模块12-3从待查找标量a中依次得到多个待查数,并依次确定多个待查数的数值是否等于指定值“1”,并将数值等于指定值“1”、且排序为等于指定值“1”的待查数中的第1个的待查数确定为目标数,将目标数的存储地址作为查找结果存入目标地址203中。The arithmetic module 12-3 sequentially obtains a plurality of numbers to be checked from the scalar to be searched a, and sequentially determines whether the values of the plurality of numbers to be checked are equal to the specified value "1", and the values are equal to the specified value "1", and is sorted as The first to-be-checked number among the to-be-checked numbers equal to the specified value "1" is determined as the target number, and the storage address of the target number is stored in the target address 203 as a search result.
在该示例中,运算模块12-3首先从待查找标量a中获得第一个待查数“0”,并判断待查数“0”的数值是否等于指定值“1”。由于待查数“0”的数值不为1,运算模块12-3继续从待查找标量a中获取下一个待查数“1”,并判断待查数“1”的数值是否等于指定值“1”。由于待查数“1”的数值等于1、且其排序为指定排序(即待查数的排序为等于指定值的待查数中的第1个),则将待查数“1”确定为目标数,将待查数“1”的存储地址503作为查找结果存入目标地址203中。In this example, the arithmetic module 12-3 first obtains the first to-be-checked number "0" from the to-be-searched scalar a, and judges whether the value of the to-be-checked number "0" is equal to the specified value "1". Since the value of the number to be checked "0" is not 1, the arithmetic module 12-3 continues to obtain the next number to be checked "1" from the scalar a to be searched, and judges whether the value of the number to be checked "1" is equal to the specified value " 1". Since the value of the number to be checked "1" is equal to 1, and its sorting is the specified sorting (that is, the sorting of the number to be checked is the first of the number to be checked equal to the specified value), the number "1" to be checked is determined as For the target number, the storage address 503 of the number to be checked "1" is stored in the target address 203 as the search result.
示例3-3Example 3-3
如图3-3c所示,控制模块11-3在接收到标量查找指令5时,对标量查找指令5进行解析,获得标量查找指令4的操作码为Find_blast,以及根据操作域确定标量查找指令5的待查找标量地址为“104”、输入长度为“12”、目标地址为“204”。并且,根据操作码Find_blast确定标量查找指令5的指定值为“1”、指定排序为“待查数的排序为等于指定值的待查数中的倒数第1个”。进而控制模块11-3从待查找标量地址204中获取输入长度为12的上述待查找标量a“010110110001”。As shown in FIG. 3-3c, when receiving the scalar search instruction 5, the control module 11-3 parses the scalar search instruction 5, obtains the operation code of the scalar search instruction 4 as Find_blast, and determines the scalar search instruction 5 according to the operation domain The scalar address to be searched for is "104", the input length is "12", and the target address is "204". And, it is determined according to the operation code Find_blast that the specified value of the scalar search instruction 5 is "1", and the specified sort is "the sort of the number to be checked is the penultimate of the number to be checked equal to the specified value". Furthermore, the control module 11-3 obtains the above-mentioned to-be-searched scalar a whose length is 12 from the to-be-searched scalar address 204 is "010110110001".
运算模块12-3从待查找标量a中依次得到多个待查数,并依次确定多个待查数的数值是否等于指定值“1”,并将数值等于指定值“1”、且排序为等于指定值“1”的待查数中的倒数第1个的待查数确定为目标数,将目标数的存储地址作为查找结果存入目标地址204中。The arithmetic module 12-3 sequentially obtains a plurality of numbers to be checked from the scalar to be searched a, and sequentially determines whether the values of the plurality of numbers to be checked are equal to the specified value "1", and the values are equal to the specified value "1", and is sorted as The first to-be-checked number among the to-be-checked numbers equal to the specified value "1" is determined as the target number, and the storage address of the target number is stored in the target address 204 as a search result.
在该示例中,运算模块12-3首先从待查找标量a中获得第一个待查数“1”,并判断待查数“1”的数值是否等于指定值“1”。由于待查数“1”的数值等于1、且其排序为指定排序(即待查数的排序为等于指定值的待查数中的倒数第1个),则将待查数“1”确定为目标数,将待查数“1”的存储地址504作为查找结果存入目标地址204中。In this example, the arithmetic module 12-3 first obtains the first to-be-checked number "1" from the to-be-searched scalar a, and judges whether the value of the to-be-checked number "1" is equal to the specified value "1". Since the value of the number to be checked "1" is equal to 1, and its sorting is the specified sorting (that is, the sorting of the number to be checked is the penultimate of the number of counts that is equal to the specified value), the number "1" is determined As the target number, the storage address 504 of the number to be checked "1" is stored in the target address 204 as a search result.
这样,标量查找指令处理装置可以快速、高效地标量查找指令进行处理。In this way, the scalar search instruction processing device can process the scalar search instruction quickly and efficiently.
图3-4示出根据本公开一实施例的标量查找指令处理方法的流程图。如图3-4所示,该方法应用于上述标量查找指令处理装置,该方法包括步骤S51-3和步骤S52-3。3-4 show a flowchart of a scalar search instruction processing method according to an embodiment of the present disclosure. As shown in FIG. 3-4, the method is applied to the above scalar search instruction processing device, and the method includes step S51-3 and step S52-3.
在步骤S51-3中,对接收到的标量查找指令进行解析,获得标量查找指令的操作码和操作域,并根据操作码和操作域确定执行标量查找指令所需的待查找标量、指定值、指定排序和目标地址。其中,操作码用于指示标量查找指令对标量数据所进行的运算为查找运算,操作域包括待查找标量地址和目标地址。In step S51-3, the received scalar search instruction is parsed to obtain the operation code and operation domain of the scalar search instruction, and according to the operation code and operation domain, the to-be-searched scalar required to execute the scalar search instruction, the specified value, Specify sorting and destination addresses. The operation code is used to indicate that the operation performed by the scalar search instruction on the scalar data is a search operation, and the operation domain includes the scalar address to be searched and the target address.
在步骤S52-3中,依次确定表示待查找标量的多个待查数的数值是否等于指定值,并将数值等于指定值、且排序为指定排序的待查数确定为目标数,将目标数的存储地址作为查找结果存入目标地址。In step S52-3, it is sequentially determined whether the values of the plurality of check numbers representing the scalar to be searched are equal to the specified value, and the check number that is equal to the specified value and sorted into the specified sort is determined as the target number, and the target number is determined The storage address of is stored in the target address as the search result.
在一种可能的实现方式中,操作域还可以包括输入长度。其中,根据操作码和操作域确定执行标量查找指令所需的待查找标量、指定值、指定排序和目标地址,可以包括:根据输入长度,从待查找标量地址中获取待查找标量。In a possible implementation, the operation domain may also include the input length. Wherein, determining the scalar to be searched, the specified value, the specified order, and the target address required to execute the scalar search instruction according to the operation code and the operation domain may include: obtaining the scalar to be searched from the scalar address to be searched according to the input length.
在一种可能的实现方式中,操作域还可以包括指定值和指定排序。其中,根据操作码和操作域确定执行标量查找指令所需的待查找标量、指定值、指定排序和目标地址,可以包括:根据操作域,确定指定值和指定排序。In a possible implementation manner, the operation domain may further include a specified value and a specified order. Wherein, determining the scalar to be searched, the specified value, the specified order and the target address required to execute the scalar search instruction according to the operation code and the operation domain may include: determining the specified value and the specified order according to the operation domain.
在一种可能的实现方式中,根据操作码和操作域确定执行标量查找指令所需的待查找标量、指定 值、指定排序和目标地址,可以包括:In a possible implementation manner, determining the scalar to be searched, the specified value, the specified order, and the target address required to execute the scalar search instruction according to the operation code and the operation field may include:
根据操作码,确定指定值和指定排序,操作码还用于指示标量查找指令的指定值和指定排序。According to the operation code, the specified value and specified order are determined. The operation code is also used to indicate the specified value and specified order of the scalar search instruction.
在一种可能的实现方式中,依次确定表示待查找标量的多个待查数的数值是否等于指定值,可以包括:In a possible implementation manner, sequentially determining whether the values of the plurality of to-be-checked numbers representing the to-be-searched scalar are equal to the specified value may include:
利用至少一个比较器对多个待查数的数值和指定值进行比较,获得比较结果,以便于根据比较结果确定待查数的数值与指定值是否相等。At least one comparator is used to compare the values of a plurality of to-be-checked numbers with a specified value to obtain a comparison result, so as to determine whether the to-be-checked value is equal to the specified value according to the comparison result.
在一种可能的实现方式中,指定排序可以包括以下至少一种:In a possible implementation, the specified ordering may include at least one of the following:
待查数的排序为等于指定值的待查数中的第n个,n为大于或等于1的正整数;待查数的排序为等于指定值的待查数中的倒数第m个,m为大于或等于1的正整数。其中,m、n小于或等于待查找标量中待查数的数量。The order of the number to be checked is the nth of the number of the number to be checked equal to the specified value, n is a positive integer greater than or equal to 1; It is a positive integer greater than or equal to 1. Among them, m and n are less than or equal to the number of the number to be checked in the scalar to be looked up.
在一种可能的实现方式中,该方法还可以包括:存储待查找标量。In a possible implementation, the method may further include: storing the scalar to be searched.
在一种可能的实现方式中,对接收到的标量查找指令进行解析,获得标量查找指令的操作码和操作域,可以包括:In a possible implementation manner, parsing the received scalar search instruction to obtain the operation code and operation domain of the scalar search instruction may include:
存储标量查找指令;Store scalar search instructions;
对标量查找指令进行解析,得到标量查找指令的操作码和操作域;Analyze the scalar search instruction to obtain the operation code and operation domain of the scalar search instruction;
存储指令队列,指令队列包括按照执行顺序依次排列的多个待执行指令,多个待执行指令可以包括标量查找指令。The instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed in order according to the execution order, and the plurality of instructions to be executed may include a scalar search instruction.
在一种可能的实现方式中,该方法还可以包括:In a possible implementation manner, the method may further include:
在确定多个待执行指令中的第一待执行指令与第一待执行指令之前的第零待执行指令存在关联关系时,缓存第一待执行指令,并在确定第零待执行指令执行完毕后,控制进行第一待执行指令的执行,When it is determined that the first to-be-executed instruction among the plurality of to-be-executed instructions is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction, cache the first to-be-executed instruction, and after determining that the execution of the zeroth to-be-executed instruction is completed , Control the execution of the first instruction to be executed,
其中,第一待执行指令与第一待执行指令之前的第零待执行指令存在关联关系包括:The association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
存储第一待执行指令所需数据的第一存储地址区间与存储第零待执行指令所需数据的第零存储地址区间具有重叠的区域。A first storage address interval storing data required for the first instruction to be executed has an overlapping area with a zeroth storage address interval storing data required for the zeroth instruction to be executed.
需要说明的是,尽管以上述实施例作为示例介绍了标量查找指令处理方法如上,但本领域技术人员能够理解,本公开应不限于此。事实上,用户完全可根据个人喜好和/或实际应用场景灵活设定各步骤,只要符合本公开的技术方案即可。It should be noted that although the above embodiment is taken as an example to introduce the scalar search instruction processing method as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various steps flexibly according to personal preferences and/or actual application scenarios, as long as the technical solutions of the present disclosure are met.
本公开实施例所提供的标量查找指令处理方法的适用范围广,对标量查找指令的处理效率高、处理速度快。The scalar search instruction processing method provided by the embodiments of the present disclosure has a wide application range, and has high processing efficiency and fast processing speed for the scalar search instruction.
依据以下条款可更好地理解前述内容:The foregoing can be better understood based on the following terms:
条款B1、一种标量查找指令处理装置,所述装置包括:Clause B1, a scalar search instruction processing device, the device comprising:
控制模块,用于对接收到的标量查找指令进行解析,获得所述标量查找指令的操作码和操作域,并根据所述操作码和所述操作域确定执行所述标量查找指令所需的待查找标量、指定值、指定排序和目标地址;The control module is used to parse the received scalar search instruction, obtain the operation code and the operation domain of the scalar search instruction, and determine the standby required to execute the scalar search instruction according to the operation code and the operation domain Find scalar, specified value, specified sort and target address;
运算模块,用于依次确定表示所述待查找标量的多个待查数的数值是否等于所述指定值,并将数值等于所述指定值、且排序为所述指定排序的待查数确定为目标数,将所述目标数的存储地址作为查找结果存入所述目标地址,The arithmetic module is used to sequentially determine whether the values of the plurality of to-be-checked numbers representing the to-be-searched scalar are equal to the specified value, and determine the to-be-checked numbers whose value is equal to the specified value and sorted into the specified sort are The target number, the storage address of the target number is stored in the target address as a search result,
其中,所述操作码用于指示所述标量查找指令对标量数据所进行的运算为查找运算,所述操作域包括所述待查找标量地址和所述目标地址。The operation code is used to indicate that the operation performed by the scalar search instruction on the scalar data is a search operation, and the operation field includes the scalar address to be searched and the target address.
条款B2、根据条款B1所述的装置,所述操作域还包括输入长度,Clause B2. The device according to Clause B1, the operation field further includes an input length,
所述控制模块,还用于根据所述输入长度,从所述待查找标量地址中获取所述待查找标量。The control module is further configured to obtain the scalar to be searched from the scalar to be searched address according to the input length.
条款B3、根据条款B1所述的装置,所述操作域还包括指定值和指定排序,Clause B3. The device according to Clause B1, the operation domain further includes a specified value and a specified order,
所述控制模块,还用于根据所述操作域,确定所述指定值和所述指定排序。The control module is also used to determine the specified value and the specified order according to the operation domain.
条款B4、根据条款B1所述的装置,Clause B4, the device according to Clause B1,
所述控制模块,还用于根据所述操作码,确定所述指定值和所述指定排序,其中,所述操作码还用于指示所述标量查找指令的指定值和指定排序。The control module is further configured to determine the specified value and the specified order according to the operation code, wherein the operation code is also used to indicate the specified value and the specified order of the scalar search instruction.
条款B5、根据条款B1所述的装置,所述运算模块,包括:Clause B5. The device according to Clause B1, the arithmetic module includes:
至少一个比较器,用于对所述多个待查数的数值和所述指定值进行比较,获得比较结果,以便于根据所述比较结果确定待查数的数值与所述指定值是否相等。At least one comparator is used to compare the values of the plurality of to-be-checked numbers with the specified value to obtain a comparison result, so as to determine whether the value of the to-be-checked number is equal to the specified value according to the comparison result.
条款B6、根据条款B1-条款B5任一项所述的装置,所述指定排序包括以下至少一种:Clause B6. The device according to any one of Clause B1-Clause B5, the designated order includes at least one of the following:
所述待查数的排序为等于所述指定值的待查数中的第n个,所述n为大于或等于1的正整数;The order of the number to be checked is the nth of the number to be checked equal to the specified value, where n is a positive integer greater than or equal to 1;
所述待查数的排序为等于所述指定值的待查数中的倒数第m个,所述m为大于或等于1的正整数,The order of the number to be checked is the m-th to the last of the number to be checked which is equal to the specified value, the m is a positive integer greater than or equal to 1,
其中,m、n小于或等于所述待查找标量中待查数的数量。Wherein, m and n are less than or equal to the number of the number to be checked in the scalar to be searched.
条款B7、根据条款B1所述的装置,Clause B7, the device according to Clause B1,
所述装置还包括:存储模块,用于存储所述待查找标量。The device also includes a storage module for storing the scalar to be searched.
其中,所述控制模块,包括:Wherein, the control module includes:
指令存储子模块,用于存储所述标量查找指令;An instruction storage sub-module for storing the scalar search instruction;
指令处理子模块,用于对所述标量查找指令进行解析,得到所述标量查找指令的操作码和操作域;Instruction processing sub-module, which is used to parse the scalar search instruction to obtain the operation code and operation domain of the scalar search instruction;
队列存储子模块,用于存储指令队列,所述指令队列包括按照执行顺序依次排列的多个待执行指令,所述多个待执行指令包括所述标量查找指令,A queue storage sub-module, which is used to store an instruction queue, the instruction queue includes a plurality of instructions to be executed in order according to the execution order, and the plurality of instructions to be executed include the scalar search instruction,
其中,所述控制模块,还包括:Wherein, the control module also includes:
依赖关系处理子模块,用于在确定所述多个待执行指令中的第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系时,将所述第一待执行指令缓存在所述指令存储子模块中,在所述第零待执行指令执行完毕后,从所述指令存储子模块中提取所述第一待执行指令发送至所述运算模块,The dependency processing sub-module is used to determine the first pending instruction when there is an association between the first pending instruction among the multiple pending instructions and the zeroth pending instruction before the first pending instruction The execution instruction is cached in the instruction storage submodule, and after the execution of the zeroth instruction to be executed is completed, the first instruction to be executed is extracted from the instruction storage submodule and sent to the arithmetic module,
其中,所述第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系包括:Wherein, the association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
存储所述第一待执行指令所需数据的第一存储地址区间与存储所述第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
条款B8、一种机器学习运算装置,所述装置包括:Clause B8. A machine learning computing device, the device comprising:
一个或多个如条款B1-条款B7任一项所述的标量查找指令处理装置,用于从其他处理装置中获取待运算数据和控制信息,并执行指定的机器学习运算,将执行结果通过I/O接口传递给其他处理装置;One or more scalar search instruction processing devices as described in any one of Clause B1-Clause B7, used to obtain the data and control information to be calculated from other processing devices, and perform the specified machine learning operation, and pass the execution result /O interface is passed to other processing devices;
当所述机器学习运算装置包含多个所述标量查找指令处理装置时,所述多个所述标量查找指令处理装置间可以通过特定的结构进行连接并传输数据;When the machine learning computing device includes a plurality of the scalar search instruction processing devices, the plurality of scalar search instruction processing devices may be connected and transmit data through a specific structure;
其中,多个所述标量查找指令处理装置通过快速外部设备互连总线PCIE总线进行互联并传输数据,以支持更大规模的机器学习的运算;多个所述标量查找指令处理装置共享同一控制系统或拥有各自的控制系统;多个所述标量查找指令处理装置共享内存或者拥有各自的内存;多个所述标量查找指令处理装置的互联方式是任意互联拓扑。Among them, a plurality of the scalar search instruction processing devices interconnect and transmit data through a PCIE bus, a fast external device interconnection bus, to support larger-scale machine learning operations; a plurality of the scalar search instruction processing devices share the same control system Or have their own control systems; a plurality of the scalar search instruction processing devices share memory or have their own memories; the interconnection method of the plurality of scalar search instruction processing devices is an arbitrary interconnection topology.
条款B9、一种组合处理装置,所述组合处理装置包括:Clause B9. A combined processing device, the combined processing device comprising:
如条款B8所述的机器学习运算装置、通用互联接口和其他处理装置;Machine learning computing devices, general interconnection interfaces and other processing devices as described in clause B8;
所述机器学习运算装置与所述其他处理装置进行交互,共同完成用户指定的计算操作,The machine learning computing device interacts with the other processing device to jointly complete the calculation operation specified by the user,
其中,所述组合处理装置还包括:存储装置,该存储装置分别与所述机器学习运算装置和所述其他处理装置连接,用于保存所述机器学习运算装置和所述其他处理装置的数据。Wherein, the combined processing device further includes: a storage device, which is respectively connected to the machine learning computing device and the other processing device, and is used for storing data of the machine learning computing device and the other processing device.
条款B10、一种机器学习芯片,所述机器学习芯片包括:Article B10. A machine learning chip, the machine learning chip includes:
如条款B8所述的机器学习运算装置或如条款B9所述的组合处理装置。The machine learning arithmetic device according to clause B8 or the combined processing device according to clause B9.
条款B11、一种电子设备,所述电子设备包括:Article B11. An electronic device, the electronic device comprising:
如条款B10所述的机器学习芯片。Machine learning chip as described in clause B10.
条款B12、一种板卡,所述板卡包括:存储器件、接口装置和控制器件以及如条款B10所述的机器学习芯片;Clause B12, a board card, the board card includes: a storage device, an interface device and a control device, and a machine learning chip as described in Clause B10;
其中,所述机器学习芯片与所述存储器件、所述控制器件以及所述接口装置分别连接;Wherein, the machine learning chip is respectively connected to the storage device, the control device and the interface device;
所述存储器件,用于存储数据;The storage device is used to store data;
所述接口装置,用于实现所述机器学习芯片与外部设备之间的数据传输;The interface device is used to realize data transmission between the machine learning chip and an external device;
所述控制器件,用于对所述机器学习芯片的状态进行监控。The control device is used for monitoring the state of the machine learning chip.
条款B13、一种标量查找指令处理方法,所述方法应用于标量查找指令处理装置,所述方法包括:Article B13. A scalar search instruction processing method. The method is applied to a scalar search instruction processing device. The method includes:
对接收到的标量查找指令进行解析,获得所述标量查找指令的操作码和操作域,并根据所述操作码和所述操作域确定执行所述标量查找指令所需的待查找标量、指定值、指定排序和目标地址;Parse the received scalar search instruction to obtain the operation code and operation domain of the scalar search instruction, and determine the scalar to be searched and the specified value required to execute the scalar search instruction according to the operation code and the operation domain 、Specify sorting and target address;
依次确定表示所述待查找标量的多个待查数的数值是否等于所述指定值,并将数值等于所述指定值、且排序为所述指定排序的待查数确定为目标数,将所述目标数的存储地址作为查找结果存入所述目标地址,In turn, determine whether the values of the plurality of to-be-checked numbers representing the to-be-searched scalar are equal to the specified value, and determine that the to-be-checked numbers whose value is equal to the specified value and sorted into the specified sort are the target number, and then determine The storage address of the target number is stored in the target address as a search result,
其中,所述操作码用于指示所述标量查找指令对数据所进行的运算为查找运算,所述操作域包括所述待查找标量地址和所述目标地址。Wherein, the operation code is used to indicate that the operation performed by the scalar search instruction on the data is a search operation, and the operation domain includes the scalar address to be searched and the target address.
条款B14、根据条款B13所述的方法,所述操作域还包括输入长度,Clause B14. The method according to Clause B13, the operation field further includes an input length,
其中,根据所述操作码和所述操作域确定执行所述标量查找指令所需的待查找标量、指定值、指定排序和目标地址,包括:Wherein, determining the scalar to be searched, the specified value, the specified order and the target address required to execute the scalar search instruction according to the operation code and the operation domain include:
根据所述输入长度,从所述待查找标量地址中获取所述待查找标量。According to the input length, obtain the scalar to be searched from the scalar address to be searched.
条款B15、根据条款B13所述的方法,所述操作域还包括指定值和指定排序,Clause B15. According to the method described in Clause B13, the operation domain further includes a specified value and a specified order,
其中,根据所述操作码和所述操作域确定执行所述标量查找指令所需的待查找标量、指定值、指定排序和目标地址,包括:Wherein, determining the scalar to be searched, the specified value, the specified order and the target address required to execute the scalar search instruction according to the operation code and the operation domain include:
根据所述操作域,确定所述指定值和所述指定排序。According to the operation domain, the specified value and the specified order are determined.
条款B16、根据条款B13所述的方法,根据所述操作码和所述操作域确定执行所述标量查找指令所需的待查找标量、指定值、指定排序和目标地址,包括:Clause B16. According to the method described in Clause B13, determine the scalar to be searched, the specified value, the specified order, and the target address required to execute the scalar search instruction based on the operation code and the operation domain, including:
根据所述操作码,确定所述指定值和所述指定排序,所述操作码还用于指示所述标量查找指令的指定值和指定排序。The specified value and the specified order are determined according to the operation code, and the operation code is also used to indicate the specified value and the specified order of the scalar search instruction.
条款B17、根据条款B13所述的方法,依次确定表示所述待查找标量的多个待查数的数值是否等于所述指定值,包括:Clause B17. According to the method described in Clause B13, sequentially determine whether the values of the plurality of to-be-checked numbers representing the to-be-searched scalar are equal to the specified value, including:
利用至少一个比较器对所述多个待查数的数值和所述指定值进行比较,获得比较结果,以便于根据所述比较结果确定待查数的数值与所述指定值是否相等。At least one comparator is used to compare the values of the plurality of to-be-checked numbers with the specified value to obtain a comparison result, so as to determine whether the value of the to-be-checked number is equal to the specified value according to the comparison result.
条款B18、根据条款B13-条款B17任一项所述的方法,所述指定排序包括以下至少一种:Clause B18. The method according to any one of Clause B13-B17, the specified ranking includes at least one of the following:
所述待查数的排序为等于所述指定值的待查数中的第n个,所述n为大于或等于1的正整数;The order of the number to be checked is the nth of the number to be checked equal to the specified value, where n is a positive integer greater than or equal to 1;
所述待查数的排序为等于所述指定值的待查数中的倒数第m个,所述m为大于或等于1的正整数,The order of the number to be checked is the m-th to the last of the number to be checked which is equal to the specified value, the m is a positive integer greater than or equal to 1,
其中,m、n小于或等于所述待查找标量中待查数的数量。Wherein, m and n are less than or equal to the number of the number to be checked in the scalar to be searched.
条款B19、根据条款B13所述的方法,Clause B19, according to the method described in Clause B13,
所述方法还包括:存储所述待查找标量,The method further includes: storing the scalar to be found,
其中,对接收到的标量查找指令进行解析,获得所述标量查找指令的操作码和操作域,包括:Wherein, the received scalar search instruction is parsed to obtain the operation code and operation domain of the scalar search instruction, including:
存储所述标量查找指令;Store the scalar search instruction;
对所述标量查找指令进行解析,得到所述标量查找指令的操作码和操作域;Parse the scalar search instruction to obtain the operation code and operation domain of the scalar search instruction;
存储指令队列,所述指令队列包括按照执行顺序依次排列的多个待执行指令,所述多个待执行指令包括所述标量查找指令,Store an instruction queue, the instruction queue includes a plurality of instructions to be executed in order according to the execution order, the plurality of instructions to be executed includes the scalar search instruction,
其中,所述方法还包括:Wherein, the method further includes:
在确定所述多个待执行指令中的第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系时,缓存所述第一待执行指令,并在确定所述第零待执行指令执行完毕后,控制进行所述第一待执行指令的执行,When it is determined that the first to-be-executed instruction among the plurality of to-be-executed instructions is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction, cache the first to-be-executed instruction, and determine the After the execution of the zeroth to-be-executed instruction is completed, control to execute the execution of the first to-be-executed instruction,
其中,所述第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系包括:Wherein, the association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
存储所述第一待执行指令所需数据的第一存储地址区间与存储所述第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
由于神经网络算法在图像识别、语音识别、自然语言处理等领域中的使用越来越广泛,使得神经网络算法的复杂度越来越高,所涉及的数据运算种类和数量不断增大。在利用神经网络算法进行数据运算的过程中需要频繁的锁定和释放资源,以保证对资源的合理利用。相关技术中,对资源进行锁定和释放的方式速度和效率难以与数据运算过程中的资源锁放需求相匹配,锁放速度慢、效率低。Since neural network algorithms are more and more widely used in the fields of image recognition, speech recognition, and natural language processing, the complexity of neural network algorithms is getting higher and higher, and the types and number of data operations involved are increasing. In the process of using the neural network algorithm for data operation, it is necessary to frequently lock and release resources to ensure the reasonable use of resources. In the related art, the speed and efficiency of the way of locking and releasing resources are difficult to match the resource lock requirements during data calculation, and the lock speed is slow and the efficiency is low.
图4-1示出根据本公开一实施例的资源锁放指令处理装置的框图。如图4-1所示,该装置包括控制模块11-4和运算模块12-4。FIG. 4-1 shows a block diagram of a resource lock instruction processing apparatus according to an embodiment of the present disclosure. As shown in Figure 4-1, the device includes a control module 11-4 and an arithmetic module 12-4.
控制模块11-4,用于对接收到的资源锁放指令进行解析,获得资源锁放指令的操作码和操作域,并根据操作码和操作域确定资源锁放指令所指示的待处理资源,以及确定进行资源锁放处理所需的锁放策略。其中,操作码用于指示资源锁放指令对资源所进行的处理为锁定或释放处理,操作域包括待处理资源标识。The control module 11-4 is used to parse the received resource lock instruction, obtain the operation code and operation domain of the resource lock instruction, and determine the resource to be processed indicated by the resource lock instruction according to the operation code and operation domain, And determine the lock strategy required for resource lock processing. The operation code is used to indicate that the processing performed by the resource lock instruction on the resource is locking or releasing processing, and the operation domain includes the identifier of the resource to be processed.
运算模块12-4,用于根据锁放策略,对待处理资源进行锁定或释放处理,得到处理后的资源。The operation module 12-4 is used to lock or release the resource to be processed according to the lock-and-release strategy to obtain the processed resource.
在本实施例中,锁放策略可以指示对待处理资源进行的处理的方式,包括锁定待处理资源和释放待处理资源。控制模块可以根据待处理资源标识确定待处理资源。待处理资源标识可以是标识待处理资源的编号、名称等信息。控制模块可以通过数据输入输出单元获得资源锁放指令、待处理资源,该数据输入输出单元可以为一个或多个数据I/O接口或I/O引脚。In this embodiment, the lock and release strategy may indicate the manner of processing the resource to be processed, including locking the resource to be processed and releasing the resource to be processed. The control module may determine the resource to be processed according to the identifier of the resource to be processed. The identifier of the resource to be processed may be information such as a number and a name that identify the resource to be processed. The control module can obtain the resource lock instruction and the resource to be processed through the data input/output unit. The data input/output unit may be one or more data I/O interfaces or I/O pins.
在本实施例中,对于一个资源锁放指令可以报考操作码和操作域。其中,操作码可以是计算机程序中所规定的要执行操作的那一部分指令或字段(通常用代码表示),是指令序列号,用来告知执行指令的装置具体需要执行哪一条指令。而操作域可以是执行对应的指令所需的所有数据或资源的来源。执行对应的指令所需的所有数据或资源包括待处理资源、对应的锁放策略等。比如,操作域至少可以包括待处理资源标识。In this embodiment, an operation code and an operation domain can be registered for a resource lock instruction. The operation code may be a part of an instruction or field (usually represented by code) specified in the computer program to perform the operation, and is an instruction sequence number used to inform the device executing the instruction which instruction needs to be executed. The operation domain may be the source of all data or resources required to execute the corresponding instruction. All data or resources required to execute the corresponding instruction include the resource to be processed, the corresponding lock and put strategy, and so on. For example, the operation domain may include at least a resource identifier to be processed.
应当理解的是,本领域技术人员可以根据需要对资源锁放指令的指令格式以及所包含的操作码和 操作域进行设置,本公开对此不作限制。It should be understood that, those skilled in the art can set the instruction format of the resource lock instruction, as well as the included operation codes and operation domains as needed, and this disclosure does not limit this.
在本实施例中,该装置可以包括一个或多个控制模块,以及一个或多个处理模块,可以根据实际需要对控制模块和处理模块的数量进行设置,本公开对此不作限制。在装置包括一个控制模块时,该控制模块可以接收资源锁放指令,并控制一个或多个处理模块进行锁定或释放处理。在装置包括多个控制模块时,多个控制模块可以分别接收资源锁放指令,并控制对应的一个或多个处理模块进行锁定或释放处理。In this embodiment, the device may include one or more control modules and one or more processing modules, and the number of control modules and processing modules may be set according to actual needs, which is not limited in the present disclosure. When the device includes a control module, the control module can receive a resource lock instruction and control one or more processing modules to perform lock or release processing. When the device includes multiple control modules, the multiple control modules may respectively receive resource lock and release instructions, and control the corresponding one or more processing modules to perform locking or releasing processing.
本公开实施例所提供的资源锁放指令处理装置,该装置包括控制模块和处理模块。控制模块用于对接收到的资源锁放指令进行解析,获得资源锁放指令的操作码和操作域,并根据操作码和操作域确定资源锁放指令所指示的待处理资源,以及确定进行资源锁放处理所需的锁放策略。处理模块用于根据锁放策略,对待处理资源进行锁定或释放处理,得到处理后的资源。本公开实施例所提供的资源锁放指令处理装置的适用范围广,根据资源锁放指令对资源进行锁定和释放的处理效率高、处理速度快。The resource lock instruction processing device provided by the embodiment of the present disclosure includes a control module and a processing module. The control module is used to analyze the received resource lock instruction, obtain the operation code and operation domain of the resource lock instruction, and determine the resource to be processed indicated by the resource lock instruction according to the operation code and operation domain, and determine the resource to be processed Locking strategy required for lock handling. The processing module is used for locking or releasing the resources to be processed according to the lock and put strategy, to obtain the processed resources. The resource lock instruction processing device provided by the embodiments of the present disclosure has a wide application range, and the processing efficiency of locking and releasing resources according to the resource lock instruction is high and the processing speed is fast.
在一种可能的实现方式中,锁放策略可以包括锁定待处理资源和释放待处理资源的至少一种。其中,待处理资源被锁定后不能被分配任务,待处理资源被释放后能够被分配任务。In a possible implementation manner, the lock-and-release strategy may include at least one of locking resources to be processed and releasing resources to be processed. Among them, the task to be processed cannot be assigned after the resource to be processed is locked, and the task can be assigned to after the resource to be processed is released.
在该实现方式中,可以为不同的锁放策略设置在资源锁放指令中的代码,例如,在资源锁放指令中“锁定待处理资源”可以用代码PV0表示,“释放待处理资源”可以用代码PV1表示。本领域技术人员可以根据实际需要对锁放策略以及锁放策略的代码进行设置,本公开对此不作限制。In this implementation, the code in the resource lock instruction can be set for different lock and put strategies. For example, in the resource lock instruction, "lock the resource to be processed" can be represented by the code PV0, and "release the resource to be processed" can be Represented by code PV1. A person skilled in the art may set the lock and release strategy and the code of the lock and release strategy according to actual needs, and the disclosure does not limit this.
在一种可能的实现方式中,操作域还可以用于指示锁放策略。In a possible implementation, the operation domain can also be used to indicate the lock and release strategy.
在一种可能的实现方式中,操作码还可以用于指示锁放策略。In a possible implementation, the operation code can also be used to indicate the lock and release strategy.
在一种可能的实现方式中,可以预先设置默认锁放策略。在控制模块根据资源锁放指令的操作域和操作码均不能确定锁放策略时,可以将默认锁放策略确定为当前资源锁放指令的锁放策略。In a possible implementation manner, a default lock and release strategy may be preset. When the control module cannot determine the lock strategy according to the operation domain and operation code of the resource lock instruction, the default lock strategy can be determined as the lock strategy of the current resource lock instruction.
在一种可能的实现方式中,待处理资源可以包括IPU资源、GPU资源、CPU资源和访存资源中的至少一种。In a possible implementation manner, the resources to be processed may include at least one of IPU resources, GPU resources, CPU resources, and memory access resources.
其中,IPU资源可以是IPU(Image Processing Unit,图像处理单元)的存储资源。GPU资源可以是GPU(Graphics Processing Unit,图形处理器)的存储资源。CPU资源是可以CPU(Central Processing Unit,中央处理器)的存储资源。访存资源可以是资源锁放指令处理装置所能够访问到的装置的内存等存储资源。本领域技术人员可以根据实际需要对待处理资源进行设置,本公开对此不作限制。The IPU resource may be a storage resource of an IPU (Image Processing Unit). The GPU resource may be a storage resource of a GPU (Graphics Processing Unit). CPU resources are storage resources that can be CPU (Central Processing Unit, central processing unit). The memory access resource may be a memory resource such as a memory of the device that can be accessed by the resource lock instruction processing device. Those skilled in the art can set the resources to be processed according to actual needs, and the present disclosure does not limit this.
图4-2示出根据本公开一实施例的资源锁放指令处理装置的框图。在一种可能的实现方式中,如图4-2所示,该装置还可以包括存储模块13-4。存储模块13-4用于存储待处理资源标识。4-2 shows a block diagram of a resource lock instruction processing apparatus according to an embodiment of the present disclosure. In a possible implementation manner, as shown in FIG. 4-2, the device may further include a storage module 13-4. The storage module 13-4 is used to store the resource identifier to be processed.
在该实现方式中,存储模块可以包括内存、缓存和寄存器中的一种或多种,缓存可以包括速暂存缓存。可以根据需要将待处理资源在存储模块中的内存、缓存和/或寄存器中,本公开对此不作限制。In this implementation, the storage module may include one or more of a memory, a cache, and a register, and the cache may include a temporary storage cache. The resources to be processed can be stored in the memory, cache, and/or registers in the storage module according to needs, which is not limited in this disclosure.
在一种可能的实现方式中,该装置还可以包括直接内存访问模块,用于从存储模块中读取或者存储数据。In a possible implementation, the device may further include a direct memory access module, which is used to read or store data from the storage module.
在一种可能的实现方式中,如图4-2所示,控制模块11-4可以包括指令存储子模块111-4、指令处理子模块112-4和队列存储子模块113-4。In a possible implementation, as shown in FIG. 4-2, the control module 11-4 may include an instruction storage submodule 111-4, an instruction processing submodule 112-4, and a queue storage submodule 113-4.
指令存储子模块111-4用于存储资源锁放指令。The instruction storage submodule 111-4 is used to store resource lock and release instructions.
指令处理子模块112-4用于对资源锁放指令进行解析,得到资源锁放指令的操作码和操作域。The instruction processing submodule 112-4 is used to parse the resource lock instruction and obtain the operation code and operation domain of the resource lock instruction.
队列存储子模块113-4用于存储指令队列,指令队列包括按照执行顺序依次排列的多个待执行指令,多个待执行指令可以包括资源锁放指令。多个待执行指令可以包括还可以包括与资源锁放指令相关的其他计算指令。The queue storage submodule 113-4 is used to store an instruction queue, and the instruction queue includes a plurality of instructions to be executed in order according to the execution order, and the plurality of instructions to be executed may include resource lock and release instructions. The plurality of instructions to be executed may include other calculation instructions related to the resource lock instruction.
在该实现方式中,可以根据待执行指令的接收时间、优先级别等对多个待执行指令的执行顺序进行排列获得指令队列,以便于根据指令队列依次执行多个待执行指令。In this implementation manner, the execution order of the plurality of instructions to be executed can be arranged according to the reception time and priority level of the instruction to be executed to obtain an instruction queue, so that the plurality of instructions to be executed can be sequentially executed according to the instruction queue.
在一种可能的实现方式中,如图4-2所示,控制模块11-4还可以包括依赖关系处理子模块114-4。In a possible implementation, as shown in FIG. 4-2, the control module 11-4 may further include a dependency processing sub-module 114-4.
依赖关系处理子模块114-4,用于在确定多个待执行指令中的第一待执行指令与第一待执行指令之前的第零待执行指令存在依赖关系时,依赖关系处理子模块114-4可以将第一待执行指令缓存在指令存储子模块112-4中,在第零待执行指令执行完毕后,从指令存储子模块112-4中提取第一待执行指令发送至处理模块12-4。其中,第一待执行指令和第零待执行指令是多个待执行指令中的指令。The dependency processing sub-module 114-4 is used to determine the dependency relationship processing sub-module 114- when the first to-be-executed instruction and the zeroth to-be-executed instruction before the first to-be-executed instruction have a dependency relationship. 4 The first instruction to be executed can be cached in the instruction storage submodule 112-4, and after the execution of the zeroth instruction to be executed is completed, the first instruction to be executed is extracted from the instruction storage submodule 112-4 and sent to the processing module 12- 4. Wherein, the first instruction to be executed and the zeroth instruction to be executed are instructions among a plurality of instructions to be executed.
其中,第一待执行指令与第一待执行指令之前的第零待执行指令存在依赖关系包括:存储第一待执行指令所需数据的第一存储地址区间与存储第零待执行指令所需数据的第零存储地址区间具有重叠的区域。反之,第一待执行指令与第零待执行指令之间没有依赖关系可以是第一存储地址区间与第零存储地址区间没有重叠区域。The dependency relationship between the first to-be-executed instruction and the zeroth to-be-executed instruction before the first to-be-executed instruction includes: a first storage address interval to store data required by the first to-be-executed instruction and data required to store the zeroth to-be-executed instruction The zeroth storage address interval has overlapping areas. Conversely, there is no dependency relationship between the first instruction to be executed and the zeroth instruction to be executed may be that there is no overlapping area between the first storage address interval and the zeroth storage address interval.
通过这种方式,可以根据待执行指令之间的依赖关系,使得在先的待执行令执行完毕之后,再执行在后的待执行指令,保证运算结果的准确性。In this way, according to the dependency relationship between the instructions to be executed, after the execution of the first to-be-executed order is completed, the subsequent to-be-executed instruction is executed again to ensure the accuracy of the calculation result.
在一种可能的实现方式中,资源锁放指令的指令格式可以为:In a possible implementation manner, the instruction format of the resource lock instruction may be:
PV sign typePV sign
其中,PV为操作码,sign、type为操作域。PV用于指示该指令为资源锁放指令。sign待处理资源标识。type为锁放策略,“锁定待处理资源”的type为PV0,“释放待处理资源”的type为PV1。Among them, PV is the operation code, and sign and type are the operation domains. PV is used to indicate that the instruction is a resource lock instruction. sign Pending resource identification. The type is the lock and put strategy, the type of "locking resources to be processed" is PV0, and the type of "releasing resources to be processed" is PV1.
在一种可能的实现方式中,资源锁放指令的指令格式还可以为:In a possible implementation manner, the instruction format of the resource lock instruction may also be:
PVx signPVxsign
其中,PVx为操作码,sign为操作域。PVx用于指示该指令为资源锁放指令。sign为待处理资源标识。PVx中的x可以指示锁放策略,“锁定待处理资源”x为0,“释放待处理资源”时x为1。Among them, PVx is the operation code, and sign is the operation domain. PVx is used to indicate that the instruction is a resource lock instruction. sign is the identifier of the resource to be processed. The x in PVx can indicate the lock and put strategy, "lock pending resources" x is 0, and "release pending resources" x is 1.
应当理解的是,本领域技术人员可以根据需要对资源锁放策略指令的操作码、指令格式中操作码以及操作域的位置进行设置,本公开对此不作限制。It should be understood that, those skilled in the art can set the operation code of the resource lock policy instruction, the operation code in the instruction format, and the position of the operation domain according to needs, and this disclosure does not limit this.
在一种可能的实现方式中,该装置可以设置于图形处理器(Graphics Processing Unit,简称GPU)、中央处理器(Central Processing Unit,简称CPU)和嵌入式神经网络处理器(Neural-network Processing Unit,简称NPU)的一种或多种之中。In a possible implementation, the device may be set in a graphics processor (GPU), a central processing unit (CPU) and a neural-network processing unit (Neural-network Processing) , Referred to as NPU).
需要说明的是,尽管以上述实施例作为示例介绍了资源锁放指令处理装置如上,但本领域技术人员能够理解,本公开应不限于此。事实上,用户完全可根据个人喜好和/或实际应用场景灵活设定各模块,只要符合本公开的技术方案即可。It should be noted that although the above embodiment is taken as an example to introduce the resource lock instruction processing device as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various modules flexibly according to personal preferences and/or actual application scenarios, as long as the technical solutions of the present disclosure are met.
应用示例Application examples
以下结合“利用资源锁放指令处理装置对待处理资源进行锁放处理”作为一个示例性应用场景,给出根据本公开实施例的应用示例,以便于理解资源锁放指令处理装置的流程。本领域技术人员应理解,以下应用示例仅仅是出于便于理解本公开实施例的目的,不应视为对本公开实施例的限制。In the following, an application example according to an embodiment of the present disclosure is given in conjunction with “using a resource lock and release instruction processing device to perform lock and release processing on a resource to be processed” as an exemplary application scenario, so as to facilitate understanding of the flow of the resource lock and release instruction processing device. Those skilled in the art should understand that the following application examples are only for the purpose of facilitating understanding of the embodiments of the present disclosure, and should not be considered as limitations to the embodiments of the present disclosure.
图4-3a-图4-3b示出根据本公开一实施例的资源锁放指令处理装置的应用场景的示意图。如图4-3a-图4-3b所示,资源锁放指令处理装置对资源锁放指令进行处理的过程如下。4-3a-FIG. 4-3b illustrate schematic diagrams of application scenarios of a resource lock instruction processing apparatus according to an embodiment of the present disclosure. As shown in FIGS. 4-3a to 4-3b, the resource lock instruction processing device processes the resource lock instruction as follows.
示例1-4Example 1-4
如图4-3a所示,控制模块11-4在接收到资源锁放指令1(如:PV0 r1)时,对资源锁放指令1进行解析,获得资源锁放指令1的操作码和操作域。该资源锁放指令1的操作码为PV0,也即锁定待处理资源。且根据操作域可以确定待处理资源标识为r1。进而控制模块11-4可以根据待处理资源标识r1确定 待处理资源1。As shown in Figure 4-3a, when the control module 11-4 receives the resource lock instruction 1 (eg PV0r1), it parses the resource lock instruction 1 to obtain the operation code and operation domain of the resource lock instruction 1 . The operation code of the resource lock instruction 1 is PV0, that is, the resource to be processed is locked. And the identifier of the resource to be processed can be determined as r1 according to the operation domain. Furthermore, the control module 11-4 can determine the resource 1 to be processed according to the resource identifier r1 to be processed.
处理模块12-4根据锁放策略PV0,锁定待处理资源1,得到处理后的资源1’,处理后的资源1’处于被锁定的状态,不能被分配任务。The processing module 12-4 locks the resource 1 to be processed according to the lock and release strategy PV0, and obtains the processed resource 1'. The processed resource 1'is in a locked state and cannot be assigned a task.
示例2-4Example 2-4
如图4-3b所示,控制模块11-4在接收到资源锁放指令2(如:PV1 r2)时,对资源锁放指令2进行解析,获得资源锁放指令2的操作码和操作域。该资源锁放指令1的操作码为PV1,根据操作码PV1可以确定锁放策略为释放待处理资源。根据操作域可以确定待处理资源标识为r2。进而控制模块11-4可以根据待处理资源标识r2确定待处理资源2。As shown in Figure 4-3b, when the control module 11-4 receives the resource lock instruction 2 (eg PV1), it parses the resource lock instruction 2 to obtain the operation code and operation domain of the resource lock instruction 2 . The operation code of the resource lock and put instruction 1 is PV1, and according to the operation code PV1, it can be determined that the lock and put strategy is to release the resource to be processed. According to the operation domain, it can be determined that the resource identifier to be processed is r2. Furthermore, the control module 11-4 may determine the resource to be processed 2 according to the resource identifier to be processed r2.
处理模块12-4根据锁放策略PV1,释放待处理资源2,得到处理后的资源2’。处理后的资源2’处于空闲状态,可以被分配任务。The processing module 12-4 releases the resource 2 to be processed according to the lock and release strategy PV1 to obtain the processed resource 2'. The processed resource 2'is in an idle state and can be assigned tasks.
以上处理过程详见上文相关描述。For details of the above process, please refer to the relevant description above.
这样,资源锁放指令处理装置可以快速、高效地根据资源锁放指令对资源进行锁放处理。In this way, the resource lock instruction processing device can quickly and efficiently perform lock processing on the resource according to the resource lock instruction.
图4-4示出根据本公开一实施例的资源锁放指令处理方法的流程图。如图4-4所示,该方法应用于上述资源锁放指令处理装置,该方法包括步骤S51-4和步骤S52-4。4-4 shows a flowchart of a resource lock instruction processing method according to an embodiment of the present disclosure. As shown in FIG. 4-4, this method is applied to the above resource lock instruction processing device. The method includes step S51-4 and step S52-4.
在步骤S51-4中,对接收到的资源锁放指令进行解析,获得资源锁放指令的操作码和操作域,并根据操作码和操作域确定资源锁放指令所指示的待处理资源,以及确定进行资源锁放处理所需的锁放策略。其中,操作码用于指示资源锁放指令对资源所进行的处理为锁定或释放处理,操作域包括待处理资源标识。In step S51-4, the received resource lock instruction is parsed to obtain the operation code and operation domain of the resource lock instruction, and the resource to be processed indicated by the resource lock instruction is determined according to the operation code and operation domain, and Determine the lock strategy required for resource lock processing. The operation code is used to indicate that the processing performed by the resource lock instruction on the resource is locking or releasing processing, and the operation domain includes the identifier of the resource to be processed.
在步骤S52-4中,根据锁放策略,对待处理资源进行锁定或释放处理,得到处理后的资源。In step S52-4, according to the lock and release strategy, the resource to be processed is locked or released to obtain the processed resource.
在一种可能的实现方式中,操作域还可以用于指示锁放策略。In a possible implementation, the operation domain can also be used to indicate the lock and release strategy.
在一种可能的实现方式中,操作码还可以用于指示锁放策略。In a possible implementation, the operation code can also be used to indicate the lock and release strategy.
在一种可能的实现方式中,锁放策略可以包括锁定待处理资源和和释放待处理资源的至少一种。其中,待处理资源被锁定后不能被分配任务,待处理资源被释放后能够被分配任务。In a possible implementation manner, the lock-and-release strategy may include at least one of locking resources to be processed and releasing resources to be processed. Among them, the task to be processed cannot be assigned after the resource to be processed is locked, and the task can be assigned to after the resource to be processed is released.
在一种可能的实现方式中,待处理资源可以包括IPU资源、GPU资源、CPU资源和访存资源中的至少一种。In a possible implementation manner, the resources to be processed may include at least one of IPU resources, GPU resources, CPU resources, and memory access resources.
在一种可能的实现方式中,该方法还可以包括:存储待处理资源标识。In a possible implementation manner, the method may further include: storing the identifier of the resource to be processed.
在一种可能的实现方式中,对接收到的资源锁放指令进行解析,获得资源锁放指令的操作码和操作域,可以包括:In a possible implementation manner, parsing the received resource lock instruction to obtain the operation code and operation domain of the resource lock instruction may include:
存储资源锁放指令;Storage resource lock instruction;
对资源锁放指令进行解析,得到资源锁放指令的操作码和操作域;Analyze the resource lock instruction to obtain the operation code and operation domain of the resource lock instruction;
存储指令队列,指令队列包括按照执行顺序依次排列的多个待执行指令,多个待执行指令可以包括资源锁放指令。An instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed in order according to the execution order, and the plurality of instructions to be executed may include resource lock and release instructions.
在一种可能的实现方式中,该方法还可以包括:In a possible implementation manner, the method may further include:
在确定多个待执行指令中的第一待执行指令与第一待执行指令之前的第零待执行指令存在依赖关系时,缓存第一待执行指令,并在确定第零待执行指令执行完毕后,控制进行第一待执行指令的执行,When it is determined that the first to-be-executed instruction among the plurality of to-be-executed instructions has a dependency relationship with the zeroth to-be-executed instruction before the first to-be-executed instruction, cache the first to-be-executed instruction, and after determining that the execution of the zeroth to-be-executed instruction is completed , Control the execution of the first instruction to be executed,
其中,第一待执行指令与第一待执行指令之前的第零待执行指令存在依赖关系包括:The dependency relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
存储第一待执行指令所需数据的第一存储地址区间与存储第零待执行指令所需数据的第零存储地址区间具有重叠的区域。A first storage address interval storing data required for the first instruction to be executed has an overlapping area with a zeroth storage address interval storing data required for the zeroth instruction to be executed.
需要说明的是,尽管以上述实施例作为示例介绍了资源锁放指令处理方法如上,但本领域技术人员能够理解,本公开应不限于此。事实上,用户完全可根据个人喜好和/或实际应用场景灵活设定各步骤,只要符合本公开的技术方案即可。It should be noted that although the above embodiment is taken as an example to introduce the resource lock instruction processing method as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various steps flexibly according to personal preferences and/or actual application scenarios, as long as the technical solutions of the present disclosure are met.
本公开实施例所提供的资源锁放指令处理方法的适用范围广,根据资源锁放指令对资源进行锁定和释放的处理效率高、处理速度快。The processing method of the resource lock and release instruction provided by the embodiments of the present disclosure has a wide application range, and the processing efficiency of locking and releasing resources according to the resource lock and release instruction is high and the processing speed is fast.
依据以下条款可更好地理解前述内容:The foregoing can be better understood based on the following terms:
条款C1、一种资源锁放指令处理装置,所述装置包括:Clause C1, a resource lock instruction processing device, the device includes:
控制模块,用于对接收到的资源锁放指令进行解析,获得所述资源锁放指令的操作码和操作域,并根据所述操作码和所述操作域确定所述资源锁放指令所指示的待处理资源,以及确定进行资源锁放处理所需的锁放策略;The control module is configured to parse the received resource lock instruction, obtain the operation code and operation domain of the resource lock instruction, and determine the instruction indicated by the resource lock instruction according to the operation code and the operation domain Resources to be processed, and determine the lock strategy required for resource lock processing;
处理模块,用于根据所述锁放策略,对所述待处理资源进行锁定或释放处理,得到处理后的资源,A processing module, configured to lock or release the resource to be processed according to the lock and release strategy to obtain the processed resource,
其中,所述操作码用于指示所述资源锁放指令对资源所进行的处理为锁定或释放处理,所述操作域包括所述待处理资源标识。Wherein, the operation code is used to indicate that the resource lock instruction performs processing on the resource as locking or releasing processing, and the operation domain includes the resource identifier to be processed.
条款C2、根据条款C1所述的装置,所述操作域还用于指示锁放策略。Clause C2. The device according to Clause C1, the operation domain is also used to indicate a lock and release strategy.
条款C3、根据条款C1所述的装置,所述操作码还用于指示所述锁放策略。Clause C3. The device according to Clause C1, the operation code is further used to indicate the lock and release strategy.
条款C4、根据条款C1所述的装置,所述锁放策略包括锁定所述待处理资源和释放所述待处理资源的至少一种,Clause C4. The device according to Clause C1, the lock and put strategy includes at least one of locking the resource to be processed and releasing the resource to be processed,
其中,所述待处理资源被锁定后不能被分配任务,所述待处理资源被释放后能够被分配任务。Wherein, the resources to be processed cannot be assigned tasks after being locked, and the resources to be processed can be assigned tasks after being released.
条款C5、根据条款C1所述的装置,所述待处理资源包括IPU资源、GPU资源、CPU资源和访存资源中的至少一种。Clause C5. The apparatus according to Clause C1, the resources to be processed include at least one of IPU resources, GPU resources, CPU resources, and memory access resources.
条款C6、根据条款C1所述的装置,Clause C6. The device according to Clause C1,
所述装置还包括:存储模块,用于存储所述待处理资源标识,The device also includes a storage module for storing the to-be-processed resource identifier,
其中,所述控制模块,包括:Wherein, the control module includes:
指令存储子模块,用于存储所述资源锁放指令;An instruction storage submodule, used to store the resource lock instruction;
指令处理子模块,用于对所述资源锁放指令进行解析,得到所述资源锁放指令的操作码和操作域;An instruction processing submodule, used for parsing the resource lock instruction, and obtaining an operation code and an operation domain of the resource lock instruction;
队列存储子模块,用于存储指令队列,所述指令队列包括按照执行顺序依次排列的多个待执行指令,所述多个待执行指令包括所述资源锁放指令,A queue storage sub-module, which is used to store an instruction queue, and the instruction queue includes a plurality of instructions to be executed, which are sequentially arranged according to the execution order, and the plurality of instructions to be executed include the resource lock and release instruction
其中,所述控制模块,还包括:Wherein, the control module also includes:
依赖关系处理子模块,用于在确定所述多个待执行指令中的第一待执行指令与所述第一待执行指令之前的第零待执行指令存在依赖关系时,将所述第一待执行指令缓存在所述指令存储子模块中,在所述第零待执行指令执行完毕后,从所述指令存储子模块中提取所述第一待执行指令发送至所述处理模块,The dependency processing sub-module is used to determine the first pending instruction when there is a dependency relationship between the first pending instruction in the plurality of pending instructions and the zeroth pending instruction before the first pending instruction The execution instruction is cached in the instruction storage submodule, and after the execution of the zeroth instruction to be executed is completed, the first instruction to be executed is extracted from the instruction storage submodule and sent to the processing module,
其中,所述第一待执行指令与所述第一待执行指令之前的第零待执行指令存在依赖关系包括:The dependency relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
存储所述第一待执行指令所需数据的第一存储地址区间与存储所述第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
条款C7、一种机器学习运算装置,所述装置包括:Clause C7. A machine learning computing device, the device comprising:
一个或多个如条款C1条款C6任一项所述的资源锁放指令处理装置,用于从其他处理装置中获取待运算数据和控制信息,并执行指定的机器学习运算,将执行结果通过I/O接口传递给其他处理装置;One or more resource lock instruction processing devices as described in any one of Clause C1 Clause C6, used to obtain data and control information to be calculated from other processing devices, and perform designated machine learning operations, and pass the execution result through I /O interface is passed to other processing devices;
当所述机器学习运算装置包含多个所述资源锁放指令处理装置时,所述多个所述资源锁放指令处 理装置间可以通过特定的结构进行连接并传输数据;When the machine learning operation device includes a plurality of the resource lock instruction processing devices, the plurality of resource lock instruction processing devices may be connected and transmit data through a specific structure;
其中,多个所述资源锁放指令处理装置通过快速外部设备互连总线PCIE总线进行互联并传输数据,以支持更大规模的机器学习的运算;多个所述资源锁放指令处理装置共享同一控制系统或拥有各自的控制系统;多个所述资源锁放指令处理装置共享内存或者拥有各自的内存;多个所述资源锁放指令处理装置的互联方式是任意互联拓扑。Among them, a plurality of the resource lock and put instruction processing apparatuses interconnect and transmit data through a fast external device interconnect bus PCIE bus to support larger-scale machine learning operations; a plurality of the resource lock and put instruction processing apparatuses share the same The control system may have its own control system; a plurality of the resource lock instruction processing devices share memory or have their own memories; the interconnection method of the plurality of resource lock instruction processing devices is any interconnected topology.
条款C8、一种组合处理装置,所述组合处理装置包括:Clause C8. A combined processing device, the combined processing device comprising:
如条款C7所述的机器学习运算装置、通用互联接口和其他处理装置;Machine learning computing devices, general interconnection interfaces and other processing devices as described in clause C7;
所述机器学习运算装置与所述其他处理装置进行交互,共同完成用户指定的计算操作,The machine learning computing device interacts with the other processing device to jointly complete the calculation operation specified by the user,
其中,所述组合处理装置还包括:存储装置,该存储装置分别与所述机器学习运算装置和所述其他处理装置连接,用于保存所述机器学习运算装置和所述其他处理装置的数据。Wherein, the combined processing device further includes: a storage device, which is respectively connected to the machine learning computing device and the other processing device, and is used for storing data of the machine learning computing device and the other processing device.
条款C9、一种机器学习芯片,所述机器学习芯片包括:Clause C9. A machine learning chip, the machine learning chip includes:
如条款C7所述的机器学习运算装置或如条款C8所述的组合处理装置。The machine learning arithmetic device according to clause C7 or the combined processing device according to clause C8.
条款C10、一种电子设备,所述电子设备包括:Clause C10. An electronic device, the electronic device comprising:
如条款C9所述的机器学习芯片。Machine learning chip as described in clause C9.
条款C11、一种板卡,所述板卡包括:存储器件、接口装置和控制器件以及如条款C9所述的机器学习芯片;Clause C11, a board card, the board card includes: a storage device, an interface device and a control device, and a machine learning chip as described in Clause C9;
其中,所述机器学习芯片与所述存储器件、所述控制器件以及所述接口装置分别连接;Wherein, the machine learning chip is respectively connected to the storage device, the control device and the interface device;
所述存储器件,用于存储数据;The storage device is used to store data;
所述接口装置,用于实现所述机器学习芯片与外部设备之间的数据传输;The interface device is used to realize data transmission between the machine learning chip and an external device;
所述控制器件,用于对所述机器学习芯片的状态进行监控。The control device is used for monitoring the state of the machine learning chip.
条款C12、一种资源锁放指令处理方法,所述方法应用于资源锁放指令处理装置,所述方法包括:Clause C12. A method for processing a resource lock instruction. The method is applied to a device for processing a resource lock instruction. The method includes:
对接收到的资源锁放指令进行解析,获得所述资源锁放指令的操作码和操作域,并根据所述操作码和所述操作域确定所述资源锁放指令所指示的待处理资源,以及确定进行资源锁放处理所需的锁放策略;Parse the received resource lock instruction, obtain the operation code and operation domain of the resource lock instruction, and determine the resource to be processed indicated by the resource lock instruction according to the operation code and the operation domain, And determine the lock strategy required for resource lock processing;
根据所述锁放策略,对所述待处理资源进行锁定或释放处理,得到处理后的资源,Lock or release the resources to be processed according to the lock and put strategy to obtain the processed resources,
其中,所述操作码用于指示所述资源锁放指令对资源所进行的处理为锁定或释放处理,所述操作域包括所述待处理资源标识。Wherein, the operation code is used to indicate that the resource lock instruction performs processing on the resource as locking or releasing processing, and the operation domain includes the resource identifier to be processed.
条款C13、根据条款C12所述的方法,所述操作域还用于指示锁放策略。Clause C13. According to the method of Clause C12, the operation field is also used to indicate a lock and release strategy.
条款C14、根据条款C12所述的方法,所述操作码还用于指示所述锁放策略。Clause C14. The method according to Clause C12, the operation code is also used to indicate the lock-and-release strategy.
条款C15、根据条款C12所述的方法,所述锁放策略包括锁定所述待处理资源和和释放所述待处理资源的至少一种,Clause C15. The method according to Clause C12, the lock-and-release strategy includes at least one of locking the resource to be processed and releasing the resource to be processed,
其中,所述待处理资源被锁定后不能被分配任务,所述待处理资源被释放后能够被分配任务。Wherein, the resources to be processed cannot be assigned tasks after being locked, and the resources to be processed can be assigned tasks after being released.
条款C16、根据条款C12所述的方法,所述待处理资源包括IPU资源、GPU资源、CPU资源和访存资源中的至少一种。Clause C16. The method according to Clause C12, the resources to be processed include at least one of IPU resources, GPU resources, CPU resources, and memory access resources.
条款C17、根据条款C12所述的方法,Clause C17, according to the method described in Clause C12,
所述方法还包括:存储所述待处理资源标识,The method further includes: storing the resource identifier to be processed,
其中,对接收到的资源锁放指令进行解析,获得所述资源锁放指令的操作码和操作域,包括:Wherein, analyzing the received resource lock instruction to obtain the operation code and operation domain of the resource lock instruction includes:
存储所述资源锁放指令;Store the resource lock instruction;
对所述资源锁放指令进行解析,得到所述资源锁放指令的操作码和操作域;Parse the resource lock instruction to obtain the operation code and operation domain of the resource lock instruction;
存储指令队列,所述指令队列包括按照执行顺序依次排列的多个待执行指令,所述多个待执行指令包括所述资源锁放指令,Store an instruction queue, the instruction queue includes a plurality of instructions to be executed in order according to the execution order, the plurality of instructions to be executed includes the resource lock and release instruction,
其中,所述方法还包括:Wherein, the method further includes:
在确定所述多个待执行指令中的第一待执行指令与所述第一待执行指令之前的第零待执行指令存在依赖关系时,缓存所述第一待执行指令,并在确定所述第零待执行指令执行完毕后,控制进行所述第一待执行指令的执行,When it is determined that the first instruction to be executed among the plurality of instructions to be executed has a dependency relationship with the zeroth instruction to be executed before the first instruction to be executed, the first instruction to be executed is cached, and the After the execution of the zeroth to-be-executed instruction is completed, control to execute the execution of the first to-be-executed instruction,
其中,所述第一待执行指令与所述第一待执行指令之前的第零待执行指令存在依赖关系包括:The dependency relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
存储所述第一待执行指令所需数据的第一存储地址区间与存储所述第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
由于神经网络算法在图像识别、语音识别、自然语言处理等领域中的使用越来越广泛,使得神经网络算法的复杂度越来越高,所涉及的数据运算种类和数量不断增大。其中,张量是一种在神经网络算法中较为常见的数据形式,由数字和/或字符组成。由于张量具有不同的维度,张量的存在满足了神经网络算法中对各类数据的表示需求,例如,可以通过0维张量表示标量、通过1维张量表示向量、通过2维张量表示矩阵、通过3维张量表示时间序列、通过4维张量表示图像、通过5维张量表示视频等等。神经网络算法中对张量的处理过程包括对张量进行重排,相关技术中,需要多个指令才能够实现对张量数据的重排,效率低、速度慢。Since neural network algorithms are more and more widely used in the fields of image recognition, speech recognition, and natural language processing, the complexity of neural network algorithms is getting higher and higher, and the types and number of data operations involved are increasing. Among them, tensor is a relatively common data form in neural network algorithm, which is composed of numbers and/or characters. Since tensors have different dimensions, the existence of tensors meets the representation needs of various types of data in neural network algorithms. For example, 0-dimensional tensors can be used to represent scalars, 1-dimensional tensors can be used to represent vectors, and 2-dimensional tensors can be used. Represents a matrix, represents a time series by a 3-dimensional tensor, represents an image by a 4-dimensional tensor, represents a video by a 5-dimensional tensor, and so on. The processing of tensors in the neural network algorithm includes the rearrangement of tensors. In the related art, multiple instructions are required to achieve the rearrangement of tensor data, which is inefficient and slow.
图5-1示出根据本公开一实施例的张量重排指令处理装置的框图。如图5-1所示,该装置包括控制模块11-5和处理模块12-5。FIG. 5-1 shows a block diagram of a tensor rearrangement instruction processing apparatus according to an embodiment of the present disclosure. As shown in Figure 5-1, the device includes a control module 11-5 and a processing module 12-5.
控制模块11-5,用于对接收到的张量重排指令进行解析,获得张量重排指令的操作码和操作域,并根据操作码和操作域确定执行张量重排指令所需的待处理张量和目标地址,以及确定进行重排处理所需的重排策略。其中,操作码用于指示张量重排指令对张量数据所进行的处理为重排处理,操作域包括待处理张量地址和目标地址。The control module 11-5 is used to parse the received tensor rearrangement instruction, obtain the operation code and operation domain of the tensor rearrangement instruction, and determine the required tensor rearrangement instruction according to the operation code and operation domain The tensor and target address to be processed, and the rearrangement strategy required for the rearrangement process. Among them, the operation code is used to instruct the processing performed by the tensor rearrangement instruction on the tensor data to be rearrangement processing, and the operation domain includes the to-be-processed tensor address and the target address.
处理模块12-5,用于根据重排策略对待处理张量进行重排处理,得到重排张量,并将重排张量存入目标地址中。The processing module 12-5 is configured to perform rearrangement processing on the tensor to be processed according to the rearrangement strategy to obtain the rearrangement tensor, and store the rearrangement tensor into the target address.
在本实施例中,张量可以包含多种形式的数据组成方式,比较常见的张量为矩阵形式,张量可以是不同阶的,比如标量可以看作是0维张量,矢量可以看作1维张量,而2维以上的张量则为二维或多维的矩阵。张量重排是指对张量进行重新排列得到重排张量的方式,其中,张量重排的方式可以是按照某一个维度为优先进行张量重排,也可以是按照某几个维度为优先进行张量重排,以2维张量为例,对2维张量的重排方式可以包括按行重排、按列重排、按块重排等重排方式中的一个或多个。其中,按行重排可以是指按照行优先的方式输入和/或输出张量中是数据,按列重排可以是指按照列优先的方式输入和/或输出张量中是数据,按块重排可以是指按照块优先的方式输入和/或输出张量中是数据。张量重排的方式可以由重排策略来定义,重排策略中可以指示对张量进行重排的相关参数,包括优先按照行、列或块等方式的输入张量,优先按照行、列或块等方式输出张量,以及若按块或两个以上维度进行输入或输出时所按照的块或两个以上维度的尺寸。In this embodiment, the tensor can contain multiple forms of data composition. The more common tensor is the matrix form. The tensor can be of different orders. For example, the scalar can be regarded as a 0-dimensional tensor, and the vector can be regarded as One-dimensional tensors, and tensors of more than two dimensions are two-dimensional or multi-dimensional matrices. Tensor rearrangement refers to the method of rearranging tensors to obtain rearranged tensors. Among them, the method of rearranging tensors can be based on a certain dimension as the priority for tensor rearrangement, or can be based on a few dimensions To prioritize the rearrangement of tensors, taking 2-dimensional tensors as an example, the rearrangement of 2-dimensional tensors may include one or more of rearrangement by row, column, block, etc. Pcs. Among them, rearrangement by row can refer to data in input and/or output tensors according to row first, rearrangement by column can refer to data in input and/or output tensors according to column first, rearrangement by block can Refers to data input and/or output tensors in a block-first manner. The method of rearranging tensors can be defined by a rearrangement strategy. The rearrangement strategy can indicate the relevant parameters for rearranging tensors, including input tensors according to rows, columns, or blocks. The tensor is output in the form of a block or a block, and the size of the block or more than two dimensions when input or output is performed in blocks or more than two dimensions.
在本实施例中,可以为不同的重排策略设置不同的代码,以区别不同的重排策略。本领域技术人员可以根据实际需要对重排策略及重排策略的代码进行设置,本公开对此不作限制。In this embodiment, different codes can be set for different rearrangement strategies to distinguish different rearrangement strategies. A person skilled in the art can set the rearrangement strategy and the code of the rearrangement strategy according to actual needs, which is not limited in the present disclosure.
在本实施例中,控制模块可以从待处理张量地址中获取待处理张量。待处理张量地址可以是存储待处理张量的首地址等物理地址,也可以是逻辑地址、线性地址。控制模块可以将重排张量存储在目 标地址中。目标地址可以是存储重排张量的首地址等物理地址,也可以是逻辑地址、线性地址。本公开对待处理张量地址、目标地址的表示方式不作限制。控制模块可以通过数据输入输出单元获得张量重排指令、待处理张量,该数据输入输出单元可以为一个或多个数据I/O接口或I/O引脚。In this embodiment, the control module may obtain the to-be-processed tensor from the to-be-processed tensor address. The to-be-processed tensor address may be a physical address such as the first address storing the to-be-processed tensor, or may be a logical address or a linear address. The control module can store the rearrangement tensor in the target address. The target address may be a physical address such as the first address storing the rearrangement tensor, or a logical address or a linear address. The present disclosure does not limit the way in which tensor addresses and target addresses are treated. The control module may obtain a tensor rearrangement instruction and a tensor to be processed through a data input/output unit. The data input/output unit may be one or more data I/O interfaces or I/O pins.
在本实施例中,对于一个张量重排指令可以包括操作码和操作域。其中操作码可以是预先配置的指令序列号,用来告知执行指令的装置具体需要执行哪一条指令。而操作域可以包括执行对应的指令所需的所有数据的来源,执行对应的指令所需的所有数据包括待处理张量、对应的重排策略,或者存储待处理张量、对应的重排策略的地址等等。比如,操作域可以包括待处理张量地址和目标地址。In this embodiment, for a tensor rearrangement instruction, an operation code and an operation field may be included. The operation code may be a pre-configured instruction sequence number, which is used to inform the device executing the instruction which instruction needs to be executed. The operation domain may include the source of all data required to execute the corresponding instruction. All data required to execute the corresponding instruction include the tensor to be processed and the corresponding rearrangement strategy, or to store the tensor to be processed and the corresponding rearrangement strategy. Address and so on. For example, the operation domain may include a tensor address to be processed and a target address.
应当理解的是,本领域技术人员可以根据需要对张量重排指令的指令格式以及所包含的操作码和操作域进行设置,本公开对此不作限制。It should be understood that those skilled in the art can set the instruction format of the tensor rearrangement instruction, as well as the included operation codes and operation domains as needed, and this disclosure does not limit this.
在本实施例中,该装置可以包括一个或多个控制模块,以及一个或多个处理模块,可以根据实际需要对控制模块和处理模块的数量进行设置,本公开对此不作限制。在装置包括一个控制模块时,该控制模块可以接收张量重排指令,并控制一个或多个处理模块进行重排处理。在装置包括多个控制模块时,多个控制模块可以分别接收张量重排指令,并控制对应的一个或多个处理模块进行重排处理。In this embodiment, the device may include one or more control modules and one or more processing modules, and the number of control modules and processing modules may be set according to actual needs, which is not limited in the present disclosure. When the device includes a control module, the control module may receive a tensor rearrangement instruction and control one or more processing modules to perform rearrangement processing. When the device includes multiple control modules, the multiple control modules may respectively receive tensor rearrangement instructions and control the corresponding one or more processing modules to perform rearrangement processing.
本公开实施例所提供的张量重排指令处理装置,该装置包括控制模块和处理模块。控制模块用于对接收到的张量重排指令进行解析,获得张量重排指令的操作码和操作域,并根据操作码和操作域确定执行张量重排指令所需的待处理张量和目标地址,以及确定进行重排处理所需的重排策略。处理模块用于根据重排策略对处理张量进行重排处理,得到重排张量,并将重排张量存入目标地址中。通过一条张量重排指令便可以实现对张量数据的重排处理,与相关技术中通过多条指令实现张量数据的重排处理的过程相比,对张量数据进行重排处理效率高、处理速度快,且适用范围广。A tensor rearrangement instruction processing device provided by an embodiment of the present disclosure includes a control module and a processing module. The control module is used to analyze the received tensor rearrangement instruction, obtain the operation code and operation domain of the tensor rearrangement instruction, and determine the pending tensor required to execute the tensor rearrangement instruction according to the operation code and operation domain And the target address, and determine the rearrangement strategy required for rearrangement processing. The processing module is used to rearrange the processing tensor according to the rearrangement strategy to obtain the rearrangement tensor, and store the rearrangement tensor into the target address. Tensor data can be rearranged by a single tensor rearrangement instruction. Compared with the process of implementing tensor data rearrangement by multiple instructions in the related art, rearrangement of tensor data is highly efficient 1. Fast processing speed and wide application range.
在一种可能的实现方式中,操作域还可以包括待处理张量的输入形状和重排张量的输出形状的至少一种,处理模块12-5,还用于根据输入形状和输出形状的至少一种、以及重排策略,对待处理张量进行重排处理,得到重排张量。In a possible implementation, the operation domain may further include at least one of an input shape of the tensor to be processed and an output shape of the rearranged tensor. The processing module 12-5 is further configured to At least one, and a rearrangement strategy, perform a rearrangement process on the tensor to be processed to obtain a rearrangement tensor.
在一种可能的实现方式中,操作域还可以包括待处理张量的形状和/或重排张量的形状,张量的“形状”可以用待处理张量的维度以及在不同的维度上所存在的数字和/或字符的数量来表示。例如,待处理张量的形状可以表示待处理张量的维度以及在不同的维度上所存在的数字和/或字符的数量。重排张量的形状可以是表示重排张量的维度以及在不同的维度上所存在的数字和/或字符的数量。In a possible implementation, the operation domain may also include the shape of the tensor to be processed and/or the shape of the rearranged tensor. The "shape" of the tensor can be the dimensions of the tensor to be processed and different dimensions. It is represented by the number of digits and/or characters present. For example, the shape of the tensor to be processed may represent the dimension of the tensor to be processed and the number of numbers and/or characters present in different dimensions. The shape of the rearrangement tensor may be a dimension representing the rearrangement tensor and the number of numbers and/or characters present in different dimensions.
举例来说,假定某待处理张量[(1,2),(3,4),(5,6),(7,8)],则该待处理张量的形状为(2,4),也即表示该待处理张量为2行、4列的二维张量。For example, assuming a certain tensor to be processed [(1,2),(3,4),(5,6),(7,8)], the shape of the tensor to be processed is (2,4) , Which means that the to-be-processed tensor is a two-dimensional tensor with 2 rows and 4 columns.
假定若重排策略为按行优先输入、按列优先输出、且输出形状为(4,2),对该待处理张量进行重排处理可以为:按行优先输入得到[1,3,5,7,2,4,6,8],进而将其按列优先输出得到重排张量[(1,3,5,7),(2,4,6,8)],该重排张量的形状为为(4,2),也即该重排张量为4行、2列的二维张量。Assuming that if the rearrangement strategy is input by row first, output by column first, and the output shape is (4, 2), the rearrangement of the to-be-processed tensor can be: input by row first [1, 3, 5 ,7,2,4,6,8], and then output it by column priority to get the rearrangement tensor [(1,3,5,7),(2,4,6,8)], the rearrangement The shape of the quantity is (4, 2), that is, the rearrangement tensor is a two-dimensional tensor with 4 rows and 2 columns.
假定若重排策略为按列优先输入、按列优先输出、且输出形状为(2,4),对该待处理张量进行重排处理可以为:按列优先输入得到[1,2,3,4,5,6,7,8],进而按列优先输出得到重排张量[(1,2,3,4),(5,6,7,8)]。该重排张量的形状为为(4,2),也即该重排张量为4行、2列的二维张量。Assuming that if the rearrangement strategy is column-first input, column-first output, and the output shape is (2, 4), rearrangement of the to-be-processed tensor can be: column-first input gets [1,2,3 ,4,5,6,7,8], and then output by column priority to get the rearrangement tensor [(1,2,3,4),(5,6,7,8)]. The shape of the rearrangement tensor is (4, 2), that is, the rearrangement tensor is a two-dimensional tensor with 4 rows and 2 columns.
假定重排策略为按行优先输入、按块优先输出(假定块的尺寸为(2,2),按块输出时优先按块的行输出块)、且输出形状为(2,4),对该待处理张量进行重排处理可以为:按行优先输入得到[1,3,5,7,2,4,6,8],进而按(1,2)的块优先输出得到重排张量[(1,5,2,6),(3,7,4,8)]。该重排张量的形状为为(2,4),也即该重排张量为4行、2列的二维张量。Assuming that the rearrangement strategy is to input by line and output by block (assuming that the size of the block is (2, 2), when output by block, the block is output by block line), and the output shape is (2, 4), The rearrangement processing of the to-be-processed tensor can be: first input by line to obtain [1,3,5,7,2,4,6,8], and then output by block priority of (1,2) to obtain rearranged The amount [(1,5,2,6),(3,7,4,8)]. The shape of the rearrangement tensor is (2, 4), that is, the rearrangement tensor is a two-dimensional tensor with 4 rows and 2 columns.
在一种可能的实现方式中,可以预先设置待处理张量的默认输入形状。在操作域中不包含待处理 张量的输入形状时,可以将待处理张量的默认输入形状确定为当前张量重排指令的待处理张量的输入形状。In a possible implementation, the default input shape of the tensor to be processed can be preset. When the input shape of the tensor to be processed is not included in the operation domain, the default input shape of the tensor to be processed can be determined as the input shape of the tensor to be processed of the current tensor rearrangement instruction.
在一种可能的实现方式中,可以预先设置重排张量的默认输出形状。在操作域中不包含重排张量的输出形状时,可以将重排张量的默认输出形状确定为当前张量重排指令的重排张量的输出形状。In a possible implementation, the default output shape of the rearranged tensor may be preset. When the output shape of the rearranged tensor is not included in the operation domain, the default output shape of the rearranged tensor can be determined as the output shape of the rearranged tensor of the current tensor rearrangement instruction.
在一种可能的实现方式中,待处理张量的维度与重排张量的维度可以不同。In a possible implementation, the dimension of the tensor to be processed and the dimension of the rearranged tensor may be different.
在该实现方式中,待处理张量的维度与重排张量的维度也可以相同。可以根据实际需要对待处理张量的维度、重排张量的维度进行设置,本公开对此不作限制。In this implementation, the dimensions of the to-be-processed tensor and the rearranged tensor may also be the same. The dimension of the tensor to be processed and the dimension of the rearranged tensor can be set according to actual needs, and this disclosure does not limit this.
举例来说,输入形状为(2,8)的某待处理张量如下:For example, the input tensor with shape (2,8) is as follows:
[(1,9),(2,10),(3,11),(4,12),(4,13),(6,14),(7,15),(8,16)][(1,9),(2,10),(3,11),(4,12),(4,13),(6,14),(7,15),(8,16)]
假定输出形状为(2,2,4)、重排策略为按列优先输入、按三个维度依次优先输出,则该待处理张量进行重排处理可以为:按行优先输入得到[1][1,9,2,10,3,11,4,12,4,13,6,14,7,15,8,16],进而按三个维度依次优先输出得到重排张量[[(1,2,3,4),(5,6,7,8)],[(9,10,11,12),(13,14,15,16)]]。Assuming that the output shape is (2,2,4), and the rearrangement strategy is to prioritize input by column and output by three dimensions in turn, the rearrangement of the to-be-processed tensor can be: input by row first [1] [1,9,2,10,3,11,4,12,4,13,6,14,7,15,8,16], and then prioritize the output in three dimensions to get the rearrangement tensor [[( 1,2,3,4),(5,6,7,8)],[(9,10,11,12),(13,14,15,16)]].
在一种可能的实现方式中,操作域还可以用于指示重排策略。In a possible implementation, the operation domain may also be used to indicate the rearrangement strategy.
在一种可能的实现方式中,操作码还可以用于指示重排策略。In a possible implementation, the operation code can also be used to indicate the rearrangement strategy.
在一种可能的实现方式中,还可以设置默认重排策略。在根据操作域和操作码均不能确定当前张量重排指令的重排策略时,可以将默认重排策略确定为当前张量重排指令的重排策略。In a possible implementation, a default rearrangement strategy can also be set. When the rearrangement strategy of the current tensor rearrangement instruction cannot be determined according to the operation domain and the operation code, the default rearrangement strategy can be determined as the rearrangement strategy of the current tensor rearrangement instruction.
图5-2示出根据本公开一实施例的张量重排指令处理装置的框图。在一种可能的实现方式中,如图5-2所示,该装置还可以包括存储模块13-5。存储模块13-5用于存储待重排张量。5-2 shows a block diagram of a tensor rearrangement instruction processing apparatus according to an embodiment of the present disclosure. In a possible implementation manner, as shown in FIG. 5-2, the device may further include a storage module 13-5. The storage module 13-5 is used to store the tensor to be rearranged.
在该实现方式中,存储模块可以包括内存、缓存和寄存器中的一种或多种,缓存可以包括速暂存缓存。可以根据需要将待重排张量在存储模块中的内存、缓存和/或寄存器中,本公开对此不作限制。In this implementation, the storage module may include one or more of a memory, a cache, and a register, and the cache may include a temporary storage cache. The tensor to be rearranged can be stored in the memory, cache, and/or register in the storage module according to needs, which is not limited in the present disclosure.
在一种可能的实现方式中,该装置还可以包括直接内存访问模块,用于从存储模块中读取或者存储数据。In a possible implementation, the device may further include a direct memory access module, which is used to read or store data from the storage module.
在一种可能的实现方式中,如图5-2所示,控制模块11-5可以包括指令存储子模块111-5、指令处理子模块112-5和队列存储子模块113-5。In a possible implementation, as shown in FIG. 5-2, the control module 11-5 may include an instruction storage submodule 111-5, an instruction processing submodule 112-5, and a queue storage submodule 113-5.
指令存储子模块111-5用于存储张量重排指令。The instruction storage submodule 111-5 is used to store tensor rearrangement instructions.
指令处理子模块112-5用于对张量重排指令进行解析,得到张量重排指令的操作码和操作域。The instruction processing sub-module 112-5 is used to parse the tensor rearrangement instruction to obtain the operation code and operation domain of the tensor rearrangement instruction.
队列存储子模块113-5用于存储指令队列,指令队列包括按照执行顺序依次排列的多个待执行指令,多个待执行指令可以包括张量重排指令。多个待执行指令可以包括还可以包括与张量重排指令相关的其他计算指令。The queue storage sub-module 113-5 is used to store an instruction queue. The instruction queue includes a plurality of instructions to be executed in order according to the execution order, and the plurality of instructions to be executed may include tensor reordering instructions. The plurality of instructions to be executed may include other calculation instructions related to the tensor rearrangement instruction.
在该实现方式中,可以根据待执行指令的接收时间、优先级别等对多个待执行指令的执行顺序进行排列获得指令队列,以便于根据指令队列依次执行多个待执行指令。In this implementation manner, the execution order of the plurality of instructions to be executed can be arranged according to the reception time and priority level of the instruction to be executed to obtain an instruction queue, so that the plurality of instructions to be executed can be sequentially executed according to the instruction queue.
在一种可能的实现方式中,如图5-2所示,控制模块11-5还可以包括依赖关系处理子模块114-5。In a possible implementation, as shown in FIG. 5-2, the control module 11-5 may further include a dependency processing sub-module 114-5.
在确定多个待执行指令中的第一待执行指令与第一待执行指令之前的第零待执行指令存在依赖关系时,依赖关系处理子模块114-5可以将第一待执行指令缓存在指令存储子模块112-5中,在第零待执行指令执行完毕后,从指令存储子模块112-5中提取第一待执行指令发送至处理模块12-5。其中,第一待执行指令和第零待执行指令是多个待执行指令中的指令。When it is determined that the first instruction to be executed among the plurality of instructions to be executed has a dependency relationship with the zeroth instruction to be executed before the first instruction to be executed, the dependency processing submodule 114-5 may cache the first instruction to be executed in the instruction In the storage submodule 112-5, after the execution of the zeroth instruction to be executed is completed, the first instruction to be executed is extracted from the instruction storage submodule 112-5 and sent to the processing module 12-5. Wherein, the first instruction to be executed and the zeroth instruction to be executed are instructions among a plurality of instructions to be executed.
其中,第一待执行指令与第一待执行指令之前的第零待执行指令存在依赖关系包括:存储第一待执行指令所需数据的第一存储地址区间与存储第零待执行指令所需数据的第零存储地址区间具有重叠的区域。反之,第一待执行指令与第零待执行指令之间没有依赖关系可以是第一存储地址区间与第 零存储地址区间没有重叠区域。The dependency relationship between the first to-be-executed instruction and the zeroth to-be-executed instruction before the first to-be-executed instruction includes: a first storage address interval to store data required by the first to-be-executed instruction and data required to store the zeroth to-be-executed instruction The zeroth storage address interval has overlapping areas. Conversely, there is no dependency relationship between the first instruction to be executed and the zeroth instruction to be executed may be that there is no overlapping area between the first storage address interval and the zeroth storage address interval.
通过这种方式,可以根据待执行指令之间的依赖关系,使得在先的待执行令执行完毕之后,再执行在后的待执行指令,保证运算结果的准确性。In this way, according to the dependency relationship between the instructions to be executed, after the execution of the first to-be-executed order is completed, the subsequent to-be-executed instruction is executed again to ensure the accuracy of the calculation result.
在一种可能的实现方式中,张量重排指令的指令格式可以为:In a possible implementation, the instruction format of the tensor rearrangement instruction may be:
Tiling dst src type src_shape dst_shapeTiling dst src type src_shape dst_shape
其中,Tiling为操作码,dst、src、type、src_shape、dst_shape为操作域。Tiling用于指示该指令为张量重排指令。dst为目标地址。src为待处理张量地址。type为重排策略。src_shape为输入形状。dst_shape为输出形状。Among them, Tiling is the operation code, dst, src, type, src_shape, dst_shape are the operation domain. Tiling is used to indicate that the instruction is a tensor rearrangement instruction. dst is the target address. src is the address of the tensor to be processed. type is the rearrangement strategy. src_shape is the input shape. dst_shape is the output shape.
在一种可能的实现方式中,张量重排指令的指令格式可以为:In a possible implementation, the instruction format of the tensor rearrangement instruction may be:
Tiling.type dst src src_shape dst_shapeTiling.type dst src src_shape dst_shape
其中,Tiling.type为操作码,dst、src、src_shape、dst_shape为操作域。Tiling.type中的Tiling用于指示该指令为张量重排指令,Tiling.type中的type为重排策略。dst为目标地址。src为待处理张量地址。src_shape为输入形状。dst_shape为输出形状。Among them, Tiling.type is the operation code, dst, src, src_shape, dst_shape are the operation domain. Tiling in Tiling.type is used to indicate that the instruction is a tensor rearrangement instruction, and type in Tiling.type is a rearrangement strategy. dst is the target address. src is the address of the tensor to be processed. src_shape is the input shape. dst_shape is the output shape.
应当理解的是,本领域技术人员可以根据需要对张量重排指令的操作码、指令格式中操作码以及操作域的位置进行设置,本公开对此不作限制。It should be understood that those skilled in the art can set the operation code of the tensor rearrangement instruction, the operation code in the instruction format, and the position of the operation domain according to needs, which is not limited in this disclosure.
在一种可能的实现方式中,该装置可以设置于图形处理器(Graphics Processing Unit,简称GPU)、中央处理器(Central Processing Unit,简称CPU)和嵌入式神经网络处理器(Neural-network Processing Unit,简称NPU)的一种或多种之中。In a possible implementation, the device may be set in a graphics processor (GPU), a central processing unit (CPU) and a neural-network processing unit (Neural-network Processing) , Referred to as NPU).
需要说明的是,尽管以上述实施例作为示例介绍了张量重排指令处理装置如上,但本领域技术人员能够理解,本公开应不限于此。事实上,用户完全可根据个人喜好和/或实际应用场景灵活设定各模块,只要符合本公开的技术方案即可。It should be noted that although the above embodiment is used as an example to introduce the tensor rearrangement instruction processing apparatus as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various modules flexibly according to personal preferences and/or actual application scenarios, as long as the technical solutions of the present disclosure are met.
应用示例Application examples
以下结合“利用张量重排指令处理装置对待处理张量进行重排处理”作为一个示例性应用场景,给出根据本公开实施例的应用示例,以便于理解张量重排指令处理装置的流程。本领域技术人员应理解,以下应用示例仅仅是出于便于理解本公开实施例的目的,不应视为对本公开实施例的限制。The following uses "Tensor Rearrangement Instruction Processing Apparatus for Rearrangement of Tensors to be Processed" as an exemplary application scenario, and gives an application example according to an embodiment of the present disclosure to facilitate understanding of the flow of the tensor rearrangement instruction processing apparatus . Those skilled in the art should understand that the following application examples are only for the purpose of facilitating understanding of the embodiments of the present disclosure, and should not be considered as limitations to the embodiments of the present disclosure.
图5-3示出根据本公开一实施例的张量重排指令处理装置的应用场景的示意图。如图5-3所示,张量重排指令处理装置对张量重排指令进行处理的过程如下。5-3 shows a schematic diagram of an application scenario of a tensor rearrangement instruction processing apparatus according to an embodiment of the present disclosure. As shown in Figure 5-3, the tensor rearrangement instruction processing device processes the tensor rearrangement instruction as follows.
示例1-5Example 1-5
控制模块11-5在接收到张量重排指令1(如:Tiling 200 100 type S1 S2)时,对张量重排指令1进行解析,获得张量重排指令1的操作码和操作域。该张量重排指令1的操作码为Tiling。且根据操作域可以确定:重排策略为type、待处理张量地址为100、输入形状为S1、目标地址为200、输出形状为S2。进而控制模块11-5从待处理张量地址200中获取输入形状为S1的待处理张量a。When the control module 11-5 receives the tensor rearrangement instruction 1 (such as: Tiling 200, 100, type, S1, and S2), it parses the tensor rearrangement instruction 1 to obtain the operation code and operation field of the tensor rearrangement instruction 1. The operation code of the tensor rearrangement instruction 1 is Tiling. And it can be determined according to the operation domain: the rearrangement strategy is type, the tensor address to be processed is 100, the input shape is S1, the target address is 200, and the output shape is S2. Furthermore, the control module 11-5 acquires the to-be-processed tensor a whose input shape is S1 from the to-be-processed tensor address 200.
处理模块12-5根据重排策略type以及输入形状、输出形状对待处理张量a进行重排处理,得到重排张量b,并将重排张量b存入目标地址200中。The processing module 12-5 performs rearrangement processing on the to-be-processed tensor a according to the rearrangement strategy type, the input shape, and the output shape to obtain the rearrangement tensor b, and stores the rearrangement tensor b into the target address 200.
其中,张量重排指令1除可以为上述Tiling 200 100 type S1 S2,还有可以为Tiling.type 200 100 S1 S2,不同指令格式的张量重排指令的处理过程相似,不再赘述。Among them, the tensor reordering instruction 1 can be Tiling 200, type 100, S1, S2, or Tiling.type 200, 100, S1, S2. The processing procedure of the tensor reordering commands in different command formats is similar and will not be repeated here.
以上处理过程详见上文相关描述。For details of the above process, please refer to the relevant description above.
这样,张量重排指令处理装置可以快速、高效地对张量重排指令进行处理,完成对张量进行重排的处理过程。In this way, the tensor rearrangement instruction processing device can quickly and efficiently process the tensor rearrangement instruction to complete the process of rearranging the tensor.
图5-4示出根据本公开一实施例的张量重排指令处理方法的流程图。如图5-4所示,该方法应用于上述张量重排指令处理装置,该方法包括步骤S51-5和步骤S52-5。5-4 shows a flowchart of a tensor reordering instruction processing method according to an embodiment of the present disclosure. As shown in FIG. 5-4, this method is applied to the above-mentioned tensor rearrangement instruction processing apparatus. The method includes step S51-5 and step S52-5.
在步骤S51-5中,对接收到的张量重排指令进行解析,获得张量重排指令的操作码和操作域,并根据操作码和操作域确定执行张量重排指令所需的待处理张量和目标地址,以及确定进行重排处理所需的重排策略。其中,操作码用于指示张量重排指令对张量数据所进行的处理为重排处理,操作域包括待处理张量地址和目标地址。In step S51-5, the received tensor rearrangement instruction is parsed to obtain the operation code and operation domain of the tensor rearrangement instruction, and according to the operation code and operation domain to determine the execution of the tensor rearrangement instruction Processing tensors and target addresses, and determining the rearrangement strategy required for rearrangement processing. Among them, the operation code is used to instruct the processing performed by the tensor rearrangement instruction on the tensor data to be rearrangement processing, and the operation domain includes the to-be-processed tensor address and the target address.
在步骤S52-5中,根据重排策略对待处理张量进行重排处理,得到重排张量,并将重排张量存入目标地址中。In step S52-5, rearrangement processing is performed on the tensor to be processed according to the rearrangement strategy to obtain the rearrangement tensor, and the rearrangement tensor is stored in the target address.
在一种可能的实现方式中,操作域还可以包括待处理张量的输入形状和重排张量的输出形状的至少一种。其中,根据重排策略对待处理张量进行重排处理,得到重排张量,可以包括:根据输入形状和输出形状的至少一种、以及重排策略,对待处理张量进行重排处理,得到重排张量。In a possible implementation manner, the operation domain may further include at least one of an input shape of the tensor to be processed and an output shape of the rearranged tensor. Wherein, rearranging the to-be-processed tensor according to the rearrangement strategy to obtain the rearrangement tensor may include: rearranging the to-be-processed tensor according to at least one of the input shape and the output shape and the rearrangement strategy to obtain Rearrange tensors.
在一种可能的实现方式中,待处理张量的维度与重排张量的维度可以不同。In a possible implementation, the dimension of the tensor to be processed and the dimension of the rearranged tensor may be different.
在一种可能的实现方式中,操作域还可以用于指示重排策略。In a possible implementation, the operation domain may also be used to indicate the rearrangement strategy.
在一种可能的实现方式中,操作码还可以用于指示重排策略。In a possible implementation, the operation code can also be used to indicate the rearrangement strategy.
在一种可能的实现方式中,该方法还可以包括:存储待处理张量。In a possible implementation, the method may further include: storing the tensor to be processed.
在一种可能的实现方式中,对接收到的张量重排指令进行解析,获得张量重排指令的操作码和操作域,可以包括:In a possible implementation manner, parsing the received tensor rearrangement instruction to obtain the operation code and operation domain of the tensor rearrangement instruction may include:
存储张量重排指令;Store tensor rearrangement instructions;
对张量重排指令进行解析,得到张量重排指令的操作码和操作域;Analyze the tensor rearrangement instruction to obtain the operation code and operation domain of the tensor rearrangement instruction;
存储指令队列,指令队列包括按照执行顺序依次排列的多个待执行指令,多个待执行指令可以包括张量重排指令。An instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed in order according to the execution order, and the plurality of instructions to be executed may include tensor reordering instructions.
在一种可能的实现方式中,该方法还可以包括:In a possible implementation manner, the method may further include:
在确定多个待执行指令中的第一待执行指令与第一待执行指令之前的第零待执行指令存在依赖关系时,缓存第一待执行指令,并在确定第零待执行指令执行完毕后,控制进行第一待执行指令的执行,When it is determined that the first to-be-executed instruction among the plurality of to-be-executed instructions has a dependency relationship with the zeroth to-be-executed instruction before the first to-be-executed instruction, cache the first to-be-executed instruction, and after determining that the execution of the zeroth to-be-executed instruction is completed , Control the execution of the first instruction to be executed,
其中,第一待执行指令与第一待执行指令之前的第零待执行指令存在依赖关系包括:存储第一待执行指令所需数据的第一存储地址区间与存储第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The dependency relationship between the first to-be-executed instruction and the zeroth to-be-executed instruction before the first to-be-executed instruction includes: a first storage address interval to store data required by the first to-be-executed instruction and data required to store the zeroth to-be-executed instruction The zeroth storage address interval has overlapping areas.
需要说明的是,尽管以上述实施例作为示例介绍了张量重排指令处理方法如上,但本领域技术人员能够理解,本公开应不限于此。事实上,用户完全可根据个人喜好和/或实际应用场景灵活设定各步骤,只要符合本公开的技术方案即可。It should be noted that, although the above embodiment is taken as an example to introduce the tensor rearrangement instruction processing method as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various steps flexibly according to personal preferences and/or actual application scenarios, as long as the technical solutions of the present disclosure are met.
本公开实施例所提供的张量重排指令处理方法,通过一条张量重排指令便可以实现对张量数据的重排处理,与相关技术中通过多条指令实现张量数据的重排处理的过程相比,对张量数据进行重排的处理效率高、处理速度快,且适用范围广。The tensor rearrangement instruction processing method provided by the embodiment of the present disclosure can realize the rearrangement processing of tensor data through one tensor rearrangement instruction, and the related art realizes the rearrangement processing of tensor data through multiple instructions Compared with the process of, the rearrangement of tensor data has high processing efficiency, fast processing speed, and wide application range.
依据以下条款可更好地理解前述内容:The foregoing can be better understood based on the following terms:
条款D1、一种张量重排指令处理装置,所述装置包括:Clause D1, a tensor rearrangement instruction processing device, the device comprising:
控制模块,用于对接收到的张量重排指令进行解析,获得所述张量重排指令的操作码和操作域,并根据所述操作码和所述操作域确定执行所述张量重排指令所需的待处理张量和目标地址,以及确定进行重排处理所需的重排策略;The control module is configured to parse the received tensor rearrangement instruction, obtain an operation code and an operation domain of the tensor rearrangement instruction, and determine to execute the tensor rearrangement according to the operation code and the operation domain The tensor and target address required for the reordering instruction, and the reordering strategy required for reordering;
处理模块,根据所述重排策略对所述待处理张量进行重排处理,得到重排张量,并将所述重排张量存入所述目标地址中,The processing module performs rearrangement processing on the to-be-processed tensor according to the rearrangement strategy to obtain a rearrangement tensor, and stores the rearrangement tensor into the target address,
其中,所述操作码用于指示所述张量重排指令对张量数据所进行的处理为重排处理,所述操作域包括所述待处理张量地址和所述目标地址。Wherein, the operation code is used to indicate that the processing performed by the tensor rearrangement instruction on the tensor data is rearrangement processing, and the operation field includes the to-be-processed tensor address and the target address.
条款D2、根据条款D1所述的装置,所述操作域还包括待处理张量的输入形状和重排张量的输出形状的至少一种,Clause D2. The device according to Clause D1, the operation domain further includes at least one of an input shape of a tensor to be processed and an output shape of a rearranged tensor,
所述处理模块,还用于根据所述输入形状和所述输出形状的至少一种、以及所述重排策略,对所述待处理张量进行重排处理,得到所述重排张量。The processing module is further configured to perform rearrangement processing on the to-be-processed tensor according to at least one of the input shape and the output shape and the rearrangement strategy to obtain the rearrangement tensor.
条款D3、根据条款D1所述的装置,所述待处理张量的维度与所述重排张量的维度不同。Clause D3. The apparatus according to Clause D1, the dimension of the to-be-processed tensor is different from the dimension of the rearrangement tensor.
条款D4、根据条款D1所述的装置,所述操作域还用于指示重排策略。Clause D4. The apparatus according to Clause D1, the operation field is further used to indicate a rearrangement strategy.
条款D5、根据条款D1所述的装置,所述操作码还用于指示所述重排策略。Clause D5. The apparatus according to Clause D1, the operation code is further used to indicate the rearrangement strategy.
条款D6、根据条款D1所述的装置,Clause D6, the device according to Clause D1,
所述装置还包括:存储模块,用于存储所述待处理张量,The device further includes a storage module for storing the to-be-processed tensor,
其中,所述控制模块,包括:Wherein, the control module includes:
指令存储子模块,用于存储所述张量重排指令;An instruction storage sub-module for storing the tensor rearrangement instruction;
指令处理子模块,用于对所述张量重排指令进行解析,得到所述张量重排指令的操作码和操作域;An instruction processing sub-module, which is used to parse the tensor rearrangement instruction to obtain the operation code and operation domain of the tensor rearrangement instruction;
队列存储子模块,用于存储指令队列,所述指令队列包括按照执行顺序依次排列的多个待执行指令,所述多个待执行指令包括所述张量重排指令,A queue storage sub-module, which is used to store an instruction queue, the instruction queue including a plurality of instructions to be executed in order according to the execution order, the plurality of instructions to be executed include the tensor reordering instructions
其中,所述控制模块,还包括:Wherein, the control module also includes:
依赖关系处理子模块,用于在确定所述多个待执行指令中的第一待执行指令与所述第一待执行指令之前的第零待执行指令存在依赖关系时,将所述第一待执行指令缓存在所述指令存储子模块中,在所述第零待执行指令执行完毕后,从所述指令存储子模块中提取所述第一待执行指令发送至所述处理模块,The dependency processing sub-module is used to determine the first pending instruction when there is a dependency relationship between the first pending instruction in the plurality of pending instructions and the zeroth pending instruction before the first pending instruction The execution instruction is cached in the instruction storage submodule, and after the execution of the zeroth instruction to be executed is completed, the first instruction to be executed is extracted from the instruction storage submodule and sent to the processing module,
其中,所述第一待执行指令与所述第一待执行指令之前的第零待执行指令存在依赖关系包括:The dependency relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
存储所述第一待执行指令所需数据的第一存储地址区间与存储所述第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
条款D7、一种机器学习运算装置,所述装置包括:Clause D7. A machine learning computing device, the device comprising:
一个或多个如条款D1-条款D5任一项所述的张量重排指令处理装置,用于从其他处理装置中获取待处理张量和控制信息,并执行指定的机器学习运算,将执行结果通过I/O接口传递给其他处理装置;One or more tensor rearrangement instruction processing devices as described in any one of clauses D1 to D5, used to obtain to-be-processed tensors and control information from other processing devices, and perform specified machine learning operations, which will be executed The result is transferred to other processing devices through the I/O interface;
当所述机器学习运算装置包含多个所述张量重排指令处理装置时,所述多个所述张量重排指令处理装置间可以通过特定的结构进行连接并传输数据;When the machine learning operation device includes a plurality of the tensor rearrangement instruction processing devices, the plurality of the tensor rearrangement instruction processing devices may be connected and transmit data through a specific structure;
其中,多个所述张量重排指令处理装置通过快速外部设备互连总线PCIE总线进行互联并传输数据,以支持更大规模的机器学习的运算;多个所述张量重排指令处理装置共享同一控制系统或拥有各自的控制系统;多个所述张量重排指令处理装置共享内存或者拥有各自的内存;多个所述张量重排指令处理装置的互联方式是任意互联拓扑。Among them, a plurality of the tensor rearrangement instruction processing devices interconnect and transmit data through a fast external device interconnect bus PCIE bus to support larger-scale machine learning operations; a plurality of the tensor rearrangement instruction processing devices Sharing the same control system or having its own control system; a plurality of the tensor rearrangement instruction processing devices share memory or have their own memories; the interconnection method of the plurality of tensor rearrangement instruction processing devices is any interconnection topology.
条款D8、一种组合处理装置,所述组合处理装置包括:Clause D8. A combined processing device, the combined processing device comprising:
如条款D7所述的机器学习运算装置、通用互联接口和其他处理装置;Machine learning computing devices, general interconnect interfaces and other processing devices as described in clause D7;
所述机器学习运算装置与所述其他处理装置进行交互,共同完成用户指定的计算操作,The machine learning computing device interacts with the other processing device to jointly complete the calculation operation specified by the user,
其中,所述组合处理装置还包括:存储装置,该存储装置分别与所述机器学习运算装置和所述其 他处理装置连接,用于保存所述机器学习运算装置和所述其他处理装置的数据。Wherein, the combined processing device further includes a storage device respectively connected to the machine learning computing device and the other processing device, and used for storing data of the machine learning computing device and the other processing device.
条款D9、一种机器学习芯片,所述机器学习芯片包括:Clause D9. A machine learning chip, the machine learning chip includes:
如条款D7所述的机器学习运算装置或如条款D8所述的组合处理装置。The machine learning arithmetic device described in Item D7 or the combined processing device described in Item D8.
条款D10、一种电子设备,所述电子设备包括:Article D10. An electronic device, the electronic device comprising:
如条款D9所述的机器学习芯片。Machine learning chip as described in clause D9.
条款D11、一种板卡,所述板卡包括:存储器件、接口装置和控制器件以及如条款D9所述的机器学习芯片;Clause D11, a board card including: a storage device, an interface device and a control device, and a machine learning chip as described in Clause D9;
其中,所述机器学习芯片与所述存储器件、所述控制器件以及所述接口装置分别连接;Wherein, the machine learning chip is respectively connected to the storage device, the control device and the interface device;
所述存储器件,用于存储数据;The storage device is used to store data;
所述接口装置,用于实现所述机器学习芯片与外部设备之间的数据传输;The interface device is used to realize data transmission between the machine learning chip and an external device;
所述控制器件,用于对所述机器学习芯片的状态进行监控。The control device is used for monitoring the state of the machine learning chip.
条款D12、一种张量重排指令处理方法,所述方法应用于张量重排指令处理装置,所述方法包括:Clause D12. A tensor rearrangement instruction processing method. The method is applied to a tensor rearrangement instruction processing apparatus. The method includes:
对接收到的张量重排指令进行解析,获得所述张量重排指令的操作码和操作域,并根据所述操作码和所述操作域确定执行所述张量重排指令所需的待处理张量和目标地址,以及确定进行重排处理所需的重排策略;Analyze the received tensor rearrangement instruction to obtain the operation code and operation domain of the tensor rearrangement instruction, and determine the requirements for executing the tensor rearrangement instruction according to the operation code and the operation domain Tensors and target addresses to be processed, and the rearrangement strategy required for rearrangement processing;
根据所述重排策略对所述待处理张量进行重排处理,得到重排张量,并将所述重排张量存入所述目标地址中,Performing rearrangement processing on the to-be-processed tensor according to the rearrangement strategy to obtain a rearrangement tensor, and storing the rearrangement tensor into the target address,
其中,所述操作码用于指示所述张量重排指令对张量数据所进行的处理为重排处理,所述操作域包括所述待处理张量地址和所述目标地址。Wherein, the operation code is used to indicate that the processing performed by the tensor rearrangement instruction on the tensor data is rearrangement processing, and the operation field includes the to-be-processed tensor address and the target address.
条款D13、根据条款D12所述的方法,所述操作域还包括待处理张量的输入形状和重排张量的输出形状的至少一种,Clause D13. The method according to Clause D12, the operation domain further includes at least one of an input shape of the tensor to be processed and an output shape of the rearranged tensor,
其中,根据所述重排策略对所述待处理张量进行重排处理,得到重排张量,包括:Wherein, performing rearrangement processing on the to-be-processed tensor according to the rearrangement strategy to obtain a rearrangement tensor includes:
根据所述输入形状和所述输出形状的至少一种、以及所述重排策略,对所述待处理张量进行重排处理,得到所述重排张量。According to at least one of the input shape and the output shape, and the rearrangement strategy, perform rearrangement processing on the to-be-processed tensor to obtain the rearrangement tensor.
条款D14、根据条款D13所述的方法,所述待处理张量的维度与所述重排张量的维度不同。Clause D14. According to the method of Clause D13, the dimension of the to-be-processed tensor is different from the dimension of the rearrangement tensor.
条款D15、根据条款D12所述的方法,所述操作域用于指示重排策略。Clause D15. The method according to Clause D12, the operation field is used to indicate a rearrangement strategy.
条款D16、根据条款D12所述的方法,所述操作码还用于指示所述重排策略。Clause D16. The method according to Clause D12, the operation code is further used to indicate the rearrangement strategy.
条款D17、根据条款D12所述的方法,Clause D17, according to the method described in Clause D12,
所述方法还包括:存储所述待处理张量,The method also includes storing the to-be-processed tensor,
其中,对接收到的张量重排指令进行解析,获得所述张量重排指令的操作码和操作域,包括:Wherein, parsing the received tensor rearrangement instruction to obtain the operation code and operation domain of the tensor rearrangement instruction includes:
存储所述张量重排指令;Store the tensor reordering instruction;
对所述张量重排指令进行解析,得到所述张量重排指令的操作码和操作域;Analyzing the tensor rearrangement instruction to obtain the operation code and operation domain of the tensor rearrangement instruction;
存储指令队列,所述指令队列包括按照执行顺序依次排列的多个待执行指令,所述多个待执行指令包括所述张量重排指令,Storing an instruction queue, the instruction queue including a plurality of instructions to be executed in order according to an execution order, the plurality of instructions to be executed including the tensor reordering instruction,
其中,所述方法还包括:Wherein, the method further includes:
在确定所述多个待执行指令中的第一待执行指令与所述第一待执行指令之前的第零待执行指令存在依赖关系时,缓存所述第一待执行指令,并在确定所述第零待执行指令执行完毕后,控制进行所述第一待执行指令的执行,When it is determined that the first instruction to be executed among the plurality of instructions to be executed has a dependency relationship with the zeroth instruction to be executed before the first instruction to be executed, the first instruction to be executed is cached, and the After the execution of the zeroth to-be-executed instruction is completed, control to execute the execution of the first to-be-executed instruction,
其中,所述第一待执行指令与所述第一待执行指令之前的第零待执行指令存在依赖关系包括:The dependency relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
存储所述第一待执行指令所需数据的第一存储地址区间与存储所述第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
图6-1示出根据本公开一实施例的数据处理装置的框图。该装置用于执行机器学习计算。如图6-1所示,该装置包括控制模块11-6和处理模块12-6。处理模块12-6包括数据传递子模块121-6和累加子模块122-6。6-1 shows a block diagram of a data processing device according to an embodiment of the present disclosure. The device is used to perform machine learning calculations. As shown in Figure 6-1, the device includes a control module 11-6 and a processing module 12-6. The processing module 12-6 includes a data transfer submodule 121-6 and an accumulation submodule 122-6.
控制模块11-6用于获取计算指令,并获取执行计算指令所需的输入数据。数据传递子模块121-6用于根据计算指令对输入数据进行处理,得到多个中间结果,并将多个中间结果依次发送至累加子模块122-6。累加子模块122-6用于对多个中间结果进行循环累加运算,得到计算指令的计算结果。The control module 11-6 is used to obtain calculation instructions and obtain input data required to execute the calculation instructions. The data transfer submodule 121-6 is used to process the input data according to the calculation instruction to obtain a plurality of intermediate results, and send the plurality of intermediate results to the accumulation submodule 122-6 in sequence. The accumulation submodule 122-6 is used to perform a cyclic accumulation operation on a plurality of intermediate results to obtain the calculation result of the calculation instruction.
在本实施例中,循环累加运算可以是将“当前运算周期”对中间结果进行加法计算得到的累加结果,在“之后运算周期”的对中间结果进行加法运算时,将中间结果与累加结果相加得到新的累加结果。“之后运算周期”可以是“当前运算周期”之后的第一个、第二个、第三个等运算周期,可以根据装置的计算能力等时机需要对“之后运算周期”是“当前运算周期”之后的第几个运算周期进行设置,本公开对此不作限制。In this embodiment, the cyclic accumulation operation may be an accumulation result obtained by adding an intermediate result to the "current operation period", and when the intermediate result is added to the "later operation period", the intermediate result is added to the accumulation result Add to get a new accumulation result. "Later operation cycle" may be the first, second, third and other operation cycles after the "current operation cycle". The "after operation cycle" may be the "current operation cycle" according to the computing power of the device and other occasions. The following several calculation cycles are set, which is not limited in this disclosure.
在本实施例中,该装置可以包括一个或多个控制模块,以及一个或多个处理模块,可以根据实际需要对控制模块和处理模块的数量进行设置,本公开对此不作限制。In this embodiment, the device may include one or more control modules and one or more processing modules, and the number of control modules and processing modules may be set according to actual needs, which is not limited in the present disclosure.
本公开实施例所提供的数据处理装置,包括:控制模块和处理模块,处理模块包括数据传递子模块和累加子模块。控制模块用于获取计算指令,并获取执行计算指令所需的输入数据。数据传递子模块用于根据计算指令对输入数据进行处理,得到多个中间结果,并将多个中间结果依次发送至累加子模块。累加子模块用于对多个中间结果进行循环累加运算,得到计算指令的计算结果。本公开实施例所提供的数据处理装置,通过对多个中间结果进行循环累加的方式降低了数据访存量和计算量,同时保证计算的精度无损,且能够有效提高数据处理速度。The data processing device provided by the embodiment of the present disclosure includes a control module and a processing module. The processing module includes a data transfer submodule and an accumulation submodule. The control module is used to obtain calculation instructions and obtain input data required to execute the calculation instructions. The data transfer sub-module is used to process the input data according to the calculation instruction to obtain multiple intermediate results, and send the multiple intermediate results to the accumulation sub-module in sequence. The accumulation submodule is used to perform a cyclic accumulation operation on multiple intermediate results to obtain the calculation result of the calculation instruction. The data processing device provided by the embodiments of the present disclosure reduces the amount of data access and calculation by cyclically accumulating a plurality of intermediate results, and at the same time ensures the accuracy of calculation without loss, and can effectively increase the data processing speed.
在一种可能的实现方式中,可以根据装置的计算能力等实际需要对累加子模块的循环累加过程进行设置,以下给出方式一、方式二两种循环累加过程的示例。需要说明的是,本领域技术人员可以根据实际需要对循环累加过程进行设置,本公开对此不作限制。In a possible implementation, the loop accumulation process of the accumulation sub-module can be set according to the actual needs of the device, such as the computing power. Examples of loop accumulation processes in Mode 1 and Mode 2 are given below. It should be noted that those skilled in the art can set the loop accumulation process according to actual needs, which is not limited in the present disclosure.
在一种可能的实现方式中,对于方式一,累加子模块122-6对多个中间结果进行循环累加运算,可以包括:In a possible implementation manner, for manner 1, the accumulation submodule 122-6 performs a cyclic accumulation operation on multiple intermediate results, which may include:
在接收到中间结果的第一运算周期,将中间结果与第一运算周期的第一中间数据相加,得到第一累加结果;In the first calculation cycle of receiving the intermediate result, add the intermediate result and the first intermediate data of the first calculation cycle to obtain the first accumulation result;
将第一累加结果存储为下一个运算周期的第一中间数据;Store the first accumulation result as the first intermediate data of the next calculation cycle;
在未接收到中间结果的第二运算周期,将第二运算周期的第一中间数据确定为计算结果,Determining the first intermediate data of the second calculation cycle as the calculation result in the second calculation cycle where the intermediate result is not received,
其中,初始运算周期的第一中间数据的值为零。The value of the first intermediate data in the initial calculation cycle is zero.
在该实现方式中,方式一中所描述的“接收到中间结果的第一运算周期”可以是累加子模块接收到中间结果的任意一个运算周期,“未接收到中间结果的第二运算周期”可以是在累加子模块未接收到中间结果的一个运算周期。“接收到中间结果的第一运算周期”所描述的是累加子模块循环反复执行的过程,“未接收到中间结果的第二运算周期”是累加子模块最终确定计算结果的过程。累加子模块可以循环执行多个“接收到中间结果的第一运算周期”,执行一个“未接收到中间结果的第二运算周期”,已完成对多个中间结果的运算。In this implementation manner, the "first operation cycle when the intermediate result is received" described in the first way may be any operation cycle when the accumulation submodule receives the intermediate result, and "the second operation cycle when the intermediate result is not received" It may be an operation cycle when the accumulation submodule does not receive the intermediate result. "The first calculation cycle of receiving the intermediate result" describes the process of repeated execution of the accumulation submodule, and the "second calculation cycle of not receiving the intermediate result" is the process of finally determining the calculation result of the accumulation submodule. The accumulation sub-module can cyclically execute a plurality of "first operation cycles of receiving intermediate results" and execute a "second operation cycle of not receiving intermediate results", and has completed operations on a plurality of intermediate results.
举例来说,假定多个中间结果分别为1、2、3。累加子模块通过方式一对多个中间结果进行循环 累加的过程如下。其中,第一个运算周期、第二个运算周期和第三个运算周期相当于上述方式一中“接收到中间结果的第一运算周期”,第四个运算周期相当于上述方式一中“未接收到中间结果的第二运算周期”。For example, suppose that multiple intermediate results are 1, 2, and 3, respectively. The accumulation sub-module performs loop accumulation on a pair of multiple intermediate results in the following manner. Among them, the first operation cycle, the second operation cycle and the third operation cycle are equivalent to the "first operation cycle of receiving the intermediate result" in the above manner 1, and the fourth operation cycle is equivalent to the "not The second calculation cycle when the intermediate result is received".
在第一个运算周期中,累加子模块接收到中间结果“1”,将中间结果“1”与第一个运算周期的第一中间数据“0”相加,得到第一个运算周期的第一累加结果“0+1”。而后将第一累加结果“0+1”存储为第二个运算周期(也即下一运算周期)的第一中间数据“0+1”。In the first operation cycle, the accumulation submodule receives the intermediate result "1", and adds the intermediate result "1" to the first intermediate data "0" of the first operation cycle to obtain the first An accumulation result "0+1". Then, the first accumulation result "0+1" is stored as the first intermediate data "0+1" of the second calculation cycle (that is, the next calculation cycle).
在第二个运算周期中,累加子模块接收到中间结果“2”,将中间结果“2”与第二个运算周期的第一中间数据“0+1”相加,得到第二个运算周期的第一累加结果“0+1+2”。而后将第二个运算周期的第一累加结果“0+1+2”存储为第三个运算周期(也即下一运算周期)的第一中间数据“0+1+2”。In the second operation cycle, the accumulation submodule receives the intermediate result "2", and adds the intermediate result "2" to the first intermediate data "0+1" in the second operation cycle to obtain the second operation cycle The first accumulation result of "0+1+2". Then, the first accumulation result "0+1+2" of the second operation cycle is stored as the first intermediate data "0+1+2" of the third operation cycle (that is, the next operation cycle).
在第三个运算周期中,累加子模块接收到中间结果“3”,将中间结果“3”与第三个运算周期的第一中间数据“0+1+2”相加,得到第三个运算周期的第一累加结果“0+1+2+3”。而后将第三个运算周期的第一累加结果“0+1+2+3”存储为第四个运算周期(也即下一运算周期)的第一中间数据“0+1+2+3”。In the third operation cycle, the accumulation submodule receives the intermediate result "3", and adds the intermediate result "3" to the first intermediate data "0+1+2" in the third operation cycle to obtain the third The first accumulation result of the operation cycle is "0+1+2+3". Then, the first accumulation result "0+1+2+3" of the third operation cycle is stored as the first intermediate data "0+1+2+3" of the fourth operation cycle (that is, the next operation cycle) .
在第四个运算周期中,累加子模块未接收到中间结果,将第四个运算周期的第一中间数据“0+1+2+3”确定为计算结果。In the fourth calculation cycle, the accumulation submodule does not receive the intermediate result, and determines the first intermediate data "0+1+2+3" of the fourth calculation cycle as the calculation result.
在一种可能的实现方式中,对于方式二,累加子模块122-6对多个中间结果进行循环累加运算,还可以包括:In a possible implementation manner, for manner 2, the accumulation submodule 122-6 performs a cyclic accumulation operation on a plurality of intermediate results, and may further include:
在接收到中间结果的第三运算周期,将中间结果与第三运算周期的第三中间数据相加,得到第二累加结果;In the third calculation cycle of receiving the intermediate result, add the intermediate result to the third intermediate data of the third calculation cycle to obtain a second accumulation result;
将第三运算周期的第二中间数据存储为下一个运算周期的第三中间数据,并将第二累加结果存储为下一个运算周期的第二中间数据;Storing the second intermediate data of the third operation cycle as the third intermediate data of the next operation cycle, and storing the second accumulation result as the second intermediate data of the next operation cycle;
在未接收到中间结果的第四运算周期,将第四运算周期的第二中间数据与第四运算周期的第三中间数据相加,得到计算结果。In the fourth calculation cycle in which the intermediate result is not received, the second intermediate data in the fourth calculation cycle and the third intermediate data in the fourth calculation cycle are added to obtain a calculation result.
其中,初始运算周期的第二中间数据及第三中间数据的值为零。The value of the second intermediate data and the third intermediate data in the initial calculation cycle is zero.
在该实现方式中,方式二中所描述的“接收到中间结果的第三运算周期”可以是累加子模块接收到中间结果的任意一个运算周期,“未接收到中间结果的第四运算周期”可以是在累加子模块未接收到中间结果的一个运算周期。“接收到中间结果的第三运算周期”所描述的是累加子模块循环反复执行的过程,“未接收到中间结果的第四运算周期”是累加子模块最终确定计算结果的过程。累加子模块可以循环执行多个“接收到中间结果的第三运算周期”,执行一个“未接收到中间结果的第四运算周期”,已完成对多个中间结果的运算。In this implementation, the "third operation cycle of receiving the intermediate result" described in the second way may be any operation cycle of the intermediate result received by the accumulation submodule, and "the fourth operation cycle of not receiving the intermediate result" It may be an operation cycle when the accumulation submodule does not receive the intermediate result. "The third operation cycle of receiving the intermediate result" describes the process of repeated execution of the accumulation sub-module, and the "fourth operation cycle of not receiving the intermediate result" is the process of the final determination of the calculation result of the accumulation sub-module. The accumulation sub-module can cyclically execute multiple "third operation cycles of receiving intermediate results" and execute a "fourth operation cycle of not receiving intermediate results", and have completed operations on multiple intermediate results.
举例来说,假定多个中间结果分别为1、2、3、4。累加子模块通过方式二对多个中间结果进行循环累加的过程如下。其中,第一个运算周期、第二个运算周期、第三个运算周期和第四个运算周期相当于上述方式二中“接收到中间结果的第三运算周期”,第五个运算周期相当于上述方式二中“未接收到中间结果的第四运算周期”。For example, suppose that multiple intermediate results are 1, 2, 3, and 4, respectively. The accumulation sub-module performs the cyclic accumulation of multiple intermediate results in the second way as follows. Among them, the first operation cycle, the second operation cycle, the third operation cycle and the fourth operation cycle are equivalent to the "third operation cycle of receiving the intermediate result" in the above manner 2, and the fifth operation cycle is equivalent to In the second way above, "the fourth calculation cycle without receiving the intermediate result".
在第一个运算周期中,累加子模块接收到中间结果“1”,将中间结果“1”与第一个运算周期的第三中间数据“0”相加,得到第一个运算周期的第二累加结果“0+1”。而后将第一个运算周期的第二中间数据“0”存储为第二个运算周期(也即下一运算周期)的第三中间数据,以及将第一个运算周期的第二累加结果“0+1”存储为第二个运算周期(也即下一运算周期)的第二中间数据。In the first calculation cycle, the accumulation submodule receives the intermediate result "1", and adds the intermediate result "1" to the third intermediate data "0" of the first calculation cycle to obtain the first calculation cycle. Two accumulation results "0+1". Then, the second intermediate data "0" of the first operation cycle is stored as the third intermediate data of the second operation cycle (that is, the next operation cycle), and the second accumulation result of the first operation cycle "0" "+1" is stored as the second intermediate data of the second calculation cycle (that is, the next calculation cycle).
在第二个运算周期中,累加子模块接收到中间结果“2”,将中间结果“2”与第二个运算周期的第三中间数据“0”相加,得到第二个运算周期的第二累加结果“0+2”。而后将第二个运算周期的第二中间数 据“0+1”存储为第三个运算周期(也即下一运算周期)的第三中间数据,以及将第二个运算周期的第二累加结果“0+2”存储为第三个运算周期(也即下一运算周期)的第二中间数据。In the second operation cycle, the accumulation submodule receives the intermediate result "2", and adds the intermediate result "2" to the third intermediate data "0" in the second operation cycle to obtain the second operation cycle. Two accumulation results "0+2". Then the second intermediate data "0+1" of the second operation cycle is stored as the third intermediate data of the third operation cycle (that is, the next operation cycle), and the second accumulation result of the second operation cycle "0+2" is stored as the second intermediate data of the third calculation cycle (that is, the next calculation cycle).
在第三个运算周期中,累加子模块接收到中间结果“3”,将中间结果“3”与第三个运算周期的第三中间数据“0+1”相加,得到第三个运算周期的第二累加结果“0+1+3”。而后将第三个运算周期的第二中间数据“0+2”存储为第四个运算周期(也即下一运算周期)的第三中间数据,以及将第三个运算周期的第二累加结果“0+1+3”存储为第四个运算周期(也即下一运算周期)的第二中间数据。In the third operation cycle, the accumulation submodule receives the intermediate result "3", and adds the intermediate result "3" to the third intermediate data "0+1" of the third operation cycle to obtain the third operation cycle The second accumulation result of "0+1+3". Then, the second intermediate data "0+2" of the third operation cycle is stored as the third intermediate data of the fourth operation cycle (that is, the next operation cycle), and the second accumulation result of the third operation cycle "0+1+3" is stored as the second intermediate data of the fourth calculation cycle (that is, the next calculation cycle).
在第四个运算周期中,累加子模块接收到中间结果“4”,将中间结果“4”与第四个运算周期的第三中间数据“0+2”相加,得到第四个运算周期的第二累加结果“0+2+4”。而后将第四个运算周期的第二中间数据“0+1+3”存储为第五个运算周期(也即下一运算周期)的第三中间数据,以及将第四个运算周期的第二累加结果“0+2+4”存储为第五个运算周期(也即下一运算周期)的第二中间数据。In the fourth operation cycle, the accumulation submodule receives the intermediate result "4", and adds the intermediate result "4" to the third intermediate data "0+2" of the fourth operation cycle to obtain the fourth operation cycle The second accumulation result of "0+2+4". Then, the second intermediate data "0+1+3" of the fourth operation cycle is stored as the third intermediate data of the fifth operation cycle (that is, the next operation cycle), and the second intermediate data of the fourth operation cycle The accumulation result "0+2+4" is stored as the second intermediate data of the fifth calculation cycle (that is, the next calculation cycle).
在第五个运算周期中,累加子模块确定未接收到中间结果,将第五个运算周期的第二中间数“0+2+4”与第五个运算周期的第三中间数据“0+1+3”相加,得到第五个运算周期的第二累加结果“0+1+2+3+4”。将该第五个运算周期的第二累加结果“0+1+2+3+4”确定为计算结果。In the fifth operation cycle, the accumulation submodule determines that the intermediate result is not received, and the second intermediate number "0+2+4" of the fifth operation cycle and the third intermediate data "0+" of the fifth operation cycle Add 1+3" to get the second accumulation result "0+1+2+3+4" in the fifth calculation cycle. The second accumulation result "0+1+2+3+4" of the fifth calculation cycle is determined as the calculation result.
在一种可能的实现方式中,机器学习计算可以包括人工神经网络运算,输入数据可以包括输入神经元数据和权值数据,计算结果为输出神经元数据。In a possible implementation, the machine learning calculation may include artificial neural network operations, the input data may include input neuron data and weight data, and the calculation result is output neuron data.
在一种可能的实现方式中,输入数据的数据类型可以包括指数型和动态定点型中的至少一项,输入神经元数据和权值数据的数据类型不同。In a possible implementation manner, the data type of the input data may include at least one of exponential type and dynamic fixed-point type, and the data types of the input neuron data and the weight data are different.
其中,数据传递子模块121-6用于根据计算指令对输入数据进行处理,得到多个中间结果,可以包括:数据传递子模块用于根据计算指令对权值数据或输入神经元数据进行移位运算,得到中间结果。Among them, the data transfer submodule 121-6 is used to process the input data according to the calculation instruction to obtain multiple intermediate results, which may include: the data transfer submodule is used to shift the weight data or the input neuron data according to the calculation instruction Operation, get intermediate results.
其中,指数型的输入数据可以包括指数位,以指定值为底数、指数位存储的数据为指数进行计算所得到的数据表示指数型的输入数据的数值。动态定点型的输入数据可以包括小数点位和整数位,小数点位所存储数据用于标记动态定点型的输入数据的小数点在整数位所存储数据中的位置,以区分整数位的数据中的整数部分和小数部分。其中,指数型的输入数据所对应的指定值与输入数据的进位制相同。例如,假定指定值为2,则输入数据需为二进制数据。这样,才能保证对输入数据进行移位运算。The exponential input data may include exponent bits, and the data obtained by calculating with the specified value as the base and the data stored in the exponent bits as exponents represent the value of the exponential input data. The input data of the dynamic fixed-point type can include a decimal point and an integer. The data stored in the decimal point is used to mark the position of the decimal point of the input data of the dynamic fixed-point in the data stored in the integer, to distinguish the integer part of the data of the integer And the decimal part. Among them, the specified value corresponding to the exponential input data is the same as the carry system of the input data. For example, assuming that the specified value is 2, the input data needs to be binary data. In this way, we can ensure that the input data is shifted.
在该实现方式中,输入神经元数据可以是指数型的数据,而权值数据是动态定点型数据。或者输入神经元数据可以是动态定点型的数据,而权值数据是指数型数据。本领域技术人员可以根据实际需要对输入神经元数据和权值数据的类型进行设置,本公开对此不作限制。In this implementation, the input neuron data may be exponential data, while the weight data is dynamic fixed-point data. Or the input neuron data may be dynamic fixed-point data, and the weight data is exponential data. A person skilled in the art may set the types of input neuron data and weight data according to actual needs, which is not limited in the present disclosure.
在该实现方式中,根据计算指令对权值数据或输入神经元数据进行移位运算可以是:在根据计算指令确定需要对权值数据和输入神经元数据所进行运算为相乘运算时,可以通过对输入神经元数据或权值数据进行移位的运算方式,实现对权值数据和输入神经元数据之间进行相乘运算的目的。其中,移位运算可以是根据权值数据和输入神经元数据中的指数型的数据确定移动位数和移动方向,而后将权值数据和输入神经元数据中的动态定点型的数据的小数点位置按照移动位数和移动方向进行移动,并通过改变存储在小数点位的数据的值来表示小数点的移动方向和移动位数,进而确定计算结果。也即将权值数据和输入神经元数据中的指数型的数据中指数位所存储的数值与权值数据和输入神经元数据中的动态定点型的数据的小数点位存储数据的数值相加,得到相加结果,将原动态定点型的数据的小数点位所存储数据替换为相加结果,便可以得到权值数据和输入神经元数据相乘的计算结果。In this implementation, the shift operation on the weight data or input neuron data according to the calculation instruction may be: when it is determined that the operation performed on the weight data and the input neuron data is a multiplication operation according to the calculation instruction, it may be Through the operation method of shifting the input neuron data or weight data, the purpose of multiplying the weight data and the input neuron data is realized. Among them, the shift operation may be based on the weight data and the exponential data in the input neuron data to determine the number of movements and the direction of movement, and then the weight data and the decimal point position of the dynamic fixed-point data in the input neuron data Move according to the number of digits and direction of movement, and change the value of the data stored in the decimal point to indicate the direction and number of digits of the decimal point, and then determine the calculation result. That is, the values stored in the exponent bits in the weight data and the exponential data in the input neuron data are added to the values in the weight data and the decimal point storage data in the dynamic fixed-point data in the input neuron data to obtain For the addition result, replace the data stored in the decimal point of the original dynamic fixed-point data with the addition result, and then you can get the calculation result of the weight data multiplied by the input neuron data.
在该实现方式中,输入数据的进位制可以是二进制、十进制、十六进制等,本公开对此不作限制。In this implementation, the carry system of the input data may be binary, decimal, hexadecimal, etc., which is not limited in this disclosure.
举例来说,图6-2示出根据本公开一实施例的数据处理装置的应用场景的示意图。如图6-2所示, 示出数据传输通道对指数型的权值数据、动态定点型的输入神经元数据进行运算的一个示例假定指数型的权值数据为二进制的“00001”(该权值数据对应的十进制数为2 1)。动态定点型的输入神经元数据为二进制的“11001000,1000”(该输入神经元数据对应的十进制数为12.5),其中前8位为整数位,后4位为小数点位。控制模块获取以上两个输入数据以及计算指令。处理模块在根据计算指令确定需要对指数型的权值数据“00001”和动态定点型的输入神经元数据“11001000,1000”所进行的运算为相乘时,可以根据指数型的权值数据“00001”确定需要对输入神经元数据所进行的移位运算为“小数点位置向右移动1位”。也即,将小数点位的数据“0100”与权值数据的“00001”相加,得到新的小数点位需要存储的新数据“0101”,将新数据“0101”存储至输入神经元数据的小数点位,得到指数型的权值数据为二进制的“00001”与动态定点型的输入神经元数据为二进制的“11001000,0100”相乘的计算结果“11001000,0101”(该计算结果对应的十进制数为25)。其中,动态定点型的输入神经元数据“11001000,0100”中的“,”是为了区分其整数位和小数点位,实际使用中可以不设置该“,”。下文动态定点型的输入数据中的“,”与此处相同,后续不再作解释。 For example, FIG. 6-2 shows a schematic diagram of an application scenario of a data processing device according to an embodiment of the present disclosure. As shown in Figure 6-2, an example of the data transmission channel operating on exponential weight data and dynamic fixed-point input neuron data assumes that the exponential weight data is binary "00001" The decimal number corresponding to the value data is 2 1 ). The dynamic fixed-point input neuron data is binary "11001000, 1000" (the decimal number corresponding to the input neuron data is 12.5), in which the first 8 digits are integer digits and the last 4 digits are decimal digits. The control module obtains the above two input data and calculation instructions. When the processing module determines that the calculation of the exponential weight data "00001" and the dynamic fixed-point input neuron data "11001000, 1000" needs to be multiplied according to the calculation instruction, it can be based on the exponential weight data ""00001" determines that the shift operation that needs to be performed on the input neuron data is "the decimal point position is shifted to the right by 1". That is, add the decimal point data "0100" and the weight data "00001" to obtain the new data "0101" that needs to be stored in the new decimal point, and store the new data "0101" to the decimal point of the input neuron data Bit, the exponential weight data is binary "00001" and the dynamic fixed-point input neuron data is binary "11001000, 0100" multiplication result "11001000, 0101" (the decimal number corresponding to the calculation result For 25). Among them, the "," in the dynamic fixed-point input neuron data "11001000, 0100" is to distinguish the integer and decimal points, and the "," may not be set in actual use. The "," in the input data of the dynamic fixed-point type below is the same as here, and will not be explained later.
在一种可能的实现方式中,该装置还可以包括第一类型转换模块。第一类型转换模块用于将接收到的待处理数据转换为以指定值为底数的第一数据,并根据第一数据的指数,生成指数型的输入数据。其中,指数型的输入数据的指数位用于存储指数。In a possible implementation manner, the device may further include a first type conversion module. The first type conversion module is used to convert the received data to be processed into first data with a specified value as the base, and generate exponential input data according to the exponent of the first data. Among them, the exponent bit of the exponential input data is used to store the exponent.
在该实现方式中,第一类型转换模块所接收到的待处理数据所转换的第一数据的指数需是整数,以保证对输入数据能够进行移位运算。可以根据实际需要对指数位所占用的比特位数进行设置,例如,5比特,本公开对此不作限制。In this implementation manner, the exponent of the first data converted by the data to be processed received by the first type conversion module needs to be an integer to ensure that the input data can be shifted. The number of bits occupied by the exponent bits can be set according to actual needs, for example, 5 bits, which is not limited in this disclosure.
在一种可能的实现方式中,对于指数型的输入数据其还可以包括指定值位,用于标记该输入数据的指定值。In a possible implementation manner, the exponential input data may further include a designated value bit, which is used to mark the designated value of the input data.
在一种可能的实现方式中,指数位中还包括符号位,用于表示指数位所存储数据的正负。例如,可以设定指数型的输入数据占用5个比特,第1个比特为符号位,第2-5比特为指数位。可以设置在符号位所存储的数为0时,指数位所存储的数据为正数,在符号位所存储的数为1时,指数位所存储的数据为负数。In a possible implementation, the exponent bit also includes a sign bit, which is used to indicate whether the data stored in the exponent bit is positive or negative. For example, you can set the exponential input data to occupy 5 bits, the first bit is the sign bit, and the 2nd to 5th bits are the exponent bits. It can be set that when the number stored in the sign bit is 0, the data stored in the exponent bit is positive, and when the number stored in the sign bit is 1, the data stored in the exponent bit is negative.
举例来说,假定接收到的待处理数据为1024,设定的指定值为2,输入数据为二进制数。第一类型转换模块可以将待处理数据“1024”转换为以2(指定值)为底数的第一数据“2 10”。根据第一数据“2 10”的指数“10”生成指数型的、二进制的输入数据“01010”。接收到的待处理数据为0.5,设定的指定值为2,输入数据为二进制数。第一类型转换模块可以将待处理数据“0.5”转换为以2(指定值)为底数的第一数据“2-1”。根据第一数据“2-1”的指数“-1”生成指数型的、二进制的输入数据“10001”。 For example, assume that the received data to be processed is 1024, the specified value is set to 2, and the input data is a binary number. The first type conversion module may convert the data to be processed "1024" into the first data "2 10 "with base 2 (specified value). An exponential, binary input data "01010" is generated based on the index "10" of the first data "2 10 ". The received data to be processed is 0.5, the specified value is set to 2, and the input data is a binary number. The first type conversion module may convert the data to be processed "0.5" into the first data "2-1" with 2 (specified value) as the base. The exponential binary input data "10001" is generated based on the index "-1" of the first data "2-1".
在一种可能的实现方式中,该装置还可以包括第二类型转换模块。第二类型转换模块用于对接收到的待处理数据进行转换,得到分别表征待处理数据的整数部分的数值的第二数据和表征小数部分的数值的第三数据,并根据第二数据、第三数据、以及待处理数据的小数点位置,生成动态定点型的输入数据。其中,动态定点型的输入数据的整数位用于存储第二数据和第三数据,动态定点型的输入数据的小数点位所存储的数据用于标记待处理数据的小数点在整数位所存储数据中的位置。In a possible implementation manner, the device may further include a second type conversion module. The second type conversion module is used to convert the received data to be processed to obtain second data respectively representing the value of the integer part of the data to be processed and third data representing the value of the decimal part, and according to the second data, the first Three data, and the position of the decimal point of the data to be processed, to generate dynamic fixed-point input data. Among them, the integer bits of the dynamic fixed-point input data are used to store the second data and the third data, and the data stored in the decimal point of the dynamic fixed-point input data are used to mark the decimal point of the data to be processed in the data stored in the integer bits s position.
在该实现方式中,第二类型转换模块所接收到的待处理数据可以是小数。例如,123.4(十进制)等。可以根据计算需要对动态定点型的输入数据所占用的总比特数、以及整数位和小数点位所占用的比特数进行设置。例如,可以设置动态定点型的输入数据占用12比特,其中,整数位占用8比特,小数点位占用4比特。本领域技术人员可以根据实际需要对动态定点型的输入数据占用的总比特数、以及整数位和小数点位所占用的比特数进行设置,本公开对此不作限制。In this implementation manner, the data to be processed received by the second type conversion module may be a decimal. For example, 123.4 (decimal), etc. You can set the total number of bits occupied by the input data of the dynamic fixed-point type, and the number of bits occupied by the integer and decimal points according to the calculation needs. For example, it can be set that the input data of the dynamic fixed-point type occupies 12 bits, in which the integer bit occupies 8 bits and the decimal point occupies 4 bits. Those skilled in the art can set the total number of bits occupied by the input data of the dynamic fixed-point type and the number of bits occupied by the integer and decimal points according to actual needs, which is not limited in the present disclosure.
举例来说,假定接收到的待处理数据为24.5,输入数据为二进制数,整数位占用10比特,小数点位占用4比特。第二类型转换模块可以将待处理数据的整数部分“24”转换二进制的第二数据“11000”,将待处理数据的小数部分“0.5”转换为二进制的第三数据“0.1000”。可以确定动态定点型的输入数据的整数位存储“0110001000”,由于小数点位置在整数位存储的“0110001000”的第六位之后,可以用“0110”表示小数点的位置。那么,最终第二类型转换模块根据待处理数据“24.5”所生成的动态定点型的输入数据为“0110001000,0110”。For example, assume that the received data to be processed is 24.5, the input data is a binary number, the integer bits occupy 10 bits, and the decimal point bits occupy 4 bits. The second type conversion module may convert the integer part "24" of the data to be processed into binary second data "11000", and convert the fractional part "0.5" of the data to be processed into binary third data "0.1000". It can be determined that the integer position of the input data of the dynamic fixed-point type is stored as "0110001000". Since the position of the decimal point is after the sixth place of "0110001000" stored in the integer position, the position of the decimal point can be represented by "0110". Then, finally, the input data of the dynamic fixed-point type generated by the second type conversion module according to the data to be processed "24.5" is "0110001000, 0110".
图6-3示出根据本公开一实施例的数据处理装置的框图。在一种可能的实现方式中,如图6-3所示,该装置还可以包括存储模块13-6。存储模块13-6用于存储待查找向量。6-3 shows a block diagram of a data processing device according to an embodiment of the present disclosure. In a possible implementation manner, as shown in FIG. 6-3, the device may further include a storage module 13-6. The storage module 13-6 is used to store the vector to be found.
在该实现方式中,存储模块可以包括内存、缓存和寄存器中的一种或多种,缓存可以包括速暂存缓存。可以根据需要将待查找向量在存储模块中的内存、缓存和/或寄存器中,本公开对此不作限制。In this implementation, the storage module may include one or more of a memory, a cache, and a register, and the cache may include a temporary storage cache. The vector to be searched can be stored in the memory, cache, and/or register in the storage module according to needs, which is not limited in the present disclosure.
在一种可能的实现方式中,该装置还可以包括直接内存访问模块,用于从存储模块中读取或者存储数据。In a possible implementation, the device may further include a direct memory access module, which is used to read or store data from the storage module.
在一种可能的实现方式中,如图6-3所示,控制模块11-6可以包括指令存储子模块111-6、指令处理子模块112-6和队列存储子模块113-6。In a possible implementation, as shown in FIG. 6-3, the control module 11-6 may include an instruction storage submodule 111-6, an instruction processing submodule 112-6, and a queue storage submodule 113-6.
指令存储子模块111-6用于存储向量查找指令。The instruction storage submodule 111-6 is used to store vector search instructions.
指令处理子模块112-6用于对向量查找指令进行解析,得到向量查找指令的操作码和操作域。The instruction processing sub-module 112-6 is used to parse the vector search instruction to obtain the operation code and operation domain of the vector search instruction.
队列存储子模块113-6用于存储指令队列,指令队列包括按照执行顺序依次排列的多个待执行指令,多个待执行指令可以包括向量查找指令。多个待执行指令可以包括还可以包括与向量查找指令相关的其他计算指令。The queue storage sub-module 113-6 is used to store an instruction queue. The instruction queue includes a plurality of instructions to be executed in order according to the execution order, and the plurality of instructions to be executed may include vector search instructions. The plurality of instructions to be executed may include other calculation instructions related to the vector search instruction.
在该实现方式中,可以根据待执行指令的接收时间、优先级别等对多个待执行指令的执行顺序进行排列获得指令队列,以便于根据指令队列依次执行多个待执行指令。In this implementation manner, the execution order of the plurality of instructions to be executed can be arranged according to the reception time and priority level of the instruction to be executed to obtain an instruction queue, so that the plurality of instructions to be executed can be sequentially executed according to the instruction queue.
在一种可能的实现方式中,如图6-3所示,控制模块11-6还可以包括依赖关系处理子模块114-6。In a possible implementation, as shown in FIG. 6-3, the control module 11-6 may further include a dependency processing sub-module 114-6.
依赖关系处理子模块114-6,用于在确定多个待执行指令中的第一待执行指令与第一待执行指令之前的第零待执行指令存在关联关系时,将第一待执行指令缓存在指令存储子模块112-6中,在第零待执行指令执行完毕后,从指令存储子模块112-6中提取第一待执行指令发送至处理模块12-6。其中,第一待执行指令和第零待执行指令是多个待执行指令中的指令。The dependency processing sub-module 114-6 is configured to cache the first instruction to be executed when it is determined that there is an association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed In the instruction storage sub-module 112-6, after the execution of the zeroth to-be-executed instruction is completed, the first to-be-executed instruction is extracted from the instruction storage sub-module 112-6 and sent to the processing module 12-6. Wherein, the first instruction to be executed and the zeroth instruction to be executed are instructions among a plurality of instructions to be executed.
其中,第一待执行指令与第一待执行指令之前的第零待执行指令存在关联关系包括:存储第一待执行指令所需数据的第一存储地址区间与存储第零待执行指令所需数据的第零存储地址区间具有重叠的区域。反之,第一待执行指令与第零待执行指令之间没有关联关系可以是第一存储地址区间与第零存储地址区间没有重叠区域。The first to-be-executed instruction is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction includes: a first storage address interval storing data required by the first to-be-executed instruction and data required to store the zeroth to-be-executed instruction The zeroth storage address interval has overlapping areas. Conversely, there is no association between the first instruction to be executed and the zeroth instruction to be executed may be that there is no overlapping area between the first storage address interval and the zeroth storage address interval.
通过这种方式,可以根据待执行指令之间的依赖关系,使得在先的待执行令执行完毕之后,再执行在后的待执行指令,保证计算结果的准确。In this way, according to the dependency relationship between the instructions to be executed, after the execution of the first to-be-executed order is completed, the subsequent to-be-executed instruction is executed again to ensure the accuracy of the calculation result.
图6-4示出根据本公开一实施例的数据处理装置的框图。在一种可能的实现方式中,如图6-4所示,处理模块12-6可以包括主处理子模块124和多个从处理子模块125。每个从处理子模块125可以包括数据传输子模块和累加子模块(图中未示出)。6-4 shows a block diagram of a data processing device according to an embodiment of the present disclosure. In a possible implementation, as shown in FIG. 6-4, the processing module 12-6 may include a master processing sub-module 124 and multiple slave processing sub-modules 125. Each slave processing sub-module 125 may include a data transmission sub-module and an accumulation sub-module (not shown in the figure).
控制模块11-6,还用于解析计算指令得到多个运算指令,并将输入数据和多个运算指令发送至主处理子模块124。The control module 11-6 is also used to parse the calculation instructions to obtain a plurality of calculation instructions, and send the input data and the plurality of calculation instructions to the main processing sub-module 124.
主处理子模块124,用于对输入数据执行前序处理,以及与多个从处理子模块125进行数据和运算指令的传输。The main processing sub-module 124 is used for performing pre-processing on input data and transmitting data and operation instructions with a plurality of sub-processing sub-modules 125.
从处理子模块125,用于根据从主处理子模块124传输的数据和运算指令并行执行中间运算得到多个中间结果,并将多个中间结果传输给主处理子模块124。The sub-processing sub-module 125 is used to execute intermediate operations in parallel according to the data and operation instructions transmitted from the main processing sub-module 124 to obtain multiple intermediate results, and transmit the multiple intermediate results to the main processing sub-module 124.
在该实现方式中,中间运算可以是对数据进行算术、逻辑等运算。其中,在输入数据包括输入神经元数据和权值数据,且输入神经元数据和权值数据分别对应不同的上述数据类型时,若根据运算指令确定所执行的中间运算为将输入神经元数据和权值数据相乘时,可以对输入神经元数据或权值数据进行移位运算,得到中间结果。In this implementation, the intermediate operation may be arithmetic, logic, and other operations on the data. Among them, when the input data includes input neuron data and weight data, and the input neuron data and weight data correspond to different types of the above data, if the intermediate operation performed according to the operation instruction is determined to be input neuron data and When the weight data is multiplied, the input neuron data or the weight data can be shifted to obtain an intermediate result.
主处理子模块124,还用于对多个中间结果执行后续处理,得到计算结果,并将计算结果存入目标地址中。The main processing sub-module 124 is also used to perform subsequent processing on a plurality of intermediate results to obtain calculation results, and store the calculation results in the target address.
需要说明的是,本领域技术人员可以根据实际需要对主处理子模块和多个从处理子模块之间的连接方式进行设置,以实现对处理模块的架构设置,例如,处理模块的架构可以是“H”型架构、阵列型架构、树型架构等,本公开对此不作限制。It should be noted that, those skilled in the art can set the connection mode between the main processing sub-module and multiple slave processing sub-modules according to actual needs, so as to realize the architecture setting of the processing module, for example, the architecture of the processing module may be The “H”-type architecture, the array-type architecture, the tree-type architecture, etc. are not limited in this disclosure.
图6-5a示出根据本公开一实施例的数据处理装置中处理模块的框图。在一种可能的实现方式中,如图6-5a所示,处理模块12-6还可以包括一个或多个分支处理子模块126,该分支处理子模块126用于转发主处理子模块124和从处理子模块125之间的数据和/或运算指令。其中,主处理子模块124与一个或多个分支处理子模块126连接。这样,处理模块中的主处理子模块、分支处理子模块和从处理子模块之间采用“H”型架构连接,通过分支处理子模块转发数据和/或运算指令,节省了对主处理子模块的资源占用,进而提高指令的处理速度。6-5a show a block diagram of a processing module in a data processing device according to an embodiment of the present disclosure. In a possible implementation, as shown in FIG. 6-5a, the processing module 12-6 may further include one or more branch processing sub-modules 126, and the branch processing sub-module 126 is used to forward the main processing sub-module 124 and The data and/or arithmetic instructions between the sub-modules 125 are processed. Among them, the main processing sub-module 124 is connected to one or more branch processing sub-modules 126. In this way, the main processing sub-module, the branch processing sub-module and the slave processing sub-module in the processing module are connected with an "H" type architecture, and the data and/or operation instructions are forwarded through the branch processing sub-module, saving the main processing sub-module Of resources, which in turn increases the processing speed of instructions.
图6-5b示出根据本公开一实施例的数据处理装置中处理模块的框图。在一种可能的实现方式中,如图6-5b所示,多个从处理子模块125呈阵列分布。6-5b show a block diagram of a processing module in a data processing device according to an embodiment of the present disclosure. In a possible implementation, as shown in FIGS. 6-5b, multiple slave processing sub-modules 125 are distributed in an array.
每个从处理子模块125与相邻的其他从处理子模块125连接,主处理子模块124连接多个从处理子模块125中的k个从处理子模块125,k个从处理子模块125为:第1行的n个从处理子模块125、第m行的n个从处理子模块125以及第1列的m个从处理子模块125。Each slave processing sub-module 125 is connected to other adjacent slave processing sub-modules 125, and the master processing sub-module 124 is connected to the k slave processing sub-modules 125 of the plurality of slave processing sub-modules 125, and the k slave processing sub-modules 125 are : N slave processing submodules 125 in the first row, n slave processing submodules 125 in the mth row, and m slave processing submodules 125 in the first column.
其中,如图6-5b所示,k个从处理子模块仅包括第1行的n个从处理子模块、第m行的n个从处理子模块以及第1列的m个从处理子模块,即该k个从处理子模块为多个从处理子模块中直接与主处理子模块连接的从处理子模块。其中,k个从处理子模块,用于在主处理子模块以及多个从处理子模块之间的数据以及指令的转发。这样,多个从处理子模块呈阵列分布,可以提高主处理子模块向从处理子模块发送数据和/或运算指令速度,进而提高指令的处理速度。Among them, as shown in FIG. 6-5b, the k slave processing submodules include only n slave processing submodules in the first row, n slave processing submodules in the mth row, and m slave processing submodules in the first column That is, the k slave processing submodules are slave processing submodules directly connected to the master processing submodule among the multiple slave processing submodules. Among them, k slave processing sub-modules are used for forwarding data and instructions between the master processing sub-module and multiple slave processing sub-modules. In this way, multiple slave processing sub-modules are distributed in an array, which can increase the speed of sending data and/or operation instructions from the master processing sub-module to the slave processing sub-modules, thereby increasing the processing speed of the instructions.
图6-5c示出根据本公开一实施例的数据处理装置中处理模块的框图。在一种可能的实现方式中,如图6-5c所示,处理模块还可以包括树型子模块127。该树型子模块127包括一个根端口401和多个支端口402。根端口401与主处理子模块124连接,多个支端口402与多个从处理子模块125分别连接。其中,树型子模块127具有收发功能,用于转发主处理子模块124和从处理子模块125之间的数据和/或运算指令。这样,通过树型子模块的作用使得处理模块呈树型架构连接,并利用树型子模块的转发功能,可以提高主处理子模块向从处理子模块发送数据和/或运算指令速度,进而提高指令的处理速度。6-5c show a block diagram of a processing module in a data processing device according to an embodiment of the present disclosure. In a possible implementation manner, as shown in FIGS. 6-5c, the processing module may further include a tree-shaped submodule 127. The tree-shaped submodule 127 includes a root port 401 and multiple branch ports 402. The root port 401 is connected to the main processing submodule 124, and the plurality of branch ports 402 are respectively connected to the plurality of slave processing submodules 125. Among them, the tree-shaped submodule 127 has a transceiver function for forwarding data and/or operation instructions between the master processing submodule 124 and the slave processing submodule 125. In this way, the processing modules are connected in a tree structure through the role of the tree-shaped submodules, and the forwarding function of the tree-shaped submodules can be used to increase the speed of sending data and/or operation instructions from the main processing submodule to the slave processing submodules, thereby increasing The processing speed of the instruction.
在一种可能的实现方式中,树型子模块127可以为该装置的可选结果,其可以包括至少一层节点。节点为具有转发功能的线结构,节点本身不具备运算功能。最下层的节点与从处理子模块连接,以转发主处理子模块124和从处理子模块125之间的数据和/或运算指令。特殊地,如树型子模块具有零层节点,该装置则无需树型子模块。In a possible implementation, the tree-shaped submodule 127 may be an optional result of the device, which may include at least one layer of nodes. The node has a line structure with a forwarding function, and the node itself does not have a computing function. The lowermost node is connected to the slave processing sub-module to forward data and/or operation instructions between the master processing sub-module 124 and the slave processing sub-module 125. In particular, if the tree-shaped submodule has zero-level nodes, the device does not require a tree-shaped submodule.
在一种可能的实现方式中,树型子模块127可以包括n叉树结构的多个节点,n叉树结构的多个节点可以具有多个层。In a possible implementation, the tree-shaped submodule 127 may include multiple nodes of an n-ary tree structure, and multiple nodes of the n-ary tree structure may have multiple layers.
举例来说,图6-5d示出根据本公开一实施例的数据处理装置中处理模块的框图。如图6-5d所示,n叉树结构可以是二叉树结构,树型子模块127包括2层节点01。最下层节点01与从处理子模块125连接,以转发主处理子模块124和从处理子模块125之间的数据和/或运算指令。For example, FIGS. 6-5d show a block diagram of a processing module in a data processing device according to an embodiment of the present disclosure. As shown in FIGS. 6-5d, the n-ary tree structure may be a binary tree structure, and the tree-shaped submodule 127 includes 2-layer nodes 01. The lowermost node 01 is connected to the slave processing submodule 125 to forward data and/or operation instructions between the master processing submodule 124 and the slave processing submodule 125.
在该实现方式中,n叉树结构还可以是三叉树结构等,n为大于或等于2的正整数。本领域技术人员可以根据需要对n叉树结构中的n以及n叉树结构中节点的层数进行设置,本公开对此不作限制。In this implementation, the n-ary tree structure may also be a tri-tree structure, etc., where n is a positive integer greater than or equal to 2. A person skilled in the art may set n in the n-ary tree structure and the number of nodes in the n-ary tree structure as needed, and the disclosure does not limit this.
需要说明的是,尽管以上述实施例作为示例介绍了数据处理装置如上,但本领域技术人员能够理解,本公开应不限于此。事实上,用户完全可根据个人喜好和/或实际应用场景灵活设定各模块,只要符合本公开的技术方案即可。It should be noted that although the above-mentioned embodiment is taken as an example to introduce the data processing apparatus as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various modules flexibly according to personal preferences and/or actual application scenarios, as long as the technical solutions of the present disclosure are met.
图6-6示出根据本公开一实施例的数据处理方法的流程图。如图6-6所示,该方法应用于上述数据处理装置,数据处理装置用于执行机器学习计算。该方法包括步骤S51-6至步骤S53-6。6-6 show a flowchart of a data processing method according to an embodiment of the present disclosure. As shown in Figs. 6-6, this method is applied to the above data processing device, and the data processing device is used to perform machine learning calculations. The method includes steps S51-6 to S53-6.
在步骤S51-6中,获取计算指令,并获取执行计算指令所需的输入数据。In step S51-6, the calculation instruction is acquired, and the input data required to execute the calculation instruction is acquired.
在步骤S52-6中,根据计算指令对输入数据进行处理,得到多个中间结果,并将多个中间结果依次发出。In step S52-6, the input data is processed according to the calculation instruction to obtain multiple intermediate results, and the multiple intermediate results are issued in sequence.
在步骤S53-6中,对多个中间结果进行循环累加运算,得到计算指令的计算结果。In step S53-6, a cyclic accumulation operation is performed on a plurality of intermediate results to obtain the calculation result of the calculation instruction.
在一种可能的实现方式中,对多个中间结果进行循环累加运算,可以包括:In a possible implementation manner, performing a cyclic accumulation operation on multiple intermediate results may include:
在接收到中间结果的第一运算周期,将中间结果与第一运算周期的第一中间数据相加,得到第一累加结果;In the first calculation cycle of receiving the intermediate result, add the intermediate result and the first intermediate data of the first calculation cycle to obtain the first accumulation result;
将第一累加结果存储为下一个运算周期的第一中间数据;Store the first accumulation result as the first intermediate data of the next calculation cycle;
在未接收到中间结果的第二运算周期,将第二运算周期的第一中间数据确定为计算结果,Determining the first intermediate data of the second calculation cycle as the calculation result in the second calculation cycle where the intermediate result is not received,
其中,初始运算周期的第一中间数据的值为零。The value of the first intermediate data in the initial calculation cycle is zero.
在一种可能的实现方式中,对多个中间结果进行循环累加运算,可以包括:In a possible implementation manner, performing a cyclic accumulation operation on multiple intermediate results may include:
在接收到中间结果的第三运算周期,将中间结果与第三运算周期的第三中间数据相加,得到第二累加结果;In the third calculation cycle of receiving the intermediate result, add the intermediate result to the third intermediate data of the third calculation cycle to obtain a second accumulation result;
将第三运算周期的第二中间数据存储为下一个运算周期的第三中间数据,并将第二累加结果存储为下一个运算周期的第二中间数据;Storing the second intermediate data of the third operation cycle as the third intermediate data of the next operation cycle, and storing the second accumulation result as the second intermediate data of the next operation cycle;
在未接收到中间结果的第四运算周期,将第四运算周期的第二中间数据与第四运算周期的第三中间数据相加,得到计算结果,Add the second intermediate data of the fourth operation cycle to the third intermediate data of the fourth operation cycle in the fourth operation cycle where no intermediate result is received, to obtain the calculation result,
其中,初始运算周期的第二中间数据及第三中间数据的值为零。The value of the second intermediate data and the third intermediate data in the initial calculation cycle is zero.
在一种可能的实现方式中,机器学习计算可以包括:人工神经网络运算,输入数据可以包括:输入神经元数据和权值数据;计算结果为输出神经元数据。In a possible implementation, the machine learning calculation may include: artificial neural network operation, and the input data may include: input neuron data and weight data; the calculation result is output neuron data.
在一种可能的实现方式中,输入数据的数据类型包括指数型和动态定点型中的至少一项,输入神经元数据和权值数据的数据类型不同。In a possible implementation manner, the data type of the input data includes at least one of exponential type and dynamic fixed-point type, and the data types of the input neuron data and the weight data are different.
其中,根据计算指令对输入数据进行处理,得到多个中间结果,可以包括:根据计算指令对权值数据或输入神经元数据进行移位运算,得到中间结果。Wherein, processing the input data according to the calculation instruction to obtain multiple intermediate results may include: performing shift operation on the weight data or input neuron data according to the calculation instruction to obtain the intermediate result.
其中,指数型的输入数据包括指数位,以指定值为底数、指数位存储的数据为指数进行计算所得到的数据表示指数型的输入数据的数值。动态定点型的输入数据包括小数点位和整数位,小数点位所存储数据用于标记动态定点型的输入数据的小数点在整数位所存储数据中的位置,以区分整数位的数据中的整数部分和小数部分。其中,指数型的输入数据所对应的指定值与输入数据的进位制相同。The exponential input data includes exponent bits, and the data obtained by calculating with the specified value as the base and the data stored in the exponent bits as exponents represent the value of the exponential input data. The input data of the dynamic fixed-point type includes a decimal point and an integer. The data stored in the decimal point is used to mark the position of the decimal point of the input data of the dynamic fixed-point in the data stored in the integer, to distinguish the integer part of the data in the integer. decimal part. Among them, the specified value corresponding to the exponential input data is the same as the carry system of the input data.
在一种可能的实现方式中,获取计算指令,并获取执行计算指令所需的输入数据,可以包括:解 析计算指令得到多个运算指令。In a possible implementation manner, obtaining the calculation instruction and obtaining the input data required to execute the calculation instruction may include: analyzing the calculation instruction to obtain multiple calculation instructions.
其中,该方法还可以包括:Among them, the method may further include:
对输入数据执行前序处理,以及进行数据和运算指令的传输;Perform pre-sequence processing on input data and transfer data and calculation instructions;
根据传输的数据和运算指令并行执行中间运算得到多个中间结果;Perform intermediate operations in parallel based on the transmitted data and operation instructions to obtain multiple intermediate results;
对多个中间结果执行后续处理,得到计算指令的计算结果。Perform subsequent processing on multiple intermediate results to obtain the calculation result of the calculation instruction.
在一种可能的实现方式中,该方法可以包括:存储输入数据。In a possible implementation, the method may include: storing input data.
在一种可能的实现方式中,获取计算指令,并获取执行计算指令所需的输入数据,可以包括:In a possible implementation manner, obtaining the calculation instruction and obtaining the input data required to execute the calculation instruction may include:
存储计算指令;Store calculation instructions;
对计算指令进行解析,得到计算指令的多个运算指令;Analyze the calculation instructions to obtain multiple calculation instructions for the calculation instructions;
存储指令队列,指令队列包括按照执行顺序依次排列的多个待执行指令,多个待执行指令包括多个运算指令;An instruction queue is stored. The instruction queue includes a plurality of instructions to be executed in order according to the execution order, and the plurality of instructions to be executed includes a plurality of arithmetic instructions;
在一种可能的实现方式中,获取计算指令,并获取执行计算指令所需的多个输入数据,还可以包括:In a possible implementation manner, acquiring the calculation instruction and acquiring multiple input data required to execute the calculation instruction may further include:
在确定多个待执行指令中的第一待执行指令与第一待执行指令之前的第零待执行指令存在关联关系时,缓存第一待执行指令,在确定第零待执行指令执行完毕后,控制进行第一待执行指令的执行。When it is determined that the first to-be-executed instruction among the plurality of to-be-executed instructions is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction, the first to-be-executed instruction is cached, and after determining that the execution of the zeroth to-be-executed instruction is completed, Controlling the execution of the first instruction to be executed.
其中,第一待执行指令与第一待执行指令之前的第零待执行指令存在关联关系包括:存储第一待执行指令所需数据的第一存储地址区间与存储第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first to-be-executed instruction is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction includes: a first storage address interval storing data required by the first to-be-executed instruction and data required to store the zeroth to-be-executed instruction The zeroth storage address interval has overlapping areas.
本公开实施例所提供的数据处理方法,通过对多个中间结果进行循环累加的方式降低了数据访存量和计算量,同时保证计算的精度无损,且能够有效提高数据处理速度。The data processing method provided by the embodiments of the present disclosure reduces the amount of data access and calculation by cyclically accumulating a plurality of intermediate results, and at the same time ensures that the accuracy of calculation is non-destructive and can effectively increase the data processing speed.
依据以下条款可更好地理解前述内容:The foregoing can be better understood based on the following terms:
条款E1、一种数据处理装置,所述装置用于执行机器学习计算,所述装置包括控制模块和处理模块,所述处理模块包括数据传递子模块和累加子模块:Clause E1. A data processing device for performing machine learning calculations. The device includes a control module and a processing module. The processing module includes a data transfer submodule and an accumulation submodule:
所述控制模块用于获取计算指令,并获取执行所述计算指令所需的输入数据;The control module is used to obtain a calculation instruction and obtain input data required to execute the calculation instruction;
所述数据传递子模块用于根据所述计算指令对所述输入数据进行处理,得到多个中间结果,并将所述多个中间结果依次发送至所述累加子模块;The data transfer sub-module is configured to process the input data according to the calculation instruction to obtain multiple intermediate results, and sequentially send the multiple intermediate results to the accumulation sub-module;
所述累加子模块用于对所述多个中间结果进行循环累加运算,得到所述计算指令的计算结果。The accumulation submodule is used to perform a cyclic accumulation operation on the plurality of intermediate results to obtain the calculation result of the calculation instruction.
条款E2、根据条款E1所述的装置,所述累加子模块对所述多个中间结果进行循环累加运算,包括:Clause E2. The apparatus according to Clause E1, the accumulation submodule performs a cyclic accumulation operation on the plurality of intermediate results, including:
在接收到中间结果的第一运算周期,将所述中间结果与第一运算周期的第一中间数据相加,得到第一累加结果;Add the intermediate result to the first intermediate data of the first operation cycle in the first operation cycle of receiving the intermediate result to obtain the first accumulation result;
将所述第一累加结果存储为下一个运算周期的第一中间数据;Storing the first accumulation result as first intermediate data of the next calculation cycle;
在未接收到中间结果的第二运算周期,将第二运算周期的第一中间数据确定为所述计算结果,Determining the first intermediate data of the second calculation cycle as the calculation result in the second calculation cycle in which the intermediate result is not received,
其中,初始运算周期的第一中间数据的值为零。The value of the first intermediate data in the initial calculation cycle is zero.
条款E3、根据条款E1所述的装置,所述累加子模块对所述多个中间结果进行循环累加运算,包括:Clause E3. The apparatus according to Clause E1, the accumulation submodule performs a cyclic accumulation operation on the plurality of intermediate results, including:
在接收到中间结果的第三运算周期,将所述中间结果与第三运算周期的第三中间数据相加,得到第二累加结果;In the third calculation cycle of receiving the intermediate result, add the intermediate result to the third intermediate data of the third calculation cycle to obtain a second accumulation result;
将第三运算周期的第二中间数据存储为下一个运算周期的第三中间数据,并将所述第二累加结果存储为下一个运算周期的第二中间数据;Storing the second intermediate data of the third operation cycle as the third intermediate data of the next operation cycle, and storing the second accumulation result as the second intermediate data of the next operation cycle;
在未接收到中间结果的第四运算周期,将第四运算周期的第二中间数据与第四运算周期的第三中 间数据相加,得到所述计算结果,Add the second intermediate data of the fourth operation cycle to the third intermediate data of the fourth operation cycle in the fourth operation cycle where no intermediate result is received, to obtain the calculation result,
其中,初始运算周期的第二中间数据及第三中间数据的值为零。The value of the second intermediate data and the third intermediate data in the initial calculation cycle is zero.
条款E4、根据条款E1-条款E3任一项所述的装置,所述机器学习计算包括:人工神经网络运算,所述输入数据包括:输入神经元数据和权值数据;所述计算结果为输出神经元数据。Clause E4. The device according to any one of Clauses E1-E3, the machine learning calculation includes: artificial neural network operation, the input data includes: input neuron data and weight data; the calculation result is output Neuron data.
条款E5、根据条款E4所述的装置,所述输入数据的数据类型包括指数型和动态定点型中的至少一项,所述输入神经元数据和所述权值数据的数据类型不同,Clause E5. The device according to Clause E4, the data type of the input data includes at least one of an exponential type and a dynamic fixed-point type, and the data types of the input neuron data and the weight data are different,
其中,所述数据传递子模块用于根据所述计算指令对所述输入数据进行处理,得到多个中间结果,包括:Wherein, the data transfer sub-module is used to process the input data according to the calculation instruction to obtain multiple intermediate results, including:
所述数据传递子模块用于根据所述计算指令对权值数据或所述输入神经元数据进行移位运算,得到中间结果,The data transfer sub-module is used to perform shift operation on the weight data or the input neuron data according to the calculation instruction to obtain an intermediate result,
其中,所述指数型的输入数据包括指数位,以指定值为底数、指数位存储的数据为指数进行计算所得到的数据表示所述指数型的输入数据的数值,Wherein, the exponential input data includes exponent bits, and the data obtained by calculating with the specified value as the base and the data stored in the exponent bits as the exponent represent the value of the exponential input data,
所述动态定点型的输入数据包括小数点位和整数位,所述小数点位所存储数据用于标记所述动态定点型的输入数据的小数点在所述整数位所存储数据中的位置,以区分所述整数位的数据中的整数部分和小数部分,The dynamic fixed-point input data includes a decimal point and an integer. The data stored in the decimal point is used to mark the position of the decimal point of the dynamic fixed-point input data in the data stored in the integer to distinguish Integer part and decimal part in the data of the integer position,
其中,所述指数型的输入数据所对应的指定值与所述输入数据的进位制相同。The specified value corresponding to the exponential input data is the same as the carry system of the input data.
条款E6、根据条款E1所述的装置,所述处理模块包括主处理子模块和多个从处理子模块,所述主处理子模块包括所述数据传递子模块和所述累加子模块,Clause E6. The apparatus according to Clause E1, the processing module includes a master processing submodule and a plurality of slave processing submodules, the master processing submodule includes the data transfer submodule and the accumulation submodule,
所述控制模块,还用于解析所述计算指令得到多个运算指令,并将所述输入数据以及所述多个运算指令发送至所述主处理子模块;The control module is further configured to parse the calculation instruction to obtain a plurality of operation instructions, and send the input data and the plurality of operation instructions to the main processing sub-module;
所述主处理子模块,用于对所述输入数据执行前序处理,以及与所述多个从处理子模块进行数据和运算指令的传输;The master processing sub-module is used to perform pre-processing on the input data and transmit data and operation instructions with the plurality of slave processing sub-modules;
所述多个从处理子模块,用于根据从所述主处理子模块传输的数据和运算指令并行执行中间运算得到多个中间结果,并将所述多个中间结果传输给所述主处理子模块;The plurality of sub-processing sub-modules are configured to execute intermediate operations in parallel according to data and operation instructions transmitted from the main processing sub-module to obtain a plurality of intermediate results, and transmit the plurality of intermediate results to the main processing sub Module
所述主处理子模块,还用于对所述多个中间结果执行后续处理,得到所述计算指令的计算结果。The main processing sub-module is also used to perform subsequent processing on the plurality of intermediate results to obtain the calculation result of the calculation instruction.
条款E7、根据条款E1所述的装置,Clause E7, the device according to Clause E1,
所述装置还包括:存储模块,用于存储所述输入数据;The device also includes a storage module for storing the input data;
其中,所述控制模块,包括:Wherein, the control module includes:
指令存储子模块,用于存储所述计算指令;An instruction storage sub-module for storing the calculation instruction;
指令处理子模块,用于对所述计算指令进行解析,得到所述计算指令的多个运算指令;An instruction processing sub-module, which is used to analyze the calculation instruction to obtain a plurality of calculation instructions of the calculation instruction;
队列存储子模块,用于存储指令队列,所述指令队列包括按照执行顺序依次排列的多个待执行指令,所述多个待执行指令包括所述多个运算指令;A queue storage sub-module, which is used to store an instruction queue, the instruction queue including a plurality of instructions to be executed in order according to the execution order, and the plurality of instructions to be executed include the plurality of arithmetic instructions;
其中,所述控制模块,还包括:Wherein, the control module also includes:
依赖关系处理子模块,用于在确定所述多个待执行指令中的第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系时,将所述第一待执行指令缓存在所述指令存储子模块中,在所述第零待执行指令执行完毕后,从所述指令存储子模块中提取所述第一待执行指令发送至所述处理模块,The dependency processing sub-module is used to determine the first pending instruction when there is an association between the first pending instruction among the multiple pending instructions and the zeroth pending instruction before the first pending instruction The execution instruction is cached in the instruction storage submodule, and after the execution of the zeroth instruction to be executed is completed, the first instruction to be executed is extracted from the instruction storage submodule and sent to the processing module,
其中,所述第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系包括:Wherein, the association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
存储所述第一待执行指令所需数据的第一存储地址区间与存储所述第零待执行指令所需数据的 第零存储地址区间具有重叠的区域。A first storage address interval storing data required for the first instruction to be executed has an overlapping area with a zeroth storage address interval storing data required for the zeroth instruction to be executed.
条款E8、一种机器学习运算装置,所述装置包括:Clause E8. A machine learning computing device, the device comprising:
一个或多个如条款E1-条款E7任一项所述的数据处理装置,用于从其他处理装置中获取待运算数据和控制信息,并执行指定的机器学习运算,将执行结果通过I/O接口传递给其他处理装置;One or more data processing devices as described in any one of Clause E1-Clause E7, used to obtain data to be calculated and control information from other processing devices, and perform specified machine learning operations, and pass the execution results through I/O The interface is passed to other processing devices;
当所述机器学习运算装置包含多个所述数据处理装置时,所述多个所述数据处理装置间可以通过特定的结构进行连接并传输数据;When the machine learning computing device includes a plurality of the data processing devices, the data processing devices may be connected and transmit data through a specific structure;
其中,多个所述数据处理装置通过快速外部设备互连总线PCIE总线进行互联并传输数据,以支持更大规模的机器学习的运算;多个所述数据处理装置共享同一控制系统或拥有各自的控制系统;多个所述数据处理装置共享内存或者拥有各自的内存;多个所述数据处理装置的互联方式是任意互联拓扑。Among them, a plurality of the data processing apparatuses interconnect and transmit data through a PCIE bus, a fast external device interconnection bus, to support larger-scale machine learning operations; a plurality of the data processing apparatuses share the same control system or have their own Control system; multiple data processing devices share memory or have their own memory; multiple data processing devices are interconnected in any interconnection topology.
条款E9、一种组合处理装置,所述组合处理装置包括:Clause E9. A combined processing device, the combined processing device comprising:
如条款E8所述的机器学习运算装置、通用互联接口和其他处理装置;Machine learning computing devices, general interconnection interfaces and other processing devices as described in clause E8;
所述机器学习运算装置与所述其他处理装置进行交互,共同完成用户指定的计算操作,The machine learning computing device interacts with the other processing device to jointly complete the calculation operation specified by the user,
其中,所述组合处理装置还包括:存储装置,该存储装置分别与所述机器学习运算装置和所述其他处理装置连接,用于保存所述机器学习运算装置和所述其他处理装置的数据。Wherein, the combined processing device further includes: a storage device, which is respectively connected to the machine learning computing device and the other processing device, and is used for storing data of the machine learning computing device and the other processing device.
条款E10、一种机器学习芯片,所述机器学习芯片包括:Clause E10. A machine learning chip, the machine learning chip includes:
如条款E8所述的机器学习运算装置或如条款E9所述的组合处理装置。The machine learning arithmetic device according to clause E8 or the combined processing device according to clause E9.
条款E11、一种电子设备,所述电子设备包括:Clause E11. An electronic device, the electronic device comprising:
如条款E10所述的机器学习芯片。Machine learning chip as described in clause E10.
条款E12、一种板卡,所述板卡包括:存储器件、接口装置和控制器件以及如条款E10所述的机器学习芯片;Clause E12, a board card, the board card includes: a storage device, an interface device and a control device, and a machine learning chip as described in Clause E10;
其中,所述机器学习芯片与所述存储器件、所述控制器件以及所述接口装置分别连接;Wherein, the machine learning chip is respectively connected to the storage device, the control device and the interface device;
所述存储器件,用于存储数据;The storage device is used to store data;
所述接口装置,用于实现所述机器学习芯片与外部设备之间的数据传输;The interface device is used to realize data transmission between the machine learning chip and an external device;
所述控制器件,用于对所述机器学习芯片的状态进行监控。The control device is used for monitoring the state of the machine learning chip.
条款E13、一种数据处理方法,所述方法应用于数据处理装置,所述装置用于执行机器学习计算,所述方法包括:Clause E13. A data processing method, the method is applied to a data processing device, the device is used to perform machine learning calculations, the method includes:
获取计算指令,并获取执行所述计算指令所需的输入数据;Obtaining calculation instructions, and obtaining input data required to execute the calculation instructions;
根据所述计算指令对所述输入数据进行处理,得到多个中间结果,并将所述多个中间结果依次发出;Processing the input data according to the calculation instruction to obtain multiple intermediate results, and sending the multiple intermediate results in sequence;
对所述多个中间结果进行循环累加运算,得到所述计算指令的计算结果。Performing a cyclic accumulation operation on the plurality of intermediate results to obtain the calculation result of the calculation instruction.
条款E14、根据条款E13所述的方法,对所述多个中间结果进行循环累加运算,包括:Clause E14. According to the method described in Clause E13, performing a cyclic accumulation operation on the plurality of intermediate results, including:
在接收到中间结果的第一运算周期,将所述中间结果与第一运算周期的第一中间数据相加,得到第一累加结果;Add the intermediate result to the first intermediate data of the first operation cycle in the first operation cycle of receiving the intermediate result to obtain the first accumulation result;
将所述第一累加结果存储为下一个运算周期的第一中间数据;Storing the first accumulation result as first intermediate data of the next calculation cycle;
在未接收到中间结果的第二运算周期,将第二运算周期的第一中间数据确定为所述计算结果,Determining the first intermediate data of the second calculation cycle as the calculation result in the second calculation cycle in which the intermediate result is not received,
其中,初始运算周期的第一中间数据的值为零。The value of the first intermediate data in the initial calculation cycle is zero.
条款E15、根据条款E13所述的方法,对所述多个中间结果进行循环累加运算,包括:Clause E15. According to the method described in Clause E13, performing a cyclic accumulation operation on the plurality of intermediate results, including:
在接收到中间结果的第三运算周期,将所述中间结果与第三运算周期的第三中间数据相加,得到第二累加结果;In the third calculation cycle of receiving the intermediate result, add the intermediate result to the third intermediate data of the third calculation cycle to obtain a second accumulation result;
将第三运算周期的第二中间数据存储为下一个运算周期的第三中间数据,并将所述第二累加结果存储为下一个运算周期的第二中间数据;Storing the second intermediate data of the third operation cycle as the third intermediate data of the next operation cycle, and storing the second accumulation result as the second intermediate data of the next operation cycle;
在未接收到中间结果的第四运算周期,将第四运算周期的第二中间数据与第四运算周期的第三中间数据相加,得到所述计算结果,Adding the second intermediate data of the fourth operation cycle to the third intermediate data of the fourth operation cycle in the fourth operation cycle where the intermediate result is not received, to obtain the calculation result,
其中,初始运算周期的第二中间数据及第三中间数据的值为零。The value of the second intermediate data and the third intermediate data in the initial calculation cycle is zero.
条款E16、根据条款E13-条款E15所述的方法,所述机器学习计算包括:人工神经网络运算,所述输入数据包括:输入神经元数据和权值数据;所述计算结果为输出神经元数据。Clause E16. According to the method described in Clause E13- Clause E15, the machine learning calculation includes: artificial neural network operation, and the input data includes: input neuron data and weight data; the calculation result is output neuron data .
条款E17、根据条款E16所述的方法,所述输入数据的数据类型包括指数型和动态定点型中的至少一项,所述输入神经元数据和所述权值数据的数据类型不同,Clause E17. The method according to Clause E16, the data type of the input data includes at least one of an exponential type and a dynamic fixed-point type, the data type of the input neuron data and the weight data is different,
其中,根据所述计算指令对所述输入数据进行处理,得到多个中间结果,包括:Wherein, processing the input data according to the calculation instruction to obtain multiple intermediate results includes:
根据所述计算指令对权值数据或所述输入神经元数据进行移位运算,得到中间结果,Performing shift operation on the weight data or the input neuron data according to the calculation instruction to obtain an intermediate result,
其中,所述指数型的输入数据包括指数位,以指定值为底数、指数位存储的数据为指数进行计算所得到的数据表示所述指数型的输入数据的数值,Wherein, the exponential input data includes exponent bits, and the data obtained by calculating with the specified value as the base and the data stored in the exponent bits as the exponent represent the value of the exponential input data,
所述动态定点型的输入数据包括小数点位和整数位,所述小数点位所存储数据用于标记所述动态定点型的输入数据的小数点在所述整数位所存储数据中的位置,以区分所述整数位的数据中的整数部分和小数部分,The dynamic fixed-point input data includes a decimal point and an integer. The data stored in the decimal point is used to mark the position of the decimal point of the dynamic fixed-point input data in the data stored in the integer to distinguish Integer part and decimal part in the data of the integer position,
其中,所述指数型的输入数据所对应的指定值与所述输入数据的进位制相同。The specified value corresponding to the exponential input data is the same as the carry system of the input data.
条款E18、根据条款E13所述的方法,获取计算指令,并获取执行所述计算指令所需的输入数据,包括:Clause E18. According to the method described in Clause E13, obtain a calculation instruction, and obtain input data required to execute the calculation instruction, including:
解析所述计算指令得到多个运算指令,Parse the calculation instruction to obtain multiple calculation instructions,
其中,所述方法还包括:Wherein, the method further includes:
对所述输入数据执行前序处理,以及进行数据和运算指令的传输;Perform pre-sequence processing on the input data and transfer data and operation instructions;
根据传输的数据和运算指令并行执行中间运算得到多个中间结果;Perform intermediate operations in parallel based on the transmitted data and operation instructions to obtain multiple intermediate results;
对所述多个中间结果执行后续处理,得到所述计算指令的计算结果。Perform subsequent processing on the plurality of intermediate results to obtain the calculation result of the calculation instruction.
条款E19、根据条款E13所述的方法,Clause E19, according to the method described in Clause E13,
所述方法包括:存储所述输入数据;The method includes: storing the input data;
其中,获取计算指令,并获取执行所述计算指令所需的输入数据,包括:Wherein, obtaining the calculation instruction, and obtaining the input data required to execute the calculation instruction include:
存储所述计算指令;Store the calculation instruction;
对所述计算指令进行解析,得到所述计算指令的多个运算指令;Analyzing the calculation instruction to obtain a plurality of calculation instructions of the calculation instruction;
存储指令队列,所述指令队列包括按照执行顺序依次排列的多个待执行指令,所述多个待执行指令包括所述多个运算指令;Storing an instruction queue, where the instruction queue includes a plurality of instructions to be executed, which are sequentially arranged in order of execution, and the plurality of instructions to be executed include the plurality of arithmetic instructions;
其中,获取计算指令,并获取执行所述计算指令所需的多个输入数据,还包括:Wherein, obtaining a calculation instruction, and obtaining a plurality of input data required to execute the calculation instruction, further includes:
在确定所述多个待执行指令中的第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系时,缓存所述第一待执行指令,在确定所述第零待执行指令执行完毕后,控制进行所述第一待执行指令的执行,When it is determined that the first to-be-executed instruction in the plurality of to-be-executed instructions is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction, cache the first to-be-executed instruction, and determine the first After the execution of the zero to-be-executed instruction is completed, control to execute the execution of the first to-be-executed instruction,
其中,所述第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系包括:Wherein, the association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
存储所述第一待执行指令所需数据的第一存储地址区间与存储所述第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
由于神经网络算法在图像识别、语音识别、自然语言处理等领域中的使用越来越广泛,使得神经网络算法的复杂度越来越高,所涉及的数据运算种类和数量不断增大。其中,矩阵是一种在神经网络算法中较为常见的数据形式,由数字和/或字符组成。神经网络算法中对矩阵的处理过程包括对矩阵进行轴对称、中心对称等对称处理。相关技术中,需要对矩阵进行对称处理的效率低、速度慢。Since neural network algorithms are more and more widely used in the fields of image recognition, speech recognition, and natural language processing, the complexity of neural network algorithms is getting higher and higher, and the types and number of data operations involved are increasing. Among them, the matrix is a relatively common data form in neural network algorithm, which is composed of numbers and/or characters. The processing of the matrix in the neural network algorithm includes symmetric processing such as axisymmetric and central symmetry. In the related art, it is necessary to perform symmetric processing on the matrix with low efficiency and slow speed.
图7-1示出根据本公开一实施例的矩阵对称指令处理装置的框图。如图7-1所示,该装置包括控制模块11-7和处理模块12-7。FIG. 7-1 shows a block diagram of a matrix symmetric instruction processing device according to an embodiment of the present disclosure. As shown in Figure 7-1, the device includes a control module 11-7 and a processing module 12-7.
控制模块11-7,用于对接收到的矩阵对称指令进行解析,获得矩阵对称指令的操作码和操作域,并根据操作码和操作域确定执行矩阵对称指令所需的待处理矩阵和目标地址,以及确定进行对称处理所需的对称策略。其中,操作码用于指示矩阵对称指令对矩阵数据所进行的处理为对称处理,操作域包括待处理矩阵地址和目标地址。The control module 11-7 is used to parse the received matrix symmetric instruction, obtain the operation code and operation domain of the matrix symmetric instruction, and determine the matrix to be processed and the target address required for executing the matrix symmetric instruction according to the operation code and the operation domain , And determine the symmetrical strategy required for symmetrical processing. The operation code is used to indicate that the processing performed by the matrix symmetric instruction on the matrix data is symmetric processing, and the operation domain includes the address of the matrix to be processed and the target address.
处理模块12-7,根据对称策略对待处理矩阵进行对称处理,得到对称后矩阵,并将对称后矩阵存入目标地址中。The processing module 12-7 performs symmetric processing on the processing matrix according to the symmetric strategy to obtain the symmetric matrix, and stores the symmetric matrix into the target address.
在本实施例中,待处理矩阵可以是由多个数字和/或字符按照阵列排布而成的数据集合。对称策略可以指示对待处理矩阵所需进行的对称处理,对称策略可以包括对称中心、对称轴等对待处理矩阵进行对称处理所需的参数,所得到的对称后矩阵与待处理矩阵之间围绕对称中心、对称轴等对称。例如,对称策略可以包括中心对称、轴对称等。其中,可以为不同的对称策略设置在矩阵对称指令中的代码,例如,在矩阵对称指令中对称策略“中心对称”可以用代码csymmetric表示,对称策略“轴对称”可以用代码asymmetry表示。本领域技术人员可以根据实际需要对对称策略以及对称策略的代码进行设置,本公开对此不作限制。In this embodiment, the matrix to be processed may be a data set formed by arranging multiple numbers and/or characters in an array. The symmetric strategy can indicate the symmetric processing required for the matrix to be processed. The symmetric strategy can include the parameters required for the symmetric processing of the matrix to be processed, such as the center of symmetry and the axis of symmetry. The obtained symmetry matrix and the matrix to be processed surround the center of symmetry , Symmetry axis and other symmetry. For example, the symmetry strategy may include center symmetry, axis symmetry, and so on. Among them, codes for matrix symmetric instructions can be set for different symmetric strategies. For example, in matrix symmetric instructions, the symmetric strategy "center symmetry" can be represented by the code csymmetric, and the symmetric strategy "axis symmetry" can be represented by the code asymmetry. A person skilled in the art may set the symmetric strategy and the code of the symmetric strategy according to actual needs, which is not limited in the present disclosure.
举例来说,假定待处理矩阵为[[1,4,7],[5,8,3]]。若根据矩阵对称指令确定对称策略为“中心对称处理”,那么装置对待处理矩阵[[1,4,7],[5,8,3]]进行中心对称处理后,可得到对称后矩阵[[3,8,5],[7,4,1]]。若根据矩阵对称指令确定对称策略为“轴对称处理”,那么装置对待处理矩阵[[1,4,7],[5,8,3]]进行中心对称处理后,可得到对称后矩阵[[5,8,3],[1,4,7]]。For example, suppose the matrix to be processed is [[1,4,7],[5,8,3]]. If the symmetry strategy is determined to be "central symmetry processing" according to the matrix symmetry instruction, then the device performs central symmetry processing on the processing matrix [[1,4,7],[5,8,3]] to obtain the post-symmetric matrix [[ 3,8,5],[7,4,1]]. If the symmetry strategy is determined to be "axisymmetric processing" according to the matrix symmetry instruction, the device can obtain the post-symmetric matrix [[1,4,7],[5,8,3]] after center-symmetric processing. 5,8,3],[1,4,7]].
在本实施例中,控制模块可以从待处理矩阵地址中获取待处理矩阵。待处理矩阵地址可以是存储待处理矩阵的首地址等物理地址,也可以是逻辑地址、线性地址。控制模块可以将待处理矩阵存储在目标地址中。目标地址可以是存储对称后矩阵的首地址等物理地址,也可以是逻辑地址、线性地址。本公开对待处理矩阵地址、目标地址的表示方式不作限制。控制模块可以通过数据输入输出单元获得矩阵对称指令、待处理矩阵,该数据输入输出单元可以为一个或多个数据I/O接口或I/O引脚。In this embodiment, the control module may obtain the matrix to be processed from the address of the matrix to be processed. The address of the matrix to be processed may be a physical address such as the first address storing the matrix to be processed, or may be a logical address or a linear address. The control module may store the matrix to be processed in the target address. The target address may be a physical address such as the first address of the symmetric matrix, or a logical address or a linear address. The present disclosure does not limit the way of expressing the processing matrix address and the target address. The control module can obtain the matrix symmetry instruction and the matrix to be processed through the data input and output unit. The data input and output unit can be one or more data I/O interfaces or I/O pins.
在本实施例中,对于一个矩阵对阵指令可以包括操作码和操作域。操作码可以是计算机程序中所规定的要执行操作的那一部分指令或字段(通常用代码表示),是指令序列号,用来告知执行指令的装置具体需要执行哪一条指令。而操作域可以是执行对应的指令所需的所有数据的来源,执行对应的指令所需的所有数据包括待处理矩阵、对应的对称策略,或者存储待处理矩阵、对应的对称策略的地址等等。比如,操作域可以包括待处理矩阵地址和目标地址。In this embodiment, for a matrix alignment instruction, an operation code and an operation field may be included. The operation code may be the part of the instruction or field (usually represented by code) specified in the computer program to perform the operation, and is an instruction sequence number used to inform the device that executes the instruction which instruction needs to be executed. The operation domain can be the source of all data required to execute the corresponding instruction. All data required to execute the corresponding instruction include the matrix to be processed, the corresponding symmetric strategy, or the address to store the matrix to be processed, the corresponding symmetric strategy, etc. . For example, the operation domain may include a to-be-processed matrix address and a target address.
应当理解的是,本领域技术人员可以根据需要对矩阵对称指令的指令格式以及所包含的操作码和操作域进行设置,本公开对此不作限制。It should be understood that a person skilled in the art may set the instruction format of the matrix symmetric instruction, as well as the included operation codes and operation domains as required, which is not limited in this disclosure.
在本实施例中,该装置可以包括一个或多个控制模块,以及一个或多个处理模块,可以根据实际需要对控制模块和处理模块的数量进行设置,本公开对此不作限制。在装置包括一个控制模块时,该控制模块可以接收矩阵对称指令,并控制一个或多个处理模块进行对称处理。在装置包括多个控制模块时,多个控制模块可以分别接收矩阵对称指令,并控制对应的一个或多个处理模块进行对称处理。In this embodiment, the device may include one or more control modules and one or more processing modules, and the number of control modules and processing modules may be set according to actual needs, which is not limited in the present disclosure. When the device includes a control module, the control module can receive matrix symmetric commands and control one or more processing modules to perform symmetric processing. When the device includes multiple control modules, the multiple control modules may respectively receive matrix symmetric instructions and control the corresponding one or more processing modules to perform symmetric processing.
本公开实施例所提供的矩阵对称指令处理装置,该装置包括控制模块和处理模块。控制模块用于对接收到的矩阵对称指令进行解析,获得矩阵对称指令的操作码和操作域,并根据操作码和操作域确定执行矩阵对称指令所需的待处理矩阵和目标地址,以及确定进行对称处理所需的对称策略。处理模块用于根据对称策略对待处理矩阵进行对称处理,得到对称后矩阵,并将对称后矩阵存入目标地址中。本公开实施例所提供的矩阵对称指令处理装置的适用范围广,对矩阵进行对称处理的处理效率高、处理速度快。The matrix symmetric instruction processing device provided by the embodiment of the present disclosure includes a control module and a processing module. The control module is used to parse the received matrix symmetric instruction, obtain the operation code and operation domain of the matrix symmetric instruction, and determine the matrix to be processed and the target address required to execute the matrix symmetric instruction according to the operation code and operation domain, and determine the progress Symmetry strategy required for symmetric processing. The processing module is used to perform symmetric processing on the processing matrix according to a symmetric strategy to obtain the symmetric matrix, and store the symmetric matrix into the target address. The matrix symmetric instruction processing device provided by the embodiments of the present disclosure has a wide application range, and has a high processing efficiency and a fast processing speed for performing symmetric processing on the matrix.
在一种可能的实现方式中,操作域还可以包括待处理矩阵的输入形状。其中,处理模块12-7,还可以用于根据输入形状以及对称策略,对待处理矩阵进行对称处理,获得对称后矩阵。In a possible implementation manner, the operation domain may further include the input shape of the matrix to be processed. Among them, the processing module 12-7 can also be used to perform symmetric processing on the matrix to be processed according to the input shape and the symmetry strategy to obtain the symmetric matrix.
在该实现方式中,根据待处理矩阵的输入形状便于对矩阵进行对称处理,也可以根据待处理矩阵的输入形状确定对称后矩阵的形状。矩阵的形状可以用待处理矩阵在行、列上数字和/或字符的数量来表示。例如,待处理矩阵1为[[1,0,1],[0,1,0],[-1,0,-1]],该待处理矩阵1的形状为3×3,也即该待处理矩阵1为3行、3列,由9个数字组成。In this implementation manner, it is convenient to perform symmetric processing on the matrix according to the input shape of the matrix to be processed, and the shape of the symmetric matrix can also be determined according to the input shape of the matrix to be processed. The shape of the matrix can be represented by the number and/or characters of the matrix to be processed in rows and columns. For example, the matrix 1 to be processed is [[1,0,1],[0,1,0],[-1,0,-1]], and the shape of the matrix 1 to be processed is 3×3, that is, the Matrix 1 to be processed consists of 3 rows and 3 columns and is composed of 9 digits.
在一种可能的实现方式中,可以预先设置待处理矩阵的默认输入形状。在操作域中不包含待处理矩阵的输入形状时,可以将待处理矩阵的默认输入形状确定为当前矩阵对称指令的待处理矩阵的输入形状。In a possible implementation manner, the default input shape of the matrix to be processed may be preset. When the input shape of the matrix to be processed is not included in the operation domain, the default input shape of the matrix to be processed may be determined as the input shape of the matrix to be processed of the current matrix symmetric instruction.
在一种可能的实现方式中,操作域中还可以包括对称后矩阵的输出形状。其中,处理模块11-7,还可以用于根据输出形状及对称策略,对待处理矩阵进行对称处理,获得对称后矩阵。In a possible implementation, the output shape of the matrix after the symmetry may also be included in the operation domain. Among them, the processing module 11-7 can also be used to perform symmetric processing on the matrix to be processed according to the output shape and the symmetric strategy to obtain the symmetric matrix.
在该实现方式中,输出形状可以为对称后矩阵的形状。例如,对称后矩阵2为[[1,0],[0,1],[-1,0]],该对称后矩阵的形状为2×3,也即该对称后矩阵2为2行、3列,由6个数字组成。In this implementation, the output shape may be the shape of a symmetric matrix. For example, the matrix 2 after symmetry is [[1,0],[0,1],[-1,0]], the shape of the matrix after symmetry is 2×3, that is, the matrix 2 after symmetry is 2 rows, 3 columns, composed of 6 numbers.
在一种可能的实现方式中,可以预先设置对称后矩阵的默认输出形状。在操作域中不包含对称后矩阵的输出形状时,可以将对称后矩阵的默认输出形状确定为当前矩阵对称指令的对称后矩阵的输出形状。In a possible implementation, the default output shape of the symmetric matrix can be preset. When the output shape of the symmetric matrix is not included in the operation domain, the default output shape of the symmetric matrix can be determined as the output shape of the symmetric matrix after the symmetric instruction of the current matrix.
在一种可能的实现方式中,操作域还可以用于指示对称策略。In a possible implementation, the operation domain can also be used to indicate a symmetric strategy.
在一种可能的实现方式中,操作码还可以用于指示对称策略。In a possible implementation, the operation code can also be used to indicate a symmetric strategy.
在一种可能的实现方式中,可以根据矩阵对称指令的操作码或操作域确定对称策略。还可以预先设置待对称矩阵的默认对称策略。在操作域中不包含待对称矩阵的对称策略时,可以将待对称矩阵的默认对称策略确定为当前矩阵对称指令的对称策略。In a possible implementation, the symmetric strategy may be determined according to the operation code or operation domain of the matrix symmetric instruction. The default symmetry strategy of the matrix to be symmetric can also be preset. When the symmetric strategy of the matrix to be symmetric is not included in the operation domain, the default symmetric strategy of the matrix to be symmetric can be determined as the symmetric strategy of the symmetric instruction of the current matrix.
图7-2示出根据本公开一实施例的矩阵对称指令处理装置的框图。在一种可能的实现方式中,如图7-2所示,该装置还可以包括存储模块13-7。存储模块13-7用于存储待处理矩阵。7-2 shows a block diagram of a matrix symmetric instruction processing device according to an embodiment of the present disclosure. In a possible implementation manner, as shown in FIG. 7-2, the device may further include a storage module 13-7. The storage module 13-7 is used to store the matrix to be processed.
在该实现方式中,存储模块可以包括内存、缓存和寄存器中的一种或多种,缓存可以包括速暂存缓存。可以根据需要将待处理矩阵在存储模块中的内存、缓存和/或寄存器中,本公开对此不作限制。In this implementation, the storage module may include one or more of a memory, a cache, and a register, and the cache may include a temporary storage cache. The matrix to be processed can be stored in the memory, cache, and/or register in the storage module as needed, which is not limited in this disclosure.
在一种可能的实现方式中,该装置还可以包括直接内存访问模块,用于从存储模块中读取或者存储数据。In a possible implementation, the device may further include a direct memory access module, which is used to read or store data from the storage module.
在一种可能的实现方式中,如图7-2所示,控制模块11-7可以包括指令存储子模块111-7、指令处理子模块112-7和队列存储子模块113-7。In a possible implementation, as shown in FIG. 7-2, the control module 11-7 may include an instruction storage sub-module 111-7, an instruction processing sub-module 112-7, and a queue storage sub-module 113-7.
指令存储子模块111-7用于存储矩阵对称指令。The instruction storage sub-module 111-7 is used to store matrix symmetric instructions.
指令处理子模块112-7用于对矩阵对称指令进行解析,得到矩阵对称指令的操作码和操作域。The instruction processing sub-module 112-7 is used to parse the matrix symmetric instruction to obtain the operation code and operation domain of the matrix symmetric instruction.
队列存储子模块113-7用于存储指令队列,指令队列包括按照执行顺序依次排列的多个待执行指令,多个待执行指令可以包括矩阵对称指令。多个待执行指令可以包括还可以包括与矩阵对称指令相 关的其他计算指令。The queue storage sub-module 113-7 is used to store an instruction queue. The instruction queue includes a plurality of instructions to be executed in order according to the execution order. The plurality of instructions to be executed may include matrix symmetric instructions. The plurality of instructions to be executed may include other calculation instructions related to matrix symmetric instructions.
在该实现方式中,可以根据待执行指令的接收时间、优先级别等对多个待执行指令的执行顺序进行排列获得指令队列,以便于根据指令队列依次执行多个待执行指令。In this implementation manner, the execution order of the plurality of instructions to be executed can be arranged according to the reception time and priority level of the instruction to be executed to obtain an instruction queue, so that the plurality of instructions to be executed can be sequentially executed according to the instruction queue.
在一种可能的实现方式中,如图7-2所示,控制模块11-7还可以包括依赖关系处理子模块114-7。In a possible implementation, as shown in FIG. 7-2, the control module 11-7 may further include a dependency processing sub-module 114-7.
依赖关系处理子模块114-7,用于在确定多个待执行指令中的第一待执行指令与第一待执行指令之前的第零待执行指令存在依赖关系时,依赖关系处理子模块114-7可以将第一待执行指令缓存在指令存储子模块112-7中,在第零待执行指令执行完毕后,从指令存储子模块112-7中提取第一待执行指令发送至处理模块12-7。其中,第一待执行指令和第零待执行指令是多个待执行指令中的指令。The dependency processing sub-module 114-7 is configured to determine the dependency relationship between the first to-be-executed instruction and the zeroth to-be-executed instruction before the first to-be-executed instruction among the plurality of to-be-executed instructions. 7 The first to-be-executed instruction may be cached in the instruction storage sub-module 112-7. After the execution of the zeroth to-be-executed instruction is completed, the first to-be-executed instruction is extracted from the instruction storage sub-module 112-7 and sent to the processing module 12- 7. Wherein, the first instruction to be executed and the zeroth instruction to be executed are instructions among a plurality of instructions to be executed.
其中,第一待执行指令与第一待执行指令之前的第零待执行指令存在依赖关系包括:存储第一待执行指令所需数据的第一存储地址区间与存储第零待执行指令所需数据的第零存储地址区间具有重叠的区域。反之,第一待执行指令与第零待执行指令之间没有依赖关系可以是第一存储地址区间与第零存储地址区间没有重叠区域。The dependency relationship between the first to-be-executed instruction and the zeroth to-be-executed instruction before the first to-be-executed instruction includes: a first storage address interval to store data required by the first to-be-executed instruction and data required to store the zeroth to-be-executed instruction The zeroth storage address interval has overlapping areas. Conversely, there is no dependency relationship between the first instruction to be executed and the zeroth instruction to be executed may be that there is no overlapping area between the first storage address interval and the zeroth storage address interval.
通过这种方式,可以根据待执行指令之间的依赖关系,使得在先的待执行令执行完毕之后,再执行在后的待执行指令,保证运算结果的准确性。In this way, according to the dependency relationship between the instructions to be executed, after the execution of the first to-be-executed order is completed, the subsequent to-be-executed instruction is executed again to ensure the accuracy of the calculation result.
在一种可能的实现方式中,矩阵对称指令的指令格式可以为:In a possible implementation manner, the instruction format of the matrix symmetric instruction may be:
Rotate1 type dst src src_shape dst_shapeRotate1 type dst src src_shape dst_shape
其中,Rotate1为操作码,type、dst、src、src_shape、dst_shape为操作域。Rotate1用于指示该指令为矩阵对称指令。type为对称策略。dst为目标地址。src为待处理矩阵地址。src_shape为输入形状。dst_shape为输出形状。Rotate1 is the operation code, and type, dst, src, src_shape, and dst_shape are the operation domains. Rotate1 is used to indicate that the instruction is a matrix symmetric instruction. type is a symmetric strategy. dst is the target address. src is the address of the matrix to be processed. src_shape is the input shape. dst_shape is the output shape.
Rotate1_type dst src src_shape dst_shapeRotate1_type dst src src_shape dst_shape
其中,Rotate1_type为操作码,dst、src、src_shape、dst_shape为操作域。Rotate1_type中的Rotate1用于指示该指令为矩阵对称指令。Rotate1_type中的type为对称策略。dst为目标地址。src为待处理矩阵地址。src_shape为输入形状。dst_shape为输出形状。Rotate1_type is the operation code, and dst, src, src_shape, and dst_shape are the operation domains. Rotate1 in Rotate1_type is used to indicate that the instruction is a matrix symmetric instruction. The type in Rotate1_type is a symmetric strategy. dst is the target address. src is the address of the matrix to be processed. src_shape is the input shape. dst_shape is the output shape.
在一种可能的实现方式中,可以设置对称策略为“中心对称”的矩阵对称指令的指令格式为:Rotate1_asymmetry dst src src_shape dst_shape。可以设置对称策略为“轴对称”的矩阵对称指令格式为:Rotate1_csymmetry dst src src_shape dst_shape。In a possible implementation, the instruction format of the matrix symmetry instruction whose symmetry strategy is "center symmetry" can be set as: Rotate1_asymmetry dst src_src_shape dst_shape. The matrix symmetry instruction format that can set the symmetry strategy to "axis symmetry" is: Rotate1_csymmetry dst src src_shape dst_shape.
应当理解的是,本领域技术人员可以根据需要对矩阵对称指令的操作码、指令格式中操作码以及操作域的位置进行设置,本公开对此不作限制。It should be understood that those skilled in the art can set the operation code of the matrix symmetric instruction, the operation code in the instruction format, and the position of the operation field according to needs, which is not limited in this disclosure.
在一种可能的实现方式中,该装置可以设置于图形处理器(Graphics Processing Unit,简称GPU)、中央处理器(Central Processing Unit,简称CPU)和嵌入式神经网络处理器(Neural-network Processing Unit,简称NPU)的一种或多种之中。In a possible implementation, the device may be set in a graphics processor (GPU), a central processing unit (CPU) and a neural-network processing unit (Neural-network Processing) , Referred to as NPU).
需要说明的是,尽管以上述实施例作为示例介绍了矩阵对称指令处理装置如上,但本领域技术人员能够理解,本公开应不限于此。事实上,用户完全可根据个人喜好和/或实际应用场景灵活设定各模块,只要符合本公开的技术方案即可。It should be noted that, although the above embodiment is taken as an example to introduce the matrix symmetric instruction processing apparatus as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various modules flexibly according to personal preferences and/or actual application scenarios, as long as the technical solutions of the present disclosure are met.
应用示例Application examples
以下结合“利用矩阵对称指令处理装置对待处理矩阵进行对称处理”作为一个示例性应用场景,给出根据本公开实施例的应用示例,以便于理解矩阵对称指令处理装置的流程。本领域技术人员应理解,以下应用示例仅仅是出于便于理解本公开实施例的目的,不应视为对本公开实施例的限制。The following describes an application example according to an embodiment of the present disclosure in conjunction with "using a matrix symmetric instruction processing device to perform symmetric processing on a matrix to be processed" as an exemplary application scenario, so as to facilitate understanding of the flow of the matrix symmetric instruction processing device. Those skilled in the art should understand that the following application examples are only for the purpose of facilitating understanding of the embodiments of the present disclosure, and should not be considered as limitations to the embodiments of the present disclosure.
图7-3示出根据本公开一实施例的矩阵对称指令处理装置的应用场景的示意图。如图7-3所示,矩 阵对称指令处理装置对矩阵对称指令进行处理的过程如下。7-3 shows a schematic diagram of an application scenario of a matrix symmetric instruction processing device according to an embodiment of the present disclosure. As shown in Figure 7-3, the matrix symmetric instruction processing device processes matrix symmetric instructions as follows.
控制模块11-7在接收到矩阵对称指令1(Rotate1_asymmetry 200 100 S1 S2)时,对矩阵对称指令1进行解析,获得矩阵对称指令1的操作码和操作域。该矩阵对称指令1的操作码为Rotate1_asymmetry。根据操作码可以确定:该指令为矩阵对称指令,对称策略为asymmetry,即对称策略为轴对称。根据操作域可以确定:待处理矩阵地址为100、输入形状为S1、目标地址为200、输出形状为S2。进而控制模块11-7从待处理矩阵地址100中获取输入形状为S1的待处理矩阵1。When receiving the matrix symmetry instruction 1 ( Rotate1_asymmetry 200, 100, S1, S2), the control module 11-7 parses the matrix symmetry instruction 1, and obtains the operation code and operation domain of the matrix symmetry instruction 1. The opcode of this matrix symmetric instruction 1 is Rotate1_asymmetry. According to the operation code, it can be determined that the instruction is a matrix symmetry instruction, and the symmetry strategy is asymmetry, that is, the symmetry strategy is axisymmetric. According to the operation domain, it can be determined that the to-be-processed matrix address is 100, the input shape is S1, the target address is 200, and the output shape is S2. Furthermore, the control module 11-7 acquires the to-be-processed matrix 1 whose input shape is S1 from the to-be-processed matrix address 100.
处理模块12-7根据对称策略对待处理矩阵1进行对称处理,得到对称后矩阵1’,并将对称后矩阵1’存入目标地址200中。The processing module 12-7 performs symmetric processing on the processing matrix 1 according to a symmetric strategy to obtain a symmetric matrix 1', and stores the symmetric matrix 1'in the target address 200.
其中,矩阵对称指令1除可以为上述Rotate1_asymmetry 200 100 S1 S2,还可以为Rotate1 asymmetry 200 100 S1 S2,二者为不同指令格式、且表示相同处理过程的指令,矩阵对称指令装置对二者的处理过程相似,不再赘述。Among them, the matrix symmetric instruction 1 can be not only the above Rotate1_asymmetry 200, 100, S1, S2, but also Rotate1, asymmetry, 200, 100, S1, S2. The two are instructions with different instruction formats and represent the same processing process. The process is similar and will not be repeated here.
上述处理过程详见上文相关描述。For details of the above process, please refer to the relevant description above.
这样,矩阵对称指令处理装置可以快速、高效地根据矩阵对称指令对矩阵进行对称处理。In this way, the matrix symmetric instruction processing device can quickly and efficiently perform symmetric processing on the matrix according to the matrix symmetric instruction.
图7-4示出根据本公开一实施例的矩阵对称指令处理方法的流程图。如图7-4所示,该方法应用于上述矩阵对称指令处理装置,该方法包括步骤S51-7和步骤S52-7。7-4 shows a flowchart of a matrix symmetric instruction processing method according to an embodiment of the present disclosure. As shown in FIG. 7-4, the method is applied to the above matrix symmetric instruction processing device. The method includes step S51-7 and step S52-7.
在步骤S51-7中,对接收到的矩阵对称指令进行解析,获得矩阵对称指令的操作码和操作域,并根据操作码和操作域确定执行矩阵对称指令所需的待处理矩阵和目标地址,以及确定进行对称处理所需的对称策略。其中,操作码用于指示矩阵对称指令对矩阵数据所进行的处理为对称处理,操作域包括待处理矩阵地址和目标地址。In step S51-7, the received matrix symmetric instruction is parsed to obtain the operation code and operation domain of the matrix symmetric instruction, and the matrix to be processed and the target address required to execute the matrix symmetric instruction are determined according to the operation code and the operation domain, And determine the symmetrical strategy required for symmetrical processing. The operation code is used to indicate that the processing performed by the matrix symmetric instruction on the matrix data is symmetric processing, and the operation domain includes the address of the matrix to be processed and the target address.
在步骤S52-7中,根据对称策略对待处理矩阵进行对称处理,得到对称后矩阵,并将对称后矩阵存入目标地址中,In step S52-7, the matrix to be processed is symmetrically processed according to a symmetric strategy to obtain a matrix after symmetry, and the matrix after symmetry is stored in the target address,
在一种可能的实现方式中,操作域还可以包括待处理矩阵的输入形状。其中,根据对称策略对待处理矩阵进行对称处理,得到对称后矩阵,可以包括:根据输入形状以及对称策略,对待处理矩阵进行对称处理,获得对称后矩阵。In a possible implementation manner, the operation domain may further include the input shape of the matrix to be processed. Wherein, performing symmetric processing on the processing matrix according to a symmetric strategy to obtain a symmetric matrix may include: performing symmetric processing on the processing matrix according to the input shape and the symmetric strategy to obtain the symmetric matrix.
在一种可能的实现方式中,操作域还可以包括对称后矩阵的输出形状。其中,根据对称策略对待处理矩阵进行对称处理,得到对称后矩阵,可以包括:根据输出形状及对称策略,对待处理矩阵进行对称处理,获得对称后矩阵。In a possible implementation, the operation domain may also include the output shape of the symmetric matrix. Wherein, performing symmetric processing on the processing matrix according to the symmetric strategy to obtain the symmetric matrix may include: performing symmetric processing on the processing matrix according to the output shape and the symmetric strategy to obtain the symmetric matrix.
在一种可能的实现方式中,操作域还可以用于指示对称策略。In a possible implementation, the operation domain can also be used to indicate a symmetric strategy.
在一种可能的实现方式中,操作码还可以用于指示对称策略。In a possible implementation, the operation code can also be used to indicate a symmetric strategy.
在一种可能的实现方式中,该方法还可以包括:存储待处理矩阵。In a possible implementation manner, the method may further include: storing a matrix to be processed.
在一种可能的实现方式中,对接收到的矩阵对称指令进行解析,获得矩阵对称指令的操作码和操作域,可以包括:In a possible implementation manner, parsing the received matrix symmetric instruction to obtain the operation code and operation domain of the matrix symmetric instruction may include:
存储矩阵对称指令;Storage matrix symmetric instructions;
对矩阵对称指令进行解析,得到矩阵对称指令的操作码和操作域;Analyze the matrix symmetric instructions to obtain the opcode and operation domain of the matrix symmetric instructions;
存储指令队列,指令队列包括按照执行顺序依次排列的多个待执行指令,多个待执行指令可以包括矩阵对称指令。An instruction queue is stored. The instruction queue includes a plurality of instructions to be executed, which are sequentially arranged according to the execution order. The plurality of instructions to be executed may include matrix symmetric instructions.
在一种可能的实现方式中,该方法还可以包括:In a possible implementation manner, the method may further include:
在确定多个待执行指令中的第一待执行指令与第一待执行指令之前的第零待执行指令存在依赖关系时,缓存第一待执行指令,并在确定第零待执行指令执行完毕后,控制进行第一待执行指令的执 行,When it is determined that the first to-be-executed instruction among the plurality of to-be-executed instructions has a dependency relationship with the zeroth to-be-executed instruction before the first to-be-executed instruction, cache the first to-be-executed instruction, and after determining that the execution of the zeroth to-be-executed instruction is completed , Control the execution of the first instruction to be executed,
其中,第一待执行指令与第一待执行指令之前的第零待执行指令存在依赖关系包括:The dependency relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
存储第一待执行指令所需数据的第一存储地址区间与存储第零待执行指令所需数据的第零存储地址区间具有重叠的区域。A first storage address interval storing data required for the first instruction to be executed has an overlapping area with a zeroth storage address interval storing data required for the zeroth instruction to be executed.
需要说明的是,尽管以上述实施例作为示例介绍了矩阵对称指令处理方法如上,但本领域技术人员能够理解,本公开应不限于此。事实上,用户完全可根据个人喜好和/或实际应用场景灵活设定各步骤,只要符合本公开的技术方案即可。It should be noted that although the above embodiment is taken as an example to introduce the matrix symmetric instruction processing method as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various steps flexibly according to personal preferences and/or actual application scenarios, as long as the technical solutions of the present disclosure are met.
本公开实施例所提供的矩阵对称指令处理方法的适用范围广,对矩阵进行对称处理的处理效率高、处理速度快。The method for processing matrix symmetric instructions provided by the embodiments of the present disclosure has a wide application range, and the processing efficiency of the matrix for symmetric processing is high and the processing speed is fast.
依据以下条款可更好地理解前述内容:The foregoing can be better understood based on the following terms:
条款F1、一种矩阵对称指令处理装置,所述装置包括:Clause F1, a matrix symmetric instruction processing device, the device comprising:
控制模块,用于对接收到的矩阵对称指令进行解析,获得所述矩阵对称指令的操作码和操作域,并根据所述操作码和所述操作域确定执行所述矩阵对称指令所需的待处理矩阵和目标地址,以及确定进行对称处理所需的对称策略;The control module is used to parse the received matrix symmetric instruction, obtain the operation code and the operation domain of the matrix symmetric instruction, and determine the required standby for executing the matrix symmetric instruction according to the operation code and the operation domain Processing matrix and target address, and determining the symmetric strategy required for symmetric processing;
处理模块,根据所述对称策略对所述待处理矩阵进行对称处理,得到对称后矩阵,并将所述对称后矩阵存入所述目标地址中,The processing module performs symmetric processing on the matrix to be processed according to the symmetric strategy to obtain a symmetric matrix, and stores the symmetric matrix into the target address,
其中,所述操作码用于指示所述矩阵对称指令对矩阵数据所进行的处理为对称处理,所述操作域包括所述待处理矩阵地址和所述目标地址。Wherein, the operation code is used to indicate that the processing performed by the matrix symmetric instruction on the matrix data is symmetric processing, and the operation field includes the to-be-processed matrix address and the target address.
条款F2、根据条款F1所述的装置,所述操作域还包括待处理矩阵的输入形状,Clause F2. The device according to Clause F1, the operation domain further includes an input shape of a matrix to be processed,
其中,所述处理模块,还用于根据所述输入形状以及所述对称策略,对所述待处理矩阵进行对称处理,获得所述对称后矩阵。Wherein, the processing module is further configured to perform symmetric processing on the matrix to be processed according to the input shape and the symmetry strategy to obtain the symmetrical matrix.
条款F3、根据条款F1所述的装置,所述操作域还包括对称后矩阵的输出形状,Clause F3. The device according to Clause F1, the operation domain further includes the output shape of the symmetric matrix,
其中,所述处理模块,还用于根据所述输出形状及所述对称策略,对所述待处理矩阵进行对称处理,获得所述对称后矩阵。Wherein, the processing module is further configured to perform symmetric processing on the matrix to be processed according to the output shape and the symmetric strategy to obtain the symmetric matrix.
条款F4、根据条款F1所述的装置,所述操作域还用于指示对称策略。Clause F4. The apparatus according to Clause F1, the operation domain is also used to indicate a symmetric strategy.
条款F5、根据条款F1所述的装置,所述操作码还用于指示所述对称策略。Clause F5. The apparatus according to Clause F1, the operation code is further used to indicate the symmetric strategy.
条款F6、根据条款F1所述的装置,Clause F6, the device according to Clause F1,
所述装置还包括:存储模块,用于存储所述待处理矩阵,The device also includes a storage module for storing the matrix to be processed,
其中,所述控制模块,包括:Wherein, the control module includes:
指令存储子模块,用于存储所述矩阵对称指令;An instruction storage sub-module for storing the matrix symmetric instructions;
指令处理子模块,用于对所述矩阵对称指令进行解析,得到所述矩阵对称指令的操作码和操作域;An instruction processing submodule, used for parsing the matrix symmetric instruction to obtain the operation code and the operation domain of the matrix symmetric instruction;
队列存储子模块,用于存储指令队列,所述指令队列包括按照执行顺序依次排列的多个待执行指令,所述多个待执行指令包括所述矩阵对称指令,A queue storage sub-module, which is used to store an instruction queue, and the instruction queue includes a plurality of instructions to be executed, which are sequentially arranged according to the execution order, and the plurality of instructions to be executed include the matrix symmetric instructions,
其中,所述控制模块,还包括:Wherein, the control module also includes:
依赖关系处理子模块,用于在确定所述多个待执行指令中的第一待执行指令与所述第一待执行指令之前的第零待执行指令存在依赖关系时,将所述第一待执行指令缓存在所述指令存储子模块中,在所述第零待执行指令执行完毕后,从所述指令存储子模块中提取所述第一待执行指令发送至所述处理模块,The dependency processing sub-module is used to determine the first pending instruction when there is a dependency relationship between the first pending instruction in the plurality of pending instructions and the zeroth pending instruction before the first pending instruction The execution instruction is cached in the instruction storage submodule, and after the execution of the zeroth instruction to be executed is completed, the first instruction to be executed is extracted from the instruction storage submodule and sent to the processing module,
其中,所述第一待执行指令与所述第一待执行指令之前的第零待执行指令存在依赖关系包括:The dependency relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
存储所述第一待执行指令所需数据的第一存储地址区间与存储所述第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
条款F7、一种机器学习运算装置,所述装置包括:Clause F7. A machine learning computing device, the device comprising:
一个或多个如条款F1-条款F6任一项所述的矩阵对称指令处理装置,用于从其他处理装置中获取待处理矩阵和控制信息,并执行指定的机器学习运算,将执行结果通过I/O接口传递给其他处理装置;One or more matrix symmetric instruction processing devices as described in any one of clauses F1-F6, used to obtain the to-be-processed matrix and control information from other processing devices, and perform specified machine learning operations, and pass the execution result through I /O interface is passed to other processing devices;
当所述机器学习运算装置包含多个所述矩阵对称指令处理装置时,所述多个所述矩阵对称指令处理装置间可以通过特定的结构进行连接并传输数据;When the machine learning computing device includes a plurality of matrix symmetric instruction processing devices, the plurality of matrix symmetric instruction processing devices may be connected and transmit data through a specific structure;
其中,多个所述矩阵对称指令处理装置通过快速外部设备互连总线PCIE总线进行互联并传输数据,以支持更大规模的机器学习的运算;多个所述矩阵对称指令处理装置共享同一控制系统或拥有各自的控制系统;多个所述矩阵对称指令处理装置共享内存或者拥有各自的内存;多个所述矩阵对称指令处理装置的互联方式是任意互联拓扑。Among them, a plurality of the matrix symmetric instruction processing devices interconnect and transmit data through a fast external device interconnect bus PCIE bus to support larger-scale machine learning operations; a plurality of the matrix symmetric instruction processing devices share the same control system Or have their own control systems; a plurality of the matrix symmetric instruction processing devices share memory or have their own memories; the interconnection method of the plurality of matrix symmetric instruction processing devices is any interconnection topology.
条款F8、一种组合处理装置,所述组合处理装置包括:Clause F8. A combined processing device, the combined processing device comprising:
如条款F7所述的机器学习运算装置、通用互联接口和其他处理装置;Machine learning computing devices, general interconnection interfaces and other processing devices as described in clause F7;
所述机器学习运算装置与所述其他处理装置进行交互,共同完成用户指定的计算操作,The machine learning computing device interacts with the other processing device to jointly complete the calculation operation specified by the user,
其中,所述组合处理装置还包括:存储装置,该存储装置分别与所述机器学习运算装置和所述其他处理装置连接,用于保存所述机器学习运算装置和所述其他处理装置的数据。Wherein, the combined processing device further includes: a storage device, which is respectively connected to the machine learning computing device and the other processing device, and is used for storing data of the machine learning computing device and the other processing device.
条款F9、一种机器学习芯片,所述机器学习芯片包括:Clause F9. A machine learning chip, the machine learning chip includes:
如条款F7所述的机器学习运算装置或如条款F8所述的组合处理装置。The machine learning arithmetic device according to clause F7 or the combined processing device according to clause F8.
条款F10、一种电子设备,所述电子设备包括:Clause F10. An electronic device, the electronic device comprising:
如条款F9所述的机器学习芯片。Machine learning chip as described in clause F9.
条款F11、一种板卡,所述板卡包括:存储器件、接口装置和控制器件以及如条款F9所述的机器学习芯片;Clause F11. A board card including: a storage device, an interface device and a control device, and a machine learning chip as described in Clause F9;
其中,所述机器学习芯片与所述存储器件、所述控制器件以及所述接口装置分别连接;Wherein, the machine learning chip is respectively connected to the storage device, the control device and the interface device;
所述存储器件,用于存储数据;The storage device is used to store data;
所述接口装置,用于实现所述机器学习芯片与外部设备之间的数据传输;The interface device is used to realize data transmission between the machine learning chip and an external device;
所述控制器件,用于对所述机器学习芯片的状态进行监控。The control device is used for monitoring the state of the machine learning chip.
条款F12、一种矩阵对称指令处理方法,所述方法应用于矩阵对称指令处理装置,所述方法包括:Article F12. A method for processing matrix symmetric instructions. The method is applied to a device for processing matrix symmetric instructions. The method includes:
对接收到的矩阵对称指令进行解析,获得所述矩阵对称指令的操作码和操作域,并根据所述操作码和所述操作域确定执行所述矩阵对称指令所需的待处理矩阵和目标地址,以及确定进行对称处理所需的对称策略;Parse the received matrix symmetric instruction to obtain the operation code and operation domain of the matrix symmetric instruction, and determine the to-be-processed matrix and the target address required to execute the matrix symmetric instruction according to the operation code and the operation domain , And determine the symmetrical strategy required for symmetrical processing;
根据所述对称策略对所述待处理矩阵进行对称处理,得到对称后矩阵,并将所述对称后矩阵存入所述目标地址中,Performing symmetric processing on the matrix to be processed according to the symmetric strategy to obtain a symmetric matrix, and storing the symmetric matrix into the target address,
其中,所述操作码用于指示所述矩阵对称指令对矩阵数据所进行的处理为对称处理,所述操作域包括所述待处理矩阵地址和所述目标地址。Wherein, the operation code is used to indicate that the processing performed by the matrix symmetric instruction on the matrix data is symmetric processing, and the operation field includes the to-be-processed matrix address and the target address.
条款F13、根据条款F12所述的方法,所述操作域还包括待处理矩阵的输入形状,Clause F13. The method according to Clause F12, the operation domain further includes an input shape of the matrix to be processed,
其中,根据所述对称策略对所述待处理矩阵进行对称处理,得到对称后矩阵,包括:Wherein, performing symmetric processing on the matrix to be processed according to the symmetric strategy to obtain a symmetric matrix includes:
根据所述输入形状以及所述对称策略,对所述待处理矩阵进行对称处理,获得所述对称后矩阵。According to the input shape and the symmetry strategy, the matrix to be processed is symmetrically processed to obtain the symmetrical matrix.
条款F14、根据条款F12所述的方法,所述操作域还包括对称后矩阵的输出形状,Clause F14. According to the method of Clause F12, the operation domain further includes the output shape of the symmetric matrix,
其中,根据所述对称策略对所述待处理矩阵进行对称处理,得到对称后矩阵,包括:Wherein, performing symmetric processing on the matrix to be processed according to the symmetric strategy to obtain a symmetric matrix includes:
根据所述输出形状及所述对称策略,对所述待处理矩阵进行对称处理,获得所述对称后矩阵。According to the output shape and the symmetry strategy, the matrix to be processed is symmetrically processed to obtain the symmetrical matrix.
条款F15、根据条款F12所述的方法,所述操作域还用于指示对称策略。Clause F15. According to the method of Clause F12, the operation domain is also used to indicate a symmetric strategy.
条款F16、根据条款F12所述的方法,所述操作码还用于指示所述对称策略。Clause F16. The method according to Clause F12, the opcode is also used to indicate the symmetric strategy.
条款F17、根据条款F12所述的方法,Clause F17, according to the method described in Clause F12,
所述方法还包括:存储所述待处理矩阵,The method further includes: storing the matrix to be processed,
其中,对接收到的矩阵对称指令进行解析,获得所述矩阵对称指令的操作码和操作域,包括:Wherein, parsing the received matrix symmetric instruction to obtain the operation code and operation domain of the matrix symmetric instruction includes:
存储所述矩阵对称指令;Store the matrix symmetric instruction;
对所述矩阵对称指令进行解析,得到所述矩阵对称指令的操作码和操作域;Analyzing the matrix symmetric instruction to obtain the operation code and the operation domain of the matrix symmetric instruction;
存储指令队列,所述指令队列包括按照执行顺序依次排列的多个待执行指令,所述多个待执行指令包括所述矩阵对称指令,Store an instruction queue, the instruction queue includes a plurality of instructions to be executed in order according to the execution order, the plurality of instructions to be executed includes the matrix symmetric instructions
其中,所述方法还包括:Wherein, the method further includes:
在确定所述多个待执行指令中的第一待执行指令与所述第一待执行指令之前的第零待执行指令存在依赖关系时,缓存所述第一待执行指令,并在确定所述第零待执行指令执行完毕后,控制进行所述第一待执行指令的执行,When it is determined that the first instruction to be executed among the plurality of instructions to be executed has a dependency relationship with the zeroth instruction to be executed before the first instruction to be executed, the first instruction to be executed is cached, and the After the execution of the zeroth to-be-executed instruction is completed, control to execute the execution of the first to-be-executed instruction,
其中,所述第一待执行指令与所述第一待执行指令之前的第零待执行指令存在依赖关系包括:The dependency relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
存储所述第一待执行指令所需数据的第一存储地址区间与存储所述第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
由于神经网络算法在图像识别、语音识别、自然语言处理等领域中的使用越来越广泛,使得神经网络算法的复杂度越来越高,所涉及的数据运算种类和数量不断增大。其中,矩阵是一种在神经网络算法中较为常见的数据形式,由数字和/或字符组成。神经网络算法中对矩阵的处理过程包括对矩阵进行镜像处理。相关技术中,需要对矩阵进行镜像处理的效率低、速度慢。Since neural network algorithms are more and more widely used in the fields of image recognition, speech recognition, and natural language processing, the complexity of neural network algorithms is getting higher and higher, and the types and number of data operations involved are increasing. Among them, the matrix is a relatively common data form in neural network algorithm, which is composed of numbers and/or characters. The processing of the matrix in the neural network algorithm includes mirroring the matrix. In the related art, the efficiency of mirroring the matrix needs to be low and the speed is slow.
图8-1示出根据本公开一实施例的矩阵镜像指令处理装置的框图。如图8-1所示,该装置包括控制模块11-8和处理模块12-8。8-1 shows a block diagram of a matrix mirroring instruction processing apparatus according to an embodiment of the present disclosure. As shown in Figure 8-1, the device includes a control module 11-8 and a processing module 12-8.
控制模块11-8,用于对接收到的矩阵镜像指令进行解析,获得矩阵镜像指令的操作码和操作域,并根据操作码和操作域确定执行矩阵镜像指令所需的待镜像矩阵和目标地址,以及确定进行镜像处理所需的镜像策略。其中,操作码用于指示矩阵镜像指令对矩阵数据所进行的处理为镜像处理,操作域包括待镜像矩阵地址和目标地址。The control module 11-8 is used to parse the received matrix mirroring instruction, obtain the operation code and operation domain of the matrix mirroring instruction, and determine the matrix to be mirrored and the target address required to execute the matrix mirroring instruction according to the operation code and the operation domain , And determine the mirroring strategy required for mirroring. The operation code is used to instruct the processing performed by the matrix mirroring instruction on the matrix data to be mirroring processing, and the operation domain includes the matrix address and the target address to be mirrored.
处理模块12-8,根据镜像策略对待镜像矩阵进行镜像处理,得到镜像后矩阵,并将镜像后矩阵存入目标地址中。The processing module 12-8 performs mirror processing on the mirror matrix according to the mirror strategy to obtain the mirrored matrix, and stores the mirrored matrix in the target address.
在本实施例中,待镜像矩阵可以是由多个数字和/或字符按照阵列排列而成的数据集合。镜像处理是对矩阵进行一种变换处理,将待镜像矩阵沿着特定翻转直线(二维平面中)或特定翻转平面(三维空间)进行翻折,获得镜像处理后的矩阵。例如,如果待镜像矩阵在二维平面中,镜像策略可以包括沿待镜像矩阵的水平方向将待镜像矩阵进行翻折和沿待镜像矩阵的垂直方向将待镜像矩阵进行翻折中的至少一种。如果待镜像矩阵在三维空间中,镜像策略可以包括沿待镜像矩阵的水平面将待镜像矩阵进行翻折、沿待镜像矩阵的垂直面将待镜像矩阵进行翻折以及沿水平面与垂直面的共同垂直的平面将待镜像矩阵进翻折中的至少一种。镜像策略中可以包括对待镜像矩阵进行镜像处理所需的翻转直线和翻转平面等进行镜像处理所需的参数,矩阵镜像指令中可以对矩阵进行一次或多次镜像处理,本公开对此不作限制。In this embodiment, the matrix to be mirrored may be a data set formed by arranging multiple numbers and/or characters in an array. The mirror image processing is to perform a transformation process on the matrix, and the matrix to be mirror image is folded along a specific flip line (in a two-dimensional plane) or a specific flip plane (in a three-dimensional space) to obtain a mirror-processed matrix. For example, if the matrix to be mirrored is in a two-dimensional plane, the mirroring strategy may include at least one of folding the matrix to be mirrored along the horizontal direction of the matrix to be mirrored and folding the matrix to be mirrored along the vertical direction of the matrix to be mirrored . If the matrix to be mirrored is in three-dimensional space, the mirroring strategy may include folding the matrix to be mirrored along the horizontal plane of the matrix to be mirrored, folding the matrix to be mirrored along the vertical plane of the matrix to be mirrored, and being perpendicular to the vertical plane along the horizontal plane At least one of the planes to be mirrored into the plane. The mirroring strategy may include parameters required for mirroring processing such as flipping straight lines and flipping planes required for mirroring of the mirroring matrix. The matrix mirroring instruction may perform one or more mirroring processing on the matrix, which is not limited in the present disclosure.
举例来说,假定待镜像矩阵为[[1,4,7],[2,5,8],[3,6,9]]。若根据矩阵镜像指令确定镜像策略为“水平镜像”,那么装置对待镜像矩阵进行水平镜像处理后,可得到镜像后矩阵[[3,6,9],[2,5,8],[1,4,7]]。若根据矩阵镜像指令确定对称策略为“垂直镜像”,那么装置对对待镜像矩阵进行垂直镜像处理后,可得到对称后矩阵[[9,6,3],[8,5,2],[7,4,1]]。For example, suppose the matrix to be mirrored is [[1,4,7],[2,5,8],[3,6,9]]. If the mirroring strategy is determined to be "horizontal mirroring" according to the matrix mirroring instruction, the device can obtain the mirrored matrix [[3,6,9],[2,5,8],[1, 4,7]]. If the symmetry strategy is determined to be "vertical mirroring" according to the matrix mirroring instruction, the device can obtain the post-symmetric matrix [[9,6,3],[8,5,2],[7 ,4,1]].
在本实施例中,控制模块可以从待镜像矩阵地址中获取待镜像矩阵。待镜像矩阵地址可以是存储待镜像矩阵的首地址等物理地址,也可以是逻辑地址、线性地址。控制模块可以将待镜像矩阵存储在目标地址中。目标地址可以是存储镜像后矩阵的首地址等物理地址,也可以是逻辑地址、线性地址。本公开对待镜像矩阵地址、目标地址的表示方式不作限制。。控制模块可以通过数据输入输出单元获得矩阵镜像指令、待镜像矩阵,该数据输入输出单元可以为一个或多个数据I/O接口或I/O引脚。In this embodiment, the control module may obtain the matrix to be mirrored from the address of the matrix to be mirrored. The address of the matrix to be mirrored may be a physical address such as the first address storing the matrix to be mirrored, or may be a logical address or a linear address. The control module may store the matrix to be mirrored in the target address. The target address may be a physical address such as the first address of the matrix after storing the mirror, or a logical address or a linear address. The present disclosure does not limit the way of expressing the mirror matrix address and the target address. . The control module may obtain the matrix mirroring instruction and the matrix to be mirrored through the data input/output unit, and the data input/output unit may be one or more data I/O interfaces or I/O pins.
在本实施例中,对于一个矩阵镜像指令可以包括操作码和操作域。操作码可以是计算机程序中所规定的要执行操作的那一部分指令或字段(通常用代码表示),是指令序列号,用来告知执行指令的装置具体需要执行哪一条指令。而操作域可以是执行对应的指令所需的所有数据的来源,执行对应的指令所需的所有数据包括待镜像矩阵、对应的镜像策略,或者存储待镜像矩阵、对应的镜像策略的地址等等。比如,操作域可以包括待镜像矩阵地址和目标地址。In this embodiment, for a matrix mirroring instruction, an operation code and an operation field may be included. The operation code may be the part of the instruction or field (usually represented by code) specified in the computer program to perform the operation, and is an instruction sequence number used to inform the device that executes the instruction which instruction needs to be executed. The operation domain can be the source of all data required to execute the corresponding instruction. All data required to execute the corresponding instruction include the matrix to be mirrored, the corresponding mirroring strategy, or the address to store the matrix to be mirrored, the corresponding mirroring strategy, etc. . For example, the operation domain may include the matrix address to be mirrored and the target address.
应当理解的是,本领域技术人员可以根据需要对矩阵镜像指令的指令格式以及所包含的操作码和操作域进行设置,本公开对此不作限制。It should be understood that, those skilled in the art can set the instruction format of the matrix mirroring instruction, as well as the included operation codes and operation domains as needed, and this disclosure does not limit this.
在本实施例中,该装置可以包括一个或多个控制模块,以及一个或多个处理模块,可以根据实际需要对控制模块和处理模块的数量进行设置,本公开对此不作限制。在装置包括一个控制模块时,该控制模块可以接收矩阵镜像指令,并控制一个或多个处理模块进行镜像处理。在装置包括多个控制模块时,多个控制模块可以分别接收矩阵镜像指令,并控制对应的一个或多个处理模块进行镜像处理。In this embodiment, the device may include one or more control modules and one or more processing modules, and the number of control modules and processing modules may be set according to actual needs, which is not limited in the present disclosure. When the device includes a control module, the control module can receive the matrix mirroring instruction and control one or more processing modules to perform mirroring processing. When the device includes multiple control modules, the multiple control modules may receive matrix mirroring instructions respectively and control the corresponding one or more processing modules to perform mirroring processing.
本公开实施例所提供的矩阵镜像指令处理装置,该装置包括控制模块和处理模块。控制模块用于对接收到的矩阵镜像指令进行解析,获得矩阵镜像指令的操作码和操作域,并根据操作码和操作域确定执行矩阵镜像指令所需的待镜像矩阵和目标地址,以及确定进行镜像处理所需的镜像策略。处理模块根据镜像策略对待镜像矩阵进行镜像处理,得到镜像后矩阵,并将镜像后矩阵存入目标地址中。本公开实施例所提供的矩阵镜像指令处理装置的适用范围广,根据矩阵镜像指令对矩阵进行镜像处理的处理效率高、处理速度快。The matrix mirroring instruction processing device provided by the embodiment of the present disclosure includes a control module and a processing module. The control module is used to analyze the received matrix mirroring instruction, obtain the operation code and operation domain of the matrix mirroring instruction, and determine the matrix to be mirrored and the target address required to execute the matrix mirroring instruction according to the operation code and the operation domain, and determine the progress The mirroring strategy required for mirroring. The processing module performs mirror processing on the mirror matrix according to the mirror strategy to obtain the mirrored matrix, and stores the mirrored matrix in the target address. The matrix mirroring instruction processing device provided by the embodiments of the present disclosure has a wide application range, and the processing efficiency of mirroring the matrix according to the matrix mirroring instruction is high and the processing speed is fast.
在一种可能的实现方式中,操作域还可以包括待镜像矩阵的输入形状。In a possible implementation manner, the operation domain may further include the input shape of the matrix to be mirrored.
其中,处理模块12-8,还可以用于根据输入形状以及镜像策略,对待镜像矩阵进行镜像处理,得到镜像后矩阵。The processing module 12-8 can also be used to perform mirror processing on the mirror matrix according to the input shape and mirror strategy to obtain the mirrored matrix.
在该实现方式中,根据待镜像矩阵的输入形状便于对矩阵进行对称处理,也可以根据待镜像矩阵的输入形状确定对称后矩阵的形状。矩阵的形状可以用待镜像矩阵在行、列上数字和/或字符的数量来表示。例如,待镜像矩阵1为[[0,1,1],[0,1,-1]],该待镜像矩阵1的形状为3×2,也即该待处理矩阵1为3行、2列,由6个数字组成。In this implementation manner, it is convenient to perform symmetric processing on the matrix according to the input shape of the matrix to be mirrored, and the shape of the symmetric matrix can also be determined according to the input shape of the matrix to be mirrored. The shape of the matrix can be represented by the number of numbers and/or characters on the rows and columns of the matrix to be mirrored. For example, the matrix 1 to be mirrored is [[0,1,1],[0,1,-1]], and the shape of the matrix 1 to be mirrored is 3×2, that is, the matrix 1 to be processed is 3 rows and 2 Column, consisting of 6 numbers.
在一种可能的实现方式中,可以预先设置待镜像矩阵的默认输入形状。在操作域中不包含待镜像矩阵的输入形状时,可以将待镜像矩阵的默认输入形状确定为当前矩阵镜像指令的待镜像矩阵的输入形状。本公开对此不作限制。In a possible implementation, the default input shape of the matrix to be mirrored may be preset. When the input shape of the matrix to be mirrored is not included in the operation domain, the default input shape of the matrix to be mirrored may be determined as the input shape of the matrix to be mirrored of the current matrix mirroring instruction. This disclosure does not limit this.
在一种可能的实现方式中,操作域还可以包括镜像后矩阵的输出形状。其中,处理模块12-8,还用于根据输出形状以及镜像策略,对待镜像矩阵进行镜像处理,得到镜像后矩阵。In a possible implementation, the operation domain may further include the output shape of the mirrored matrix. Among them, the processing module 12-8 is also used to perform mirror processing on the mirror matrix according to the output shape and the mirror strategy to obtain the mirrored matrix.
在该实现方式中,输出形状可以为镜像后矩阵的形状。例如,镜像后矩阵为[[1,0],[0,1],[-1,0]], 该镜像后矩阵的形状为2×3,也即该镜像后矩阵为2行、3列,由6个数字组成。In this implementation, the output shape may be the shape of the mirrored matrix. For example, the mirrored matrix is [[1,0],[0,1],[-1,0]], the shape of the mirrored matrix is 2×3, that is, the mirrored matrix is 2 rows and 3 columns , Consisting of 6 numbers.
在一种可能的实现方式中,可以预先设置镜像后矩阵的默认输出形状。在操作域中不包含镜像后矩阵的输出形状时,可以将镜像后矩阵的默认输出形状确定为当前矩阵镜像指令的镜像后矩阵的输出形状。本公开对此不作限制。In a possible implementation, the default output shape of the mirrored matrix can be preset. When the output shape of the mirrored matrix is not included in the operation domain, the default output shape of the mirrored matrix can be determined as the output shape of the mirrored matrix of the current matrix mirroring instruction. This disclosure does not limit this.
在一种可能的实现方式中,操作域还可以用于指示镜像策略。In a possible implementation, the operation domain can also be used to indicate the mirroring strategy.
在一种可能的实现方式中,操作码还可以用于指示镜像策略。In a possible implementation, the operation code can also be used to indicate the mirroring strategy.
在一种可能的实现方式中,可以根据矩阵镜像指令的操作码或操作域确定镜像策略。还可以预先设置待镜像矩阵的默认镜像策略。在操作域中不包含待镜像矩阵的镜像策略时,可以将待镜像矩阵的默认镜像策略确定为当前矩阵镜像指令的待镜像矩阵的镜像策略。In a possible implementation, the mirroring strategy may be determined according to the operation code or operation domain of the matrix mirroring instruction. The default mirroring strategy of the matrix to be mirrored can also be preset. When the mirroring strategy of the matrix to be mirrored is not included in the operation domain, the default mirroring strategy of the matrix to be mirrored may be determined as the mirroring strategy of the matrix to be mirrored of the current matrix mirroring instruction.
图8-2示出根据本公开一实施例的矩阵镜像指令处理装置的框图。在一种可能的实现方式中,如图8-2所示,矩阵镜像指令处理装置还可以包括:存储模块13-8,用于存储待镜像矩阵。8-2 shows a block diagram of a matrix mirroring instruction processing apparatus according to an embodiment of the present disclosure. In a possible implementation manner, as shown in FIG. 8-2, the matrix mirroring instruction processing apparatus may further include: a storage module 13-8, configured to store the matrix to be mirrored.
在该实现方式中,存储模块可以包括内存、缓存和寄存器中的一种或多种,缓存可以包括速暂存缓存。可以根据需要将待镜像矩阵在存储模块中的内存、缓存和/或寄存器中,本公开对此不作限制。In this implementation, the storage module may include one or more of a memory, a cache, and a register, and the cache may include a temporary storage cache. The matrix to be mirrored can be stored in the memory, cache, and/or register in the storage module according to needs, which is not limited in this disclosure.
在一种可能的实现方式中,该装置还可以包括直接内存访问模块,用于从存储模块中读取或者存储数据。In a possible implementation, the device may further include a direct memory access module, which is used to read or store data from the storage module.
在一种可能的实现方式中,如图8-2所示,控制模块11-8可以包括指令存储子模块111-8、指令处理子模块112-8和队列存储子模块113-8。In a possible implementation, as shown in FIG. 8-2, the control module 11-8 may include an instruction storage submodule 111-8, an instruction processing submodule 112-8, and a queue storage submodule 113-8.
指令存储子模块111-8用于存储矩阵镜像指令。The instruction storage submodule 111-8 is used to store matrix mirroring instructions.
指令处理子模块112-8用于对矩阵镜像指令进行解析,得到矩阵镜像指令的操作码和操作域。The instruction processing sub-module 112-8 is used to parse the matrix mirroring instruction to obtain the operation code and operation domain of the matrix mirroring instruction.
队列存储子模块113-8用于存储指令队列,指令队列包括按照执行顺序依次排列的多个待执行指令,多个待执行指令可以包括矩阵镜像指令。多个待执行指令可以包括还可以包括与矩阵镜像指令相关的其他计算指令。The queue storage sub-module 113-8 is used to store an instruction queue. The instruction queue includes a plurality of instructions to be executed in order according to the execution order, and the plurality of instructions to be executed may include matrix mirroring instructions. The plurality of instructions to be executed may include other calculation instructions related to the matrix mirroring instruction.
在该实现方式中,可以根据待执行指令的接收时间、优先级别等对多个待执行指令的执行顺序进行排列获得指令队列,以便于根据指令队列依次执行多个待执行指令。In this implementation manner, the execution order of the plurality of instructions to be executed can be arranged according to the reception time and priority level of the instruction to be executed to obtain an instruction queue, so that the plurality of instructions to be executed can be sequentially executed according to the instruction queue.
在一种可能的实现方式中,如图8-2所示,控制模块11-8还可以包括依赖关系处理子模块114-8。In a possible implementation manner, as shown in FIG. 8-2, the control module 11-8 may further include a dependency relationship processing sub-module 114-8.
依赖关系处理子模块114-8,用于在确定多个待执行指令中的第一待执行指令与第一待执行指令之前的第零待执行指令存在依赖关系时,依赖关系处理子模块114-8可以将第一待执行指令缓存在指令存储子模块112-8中,在第零待执行指令执行完毕后,从指令存储子模块112-8中提取第一待执行指令发送至处理模块12-8。其中,第一待执行指令和第零待执行指令是多个待执行指令中的指令。The dependency processing sub-module 114-8 is used to determine the dependency relationship between the first to-be-executed instruction and the zero-th to-be-executed instruction before the first to-be-executed instruction. 8 The first instruction to be executed can be cached in the instruction storage submodule 112-8, and after the execution of the zeroth instruction to be executed is completed, the first instruction to be executed is extracted from the instruction storage submodule 112-8 and sent to the processing module 12- 8. Wherein, the first instruction to be executed and the zeroth instruction to be executed are instructions among a plurality of instructions to be executed.
其中,第一待执行指令与第一待执行指令之前的第零待执行指令存在依赖关系包括:存储第一待执行指令所需数据的第一存储地址区间与存储第零待执行指令所需数据的第零存储地址区间具有重叠的区域。反之,第一待执行指令与第零待执行指令之间没有依赖关系可以是第一存储地址区间与第零存储地址区间没有重叠区域。The dependency relationship between the first to-be-executed instruction and the zeroth to-be-executed instruction before the first to-be-executed instruction includes: a first storage address interval to store data required by the first to-be-executed instruction and data required to store the zeroth to-be-executed instruction The zeroth storage address interval has overlapping areas. Conversely, there is no dependency relationship between the first instruction to be executed and the zeroth instruction to be executed may be that there is no overlapping area between the first storage address interval and the zeroth storage address interval.
通过这种方式,可以根据待执行指令之间的依赖关系,使得在先的待执行令执行完毕之后,再执行在后的待执行指令,保证运算结果的准确性。In this way, according to the dependency relationship between the instructions to be executed, after the execution of the first to-be-executed order is completed, the subsequent to-be-executed instruction is executed again to ensure the accuracy of the calculation result.
在一种可能的实现方式中,矩阵镜像指令的指令格式可以为:In a possible implementation manner, the instruction format of the matrix mirroring instruction may be:
Rotate2 type dst src src_shape dst_shapeRotate2 type dst src src_shape dst_shape
其中,Rotate2为操作码,type、dst、src、src_shape、dst_shape为操作域。Rotate2用于指示该指令为矩阵镜像指令。type为镜像策略。dst为目标地址。src为待镜像矩阵地址。src_shape为输入形状。 dst_shape为输出形状。Rotate2 is the operation code, and type, dst, src, src_shape, and dst_shape are the operation domains. Rotate2 is used to indicate that the instruction is a matrix mirroring instruction. type is the mirroring strategy. dst is the target address. src is the address of the matrix to be mirrored. src_shape is the input shape. dst_shape is the output shape.
Rotate2_type dst src src_shape dst_shapeRotate2_type dst src src_shape dst_shape
在一种可能的实现方式中,矩阵镜像指令的指令格式可以为:In a possible implementation manner, the instruction format of the matrix mirroring instruction may be:
其中,Rotate2_type为操作码,dst、src、src_shape、dst_shape为操作域。Rotate2_type中的Rotate2用于指示该指令为矩阵镜像指令,Rotate2_type中type为镜像策略。dst为目标地址。src为待镜像矩阵地址。src_shape为输入形状。dst_shape为输出形状。Among them, Rotate2_type is the operation code, dst, src, src_shape, dst_shape is the operation domain. Rotate2 in Rotate2_type is used to indicate that the instruction is a matrix mirroring instruction, and type in Rotate2_type is a mirroring strategy. dst is the target address. src is the address of the matrix to be mirrored. src_shape is the input shape. dst_shape is the output shape.
应当理解的是,本领域技术人员可以根据需要对矩阵镜像指令的操作码、指令格式中操作码以及操作域的位置进行设置,本公开对此不作限制。It should be understood that, those skilled in the art can set the operation code of the matrix mirroring instruction, the operation code in the instruction format, and the position of the operation field according to needs, which is not limited in the present disclosure.
在一种可能的实现方式中,该装置可以设置于图形处理器(Graphics Processing Unit,简称GPU)、中央处理器(Central Processing Unit,简称CPU)和嵌入式神经网络处理器(Neural-network Processing Unit,简称NPU)的一种或多种之中。In a possible implementation, the device may be set in a graphics processor (GPU), a central processing unit (CPU) and a neural-network processing unit (Neural-network Processing) , Referred to as NPU).
需要说明的是,尽管以上述实施例作为示例介绍了矩阵镜像指令处理装置如上,但本领域技术人员能够理解,本公开应不限于此。事实上,用户完全可根据个人喜好和/或实际应用场景灵活设定各模块,只要符合本公开的技术方案即可。It should be noted that although the above embodiment is used as an example to introduce the matrix mirroring instruction processing apparatus as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various modules flexibly according to personal preferences and/or actual application scenarios, as long as the technical solutions of the present disclosure are met.
应用示例Application examples
以下结合“利用矩阵镜像指令处理装置对待矩阵镜像进行镜像处理”作为一个示例性应用场景,给出根据本公开实施例的应用示例,以便于理解矩阵镜像指令处理装置的流程。本领域技术人员应理解,以下应用示例仅仅是出于便于理解本公开实施例的目的,不应视为对本公开实施例的限制。The following describes an application example according to an embodiment of the present disclosure in conjunction with "mirror processing of a matrix mirror using a matrix mirroring instruction processing device" as an exemplary application scenario, so as to facilitate understanding of the flow of the matrix mirroring instruction processing device. Those skilled in the art should understand that the following application examples are only for the purpose of facilitating understanding of the embodiments of the present disclosure, and should not be considered as limitations to the embodiments of the present disclosure.
图8-3示出根据本公开一实施例的矩阵镜像指令处理装置的应用场景的示意图。如图8-3所示,矩阵镜像指令处理装置对矩阵镜像指令进行处理的过程如下。8-3 shows a schematic diagram of an application scenario of a matrix mirroring instruction processing apparatus according to an embodiment of the present disclosure. As shown in Figure 8-3, the matrix mirroring instruction processing device processes the matrix mirroring instruction as follows.
示例1-8Example 1-8
控制模块11-8在接收到矩阵镜像指令1(Rotate2_type 200 100 S1 S2),对矩阵镜像指令进行解析,获得矩阵镜像指令1的操作码和操作域。该矩阵镜像指令1的操作码为Rotate2_type。根据操作码可以确定:该指令为矩阵镜像处理指令,镜像策略为type。根据操作域可以确定:待镜像矩阵地址为100、输入形状为S1、目标地址为200、输出形状为S2。进而控制模块11-8从待镜像矩阵地址100中获取输入形状为S1的待镜像矩阵1。The control module 11-8 receives the matrix mirroring instruction 1 (Rotate2_type 200 200 S1 S2), analyzes the matrix mirroring instruction, and obtains the operation code and operation domain of the matrix mirroring instruction 1. The operation code of the matrix mirroring instruction 1 is Rotate2_type. According to the operation code, it can be determined that the instruction is a matrix mirroring processing instruction, and the mirroring strategy is type. According to the operation domain, it can be determined that the matrix address to be mirrored is 100, the input shape is S1, the target address is 200, and the output shape is S2. Furthermore, the control module 11-8 obtains the matrix 1 to be mirrored whose input shape is S1 from the matrix address 100 to be mirrored.
处理模块12-8根据镜像策略对待镜像矩阵1进行镜像处理,得到镜像后矩阵1’,并将镜像矩阵1’存入目标地址200中。The processing module 12-8 performs mirror processing on the mirror matrix 1 according to the mirror strategy to obtain the mirrored matrix 1', and stores the mirror matrix 1'in the target address 200.
其中吗,矩阵镜像指令1除可以为上述Rotate2_type 200 100 S1 S2,还可以为Rotate2 type 200 100 S1 S2,二者为不同指令格式,且表示相同处理过程的指令,矩阵镜像指令装置对二者的处理过程相似,不再赘述。Among them, matrix mirroring instruction 1 can be Rotate2_type 200, 100, S1, S2, or Rotate2, type 200, 100, S1, S2. The two are different command formats and represent the same processing process. The processing is similar and will not be repeated here.
上述处理过程详见上文相关描述。For details of the above process, please refer to the relevant description above.
这样,矩阵镜像指令处理装置可以快速、高效地根据矩阵镜像指令对矩阵进行镜像处理。In this way, the matrix mirroring instruction processing device can quickly and efficiently mirror the matrix according to the matrix mirroring instruction.
图8-4示出根据本公开一实施例的矩阵镜像指令处理方法的流程图。如图8-4所示,该方法应用于上述矩阵镜像指令处理装置,该方法包括步骤S51-8和步骤S52-8。8-4 shows a flowchart of a matrix mirroring instruction processing method according to an embodiment of the present disclosure. As shown in FIG. 8-4, this method is applied to the above-mentioned matrix mirroring instruction processing device. The method includes step S51-8 and step S52-8.
在步骤S51-8中,对接收到的矩阵镜像指令进行解析,获得矩阵镜像指令的操作码和操作域,并根据操作码和操作域确定执行矩阵镜像指令所需的待镜像矩阵和目标地址,以及确定进行镜像处理所需的镜像策略。其中,操作码用于指示矩阵镜像指令对矩阵所进行的处理为镜像处理,操作域包括待镜像矩阵地址和目标地址。In step S51-8, the received matrix mirroring instruction is analyzed to obtain the operation code and operation domain of the matrix mirroring instruction, and the matrix to be mirrored and the target address required to execute the matrix mirroring instruction are determined according to the operation code and the operation domain And determine the mirroring strategy required for mirroring. The operation code is used to indicate that the processing performed by the matrix mirroring instruction on the matrix is mirror processing, and the operation domain includes the address of the matrix to be mirrored and the target address.
在步骤S52-8中,根据镜像策略对待镜像矩阵进行镜像处理,得到镜像后矩阵,并将镜像后矩阵存入目标地址中。In step S52-8, the mirroring matrix is mirrored according to the mirroring strategy to obtain the mirrored matrix, and the mirrored matrix is stored in the target address.
在一种可能的实现方式中,操作域还可以包括待镜像矩阵的输入形状。其中,根据镜像策略对待镜像矩阵进行镜像处理,得到镜像后矩阵,可以包括:根据输出形状以及镜像策略,对待镜像矩阵进行镜像处理,获得镜像后矩阵。In a possible implementation manner, the operation domain may further include the input shape of the matrix to be mirrored. Wherein, mirroring the mirror matrix according to the mirroring strategy to obtain the mirrored matrix may include: mirroring the mirroring matrix according to the output shape and mirroring strategy to obtain the mirrored matrix.
在一种可能的实现方式中,操作域还可以包括镜像后矩阵的输出形状,其中,根据镜像策略对待镜像矩阵进行镜像处理,得到镜像后矩阵,可以包括:根据输出形状以及镜像策略,对待镜像矩阵进行镜像处理,获得镜像后矩阵。In a possible implementation manner, the operation domain may further include the output shape of the mirrored matrix, where the mirroring matrix is mirrored according to the mirroring strategy to obtain the mirrored matrix, which may include: according to the output shape and mirroring strategy, the mirror The matrix is mirrored to obtain the mirrored matrix.
在一种可能的实现方式中,操作域还可以用于指示镜像策略。In a possible implementation, the operation domain can also be used to indicate the mirroring strategy.
在一种可能的实现方式中,操作码还可以用于指示镜像策略。In a possible implementation, the operation code can also be used to indicate the mirroring strategy.
在一种可能的实现方式中,该方法还可以包括:存储待镜像矩阵。In a possible implementation manner, the method may further include: storing a matrix to be mirrored.
在一种可能的实现方式中,对接收到的矩阵镜像指令进行解析,获得矩阵镜像指令的操作码和操作域,可以包括:In a possible implementation manner, parsing the received matrix mirroring instruction to obtain the operation code and operation domain of the matrix mirroring instruction may include:
存储矩阵镜像指令;Storage matrix mirroring instruction;
对矩阵镜像指令进行解析,得到矩阵镜像指令的操作码和操作域;Analyze the matrix mirroring instruction to obtain the operation code and operation domain of the matrix mirroring instruction;
存储指令队列,指令队列包括按照执行顺序依次排列的多个待执行指令,多个待执行指令可以包括矩阵镜像指令。The instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed in order according to the execution order, and the plurality of instructions to be executed may include matrix mirroring instructions.
在一种可能的实现方式中,该方法还可以包括:In a possible implementation manner, the method may further include:
在确定多个待执行指令中的第一待执行指令与第一待执行指令之前的第零待执行指令存在依赖关系时,缓存第一待执行指令,并在确定第零待执行指令执行完毕后,控制进行第一待执行指令的执行,When it is determined that the first to-be-executed instruction among the plurality of to-be-executed instructions has a dependency relationship with the zeroth to-be-executed instruction before the first to-be-executed instruction, cache the first to-be-executed instruction, and after determining that the execution of the zeroth to-be-executed instruction is completed , Control the execution of the first instruction to be executed,
其中,第一待执行指令与第一待执行指令之前的第零待执行指令存在依赖关系包括:The dependency relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
存储第一待执行指令所需数据的第一存储地址区间与存储第零待执行指令所需数据的第零存储地址区间具有重叠的区域。A first storage address interval storing data required for the first instruction to be executed has an overlapping area with a zeroth storage address interval storing data required for the zeroth instruction to be executed.
需要说明的是,尽管以上述实施例作为示例介绍了矩阵镜像指令处理方法如上,但本领域技术人员能够理解,本公开应不限于此。事实上,用户完全可根据个人喜好和/或实际应用场景灵活设定各步骤,只要符合本公开的技术方案即可。It should be noted that although the above embodiment is taken as an example to introduce the processing method of the matrix mirroring instruction as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various steps flexibly according to personal preferences and/or actual application scenarios, as long as the technical solutions of the present disclosure are met.
本公开实施例所提供的矩阵镜像指令处理方法的适用范围广,对矩阵机械能镜像的处理效率高、处理速度快。The matrix mirroring instruction processing method provided by the embodiments of the present disclosure has a wide application range, and has high processing efficiency and fast processing speed for the mirroring of matrix mechanical energy.
依据以下条款可更好地理解前述内容:The foregoing can be better understood based on the following terms:
条款G1、一种矩阵镜像指令处理装置,所述装置包括:Clause G1, a matrix mirroring instruction processing device, the device comprising:
控制模块,用于对接收到的矩阵镜像指令进行解析,获得所述矩阵镜像指令的操作码和操作域,并根据所述操作码和所述操作域确定执行所述矩阵镜像指令所需的待镜像矩阵和目标地址,以及确定进行镜像处理所需的镜像策略;The control module is used to parse the received matrix mirroring instruction, obtain the operation code and the operation domain of the matrix mirroring instruction, and determine the standby required to execute the matrix mirroring instruction according to the operation code and the operation domain Mirroring matrix and target address, and determining the mirroring strategy required for mirroring processing;
处理模块,根据所述镜像策略对所述待镜像矩阵进行镜像处理,得到镜像后矩阵,并将所述镜像后矩阵存入所述目标地址中,The processing module performs mirror processing on the matrix to be mirrored according to the mirroring strategy to obtain a mirrored matrix, and stores the mirrored matrix in the target address,
其中,所述操作码用于指示所述矩阵镜像指令对矩阵数据所进行的处理为镜像处理,所述操作域包括所述待镜像矩阵地址和所述目标地址。Wherein, the operation code is used to indicate that the processing performed by the matrix mirroring instruction on the matrix data is mirror processing, and the operation domain includes the matrix address to be mirrored and the target address.
条款G2、根据条款G1所述的装置,所述操作域还包括待镜像矩阵的输入形状,Clause G2. The device according to Clause G1, the operation domain further includes an input shape of the matrix to be mirrored,
其中,所述处理模块,还用于根据所述输入形状以及所述镜像策略,对所述待镜像矩阵进行镜像处理,得到镜像后矩阵。Wherein, the processing module is further configured to perform mirror processing on the matrix to be mirrored according to the input shape and the mirroring strategy to obtain a mirrored matrix.
条款G3、根据条款G1所述的装置,所述操作域还包括镜像后矩阵的输出形状,Clause G3. The device according to Clause G1, the operation domain further includes the output shape of the mirrored matrix,
其中,所述处理模块,还用于根据所述输出形状以及所述镜像策略,对所述待镜像矩阵进行镜像处理,得到镜像后矩阵。Wherein, the processing module is further configured to perform mirror processing on the matrix to be mirrored according to the output shape and the mirroring strategy to obtain a mirrored matrix.
条款G4、根据条款G1所述的装置,所述操作域还用于指示镜像策略。Clause G4. The apparatus according to Clause G1, the operation domain is further used to indicate a mirroring policy.
条款G5、根据条款G1所述的装置,所述操作码还用于指示所述镜像策略。Clause G5. The apparatus according to Clause G1, the operation code is further used to indicate the mirroring policy.
条款G6、根据条款G1所述的装置,Clause G6, the device according to Clause G1,
所述装置还包括:存储模块,用于存储所述待镜像矩阵,The device also includes a storage module for storing the matrix to be mirrored,
其中,所述控制模块,包括:Wherein, the control module includes:
指令存储子模块,用于存储所述矩阵镜像指令;An instruction storage sub-module for storing the matrix mirroring instruction;
指令处理子模块,用于对所述矩阵镜像指令进行解析,得到所述矩阵镜像指令的操作码和操作域;An instruction processing submodule, used for parsing the matrix mirroring instruction to obtain the operation code and the operation domain of the matrix mirroring instruction;
队列存储子模块,用于存储指令队列,所述指令队列包括按照执行顺序依次排列的多个待执行指令,所述多个待执行指令包括所述矩阵镜像指令,A queue storage sub-module, which is used to store an instruction queue, the instruction queue includes a plurality of instructions to be executed in order according to the execution order, the plurality of instructions to be executed includes the matrix mirroring instruction,
其中,所述控制模块,还包括:Wherein, the control module also includes:
依赖关系处理子模块,用于在确定所述多个待执行指令中的第一待执行指令与所述第一待执行指令之前的第零待执行指令存在依赖关系时,将所述第一待执行指令缓存在所述指令存储子模块中,在所述第零待执行指令执行完毕后,从所述指令存储子模块中提取所述第一待执行指令发送至所述处理模块,The dependency processing sub-module is used to determine the first pending instruction when there is a dependency relationship between the first pending instruction in the plurality of pending instructions and the zeroth pending instruction before the first pending instruction The execution instruction is cached in the instruction storage submodule, and after the execution of the zeroth instruction to be executed is completed, the first instruction to be executed is extracted from the instruction storage submodule and sent to the processing module,
其中,所述第一待执行指令与所述第一待执行指令之前的第零待执行指令存在依赖关系包括:The dependency relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
存储所述第一待执行指令所需数据的第一存储地址区间与存储所述第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
条款G7、一种机器学习运算装置,所述装置包括:Clause G7. A machine learning computing device, the device comprising:
一个或多个如条款G1-条款G6任一项所述的矩阵镜像指令处理装置,用于从其他处理装置中获取待镜像矩阵和控制信息,并执行指定的机器学习运算,将执行结果通过I/O接口传递给其他处理装置;One or more matrix mirroring instruction processing devices as described in any one of Clause G1-Clause G6, used to obtain the matrix and control information to be mirrored from other processing devices, and perform the specified machine learning operation, and pass the execution result through I /O interface is passed to other processing devices;
当所述机器学习运算装置包含多个所述矩阵镜像指令处理装置时,所述多个所述矩阵镜像指令处理装置间可以通过特定的结构进行连接并传输数据;When the machine learning operation device includes a plurality of the matrix mirroring instruction processing devices, the plurality of matrix mirroring instruction processing devices may be connected and transmit data through a specific structure;
其中,多个所述矩阵镜像指令处理装置通过快速外部设备互连总线PCIE总线进行互联并传输数据,以支持更大规模的机器学习的运算;多个所述矩阵镜像指令处理装置共享同一控制系统或拥有各自的控制系统;多个所述矩阵镜像指令处理装置共享内存或者拥有各自的内存;多个所述矩阵镜像指令处理装置的互联方式是任意互联拓扑。Among them, a plurality of the matrix mirroring instruction processing devices interconnect and transmit data through a fast external device interconnect bus PCIE bus to support larger-scale machine learning operations; a plurality of the matrix mirroring instruction processing devices share the same control system Or have their own control systems; a plurality of the matrix mirroring instruction processing devices share memory or have their own memories; the interconnection method of the plurality of matrix mirroring instruction processing devices is an arbitrary interconnection topology.
条款G8、一种组合处理装置,所述组合处理装置包括:Clause G8. A combined processing device, the combined processing device comprising:
如条款G7所述的机器学习运算装置、通用互联接口和其他处理装置;Machine learning computing devices, general interconnection interfaces and other processing devices as described in Clause G7;
所述机器学习运算装置与所述其他处理装置进行交互,共同完成用户指定的计算操作,The machine learning computing device interacts with the other processing device to jointly complete the calculation operation specified by the user,
其中,所述组合处理装置还包括:存储装置,该存储装置分别与所述机器学习运算装置和所述其他处理装置连接,用于保存所述机器学习运算装置和所述其他处理装置的数据。Wherein, the combined processing device further includes: a storage device, which is respectively connected to the machine learning computing device and the other processing device, and is used for storing data of the machine learning computing device and the other processing device.
条款G9、一种机器学习芯片,所述机器学习芯片包括:Clause G9. A machine learning chip, the machine learning chip includes:
如条款G7所述的机器学习运算装置或如条款G8所述的组合处理装置。The machine learning arithmetic device according to clause G7 or the combined processing device according to clause G8.
条款G10、一种电子设备,所述电子设备包括:Clause G10. An electronic device, the electronic device comprising:
如条款G9所述的机器学习芯片。Machine learning chip as described in clause G9.
条款G11、一种板卡,所述板卡包括:存储器件、接口装置和控制器件以及如条款G9所述的机器学习芯片;Clause G11, a board card including: a storage device, an interface device and a control device, and a machine learning chip as described in Clause G9;
其中,所述机器学习芯片与所述存储器件、所述控制器件以及所述接口装置分别连接;Wherein, the machine learning chip is respectively connected to the storage device, the control device and the interface device;
所述存储器件,用于存储数据;The storage device is used to store data;
所述接口装置,用于实现所述机器学习芯片与外部设备之间的数据传输;The interface device is used to realize data transmission between the machine learning chip and an external device;
所述控制器件,用于对所述机器学习芯片的状态进行监控。The control device is used for monitoring the state of the machine learning chip.
条款G12、一种矩阵镜像指令处理方法,所述方法应用于矩阵镜像指令处理装置,所述方法包括:Clause G12. A method for processing a matrix mirroring instruction. The method is applied to a matrix mirroring instruction processing apparatus. The method includes:
对接收到的矩阵镜像指令进行解析,获得所述矩阵镜像指令的操作码和操作域,并根据所述操作码和所述操作域确定执行所述矩阵镜像指令所需的待镜像矩阵和目标地址,以及确定进行镜像处理所需的镜像策略;Parse the received matrix mirroring instruction to obtain the operation code and operation domain of the matrix mirroring instruction, and determine the matrix to be mirrored and the target address required to execute the matrix mirroring instruction according to the operation code and the operation domain , And determine the mirroring strategy required for mirroring;
根据所述镜像策略对所述待镜像矩阵进行镜像处理,得到镜像后矩阵,并将所述镜像后矩阵存入所述目标地址中,Mirroring the matrix to be mirrored according to the mirroring strategy to obtain a mirrored matrix, and storing the mirrored matrix in the target address,
其中,所述操作码用于指示所述矩阵镜像指令对矩阵所进行的处理为镜像处理,所述操作域包括所述待镜像矩阵地址和所述目标地址。Wherein, the operation code is used to indicate that the processing performed by the matrix mirroring instruction on the matrix is mirror processing, and the operation domain includes the matrix address to be mirrored and the target address.
条款G13、根据条款G12所述的方法,所述操作域还包括待镜像矩阵的输入形状,Clause G13. According to the method of Clause G12, the operation domain further includes the input shape of the matrix to be mirrored,
其中,根据所述镜像策略对所述待镜像矩阵进行镜像处理,得到镜像后矩阵,包括:Wherein, performing mirror processing on the matrix to be mirrored according to the mirroring strategy to obtain the mirrored matrix includes:
根据所述输入形状以及所述镜像策略,对所述待镜像矩阵进行镜像处理,获得所述镜像后矩阵。Perform mirror processing on the matrix to be mirrored according to the input shape and the mirroring strategy to obtain the mirrored matrix.
条款G14、根据条款G12所述的方法,所述操作域还包括镜像后矩阵的输出形状,Clause G14. According to the method of Clause G12, the operation domain further includes the output shape of the mirrored matrix,
其中,根据所述镜像策略对所述待镜像矩阵进行镜像处理,得到镜像后矩阵,包括:Wherein, performing mirror processing on the matrix to be mirrored according to the mirroring strategy to obtain the mirrored matrix includes:
根据所述输出形状以及所述镜像策略,对所述待镜像矩阵进行镜像处理,获得所述镜像后矩阵。According to the output shape and the mirroring strategy, perform mirror processing on the matrix to be mirrored to obtain the mirrored matrix.
条款G15、根据条款G12所述的方法,所述操作域还用于指示镜像策略。Clause G15. According to the method of Clause G12, the operation domain is also used to indicate a mirroring policy.
条款G16、根据条款G12所述的方法,所述操作码还用于指示所述镜像策略。Clause G16. The method according to Clause G12, the operation code is further used to indicate the mirroring policy.
条款G17、根据条款G12所述的方法,Clause G17, according to the method described in Clause G12,
所述方法还包括:存储所述待镜像矩阵,The method further includes: storing the matrix to be mirrored,
其中,对接收到的矩阵镜像指令进行解析,获得所述矩阵镜像指令的操作码和操作域,包括:Wherein, analyzing the received matrix mirroring instruction to obtain the operation code and operation domain of the matrix mirroring instruction includes:
存储所述矩阵镜像指令;Store the matrix mirroring instruction;
对所述矩阵镜像指令进行解析,得到所述矩阵镜像指令的操作码和操作域;Parse the matrix mirroring instruction to obtain the operation code and operation domain of the matrix mirroring instruction;
存储指令队列,所述指令队列包括按照执行顺序依次排列的多个待执行指令,所述多个待执行指令包括所述矩阵镜像指令,Store an instruction queue, the instruction queue includes a plurality of instructions to be executed in order according to the execution order, the plurality of instructions to be executed includes the matrix mirroring instruction,
其中,所述方法还包括:Wherein, the method further includes:
在确定所述多个待执行指令中的第一待执行指令与所述第一待执行指令之前的第零待执行指令存在依赖关系时,缓存所述第一待执行指令,并在确定所述第零待执行指令执行完毕后,控制进行所述第一待执行指令的执行,When it is determined that the first instruction to be executed among the plurality of instructions to be executed has a dependency relationship with the zeroth instruction to be executed before the first instruction to be executed, the first instruction to be executed is cached, and the After the execution of the zeroth to-be-executed instruction is completed, control to execute the execution of the first to-be-executed instruction,
其中,所述第一待执行指令与所述第一待执行指令之前的第零待执行指令存在依赖关系包括:The dependency relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
存储所述第一待执行指令所需数据的第一存储地址区间与存储所述第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
由于神经网络算法在图像识别、语音识别、自然语言处理等领域中的使用越来越广泛,使得神经 网络算法的复杂度越来越高,所涉及的数据运算种类和数量不断增大。其中,矩阵是一种在神经网络算法中较为常见的数据形式,由数字和/或字符组成。神经网络算法中对矩阵的处理过程包括对矩阵进行旋转处理。相关技术中,需要对矩阵进行旋转处理的效率低、速度慢。Since neural network algorithms are more and more widely used in the fields of image recognition, speech recognition, and natural language processing, the complexity of neural network algorithms is getting higher and higher, and the types and number of data operations involved are increasing. Among them, the matrix is a relatively common data form in neural network algorithm, which is composed of numbers and/or characters. The processing of the matrix in the neural network algorithm includes rotating the matrix. In the related art, the efficiency of rotating the matrix needs to be low and the speed is slow.
图9-1示出根据本公开一实施例的矩阵旋转指令处理装置的框图。如图9-1所示,该装置包括控制模块11-9和处理模块12-9。9-1 shows a block diagram of a matrix rotation instruction processing device according to an embodiment of the present disclosure. As shown in Figure 9-1, the device includes a control module 11-9 and a processing module 12-9.
控制模块11-9,用于对接收到的矩阵旋转指令进行解析,获得矩阵旋转指令的操作码和操作域,并根据操作码和操作域确定执行矩阵旋转指令所需的待旋转矩阵和目标地址,以及确定对待旋转矩阵进行旋转的旋转角度。其中,操作码用于指示矩阵旋转指令对矩阵数据所进行的处理为旋转处理,操作域包括待旋转矩阵地址和目标地址。The control module 11-9 is used to parse the received matrix rotation instruction, obtain the operation code and operation domain of the matrix rotation instruction, and determine the matrix to be rotated and the target address required to execute the matrix rotation instruction according to the operation code and operation domain , And determine the rotation angle of the matrix to be rotated. The operation code is used to instruct the matrix rotation instruction to process the matrix data as rotation processing, and the operation domain includes the matrix address and the target address to be rotated.
处理模块12-9,根据旋转角度对待旋转矩阵进行旋转处理,得到旋转后矩阵,并将旋转后矩阵存入目标地址中。The processing module 12-9 performs rotation processing on the rotation matrix according to the rotation angle to obtain the rotated matrix, and stores the rotated matrix in the target address.
在本实施例中,待旋转矩阵可以是由多个数字和/或字符按照阵列排列而成的数据集合。旋转角度可以是预先设定的任意大于0度或小于0度的角度。其中,可以设置在旋转角度大于0度时,对待旋转矩阵所进行的旋转为顺时针旋转;在旋转角度小于0度时,对待旋转矩阵所进行的旋转为逆时针旋转。例如,90°、180°、270°等。In this embodiment, the matrix to be rotated may be a data set formed by arranging multiple numbers and/or characters in an array. The rotation angle may be any preset angle greater than 0 degrees or less than 0 degrees. Wherein, when the rotation angle is greater than 0 degrees, the rotation performed by the matrix to be rotated is clockwise; when the rotation angle is less than 0 degrees, the rotation performed by the matrix to be rotated is counterclockwise. For example, 90°, 180°, 270°, etc.
举例来说,假定待旋转矩阵为[[1,4,7],[2,5,8],[3,6,9]]。若旋转角度为90°,那么装置将该待旋转矩阵顺时针旋转90°,得到进行旋转处理后的旋转后矩阵[[7,8,9],[4,5,6],[1,2,3]]。For example, suppose the matrix to be rotated is [[1,4,7],[2,5,8],[3,6,9]]. If the rotation angle is 90°, the device rotates the matrix to be rotated 90° clockwise to obtain the rotated matrix [[7,8,9],[4,5,6],[1,2 ,3]].
在本实施例中,控制模块可以从待旋转矩阵地址中获取待旋转矩阵。待旋转矩阵地址可以是存储待旋转矩阵的首地址等物理地址,也可以是逻辑地址、线性地址。控制模块可以将待旋转矩阵存储在目标地址中。目标地址可以是存储旋转后矩阵的首地址等物理地址,也可以是逻辑地址、线性地址。本公开对待旋转矩阵地址、目标地址的表示方式不作限制。。控制模块可以通过数据输入输出单元获得矩阵旋转指令、待旋转矩阵,该数据输入输出单元可以为一个或多个数据I/O接口或I/O引脚。In this embodiment, the control module may obtain the matrix to be rotated from the address of the matrix to be rotated. The address of the matrix to be rotated may be a physical address such as the first address storing the matrix to be rotated, or a logical address or a linear address. The control module may store the matrix to be rotated in the target address. The target address may be a physical address such as the first address of the rotated matrix, or a logical address or a linear address. The present disclosure does not limit the manner of expressing the address of the rotation matrix and the target address. . The control module may obtain the matrix rotation instruction and the matrix to be rotated through the data input and output unit. The data input and output unit may be one or more data I/O interfaces or I/O pins.
在本实施例中,对于一个矩阵旋转指令可以包括操作码和操作域。操作码可以是计算机程序中所规定的要执行操作的那一部分指令或字段(通常用代码表示),是指令序列号,用来告知执行指令的装置具体需要执行哪一条指令。而操作域可以是执行对应的指令所需的所有数据的来源,执行对应的指令所需的所有数据包括待旋转矩阵、对应的旋转角度,或者存储待旋转矩阵、对应的旋转角度的地址等等。比如,操作域可以包括待旋转矩阵地址和目标地址。In this embodiment, an operation code and an operation field may be included for a matrix rotation instruction. The operation code may be the part of the instruction or field (usually represented by code) specified in the computer program to perform the operation, and is an instruction sequence number used to inform the device that executes the instruction which instruction needs to be executed. The operation domain can be the source of all data required to execute the corresponding instruction. All data required to execute the corresponding instruction include the matrix to be rotated and the corresponding rotation angle, or the address to store the matrix to be rotated and the corresponding rotation angle, etc. . For example, the operation domain may include the matrix address to be rotated and the target address.
应当理解的是,本领域技术人员可以根据需要对矩阵旋转指令的指令格式以及所包含的操作码和操作域进行设置,本公开对此不作限制。It should be understood that, those skilled in the art can set the instruction format of the matrix rotation instruction, as well as the included operation codes and operation fields as required, and the disclosure does not limit this.
在本实施例中,该装置可以包括一个或多个控制模块,以及一个或多个处理模块,可以根据实际需要对控制模块和处理模块的数量进行设置,本公开对此不作限制。在装置包括一个控制模块时,该控制模块可以接收矩阵旋转指令,并控制一个或多个处理模块进行旋转处理。在装置包括多个控制模块时,多个控制模块可以分别接收矩阵旋转指令,并控制对应的一个或多个处理模块进行旋转处理。In this embodiment, the device may include one or more control modules and one or more processing modules, and the number of control modules and processing modules may be set according to actual needs, which is not limited in the present disclosure. When the device includes a control module, the control module may receive a matrix rotation instruction and control one or more processing modules to perform rotation processing. When the device includes multiple control modules, the multiple control modules may respectively receive matrix rotation instructions and control the corresponding one or more processing modules to perform rotation processing.
本公开实施例所提供的矩阵旋转指令处理装置,该装置包括控制模块和处理模块。控制模块用于对接收到的矩阵旋转指令进行解析,获得矩阵旋转指令的操作码和操作域,并根据操作码和操作域确定执行矩阵旋转指令所需的待旋转矩阵和目标地址,以及确定对待旋转矩阵进行旋转的旋转角度。根据旋转角度对待旋转矩阵进行旋转处理,得到旋转后矩阵,并将旋转后矩阵存入目标地址中。本公开实施例所提供的矩阵旋转指令处理装置的适用范围广,根据矩阵旋转指令对矩阵进行旋转的处理效率高、处理速度快。A matrix rotation instruction processing device provided by an embodiment of the present disclosure includes a control module and a processing module. The control module is used to parse the received matrix rotation instruction, obtain the operation code and operation domain of the matrix rotation instruction, and determine the matrix to be rotated and the target address required to execute the matrix rotation instruction according to the operation code and operation domain, and determine the treatment The rotation angle of the rotation matrix. The rotation matrix is rotated according to the rotation angle to obtain the rotated matrix, and the rotated matrix is stored in the target address. The matrix rotation instruction processing device provided by the embodiments of the present disclosure has a wide application range, and the processing efficiency of rotating the matrix according to the matrix rotation instruction is high and the processing speed is fast.
在一种可能的实现方式中,操作域还可以包括待旋转矩阵的输入形状。处理模块12-9,还用于根据输入形状以及旋转角度,对待旋转矩阵进行旋转处理,获得旋转后矩阵。In a possible implementation manner, the operation domain may further include the input shape of the matrix to be rotated. The processing module 12-9 is also used to rotate the matrix to be rotated according to the input shape and the rotation angle to obtain the rotated matrix.
在该实现方式中,根据待旋转矩阵的输入形状可以便于对矩阵进行旋转处理,也可以根据待旋转矩阵的输入形状确定旋转后矩阵的形状。矩阵的形状可以用待旋转矩阵在行、列上数字和/或字符的数量来表示。例如,待旋转矩阵1为[[1,0,1],[0,1,0],[-1,0,-1]],该待旋转矩阵1的形状为3×3,也即该待处旋转矩阵1为3行、3列,由9个数字组成。In this implementation manner, the matrix may be rotated according to the input shape of the matrix to be rotated, or the shape of the rotated matrix may be determined according to the input shape of the matrix to be rotated. The shape of the matrix can be expressed by the number of numbers and/or characters on the rows and columns of the matrix to be rotated. For example, the matrix 1 to be rotated is [[1,0,1],[0,1,0],[-1,0,-1]], and the shape of the matrix 1 to be rotated is 3×3, that is, the The rotation matrix 1 to be processed consists of 3 rows and 3 columns, and is composed of 9 numbers.
在一种可能的实现方式中,可以预先设置待旋转矩阵的默认输入形状。在操作域中不包含待旋转矩阵的输入形状时,可以将待旋转矩阵的默认输入形状确定为当前矩阵旋转指令的待旋转矩阵的输入形状。输入形状至少可以包括待旋转矩阵的长度、待旋转矩阵的宽度,本公开对此不作限制。In a possible implementation, the default input shape of the matrix to be rotated can be preset. When the input shape of the matrix to be rotated is not included in the operation domain, the default input shape of the matrix to be rotated can be determined as the input shape of the matrix to be rotated of the current matrix rotation instruction. The input shape may include at least the length of the matrix to be rotated and the width of the matrix to be rotated, which is not limited in this disclosure.
在一种可能的实现方式中,操作域还可以包括旋转矩阵的输出形状,处理模块12-9,还用于根据输出形状以及旋转角度,对待旋转矩阵进行旋转处理,获得旋转后矩阵。In a possible implementation manner, the operation domain may further include an output shape of the rotation matrix, and the processing module 12-9 is further configured to perform rotation processing on the rotation matrix according to the output shape and the rotation angle to obtain the rotated matrix.
在该实现方式中,输出形状可以为旋转后矩阵的形状。例如,旋转后矩阵为[[1,-1],[1,1],[0,0]],该旋转后矩阵的形状为2×3,也即该对称后矩阵2为2行、3列,由6个数字组成。In this implementation, the output shape may be the shape of the rotated matrix. For example, the matrix after rotation is [[1,-1],[1,1],[0,0]], the shape of the matrix after rotation is 2×3, that is, the symmetric matrix 2 is 2 rows, 3 Column, consisting of 6 numbers.
在一种可能的实现方式中,可以预先设置旋转后矩阵的默认输出形状。在操作域中不包含旋转后矩阵的输出形状时,可以将旋转后矩阵的默认输出形状确定为当前矩阵旋转指令的旋转后矩阵的输出形状,输出形状至少可以包括旋转后矩阵的长度、旋转后矩阵的宽度,本公开对此不作限制。In a possible implementation, the default output shape of the rotated matrix can be preset. When the output shape of the rotated matrix is not included in the operation domain, the default output shape of the rotated matrix can be determined as the output shape of the rotated matrix of the current matrix rotation instruction. The output shape can include at least the length of the rotated matrix and The width of the matrix is not limited by this disclosure.
在一种可能的实现方式中,操作域还可以用于指示旋转角度。In a possible implementation, the operation field can also be used to indicate the rotation angle.
在一种可能的实现方式中,操作码还可以用于指示旋转角度。In a possible implementation, the operation code can also be used to indicate the rotation angle.
在一种可能的实现方式中,可以根据矩阵旋转指令的操作码或操作域确定旋转角度。还可以预先设置待旋转矩阵的默认旋转角度。在操作域中不包含待旋转矩阵的旋转角度时,可以将待旋转矩阵的默认旋转角度确定为当前矩阵旋转指令的待旋转矩阵的旋转角度。In a possible implementation manner, the rotation angle may be determined according to the operation code or operation field of the matrix rotation instruction. The default rotation angle of the matrix to be rotated can also be preset. When the rotation angle of the matrix to be rotated is not included in the operation domain, the default rotation angle of the matrix to be rotated may be determined as the rotation angle of the matrix to be rotated of the current matrix rotation instruction.
图9-2示出根据本公开一实施例的矩阵旋转指令处理装置的框图。在一种可能的实现方式中,如图9-2所示,矩阵旋转指令处理装置还可以包括:存储模块13-9,用于存储待旋转矩阵。9-2 shows a block diagram of a matrix rotation instruction processing device according to an embodiment of the present disclosure. In a possible implementation manner, as shown in FIG. 9-2, the matrix rotation instruction processing apparatus may further include: a storage module 13-9, configured to store the matrix to be rotated.
在该实现方式中,存储模块可以包括内存、缓存和寄存器中的一种或多种,缓存可以包括速暂存缓存。可以根据需要将待旋转矩阵在存储模块中的内存、缓存和/或寄存器中,本公开对此不作限制。In this implementation, the storage module may include one or more of a memory, a cache, and a register, and the cache may include a temporary storage cache. The matrix to be rotated can be stored in the memory, cache, and/or register in the storage module as needed, which is not limited in this disclosure.
在一种可能的实现方式中,该装置还可以包括直接内存访问模块,用于从存储模块中读取或者存储数据。In a possible implementation, the device may further include a direct memory access module, which is used to read or store data from the storage module.
在一种可能的实现方式中,如图9-2所示,控制模块11-9可以包括指令存储子模块111-9、指令处理子模块112-9和队列存储子模块113-9。In a possible implementation, as shown in FIG. 9-2, the control module 11-9 may include an instruction storage sub-module 111-9, an instruction processing sub-module 112-9, and a queue storage sub-module 113-9.
指令存储子模块111-9用于存储矩阵旋转指令。The instruction storage submodule 111-9 is used to store matrix rotation instructions.
指令处理子模块112-9用于对矩阵旋转指令进行解析,得到矩阵旋转指令的操作码和操作域。The instruction processing sub-module 112-9 is used to parse the matrix rotation instruction to obtain the operation code and operation domain of the matrix rotation instruction.
队列存储子模块113-9用于存储指令队列,指令队列包括按照执行顺序依次排列的多个待执行指令,多个待执行指令可以包括矩阵旋转指令。多个待执行指令可以包括还可以包括与矩阵旋转指令相关的其他计算指令。The queue storage sub-module 113-9 is used to store an instruction queue. The instruction queue includes a plurality of instructions to be executed in order according to the execution order, and the plurality of instructions to be executed may include matrix rotation instructions. The plurality of instructions to be executed may include other calculation instructions related to the matrix rotation instruction.
在该实现方式中,可以根据待执行指令的接收时间、优先级别等对多个待执行指令的执行顺序进行排列获得指令队列,以便于根据指令队列依次执行多个待执行指令。In this implementation manner, the execution order of the plurality of instructions to be executed can be arranged according to the reception time and priority level of the instruction to be executed to obtain an instruction queue, so that the plurality of instructions to be executed can be sequentially executed according to the instruction queue.
在一种可能的实现方式中,如图9-2所示,控制模块11-9还可以包括依赖关系处理子模块114-9。In a possible implementation, as shown in FIG. 9-2, the control module 11-9 may further include a dependency processing sub-module 114-9.
依赖关系处理子模块114-9,用于在确定多个待执行指令中的第一待执行指令与第一待执行指令之前的第零待执行指令存在依赖关系时,依赖关系处理子模块114-9可以将第一待执行指令缓存在指 令存储子模块112-9中,在第零待执行指令执行完毕后,从指令存储子模块112-9中提取第一待执行指令发送至处理模块12-9。其中,第一待执行指令和第零待执行指令是多个待执行指令中的指令。The dependency processing sub-module 114-9 is configured to determine the dependency relationship between the first to-be-executed instruction and the zeroth to-be-executed instruction before the first to-be-executed instruction among the plurality of to-be-executed instructions. 9 The first instruction to be executed can be cached in the instruction storage submodule 112-9, and after the execution of the zeroth instruction to be executed is completed, the first instruction to be executed is extracted from the instruction storage submodule 112-9 and sent to the processing module 12- 9. Wherein, the first instruction to be executed and the zeroth instruction to be executed are instructions among a plurality of instructions to be executed.
其中,第一待执行指令与第一待执行指令之前的第零待执行指令存在依赖关系包括:存储第一待执行指令所需数据的第一存储地址区间与存储第零待执行指令所需数据的第零存储地址区间具有重叠的区域。反之,第一待执行指令与第零待执行指令之间没有依赖关系可以是第一存储地址区间与第零存储地址区间没有重叠区域。The dependency relationship between the first to-be-executed instruction and the zeroth to-be-executed instruction before the first to-be-executed instruction includes: a first storage address interval to store data required by the first to-be-executed instruction and data required to store the zeroth to-be-executed instruction The zeroth storage address interval has overlapping areas. Conversely, there is no dependency relationship between the first instruction to be executed and the zeroth instruction to be executed may be that there is no overlapping area between the first storage address interval and the zeroth storage address interval.
通过这种方式,可以根据待执行指令之间的依赖关系,使得在先的待执行令执行完毕之后,再执行在后的待执行指令,保证运算结果的准确性。In this way, according to the dependency relationship between the instructions to be executed, after the execution of the first to-be-executed order is completed, the subsequent to-be-executed instruction is executed again to ensure the accuracy of the calculation result.
在一种可能的实现方式中,矩阵旋转指令的指令格式可以为:In a possible implementation manner, the instruction format of the matrix rotation instruction may be:
Rotate angle dst src src_shape dst_shapeRotate angle dst src src_shape dst_shape
其中,Rotate为操作码,angle、dst、src、src_shape、dst_shape为操作域。Rotate用于指示该指令为矩阵旋转指令。dst为目标地址。src为待旋转矩阵地址。angle为旋转角度。src_shape为输入形状。dst_shape为输出形状。Rotate is the operation code, and angle, dst, src, src_shape, and dst_shape are the operation domains. Rotate is used to indicate that the instruction is a matrix rotation instruction. dst is the target address. src is the address of the matrix to be rotated. angle is the rotation angle. src_shape is the input shape. dst_shape is the output shape.
在一种可能的实现方式中,矩阵旋转指令的指令格式可以为:In a possible implementation manner, the instruction format of the matrix rotation instruction may be:
Rotate_angle dst src src_shape dst_shapeRotate_angle dst src src_shape dst_shape
其中,Rotate_angle为操作码,dst、src、src_shape、dst_shape为操作域。Rotate_angle中的Rotate用于指示该指令为矩阵旋转指令,Rotate_angle中的angle为旋转角度。dst为目标地址。src为待旋转矩阵地址。src_shape为输入形状。dst_shape为输出形状。Among them, Rotate_angle is the operation code, dst, src, src_shape, dst_shape are the operation domain. Rotate in Rotate_angle is used to indicate that the instruction is a matrix rotation instruction, and angle in Rotate_angle is a rotation angle. dst is the target address. src is the address of the matrix to be rotated. src_shape is the input shape. dst_shape is the output shape.
在一种可能的实现方式中,可以设置顺时针旋转90°的矩阵旋转指令的指令格式设置为:Rotate_90 dst src src_shape dst_shape。可以设置顺时针旋转180°的矩阵旋转指令的指令格式设置为:Rotate_180 dst src src_shape dst_shape。可以设置顺时针旋转270°的矩阵旋转指令的指令格式设置为:Rotate_270 dst src src_shape dst_shape。In a possible implementation, the instruction format of the matrix rotation instruction rotated clockwise by 90° can be set as: Rotate_90 dst src src_shape dst_shape. The format of the matrix rotation command that can be rotated clockwise by 180° can be set as: Rotate_180 dst src src_shape dst_shape. The format of the matrix rotation command that can be rotated clockwise by 270° is set to Rotate_270 dst src src_shape dst_shape.
应当理解的是,本领域技术人员可以根据需要对矩阵旋转指令的操作码、指令格式中操作码以及操作域的位置进行设置,本公开对此不作限制。It should be understood that those skilled in the art can set the operation code of the matrix rotation instruction, the operation code in the instruction format, and the position of the operation field as needed, and the disclosure does not limit this.
在一种可能的实现方式中,该装置可以设置于图形处理器(Graphics Processing Unit,简称GPU)、中央处理器(Central Processing Unit,简称CPU)和嵌入式神经网络处理器(Neural-network Processing Unit,简称NPU)的一种或多种之中。In a possible implementation, the device may be set in a graphics processor (GPU), a central processing unit (CPU) and a neural-network processing unit (Neural-network Processing) , Referred to as NPU).
需要说明的是,尽管以上述实施例作为示例介绍了矩阵旋转指令处理装置如上,但本领域技术人员能够理解,本公开应不限于此。事实上,用户完全可根据个人喜好和/或实际应用场景灵活设定各模块,只要符合本公开的技术方案即可。It should be noted that although the above embodiment is used as an example to introduce the matrix rotation instruction processing device as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various modules flexibly according to personal preferences and/or actual application scenarios, as long as the technical solutions of the present disclosure are met.
应用示例Application examples
以下结合“利用矩阵旋转指令处理装置对待旋转矩阵进行旋转”作为一个示例性应用场景,给出根据本公开实施例的应用示例,以便于理解矩阵旋转指令处理装置的流程。本领域技术人员应理解,以下应用示例仅仅是出于便于理解本公开实施例的目的,不应视为对本公开实施例的限制。In the following, an application example according to an embodiment of the present disclosure is given in conjunction with "using a matrix rotation instruction processing device to rotate a matrix to be rotated" as an exemplary application scenario, so as to facilitate understanding of the flow of the matrix rotation instruction processing device. Those skilled in the art should understand that the following application examples are only for the purpose of facilitating understanding of the embodiments of the present disclosure, and should not be considered as limitations to the embodiments of the present disclosure.
图9-3示出根据本公开一实施例的矩阵旋转指令处理装置的应用场景的示意图。如图9-3所示,矩阵旋转指令处理装置对矩阵旋转指令进行处理的过程如下。9-3 shows a schematic diagram of an application scenario of a matrix rotation instruction processing device according to an embodiment of the present disclosure. As shown in FIG. 9-3, the matrix rotation instruction processing device processes the matrix rotation instruction as follows.
控制模块11-9在接收到矩阵旋转指令1(Rotate 90 200 100 S1 S2)时,对矩阵旋转指令1进行解析,获得矩阵旋转指令1的操作码和操作域。该矩阵旋转指令1的操作码为Rotate。且根据操作域可以确定:旋转角度为90度、待旋转矩阵地址为100、输入形状为S1、目标地址为200、输出形状为S2。进而控制 模块11-9从待旋转矩阵地址100中获取输入形状为S1的待旋转矩阵1。When the control module 11-9 receives the matrix rotation instruction 1 (Rotate 90, 200, 100, S1, S2), it parses the matrix rotation instruction 1, and obtains the operation code and operation field of the matrix rotation instruction 1. The operation code of the matrix rotation instruction 1 is Rotate. And it can be determined according to the operation domain: the rotation angle is 90 degrees, the matrix address to be rotated is 100, the input shape is S1, the target address is 200, and the output shape is S2. Further, the control module 11-9 obtains the matrix 1 to be rotated whose input shape is S1 from the matrix address 100 to be rotated.
处理模块12-9根据旋转角度对待旋转矩阵1进行旋转处理,得到旋转后矩阵1’,并将旋转后矩阵1’存入目标地址200中。The processing module 12-9 performs rotation processing on the rotation matrix 1 according to the rotation angle to obtain the rotated matrix 1', and stores the rotated matrix 1'in the target address 200.
其中,矩阵旋转指令1除可以为上述Rotate 90 200 100 S1 S2,还可以为Rotate_90 200 100 S1 S2,二者为不同指令格式、且表示相同处理过程的指令,矩阵旋转指令装置对二者的处理过程相似,不再赘述。Among them, the matrix rotation instruction 1 can be not only the above-mentioned Rotate 90, 200, 100, S1, S2, but also Rotate_90, 200, 100, S1, S2. The two are instructions with different instruction formats and represent the same processing process. The process is similar and will not be repeated here.
上述处理过程详见上文相关描述。For details of the above process, please refer to the relevant description above.
这样,矩阵旋转指令处理装置可以快速、高效地根据矩阵旋转指令对矩阵进行旋转处理。In this way, the matrix rotation instruction processing device can quickly and efficiently rotate the matrix according to the matrix rotation instruction.
图9-4示出根据本公开一实施例的矩阵旋转指令处理方法的流程图。如图9-4所示,该方法应用于上述矩阵旋转指令处理装置,该方法包括步骤S51-9和步骤S52-9。9-4 shows a flowchart of a matrix rotation instruction processing method according to an embodiment of the present disclosure. As shown in FIG. 9-4, the method is applied to the above matrix rotation instruction processing device, and the method includes step S51-9 and step S52-9.
在步骤S51-9中,对接收到的矩阵旋转指令进行解析,获得矩阵旋转指令的操作码和操作域,并根据操作码和操作域确定执行矩阵旋转指令所需的待旋转矩阵和目标地址,以及确定对待旋转矩阵进行旋转的旋转角度。其中,操作码用于指示矩阵旋转指令对矩阵数据所进行的处理为旋转处理,操作域包括待旋转矩阵地址和目标地址。In step S51-9, the received matrix rotation instruction is analyzed to obtain the operation code and operation domain of the matrix rotation instruction, and the matrix to be rotated and the target address required to execute the matrix rotation instruction are determined according to the operation code and operation domain And determine the rotation angle of the matrix to be rotated. The operation code is used to instruct the matrix rotation instruction to process the matrix data as rotation processing, and the operation domain includes the matrix address and the target address to be rotated.
在步骤S52-9中,根据旋转角度对待旋转矩阵进行旋转处理,得到旋转后矩阵,并将旋转后矩阵存入目标地址中。In step S52-9, the rotation matrix is rotated according to the rotation angle to obtain the rotated matrix, and the rotated matrix is stored in the target address.
在一种可能的实现方式中,操作域还可以包括待旋转矩阵的输入形状。其中,根据旋转角度对待旋转矩阵进行旋转处理,得到旋转后矩阵,可以包括:根据输入形状以及旋转角度,对待旋转矩阵进行旋转处理,获得旋转后矩阵。In a possible implementation manner, the operation domain may further include the input shape of the matrix to be rotated. Wherein, performing the rotation processing on the rotation matrix according to the rotation angle to obtain the rotated matrix may include: performing rotation processing on the rotation matrix according to the input shape and the rotation angle to obtain the rotated matrix.
在一种可能的实现方式中,操作域还可以包括旋转矩阵的输出形状。其中,根据旋转角度对待旋转矩阵进行旋转处理,得到旋转后矩阵,可以包括:根据输出形状以及旋转角度,对待旋转矩阵进行旋转处理,获得旋转后矩阵。In a possible implementation, the operation domain may also include the output shape of the rotation matrix. Wherein, performing the rotation processing on the rotation matrix according to the rotation angle to obtain the rotated matrix may include: performing rotation processing on the rotation matrix according to the output shape and the rotation angle to obtain the rotated matrix.
在一种可能的实现方式中,操作域还可以用于指示旋转角度。In a possible implementation, the operation field can also be used to indicate the rotation angle.
在一种可能的实现方式中,操作码还可以用于指示旋转角度。In a possible implementation, the operation code can also be used to indicate the rotation angle.
在一种可能的实现方式中,该方法还可以包括:存储待旋转矩阵。In a possible implementation manner, the method may further include: storing the matrix to be rotated.
在一种可能的实现方式中,对接收到的矩阵旋转指令进行解析,获得矩阵旋转指令的操作码和操作域,可以包括:In a possible implementation manner, parsing the received matrix rotation instruction to obtain the operation code and operation domain of the matrix rotation instruction may include:
存储矩阵旋转指令;Storage matrix rotation instruction;
对矩阵旋转指令进行解析,得到矩阵旋转指令的操作码和操作域;Analyze the matrix rotation instruction to obtain the operation code and operation domain of the matrix rotation instruction;
存储指令队列,指令队列包括按照执行顺序依次排列的多个待执行指令,多个待执行指令可以包括矩阵旋转指令。An instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed that are sequentially arranged according to the execution order, and the plurality of instructions to be executed may include matrix rotation instructions.
在一种可能的实现方式中,该方法还可以包括:In a possible implementation manner, the method may further include:
在确定多个待执行指令中的第一待执行指令与第一待执行指令之前的第零待执行指令存在依赖关系时,缓存第一待执行指令,并在确定第零待执行指令执行完毕后,控制进行第一待执行指令的执行,When it is determined that the first to-be-executed instruction among the plurality of to-be-executed instructions has a dependency relationship with the zeroth to-be-executed instruction before the first to-be-executed instruction, cache the first to-be-executed instruction, and after determining that the execution of the zeroth to-be-executed instruction is completed , Control the execution of the first instruction to be executed,
其中,第一待执行指令与第一待执行指令之前的第零待执行指令存在依赖关系包括:The dependency relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
存储第一待执行指令所需数据的第一存储地址区间与存储第零待执行指令所需数据的第零存储地址区间具有重叠的区域。A first storage address interval storing data required for the first instruction to be executed has an overlapping area with a zeroth storage address interval storing data required for the zeroth instruction to be executed.
需要说明的是,尽管以上述实施例作为示例介绍了矩阵旋转指令处理方法如上,但本领域技术人 员能够理解,本公开应不限于此。事实上,用户完全可根据个人喜好和/或实际应用场景灵活设定各步骤,只要符合本公开的技术方案即可。It should be noted that although the above embodiment is used as an example to introduce the matrix rotation instruction processing method as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various steps flexibly according to personal preferences and/or actual application scenarios, as long as the technical solutions of the present disclosure are met.
本公开实施例所提供的矩阵旋转指令处理方法的适用范围广,根据矩阵旋转指令对矩阵进行旋转处理的处理效率高、处理速度快。The matrix rotation instruction processing method provided by the embodiments of the present disclosure has a wide application range, and the processing efficiency of rotating the matrix according to the matrix rotation instruction is high and the processing speed is fast.
依据以下条款可更好地理解前述内容:The foregoing can be better understood based on the following terms:
条款H1、一种矩阵旋转指令处理装置,所述装置包括:Clause H1, a matrix rotation instruction processing device, the device comprising:
控制模块,用于对接收到的矩阵旋转指令进行解析,获得所述矩阵旋转指令的操作码和操作域,并根据所述操作码和所述操作域确定执行所述矩阵旋转指令所需的待旋转矩阵和目标地址,以及确定对待旋转矩阵进行旋转的旋转角度;The control module is used to parse the received matrix rotation instruction, obtain the operation code and the operation domain of the matrix rotation instruction, and determine the standby required to execute the matrix rotation instruction according to the operation code and the operation domain Rotation matrix and target address, and determine the rotation angle of the rotation matrix to be rotated;
处理模块,根据所述旋转角度对所述待旋转矩阵进行旋转处理,得到旋转后矩阵,并将所述旋转后矩阵存入所述目标地址中,The processing module performs rotation processing on the matrix to be rotated according to the rotation angle to obtain a matrix after rotation, and stores the matrix after rotation into the target address,
其中,所述操作码用于指示所述矩阵旋转指令对矩阵数据所进行的处理为旋转处理,所述操作域包括所述待旋转矩阵地址和所述目标地址。Wherein, the operation code is used to indicate that the processing performed by the matrix rotation instruction on the matrix data is rotation processing, and the operation domain includes the matrix address to be rotated and the target address.
条款H2、根据条款H1所述的装置,所述操作域还包括待旋转矩阵的输入形状,Clause H2. The device according to Clause H1, the operation domain further includes an input shape of the matrix to be rotated,
所述处理模块,还用于根据所述输入形状以及所述旋转角度,对所述待旋转矩阵进行旋转处理,获得所述旋转后矩阵。The processing module is further configured to perform rotation processing on the matrix to be rotated according to the input shape and the rotation angle to obtain the rotated matrix.
条款H3、根据条款H1所述的装置,所述操作域还包括旋转矩阵的输出形状,Clause H3. The device according to Clause H1, the operation domain further includes an output shape of a rotation matrix,
所述处理模块,还用于根据所述输出形状以及所述旋转角度,对所述待旋转矩阵进行旋转处理,获得所述旋转后矩阵。The processing module is further configured to perform rotation processing on the matrix to be rotated according to the output shape and the rotation angle to obtain the rotated matrix.
条款H4、根据条款H1所述的装置,所述操作域还用于指示旋转角度。Clause H4. The device according to Clause H1, the operation field is also used to indicate a rotation angle.
条款H5、根据条款H1所述的装置,所述操作码还用于指示旋转角度。Clause H5. The device according to Clause H1, the operation code is also used to indicate a rotation angle.
条款H6、根据条款H1所述的装置,Clause H6, the device according to Clause H1,
所述装置还包括:存储模块,用于存储所述待旋转矩阵,The device also includes a storage module for storing the matrix to be rotated,
其中,所述控制模块,包括:Wherein, the control module includes:
指令存储子模块,用于存储所述矩阵旋转指令;An instruction storage sub-module for storing the matrix rotation instruction;
指令处理子模块,用于对所述矩阵旋转指令进行解析,得到所述矩阵旋转指令的操作码和操作域;An instruction processing sub-module, which is used to analyze the matrix rotation instruction to obtain the operation code and operation domain of the matrix rotation instruction;
队列存储子模块,用于存储指令队列,所述指令队列包括按照执行顺序依次排列的多个待执行指令,所述多个待执行指令包括所述矩阵旋转指令,A queue storage submodule, which is used to store an instruction queue, the instruction queue includes a plurality of instructions to be executed in order according to the execution order, and the plurality of instructions to be executed include the matrix rotation instruction,
其中,所述控制模块,还包括:Wherein, the control module also includes:
依赖关系处理子模块,用于在确定所述多个待执行指令中的第一待执行指令与所述第一待执行指令之前的第零待执行指令存在依赖关系时,将所述第一待执行指令缓存在所述指令存储子模块中,在所述第零待执行指令执行完毕后,从所述指令存储子模块中提取所述第一待执行指令发送至所述处理模块,The dependency processing sub-module is used to determine the first pending instruction when there is a dependency relationship between the first pending instruction in the plurality of pending instructions and the zeroth pending instruction before the first pending instruction The execution instruction is cached in the instruction storage submodule, and after the execution of the zeroth instruction to be executed is completed, the first instruction to be executed is extracted from the instruction storage submodule and sent to the processing module,
其中,所述第一待执行指令与所述第一待执行指令之前的第零待执行指令存在依赖关系包括:The dependency relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
存储所述第一待执行指令所需数据的第一存储地址区间与存储所述第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
条款H7、一种机器学习运算装置,所述装置包括:Clause H7. A machine learning computing device, the device comprising:
一个或多个如条款H1-条款H6任一项所述的矩阵旋转指令处理装置,用于从其他处理装置中获取待旋转矩阵和控制信息,并执行指定的机器学习运算,将执行结果通过I/O接口传递给其他处理装置;One or more matrix rotation instruction processing devices as described in any one of Clause H1-Clause H6, used to obtain the matrix to be rotated and control information from other processing devices, and perform a specified machine learning operation, and pass the execution result through I /O interface is passed to other processing devices;
当所述机器学习运算装置包含多个所述矩阵旋转指令处理装置时,所述多个所述矩阵旋转指令处理装置间可以通过特定的结构进行连接并传输数据;When the machine learning operation device includes a plurality of matrix rotation instruction processing devices, the plurality of matrix rotation instruction processing devices may be connected and transmit data through a specific structure;
其中,多个所述矩阵旋转指令处理装置通过快速外部设备互连总线PCIE总线进行互联并传输数据,以支持更大规模的机器学习的运算;多个所述矩阵旋转指令处理装置共享同一控制系统或拥有各自的控制系统;多个所述矩阵旋转指令处理装置共享内存或者拥有各自的内存;多个所述矩阵旋转指令处理装置的互联方式是任意互联拓扑。Among them, a plurality of the matrix rotation instruction processing apparatuses interconnect and transmit data through a PCIE bus that is a fast external device interconnection bus to support larger-scale machine learning operations; Or have their own control systems; a plurality of the matrix rotation instruction processing devices share memory or have their own memories; the interconnection method of the plurality of matrix rotation instruction processing devices is an arbitrary interconnection topology.
条款H8、一种组合处理装置,所述组合处理装置包括:Clause H8. A combined processing device, the combined processing device comprising:
如条款H7所述的机器学习运算装置、通用互联接口和其他处理装置;Machine learning computing devices, general interconnection interfaces and other processing devices as described in clause H7;
所述机器学习运算装置与所述其他处理装置进行交互,共同完成用户指定的计算操作,The machine learning computing device interacts with the other processing device to jointly complete the calculation operation specified by the user,
其中,所述组合处理装置还包括:存储装置,该存储装置分别与所述机器学习运算装置和所述其他处理装置连接,用于保存所述机器学习运算装置和所述其他处理装置的数据。Wherein, the combined processing device further includes: a storage device, which is respectively connected to the machine learning computing device and the other processing device, and is used for storing data of the machine learning computing device and the other processing device.
条款H9、一种机器学习芯片,所述机器学习芯片包括:Clause H9. A machine learning chip, the machine learning chip includes:
如条款H7所述的机器学习运算装置或如条款H8所述的组合处理装置。The machine learning arithmetic device according to clause H7 or the combined processing device according to clause H8.
条款H10、一种电子设备,所述电子设备包括:Clause H10. An electronic device, the electronic device comprising:
如条款H9所述的机器学习芯片。Machine learning chip as described in clause H9.
条款H11、一种板卡,所述板卡包括:存储器件、接口装置和控制器件以及如条款H9所述的机器学习芯片;Clause H11, a board card including: a storage device, an interface device and a control device, and a machine learning chip as described in Clause H9;
其中,所述机器学习芯片与所述存储器件、所述控制器件以及所述接口装置分别连接;Wherein, the machine learning chip is respectively connected to the storage device, the control device and the interface device;
所述存储器件,用于存储数据;The storage device is used to store data;
所述接口装置,用于实现所述机器学习芯片与外部设备之间的数据传输;The interface device is used to realize data transmission between the machine learning chip and an external device;
所述控制器件,用于对所述机器学习芯片的状态进行监控。The control device is used for monitoring the state of the machine learning chip.
条款H12、一种矩阵旋转指令处理方法,所述方法应用于矩阵旋转指令处理装置,所述方法包括:Clause H12. A method of processing matrix rotation instructions. The method is applied to a device for processing matrix rotation instructions. The method includes:
对接收到的矩阵旋转指令进行解析,获得所述矩阵旋转指令的操作码和操作域,并根据所述操作码和所述操作域确定执行所述矩阵旋转指令所需的待旋转矩阵和目标地址,以及确定对待旋转矩阵进行旋转的旋转角度;Parse the received matrix rotation instruction to obtain the operation code and operation domain of the matrix rotation instruction, and determine the matrix to be rotated and the target address required to execute the matrix rotation instruction according to the operation code and the operation domain , And determine the rotation angle of the matrix to be rotated;
根据所述旋转角度对所述待旋转矩阵进行旋转处理,得到旋转后矩阵,并将所述旋转后矩阵存入所述目标地址中,Rotating the matrix to be rotated according to the rotation angle to obtain a matrix after rotation, and storing the matrix after rotation into the target address,
其中,所述操作码用于指示所述矩阵旋转指令对矩阵数据所进行的处理为旋转处理,所述操作域包括所述待旋转矩阵地址和所述目标地址。Wherein, the operation code is used to indicate that the processing performed by the matrix rotation instruction on the matrix data is rotation processing, and the operation domain includes the matrix address to be rotated and the target address.
条款H13、根据条款H12所述的方法,所述操作域还包括待旋转矩阵的输入形状,Clause H13. The method according to Clause H12, the operation domain further includes an input shape of the matrix to be rotated,
其中,根据所述旋转角度对所述待旋转矩阵进行旋转处理,得到旋转后矩阵,包括:Wherein, rotating the matrix to be rotated according to the rotation angle to obtain a matrix after rotation includes:
根据所述输入形状以及所述旋转角度,对所述待旋转矩阵进行旋转处理,获得所述旋转后矩阵。Rotate the matrix to be rotated according to the input shape and the rotation angle to obtain the rotated matrix.
条款H14、根据条款H12所述的方法,所述操作域还包括旋转矩阵的输出形状,Clause H14. The method according to Clause H12, the operation domain further includes an output shape of a rotation matrix,
其中,根据所述旋转角度对所述待旋转矩阵进行旋转处理,得到旋转后矩阵,包括:Wherein, rotating the matrix to be rotated according to the rotation angle to obtain a matrix after rotation includes:
根据所述输出形状以及所述旋转角度,对所述待旋转矩阵进行旋转处理,获得所述旋转后矩阵。Rotate the matrix to be rotated according to the output shape and the rotation angle to obtain the rotated matrix.
条款H15、根据条款H12所述的方法,所述操作域还用于指示旋转角度。Clause H15. The method according to Clause H12, the operation field is also used to indicate a rotation angle.
条款H16、根据条款H14所述的方法,所述操作码还用于指示旋转角度。Clause H16. The method according to Clause H14, the operation code is also used to indicate a rotation angle.
条款H17、根据条款H12所述的方法,Clause H17, according to the method described in Clause H12,
所述方法还包括:存储所述待旋转矩阵,The method further includes: storing the matrix to be rotated,
其中,对接收到的矩阵旋转指令进行解析,获得所述矩阵旋转指令的操作码和操作域,包括:Wherein, analyzing the received matrix rotation instruction to obtain the operation code and operation domain of the matrix rotation instruction includes:
存储所述矩阵旋转指令;Store the matrix rotation instruction;
对所述矩阵旋转指令进行解析,得到所述矩阵旋转指令的操作码和操作域;Parse the matrix rotation instruction to obtain the operation code and operation domain of the matrix rotation instruction;
存储指令队列,所述指令队列包括按照执行顺序依次排列的多个待执行指令,所述多个待执行指令包括所述矩阵旋转指令,An instruction queue is stored, the instruction queue includes a plurality of instructions to be executed in order according to the execution order, the plurality of instructions to be executed includes the matrix rotation instruction,
其中,所述方法还包括:Wherein, the method further includes:
在确定所述多个待执行指令中的第一待执行指令与所述第一待执行指令之前的第零待执行指令存在依赖关系时,缓存所述第一待执行指令,并在确定所述第零待执行指令执行完毕后,控制进行所述第一待执行指令的执行,When it is determined that the first instruction to be executed among the plurality of instructions to be executed has a dependency relationship with the zeroth instruction to be executed before the first instruction to be executed, the first instruction to be executed is cached, and the After the execution of the zeroth to-be-executed instruction is completed, control to execute the execution of the first to-be-executed instruction,
其中,所述第一待执行指令与所述第一待执行指令之前的第零待执行指令存在依赖关系包括:The dependency relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
存储所述第一待执行指令所需数据的第一存储地址区间与存储所述第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
本公开实施例还提出一种计算机可读存储介质,其上存储有计算机程序指令,所述计算机程序指令被处理器执行时实现上述方法。计算机可读存储介质可以是非易失性计算机可读存储介质。An embodiment of the present disclosure also proposes a computer-readable storage medium on which computer program instructions are stored, and when the computer program instructions are executed by a processor, the above method is implemented. The computer-readable storage medium may be a non-volatile computer-readable storage medium.
本公开实施例还提出一种电子设备,包括:处理器;用于存储处理器可执行指令的存储器;其中,所述处理器被配置为调用所述存储器存储的指令,以执行上述方法。An embodiment of the present disclosure also proposes an electronic device, including: a processor; a memory for storing processor executable instructions; wherein the processor is configured to call the instructions stored in the memory to perform the above method.
本公开提供一种机器学习运算装置,该机器学习运算装置可以包括一个或多个上述指令处理装置,用于从其他处理装置中获取待运算数据和控制信息,执行指定的机器学习运算。该机器学习运算装置可以从其他机器学习运算装置或非机器学习运算装置中获得指令,并将执行结果通过I/O接口传递给外围设备(也可称其他处理装置)。外围设备譬如摄像头,显示器,鼠标,键盘,网卡,wifi接口,服务器。当包含一个以上指令处理装置时,指令处理装置间可以通过特定的结构进行链接并传输数据,譬如,通过PCIE总线进行互联并传输数据,以支持更大规模的神经网络的运算。此时,可以共享同一控制系统,也可以有各自独立的控制系统;可以共享内存,也可以每个加速器有各自的内存。此外,其互联方式可以是任意互联拓扑。The present disclosure provides a machine learning computing device. The machine learning computing device may include one or more of the above-mentioned instruction processing devices for acquiring data to be operated and control information from other processing devices to perform specified machine learning operations. The machine learning computing device can obtain instructions from other machine learning computing devices or non-machine learning computing devices, and transfer the execution result to peripheral devices (also called other processing devices) through the I/O interface. Peripheral equipment such as camera, monitor, mouse, keyboard, network card, wifi interface, server. When more than one instruction processing device is included, the instruction processing device can link and transmit data through a specific structure, for example, interconnect and transmit data through the PCIE bus to support a larger-scale neural network operation. At this time, you can share the same control system or have separate control systems; you can share memory, or each accelerator has its own memory. In addition, the interconnection method can be any interconnection topology.
该机器学习运算装置具有较高的兼容性,可通过PCIE接口与各种类型的服务器相连接。The machine learning computing device has high compatibility, and can be connected with various types of servers through the PCIE interface.
图10a示出根据本公开一实施例的组合处理装置的框图。如图10a所示,该组合处理装置包括上述机器学习运算装置、通用互联接口和其他处理装置。机器学习运算装置与其他处理装置进行交互,共同完成用户指定的操作。10a shows a block diagram of a combined processing device according to an embodiment of the present disclosure. As shown in FIG. 10a, the combined processing device includes the above-mentioned machine learning computing device, a universal interconnection interface, and other processing devices. The machine learning computing device interacts with other processing devices to complete the operation specified by the user.
其他处理装置,包括中央处理器CPU、图形处理器GPU、神经网络处理器等通用/专用处理器中的一种或以上的处理器类型。其他处理装置所包括的处理器数量不做限制。其他处理装置作为机器学习运算装置与外部数据和控制的接口,包括数据搬运,完成对本机器学习运算装置的开启、停止等基本控制;其他处理装置也可以和机器学习运算装置协作共同完成运算任务。Other processing devices include one or more processor types of general-purpose/dedicated processors such as central processing unit CPU, graphics processor GPU, neural network processor. The number of processors included in other processing devices is not limited. Other processing devices serve as an interface between the machine learning computing device and external data and control, including data handling, to complete the basic control of starting and stopping the machine learning computing device; other processing devices can also cooperate with the machine learning computing device to complete the computing task.
通用互联接口,用于在机器学习运算装置与其他处理装置间传输数据和控制指令。该机器学习运算装置从其他处理装置中获取所需的输入数据,写入机器学习运算装置片上的存储装置;可以从其他处理装置中获取控制指令,写入机器学习运算装置片上的控制缓存;也可以读取机器学习运算装置的存储模块中的数据并传输给其他处理装置。Universal interconnection interface, used to transfer data and control instructions between machine learning computing devices and other processing devices. The machine learning computing device obtains the required input data from other processing devices and writes them into the on-chip storage device of the machine learning computing device; it can obtain control instructions from other processing devices and write them into the control cache of the machine learning computing device; The data in the storage module of the machine learning computing device can be read and transmitted to other processing devices.
图10b示出根据本公开一实施例的组合处理装置的框图。在一种可能的实现方式中,如图10b所示,该组合处理装置还可以包括存储装置,存储装置分别与机器学习运算装置和其他处理装置连接。存储 装置用于保存在机器学习运算装置和其他处理装置的数据,尤其适用于所需要运算的数据在本机器学习运算装置或其他处理装置的内部存储中无法全部保存的数据。FIG. 10b shows a block diagram of a combined processing device according to an embodiment of the present disclosure. In a possible implementation manner, as shown in FIG. 10b, the combined processing device may further include a storage device, and the storage device is respectively connected to the machine learning operation device and other processing devices. The storage device is used to store data stored in the machine learning computing device and other processing devices, and is particularly suitable for data that cannot be saved in the internal storage of the machine learning computing device or other processing devices.
该组合处理装置可以作为手机、机器人、无人机、视频监控设备等设备的SOC片上系统,有效降低控制部分的核心面积,提高处理速度,降低整体功耗。此情况时,该组合处理装置的通用互联接口与设备的某些部件相连接。某些部件譬如摄像头,显示器,鼠标,键盘,网卡,wifi接口。The combined processing device can be used as an SOC on-chip system for mobile phones, robots, drones, video surveillance equipment and other equipment, effectively reducing the core area of the control part, increasing processing speed, and reducing overall power consumption. In this case, the general interconnection interface of the combined processing device is connected to some components of the device. Some components such as camera, monitor, mouse, keyboard, network card, wifi interface.
本公开提供一种机器学习芯片,该芯片包括上述机器学习运算装置或组合处理装置。The present disclosure provides a machine learning chip including the above machine learning arithmetic device or combination processing device.
本公开提供一种机器学习芯片封装结构,该机器学习芯片封装结构包括上述机器学习芯片。The present disclosure provides a machine learning chip packaging structure including the above machine learning chip.
本公开提供一种板卡,图11示出根据本公开一实施例的板卡的结构示意图。如图11所示,该板卡包括上述机器学习芯片封装结构或者上述机器学习芯片。板卡除了包括机器学习芯片389以外,还可以包括其他的配套部件,该配套部件包括但不限于:存储器件390、接口装置391和控制器件392。The present disclosure provides a board card. FIG. 11 shows a schematic diagram of a board card according to an embodiment of the present disclosure. As shown in FIG. 11, the board includes the above machine learning chip packaging structure or the above machine learning chip. In addition to the machine learning chip 389, the board may also include other supporting components, including but not limited to: a storage device 390, an interface device 391, and a control device 392.
存储器件390与机器学习芯片389(或者机器学习芯片封装结构内的机器学习芯片)通过总线连接,用于存储数据。存储器件390可以包括多组存储单元393。每一组存储单元393与机器学习芯片389通过总线连接。可以理解,每一组存储单元393可以是DDR SDRAM(英文:Double Data Rate SDRAM,双倍速率同步动态随机存储器)。The storage device 390 and the machine learning chip 389 (or the machine learning chip in the machine learning chip packaging structure) are connected via a bus, and are used to store data. The storage device 390 may include multiple sets of storage units 393. Each group of storage units 393 and the machine learning chip 389 are connected by a bus. It can be understood that each group of storage units 393 may be DDR SDRAM (English: Double Data Rate SDRAM, double rate synchronous dynamic random access memory).
DDR不需要提高时钟频率就能加倍提高SDRAM的速度。DDR允许在时钟脉冲的上升沿和下降沿读出数据。DDR的速度是标准SDRAM的两倍。DDR can double the speed of SDRAM without increasing the clock frequency. DDR allows data to be read on the rising and falling edges of the clock pulse. DDR is twice as fast as standard SDRAM.
在一个实施例中,存储器件390可以包括4组存储单元393。每一组存储单元393可以包括多个DDR4颗粒(芯片)。在一个实施例中,机器学习芯片389内部可以包括4个72位DDR4控制器,上述72位DDR4控制器中64bit用于传输数据,8bit用于ECC校验。可以理解,当每一组存储单元393中采用DDR4-3200颗粒时,数据传输的理论带宽可达到25600MB/s。In one embodiment, the memory device 390 may include 4 sets of memory cells 393. Each group of memory cells 393 may include multiple DDR4 particles (chips). In one embodiment, the machine learning chip 389 may include four 72-bit DDR4 controllers. Among the 72-bit DDR4 controllers, 64 bits are used for data transmission, and 8 bits are used for ECC check. It can be understood that when DDR4-3200 particles are used in each group of memory cells 393, the theoretical bandwidth of data transmission can reach 25600MB/s.
在一个实施例中,每一组存储单元393包括多个并联设置的双倍速率同步动态随机存储器。DDR在一个时钟周期内可以传输两次数据。在机器学习芯片389中设置控制DDR的控制器,用于对每个存储单元393的数据传输与数据存储的控制。In one embodiment, each group of storage units 393 includes multiple double-rate synchronous dynamic random access memories arranged in parallel. DDR can transfer data twice in one clock cycle. A controller for controlling DDR is provided in the machine learning chip 389 for controlling the data transmission and data storage of each storage unit 393.
接口装置391与机器学习芯片389(或者机器学习芯片封装结构内的机器学习芯片)电连接。接口装置391用于实现机器学习芯片389与外部设备(例如服务器或计算机)之间的数据传输。例如在一个实施例中,接口装置391可以为标准PCIE接口。比如,待处理的数据由服务器通过标准PCIE接口传递至机器学习芯片289,实现数据转移。优选的,当采用PCIE 3.0 X 16接口传输时,理论带宽可达到16000MB/s。在另一个实施例中,接口装置391还可以是其他的接口,本公开并不限制上述其他的接口的具体表现形式,接口装置能够实现转接功能即可。另外,机器学习芯片的计算结果仍由接口装置传送回外部设备(例如服务器)。The interface device 391 is electrically connected to the machine learning chip 389 (or the machine learning chip in the machine learning chip packaging structure). The interface device 391 is used to realize data transmission between the machine learning chip 389 and an external device (such as a server or a computer). For example, in one embodiment, the interface device 391 may be a standard PCIE interface. For example, the data to be processed is transferred from the server to the machine learning chip 289 through a standard PCIE interface to realize data transfer. Preferably, when using PCIE 3.0 and X 16 interface transmission, the theoretical bandwidth can reach 16000MB/s. In another embodiment, the interface device 391 may also be other interfaces. The present disclosure does not limit the specific expressions of the other interfaces described above, and the interface device can implement the transfer function. In addition, the calculation result of the machine learning chip is still transmitted back to the external device (such as a server) by the interface device.
控制器件392与机器学习芯片389电连接。控制器件392用于对机器学习芯片389的状态进行监控。具体的,机器学习芯片389与控制器件392可以通过SPI接口电连接。控制器件392可以包括单片机(Micro Controller Unit,MCU)。如机器学习芯片389可以包括多个处理芯片、多个处理核或多个处理电路,可以带动多个负载。因此,机器学习芯片389可以处于多负载和轻负载等不同的工作状态。通过控制器件可以实现对机器学习芯片中多个处理芯片、多个处理和/或多个处理电路的工作状态的调控。The control device 392 is electrically connected to the machine learning chip 389. The control device 392 is used to monitor the state of the machine learning chip 389. Specifically, the machine learning chip 389 and the control device 392 may be electrically connected through an SPI interface. The control device 392 may include a single-chip microcomputer (Micro Controller Unit, MCU). For example, the machine learning chip 389 may include multiple processing chips, multiple processing cores, or multiple processing circuits, and may drive multiple loads. Therefore, the machine learning chip 389 can be in different working states such as multiple loads and light loads. The control device can realize the regulation of the working state of multiple processing chips, multiple processes and/or multiple processing circuits in the machine learning chip.
本公开提供一种电子设备,该电子设备包括上述机器学习芯片或板卡。The present disclosure provides an electronic device including the aforementioned machine learning chip or board.
电子设备可以包括数据处理装置、机器人、电脑、打印机、扫描仪、平板电脑、智能终端、手机、行车记录仪、导航仪、传感器、摄像头、服务器、云端服务器、相机、摄像机、投影仪、手表、耳机、 移动存储、可穿戴设备、交通工具、家用电器、和/或医疗设备。Electronic equipment can include data processing devices, robots, computers, printers, scanners, tablets, smart terminals, mobile phones, driving recorders, navigators, sensors, cameras, servers, cloud servers, cameras, cameras, projectors, watches, Headphones, mobile storage, wearable devices, vehicles, household appliances, and/or medical devices.
交通工具可以包括飞机、轮船和/或车辆。家用电器可以包括电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机。医疗设备可以包括核磁共振仪、B超仪和/或心电图仪。Vehicles may include airplanes, ships, and/or vehicles. Household appliances may include TVs, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lights, gas stoves, and range hoods. The medical device may include a nuclear magnetic resonance apparatus, a B-mode ultrasound apparatus, and/or an electrocardiograph.
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请并不受所描述的动作顺序的限制,因为依据本申请,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于可选实施例,所涉及的动作和模块并不一定是本公开所必须的。It should be noted that, for the sake of simple description, the foregoing method embodiments are all expressed as a series of action combinations, but those skilled in the art should know that this application is not limited by the described action sequence, Because according to the present application, certain steps can be performed in other orders or simultaneously. Secondly, those skilled in the art should also know that the embodiments described in the specification are all optional embodiments, and the involved actions and modules are not necessarily required by the present disclosure.
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。In the above embodiments, the description of each embodiment has its own emphasis. For a part that is not detailed in an embodiment, you can refer to related descriptions in other embodiments.
在本公开所提供的实施例中,应该理解到,所揭露的系统、装置,可通过其它的方式实现。例如,以上所描述的系统、装置实施例仅仅是示意性的,例如设备、装置、模块的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个模块可以结合或者可以集成到另一个系统或装置,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,设备、装置或模块的间接耦合或通信连接,可以是电性或其它的形式。In the embodiments provided by the present disclosure, it should be understood that the disclosed system and device may be implemented in other ways. For example, the system and device embodiments described above are only schematic. For example, the division of devices, devices, and modules is only a logical function division. In actual implementation, there may be other divisions, for example, multiple modules may be combined Or it can be integrated into another system or device, or some features can be ignored or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices, devices, or modules, and may be in electrical or other forms.
作为分离部件说明的模块可以是或者也可以不是物理上分开的,作为模块显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。A module described as a separate component may or may not be physically separated, and a component displayed as a module may or may not be a physical unit, that is, it may be located in one place or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
另外,在本公开各个实施例中的各功能模块可以集成在一个处理单元中,也可以是各个模块单独物理存在,也可以两个或两个以上模块集成在一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件程序模块的形式实现。In addition, each functional module in each embodiment of the present disclosure may be integrated into one processing unit, or each module may exist alone physically, or two or more modules may be integrated into one module. The above-mentioned integrated modules can be implemented in the form of hardware or software program modules.
集成的模块如果以软件程序模块的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储器中。基于这样的理解,本公开的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储器中,包括若干指令用以使得一台计算机设备(可为个人计算机、服务器或者网络设备等)执行本公开各个实施例所述方法的全部或部分步骤。而前述的存储器包括:U盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。If the integrated module is implemented in the form of a software program module and sold or used as an independent product, it can be stored in a computer-readable memory. Based on such an understanding, the technical solution of the present disclosure may be essentially or part of the contribution to the existing technology or all or part of the technical solution may be embodied in the form of a software product, the computer software product is stored in a memory, Several instructions are included to enable a computer device (which may be a personal computer, server, network device, etc.) to perform all or part of the steps of the methods described in various embodiments of the present disclosure. The aforementioned memory includes: U disk, Read-Only Memory (ROM, Read-Only Memory), Random Access Memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program codes.
本领域普通技术人员可以理解上述实施例的各种方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成,该程序可以存储于一计算机可读存储器中,存储器可以包括:闪存盘、只读存储器(英文:Read-Only Memory,简称:ROM)、随机存取器(英文:Random Access Memory,简称:RAM)、磁盘或光盘等。A person of ordinary skill in the art may understand that all or part of the steps in the various methods of the above embodiments may be completed by a program instructing relevant hardware. The program may be stored in a computer-readable memory, and the memory may include: a flash disk , Read-Only Memory (English: Read-Only Memory, abbreviation: ROM), Random Access Device (English: Random Access Memory, abbreviation: RAM), magnetic disk or optical disk, etc.
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本公开并不受所描述的动作顺序的限制,因为依据本公开,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于可选实施例,所涉及的动作和模块并不一定是本公开所必须的。It should be noted that the foregoing method embodiments are described as a series of action combinations for the sake of simple description, but those skilled in the art should be aware that the present disclosure is not limited by the sequence of actions described Because according to the present disclosure, certain steps may be performed in other orders or simultaneously. Secondly, those skilled in the art should also know that the embodiments described in the specification are all optional embodiments, and the involved actions and modules are not necessarily required by the present disclosure.
进一步需要说明的是,虽然图2-4、图3-4、图4-4、图5-4、图6-6、图7-4、图8-4、图9-4的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,图2-4、图3-4、图4-4、图5-4、图6-6、图7-4、图8-4、图9-4中的至少一部分步骤可以包括多个 子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些子步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。It should be further noted that although the flow charts in Figure 2-4, Figure 3-4, Figure 4-4, Figure 5-4, Figure 6-6, Figure 7-4, Figure 8-4, and Figure 9-4 The individual steps are displayed in the order indicated by the arrows, but these steps are not necessarily executed in the order indicated by the arrows. Unless clearly stated in this article, the execution of these steps is not strictly limited in order, and these steps can be executed in other orders. Moreover, at least some of the steps in FIGS. 2-4, 3-4, 4-4, 5-4, 6-6, 7-4, 8-4, and 9-4 may include multiple sub-steps Steps or multiple phases. These sub-steps or phases are not necessarily executed at the same time, but can be executed at different times. The execution order of these sub-steps or phases is not necessarily sequential, but can be performed At least part of the sub-steps or stages of the steps or other steps are executed in turn or alternately.
以上已经描述了本公开的各实施例,上述说明是示例性的,并非穷尽性的,并且也不限于所披露的各实施例。在不偏离所说明的各实施例的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。本文中所用术语的选择,旨在最好地解释各实施例的原理、实际应用或对市场中的技术的技术改进,或者使本技术领域的其它普通技术人员能理解本文披露的各实施例。The embodiments of the present disclosure have been described above. The above description is exemplary, not exhaustive, and is not limited to the disclosed embodiments. Many modifications and changes will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the illustrated embodiments. The selection of terms used herein is intended to best explain the principles, practical applications or technical improvements of the technologies in the embodiments, or to enable other persons of ordinary skill in the art to understand the embodiments disclosed herein.

Claims (21)

  1. 一种向量查找指令处理装置,其特征在于,所述装置包括:A vector search instruction processing device, characterized in that the device includes:
    控制模块,用于对接收到的向量查找指令进行解析,获得所述向量查找指令的操作码和操作域,并根据所述操作码和所述操作域确定执行所述向量查找指令所需的待查找向量、查找条件和目标地址;The control module is used to parse the received vector search instruction, obtain the operation code and the operation domain of the vector search instruction, and determine the standby required to execute the vector search instruction according to the operation code and the operation domain Search vector, search condition and target address;
    运算模块,用于依次确定表示所述待查找向量的多个待查数是否满足所述查找条件,并将满足所述查找条件的待查数确定为目标数,将所述目标数的存储地址作为查找结果存入所述目标地址,The operation module is used to sequentially determine whether a plurality of check numbers representing the search vector satisfy the search condition, determine the check number satisfying the search condition as the target number, and store the storage address of the target number Store in the target address as a search result,
    其中,所述操作码用于指示所述向量查找指令对向量数据所进行的运算为查找运算,所述操作域包括所述待查找向量地址和所述目标地址。Wherein, the operation code is used to indicate that the operation performed by the vector search instruction on the vector data is a search operation, and the operation domain includes the vector address to be searched and the target address.
  2. 根据权利要求1所述的装置,其特征在于,所述操作域还包括输入长度,The device according to claim 1, wherein the operation domain further includes an input length,
    所述控制模块,还用于根据所述输入长度,从所述待查找向量地址中获取所述待查找向量。The control module is further configured to obtain the vector to be searched from the address of the vector to be searched according to the input length.
  3. 根据权利要求1所述的装置,其特征在于,所述操作域还包括待查数宽度,The device according to claim 1, wherein the operation domain further includes a width of the data to be checked,
    所述运算模块,还用于根据所述待查数宽度,从所述待查找向量中确定出所述多个待查数。The calculation module is further configured to determine the plurality of to-be-checked numbers from the to-be-searched vector according to the width of the to-be-checked number.
  4. 根据权利要求1所述的装置,其特征在于,所述操作域还包括查找条件,The apparatus according to claim 1, wherein the operation domain further includes a search condition,
    所述控制模块,还用于根据所述操作域,确定所述查找条件。The control module is also used to determine the search condition according to the operation domain.
  5. 根据权利要求1所述的装置,其特征在于,The device according to claim 1, characterized in that
    所述控制模块,还用于根据所述操作码,确定所述查找条件,其中,所述操作码还用于指示所述向量查找指令的查找条件。The control module is further configured to determine the search condition according to the operation code, wherein the operation code is also used to indicate the search condition of the vector search instruction.
  6. 根据权利要求1所述的装置,其特征在于,所述运算模块,包括:The device according to claim 1, wherein the arithmetic module comprises:
    至少一个比较器,用于对所述多个待查数与所述查找条件进行比较,获得比较结果,以便于根据所述比较结果确定待查数是否满足所述查找条件。At least one comparator is used to compare the plurality of to-be-checked numbers with the search condition to obtain a comparison result, so as to determine whether the to-be-checked number meets the search condition according to the comparison result.
  7. 根据权利要求1-6任一项所述的装置,其特征在于,满足所述查找条件的待查数包括以下至少一项:The device according to any one of claims 1 to 6, wherein the number of to-be-checked satisfying the search condition includes at least one of the following:
    数值是指定值的指定倍数、且排序为指定排序的待查数;The numeric value is the specified multiple of the specified value, and the sorting is the number to be checked of the specified sorting;
    数值处于指定数值区间的待查数;Number to be checked if the value is within the specified value interval;
    数值是指定值的指定倍数的待查数,The value is the number to be checked for the specified multiple of the specified value,
    其中,所述指定排序包括以下至少一种:Wherein, the designated order includes at least one of the following:
    所述待查数的排序为数值是指定值的指定倍数的待查数中的第n个,所述n为大于或等于1的正整数;The sorting of the number to be checked is the nth of the number to be checked whose value is a specified multiple of the specified value, where n is a positive integer greater than or equal to 1;
    所述待查数的排序为数值是指定值的指定倍数的待查数中的倒数第m个,所述m为大于或等于1的正整数,The order of the numbers to be checked is the m-th to the last one of the numbers to be checked whose value is the specified multiple of the specified value, where m is a positive integer greater than or equal to 1,
    其中,m、n小于或等于所述待查找向量中待查数的数量。Wherein, m and n are less than or equal to the number of numbers to be checked in the vector to be looked up.
  8. 根据权利要求1所述的装置,其特征在于,The device according to claim 1, characterized in that
    所述装置还包括:存储模块,用于存储所述待查找向量,The device further includes a storage module for storing the to-be-searched vector,
    其中,所述控制模块,包括:Wherein, the control module includes:
    指令存储子模块,用于存储所述向量查找指令;Instruction storage sub-module for storing the vector search instruction;
    指令处理子模块,用于对所述向量查找指令进行解析,得到所述向量查找指令的操作码和操作域;An instruction processing sub-module, which is used to parse the vector search instruction to obtain the operation code and operation domain of the vector search instruction;
    队列存储子模块,用于存储指令队列,所述指令队列包括按照执行顺序依次排列的多个待执行指令,所述多个待执行指令包括所述向量查找指令,A queue storage sub-module is used to store an instruction queue, the instruction queue includes a plurality of instructions to be executed, which are sequentially arranged according to the execution order, and the plurality of instructions to be executed include the vector search instruction,
    其中,所述控制模块,还包括:Wherein, the control module also includes:
    依赖关系处理子模块,用于在确定所述多个待执行指令中的第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系时,将所述第一待执行指令缓存在所述指令存储子模块中,在所述第零待执行指令执行完毕后,从所述指令存储子模块中提取所述第一待执行指令发送至所述运算模块,The dependency processing sub-module is used to determine the first pending instruction when there is an association between the first pending instruction among the multiple pending instructions and the zeroth pending instruction before the first pending instruction The execution instruction is cached in the instruction storage submodule, and after the execution of the zeroth instruction to be executed is completed, the first instruction to be executed is extracted from the instruction storage submodule and sent to the arithmetic module,
    其中,所述第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系包括:Wherein, the association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
    存储所述第一待执行指令所需数据的第一存储地址区间与存储所述第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
  9. 一种机器学习运算装置,其特征在于,所述装置包括:A machine learning computing device, characterized in that the device includes:
    一个或多个如权利要求1-8任一项所述的向量查找指令处理装置,用于从其他处理装置中获取待运算数据和控制信息,并执行指定的机器学习运算,将执行结果通过I/O接口传递给其他处理装置;One or more vector search instruction processing devices according to any one of claims 1-8, used to obtain data to be operated and control information from other processing devices, and perform a specified machine learning operation, and pass the execution result through I /O interface is passed to other processing devices;
    当所述机器学习运算装置包含多个所述向量查找指令处理装置时,所述多个所述向量查找指令处理装置间可以通过特定的结构进行连接并传输数据;When the machine learning operation device includes a plurality of the vector search instruction processing devices, the plurality of vector search instruction processing devices may be connected and transmit data through a specific structure;
    其中,多个所述向量查找指令处理装置通过快速外部设备互连总线PCIE总线进行互联并传输数据,以支持更大规模的机器学习的运算;多个所述向量查找指令处理装置共享同一控制系统或拥有各自的控制系统;多个所述向量查找指令处理装置共享内存或者拥有各自的内存;多个所述向量查找指令处理装置的互联方式是任意互联拓扑。Among them, a plurality of the vector search instruction processing devices interconnect and transmit data through a fast external device interconnect bus PCIE bus to support larger-scale machine learning operations; a plurality of the vector search instruction processing devices share the same control system Or have their own control systems; a plurality of the vector search instruction processing devices share memory or have their own memory; the interconnection method of the plurality of vector search instruction processing devices is an arbitrary interconnection topology.
  10. 一种组合处理装置,其特征在于,所述组合处理装置包括:A combined processing device, characterized in that the combined processing device includes:
    如权利要求9所述的机器学习运算装置、通用互联接口和其他处理装置;The machine learning computing device, general interconnection interface and other processing devices as claimed in claim 9;
    所述机器学习运算装置与所述其他处理装置进行交互,共同完成用户指定的计算操作,The machine learning computing device interacts with the other processing device to jointly complete the calculation operation specified by the user,
    其中,所述组合处理装置还包括:存储装置,该存储装置分别与所述机器学习运算装置和所述其他处理装置连接,用于保存所述机器学习运算装置和所述其他处理装置的数据。Wherein, the combined processing device further includes: a storage device, which is respectively connected to the machine learning computing device and the other processing device, and is used for storing data of the machine learning computing device and the other processing device.
  11. 一种机器学习芯片,其特征在于,所述机器学习芯片包括:A machine learning chip, characterized in that the machine learning chip includes:
    如权利要求9所述的机器学习运算装置或如权利要求10所述的组合处理装置。The machine learning arithmetic device according to claim 9 or the combined processing device according to claim 10.
  12. 一种电子设备,其特征在于,所述电子设备包括:An electronic device, characterized in that the electronic device includes:
    如权利要求11所述的机器学习芯片。The machine learning chip according to claim 11.
  13. 一种板卡,其特征在于,所述板卡包括:存储器件、接口装置和控制器件以及如权利要求11所述的机器学习芯片;A board card, characterized in that the board card comprises: a storage device, an interface device and a control device, and the machine learning chip according to claim 11;
    其中,所述机器学习芯片与所述存储器件、所述控制器件以及所述接口装置分别连接;Wherein, the machine learning chip is respectively connected to the storage device, the control device and the interface device;
    所述存储器件,用于存储数据;The storage device is used to store data;
    所述接口装置,用于实现所述机器学习芯片与外部设备之间的数据传输;The interface device is used to realize data transmission between the machine learning chip and an external device;
    所述控制器件,用于对所述机器学习芯片的状态进行监控。The control device is used for monitoring the state of the machine learning chip.
  14. 一种向量查找指令处理方法,其特征在于,所述方法应用于向量查找指令处理装置,所述方法包括:A vector search instruction processing method, characterized in that the method is applied to a vector search instruction processing device, and the method includes:
    对接收到的向量查找指令进行解析,获得所述向量查找指令的操作码和操作域,并根据所述操作码和所述操作域确定执行所述向量查找指令所需的待查找向量、查找条件和目标地址;Parse the received vector search instruction to obtain the operation code and operation domain of the vector search instruction, and determine the vector to be searched and the search condition required to execute the vector search instruction according to the operation code and the operation domain And destination address;
    依次确定表示所述待查找向量的多个待查数是否满足所述查找条件,并将满足所述查找条件的待查数确定为目标数,将所述目标数的存储地址作为查找结果存入所述目标地址,Sequentially determine whether a plurality of searched numbers representing the searched vector satisfy the search condition, determine the searched number satisfying the search condition as a target number, and store the storage address of the target number as a search result The target address,
    其中,所述操作码用于指示所述向量查找指令对向量数据所进行的运算为查找运算,所述操作域包括所述待查找向量地址和所述目标地址。Wherein, the operation code is used to indicate that the operation performed by the vector search instruction on the vector data is a search operation, and the operation domain includes the vector address to be searched and the target address.
  15. 根据权利要求14所述的方法,其特征在于,所述操作域还包括输入长度,The method according to claim 14, wherein the operation domain further includes an input length,
    其中,根据所述操作码和所述操作域确定执行所述向量查找指令所需的待查找向量、查找条件和目标地址,包括:Wherein, determining the vector to be searched, the search condition and the target address required to execute the vector search instruction according to the operation code and the operation domain include:
    根据所述输入长度,从所述待查找向量地址中获取所述待查找向量。Obtain the vector to be searched from the address of the vector to be searched according to the input length.
  16. 根据权利要求14所述的方法,其特征在于,所述操作域还包括待查数宽度,所述方法还包括:The method according to claim 14, wherein the operation domain further includes a width of the data to be checked, and the method further includes:
    根据所述待查数宽度,从所述待查找向量中确定出所述多个待查数。According to the width of the number to be checked, the plurality of numbers to be checked are determined from the vector to be looked up.
  17. 根据权利要求14所述的方法,其特征在于,所述操作域还包括查找条件,The method according to claim 14, wherein the operation domain further includes a search condition,
    其中,根据所述操作码和所述操作域确定执行所述向量查找指令所需的待查找向量、查找条件和目标地址,包括:Wherein, determining the vector to be searched, the search condition and the target address required to execute the vector search instruction according to the operation code and the operation domain include:
    根据所述操作域,确定所述查找条件。According to the operation domain, the search condition is determined.
  18. 根据权利要求14所述的方法,其特征在于,根据所述操作码和所述操作域确定执行所述向量查找指令所需的待查找向量、查找条件和目标地址,包括:The method according to claim 14, wherein determining the vector to be searched, the search condition, and the target address required to execute the vector search instruction according to the operation code and the operation field includes:
    根据所述操作码,确定所述查找条件,所述操作码还用于指示所述向量查找指令的查找条件。The search condition is determined according to the operation code, and the operation code is also used to indicate the search condition of the vector search instruction.
  19. 根据权利要求14所述的方法,其特征在于,依次确定表示所述待查找向量的多个待查数是否满足所述查找条件,包括:The method according to claim 14, characterized in that sequentially determining whether a plurality of to-be-checked numbers representing the to-be-searched vector satisfy the search condition includes:
    利用至少一个比较器对所述多个待查数与所述查找条件进行比较,获得比较结果,以便于根据所述比较结果确定待查数是否满足所述查找条件。At least one comparator is used to compare the plurality of to-be-checked numbers with the search condition to obtain a comparison result, so as to determine whether the to-be-checked number satisfies the search condition according to the comparison result.
  20. 根据权利要求14-19任一项所述的方法,其特征在于,满足所述查找条件的待查数包括以下至少一项:The method according to any one of claims 14-19, wherein the number of to-be-checked satisfying the search condition includes at least one of the following:
    数值是指定值的指定倍数、且排序为指定排序的待查数;The numeric value is the specified multiple of the specified value, and the sorting is the number to be checked of the specified sorting;
    数值处于指定数值区间的待查数;Number to be checked if the value is within the specified value interval;
    数值是指定值的指定倍数的待查数,The value is the number to be checked for the specified multiple of the specified value,
    其中,所述指定排序包括以下至少一种:Wherein, the designated order includes at least one of the following:
    所述待查数的排序为数值是指定值的指定倍数的待查数中的第n个,所述n为大于或等于1的正整数;The sorting of the number to be checked is the nth of the number to be checked whose value is a specified multiple of the specified value, where n is a positive integer greater than or equal to 1;
    所述待查数的排序为数值是指定值的指定倍数的待查数中的倒数第m个,所述m为大于或等于1的正整数,The order of the numbers to be checked is the m-th to the last one of the numbers to be checked whose value is the specified multiple of the specified value, where m is a positive integer greater than or equal to 1,
    其中,m、n小于或等于所述待查找向量中待查数的数量。Wherein, m and n are less than or equal to the number of numbers to be checked in the vector to be looked up.
  21. 根据权利要求14所述的方法,其特征在于,The method according to claim 14, characterized in that
    所述方法还包括:存储所述待查找向量,The method further includes: storing the to-be-searched vector,
    其中,对接收到的向量查找指令进行解析,获得所述向量查找指令的操作码和操作域,包括:Wherein, analyzing the received vector search instruction to obtain the operation code and operation domain of the vector search instruction includes:
    存储所述向量查找指令;Store the vector search instruction;
    对所述向量查找指令进行解析,得到所述向量查找指令的操作码和操作域;Parse the vector search instruction to obtain the operation code and operation domain of the vector search instruction;
    存储指令队列,所述指令队列包括按照执行顺序依次排列的多个待执行指令,所述多个待执行指令包括所述向量查找指令,Store an instruction queue, the instruction queue includes a plurality of instructions to be executed in order according to the execution order, the plurality of instructions to be executed includes the vector search instruction,
    其中,所述方法还包括:Wherein, the method further includes:
    在确定所述多个待执行指令中的第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系时,缓存所述第一待执行指令,并在确定所述第零待执行指令执行完毕后,控制进行所述第一待执行指令的执行,When it is determined that the first to-be-executed instruction among the plurality of to-be-executed instructions is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction, cache the first to-be-executed instruction, and determine the After the execution of the zeroth to-be-executed instruction is completed, control to execute the execution of the first to-be-executed instruction,
    其中,所述第一待执行指令与所述第一待执行指令之前的第零待执行指令存在关联关系包括:Wherein, the association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:
    存储所述第一待执行指令所需数据的第一存储地址区间与存储所述第零待执行指令所需数据的第零存储地址区间具有重叠的区域。The first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
PCT/CN2019/120893 2018-11-30 2019-11-26 Computing method and apparatus, and related product WO2020108471A1 (en)

Applications Claiming Priority (16)

Application Number Priority Date Filing Date Title
CN201811456735.XA CN111258641B (en) 2018-11-30 2018-11-30 Operation method, device and related product
CN201811456735.X 2018-11-30
CN201910001865.2A CN111400341B (en) 2019-01-02 2019-01-02 Scalar lookup instruction processing method and device and related product
CN201910001855.9A CN111399905B (en) 2019-01-02 2019-01-02 Operation method, device and related product
CN201910001865.2 2019-01-02
CN201910001855.9 2019-01-02
CN201910294130.3 2019-04-12
CN201910293748.8 2019-04-12
CN201910293190.3A CN111813376A (en) 2019-04-12 2019-04-12 Operation method, device and related product
CN201910293748.8A CN111813448A (en) 2019-04-12 2019-04-12 Operation method, device and related product
CN201910293770.2A CN111813537A (en) 2019-04-12 2019-04-12 Operation method, device and related product
CN201910293190.3 2019-04-12
CN201910293777.4 2019-04-12
CN201910293777.4A CN111813449A (en) 2019-04-12 2019-04-12 Operation method, device and related product
CN201910294130.3A CN111813450A (en) 2019-04-12 2019-04-12 Operation method, device and related product
CN201910293770.2 2019-04-12

Publications (1)

Publication Number Publication Date
WO2020108471A1 true WO2020108471A1 (en) 2020-06-04

Family

ID=70853863

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/120893 WO2020108471A1 (en) 2018-11-30 2019-11-26 Computing method and apparatus, and related product

Country Status (1)

Country Link
WO (1) WO2020108471A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120011348A1 (en) * 2010-07-12 2012-01-12 International Business Machines Corporation Matrix Multiplication Operations Using Pair-Wise Load and Splat Operations
CN108388446A (en) * 2018-02-05 2018-08-10 上海寒武纪信息科技有限公司 Computing module and method
CN108629411A (en) * 2018-05-07 2018-10-09 济南浪潮高新科技投资发展有限公司 A kind of convolution algorithm hardware realization apparatus and method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120011348A1 (en) * 2010-07-12 2012-01-12 International Business Machines Corporation Matrix Multiplication Operations Using Pair-Wise Load and Splat Operations
CN108388446A (en) * 2018-02-05 2018-08-10 上海寒武纪信息科技有限公司 Computing module and method
CN108629411A (en) * 2018-05-07 2018-10-09 济南浪潮高新科技投资发展有限公司 A kind of convolution algorithm hardware realization apparatus and method

Similar Documents

Publication Publication Date Title
CN110096309B (en) Operation method, operation device, computer equipment and storage medium
US11675785B2 (en) Dynamic asynchronous traversals for distributed graph queries
CN110096310B (en) Operation method, operation device, computer equipment and storage medium
JP7074832B2 (en) Network-on-chip data processing methods and equipment
WO2023093623A1 (en) Computation graph optimization method, data processing method and related product
GB2568086A (en) Hardware implementation of convolution layer of deep neutral network
CN110119807B (en) Operation method, operation device, computer equipment and storage medium
KR102539571B1 (en) Network-on-chip data processing method and device
WO2022247880A1 (en) Method for fusing operators of neural network, and related product
KR102539572B1 (en) Network-on-chip data processing method and device
Sun et al. Multi-node acceleration for large-scale GCNs
WO2021027972A1 (en) Data synchronization method and apparatus and related product
WO2021018313A1 (en) Data synchronization method and apparatus, and related product
WO2020108471A1 (en) Computing method and apparatus, and related product
KR102539573B1 (en) Network-on-chip data processing method and device
CN111047005A (en) Operation method, operation device, computer equipment and storage medium
WO2021233187A1 (en) Method and device for allocating storage addresses for data in memory
WO2021027973A1 (en) Data synchronization method and device, and related products
KR102539574B1 (en) Network-on-chip data processing method and device
CN116185378A (en) Optimization method of calculation graph, data processing method and related products
WO2018188416A1 (en) Data search method and apparatus, and related devices
CN111047030A (en) Operation method, operation device, computer equipment and storage medium
CN111026440B (en) Operation method, operation device, computer equipment and storage medium
CN111124497B (en) Operation method, operation device, computer equipment and storage medium
CN112395008A (en) Operation method, operation device, computer equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19890411

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205 DATED 21/07/2021)

122 Ep: pct application non-entry in european phase

Ref document number: 19890411

Country of ref document: EP

Kind code of ref document: A1