WO2020108471A1

WO2020108471A1 - Computing method and apparatus, and related product

Info

Publication number: WO2020108471A1
Application number: PCT/CN2019/120893
Authority: WO
Inventors: 张尧; 陈煜�; 刘少礼; 曾洪博; 于涌; 陶劲桦; 李震; 韩栋
Original assignee: 上海寒武纪信息科技有限公司
Priority date: 2018-11-30
Filing date: 2019-11-26
Publication date: 2020-06-04

Abstract

The present application relates to a computing method and apparatus, and a related product. A board comprises: a storage device, an interface apparatus, a control device, and a machine learning chip. The machine learning chip is connected to the storage device, the control device, and the interface apparatus, separately. The storage device is used for storing data. The interface apparatus is used for performing data transmission between the machine learning chip and an external device. The control device is used for monitoring the state of the machine learning chip. The computing method and apparatus, and the related product provided in embodiments of the present application are widely applied, and have high processing efficiency and fast processing speed of an instruction.

Description

Calculation method, device and related products

Technical field

The present disclosure relates to the field of computer technology, and in particular, to an arithmetic method, device, and related products.

Background technique

With the continuous development of science and technology, machine learning, especially the use of neural network algorithms is becoming more and more widely used. It has been well applied in the fields of image recognition, speech recognition, natural language processing and so on. However, due to the increasing complexity of neural network algorithms, the types and number of data operations involved are increasing. In the related art, it is inefficient and slow in processing, searching, and accumulating vectors, scalars, matrices, tensors, and resource locks.

Summary of the invention

In view of this, the present disclosure proposes an arithmetic method, device and related products.

In order to improve the efficiency and speed of performing search operations on vectors, according to an aspect of the present disclosure, a vector search instruction processing device is provided, and the device includes:

The control module is used to parse the received vector search instruction, obtain the operation code and the operation domain of the vector search instruction, and determine the standby required to execute the vector search instruction according to the operation code and the operation domain Search vector, search condition and target address;

The operation module is used to sequentially determine whether a plurality of check numbers representing the search vector satisfy the search condition, determine the check number satisfying the search condition as the target number, and store the storage address of the target number Store in the target address as a search result,

Wherein, the operation code is used to indicate that the operation performed by the vector search instruction on the vector data is a search operation, and the operation domain includes the vector address to be searched and the target address.

According to another aspect of the present disclosure, a machine learning computing device is provided, the device including:

One or more of the above vector search instruction processing devices are used to obtain data and control information to be calculated from other processing devices, and perform specified machine learning operations, and pass the execution results to other processing devices through the I/O interface;

When the machine learning operation device includes a plurality of the vector search instruction processing devices, the plurality of vector search instruction processing devices may be connected and transmit data through a specific structure;

Among them, a plurality of the vector search instruction processing devices interconnect and transmit data through a fast external device interconnect bus PCIE bus to support larger-scale machine learning operations; a plurality of the vector search instruction processing devices share the same control system Or have their own control systems; a plurality of the vector search instruction processing devices share memory or have their own memory; the interconnection method of the plurality of vector search instruction processing devices is an arbitrary interconnection topology.

According to another aspect of the present disclosure, a combined processing device is provided, the device comprising:

The above-mentioned machine learning computing device, universal interconnection interface and other processing devices;

The machine learning operation device interacts with the other processing device to jointly complete the calculation operation specified by the user.

According to another aspect of the present disclosure, a machine learning chip is provided. The machine learning chip includes the above machine learning network operation device or the above combination processing device.

According to another aspect of the present disclosure, there is provided a machine learning chip packaging structure including the above machine learning chip.

According to another aspect of the present disclosure, there is provided a board card including the above machine learning chip packaging structure.

According to another aspect of the present disclosure, there is provided an electronic device including the aforementioned machine learning chip or the aforementioned board.

The vector search instruction processing method, device and related products provided by the embodiments of the present disclosure include a control module and an arithmetic module. The control module is used to parse the received vector search instruction, obtain the operation code and operation domain of the vector search instruction, and determine the to-be-searched vector, search condition, and target address required to execute the vector search instruction according to the operation code and operation domain. The operation module is used to sequentially determine whether a plurality of to-be-checked numbers representing the to-be-searched vector satisfy the search condition, determine the to-be-checked number satisfying the search condition as the target number, and store the target number's storage address as the search result in the target address. The vector search instruction processing method, device and related products provided by the embodiments of the present disclosure have a wide range of application, and have high processing efficiency and fast processing speed for the vector search instruction.

In order to improve the efficiency and speed of performing a search operation on a scalar, according to an aspect of the present disclosure, a scalar search instruction processing device is provided, and the device includes:

The control module is used to parse the received scalar search instruction, obtain the operation code and the operation domain of the scalar search instruction, and determine the standby required to execute the scalar search instruction according to the operation code and the operation domain Find scalar, specified value, specified sort and target address;

The arithmetic module is used to sequentially determine whether the values of the plurality of to-be-checked numbers representing the to-be-searched scalar are equal to the specified value, and determine the to-be-checked numbers whose value is equal to the specified value and sorted into the specified sort are The target number, the storage address of the target number is stored in the target address as a search result,

The operation code is used to indicate that the operation performed by the scalar search instruction on the scalar data is a search operation, and the operation field includes the scalar address to be searched and the target address.

One or more of the above scalar search instruction processing devices, used to obtain the data to be operated and control information from other processing devices, and perform the specified machine learning operations, and pass the execution results to other processing devices through the I/O interface;

When the machine learning computing device includes a plurality of the scalar search instruction processing devices, the plurality of scalar search instruction processing devices may be connected and transmit data through a specific structure;

Among them, a plurality of the scalar search instruction processing devices interconnect and transmit data through a PCIE bus, a fast external device interconnection bus, to support larger-scale machine learning operations; a plurality of the scalar search instruction processing devices share the same control system Or have their own control systems; a plurality of the scalar search instruction processing devices share memory or have their own memories; the interconnection method of the plurality of scalar search instruction processing devices is an arbitrary interconnection topology.

According to another aspect of the present disclosure, a scalar search instruction processing method is provided. The method is applied to a scalar search instruction processing device. The method includes:

Parse the received scalar search instruction to obtain the operation code and operation domain of the scalar search instruction, and determine the scalar to be searched and the specified value required to execute the scalar search instruction according to the operation code and the operation domain 、Specify sorting and target address;

In turn, determine whether the values of the plurality of to-be-checked numbers representing the to-be-searched scalar are equal to the specified value, and determine that the to-be-checked numbers whose value is equal to the specified value and sorted into the specified sort are the target number, The storage address of the target number is stored in the target address as a search result,

The scalar search instruction processing method, device and related products provided by the embodiments of the present disclosure include a control module and an arithmetic module. The control module is used to parse the received scalar search instruction, obtain the operation code and operation domain of the scalar search instruction, and determine the scalar to be searched, the specified value, the specified order and the required scalar search instruction according to the operation code and the operation domain. target address. The arithmetic module is used to sequentially determine whether the values of the multiple check numbers representing the scalar to be searched are equal to the specified value, determine the check number that is equal to the specified value and sorted into the specified sort as the target number, and determine the storage address of the target number The target address is stored as the search result. The scalar search instruction processing method, device and related products provided by the embodiments of the present disclosure have a wide range of application, and have high processing efficiency and fast processing speed for the scalar search instruction.

In order to improve the efficiency and speed of the process of locking and releasing resources, according to the first aspect of the present disclosure, there is provided a resource lock and release instruction processing device, the device comprising:

The control module is configured to parse the received resource lock instruction, obtain the operation code and operation domain of the resource lock instruction, and determine the instruction indicated by the resource lock instruction according to the operation code and the operation domain Resources to be processed, and determine the lock strategy required for resource lock processing;

A processing module, configured to lock or release the resource to be processed according to the lock and release strategy to obtain the processed resource,

Wherein, the operation code is used to indicate that the resource lock instruction performs processing on the resource as locking or releasing processing, and the operation domain includes the resource identifier to be processed.

One or more of the above resource lock instruction processing devices, used to obtain data to be calculated and control information from other processing devices, and perform specified machine learning operations, and pass the execution results to other processing devices through the I/O interface;

When the machine learning operation device includes a plurality of the resource lock instruction processing devices, the plurality of resource lock instruction processing devices may be connected and transmit data through a specific structure;

Among them, a plurality of the resource lock and put instruction processing apparatuses interconnect and transmit data through a fast external device interconnect bus PCIE bus to support larger-scale machine learning operations; a plurality of the resource lock and put instruction processing apparatuses share the same The control system may have its own control system; a plurality of the resource lock instruction processing devices share memory or have their own memories; the interconnection method of the plurality of resource lock instruction processing devices is any interconnected topology.

According to another aspect of the present disclosure, a method for processing a resource lock instruction is provided. The method is applied to a device for processing a resource lock instruction. The method includes:

Parse the received resource lock instruction, obtain the operation code and operation domain of the resource lock instruction, and determine the resource to be processed indicated by the resource lock instruction according to the operation code and the operation domain, And determine the lock strategy required for resource lock processing;

Lock or release the resources to be processed according to the lock and put strategy to obtain the processed resources,

The resource lock instruction processing method, device and related products provided by the embodiments of the present disclosure include a control module and an arithmetic module. The control module is used to analyze the received resource lock instruction, obtain the operation code and operation domain of the resource lock instruction, and determine the resource to be processed indicated by the resource lock instruction according to the operation code and operation domain, and determine the resource to be processed Locking strategy required for lock handling. The processing module is used for locking or releasing the resources to be processed according to the lock and put strategy, to obtain the processed resources. The resource lock instruction processing method, device and related products provided by the embodiments of the present disclosure have a wide range of applications, and the processing efficiency of locking and releasing resources according to the resource lock instruction is high and the processing speed is high.

In order to improve the efficiency and speed of the rearrangement processing of tensors, according to an aspect of the present disclosure, a tensor rearrangement instruction processing device is provided, and the device includes:

The control module is configured to parse the received tensor rearrangement instruction, obtain an operation code and an operation domain of the tensor rearrangement instruction, and determine to execute the tensor rearrangement according to the operation code and the operation domain The tensor and target address required for the reordering instruction, and the reordering strategy required for reordering;

The processing module performs rearrangement processing on the to-be-processed tensor according to the rearrangement strategy to obtain a rearrangement tensor, and stores the rearrangement tensor into the target address,

Wherein, the operation code is used to indicate that the processing performed by the tensor rearrangement instruction on the tensor data is rearrangement processing, and the operation field includes the to-be-processed tensor address and the target address.

One or more of the above tensor reordering instruction processing devices, used to obtain to-be-processed tensors and control information from other processing devices, and perform specified machine learning operations, and pass the execution results to other processing devices through the I/O interface ;

When the machine learning operation device includes a plurality of the tensor rearrangement instruction processing devices, the plurality of the tensor rearrangement instruction processing devices may be connected and transmit data through a specific structure;

Among them, a plurality of the tensor rearrangement instruction processing devices interconnect and transmit data through a fast external device interconnect bus PCIE bus to support larger-scale machine learning operations; a plurality of the tensor rearrangement instruction processing devices Sharing the same control system or having its own control system; a plurality of the tensor rearrangement instruction processing devices share memory or have their own memories; the interconnection method of the plurality of tensor rearrangement instruction processing devices is any interconnection topology.

According to another aspect of the present disclosure, a tensor rearrangement instruction processing method is provided. The method is applied to a tensor rearrangement instruction processing apparatus. The method includes:

Analyze the received tensor rearrangement instruction to obtain the operation code and operation domain of the tensor rearrangement instruction, and determine the requirements for executing the tensor rearrangement instruction according to the operation code and the operation domain Tensors and target addresses to be processed, and the rearrangement strategy required for rearrangement processing;

Performing rearrangement processing on the to-be-processed tensor according to the rearrangement strategy to obtain a rearrangement tensor, and storing the rearrangement tensor into the target address,

The tensor rearrangement instruction processing method, device and related products provided by the embodiments of the present disclosure include a control module and a processing module. The control module is used to analyze the received tensor rearrangement instruction, obtain the operation code and operation domain of the tensor rearrangement instruction, and determine the pending tensor required to execute the tensor rearrangement instruction according to the operation code and operation domain And the target address, and determine the rearrangement strategy required for rearrangement processing. The processing module is used to rearrange the to-be-processed tensor according to the rearrangement strategy to obtain the rearranged tensor, and store the rearranged tensor into the target address. The tensor rearrangement instruction processing method, device and related products provided by the embodiments of the present disclosure can realize the rearrangement processing of tensor data through one tensor rearrangement instruction, and the related art realizes tensor through multiple instructions Compared with the process of data rearrangement, the rearrangement of tensor data has high processing efficiency, fast processing speed, and wide application range.

In order to solve the problem of ensuring calculation accuracy and reducing the amount of data access and the amount of calculation cannot be satisfied at the same time, according to an aspect of the present disclosure, a data processing device is provided, the device is used to perform machine learning calculations, and the device includes a control module And a processing module, the processing module includes a data transfer submodule and an accumulation submodule:

The control module is used to obtain a calculation instruction and obtain input data required to execute the calculation instruction;

The data transfer sub-module is configured to process the input data according to the calculation instruction to obtain multiple intermediate results, and sequentially send the multiple intermediate results to the accumulation sub-module;

The accumulation submodule is used to perform a cyclic accumulation operation on the plurality of intermediate results to obtain the calculation result of the calculation instruction.

One or more of the above data processing devices are used to obtain input data and control information from other processing devices, and perform specified machine learning operations, and pass the execution results to other processing devices through the I/O interface;

When the machine learning computing device includes a plurality of the data processing devices, the data processing devices may be connected and transmit data through a specific structure;

Among them, a plurality of the data processing apparatuses interconnect and transmit data through a PCIE bus, a fast external device interconnection bus, to support larger-scale machine learning operations; a plurality of the data processing apparatuses share the same control system or have their own Control system; multiple data processing devices share memory or have their own memory; multiple data processing devices are interconnected in any interconnection topology.

According to another aspect of the present disclosure, a data processing method is provided. The method is applied to a data processing device, and the device is used to perform machine learning calculations. The method includes:

Obtaining calculation instructions, and obtaining input data required to execute the calculation instructions;

Processing the input data according to the calculation instruction to obtain multiple intermediate results, and sending the multiple intermediate results in sequence;

Performing a cyclic accumulation operation on the plurality of intermediate results to obtain the calculation result of the calculation instruction.

The data processing device, method and related products provided by the embodiments of the present disclosure include: a control module and a processing module. The processing module includes a data transfer submodule and an accumulation submodule. The control module is used to obtain calculation instructions and obtain input data required to execute the calculation instructions. The data transfer sub-module is used to process the input data according to the calculation instruction to obtain multiple intermediate results, and send the multiple intermediate results to the accumulation sub-module in sequence. The accumulation submodule is used to perform a cyclic accumulation operation on multiple intermediate results to obtain the calculation result of the calculation instruction. The data processing device, method and related products provided by the embodiments of the present disclosure reduce the amount of data access and calculation by cyclically accumulating multiple intermediate results, while ensuring the accuracy of calculation is not damaged, and can effectively increase the speed of data processing .

In order to improve the efficiency and speed of performing symmetric processing on the matrix, according to an aspect of the present disclosure, a matrix symmetric instruction processing device is provided, and the device includes:

The control module is used to parse the received matrix symmetric instruction, obtain the operation code and operation domain of the matrix symmetric instruction, and determine the matrix to be processed and the target address required to execute the matrix symmetric instruction according to the operation code and operation domain Symmetry strategy required for symmetric processing;

The processing module performs symmetric processing on the matrix to be processed according to the symmetric strategy to obtain a symmetric matrix, and stores the symmetric matrix into the target address,

Wherein, the operation code is used to indicate that the processing performed by the matrix symmetric instruction on the matrix data is symmetric processing, and the operation field includes the to-be-processed matrix address and the target address.

One or more of the above matrix symmetric instruction processing devices are used to obtain the to-be-processed matrix and control information from other processing devices, and perform specified machine learning operations, and pass the execution results to other processing devices through the I/O interface;

When the machine learning operation device includes a plurality of matrix symmetric instruction processing devices, the plurality of matrix symmetric instruction processing devices may be connected and transmit data through a specific structure;

Among them, a plurality of the matrix symmetric instruction processing devices interconnect and transmit data through a fast external device interconnect bus PCIE bus to support larger-scale machine learning operations; a plurality of the matrix symmetric instruction processing devices share the same control system Or have their own control systems; a plurality of the matrix symmetric instruction processing devices share memory or have their own memories; the interconnection method of the plurality of matrix symmetric instruction processing devices is any interconnection topology.

According to another aspect of the present disclosure, a matrix symmetric instruction processing method is provided. The method is applied to a matrix symmetric instruction processing device. The method includes:

Parse the received matrix symmetric instruction to obtain the operation code and operation domain of the matrix symmetric instruction, and determine the to-be-processed matrix and the target address required to execute the matrix symmetric instruction according to the operation code and the operation domain , And determine the symmetrical strategy required for symmetrical processing;

Performing symmetric processing on the matrix to be processed according to the symmetric strategy to obtain a symmetric matrix, and storing the symmetric matrix into the target address,

The matrix symmetric instruction processing method, device and related products provided by the embodiments of the present disclosure include a control module and a processing module. The control module is used to parse the received matrix symmetric instruction, obtain the operation code and operation domain of the matrix symmetric instruction, and determine the matrix to be processed and the target address required to execute the matrix symmetric instruction according to the operation code and operation domain, and determine the progress Symmetry strategy required for symmetric processing. The processing module is used to perform symmetric processing on the processing matrix according to a symmetric strategy to obtain the symmetric matrix, and store the symmetric matrix into the target address. The matrix symmetric instruction processing method, device and related products provided by the embodiments of the present disclosure have a wide range of application, and the symmetric processing of the matrix according to the matrix symmetric instruction has high processing efficiency and fast processing speed.

In order to improve the efficiency and speed of the mirroring processing of the matrix, according to an aspect of the present disclosure, a matrix mirroring instruction processing device is provided. The device includes:

The control module is used to parse the received matrix mirroring instruction, obtain the operation code and the operation domain of the matrix mirroring instruction, and determine the standby required to execute the matrix mirroring instruction according to the operation code and the operation domain Mirroring matrix and target address, and determining the mirroring strategy required for mirroring processing;

The processing module performs mirror processing on the matrix to be mirrored according to the mirroring strategy to obtain a mirrored matrix, and stores the mirrored matrix in the target address,

Wherein, the operation code is used to indicate that the processing performed by the matrix mirroring instruction on the matrix data is mirror processing, and the operation domain includes the matrix address to be mirrored and the target address.

One or more of the above matrix mirroring instruction processing devices are used to obtain the to-be-mirrored matrix and control information from other processing devices, and perform specified machine learning operations, and pass the execution results to other processing devices through the I/O interface;

When the machine learning operation device includes a plurality of the matrix mirroring instruction processing devices, the plurality of matrix mirroring instruction processing devices may be connected and transmit data through a specific structure;

Among them, a plurality of the matrix mirroring instruction processing devices interconnect and transmit data through a fast external device interconnect bus PCIE bus to support larger-scale machine learning operations; a plurality of the matrix mirroring instruction processing devices share the same control system Or have their own control systems; a plurality of the matrix mirroring instruction processing devices share memory or have their own memories; the interconnection method of the plurality of matrix mirroring instruction processing devices is an arbitrary interconnection topology.

According to another aspect of the present disclosure, a matrix mirroring instruction processing method is provided. The method is applied to a matrix mirroring instruction processing apparatus. The method includes:

Parse the received matrix mirroring instruction to obtain the operation code and operation domain of the matrix mirroring instruction, and determine the matrix to be mirrored and the target address required to execute the matrix mirroring instruction according to the operation code and the operation domain , And determine the mirroring strategy required for mirroring;

Mirroring the matrix to be mirrored according to the mirroring strategy to obtain a mirrored matrix, and storing the mirrored matrix in the target address,

Wherein, the operation code is used to indicate that the processing performed by the matrix mirroring instruction on the matrix is mirror processing, and the operation domain includes the matrix address to be mirrored and the target address.

The matrix mirroring instruction processing method, device and related products provided by the embodiments of the present disclosure include a control module and a processing module. The control module is used to analyze the received matrix mirroring instruction, obtain the operation code and operation domain of the matrix mirroring instruction, and determine the matrix to be mirrored and the target address required to execute the matrix mirroring instruction according to the operation code and the operation domain. The processing module is used to perform mirror processing on the mirror matrix according to the mirror strategy to obtain the mirrored matrix, and store the mirrored matrix in the target address. The matrix mirroring instruction processing method, device and related products provided by the embodiments of the present disclosure have a wide range of applications, and the mirroring processing of the matrix according to the matrix mirroring instruction has high processing efficiency and fast processing speed.

In order to improve the efficiency and speed of rotating the matrix, according to an aspect of the present disclosure, a matrix rotation instruction processing device is provided, and the device includes:

The control module is used to parse the received matrix rotation instruction, obtain the operation code and the operation domain of the matrix rotation instruction, and determine the standby required to execute the matrix rotation instruction according to the operation code and the operation domain Rotation matrix and target address, and determine the rotation angle of the rotation matrix to be rotated;

The processing module performs rotation processing on the matrix to be rotated according to the rotation angle to obtain a matrix after rotation, and stores the matrix after rotation into the target address,

Wherein, the operation code is used to indicate that the processing performed by the matrix rotation instruction on the matrix data is rotation processing, and the operation domain includes the matrix address to be rotated and the target address.

One or more of the above matrix rotation instruction processing devices are used to obtain the matrix to be rotated and control information from other processing devices, and perform specified machine learning operations, and pass the execution results to other processing devices through the I/O interface;

When the machine learning operation device includes a plurality of matrix rotation instruction processing devices, the plurality of matrix rotation instruction processing devices may be connected and transmit data through a specific structure;

Among them, a plurality of the matrix rotation instruction processing apparatuses interconnect and transmit data through a PCIE bus that is a fast external device interconnection bus to support larger-scale machine learning operations; Or have their own control systems; a plurality of the matrix rotation instruction processing devices share memory or have their own memories; the interconnection method of the plurality of matrix rotation instruction processing devices is an arbitrary interconnection topology.

According to another aspect of the present disclosure, a matrix rotation instruction processing method is provided. The method is applied to a matrix rotation instruction processing apparatus. The method includes:

Parse the received matrix rotation instruction to obtain the operation code and operation domain of the matrix rotation instruction, and determine the matrix to be rotated and the target address required to execute the matrix rotation instruction according to the operation code and the operation domain , And determine the rotation angle of the matrix to be rotated;

Rotating the matrix to be rotated according to the rotation angle to obtain a matrix after rotation, and storing the matrix after rotation into the target address,

The matrix rotation instruction processing method, device and related products provided by the embodiments of the present disclosure include a control module and a processing module. The control module is used to parse the received matrix rotation instruction, obtain the operation code and operation domain of the matrix rotation instruction, and determine the matrix to be rotated and the target address required to execute the matrix rotation instruction according to the operation code and operation domain, and determine the treatment The rotation angle of the rotation matrix. The processing module is used to perform rotation processing on the rotation matrix according to the rotation angle to obtain the rotated matrix, and store the rotated matrix in the target address. The matrix rotation instruction processing method, device and related products provided by the embodiments of the present disclosure have a wide range of application, and the processing efficiency of the matrix to be rotated according to the matrix rotation instruction is high and the processing speed is fast.

In some embodiments, the electronic device includes a data processing device, a robot, a computer, a printer, a scanner, a tablet computer, a smart terminal, a mobile phone, a driving recorder, a navigator, a sensor, a camera, a server, a cloud server, a camera, Cameras, projectors, watches, headphones, mobile storage, wearable devices, vehicles, household appliances, and/or medical devices.

In some embodiments, the vehicle includes an airplane, ship, and/or vehicle; the household appliance includes a TV, air conditioner, microwave oven, refrigerator, rice cooker, humidifier, washing machine, electric lamp, gas stove, and range hood; and the medical Equipment includes MRI, B-mode ultrasound and/or electrocardiograph.

Other features and aspects of the present disclosure will become clear from the following detailed description of exemplary embodiments with reference to the accompanying drawings.

BRIEF DESCRIPTION

The drawings included in the specification and forming a part of the specification together with the specification show exemplary embodiments, features, and aspects of the present disclosure, and are used to explain the principles of the present disclosure.

FIG. 1 shows a schematic diagram of a processor of an instruction processing method according to an embodiment of the present disclosure.

FIG. 2-1 illustrates a block diagram of a vector search instruction processing apparatus according to an embodiment of the present disclosure.

2-2 shows a block diagram of a vector search instruction processing apparatus according to an embodiment of the present disclosure.

FIGS. 2-3a-2-3c illustrate schematic diagrams of application scenarios of a vector search instruction processing apparatus according to an embodiment of the present disclosure.

2-4 show a flowchart of a vector search instruction processing method according to an embodiment of the present disclosure.

FIG. 3-1 shows a block diagram of a scalar search instruction processing device according to an embodiment of the present disclosure.

3-2 shows a block diagram of a scalar search instruction processing device according to an embodiment of the present disclosure.

3-3a-FIG. 3-3c are schematic diagrams illustrating application scenarios of a scalar search instruction processing device according to an embodiment of the present disclosure.

3-4 show a flowchart of a scalar search instruction processing method according to an embodiment of the present disclosure.

FIG. 4-1 shows a block diagram of a resource lock instruction processing apparatus according to an embodiment of the present disclosure.

4-2 shows a block diagram of a resource lock instruction processing apparatus according to an embodiment of the present disclosure.

4-3a-FIG. 4-3b illustrate schematic diagrams of application scenarios of a resource lock instruction processing apparatus according to an embodiment of the present disclosure.

4-4 shows a flowchart of a resource lock instruction processing method according to an embodiment of the present disclosure.

FIG. 5-1 shows a block diagram of a tensor rearrangement instruction processing apparatus according to an embodiment of the present disclosure.

5-2 shows a block diagram of a tensor rearrangement instruction processing apparatus according to an embodiment of the present disclosure.

5-3 shows a schematic diagram of an application scenario of a tensor rearrangement instruction processing apparatus according to an embodiment of the present disclosure.

5-4 shows a flowchart of a tensor reordering instruction processing method according to an embodiment of the present disclosure.

6-1 shows a block diagram of a data processing device according to an embodiment of the present disclosure.

6-2 shows a schematic diagram of an application scenario of a data processing device according to an embodiment of the present disclosure.

6-3 shows a block diagram of a data processing device according to an embodiment of the present disclosure.

6-4 shows a block diagram of a data processing device according to an embodiment of the present disclosure.

6-5a- 6-5d show block diagrams of processing modules in a data processing apparatus according to an embodiment of the present disclosure.

6-6 show a flowchart of a data processing method according to an embodiment of the present disclosure.

FIG. 7-1 shows a block diagram of a matrix symmetric instruction processing device according to an embodiment of the present disclosure.

7-2 shows a block diagram of a matrix symmetric instruction processing device according to an embodiment of the present disclosure.

7-3 shows a schematic diagram of an application scenario of a matrix symmetric instruction processing device according to an embodiment of the present disclosure.

7-4 shows a flowchart of a matrix symmetric instruction processing method according to an embodiment of the present disclosure.

8-1 shows a block diagram of a matrix mirroring instruction processing apparatus according to an embodiment of the present disclosure.

8-2 shows a block diagram of a matrix mirroring instruction processing apparatus according to an embodiment of the present disclosure.

8-3 shows a schematic diagram of an application scenario of a matrix mirroring instruction processing apparatus according to an embodiment of the present disclosure.

8-4 shows a flowchart of a matrix mirroring instruction processing method according to an embodiment of the present disclosure.

9-1 shows a block diagram of a matrix rotation instruction processing device according to an embodiment of the present disclosure.

9-2 shows a block diagram of a matrix rotation instruction processing device according to an embodiment of the present disclosure.

9-3 shows a schematic diagram of an application scenario of a matrix rotation instruction processing device according to an embodiment of the present disclosure.

9-4 shows a flowchart of a matrix rotation instruction processing method according to an embodiment of the present disclosure.

10a and 10b show block diagrams of a combined processing device according to an embodiment of the present disclosure.

FIG. 11 shows a schematic structural diagram of a board according to an embodiment of the present disclosure.

detailed description

The technical solutions in the embodiments of the present disclosure will be described clearly and completely in conjunction with the drawings in the embodiments of the present disclosure. Obviously, the described embodiments are part of the embodiments of the present disclosure, but not all of the embodiments. Based on the embodiments in the present disclosure, all other embodiments obtained by those skilled in the art without creative efforts fall within the protection scope of the present disclosure.

It should be understood that the terms "zeroth", "first", "second", "third", and "fourth" in the claims, specification, and drawings of the present disclosure are used to distinguish different objects, not Used to describe a specific order. The terms "comprising" and "including" used in the specification and claims of this disclosure indicate the presence of the described features, wholes, steps, operations, elements and/or components, but do not exclude one or more other features, wholes , Steps, operations, elements, components, and/or their existence or addition.

It should also be understood that the terminology used in the present specification of the present disclosure is for the purpose of describing particular embodiments only, and is not intended to limit the present disclosure. As used in this disclosure specification and claims, unless the context clearly indicates otherwise, the singular forms "a", "an", and "the" are intended to include the plural forms. It should also be further understood that the term "and/or" used in the present specification and claims refers to any and all possible combinations of one or more of the associated listed items and includes these combinations.

As used in this specification and claims, the term "if" may be interpreted as "when" or "once" or "in response to a determination" or "in response to a detection" depending on the context. Similarly, the phrase "if determined" or "if [described condition or event] is detected" can be interpreted in the context to mean "once determined" or "in response to a determination" or "once detected [described condition or event ]" or "In response to detection of [the described condition or event]".

The present disclosure provides instruction processing methods and apparatuses corresponding to different operations or processes, and computer equipment and storage media corresponding to each instruction processing method and device, and instruction processing methods corresponding to different operations or processes And devices include: vector search instruction processing method and device, scalar search instruction processing method and device, resource lock instruction processing method and device, tensor rearrangement instruction processing method and device, data processing method and device, matrix symmetric instruction processing method And device, matrix mirroring instruction processing method and device, and matrix rotation instruction processing method and device. The instruction processing method and instruction processing device described below may be any of the instruction processing methods and devices listed above.

The instruction processing method according to an embodiment of the present disclosure may be applied to a processor, which may be a general-purpose processor, such as a CPU (Central Processing Unit, central processing unit), or artificial intelligence processing for performing artificial intelligence operations Device (IPU). Artificial intelligence operations can include machine learning operations, brain-like operations, and so on. Among them, machine learning operations include neural network operations, k-means operations, support vector machine operations, etc. The artificial intelligence processor may include, for example, GPU (Graphics Processing Unit), NPU (Neural-Network Processing Unit, neural network processing unit), DSP (Digital Signal Processing, digital signal processing unit), field programmable gate array (Field-Programmable Gate Array, FPGA) One or a combination of chips. This disclosure does not limit the specific types of processors.

In a possible implementation manner, the processor mentioned in the present disclosure may include multiple processing units, and each processing unit may independently run various assigned tasks, such as: convolution operation tasks and pooling tasks Or fully connected tasks. The present disclosure does not limit the processing unit and the tasks performed by the processing unit.

FIG. 1 shows a schematic diagram of a processor of an instruction processing method according to an embodiment of the present disclosure. As shown in FIG. 1, the processor 100 includes a plurality of processing units 101 and a storage unit 102. The plurality of processing units 101 are used to execute an instruction sequence, and the storage unit 102 is used to store data, which may include a random access memory (RAM, Random Access Memory) And register file. The multiple processing units 101 in the processor 100 can share a part of the storage space, for example, share a part of the RAM storage space and the register file, and can also have their own storage spaces at the same time.

FIG. 2-1 shows a block diagram of a vector search instruction processing apparatus according to an embodiment of the present disclosure. As shown in Figure 2-1, the device includes a control module 11-2 and an arithmetic module 12-2.

The control module 11-2 is used to analyze the received vector search instruction, obtain the operation code and operation domain of the vector search instruction, and determine the vector to be searched and the search condition required to execute the vector search instruction according to the operation code and operation domain And destination address. The operation code is used to indicate that the operation performed by the vector search instruction on the vector data is a search operation, and the operation domain includes the vector address and the target address to be searched.

The operation module 12-2 is used to sequentially determine whether a plurality of check numbers representing the search vector satisfy the search condition, and determine the check number that meets the search condition as the target number, and store the storage address of the target number as the search result target address.

In this embodiment, the to-be-searched vector may be composed of multiple to-be-checked numbers. For example, if the decimal representation of the vector m to be searched is (5, 6, 4), then the multiple numbers to be searched for the vector m to be searched are "5", "6", and "4". The vector to be searched can also be represented by a string of binary, hexadecimal, etc. For example, the binary representation of the vector m to be searched is "101110100", where "101", "110" and "100" are the multiple numbers to be searched for the vector m to be searched, respectively when the vector m to be searched is converted to decimal Corresponding numbers 5, 6, and 4. The control module can obtain the vector to be found from the address of the vector to be found. The address of the vector to be searched may be the first address for storing the vector to be searched, and so on. The control module may obtain the vector search instruction and the vector to be searched through the data input/output unit. The data input/output unit may be one or more data I/O interfaces or I/O pins.

In this embodiment, the operation code may be the part of the instruction or field (usually represented by code) specified in the computer program to perform the operation, and is the instruction serial number used to inform the device that executes the instruction which instruction needs to be executed. . The operation domain may be the source of all data required to execute the corresponding instruction. All data required to execute the corresponding instruction include parameter data, vector to be searched, and corresponding operation method, or store parameter data, vector to be searched, and corresponding operation The address of the method, etc. For a vector search instruction, it must include an operation code and an operation field, where the operation field includes at least the vector address and the target address to be searched.

It should be understood that a person skilled in the art may set the instruction format of the vector search instruction, as well as the included operation codes and operation domains as needed, which is not limited in this disclosure.

In this embodiment, the device may include one or more control modules and one or more arithmetic modules, and the number of control modules and arithmetic modules may be set according to actual needs, which is not limited in the present disclosure.

The vector search instruction processing device provided by the embodiment of the present disclosure includes a control module and an operation module. The control module is used to parse the received vector search instruction, obtain the operation code and operation domain of the vector search instruction, and determine the to-be-searched vector, search condition, and target address required to execute the vector search instruction according to the operation code and operation domain. The operation module is used to sequentially determine whether a plurality of to-be-checked numbers representing the to-be-searched vector satisfy the search condition, determine the to-be-checked number satisfying the search condition as the target number, and store the target number's storage address as the search result in the target address. The vector search instruction processing device provided by the embodiments of the present disclosure has a wide application range, and has high processing efficiency and fast processing speed for the vector search instruction.

In a possible implementation manner, the number of to-be-checked that meets the search condition may include at least one of the following:

The numeric value is the specified multiple of the specified value, and the sorting is the number to be checked of the specified sorting;

Number to be checked if the value is within the specified value interval;

The numeric value is the number to be checked for the specified multiple of the specified value.

Among them, the specified sorting may include at least one of the following: the sorting of the number to be checked is the nth of the number to be checked whose value is the specified multiple of the specified value, n is a positive integer greater than or equal to 1; the sorting of the number to be checked is The numeric value is the m-th to last in the number to be checked in the specified multiple of the specified value, and m is a positive integer greater than or equal to 1. Among them, m and n are less than or equal to the number of to-be-checked numbers in the to-be-searched vector.

In this implementation, the order of the number to be checked is the nth of the number to be checked which is the specified multiple of the specified value, and the order of the number to be checked is the number which is the specified multiple of the specified value. The "mth to the last" in the count is set with different expressions, etc., to distinguish the specified order of the countdown and the positive number. You can set the expression in the specified sorting order to be "the number of the number to be checked is the nth of the number to be checked that is the specified multiple of the specified value" in the vector search instruction as "0n", and set the "waiting The sorting of the number of checks is that the m-th to the last of the numbers to be searched whose value is the specified multiple of the specified value is expressed in the vector search instruction as "m0". A person skilled in the art can set the expression of the specified order according to actual needs, and this disclosure does not limit this.

In a possible implementation, the specified value may be 0, 1, 2, 3, and so on. The specified multiple can be 1 times (that is, the value is the same as the specified value), 2 times, 3 times and other multiples.

For example, the target number found by the vector search instruction may be the first 1, the last 1, the first 2, 3 times the number to be searched, and the last 2 3 times the number to be checked, less than 5 to be checked, more than 9 to be checked, etc.

In a possible implementation, the operation domain may also include the input length. The control module 11-2 is also used to obtain the vector to be searched from the address of the vector to be searched according to the input length.

In this implementation, the length of the vector to be searched obtained from the address of the vector to be searched according to the input length needs to be equal to the input length, or needs to be less than the input length.

In a possible implementation manner, when the input length is not included in the operation domain, the vector to be searched may be obtained according to a preset default input length. It is also possible to obtain all data in the address of the vector to be searched as the vector to be searched.

In a possible implementation manner, the operation field may further include the width of the data to be checked. The operation module 12-2 is also used to determine a plurality of numbers to be checked from the vector to be looked up according to the width of the numbers to be checked.

In this implementation, the width of the number to be checked may represent the width corresponding to each number to be checked in the character string of the vector to be looked up. When the width of the data to be searched is included in the operation domain, a plurality of groups of character strings whose width is the width of the data to be searched can be determined from the character strings representing the vector to be searched, and each group of character strings corresponds to one data to be searched. For example, if the width of the number to be searched is 3, the vector to be searched m (expressed as (5,6,4) when converted to decimal) is "101110100", and the number of the searched vector m to be searched is "101" "110 "And "100", the multiple to-be-checked numbers "101", "110" and "100" are the corresponding numbers 5, 6, and 4 when the to-be-searched vector m is converted to decimal, respectively. If the width of the number to be checked is 1, the plurality of numbers to be searched for the vector m are "1", "0", "1", "1", "1", "0", "1", "0" And "0", or the width of the number to be searched is 2, 4 and other widths other than 3, the obtained number of the searched vector m is only a number composed of character strings, and converted to the searched vector m The corresponding numbers 5, 6, and 4 in decimal are irrelevant.

In a possible implementation, the operation domain may further include search conditions. The control module 11-2 is also used to determine search conditions according to the operation domain.

In this implementation manner, when the search condition is included in the operation domain, the search condition in the operation domain can be directly obtained.

In a possible implementation, the control module 11-2 is also used to determine the search condition according to the operation code. Among them, the opcode is also used to indicate the search condition of the vector search instruction.

In this implementation, different operation codes can be set to represent different search conditions. The width of the data to be checked can also be determined according to the operation code or the default width.

For example, you can set the opcode "Find_vfirst" to find the first 1 of the multiple numbers to be searched for the vector to be found (the width of the number to be checked is greater than 1, the number of the number to be checked that meets the search condition is: the value is the specified value The first to be checked out of the doubled to be checked out). The operation code "Find_vlast" is the last one of the multiple numbers to be searched to find the vector to be searched (the width of the number to be checked is greater than 1, the number of the number to be checked that meets the search condition is: the value is the number of the number to be checked that is double the specified value The penultimate number 1 to be checked). When the operation codes are "Find_vfirst" and "Find_vlast", the width of the number to be checked can be further determined according to the operation code, or the default width can be determined as the width of the number to be checked, and then multiple number of numbers to be checked with the width of the number to be checked can be obtained .

2-2 shows a block diagram of a vector search instruction processing apparatus according to an embodiment of the present disclosure. In a possible implementation manner, as shown in FIG. 2-2, the operation module 12-2 may include at least one comparator 121-2, configured to compare a plurality of to-be-checked numbers with search conditions to obtain a comparison result, In order to determine whether the number to be checked meets the search condition according to the comparison result.

For example, what the vector search instruction is looking for is: the first of the numbers to be checked whose value is 1 (that is, the value is twice the specified value 1) is an example. The comparator can sequentially search multiple The value of the number to be checked is compared with the specified value "1", and then it can be determined whether the value of the number to be checked is equal to the specified value "1", and the value is equal to the specified value "1" and sorted to be equal to the specified value "1" The first to-be-checked number in the to-be-checked number is determined as the target number, and the storage address of the target number is stored in the target address as the search result. The number of comparators can be set according to the size of the data amount to be compared, the processing speed, efficiency, etc. of the comparison, which is not limited in the present disclosure.

In a possible implementation manner, as shown in FIG. 2-2, the device may further include a storage module 13-2. The storage module 13-2 is used to store the vector to be searched.

In this implementation, the storage module may include one or more of a memory, a cache, and a register, and the cache may include a temporary storage cache. The vector to be searched can be stored in the memory, cache, and/or register in the storage module according to needs, which is not limited in the present disclosure.

In a possible implementation, the device may further include a direct memory access module, which is used to read or store data from the storage module.

In a possible implementation, as shown in FIG. 2-2, the control module 11-2 may include an instruction storage submodule 111-2, an instruction processing submodule 112-2, and a queue storage submodule 113-2.

The instruction storage submodule 111-2 is used to store vector search instructions.

The instruction processing sub-module 112-2 is used to parse the vector search instruction to obtain the operation code and operation domain of the vector search instruction.

The queue storage sub-module 113-2 is used to store an instruction queue. The instruction queue includes a plurality of instructions to be executed in order according to the execution order. The plurality of instructions to be executed may include vector search instructions. The plurality of instructions to be executed may include other calculation instructions related to the vector search instruction.

In this implementation manner, the execution order of the plurality of instructions to be executed can be arranged according to the reception time and priority level of the instruction to be executed to obtain an instruction queue, so that the plurality of instructions to be executed can be sequentially executed according to the instruction queue.

In a possible implementation, as shown in FIG. 2-2, the control module 11-2 may further include a dependency processing sub-module 114-2.

The dependency processing sub-module 114-2 is configured to cache the first instruction to be executed when it is determined that there is an association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed In the instruction storage sub-module 112-2, after the execution of the zeroth to-be-executed instruction is completed, the first to-be-executed instruction is extracted from the instruction storage sub-module 112-2 and sent to the arithmetic module 12-2. Wherein, the first instruction to be executed and the zeroth instruction to be executed are instructions among a plurality of instructions to be executed.

The first to-be-executed instruction is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction includes: a first storage address interval storing data required by the first to-be-executed instruction and data required to store the zeroth to-be-executed instruction The zeroth storage address interval has overlapping areas. Conversely, there is no association between the first instruction to be executed and the zeroth instruction to be executed may be that there is no overlapping area between the first storage address interval and the zeroth storage address interval.

In this way, according to the dependency relationship between the instructions to be executed, after the execution of the first to-be-executed order is completed, the subsequent to-be-executed instruction is executed again to ensure the accuracy of the calculation result.

In a possible implementation manner, the instruction format of the vector search instruction may be as shown in Table 1 below, and the operation code and the position of the operation code may be set. In Table 2, a conventional vector search instruction (Find) is given, and any number in the vector to be searched can be found by using the conventional vector search instruction; and an example of the vector search instruction is given in Table 2 and defined Two special types of vector search instructions (Find_vfirst, Find_vlast) need to include the opcode and operation field. Using a special type of vector search instruction to search for the search vector can simplify the instruction processing process and save the search time.

Table 1 Command format

Among them, when the search condition includes only the specified value interval, the number of to-be-checked satisfying the search condition refers to: the value is the number of to-be-checked of the specified multiple of the specified value.

When the search condition includes the specified value and the specified order, the number of queries to satisfy the search condition means that the number is equal to the specified value and the order is the number of queries to the specified order.

When the search condition includes the specified value and the specified multiple, the number to be checked that meets the search condition refers to: the value is the number to be checked of the specified multiple of the specified value.

When the search condition includes a specified value, a specified multiple, and a specified sort, the number of queries to satisfy the search condition refers to: the value is the specified multiple of the specified value, and the sort is the number of the query to be specified.

Table 2 Examples of vector search instructions

Among them, the vector search instruction whose operation code is "Find_vfirst", the number of the searched to meet the search conditions is: the value is double the specified value 1 (that is, the value is equal to the specified value 1), and the sorted value is The first of the counts to be checked that is double the specified value 1. The width of the data to be checked is greater than 1.

The vector search instruction whose operation code is "Find_vlast" finds that the number of pending queries that meet the search conditions is: the value is double the specified value 1 (that is, the value is equal to the specified value 1) and the sorted value is the specified value The penultimate of the counts to be doubled. The width of the data to be checked is greater than 1.

It should be understood that a person skilled in the art may set the operation code of the vector search instruction, the operation code in the instruction format, and the position of the operation field according to needs, which is not limited in the present disclosure.

In a possible implementation, the device may be set in a graphics processor (GPU), a central processing unit (CPU) and a neural-network processing unit (Neural-network Processing) , Referred to as NPU).

It should be noted that although the above embodiment is taken as an example to introduce the vector search instruction processing apparatus as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various modules flexibly according to personal preferences and/or actual application scenarios, as long as the technical solutions of the present disclosure are met.

Application examples

In the following, an application example according to an embodiment of the present disclosure is given in conjunction with "using a vector search instruction processing device to search for a search vector to be searched" as an exemplary application scenario, so as to facilitate understanding of the flow of the vector search instruction processing device. Those skilled in the art should understand that the following application examples are only for the purpose of facilitating understanding of the embodiments of the present disclosure, and should not be considered as limitations to the embodiments of the present disclosure.

FIGS. 2-3a-2-3c illustrate schematic diagrams of application scenarios of a vector search instruction processing apparatus according to an embodiment of the present disclosure. As shown in Figures 2-3a-2-3c, the vector search instruction processing device processes the vector search instruction as follows.

First, assume that the to-be-searched vector a is "0101 1011 0001 0101 1100 0001 1001". In addition, every four binary numbers represent a number of the vector a to be searched in decimal, that is, the vector a to be searched in decimal is (5,11,1,5,12,1,9). To facilitate distinguishing between different vector search instructions, it is assumed that the storage address of the vector a to be searched in different vector search instructions is different.

The vector search instructions to be processed by the device include:

Vector search instruction 1: @Find#100#28#200#4#1#01

Vector search instruction 2: @Find_vfirst#101#28#201#4

Vector search instruction 3: @Find_vlast#102#28#202#4

Example 1

As shown in Figure 2-3a, the control module 11-2 parses the vector search instruction 1 when receiving the vector search instruction 1, obtains the operation code of the vector search instruction 1 as Find, and determines the vector search instruction 1 according to the operation domain , The vector address to be searched is "100", the input length is "28", the target address is "200", the specified sort is "the number of counts to be checked is equal to the specified value (because the specified multiple position in the vector search instruction 1 is empty , By default, the multiple is doubled), the first of the number to be checked", the specified value is "1", and the width of the number to be checked is "4". Further, the control module 11-2 obtains the above-mentioned to-be-searched vector a whose input length is 28 from the to-be-searched vector address 200 is "0101 1011 0001 0101 1100 0001 1001".

The arithmetic module 12-2 obtains a plurality of numbers to be checked sequentially from the vector to be searched according to the width of the numbers to be checked "4", and sequentially determines whether the values of the plurality of numbers to be checked are equal to the specified value "1", and sets the value The first to-be-checked number among the to-be-checked numbers equal to the specified value “1” and sorted to be equal to the specified value “1” is determined as the target number, and the storage address of the target number is stored in the target address 200 as a search result.

In this example, the arithmetic module 12-2 first obtains the first to-be-checked number "0101" with a width of 4 from the to-be-searched vector a, and judges whether the value of the to-be-checked number "0101" is equal to the specified value "1". Since the value of the number to be checked "0101" is not 1, the arithmetic module 12-2 continues to obtain the next number to be checked "1011" from the vector to be looked up a, and judges whether the value of the number to be checked "1011" is equal to the specified value " 1". Since the value of the number to be checked "1011" is not 1, the arithmetic module 12-2 continues to obtain the next number to be checked "0001" from the vector to be looked up a, and judges whether the value of the number to be checked "0001" is equal to the specified value " 1". Since the value of the number to be checked "0001" is equal to 1, and its sorting is the specified sorting (that is, the sorting of the number to be checked is the first of the number to be checked equal to the specified value), the number "0001" to be checked is determined as For the target number, the storage address 500 of the number to be checked "0001" is stored in the target address 200 as the search result.

Example 2

As shown in Figure 2-3b, the control module 11-2 parses the vector search instruction 2 when receiving the vector search instruction 2, obtains the operation code of the vector search instruction 2 as Find_vfirst, and determines the vector search instruction 2 according to the operation domain The address of the vector to be searched is "101", the input length is "28", the target address is "201", and the width of the number to be searched is "4". In addition, it is determined according to the operation code Find_vfirst that the specified value of the vector search instruction 2 is "1" and the specified sort is "the sort of the number to be checked is the first of the number to be checked equal to the specified value". Furthermore, the control module 11-2 obtains the above-mentioned to-be-searched vector a whose input length is 28 from the to-be-searched vector address 201 is "0101 1011 0001 0101 1100 0001 1001".

The arithmetic module 12-2 obtains a plurality of numbers to be checked sequentially from the vector to be searched according to the width of the numbers to be checked "4", and sequentially determines whether the values of the plurality of numbers to be checked are equal to the specified value "1", and sets the value The first to-be-checked number among the to-be-checked numbers equal to the specified value "1" and sorted to be equal to the specified value "1" is determined as the target number, and the storage address of the target number is stored in the target address 201 as the search result.

In this example, the arithmetic module 12-2 first obtains the first to-be-checked number "0101" with a width of 4 from the to-be-searched vector a, and judges whether the value of the to-be-checked number "0101" is equal to the specified value "1". Since the value of the number to be checked "0101" is not 1, the arithmetic module 12-2 continues to obtain the next number to be checked "1011" from the vector to be looked up a, and judges whether the value of the number to be checked "1011" is equal to the specified value " 1". Since the value of the number to be checked "1011" is not 1, the arithmetic module 12-2 continues to obtain the next number to be checked "0001" from the vector to be looked up a, and judges whether the value of the number to be checked "0001" is equal to the specified value " 1". Since the value of the number to be checked "0001" is equal to 1, and its sorting is the specified sorting (that is, the sorting of the number to be checked is the first of the number to be checked equal to the specified value), the number "0001" to be checked is determined as For the target number, the storage address 501 of the number to be checked "0001" is stored in the target address 201 as the search result.

Example 3

As shown in Figure 2-3c, the control module 11-2 parses the vector search instruction 3 when receiving the vector search instruction 3, obtains the operation code of the vector search instruction 3 as Find_vlast, and determines the vector search instruction 3 according to the operation domain The address of the vector to be searched is "102", the input length is "28", the target address is "202", and the width of the data to be searched is "4". In addition, it is determined according to the operation code Find_vlast that the specified value of the vector search instruction 3 is "1" and the specified sort is "the sort of the number to be checked is the penultimate number of the number to be checked equal to the specified value". Furthermore, the control module 11-2 obtains the above-mentioned to-be-searched vector a whose input length is 28 from the to-be-searched vector address 202 is "0101 1011 0001 0101 1100 0001 1001".

The arithmetic module 12-2 obtains a plurality of numbers to be checked sequentially from the vector to be searched according to the width of the numbers to be checked "4", and sequentially determines whether the values of the plurality of numbers to be checked are equal to the specified value "1", and sets the value The first to-be-checked number among the to-be-checked numbers equal to the specified value "1" and sorted to be equal to the specified value "1" is determined as the target number, and the storage address of the target number is stored in the target address 202 as the search result .

In this example, the arithmetic module 12-2 first obtains the first to-be-checked number "1001" with a width of 4 from the to-be-searched vector a, and determines whether the value of the to-be-checked number "1001" is equal to the specified value "1" . Since the value of the number to be checked "1001" is not 1, the arithmetic module 12-2 continues to obtain the next number to be checked "0001" from the vector to be looked up a, and judges whether the value of the number to be checked "0001" is equal to the specified value " 1". Since the value of the number to be checked "0001" is equal to 1, and its sorting is the specified sorting (that is, the sorting of the number to be checked is the penultimate of the number to be checked equal to the specified value), the number to be checked "0001" is determined As the target number, the storage address 502 of the number to be checked "0001" is stored in the target address 202 as a search result.

In this way, the vector search instruction processing device can process the vector search instruction quickly and efficiently.

2-4 show a flowchart of a vector search instruction processing method according to an embodiment of the present disclosure. As shown in FIG. 2-4, this method is applied to the above vector search instruction processing device. The method includes step S51-2 and step S52-2.

In step S51-2, the received vector search instruction is parsed to obtain the operation code and operation domain of the vector search instruction, and the vector to be searched, the search condition and the search vector required for executing the vector search instruction are determined according to the operation code and operation domain target address. The operation code is used to indicate that the operation performed by the vector search instruction on the vector data is a search operation, and the operation domain includes the vector address and the target address to be searched.

In step S52-2, it is sequentially determined whether a plurality of numbers to be searched for the vector to be searched satisfy the search condition, and the number to be searched that meets the search condition is determined as the target number, and the storage address of the target number is stored as the search result in the target address.

In a possible implementation, the operation domain may also include the input length. Wherein, determining the vector to be searched, the search condition and the target address required to execute the vector search instruction according to the operation code and the operation domain may include: obtaining the vector to be searched from the vector address to be searched according to the input length.

In a possible implementation manner, the operation field may further include the width of the data to be checked. The method may further include: determining a plurality of to-be-checked numbers from the to-be-checked vector according to the width of the to-be-checked numbers.

In a possible implementation, the operation domain may further include search conditions. Wherein, determining the vector to be searched, the search condition and the target address required to execute the vector search instruction according to the operation code and the operation domain may include: determining the search condition according to the operation domain.

In a possible implementation manner, determining the to-be-searched vector, the search condition, and the target address required to execute the vector search instruction according to the operation code and the operation domain may include:

According to the operation code, the search condition is determined, and the operation code is also used to indicate the search condition of the vector search instruction.

In a possible implementation manner, sequentially determining whether a plurality of to-be-checked numbers representing the to-be-searched vector satisfy the search condition may include:

At least one comparator is used to compare the plurality of to-be-checked numbers with the search condition to obtain a comparison result, so as to determine whether the to-be-checked number meets the search condition according to the comparison result.

Number to be checked if the value is within the specified value interval;

In a possible implementation manner, the method may further include: storing the to-be-searched vector.

In a possible implementation manner, parsing the received vector search instruction to obtain the operation code and operation domain of the vector search instruction may include:

Store vector search instruction;

Analyze the vector search instruction to obtain the operation code and operation domain of the vector search instruction;

The instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed in order according to the execution order, and the plurality of instructions to be executed may include vector search instructions.

In a possible implementation manner, the method may further include:

When it is determined that the first to-be-executed instruction among the plurality of to-be-executed instructions is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction, cache the first to-be-executed instruction, and after determining that the execution of the zeroth to-be-executed instruction is completed , Control the execution of the first instruction to be executed,

The association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:

A first storage address interval storing data required for the first instruction to be executed has an overlapping area with a zeroth storage address interval storing data required for the zeroth instruction to be executed.

It should be noted that although the above embodiment is taken as an example to introduce the vector search instruction processing method as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various steps flexibly according to personal preferences and/or actual application scenarios, as long as the technical solutions of the present disclosure are met.

The vector search instruction processing method provided by the embodiments of the present disclosure has a wide application range, and has high processing efficiency and fast processing speed for the vector search instruction.

The foregoing can be better understood based on the following terms:

Clause A1, a vector search instruction processing device, the device comprising:

Clause A2. The device according to Clause A1, the operation field further includes an input length,

The control module is further configured to obtain the vector to be searched from the address of the vector to be searched according to the input length.

Clause A3. The device according to Clause A1, the operation domain further includes the width of the data to be checked,

The calculation module is further configured to determine the plurality of to-be-checked numbers from the to-be-searched vector according to the width of the to-be-checked number.

Clause A4. The device according to Clause A1, the operation domain further includes a search condition,

The control module is also used to determine the search condition according to the operation domain.

Clause A5, the device according to Clause A1,

The control module is further used to determine the search condition according to the operation code, wherein the operation code is also used to indicate the search condition of the vector search instruction.

Clause A6. The device according to Clause A1, the arithmetic module includes:

Clause A7. The device according to any one of Clause A1-Clause A6, the number of to-be-checked satisfying the search condition includes at least one of the following:

Number to be checked if the value is within the specified value interval;

The value is the number to be checked for the specified multiple of the specified value,

Wherein, the designated order includes at least one of the following:

The sorting of the number to be checked is the nth of the number to be checked whose value is a specified multiple of the specified value, where n is a positive integer greater than or equal to 1;

The order of the numbers to be checked is the m-th to the last one of the numbers to be checked whose value is the specified multiple of the specified value, where m is a positive integer greater than or equal to 1,

Wherein, m and n are less than or equal to the number of numbers to be checked in the vector to be looked up.

Clause A8. The device according to Clause A1,

The device further includes a storage module for storing the to-be-searched vector,

Wherein, the control module includes:

Instruction storage sub-module for storing the vector search instruction;

An instruction processing sub-module, which is used to parse the vector search instruction to obtain the operation code and operation domain of the vector search instruction;

A queue storage sub-module is used to store an instruction queue, the instruction queue includes a plurality of instructions to be executed, which are sequentially arranged according to the execution order, and the plurality of instructions to be executed include the vector search instruction,

Wherein, the control module also includes:

The dependency processing sub-module is used to determine the first pending instruction when there is an association between the first pending instruction among the multiple pending instructions and the zeroth pending instruction before the first pending instruction The execution instruction is cached in the instruction storage submodule, and after the execution of the zeroth instruction to be executed is completed, the first instruction to be executed is extracted from the instruction storage submodule and sent to the arithmetic module,

Wherein, the association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:

The first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.

Clause A9. A machine learning computing device, the device comprising:

One or more vector search instruction processing devices as described in any one of clauses A1 to A8, used to obtain data and control information to be calculated from other processing apparatuses, and perform specified machine learning operations, and pass the execution result through I /O interface is passed to other processing devices;

Clause A10. A combined processing device, the combined processing device comprising:

Machine learning computing device, general interconnection interface and other processing devices as described in clause A9;

The machine learning computing device interacts with the other processing device to jointly complete the calculation operation specified by the user,

Wherein, the combined processing device further includes: a storage device, which is respectively connected to the machine learning computing device and the other processing device, and is used for storing data of the machine learning computing device and the other processing device.

Article A11. A machine learning chip, the machine learning chip includes:

The machine learning arithmetic device according to clause A9 or the combined processing device according to clause A10.

Article A12. An electronic device, the electronic device comprising:

Machine learning chip as described in clause A11.

Clause A13, a board card, the board card includes: a storage device, an interface device and a control device, and a machine learning chip as described in Clause A11;

Wherein, the machine learning chip is respectively connected to the storage device, the control device and the interface device;

The storage device is used to store data;

The interface device is used to realize data transmission between the machine learning chip and an external device;

The control device is used for monitoring the state of the machine learning chip.

Clause A14. A vector search instruction processing method. The method is applied to a vector search instruction processing device. The method includes:

Parse the received vector search instruction to obtain the operation code and operation domain of the vector search instruction, and determine the vector to be searched and the search condition required to execute the vector search instruction according to the operation code and the operation domain And destination address;

Sequentially determine whether a plurality of searched numbers representing the searched vector satisfy the search condition, determine the searched number satisfying the search condition as a target number, and store the storage address of the target number as a search result The target address,

Clause A15. The method according to Clause A14, the operation field further includes an input length,

Wherein, determining the vector to be searched, the search condition and the target address required to execute the vector search instruction according to the operation code and the operation domain include:

Obtain the vector to be searched from the address of the vector to be searched according to the input length.

Clause A16. The method according to Clause A14, the operation domain further includes a width to be checked, and the method further includes:

According to the width of the number to be checked, the plurality of numbers to be checked are determined from the vector to be looked up.

Clause A17. The method according to Clause A14, the operation domain further includes a search condition,

According to the operation domain, the search condition is determined.

Clause A18. According to the method described in Clause A14, according to the operation code and the operation domain, a vector to be searched, a search condition, and a target address required to execute the vector search instruction are determined, including:

The search condition is determined according to the operation code, and the operation code is also used to indicate the search condition of the vector search instruction.

Clause A19. According to the method described in Clause A14, sequentially determining whether a plurality of to-be-checked numbers representing the to-be-searched vector satisfy the search condition includes:

At least one comparator is used to compare the plurality of to-be-checked numbers with the search condition to obtain a comparison result, so as to determine whether the to-be-checked number satisfies the search condition according to the comparison result.

Clause A20. According to the method described in any one of Clause A14-Clause A19, the number of to-be-checked that meets the search condition includes at least one of the following:

Number to be checked if the value is within the specified value interval;

Wherein, the designated order includes at least one of the following:

Clause A21, according to the method described in Clause A14,

The method further includes: storing the to-be-searched vector,

Wherein, analyzing the received vector search instruction to obtain the operation code and operation domain of the vector search instruction includes:

Store the vector search instruction;

Parse the vector search instruction to obtain the operation code and operation domain of the vector search instruction;

Store an instruction queue, the instruction queue includes a plurality of instructions to be executed in order according to the execution order, the plurality of instructions to be executed includes the vector search instruction,

Wherein, the method further includes:

When it is determined that the first to-be-executed instruction among the plurality of to-be-executed instructions is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction, cache the first to-be-executed instruction, and determine the After the execution of the zeroth to-be-executed instruction is completed, control to execute the execution of the first to-be-executed instruction,

FIG. 3-1 shows a block diagram of a scalar search instruction processing device according to an embodiment of the present disclosure. As shown in Figure 3-1, the device includes a control module 11-3 and an arithmetic module 12-3.

The control module 11-3 is used to parse the received scalar search instruction, obtain the operation code and operation domain of the scalar search instruction, and determine the scalar to be searched and the specified value required to execute the scalar search instruction according to the operation code and operation domain , Specify sorting and target address. The operation code is used to indicate that the operation performed by the scalar search instruction on the data is a search operation, and the operation domain includes the scalar address and the target address to be searched.

The arithmetic module 12-3 is used to sequentially determine whether the values of the plurality of check numbers representing the scalar to be searched are equal to the specified value, and determine the check number with the value equal to the specified value and sorted to the specified sort as the target number, and determine the target The storage address of the number is stored in the target address as the search result.

In this embodiment, the scalar to be searched may be a character string in binary, hexadecimal, or the like. For example, the binary representation of the scalar 87 to be searched is "01010111", and the multiple queried numbers of the scalar 87 to be searched are "0", "1", "0", "1", "0", "1", "1" and "1". The control module can obtain the scalar to be found from the scalar address to be found. The scalar address to be searched may be the first address storing the scalar to be searched, and so on. The control module may obtain the scalar search instruction and the scalar to be searched through the data input and output unit, and the data input and output unit may be one or more data I/O interfaces or I/O pins.

In this embodiment, the operation code may be the part of the instruction or field (usually represented by code) specified in the computer program to perform the operation, and is the instruction serial number used to inform the device that executes the instruction which instruction needs to be executed. . The operation domain can be the source of all data required to execute the corresponding instruction. All data required to execute the corresponding instruction include parameter data, scalar to be searched, and corresponding operation method, or store parameter data, scalar to be searched, and corresponding operation The address of the method, etc. For a scalar search instruction, it must include an operation code and an operation field, where the operation field includes at least the scalar address to be searched and the target address.

It should be understood that those skilled in the art can set the instruction format of the scalar search instruction, as well as the included operation codes and operation domains as needed, and this disclosure does not limit this.

A scalar search instruction processing device provided by an embodiment of the present disclosure includes a control module and an arithmetic module. The control module is used to parse the received scalar search instruction, obtain the operation code and operation domain of the scalar search instruction, and determine the scalar to be searched, the specified value, the specified order and the required scalar search instruction according to the operation code and the operation domain. target address. The arithmetic module is used to sequentially determine whether the values of the multiple check numbers representing the scalar to be searched are equal to the specified value, determine the check number that is equal to the specified value and sorted into the specified sort as the target number, and determine the storage address of the target number The target address is stored as the search result. The scalar search instruction processing device provided by the embodiment of the present disclosure has a wide application range, and has high processing efficiency and fast processing speed for the scalar search instruction.

In a possible implementation manner, the specified sorting may include at least one of the following: the sorting of the number to be checked is the nth of the number to be checked equal to the specified value, n is a positive integer greater than or equal to 1; the number to be checked The order of is the m-th to the last one of the number to be checked equal to the specified value, and m is a positive integer greater than or equal to 1. Among them, m and n are less than or equal to the number of the number to be checked in the scalar to be looked up.

In this implementation, you can sort the "number to be checked to be the nth of the number to be checked that is equal to the specified value" and "the order to be checked to be the mth number to the bottom of the number that is to be checked." "Set different expressions, etc., to distinguish the specified order of reciprocal and positive numbers. You can set the expression of "the number of the number to be checked to be the nth of the number to be checked equal to the specified value" in the specified sort in the scalar search instruction is "0n", set the "sort of the number to be checked" in the specified sort The "m-th to last" in the number to be checked equal to the specified value is expressed in the scalar search instruction as "m0". A person skilled in the art can set the expression of the specified order according to actual needs, and this disclosure does not limit this.

In a possible implementation, the specified value may be 0, 1, 2, 3, and so on.

For example, if the scalar to be searched is a hexadecimal string, the specified values can be 0-9, A-F. If the scalar to be searched is a binary string, the specified value can be 0 or 1. The target number found by the scalar search instruction may be the first one, the last one, etc. of the multiple queried numbers to be searched for.

In a possible implementation, the operation domain may also include the input length. The control module 11-3 is also used to obtain the scalar to be found from the scalar address to be found according to the input length.

In this implementation manner, the length of the scalar to be searched obtained from the scalar address to be searched according to the input length needs to be equal to the input length, or needs to be less than the input length.

In a possible implementation manner, when the input length is not included in the operation domain, the scalar to be searched can be obtained according to a preset default input length. It is also possible to obtain all data in the scalar address to be searched as the scalar to be searched.

In a possible implementation manner, the operation field may further include the width of the data to be checked. The operation module 12-3 is also used to determine a plurality of to-be-checked numbers from the to-be-searched scalars according to the to-be-checked width.

In this implementation manner, when the width of the data to be checked is included in the operation domain, a plurality of data to be searched whose width is the width of the data to be checked can be determined from the scalar to be searched.

In this implementation, the width of the data to be checked is not included in the operation field (it may mean that the position corresponding to the width of the data to be checked in the scalar search instruction is empty, or there is no width of the data to be checked), or the width of the data to be checked is At 1, the multiple queried numbers of the scalar to be searched are multiple characters in the character string. For example, when the scalar n to be searched is "01010111", and the width of the queuing to be searched is 1, the multiple queried to be searched for the scalar n is "0", "1", "0", "1", "0" "1", "1" and "1".

In this implementation manner, when the operation domain includes the width of the number to be checked and the width of the number to be checked is greater than 1, the plurality of numbers to be checked for the scalar to be searched are multiple binary digit strings having the width of the number to be checked, each of the number to be checked The string of binary digits in width represents a number to be checked. For example, if the width of the data to be searched is 3, the scalar m to be searched is "101110100", and the plurality of scalars to be searched m are "101", "110", and "100".

In a possible implementation manner, the operation domain may further include a specified value and a specified order. The control module 11-3 is also used to determine the specified value and specified order according to the operation domain.

In this implementation, when the specified value and specified order are included in the operation domain, the specified value and specified order in the operation domain can be directly obtained.

In a possible implementation, the control module 11-3 is also used to determine the specified value and the specified order according to the operation code. Among them, the opcode is also used to indicate the specified value and specified order of the scalar search instruction.

In this implementation, different operation codes can be set to represent different specified values and specified orders. In particular, the width of the data to be checked can also be determined according to the operation code or the default width.

For example, you can set the opcode "Find_bfirst" to find the first one of the multiple numbers to be searched for the scalar to be found (the width of the number to be checked is 1, the specified value is 1, and the specified sort is the sort of the number to be checked It is the first one of the number to be checked equal to the specified value). The operation code "Find_blast" is the last one of the multiple numbers to be found to find the scalar to be found (the width of the number to be checked is 1, the specified value is 1, the specified sort is the number of checked numbers, and the sort is equal to the specified value The penultimate number in the number). When the operation codes are "Find_bfirst" and "Find_blast", the width of the number-to-be-checked can be further determined to be 1 according to the operation code, and then a plurality of number-to-be-checked with a width of 1 can be obtained.

3-2 shows a block diagram of a scalar search instruction processing device according to an embodiment of the present disclosure. In a possible implementation manner, as shown in FIG. 3-2, the operation module 12-3 may include at least one comparator 121-3, which is used to compare the values of multiple numbers to be checked and the specified values to obtain a comparison The result, in order to determine whether the value of the number to be checked is equal to the specified value according to the comparison result.

For example, taking the specified value as "1" and the specified order as "the number of items to be checked is the first of the number of items to be checked equal to the specified value" as an example, the comparator may sequentially select multiple The value of the check number is compared with the specified value "1" to obtain the comparison result. In turn, the arithmetic module can determine whether the value of the number to be checked is equal to the specified value "1" according to the comparison result, and the value is equal to the specified value "1" and sorted to be equal to the specified value. The number is determined as the target number, and the storage address of the target number is stored in the target address as the search result. The number of comparators can be set according to the size of the data amount to be compared, the processing speed, efficiency, etc. of the comparison, which is not limited in the present disclosure.

In a possible implementation manner, as shown in FIG. 3-2, the device may further include a storage module 13-3. The storage module 13-3 is used to store the scalar to be found.

In this implementation, the storage module may include one or more of a memory, a cache, and a register, and the cache may include a temporary storage cache. The scalar to be searched can be stored in the memory, cache, and/or register in the storage module as needed, which is not limited in this disclosure.

In a possible implementation, as shown in FIG. 3-2, the control module 11-3 may include an instruction storage sub-module 111-3, an instruction processing sub-module 112-3, and a queue storage sub-module 113-3.

The instruction storage submodule 111-3 is used to store scalar search instructions.

The instruction processing submodule 112-3 is used to parse the scalar search instruction to obtain the operation code and operation domain of the scalar search instruction.

The queue storage sub-module 113-3 is used to store an instruction queue. The instruction queue includes a plurality of instructions to be executed in order according to the execution order. The plurality of instructions to be executed may include a scalar search instruction.

In a possible implementation, as shown in FIG. 3-2, the control module 11-3 may further include a dependency processing sub-module 114-3.

The dependency processing sub-module 114-3 is configured to cache the first instruction to be executed when it is determined that there is an association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed In the instruction storage submodule 112-3, after the execution of the zeroth to-be-executed instruction is completed, the first to-be-executed instruction is extracted from the instruction storage submodule 112-3 and sent to the arithmetic module 12-3. Wherein, the first instruction to be executed and the zeroth instruction to be executed are instructions among a plurality of instructions to be executed.

In this way, according to the dependency relationship between the instructions to be executed, after the execution of the first instruction to be executed is completed, the instruction to be executed later is executed to ensure the accuracy of the calculation result.

In a possible implementation manner, the instruction format of the scalar search instruction may be as shown in Table 3 below, and the operation code and the position of the operation code may be set. In Table 4, the conventional scalar search instruction (Find) is given, and any number in the scalar to be searched can be found by using the conventional scalar search instruction; and an example of the scalar search instruction is given in Table 4 and defined Two special types of scalar search instructions (Find_bfirst, Find_blast) need to include the operation code and operation field. Using a special type of scalar search instruction to search for the scalar to be searched can simplify the instruction processing process and save search time.

Table 3 Command format

Table 4 Examples of scalar search instructions

Among them, the scalar search instruction of the operation code "Find_bfirst", the corresponding specified value is 1, the specified sort is the number of the number to be checked is equal to the specified value of the number of the number to be checked, the width of the number to be checked is .

The scalar search instruction whose operation code is "Find_blast" corresponds to the width of the number to be checked is 1, the specified value is 1, and the specified order is the sort of the number to be checked is the penultimate number of the number to be checked equal to the specified value.

It should be understood that those skilled in the art can set the operation code of the scalar search instruction, the operation code in the instruction format, and the position of the operation field as needed, and the disclosure does not limit this.

It should be noted that although the above embodiment is taken as an example to introduce the scalar search instruction processing apparatus as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various modules flexibly according to personal preferences and/or actual application scenarios, as long as the technical solutions of the present disclosure are met.

Application examples

In the following, an application example according to an embodiment of the present disclosure is given in conjunction with "using a scalar search instruction processing device to search for a scalar to be searched" as an exemplary application scenario, so as to facilitate understanding of the flow of the scalar search instruction processing device. Those skilled in the art should understand that the following application examples are only for the purpose of facilitating understanding of the embodiments of the present disclosure, and should not be considered as limitations to the embodiments of the present disclosure.

3-3a-FIG. 3-3c are schematic diagrams illustrating application scenarios of a scalar search instruction processing device according to an embodiment of the present disclosure. As shown in FIGS. 3-3a to 3-3c, the scalar search instruction processing device processes the scalar search instruction as follows.

First, assume that the scalar a to be searched is "010110110001". The number of the scalar a to be searched in decimal, that is, the scalar a to be searched in decimal is 1457. To facilitate distinguishing between different scalar search instructions, it is assumed that the storage addresses of the scalar a to be searched in different scalar search instructions are different.

The scalar search instructions to be processed by the device include:

Scalar search instruction 1: @Find#1#100#12#200#01#4

Scalar search instruction 4: @Find_bfirst#103#12#203

Scalar search instruction 5: @Find_blast#104#12#204

Example 1-3

As shown in Figure 3-3a, the control module 11-3 parses the scalar search instruction 1 when receiving the scalar search instruction 1, obtains the operation code of the scalar search instruction 1 as Find, and determines the scalar search instruction 1 according to the operation domain The specified value is "1", the scalar address to be searched is "100", the input length is "12", the target address is "200", and the specified sort is "the number of counts to be checked is equal to the specified value. "1st", the width of the data to be checked is "4". Further, the control module 11-3 obtains the above-mentioned to-be-searched scalar a whose length is 12 from the to-be-searched scalar address 200 is "010110110001".

The arithmetic module 12-3 sequentially obtains a plurality of numbers to be checked from the scalar a to be searched according to the width of the numbers to be checked to be "4", and sequentially determines whether the values of the plurality of to-be-checked numbers are equal to the specified value "1", and sets the value The first to-be-checked number among the to-be-checked numbers equal to the specified value “1” and sorted to be equal to the specified value “1” is determined as the target number, and the storage address of the target number is stored in the target address 200 as a search result.

In this example, the arithmetic module 12-3 first obtains the first to-be-checked number "0101" with a width of 4 from the to-be-searched scalar a, and judges whether the value of the to-be-checked number "0101" is equal to the specified value "1". Since the value of the number to be checked "0101" is not 1, the arithmetic module 12-3 continues to obtain the next number to be checked "1011" from the scalar a to be searched, and judges whether the value of the number to be checked "1011" is equal to the specified value " 1". Since the value of the number to be checked "1011" is not 1, the arithmetic module 12-3 continues to obtain the next number to be checked "0001" from the scalar a to be searched, and judges whether the value of the number to be checked "0001" is equal to the specified value " 1". Since the value of the number to be checked "0001" is equal to 1, and its sorting is the specified sorting (that is, the sorting of the number to be checked is the first of the number to be checked equal to the specified value), the number "0001" to be checked is determined as For the target number, the storage address 500 of the number to be checked "0001" is stored in the target address 200 as the search result.

Example 2-3

As shown in Fig. 3-3b, the control module 11-3 parses the scalar search instruction 4 when receiving the scalar search instruction 4, obtains the operation code of the scalar search instruction 4 as Find_bfirst, and determines the scalar search instruction 4 according to the operation domain The scalar address to be searched for is "103", the input length is "12", and the target address is "203". And, it is determined according to the operation code Find_bfirst that the specified value of the scalar search instruction 4 is "1", and the specified sort is "the sort of the number to be checked is the first of the number to be checked equal to the specified value". Furthermore, the control module 11-3 obtains the above-mentioned to-be-searched scalar a whose length is 12 from the to-be-searched scalar address 203, "010110110001".

The arithmetic module 12-3 sequentially obtains a plurality of numbers to be checked from the scalar to be searched a, and sequentially determines whether the values of the plurality of numbers to be checked are equal to the specified value "1", and the values are equal to the specified value "1", and is sorted as The first to-be-checked number among the to-be-checked numbers equal to the specified value "1" is determined as the target number, and the storage address of the target number is stored in the target address 203 as a search result.

In this example, the arithmetic module 12-3 first obtains the first to-be-checked number "0" from the to-be-searched scalar a, and judges whether the value of the to-be-checked number "0" is equal to the specified value "1". Since the value of the number to be checked "0" is not 1, the arithmetic module 12-3 continues to obtain the next number to be checked "1" from the scalar a to be searched, and judges whether the value of the number to be checked "1" is equal to the specified value " 1". Since the value of the number to be checked "1" is equal to 1, and its sorting is the specified sorting (that is, the sorting of the number to be checked is the first of the number to be checked equal to the specified value), the number "1" to be checked is determined as For the target number, the storage address 503 of the number to be checked "1" is stored in the target address 203 as the search result.

Example 3-3

As shown in FIG. 3-3c, when receiving the scalar search instruction 5, the control module 11-3 parses the scalar search instruction 5, obtains the operation code of the scalar search instruction 4 as Find_blast, and determines the scalar search instruction 5 according to the operation domain The scalar address to be searched for is "104", the input length is "12", and the target address is "204". And, it is determined according to the operation code Find_blast that the specified value of the scalar search instruction 5 is "1", and the specified sort is "the sort of the number to be checked is the penultimate of the number to be checked equal to the specified value". Furthermore, the control module 11-3 obtains the above-mentioned to-be-searched scalar a whose length is 12 from the to-be-searched scalar address 204 is "010110110001".

The arithmetic module 12-3 sequentially obtains a plurality of numbers to be checked from the scalar to be searched a, and sequentially determines whether the values of the plurality of numbers to be checked are equal to the specified value "1", and the values are equal to the specified value "1", and is sorted as The first to-be-checked number among the to-be-checked numbers equal to the specified value "1" is determined as the target number, and the storage address of the target number is stored in the target address 204 as a search result.

In this example, the arithmetic module 12-3 first obtains the first to-be-checked number "1" from the to-be-searched scalar a, and judges whether the value of the to-be-checked number "1" is equal to the specified value "1". Since the value of the number to be checked "1" is equal to 1, and its sorting is the specified sorting (that is, the sorting of the number to be checked is the penultimate of the number of counts that is equal to the specified value), the number "1" is determined As the target number, the storage address 504 of the number to be checked "1" is stored in the target address 204 as a search result.

In this way, the scalar search instruction processing device can process the scalar search instruction quickly and efficiently.

3-4 show a flowchart of a scalar search instruction processing method according to an embodiment of the present disclosure. As shown in FIG. 3-4, the method is applied to the above scalar search instruction processing device, and the method includes step S51-3 and step S52-3.

In step S51-3, the received scalar search instruction is parsed to obtain the operation code and operation domain of the scalar search instruction, and according to the operation code and operation domain, the to-be-searched scalar required to execute the scalar search instruction, the specified value, Specify sorting and destination addresses. The operation code is used to indicate that the operation performed by the scalar search instruction on the scalar data is a search operation, and the operation domain includes the scalar address to be searched and the target address.

In step S52-3, it is sequentially determined whether the values of the plurality of check numbers representing the scalar to be searched are equal to the specified value, and the check number that is equal to the specified value and sorted into the specified sort is determined as the target number, and the target number is determined The storage address of is stored in the target address as the search result.

In a possible implementation, the operation domain may also include the input length. Wherein, determining the scalar to be searched, the specified value, the specified order, and the target address required to execute the scalar search instruction according to the operation code and the operation domain may include: obtaining the scalar to be searched from the scalar address to be searched according to the input length.

In a possible implementation manner, the operation domain may further include a specified value and a specified order. Wherein, determining the scalar to be searched, the specified value, the specified order and the target address required to execute the scalar search instruction according to the operation code and the operation domain may include: determining the specified value and the specified order according to the operation domain.

In a possible implementation manner, determining the scalar to be searched, the specified value, the specified order, and the target address required to execute the scalar search instruction according to the operation code and the operation field may include:

According to the operation code, the specified value and specified order are determined. The operation code is also used to indicate the specified value and specified order of the scalar search instruction.

In a possible implementation manner, sequentially determining whether the values of the plurality of to-be-checked numbers representing the to-be-searched scalar are equal to the specified value may include:

At least one comparator is used to compare the values of a plurality of to-be-checked numbers with a specified value to obtain a comparison result, so as to determine whether the to-be-checked value is equal to the specified value according to the comparison result.

In a possible implementation, the specified ordering may include at least one of the following:

The order of the number to be checked is the nth of the number of the number to be checked equal to the specified value, n is a positive integer greater than or equal to 1; It is a positive integer greater than or equal to 1. Among them, m and n are less than or equal to the number of the number to be checked in the scalar to be looked up.

In a possible implementation, the method may further include: storing the scalar to be searched.

In a possible implementation manner, parsing the received scalar search instruction to obtain the operation code and operation domain of the scalar search instruction may include:

Store scalar search instructions;

Analyze the scalar search instruction to obtain the operation code and operation domain of the scalar search instruction;

The instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed in order according to the execution order, and the plurality of instructions to be executed may include a scalar search instruction.

In a possible implementation manner, the method may further include:

It should be noted that although the above embodiment is taken as an example to introduce the scalar search instruction processing method as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various steps flexibly according to personal preferences and/or actual application scenarios, as long as the technical solutions of the present disclosure are met.

The scalar search instruction processing method provided by the embodiments of the present disclosure has a wide application range, and has high processing efficiency and fast processing speed for the scalar search instruction.

The foregoing can be better understood based on the following terms:

Clause B1, a scalar search instruction processing device, the device comprising:

Clause B2. The device according to Clause B1, the operation field further includes an input length,

The control module is further configured to obtain the scalar to be searched from the scalar to be searched address according to the input length.

Clause B3. The device according to Clause B1, the operation domain further includes a specified value and a specified order,

The control module is also used to determine the specified value and the specified order according to the operation domain.

Clause B4, the device according to Clause B1,

The control module is further configured to determine the specified value and the specified order according to the operation code, wherein the operation code is also used to indicate the specified value and the specified order of the scalar search instruction.

Clause B5. The device according to Clause B1, the arithmetic module includes:

At least one comparator is used to compare the values of the plurality of to-be-checked numbers with the specified value to obtain a comparison result, so as to determine whether the value of the to-be-checked number is equal to the specified value according to the comparison result.

Clause B6. The device according to any one of Clause B1-Clause B5, the designated order includes at least one of the following:

The order of the number to be checked is the nth of the number to be checked equal to the specified value, where n is a positive integer greater than or equal to 1;

The order of the number to be checked is the m-th to the last of the number to be checked which is equal to the specified value, the m is a positive integer greater than or equal to 1,

Wherein, m and n are less than or equal to the number of the number to be checked in the scalar to be searched.

Clause B7, the device according to Clause B1,

The device also includes a storage module for storing the scalar to be searched.

Wherein, the control module includes:

An instruction storage sub-module for storing the scalar search instruction;

Instruction processing sub-module, which is used to parse the scalar search instruction to obtain the operation code and operation domain of the scalar search instruction;

A queue storage sub-module, which is used to store an instruction queue, the instruction queue includes a plurality of instructions to be executed in order according to the execution order, and the plurality of instructions to be executed include the scalar search instruction,

Wherein, the control module also includes:

Clause B8. A machine learning computing device, the device comprising:

One or more scalar search instruction processing devices as described in any one of Clause B1-Clause B7, used to obtain the data and control information to be calculated from other processing devices, and perform the specified machine learning operation, and pass the execution result /O interface is passed to other processing devices;

Clause B9. A combined processing device, the combined processing device comprising:

Machine learning computing devices, general interconnection interfaces and other processing devices as described in clause B8;

Article B10. A machine learning chip, the machine learning chip includes:

The machine learning arithmetic device according to clause B8 or the combined processing device according to clause B9.

Article B11. An electronic device, the electronic device comprising:

Machine learning chip as described in clause B10.

Clause B12, a board card, the board card includes: a storage device, an interface device and a control device, and a machine learning chip as described in Clause B10;

The storage device is used to store data;

Article B13. A scalar search instruction processing method. The method is applied to a scalar search instruction processing device. The method includes:

In turn, determine whether the values of the plurality of to-be-checked numbers representing the to-be-searched scalar are equal to the specified value, and determine that the to-be-checked numbers whose value is equal to the specified value and sorted into the specified sort are the target number, and then determine The storage address of the target number is stored in the target address as a search result,

Wherein, the operation code is used to indicate that the operation performed by the scalar search instruction on the data is a search operation, and the operation domain includes the scalar address to be searched and the target address.

Clause B14. The method according to Clause B13, the operation field further includes an input length,

Wherein, determining the scalar to be searched, the specified value, the specified order and the target address required to execute the scalar search instruction according to the operation code and the operation domain include:

According to the input length, obtain the scalar to be searched from the scalar address to be searched.

Clause B15. According to the method described in Clause B13, the operation domain further includes a specified value and a specified order,

According to the operation domain, the specified value and the specified order are determined.

Clause B16. According to the method described in Clause B13, determine the scalar to be searched, the specified value, the specified order, and the target address required to execute the scalar search instruction based on the operation code and the operation domain, including:

The specified value and the specified order are determined according to the operation code, and the operation code is also used to indicate the specified value and the specified order of the scalar search instruction.

Clause B17. According to the method described in Clause B13, sequentially determine whether the values of the plurality of to-be-checked numbers representing the to-be-searched scalar are equal to the specified value, including:

Clause B18. The method according to any one of Clause B13-B17, the specified ranking includes at least one of the following:

Clause B19, according to the method described in Clause B13,

The method further includes: storing the scalar to be found,

Wherein, the received scalar search instruction is parsed to obtain the operation code and operation domain of the scalar search instruction, including:

Store the scalar search instruction;

Parse the scalar search instruction to obtain the operation code and operation domain of the scalar search instruction;

Store an instruction queue, the instruction queue includes a plurality of instructions to be executed in order according to the execution order, the plurality of instructions to be executed includes the scalar search instruction,

Wherein, the method further includes:

Since neural network algorithms are more and more widely used in the fields of image recognition, speech recognition, and natural language processing, the complexity of neural network algorithms is getting higher and higher, and the types and number of data operations involved are increasing. In the process of using the neural network algorithm for data operation, it is necessary to frequently lock and release resources to ensure the reasonable use of resources. In the related art, the speed and efficiency of the way of locking and releasing resources are difficult to match the resource lock requirements during data calculation, and the lock speed is slow and the efficiency is low.

FIG. 4-1 shows a block diagram of a resource lock instruction processing apparatus according to an embodiment of the present disclosure. As shown in Figure 4-1, the device includes a control module 11-4 and an arithmetic module 12-4.

The control module 11-4 is used to parse the received resource lock instruction, obtain the operation code and operation domain of the resource lock instruction, and determine the resource to be processed indicated by the resource lock instruction according to the operation code and operation domain, And determine the lock strategy required for resource lock processing. The operation code is used to indicate that the processing performed by the resource lock instruction on the resource is locking or releasing processing, and the operation domain includes the identifier of the resource to be processed.

The operation module 12-4 is used to lock or release the resource to be processed according to the lock-and-release strategy to obtain the processed resource.

In this embodiment, the lock and release strategy may indicate the manner of processing the resource to be processed, including locking the resource to be processed and releasing the resource to be processed. The control module may determine the resource to be processed according to the identifier of the resource to be processed. The identifier of the resource to be processed may be information such as a number and a name that identify the resource to be processed. The control module can obtain the resource lock instruction and the resource to be processed through the data input/output unit. The data input/output unit may be one or more data I/O interfaces or I/O pins.

In this embodiment, an operation code and an operation domain can be registered for a resource lock instruction. The operation code may be a part of an instruction or field (usually represented by code) specified in the computer program to perform the operation, and is an instruction sequence number used to inform the device executing the instruction which instruction needs to be executed. The operation domain may be the source of all data or resources required to execute the corresponding instruction. All data or resources required to execute the corresponding instruction include the resource to be processed, the corresponding lock and put strategy, and so on. For example, the operation domain may include at least a resource identifier to be processed.

It should be understood that, those skilled in the art can set the instruction format of the resource lock instruction, as well as the included operation codes and operation domains as needed, and this disclosure does not limit this.

In this embodiment, the device may include one or more control modules and one or more processing modules, and the number of control modules and processing modules may be set according to actual needs, which is not limited in the present disclosure. When the device includes a control module, the control module can receive a resource lock instruction and control one or more processing modules to perform lock or release processing. When the device includes multiple control modules, the multiple control modules may respectively receive resource lock and release instructions, and control the corresponding one or more processing modules to perform locking or releasing processing.

The resource lock instruction processing device provided by the embodiment of the present disclosure includes a control module and a processing module. The control module is used to analyze the received resource lock instruction, obtain the operation code and operation domain of the resource lock instruction, and determine the resource to be processed indicated by the resource lock instruction according to the operation code and operation domain, and determine the resource to be processed Locking strategy required for lock handling. The processing module is used for locking or releasing the resources to be processed according to the lock and put strategy, to obtain the processed resources. The resource lock instruction processing device provided by the embodiments of the present disclosure has a wide application range, and the processing efficiency of locking and releasing resources according to the resource lock instruction is high and the processing speed is fast.

In a possible implementation manner, the lock-and-release strategy may include at least one of locking resources to be processed and releasing resources to be processed. Among them, the task to be processed cannot be assigned after the resource to be processed is locked, and the task can be assigned to after the resource to be processed is released.

In this implementation, the code in the resource lock instruction can be set for different lock and put strategies. For example, in the resource lock instruction, "lock the resource to be processed" can be represented by the code PV0, and "release the resource to be processed" can be Represented by code PV1. A person skilled in the art may set the lock and release strategy and the code of the lock and release strategy according to actual needs, and the disclosure does not limit this.

In a possible implementation, the operation domain can also be used to indicate the lock and release strategy.

In a possible implementation, the operation code can also be used to indicate the lock and release strategy.

In a possible implementation manner, a default lock and release strategy may be preset. When the control module cannot determine the lock strategy according to the operation domain and operation code of the resource lock instruction, the default lock strategy can be determined as the lock strategy of the current resource lock instruction.

In a possible implementation manner, the resources to be processed may include at least one of IPU resources, GPU resources, CPU resources, and memory access resources.

The IPU resource may be a storage resource of an IPU (Image Processing Unit). The GPU resource may be a storage resource of a GPU (Graphics Processing Unit). CPU resources are storage resources that can be CPU (Central Processing Unit, central processing unit). The memory access resource may be a memory resource such as a memory of the device that can be accessed by the resource lock instruction processing device. Those skilled in the art can set the resources to be processed according to actual needs, and the present disclosure does not limit this.

4-2 shows a block diagram of a resource lock instruction processing apparatus according to an embodiment of the present disclosure. In a possible implementation manner, as shown in FIG. 4-2, the device may further include a storage module 13-4. The storage module 13-4 is used to store the resource identifier to be processed.

In this implementation, the storage module may include one or more of a memory, a cache, and a register, and the cache may include a temporary storage cache. The resources to be processed can be stored in the memory, cache, and/or registers in the storage module according to needs, which is not limited in this disclosure.

In a possible implementation, as shown in FIG. 4-2, the control module 11-4 may include an instruction storage submodule 111-4, an instruction processing submodule 112-4, and a queue storage submodule 113-4.

The instruction storage submodule 111-4 is used to store resource lock and release instructions.

The instruction processing submodule 112-4 is used to parse the resource lock instruction and obtain the operation code and operation domain of the resource lock instruction.

The queue storage submodule 113-4 is used to store an instruction queue, and the instruction queue includes a plurality of instructions to be executed in order according to the execution order, and the plurality of instructions to be executed may include resource lock and release instructions. The plurality of instructions to be executed may include other calculation instructions related to the resource lock instruction.

In a possible implementation, as shown in FIG. 4-2, the control module 11-4 may further include a dependency processing sub-module 114-4.

The dependency processing sub-module 114-4 is used to determine the dependency relationship processing sub-module 114- when the first to-be-executed instruction and the zeroth to-be-executed instruction before the first to-be-executed instruction have a dependency relationship. 4 The first instruction to be executed can be cached in the instruction storage submodule 112-4, and after the execution of the zeroth instruction to be executed is completed, the first instruction to be executed is extracted from the instruction storage submodule 112-4 and sent to the processing module 12- 4. Wherein, the first instruction to be executed and the zeroth instruction to be executed are instructions among a plurality of instructions to be executed.

The dependency relationship between the first to-be-executed instruction and the zeroth to-be-executed instruction before the first to-be-executed instruction includes: a first storage address interval to store data required by the first to-be-executed instruction and data required to store the zeroth to-be-executed instruction The zeroth storage address interval has overlapping areas. Conversely, there is no dependency relationship between the first instruction to be executed and the zeroth instruction to be executed may be that there is no overlapping area between the first storage address interval and the zeroth storage address interval.

In a possible implementation manner, the instruction format of the resource lock instruction may be:

PV sign

Among them, PV is the operation code, and sign and type are the operation domains. PV is used to indicate that the instruction is a resource lock instruction. sign Pending resource identification. The type is the lock and put strategy, the type of "locking resources to be processed" is PV0, and the type of "releasing resources to be processed" is PV1.

In a possible implementation manner, the instruction format of the resource lock instruction may also be:

PVxsign

Among them, PVx is the operation code, and sign is the operation domain. PVx is used to indicate that the instruction is a resource lock instruction. sign is the identifier of the resource to be processed. The x in PVx can indicate the lock and put strategy, "lock pending resources" x is 0, and "release pending resources" x is 1.

It should be understood that, those skilled in the art can set the operation code of the resource lock policy instruction, the operation code in the instruction format, and the position of the operation domain according to needs, and this disclosure does not limit this.

It should be noted that although the above embodiment is taken as an example to introduce the resource lock instruction processing device as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various modules flexibly according to personal preferences and/or actual application scenarios, as long as the technical solutions of the present disclosure are met.

Application examples

In the following, an application example according to an embodiment of the present disclosure is given in conjunction with “using a resource lock and release instruction processing device to perform lock and release processing on a resource to be processed” as an exemplary application scenario, so as to facilitate understanding of the flow of the resource lock and release instruction processing device. Those skilled in the art should understand that the following application examples are only for the purpose of facilitating understanding of the embodiments of the present disclosure, and should not be considered as limitations to the embodiments of the present disclosure.

4-3a-FIG. 4-3b illustrate schematic diagrams of application scenarios of a resource lock instruction processing apparatus according to an embodiment of the present disclosure. As shown in FIGS. 4-3a to 4-3b, the resource lock instruction processing device processes the resource lock instruction as follows.

Example 1-4

As shown in Figure 4-3a, when the control module 11-4 receives the resource lock instruction 1 (eg PV0r1), it parses the resource lock instruction 1 to obtain the operation code and operation domain of the resource lock instruction 1 . The operation code of the resource lock instruction 1 is PV0, that is, the resource to be processed is locked. And the identifier of the resource to be processed can be determined as r1 according to the operation domain. Furthermore, the control module 11-4 can determine the resource 1 to be processed according to the resource identifier r1 to be processed.

The processing module 12-4 locks the resource 1 to be processed according to the lock and release strategy PV0, and obtains the processed resource 1'. The processed resource 1'is in a locked state and cannot be assigned a task.

Example 2-4

As shown in Figure 4-3b, when the control module 11-4 receives the resource lock instruction 2 (eg PV1), it parses the resource lock instruction 2 to obtain the operation code and operation domain of the resource lock instruction 2 . The operation code of the resource lock and put instruction 1 is PV1, and according to the operation code PV1, it can be determined that the lock and put strategy is to release the resource to be processed. According to the operation domain, it can be determined that the resource identifier to be processed is r2. Furthermore, the control module 11-4 may determine the resource to be processed 2 according to the resource identifier to be processed r2.

The processing module 12-4 releases the resource 2 to be processed according to the lock and release strategy PV1 to obtain the processed resource 2'. The processed resource 2'is in an idle state and can be assigned tasks.

For details of the above process, please refer to the relevant description above.

In this way, the resource lock instruction processing device can quickly and efficiently perform lock processing on the resource according to the resource lock instruction.

4-4 shows a flowchart of a resource lock instruction processing method according to an embodiment of the present disclosure. As shown in FIG. 4-4, this method is applied to the above resource lock instruction processing device. The method includes step S51-4 and step S52-4.

In step S51-4, the received resource lock instruction is parsed to obtain the operation code and operation domain of the resource lock instruction, and the resource to be processed indicated by the resource lock instruction is determined according to the operation code and operation domain, and Determine the lock strategy required for resource lock processing. The operation code is used to indicate that the processing performed by the resource lock instruction on the resource is locking or releasing processing, and the operation domain includes the identifier of the resource to be processed.

In step S52-4, according to the lock and release strategy, the resource to be processed is locked or released to obtain the processed resource.

In a possible implementation manner, the method may further include: storing the identifier of the resource to be processed.

In a possible implementation manner, parsing the received resource lock instruction to obtain the operation code and operation domain of the resource lock instruction may include:

Storage resource lock instruction;

Analyze the resource lock instruction to obtain the operation code and operation domain of the resource lock instruction;

An instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed in order according to the execution order, and the plurality of instructions to be executed may include resource lock and release instructions.

In a possible implementation manner, the method may further include:

When it is determined that the first to-be-executed instruction among the plurality of to-be-executed instructions has a dependency relationship with the zeroth to-be-executed instruction before the first to-be-executed instruction, cache the first to-be-executed instruction, and after determining that the execution of the zeroth to-be-executed instruction is completed , Control the execution of the first instruction to be executed,

The dependency relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:

It should be noted that although the above embodiment is taken as an example to introduce the resource lock instruction processing method as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various steps flexibly according to personal preferences and/or actual application scenarios, as long as the technical solutions of the present disclosure are met.

The processing method of the resource lock and release instruction provided by the embodiments of the present disclosure has a wide application range, and the processing efficiency of locking and releasing resources according to the resource lock and release instruction is high and the processing speed is fast.

The foregoing can be better understood based on the following terms:

Clause C1, a resource lock instruction processing device, the device includes:

Clause C2. The device according to Clause C1, the operation domain is also used to indicate a lock and release strategy.

Clause C3. The device according to Clause C1, the operation code is further used to indicate the lock and release strategy.

Clause C4. The device according to Clause C1, the lock and put strategy includes at least one of locking the resource to be processed and releasing the resource to be processed,

Wherein, the resources to be processed cannot be assigned tasks after being locked, and the resources to be processed can be assigned tasks after being released.

Clause C5. The apparatus according to Clause C1, the resources to be processed include at least one of IPU resources, GPU resources, CPU resources, and memory access resources.

Clause C6. The device according to Clause C1,

The device also includes a storage module for storing the to-be-processed resource identifier,

Wherein, the control module includes:

An instruction storage submodule, used to store the resource lock instruction;

An instruction processing submodule, used for parsing the resource lock instruction, and obtaining an operation code and an operation domain of the resource lock instruction;

A queue storage sub-module, which is used to store an instruction queue, and the instruction queue includes a plurality of instructions to be executed, which are sequentially arranged according to the execution order, and the plurality of instructions to be executed include the resource lock and release instruction

Wherein, the control module also includes:

The dependency processing sub-module is used to determine the first pending instruction when there is a dependency relationship between the first pending instruction in the plurality of pending instructions and the zeroth pending instruction before the first pending instruction The execution instruction is cached in the instruction storage submodule, and after the execution of the zeroth instruction to be executed is completed, the first instruction to be executed is extracted from the instruction storage submodule and sent to the processing module,

Clause C7. A machine learning computing device, the device comprising:

One or more resource lock instruction processing devices as described in any one of Clause C1 Clause C6, used to obtain data and control information to be calculated from other processing devices, and perform designated machine learning operations, and pass the execution result through I /O interface is passed to other processing devices;

Clause C8. A combined processing device, the combined processing device comprising:

Machine learning computing devices, general interconnection interfaces and other processing devices as described in clause C7;

Clause C9. A machine learning chip, the machine learning chip includes:

The machine learning arithmetic device according to clause C7 or the combined processing device according to clause C8.

Clause C10. An electronic device, the electronic device comprising:

Machine learning chip as described in clause C9.

Clause C11, a board card, the board card includes: a storage device, an interface device and a control device, and a machine learning chip as described in Clause C9;

The storage device is used to store data;

Clause C12. A method for processing a resource lock instruction. The method is applied to a device for processing a resource lock instruction. The method includes:

Clause C13. According to the method of Clause C12, the operation field is also used to indicate a lock and release strategy.

Clause C14. The method according to Clause C12, the operation code is also used to indicate the lock-and-release strategy.

Clause C15. The method according to Clause C12, the lock-and-release strategy includes at least one of locking the resource to be processed and releasing the resource to be processed,

Clause C16. The method according to Clause C12, the resources to be processed include at least one of IPU resources, GPU resources, CPU resources, and memory access resources.

Clause C17, according to the method described in Clause C12,

The method further includes: storing the resource identifier to be processed,

Wherein, analyzing the received resource lock instruction to obtain the operation code and operation domain of the resource lock instruction includes:

Store the resource lock instruction;

Parse the resource lock instruction to obtain the operation code and operation domain of the resource lock instruction;

Store an instruction queue, the instruction queue includes a plurality of instructions to be executed in order according to the execution order, the plurality of instructions to be executed includes the resource lock and release instruction,

Wherein, the method further includes:

When it is determined that the first instruction to be executed among the plurality of instructions to be executed has a dependency relationship with the zeroth instruction to be executed before the first instruction to be executed, the first instruction to be executed is cached, and the After the execution of the zeroth to-be-executed instruction is completed, control to execute the execution of the first to-be-executed instruction,

Since neural network algorithms are more and more widely used in the fields of image recognition, speech recognition, and natural language processing, the complexity of neural network algorithms is getting higher and higher, and the types and number of data operations involved are increasing. Among them, tensor is a relatively common data form in neural network algorithm, which is composed of numbers and/or characters. Since tensors have different dimensions, the existence of tensors meets the representation needs of various types of data in neural network algorithms. For example, 0-dimensional tensors can be used to represent scalars, 1-dimensional tensors can be used to represent vectors, and 2-dimensional tensors can be used. Represents a matrix, represents a time series by a 3-dimensional tensor, represents an image by a 4-dimensional tensor, represents a video by a 5-dimensional tensor, and so on. The processing of tensors in the neural network algorithm includes the rearrangement of tensors. In the related art, multiple instructions are required to achieve the rearrangement of tensor data, which is inefficient and slow.

FIG. 5-1 shows a block diagram of a tensor rearrangement instruction processing apparatus according to an embodiment of the present disclosure. As shown in Figure 5-1, the device includes a control module 11-5 and a processing module 12-5.

The control module 11-5 is used to parse the received tensor rearrangement instruction, obtain the operation code and operation domain of the tensor rearrangement instruction, and determine the required tensor rearrangement instruction according to the operation code and operation domain The tensor and target address to be processed, and the rearrangement strategy required for the rearrangement process. Among them, the operation code is used to instruct the processing performed by the tensor rearrangement instruction on the tensor data to be rearrangement processing, and the operation domain includes the to-be-processed tensor address and the target address.

The processing module 12-5 is configured to perform rearrangement processing on the tensor to be processed according to the rearrangement strategy to obtain the rearrangement tensor, and store the rearrangement tensor into the target address.

In this embodiment, the tensor can contain multiple forms of data composition. The more common tensor is the matrix form. The tensor can be of different orders. For example, the scalar can be regarded as a 0-dimensional tensor, and the vector can be regarded as One-dimensional tensors, and tensors of more than two dimensions are two-dimensional or multi-dimensional matrices. Tensor rearrangement refers to the method of rearranging tensors to obtain rearranged tensors. Among them, the method of rearranging tensors can be based on a certain dimension as the priority for tensor rearrangement, or can be based on a few dimensions To prioritize the rearrangement of tensors, taking 2-dimensional tensors as an example, the rearrangement of 2-dimensional tensors may include one or more of rearrangement by row, column, block, etc. Pcs. Among them, rearrangement by row can refer to data in input and/or output tensors according to row first, rearrangement by column can refer to data in input and/or output tensors according to column first, rearrangement by block can Refers to data input and/or output tensors in a block-first manner. The method of rearranging tensors can be defined by a rearrangement strategy. The rearrangement strategy can indicate the relevant parameters for rearranging tensors, including input tensors according to rows, columns, or blocks. The tensor is output in the form of a block or a block, and the size of the block or more than two dimensions when input or output is performed in blocks or more than two dimensions.

In this embodiment, different codes can be set for different rearrangement strategies to distinguish different rearrangement strategies. A person skilled in the art can set the rearrangement strategy and the code of the rearrangement strategy according to actual needs, which is not limited in the present disclosure.

In this embodiment, the control module may obtain the to-be-processed tensor from the to-be-processed tensor address. The to-be-processed tensor address may be a physical address such as the first address storing the to-be-processed tensor, or may be a logical address or a linear address. The control module can store the rearrangement tensor in the target address. The target address may be a physical address such as the first address storing the rearrangement tensor, or a logical address or a linear address. The present disclosure does not limit the way in which tensor addresses and target addresses are treated. The control module may obtain a tensor rearrangement instruction and a tensor to be processed through a data input/output unit. The data input/output unit may be one or more data I/O interfaces or I/O pins.

In this embodiment, for a tensor rearrangement instruction, an operation code and an operation field may be included. The operation code may be a pre-configured instruction sequence number, which is used to inform the device executing the instruction which instruction needs to be executed. The operation domain may include the source of all data required to execute the corresponding instruction. All data required to execute the corresponding instruction include the tensor to be processed and the corresponding rearrangement strategy, or to store the tensor to be processed and the corresponding rearrangement strategy. Address and so on. For example, the operation domain may include a tensor address to be processed and a target address.

It should be understood that those skilled in the art can set the instruction format of the tensor rearrangement instruction, as well as the included operation codes and operation domains as needed, and this disclosure does not limit this.

In this embodiment, the device may include one or more control modules and one or more processing modules, and the number of control modules and processing modules may be set according to actual needs, which is not limited in the present disclosure. When the device includes a control module, the control module may receive a tensor rearrangement instruction and control one or more processing modules to perform rearrangement processing. When the device includes multiple control modules, the multiple control modules may respectively receive tensor rearrangement instructions and control the corresponding one or more processing modules to perform rearrangement processing.

A tensor rearrangement instruction processing device provided by an embodiment of the present disclosure includes a control module and a processing module. The control module is used to analyze the received tensor rearrangement instruction, obtain the operation code and operation domain of the tensor rearrangement instruction, and determine the pending tensor required to execute the tensor rearrangement instruction according to the operation code and operation domain And the target address, and determine the rearrangement strategy required for rearrangement processing. The processing module is used to rearrange the processing tensor according to the rearrangement strategy to obtain the rearrangement tensor, and store the rearrangement tensor into the target address. Tensor data can be rearranged by a single tensor rearrangement instruction. Compared with the process of implementing tensor data rearrangement by multiple instructions in the related art, rearrangement of tensor data is highly efficient 1. Fast processing speed and wide application range.

In a possible implementation, the operation domain may further include at least one of an input shape of the tensor to be processed and an output shape of the rearranged tensor. The processing module 12-5 is further configured to At least one, and a rearrangement strategy, perform a rearrangement process on the tensor to be processed to obtain a rearrangement tensor.

In a possible implementation, the operation domain may also include the shape of the tensor to be processed and/or the shape of the rearranged tensor. The "shape" of the tensor can be the dimensions of the tensor to be processed and different dimensions. It is represented by the number of digits and/or characters present. For example, the shape of the tensor to be processed may represent the dimension of the tensor to be processed and the number of numbers and/or characters present in different dimensions. The shape of the rearrangement tensor may be a dimension representing the rearrangement tensor and the number of numbers and/or characters present in different dimensions.

For example, assuming a certain tensor to be processed [(1,2),(3,4),(5,6),(7,8)], the shape of the tensor to be processed is (2,4) , Which means that the to-be-processed tensor is a two-dimensional tensor with 2 rows and 4 columns.

Assuming that if the rearrangement strategy is input by row first, output by column first, and the output shape is (4, 2), the rearrangement of the to-be-processed tensor can be: input by row first [1, 3, 5 ,7,2,4,6,8], and then output it by column priority to get the rearrangement tensor [(1,3,5,7),(2,4,6,8)], the rearrangement The shape of the quantity is (4, 2), that is, the rearrangement tensor is a two-dimensional tensor with 4 rows and 2 columns.

Assuming that if the rearrangement strategy is column-first input, column-first output, and the output shape is (2, 4), rearrangement of the to-be-processed tensor can be: column-first input gets [1,2,3 ,4,5,6,7,8], and then output by column priority to get the rearrangement tensor [(1,2,3,4),(5,6,7,8)]. The shape of the rearrangement tensor is (4, 2), that is, the rearrangement tensor is a two-dimensional tensor with 4 rows and 2 columns.

Assuming that the rearrangement strategy is to input by line and output by block (assuming that the size of the block is (2, 2), when output by block, the block is output by block line), and the output shape is (2, 4), The rearrangement processing of the to-be-processed tensor can be: first input by line to obtain [1,3,5,7,2,4,6,8], and then output by block priority of (1,2) to obtain rearranged The amount [(1,5,2,6),(3,7,4,8)]. The shape of the rearrangement tensor is (2, 4), that is, the rearrangement tensor is a two-dimensional tensor with 4 rows and 2 columns.

In a possible implementation, the default input shape of the tensor to be processed can be preset. When the input shape of the tensor to be processed is not included in the operation domain, the default input shape of the tensor to be processed can be determined as the input shape of the tensor to be processed of the current tensor rearrangement instruction.

In a possible implementation, the default output shape of the rearranged tensor may be preset. When the output shape of the rearranged tensor is not included in the operation domain, the default output shape of the rearranged tensor can be determined as the output shape of the rearranged tensor of the current tensor rearrangement instruction.

In a possible implementation, the dimension of the tensor to be processed and the dimension of the rearranged tensor may be different.

In this implementation, the dimensions of the to-be-processed tensor and the rearranged tensor may also be the same. The dimension of the tensor to be processed and the dimension of the rearranged tensor can be set according to actual needs, and this disclosure does not limit this.

For example, the input tensor with shape (2,8) is as follows:

[(1,9),(2,10),(3,11),(4,12),(4,13),(6,14),(7,15),(8,16)]

Assuming that the output shape is (2,2,4), and the rearrangement strategy is to prioritize input by column and output by three dimensions in turn, the rearrangement of the to-be-processed tensor can be: input by row first [1] [1,9,2,10,3,11,4,12,4,13,6,14,7,15,8,16], and then prioritize the output in three dimensions to get the rearrangement tensor [[( 1,2,3,4),(5,6,7,8)],[(9,10,11,12),(13,14,15,16)]].

In a possible implementation, the operation domain may also be used to indicate the rearrangement strategy.

In a possible implementation, the operation code can also be used to indicate the rearrangement strategy.

In a possible implementation, a default rearrangement strategy can also be set. When the rearrangement strategy of the current tensor rearrangement instruction cannot be determined according to the operation domain and the operation code, the default rearrangement strategy can be determined as the rearrangement strategy of the current tensor rearrangement instruction.

5-2 shows a block diagram of a tensor rearrangement instruction processing apparatus according to an embodiment of the present disclosure. In a possible implementation manner, as shown in FIG. 5-2, the device may further include a storage module 13-5. The storage module 13-5 is used to store the tensor to be rearranged.

In this implementation, the storage module may include one or more of a memory, a cache, and a register, and the cache may include a temporary storage cache. The tensor to be rearranged can be stored in the memory, cache, and/or register in the storage module according to needs, which is not limited in the present disclosure.

In a possible implementation, as shown in FIG. 5-2, the control module 11-5 may include an instruction storage submodule 111-5, an instruction processing submodule 112-5, and a queue storage submodule 113-5.

The instruction storage submodule 111-5 is used to store tensor rearrangement instructions.

The instruction processing sub-module 112-5 is used to parse the tensor rearrangement instruction to obtain the operation code and operation domain of the tensor rearrangement instruction.

The queue storage sub-module 113-5 is used to store an instruction queue. The instruction queue includes a plurality of instructions to be executed in order according to the execution order, and the plurality of instructions to be executed may include tensor reordering instructions. The plurality of instructions to be executed may include other calculation instructions related to the tensor rearrangement instruction.

In a possible implementation, as shown in FIG. 5-2, the control module 11-5 may further include a dependency processing sub-module 114-5.

When it is determined that the first instruction to be executed among the plurality of instructions to be executed has a dependency relationship with the zeroth instruction to be executed before the first instruction to be executed, the dependency processing submodule 114-5 may cache the first instruction to be executed in the instruction In the storage submodule 112-5, after the execution of the zeroth instruction to be executed is completed, the first instruction to be executed is extracted from the instruction storage submodule 112-5 and sent to the processing module 12-5. Wherein, the first instruction to be executed and the zeroth instruction to be executed are instructions among a plurality of instructions to be executed.

In a possible implementation, the instruction format of the tensor rearrangement instruction may be:

Tiling dst src type src_shape dst_shape

Among them, Tiling is the operation code, dst, src, type, src_shape, dst_shape are the operation domain. Tiling is used to indicate that the instruction is a tensor rearrangement instruction. dst is the target address. src is the address of the tensor to be processed. type is the rearrangement strategy. src_shape is the input shape. dst_shape is the output shape.

Tiling.type dst src src_shape dst_shape

Among them, Tiling.type is the operation code, dst, src, src_shape, dst_shape are the operation domain. Tiling in Tiling.type is used to indicate that the instruction is a tensor rearrangement instruction, and type in Tiling.type is a rearrangement strategy. dst is the target address. src is the address of the tensor to be processed. src_shape is the input shape. dst_shape is the output shape.

It should be understood that those skilled in the art can set the operation code of the tensor rearrangement instruction, the operation code in the instruction format, and the position of the operation domain according to needs, which is not limited in this disclosure.

It should be noted that although the above embodiment is used as an example to introduce the tensor rearrangement instruction processing apparatus as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various modules flexibly according to personal preferences and/or actual application scenarios, as long as the technical solutions of the present disclosure are met.

Application examples

The following uses "Tensor Rearrangement Instruction Processing Apparatus for Rearrangement of Tensors to be Processed" as an exemplary application scenario, and gives an application example according to an embodiment of the present disclosure to facilitate understanding of the flow of the tensor rearrangement instruction processing apparatus . Those skilled in the art should understand that the following application examples are only for the purpose of facilitating understanding of the embodiments of the present disclosure, and should not be considered as limitations to the embodiments of the present disclosure.

5-3 shows a schematic diagram of an application scenario of a tensor rearrangement instruction processing apparatus according to an embodiment of the present disclosure. As shown in Figure 5-3, the tensor rearrangement instruction processing device processes the tensor rearrangement instruction as follows.

Example 1-5

When the control module 11-5 receives the tensor rearrangement instruction 1 (such as:

Tiling

200, 100, type, S1, and S2), it parses the tensor rearrangement instruction 1 to obtain the operation code and operation field of the tensor rearrangement instruction 1. The operation code of the tensor rearrangement instruction 1 is Tiling. And it can be determined according to the operation domain: the rearrangement strategy is type, the tensor address to be processed is 100, the input shape is S1, the target address is 200, and the output shape is S2. Furthermore, the control module 11-5 acquires the to-be-processed tensor a whose input shape is S1 from the to-be-processed tensor address 200.

The processing module 12-5 performs rearrangement processing on the to-be-processed tensor a according to the rearrangement strategy type, the input shape, and the output shape to obtain the rearrangement tensor b, and stores the rearrangement tensor b into the target address 200.

Among them, the tensor reordering instruction 1 can be Tiling 200, type 100, S1, S2, or

Tiling.type

200, 100, S1, S2. The processing procedure of the tensor reordering commands in different command formats is similar and will not be repeated here.

In this way, the tensor rearrangement instruction processing device can quickly and efficiently process the tensor rearrangement instruction to complete the process of rearranging the tensor.

5-4 shows a flowchart of a tensor reordering instruction processing method according to an embodiment of the present disclosure. As shown in FIG. 5-4, this method is applied to the above-mentioned tensor rearrangement instruction processing apparatus. The method includes step S51-5 and step S52-5.

In step S51-5, the received tensor rearrangement instruction is parsed to obtain the operation code and operation domain of the tensor rearrangement instruction, and according to the operation code and operation domain to determine the execution of the tensor rearrangement instruction Processing tensors and target addresses, and determining the rearrangement strategy required for rearrangement processing. Among them, the operation code is used to instruct the processing performed by the tensor rearrangement instruction on the tensor data to be rearrangement processing, and the operation domain includes the to-be-processed tensor address and the target address.

In step S52-5, rearrangement processing is performed on the tensor to be processed according to the rearrangement strategy to obtain the rearrangement tensor, and the rearrangement tensor is stored in the target address.

In a possible implementation manner, the operation domain may further include at least one of an input shape of the tensor to be processed and an output shape of the rearranged tensor. Wherein, rearranging the to-be-processed tensor according to the rearrangement strategy to obtain the rearrangement tensor may include: rearranging the to-be-processed tensor according to at least one of the input shape and the output shape and the rearrangement strategy to obtain Rearrange tensors.

In a possible implementation, the method may further include: storing the tensor to be processed.

In a possible implementation manner, parsing the received tensor rearrangement instruction to obtain the operation code and operation domain of the tensor rearrangement instruction may include:

Store tensor rearrangement instructions;

Analyze the tensor rearrangement instruction to obtain the operation code and operation domain of the tensor rearrangement instruction;

An instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed in order according to the execution order, and the plurality of instructions to be executed may include tensor reordering instructions.

In a possible implementation manner, the method may further include:

The dependency relationship between the first to-be-executed instruction and the zeroth to-be-executed instruction before the first to-be-executed instruction includes: a first storage address interval to store data required by the first to-be-executed instruction and data required to store the zeroth to-be-executed instruction The zeroth storage address interval has overlapping areas.

It should be noted that, although the above embodiment is taken as an example to introduce the tensor rearrangement instruction processing method as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various steps flexibly according to personal preferences and/or actual application scenarios, as long as the technical solutions of the present disclosure are met.

The tensor rearrangement instruction processing method provided by the embodiment of the present disclosure can realize the rearrangement processing of tensor data through one tensor rearrangement instruction, and the related art realizes the rearrangement processing of tensor data through multiple instructions Compared with the process of, the rearrangement of tensor data has high processing efficiency, fast processing speed, and wide application range.

The foregoing can be better understood based on the following terms:

Clause D1, a tensor rearrangement instruction processing device, the device comprising:

Clause D2. The device according to Clause D1, the operation domain further includes at least one of an input shape of a tensor to be processed and an output shape of a rearranged tensor,

The processing module is further configured to perform rearrangement processing on the to-be-processed tensor according to at least one of the input shape and the output shape and the rearrangement strategy to obtain the rearrangement tensor.

Clause D3. The apparatus according to Clause D1, the dimension of the to-be-processed tensor is different from the dimension of the rearrangement tensor.

Clause D4. The apparatus according to Clause D1, the operation field is further used to indicate a rearrangement strategy.

Clause D5. The apparatus according to Clause D1, the operation code is further used to indicate the rearrangement strategy.

Clause D6, the device according to Clause D1,

The device further includes a storage module for storing the to-be-processed tensor,

Wherein, the control module includes:

An instruction storage sub-module for storing the tensor rearrangement instruction;

An instruction processing sub-module, which is used to parse the tensor rearrangement instruction to obtain the operation code and operation domain of the tensor rearrangement instruction;

A queue storage sub-module, which is used to store an instruction queue, the instruction queue including a plurality of instructions to be executed in order according to the execution order, the plurality of instructions to be executed include the tensor reordering instructions

Wherein, the control module also includes:

Clause D7. A machine learning computing device, the device comprising:

One or more tensor rearrangement instruction processing devices as described in any one of clauses D1 to D5, used to obtain to-be-processed tensors and control information from other processing devices, and perform specified machine learning operations, which will be executed The result is transferred to other processing devices through the I/O interface;

Clause D8. A combined processing device, the combined processing device comprising:

Machine learning computing devices, general interconnect interfaces and other processing devices as described in clause D7;

Wherein, the combined processing device further includes a storage device respectively connected to the machine learning computing device and the other processing device, and used for storing data of the machine learning computing device and the other processing device.

Clause D9. A machine learning chip, the machine learning chip includes:

The machine learning arithmetic device described in Item D7 or the combined processing device described in Item D8.

Article D10. An electronic device, the electronic device comprising:

Machine learning chip as described in clause D9.

Clause D11, a board card including: a storage device, an interface device and a control device, and a machine learning chip as described in Clause D9;

The storage device is used to store data;

Clause D12. A tensor rearrangement instruction processing method. The method is applied to a tensor rearrangement instruction processing apparatus. The method includes:

Clause D13. The method according to Clause D12, the operation domain further includes at least one of an input shape of the tensor to be processed and an output shape of the rearranged tensor,

Wherein, performing rearrangement processing on the to-be-processed tensor according to the rearrangement strategy to obtain a rearrangement tensor includes:

According to at least one of the input shape and the output shape, and the rearrangement strategy, perform rearrangement processing on the to-be-processed tensor to obtain the rearrangement tensor.

Clause D14. According to the method of Clause D13, the dimension of the to-be-processed tensor is different from the dimension of the rearrangement tensor.

Clause D15. The method according to Clause D12, the operation field is used to indicate a rearrangement strategy.

Clause D16. The method according to Clause D12, the operation code is further used to indicate the rearrangement strategy.

Clause D17, according to the method described in Clause D12,

The method also includes storing the to-be-processed tensor,

Wherein, parsing the received tensor rearrangement instruction to obtain the operation code and operation domain of the tensor rearrangement instruction includes:

Store the tensor reordering instruction;

Analyzing the tensor rearrangement instruction to obtain the operation code and operation domain of the tensor rearrangement instruction;

Storing an instruction queue, the instruction queue including a plurality of instructions to be executed in order according to an execution order, the plurality of instructions to be executed including the tensor reordering instruction,

Wherein, the method further includes:

6-1 shows a block diagram of a data processing device according to an embodiment of the present disclosure. The device is used to perform machine learning calculations. As shown in Figure 6-1, the device includes a control module 11-6 and a processing module 12-6. The processing module 12-6 includes a data transfer submodule 121-6 and an accumulation submodule 122-6.

The control module 11-6 is used to obtain calculation instructions and obtain input data required to execute the calculation instructions. The data transfer submodule 121-6 is used to process the input data according to the calculation instruction to obtain a plurality of intermediate results, and send the plurality of intermediate results to the accumulation submodule 122-6 in sequence. The accumulation submodule 122-6 is used to perform a cyclic accumulation operation on a plurality of intermediate results to obtain the calculation result of the calculation instruction.

In this embodiment, the cyclic accumulation operation may be an accumulation result obtained by adding an intermediate result to the "current operation period", and when the intermediate result is added to the "later operation period", the intermediate result is added to the accumulation result Add to get a new accumulation result. "Later operation cycle" may be the first, second, third and other operation cycles after the "current operation cycle". The "after operation cycle" may be the "current operation cycle" according to the computing power of the device and other occasions. The following several calculation cycles are set, which is not limited in this disclosure.

In this embodiment, the device may include one or more control modules and one or more processing modules, and the number of control modules and processing modules may be set according to actual needs, which is not limited in the present disclosure.

The data processing device provided by the embodiment of the present disclosure includes a control module and a processing module. The processing module includes a data transfer submodule and an accumulation submodule. The control module is used to obtain calculation instructions and obtain input data required to execute the calculation instructions. The data transfer sub-module is used to process the input data according to the calculation instruction to obtain multiple intermediate results, and send the multiple intermediate results to the accumulation sub-module in sequence. The accumulation submodule is used to perform a cyclic accumulation operation on multiple intermediate results to obtain the calculation result of the calculation instruction. The data processing device provided by the embodiments of the present disclosure reduces the amount of data access and calculation by cyclically accumulating a plurality of intermediate results, and at the same time ensures the accuracy of calculation without loss, and can effectively increase the data processing speed.

In a possible implementation, the loop accumulation process of the accumulation sub-module can be set according to the actual needs of the device, such as the computing power. Examples of loop accumulation processes in Mode 1 and Mode 2 are given below. It should be noted that those skilled in the art can set the loop accumulation process according to actual needs, which is not limited in the present disclosure.

In a possible implementation manner, for manner 1, the accumulation submodule 122-6 performs a cyclic accumulation operation on multiple intermediate results, which may include:

In the first calculation cycle of receiving the intermediate result, add the intermediate result and the first intermediate data of the first calculation cycle to obtain the first accumulation result;

Store the first accumulation result as the first intermediate data of the next calculation cycle;

Determining the first intermediate data of the second calculation cycle as the calculation result in the second calculation cycle where the intermediate result is not received,

The value of the first intermediate data in the initial calculation cycle is zero.

In this implementation manner, the "first operation cycle when the intermediate result is received" described in the first way may be any operation cycle when the accumulation submodule receives the intermediate result, and "the second operation cycle when the intermediate result is not received" It may be an operation cycle when the accumulation submodule does not receive the intermediate result. "The first calculation cycle of receiving the intermediate result" describes the process of repeated execution of the accumulation submodule, and the "second calculation cycle of not receiving the intermediate result" is the process of finally determining the calculation result of the accumulation submodule. The accumulation sub-module can cyclically execute a plurality of "first operation cycles of receiving intermediate results" and execute a "second operation cycle of not receiving intermediate results", and has completed operations on a plurality of intermediate results.

For example, suppose that multiple intermediate results are 1, 2, and 3, respectively. The accumulation sub-module performs loop accumulation on a pair of multiple intermediate results in the following manner. Among them, the first operation cycle, the second operation cycle and the third operation cycle are equivalent to the "first operation cycle of receiving the intermediate result" in the above manner 1, and the fourth operation cycle is equivalent to the "not The second calculation cycle when the intermediate result is received".

In the first operation cycle, the accumulation submodule receives the intermediate result "1", and adds the intermediate result "1" to the first intermediate data "0" of the first operation cycle to obtain the first An accumulation result "0+1". Then, the first accumulation result "0+1" is stored as the first intermediate data "0+1" of the second calculation cycle (that is, the next calculation cycle).

In the second operation cycle, the accumulation submodule receives the intermediate result "2", and adds the intermediate result "2" to the first intermediate data "0+1" in the second operation cycle to obtain the second operation cycle The first accumulation result of "0+1+2". Then, the first accumulation result "0+1+2" of the second operation cycle is stored as the first intermediate data "0+1+2" of the third operation cycle (that is, the next operation cycle).

In the third operation cycle, the accumulation submodule receives the intermediate result "3", and adds the intermediate result "3" to the first intermediate data "0+1+2" in the third operation cycle to obtain the third The first accumulation result of the operation cycle is "0+1+2+3". Then, the first accumulation result "0+1+2+3" of the third operation cycle is stored as the first intermediate data "0+1+2+3" of the fourth operation cycle (that is, the next operation cycle) .

In the fourth calculation cycle, the accumulation submodule does not receive the intermediate result, and determines the first intermediate data "0+1+2+3" of the fourth calculation cycle as the calculation result.

In a possible implementation manner, for manner 2, the accumulation submodule 122-6 performs a cyclic accumulation operation on a plurality of intermediate results, and may further include:

In the third calculation cycle of receiving the intermediate result, add the intermediate result to the third intermediate data of the third calculation cycle to obtain a second accumulation result;

Storing the second intermediate data of the third operation cycle as the third intermediate data of the next operation cycle, and storing the second accumulation result as the second intermediate data of the next operation cycle;

In the fourth calculation cycle in which the intermediate result is not received, the second intermediate data in the fourth calculation cycle and the third intermediate data in the fourth calculation cycle are added to obtain a calculation result.

The value of the second intermediate data and the third intermediate data in the initial calculation cycle is zero.

In this implementation, the "third operation cycle of receiving the intermediate result" described in the second way may be any operation cycle of the intermediate result received by the accumulation submodule, and "the fourth operation cycle of not receiving the intermediate result" It may be an operation cycle when the accumulation submodule does not receive the intermediate result. "The third operation cycle of receiving the intermediate result" describes the process of repeated execution of the accumulation sub-module, and the "fourth operation cycle of not receiving the intermediate result" is the process of the final determination of the calculation result of the accumulation sub-module. The accumulation sub-module can cyclically execute multiple "third operation cycles of receiving intermediate results" and execute a "fourth operation cycle of not receiving intermediate results", and have completed operations on multiple intermediate results.

For example, suppose that multiple intermediate results are 1, 2, 3, and 4, respectively. The accumulation sub-module performs the cyclic accumulation of multiple intermediate results in the second way as follows. Among them, the first operation cycle, the second operation cycle, the third operation cycle and the fourth operation cycle are equivalent to the "third operation cycle of receiving the intermediate result" in the above manner 2, and the fifth operation cycle is equivalent to In the second way above, "the fourth calculation cycle without receiving the intermediate result".

In the first calculation cycle, the accumulation submodule receives the intermediate result "1", and adds the intermediate result "1" to the third intermediate data "0" of the first calculation cycle to obtain the first calculation cycle. Two accumulation results "0+1". Then, the second intermediate data "0" of the first operation cycle is stored as the third intermediate data of the second operation cycle (that is, the next operation cycle), and the second accumulation result of the first operation cycle "0" "+1" is stored as the second intermediate data of the second calculation cycle (that is, the next calculation cycle).

In the second operation cycle, the accumulation submodule receives the intermediate result "2", and adds the intermediate result "2" to the third intermediate data "0" in the second operation cycle to obtain the second operation cycle. Two accumulation results "0+2". Then the second intermediate data "0+1" of the second operation cycle is stored as the third intermediate data of the third operation cycle (that is, the next operation cycle), and the second accumulation result of the second operation cycle "0+2" is stored as the second intermediate data of the third calculation cycle (that is, the next calculation cycle).

In the third operation cycle, the accumulation submodule receives the intermediate result "3", and adds the intermediate result "3" to the third intermediate data "0+1" of the third operation cycle to obtain the third operation cycle The second accumulation result of "0+1+3". Then, the second intermediate data "0+2" of the third operation cycle is stored as the third intermediate data of the fourth operation cycle (that is, the next operation cycle), and the second accumulation result of the third operation cycle "0+1+3" is stored as the second intermediate data of the fourth calculation cycle (that is, the next calculation cycle).

In the fourth operation cycle, the accumulation submodule receives the intermediate result "4", and adds the intermediate result "4" to the third intermediate data "0+2" of the fourth operation cycle to obtain the fourth operation cycle The second accumulation result of "0+2+4". Then, the second intermediate data "0+1+3" of the fourth operation cycle is stored as the third intermediate data of the fifth operation cycle (that is, the next operation cycle), and the second intermediate data of the fourth operation cycle The accumulation result "0+2+4" is stored as the second intermediate data of the fifth calculation cycle (that is, the next calculation cycle).

In the fifth operation cycle, the accumulation submodule determines that the intermediate result is not received, and the second intermediate number "0+2+4" of the fifth operation cycle and the third intermediate data "0+" of the fifth operation cycle Add 1+3" to get the second accumulation result "0+1+2+3+4" in the fifth calculation cycle. The second accumulation result "0+1+2+3+4" of the fifth calculation cycle is determined as the calculation result.

In a possible implementation, the machine learning calculation may include artificial neural network operations, the input data may include input neuron data and weight data, and the calculation result is output neuron data.

In a possible implementation manner, the data type of the input data may include at least one of exponential type and dynamic fixed-point type, and the data types of the input neuron data and the weight data are different.

Among them, the data transfer submodule 121-6 is used to process the input data according to the calculation instruction to obtain multiple intermediate results, which may include: the data transfer submodule is used to shift the weight data or the input neuron data according to the calculation instruction Operation, get intermediate results.

The exponential input data may include exponent bits, and the data obtained by calculating with the specified value as the base and the data stored in the exponent bits as exponents represent the value of the exponential input data. The input data of the dynamic fixed-point type can include a decimal point and an integer. The data stored in the decimal point is used to mark the position of the decimal point of the input data of the dynamic fixed-point in the data stored in the integer, to distinguish the integer part of the data of the integer And the decimal part. Among them, the specified value corresponding to the exponential input data is the same as the carry system of the input data. For example, assuming that the specified value is 2, the input data needs to be binary data. In this way, we can ensure that the input data is shifted.

In this implementation, the input neuron data may be exponential data, while the weight data is dynamic fixed-point data. Or the input neuron data may be dynamic fixed-point data, and the weight data is exponential data. A person skilled in the art may set the types of input neuron data and weight data according to actual needs, which is not limited in the present disclosure.

In this implementation, the shift operation on the weight data or input neuron data according to the calculation instruction may be: when it is determined that the operation performed on the weight data and the input neuron data is a multiplication operation according to the calculation instruction, it may be Through the operation method of shifting the input neuron data or weight data, the purpose of multiplying the weight data and the input neuron data is realized. Among them, the shift operation may be based on the weight data and the exponential data in the input neuron data to determine the number of movements and the direction of movement, and then the weight data and the decimal point position of the dynamic fixed-point data in the input neuron data Move according to the number of digits and direction of movement, and change the value of the data stored in the decimal point to indicate the direction and number of digits of the decimal point, and then determine the calculation result. That is, the values stored in the exponent bits in the weight data and the exponential data in the input neuron data are added to the values in the weight data and the decimal point storage data in the dynamic fixed-point data in the input neuron data to obtain For the addition result, replace the data stored in the decimal point of the original dynamic fixed-point data with the addition result, and then you can get the calculation result of the weight data multiplied by the input neuron data.

In this implementation, the carry system of the input data may be binary, decimal, hexadecimal, etc., which is not limited in this disclosure.

For example, FIG. 6-2 shows a schematic diagram of an application scenario of a data processing device according to an embodiment of the present disclosure. As shown in Figure 6-2, an example of the data transmission channel operating on exponential weight data and dynamic fixed-point input neuron data assumes that the exponential weight data is binary "00001" The decimal number corresponding to the value data is 2 ¹ ). The dynamic fixed-point input neuron data is binary "11001000, 1000" (the decimal number corresponding to the input neuron data is 12.5), in which the first 8 digits are integer digits and the last 4 digits are decimal digits. The control module obtains the above two input data and calculation instructions. When the processing module determines that the calculation of the exponential weight data "00001" and the dynamic fixed-point input neuron data "11001000, 1000" needs to be multiplied according to the calculation instruction, it can be based on the exponential weight data ""00001" determines that the shift operation that needs to be performed on the input neuron data is "the decimal point position is shifted to the right by 1". That is, add the decimal point data "0100" and the weight data "00001" to obtain the new data "0101" that needs to be stored in the new decimal point, and store the new data "0101" to the decimal point of the input neuron data Bit, the exponential weight data is binary "00001" and the dynamic fixed-point input neuron data is binary "11001000, 0100" multiplication result "11001000, 0101" (the decimal number corresponding to the calculation result For 25). Among them, the "," in the dynamic fixed-point input neuron data "11001000, 0100" is to distinguish the integer and decimal points, and the "," may not be set in actual use. The "," in the input data of the dynamic fixed-point type below is the same as here, and will not be explained later.

In a possible implementation manner, the device may further include a first type conversion module. The first type conversion module is used to convert the received data to be processed into first data with a specified value as the base, and generate exponential input data according to the exponent of the first data. Among them, the exponent bit of the exponential input data is used to store the exponent.

In this implementation manner, the exponent of the first data converted by the data to be processed received by the first type conversion module needs to be an integer to ensure that the input data can be shifted. The number of bits occupied by the exponent bits can be set according to actual needs, for example, 5 bits, which is not limited in this disclosure.

In a possible implementation manner, the exponential input data may further include a designated value bit, which is used to mark the designated value of the input data.

In a possible implementation, the exponent bit also includes a sign bit, which is used to indicate whether the data stored in the exponent bit is positive or negative. For example, you can set the exponential input data to occupy 5 bits, the first bit is the sign bit, and the 2nd to 5th bits are the exponent bits. It can be set that when the number stored in the sign bit is 0, the data stored in the exponent bit is positive, and when the number stored in the sign bit is 1, the data stored in the exponent bit is negative.

For example, assume that the received data to be processed is 1024, the specified value is set to 2, and the input data is a binary number. The first type conversion module may convert the data to be processed "1024" into the first data "2 ¹⁰ "with base 2 (specified value). An exponential, binary input data "01010" is generated based on the index "10" of the first data "2 ¹⁰ ". The received data to be processed is 0.5, the specified value is set to 2, and the input data is a binary number. The first type conversion module may convert the data to be processed "0.5" into the first data "2-1" with 2 (specified value) as the base. The exponential binary input data "10001" is generated based on the index "-1" of the first data "2-1".

In a possible implementation manner, the device may further include a second type conversion module. The second type conversion module is used to convert the received data to be processed to obtain second data respectively representing the value of the integer part of the data to be processed and third data representing the value of the decimal part, and according to the second data, the first Three data, and the position of the decimal point of the data to be processed, to generate dynamic fixed-point input data. Among them, the integer bits of the dynamic fixed-point input data are used to store the second data and the third data, and the data stored in the decimal point of the dynamic fixed-point input data are used to mark the decimal point of the data to be processed in the data stored in the integer bits s position.

In this implementation manner, the data to be processed received by the second type conversion module may be a decimal. For example, 123.4 (decimal), etc. You can set the total number of bits occupied by the input data of the dynamic fixed-point type, and the number of bits occupied by the integer and decimal points according to the calculation needs. For example, it can be set that the input data of the dynamic fixed-point type occupies 12 bits, in which the integer bit occupies 8 bits and the decimal point occupies 4 bits. Those skilled in the art can set the total number of bits occupied by the input data of the dynamic fixed-point type and the number of bits occupied by the integer and decimal points according to actual needs, which is not limited in the present disclosure.

For example, assume that the received data to be processed is 24.5, the input data is a binary number, the integer bits occupy 10 bits, and the decimal point bits occupy 4 bits. The second type conversion module may convert the integer part "24" of the data to be processed into binary second data "11000", and convert the fractional part "0.5" of the data to be processed into binary third data "0.1000". It can be determined that the integer position of the input data of the dynamic fixed-point type is stored as "0110001000". Since the position of the decimal point is after the sixth place of "0110001000" stored in the integer position, the position of the decimal point can be represented by "0110". Then, finally, the input data of the dynamic fixed-point type generated by the second type conversion module according to the data to be processed "24.5" is "0110001000, 0110".

6-3 shows a block diagram of a data processing device according to an embodiment of the present disclosure. In a possible implementation manner, as shown in FIG. 6-3, the device may further include a storage module 13-6. The storage module 13-6 is used to store the vector to be found.

In a possible implementation, as shown in FIG. 6-3, the control module 11-6 may include an instruction storage submodule 111-6, an instruction processing submodule 112-6, and a queue storage submodule 113-6.

The instruction storage submodule 111-6 is used to store vector search instructions.

The instruction processing sub-module 112-6 is used to parse the vector search instruction to obtain the operation code and operation domain of the vector search instruction.

The queue storage sub-module 113-6 is used to store an instruction queue. The instruction queue includes a plurality of instructions to be executed in order according to the execution order, and the plurality of instructions to be executed may include vector search instructions. The plurality of instructions to be executed may include other calculation instructions related to the vector search instruction.

In a possible implementation, as shown in FIG. 6-3, the control module 11-6 may further include a dependency processing sub-module 114-6.

The dependency processing sub-module 114-6 is configured to cache the first instruction to be executed when it is determined that there is an association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed In the instruction storage sub-module 112-6, after the execution of the zeroth to-be-executed instruction is completed, the first to-be-executed instruction is extracted from the instruction storage sub-module 112-6 and sent to the processing module 12-6. Wherein, the first instruction to be executed and the zeroth instruction to be executed are instructions among a plurality of instructions to be executed.

6-4 shows a block diagram of a data processing device according to an embodiment of the present disclosure. In a possible implementation, as shown in FIG. 6-4, the processing module 12-6 may include a master processing sub-module 124 and multiple slave processing sub-modules 125. Each slave processing sub-module 125 may include a data transmission sub-module and an accumulation sub-module (not shown in the figure).

The control module 11-6 is also used to parse the calculation instructions to obtain a plurality of calculation instructions, and send the input data and the plurality of calculation instructions to the main processing sub-module 124.

The main processing sub-module 124 is used for performing pre-processing on input data and transmitting data and operation instructions with a plurality of sub-processing sub-modules 125.

The sub-processing sub-module 125 is used to execute intermediate operations in parallel according to the data and operation instructions transmitted from the main processing sub-module 124 to obtain multiple intermediate results, and transmit the multiple intermediate results to the main processing sub-module 124.

In this implementation, the intermediate operation may be arithmetic, logic, and other operations on the data. Among them, when the input data includes input neuron data and weight data, and the input neuron data and weight data correspond to different types of the above data, if the intermediate operation performed according to the operation instruction is determined to be input neuron data and When the weight data is multiplied, the input neuron data or the weight data can be shifted to obtain an intermediate result.

The main processing sub-module 124 is also used to perform subsequent processing on a plurality of intermediate results to obtain calculation results, and store the calculation results in the target address.

It should be noted that, those skilled in the art can set the connection mode between the main processing sub-module and multiple slave processing sub-modules according to actual needs, so as to realize the architecture setting of the processing module, for example, the architecture of the processing module may be The “H”-type architecture, the array-type architecture, the tree-type architecture, etc. are not limited in this disclosure.

6-5a show a block diagram of a processing module in a data processing device according to an embodiment of the present disclosure. In a possible implementation, as shown in FIG. 6-5a, the processing module 12-6 may further include one or more branch processing sub-modules 126, and the branch processing sub-module 126 is used to forward the main processing sub-module 124 and The data and/or arithmetic instructions between the sub-modules 125 are processed. Among them, the main processing sub-module 124 is connected to one or more branch processing sub-modules 126. In this way, the main processing sub-module, the branch processing sub-module and the slave processing sub-module in the processing module are connected with an "H" type architecture, and the data and/or operation instructions are forwarded through the branch processing sub-module, saving the main processing sub-module Of resources, which in turn increases the processing speed of instructions.

6-5b show a block diagram of a processing module in a data processing device according to an embodiment of the present disclosure. In a possible implementation, as shown in FIGS. 6-5b, multiple slave processing sub-modules 125 are distributed in an array.

Each slave processing sub-module 125 is connected to other adjacent slave processing sub-modules 125, and the master processing sub-module 124 is connected to the k slave processing sub-modules 125 of the plurality of slave processing sub-modules 125, and the k slave processing sub-modules 125 are : N slave processing submodules 125 in the first row, n slave processing submodules 125 in the mth row, and m slave processing submodules 125 in the first column.

Among them, as shown in FIG. 6-5b, the k slave processing submodules include only n slave processing submodules in the first row, n slave processing submodules in the mth row, and m slave processing submodules in the first column That is, the k slave processing submodules are slave processing submodules directly connected to the master processing submodule among the multiple slave processing submodules. Among them, k slave processing sub-modules are used for forwarding data and instructions between the master processing sub-module and multiple slave processing sub-modules. In this way, multiple slave processing sub-modules are distributed in an array, which can increase the speed of sending data and/or operation instructions from the master processing sub-module to the slave processing sub-modules, thereby increasing the processing speed of the instructions.

6-5c show a block diagram of a processing module in a data processing device according to an embodiment of the present disclosure. In a possible implementation manner, as shown in FIGS. 6-5c, the processing module may further include a tree-shaped submodule 127. The tree-shaped submodule 127 includes a root port 401 and multiple branch ports 402. The root port 401 is connected to the main processing submodule 124, and the plurality of branch ports 402 are respectively connected to the plurality of slave processing submodules 125. Among them, the tree-shaped submodule 127 has a transceiver function for forwarding data and/or operation instructions between the master processing submodule 124 and the slave processing submodule 125. In this way, the processing modules are connected in a tree structure through the role of the tree-shaped submodules, and the forwarding function of the tree-shaped submodules can be used to increase the speed of sending data and/or operation instructions from the main processing submodule to the slave processing submodules, thereby increasing The processing speed of the instruction.

In a possible implementation, the tree-shaped submodule 127 may be an optional result of the device, which may include at least one layer of nodes. The node has a line structure with a forwarding function, and the node itself does not have a computing function. The lowermost node is connected to the slave processing sub-module to forward data and/or operation instructions between the master processing sub-module 124 and the slave processing sub-module 125. In particular, if the tree-shaped submodule has zero-level nodes, the device does not require a tree-shaped submodule.

In a possible implementation, the tree-shaped submodule 127 may include multiple nodes of an n-ary tree structure, and multiple nodes of the n-ary tree structure may have multiple layers.

For example, FIGS. 6-5d show a block diagram of a processing module in a data processing device according to an embodiment of the present disclosure. As shown in FIGS. 6-5d, the n-ary tree structure may be a binary tree structure, and the tree-shaped submodule 127 includes 2-layer nodes 01. The lowermost node 01 is connected to the slave processing submodule 125 to forward data and/or operation instructions between the master processing submodule 124 and the slave processing submodule 125.

In this implementation, the n-ary tree structure may also be a tri-tree structure, etc., where n is a positive integer greater than or equal to 2. A person skilled in the art may set n in the n-ary tree structure and the number of nodes in the n-ary tree structure as needed, and the disclosure does not limit this.

It should be noted that although the above-mentioned embodiment is taken as an example to introduce the data processing apparatus as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various modules flexibly according to personal preferences and/or actual application scenarios, as long as the technical solutions of the present disclosure are met.

6-6 show a flowchart of a data processing method according to an embodiment of the present disclosure. As shown in Figs. 6-6, this method is applied to the above data processing device, and the data processing device is used to perform machine learning calculations. The method includes steps S51-6 to S53-6.

In step S51-6, the calculation instruction is acquired, and the input data required to execute the calculation instruction is acquired.

In step S52-6, the input data is processed according to the calculation instruction to obtain multiple intermediate results, and the multiple intermediate results are issued in sequence.

In step S53-6, a cyclic accumulation operation is performed on a plurality of intermediate results to obtain the calculation result of the calculation instruction.

In a possible implementation manner, performing a cyclic accumulation operation on multiple intermediate results may include:

Add the second intermediate data of the fourth operation cycle to the third intermediate data of the fourth operation cycle in the fourth operation cycle where no intermediate result is received, to obtain the calculation result,

In a possible implementation, the machine learning calculation may include: artificial neural network operation, and the input data may include: input neuron data and weight data; the calculation result is output neuron data.

In a possible implementation manner, the data type of the input data includes at least one of exponential type and dynamic fixed-point type, and the data types of the input neuron data and the weight data are different.

Wherein, processing the input data according to the calculation instruction to obtain multiple intermediate results may include: performing shift operation on the weight data or input neuron data according to the calculation instruction to obtain the intermediate result.

The exponential input data includes exponent bits, and the data obtained by calculating with the specified value as the base and the data stored in the exponent bits as exponents represent the value of the exponential input data. The input data of the dynamic fixed-point type includes a decimal point and an integer. The data stored in the decimal point is used to mark the position of the decimal point of the input data of the dynamic fixed-point in the data stored in the integer, to distinguish the integer part of the data in the integer. decimal part. Among them, the specified value corresponding to the exponential input data is the same as the carry system of the input data.

In a possible implementation manner, obtaining the calculation instruction and obtaining the input data required to execute the calculation instruction may include: analyzing the calculation instruction to obtain multiple calculation instructions.

Among them, the method may further include:

Perform pre-sequence processing on input data and transfer data and calculation instructions;

Perform intermediate operations in parallel based on the transmitted data and operation instructions to obtain multiple intermediate results;

Perform subsequent processing on multiple intermediate results to obtain the calculation result of the calculation instruction.

In a possible implementation, the method may include: storing input data.

In a possible implementation manner, obtaining the calculation instruction and obtaining the input data required to execute the calculation instruction may include:

Store calculation instructions;

Analyze the calculation instructions to obtain multiple calculation instructions for the calculation instructions;

An instruction queue is stored. The instruction queue includes a plurality of instructions to be executed in order according to the execution order, and the plurality of instructions to be executed includes a plurality of arithmetic instructions;

In a possible implementation manner, acquiring the calculation instruction and acquiring multiple input data required to execute the calculation instruction may further include:

When it is determined that the first to-be-executed instruction among the plurality of to-be-executed instructions is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction, the first to-be-executed instruction is cached, and after determining that the execution of the zeroth to-be-executed instruction is completed, Controlling the execution of the first instruction to be executed.

The first to-be-executed instruction is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction includes: a first storage address interval storing data required by the first to-be-executed instruction and data required to store the zeroth to-be-executed instruction The zeroth storage address interval has overlapping areas.

The data processing method provided by the embodiments of the present disclosure reduces the amount of data access and calculation by cyclically accumulating a plurality of intermediate results, and at the same time ensures that the accuracy of calculation is non-destructive and can effectively increase the data processing speed.

The foregoing can be better understood based on the following terms:

Clause E1. A data processing device for performing machine learning calculations. The device includes a control module and a processing module. The processing module includes a data transfer submodule and an accumulation submodule:

Clause E2. The apparatus according to Clause E1, the accumulation submodule performs a cyclic accumulation operation on the plurality of intermediate results, including:

Add the intermediate result to the first intermediate data of the first operation cycle in the first operation cycle of receiving the intermediate result to obtain the first accumulation result;

Storing the first accumulation result as first intermediate data of the next calculation cycle;

Determining the first intermediate data of the second calculation cycle as the calculation result in the second calculation cycle in which the intermediate result is not received,

Clause E3. The apparatus according to Clause E1, the accumulation submodule performs a cyclic accumulation operation on the plurality of intermediate results, including:

Clause E4. The device according to any one of Clauses E1-E3, the machine learning calculation includes: artificial neural network operation, the input data includes: input neuron data and weight data; the calculation result is output Neuron data.

Clause E5. The device according to Clause E4, the data type of the input data includes at least one of an exponential type and a dynamic fixed-point type, and the data types of the input neuron data and the weight data are different,

Wherein, the data transfer sub-module is used to process the input data according to the calculation instruction to obtain multiple intermediate results, including:

The data transfer sub-module is used to perform shift operation on the weight data or the input neuron data according to the calculation instruction to obtain an intermediate result,

Wherein, the exponential input data includes exponent bits, and the data obtained by calculating with the specified value as the base and the data stored in the exponent bits as the exponent represent the value of the exponential input data,

The dynamic fixed-point input data includes a decimal point and an integer. The data stored in the decimal point is used to mark the position of the decimal point of the dynamic fixed-point input data in the data stored in the integer to distinguish Integer part and decimal part in the data of the integer position,

The specified value corresponding to the exponential input data is the same as the carry system of the input data.

Clause E6. The apparatus according to Clause E1, the processing module includes a master processing submodule and a plurality of slave processing submodules, the master processing submodule includes the data transfer submodule and the accumulation submodule,

The control module is further configured to parse the calculation instruction to obtain a plurality of operation instructions, and send the input data and the plurality of operation instructions to the main processing sub-module;

The master processing sub-module is used to perform pre-processing on the input data and transmit data and operation instructions with the plurality of slave processing sub-modules;

The plurality of sub-processing sub-modules are configured to execute intermediate operations in parallel according to data and operation instructions transmitted from the main processing sub-module to obtain a plurality of intermediate results, and transmit the plurality of intermediate results to the main processing sub Module

The main processing sub-module is also used to perform subsequent processing on the plurality of intermediate results to obtain the calculation result of the calculation instruction.

Clause E7, the device according to Clause E1,

The device also includes a storage module for storing the input data;

Wherein, the control module includes:

An instruction storage sub-module for storing the calculation instruction;

An instruction processing sub-module, which is used to analyze the calculation instruction to obtain a plurality of calculation instructions of the calculation instruction;

A queue storage sub-module, which is used to store an instruction queue, the instruction queue including a plurality of instructions to be executed in order according to the execution order, and the plurality of instructions to be executed include the plurality of arithmetic instructions;

Wherein, the control module also includes:

The dependency processing sub-module is used to determine the first pending instruction when there is an association between the first pending instruction among the multiple pending instructions and the zeroth pending instruction before the first pending instruction The execution instruction is cached in the instruction storage submodule, and after the execution of the zeroth instruction to be executed is completed, the first instruction to be executed is extracted from the instruction storage submodule and sent to the processing module,

Clause E8. A machine learning computing device, the device comprising:

One or more data processing devices as described in any one of Clause E1-Clause E7, used to obtain data to be calculated and control information from other processing devices, and perform specified machine learning operations, and pass the execution results through I/O The interface is passed to other processing devices;

Clause E9. A combined processing device, the combined processing device comprising:

Machine learning computing devices, general interconnection interfaces and other processing devices as described in clause E8;

Clause E10. A machine learning chip, the machine learning chip includes:

The machine learning arithmetic device according to clause E8 or the combined processing device according to clause E9.

Clause E11. An electronic device, the electronic device comprising:

Machine learning chip as described in clause E10.

Clause E12, a board card, the board card includes: a storage device, an interface device and a control device, and a machine learning chip as described in Clause E10;

The storage device is used to store data;

Clause E13. A data processing method, the method is applied to a data processing device, the device is used to perform machine learning calculations, the method includes:

Clause E14. According to the method described in Clause E13, performing a cyclic accumulation operation on the plurality of intermediate results, including:

Clause E15. According to the method described in Clause E13, performing a cyclic accumulation operation on the plurality of intermediate results, including:

Adding the second intermediate data of the fourth operation cycle to the third intermediate data of the fourth operation cycle in the fourth operation cycle where the intermediate result is not received, to obtain the calculation result,

Clause E16. According to the method described in Clause E13- Clause E15, the machine learning calculation includes: artificial neural network operation, and the input data includes: input neuron data and weight data; the calculation result is output neuron data .

Clause E17. The method according to Clause E16, the data type of the input data includes at least one of an exponential type and a dynamic fixed-point type, the data type of the input neuron data and the weight data is different,

Wherein, processing the input data according to the calculation instruction to obtain multiple intermediate results includes:

Performing shift operation on the weight data or the input neuron data according to the calculation instruction to obtain an intermediate result,

Clause E18. According to the method described in Clause E13, obtain a calculation instruction, and obtain input data required to execute the calculation instruction, including:

Parse the calculation instruction to obtain multiple calculation instructions,

Wherein, the method further includes:

Perform pre-sequence processing on the input data and transfer data and operation instructions;

Perform subsequent processing on the plurality of intermediate results to obtain the calculation result of the calculation instruction.

Clause E19, according to the method described in Clause E13,

The method includes: storing the input data;

Wherein, obtaining the calculation instruction, and obtaining the input data required to execute the calculation instruction include:

Store the calculation instruction;

Analyzing the calculation instruction to obtain a plurality of calculation instructions of the calculation instruction;

Storing an instruction queue, where the instruction queue includes a plurality of instructions to be executed, which are sequentially arranged in order of execution, and the plurality of instructions to be executed include the plurality of arithmetic instructions;

Wherein, obtaining a calculation instruction, and obtaining a plurality of input data required to execute the calculation instruction, further includes:

When it is determined that the first to-be-executed instruction in the plurality of to-be-executed instructions is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction, cache the first to-be-executed instruction, and determine the first After the execution of the zero to-be-executed instruction is completed, control to execute the execution of the first to-be-executed instruction,

Since neural network algorithms are more and more widely used in the fields of image recognition, speech recognition, and natural language processing, the complexity of neural network algorithms is getting higher and higher, and the types and number of data operations involved are increasing. Among them, the matrix is a relatively common data form in neural network algorithm, which is composed of numbers and/or characters. The processing of the matrix in the neural network algorithm includes symmetric processing such as axisymmetric and central symmetry. In the related art, it is necessary to perform symmetric processing on the matrix with low efficiency and slow speed.

FIG. 7-1 shows a block diagram of a matrix symmetric instruction processing device according to an embodiment of the present disclosure. As shown in Figure 7-1, the device includes a control module 11-7 and a processing module 12-7.

The control module 11-7 is used to parse the received matrix symmetric instruction, obtain the operation code and operation domain of the matrix symmetric instruction, and determine the matrix to be processed and the target address required for executing the matrix symmetric instruction according to the operation code and the operation domain , And determine the symmetrical strategy required for symmetrical processing. The operation code is used to indicate that the processing performed by the matrix symmetric instruction on the matrix data is symmetric processing, and the operation domain includes the address of the matrix to be processed and the target address.

The processing module 12-7 performs symmetric processing on the processing matrix according to the symmetric strategy to obtain the symmetric matrix, and stores the symmetric matrix into the target address.

In this embodiment, the matrix to be processed may be a data set formed by arranging multiple numbers and/or characters in an array. The symmetric strategy can indicate the symmetric processing required for the matrix to be processed. The symmetric strategy can include the parameters required for the symmetric processing of the matrix to be processed, such as the center of symmetry and the axis of symmetry. The obtained symmetry matrix and the matrix to be processed surround the center of symmetry , Symmetry axis and other symmetry. For example, the symmetry strategy may include center symmetry, axis symmetry, and so on. Among them, codes for matrix symmetric instructions can be set for different symmetric strategies. For example, in matrix symmetric instructions, the symmetric strategy "center symmetry" can be represented by the code csymmetric, and the symmetric strategy "axis symmetry" can be represented by the code asymmetry. A person skilled in the art may set the symmetric strategy and the code of the symmetric strategy according to actual needs, which is not limited in the present disclosure.

For example, suppose the matrix to be processed is [[1,4,7],[5,8,3]]. If the symmetry strategy is determined to be "central symmetry processing" according to the matrix symmetry instruction, then the device performs central symmetry processing on the processing matrix [[1,4,7],[5,8,3]] to obtain the post-symmetric matrix [[ 3,8,5],[7,4,1]]. If the symmetry strategy is determined to be "axisymmetric processing" according to the matrix symmetry instruction, the device can obtain the post-symmetric matrix [[1,4,7],[5,8,3]] after center-symmetric processing. 5,8,3],[1,4,7]].

In this embodiment, the control module may obtain the matrix to be processed from the address of the matrix to be processed. The address of the matrix to be processed may be a physical address such as the first address storing the matrix to be processed, or may be a logical address or a linear address. The control module may store the matrix to be processed in the target address. The target address may be a physical address such as the first address of the symmetric matrix, or a logical address or a linear address. The present disclosure does not limit the way of expressing the processing matrix address and the target address. The control module can obtain the matrix symmetry instruction and the matrix to be processed through the data input and output unit. The data input and output unit can be one or more data I/O interfaces or I/O pins.

In this embodiment, for a matrix alignment instruction, an operation code and an operation field may be included. The operation code may be the part of the instruction or field (usually represented by code) specified in the computer program to perform the operation, and is an instruction sequence number used to inform the device that executes the instruction which instruction needs to be executed. The operation domain can be the source of all data required to execute the corresponding instruction. All data required to execute the corresponding instruction include the matrix to be processed, the corresponding symmetric strategy, or the address to store the matrix to be processed, the corresponding symmetric strategy, etc. . For example, the operation domain may include a to-be-processed matrix address and a target address.

It should be understood that a person skilled in the art may set the instruction format of the matrix symmetric instruction, as well as the included operation codes and operation domains as required, which is not limited in this disclosure.

In this embodiment, the device may include one or more control modules and one or more processing modules, and the number of control modules and processing modules may be set according to actual needs, which is not limited in the present disclosure. When the device includes a control module, the control module can receive matrix symmetric commands and control one or more processing modules to perform symmetric processing. When the device includes multiple control modules, the multiple control modules may respectively receive matrix symmetric instructions and control the corresponding one or more processing modules to perform symmetric processing.

The matrix symmetric instruction processing device provided by the embodiment of the present disclosure includes a control module and a processing module. The control module is used to parse the received matrix symmetric instruction, obtain the operation code and operation domain of the matrix symmetric instruction, and determine the matrix to be processed and the target address required to execute the matrix symmetric instruction according to the operation code and operation domain, and determine the progress Symmetry strategy required for symmetric processing. The processing module is used to perform symmetric processing on the processing matrix according to a symmetric strategy to obtain the symmetric matrix, and store the symmetric matrix into the target address. The matrix symmetric instruction processing device provided by the embodiments of the present disclosure has a wide application range, and has a high processing efficiency and a fast processing speed for performing symmetric processing on the matrix.

In a possible implementation manner, the operation domain may further include the input shape of the matrix to be processed. Among them, the processing module 12-7 can also be used to perform symmetric processing on the matrix to be processed according to the input shape and the symmetry strategy to obtain the symmetric matrix.

In this implementation manner, it is convenient to perform symmetric processing on the matrix according to the input shape of the matrix to be processed, and the shape of the symmetric matrix can also be determined according to the input shape of the matrix to be processed. The shape of the matrix can be represented by the number and/or characters of the matrix to be processed in rows and columns. For example, the matrix 1 to be processed is [[1,0,1],[0,1,0],[-1,0,-1]], and the shape of the matrix 1 to be processed is 3×3, that is, the Matrix 1 to be processed consists of 3 rows and 3 columns and is composed of 9 digits.

In a possible implementation manner, the default input shape of the matrix to be processed may be preset. When the input shape of the matrix to be processed is not included in the operation domain, the default input shape of the matrix to be processed may be determined as the input shape of the matrix to be processed of the current matrix symmetric instruction.

In a possible implementation, the output shape of the matrix after the symmetry may also be included in the operation domain. Among them, the processing module 11-7 can also be used to perform symmetric processing on the matrix to be processed according to the output shape and the symmetric strategy to obtain the symmetric matrix.

In this implementation, the output shape may be the shape of a symmetric matrix. For example, the matrix 2 after symmetry is [[1,0],[0,1],[-1,0]], the shape of the matrix after symmetry is 2×3, that is, the matrix 2 after symmetry is 2 rows, 3 columns, composed of 6 numbers.

In a possible implementation, the default output shape of the symmetric matrix can be preset. When the output shape of the symmetric matrix is not included in the operation domain, the default output shape of the symmetric matrix can be determined as the output shape of the symmetric matrix after the symmetric instruction of the current matrix.

In a possible implementation, the operation domain can also be used to indicate a symmetric strategy.

In a possible implementation, the operation code can also be used to indicate a symmetric strategy.

In a possible implementation, the symmetric strategy may be determined according to the operation code or operation domain of the matrix symmetric instruction. The default symmetry strategy of the matrix to be symmetric can also be preset. When the symmetric strategy of the matrix to be symmetric is not included in the operation domain, the default symmetric strategy of the matrix to be symmetric can be determined as the symmetric strategy of the symmetric instruction of the current matrix.

7-2 shows a block diagram of a matrix symmetric instruction processing device according to an embodiment of the present disclosure. In a possible implementation manner, as shown in FIG. 7-2, the device may further include a storage module 13-7. The storage module 13-7 is used to store the matrix to be processed.

In this implementation, the storage module may include one or more of a memory, a cache, and a register, and the cache may include a temporary storage cache. The matrix to be processed can be stored in the memory, cache, and/or register in the storage module as needed, which is not limited in this disclosure.

In a possible implementation, as shown in FIG. 7-2, the control module 11-7 may include an instruction storage sub-module 111-7, an instruction processing sub-module 112-7, and a queue storage sub-module 113-7.

The instruction storage sub-module 111-7 is used to store matrix symmetric instructions.

The instruction processing sub-module 112-7 is used to parse the matrix symmetric instruction to obtain the operation code and operation domain of the matrix symmetric instruction.

The queue storage sub-module 113-7 is used to store an instruction queue. The instruction queue includes a plurality of instructions to be executed in order according to the execution order. The plurality of instructions to be executed may include matrix symmetric instructions. The plurality of instructions to be executed may include other calculation instructions related to matrix symmetric instructions.

In a possible implementation, as shown in FIG. 7-2, the control module 11-7 may further include a dependency processing sub-module 114-7.

The dependency processing sub-module 114-7 is configured to determine the dependency relationship between the first to-be-executed instruction and the zeroth to-be-executed instruction before the first to-be-executed instruction among the plurality of to-be-executed instructions. 7 The first to-be-executed instruction may be cached in the instruction storage sub-module 112-7. After the execution of the zeroth to-be-executed instruction is completed, the first to-be-executed instruction is extracted from the instruction storage sub-module 112-7 and sent to the processing module 12- 7. Wherein, the first instruction to be executed and the zeroth instruction to be executed are instructions among a plurality of instructions to be executed.

In a possible implementation manner, the instruction format of the matrix symmetric instruction may be:

Rotate1 type dst src src_shape dst_shape

Rotate1 is the operation code, and type, dst, src, src_shape, and dst_shape are the operation domains. Rotate1 is used to indicate that the instruction is a matrix symmetric instruction. type is a symmetric strategy. dst is the target address. src is the address of the matrix to be processed. src_shape is the input shape. dst_shape is the output shape.

Rotate1_type dst src src_shape dst_shape

Rotate1_type is the operation code, and dst, src, src_shape, and dst_shape are the operation domains. Rotate1 in Rotate1_type is used to indicate that the instruction is a matrix symmetric instruction. The type in Rotate1_type is a symmetric strategy. dst is the target address. src is the address of the matrix to be processed. src_shape is the input shape. dst_shape is the output shape.

In a possible implementation, the instruction format of the matrix symmetry instruction whose symmetry strategy is "center symmetry" can be set as: Rotate1_asymmetry dst src_src_shape dst_shape. The matrix symmetry instruction format that can set the symmetry strategy to "axis symmetry" is: Rotate1_csymmetry dst src src_shape dst_shape.

It should be understood that those skilled in the art can set the operation code of the matrix symmetric instruction, the operation code in the instruction format, and the position of the operation field according to needs, which is not limited in this disclosure.

It should be noted that, although the above embodiment is taken as an example to introduce the matrix symmetric instruction processing apparatus as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various modules flexibly according to personal preferences and/or actual application scenarios, as long as the technical solutions of the present disclosure are met.

Application examples

The following describes an application example according to an embodiment of the present disclosure in conjunction with "using a matrix symmetric instruction processing device to perform symmetric processing on a matrix to be processed" as an exemplary application scenario, so as to facilitate understanding of the flow of the matrix symmetric instruction processing device. Those skilled in the art should understand that the following application examples are only for the purpose of facilitating understanding of the embodiments of the present disclosure, and should not be considered as limitations to the embodiments of the present disclosure.

7-3 shows a schematic diagram of an application scenario of a matrix symmetric instruction processing device according to an embodiment of the present disclosure. As shown in Figure 7-3, the matrix symmetric instruction processing device processes matrix symmetric instructions as follows.

When receiving the matrix symmetry instruction 1 (

Rotate1_asymmetry

200, 100, S1, S2), the control module 11-7 parses the matrix symmetry instruction 1, and obtains the operation code and operation domain of the matrix symmetry instruction 1. The opcode of this matrix symmetric instruction 1 is Rotate1_asymmetry. According to the operation code, it can be determined that the instruction is a matrix symmetry instruction, and the symmetry strategy is asymmetry, that is, the symmetry strategy is axisymmetric. According to the operation domain, it can be determined that the to-be-processed matrix address is 100, the input shape is S1, the target address is 200, and the output shape is S2. Furthermore, the control module 11-7 acquires the to-be-processed matrix 1 whose input shape is S1 from the to-be-processed matrix address 100.

The processing module 12-7 performs symmetric processing on the processing matrix 1 according to a symmetric strategy to obtain a symmetric matrix 1', and stores the symmetric matrix 1'in the target address 200.

Among them, the matrix symmetric instruction 1 can be not only the

above Rotate1_asymmetry

200, 100, S1, S2, but also Rotate1, asymmetry, 200, 100, S1, S2. The two are instructions with different instruction formats and represent the same processing process. The process is similar and will not be repeated here.

In this way, the matrix symmetric instruction processing device can quickly and efficiently perform symmetric processing on the matrix according to the matrix symmetric instruction.

7-4 shows a flowchart of a matrix symmetric instruction processing method according to an embodiment of the present disclosure. As shown in FIG. 7-4, the method is applied to the above matrix symmetric instruction processing device. The method includes step S51-7 and step S52-7.

In step S51-7, the received matrix symmetric instruction is parsed to obtain the operation code and operation domain of the matrix symmetric instruction, and the matrix to be processed and the target address required to execute the matrix symmetric instruction are determined according to the operation code and the operation domain, And determine the symmetrical strategy required for symmetrical processing. The operation code is used to indicate that the processing performed by the matrix symmetric instruction on the matrix data is symmetric processing, and the operation domain includes the address of the matrix to be processed and the target address.

In step S52-7, the matrix to be processed is symmetrically processed according to a symmetric strategy to obtain a matrix after symmetry, and the matrix after symmetry is stored in the target address,

In a possible implementation manner, the operation domain may further include the input shape of the matrix to be processed. Wherein, performing symmetric processing on the processing matrix according to a symmetric strategy to obtain a symmetric matrix may include: performing symmetric processing on the processing matrix according to the input shape and the symmetric strategy to obtain the symmetric matrix.

In a possible implementation, the operation domain may also include the output shape of the symmetric matrix. Wherein, performing symmetric processing on the processing matrix according to the symmetric strategy to obtain the symmetric matrix may include: performing symmetric processing on the processing matrix according to the output shape and the symmetric strategy to obtain the symmetric matrix.

In a possible implementation manner, the method may further include: storing a matrix to be processed.

In a possible implementation manner, parsing the received matrix symmetric instruction to obtain the operation code and operation domain of the matrix symmetric instruction may include:

Storage matrix symmetric instructions;

Analyze the matrix symmetric instructions to obtain the opcode and operation domain of the matrix symmetric instructions;

An instruction queue is stored. The instruction queue includes a plurality of instructions to be executed, which are sequentially arranged according to the execution order. The plurality of instructions to be executed may include matrix symmetric instructions.

In a possible implementation manner, the method may further include:

It should be noted that although the above embodiment is taken as an example to introduce the matrix symmetric instruction processing method as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various steps flexibly according to personal preferences and/or actual application scenarios, as long as the technical solutions of the present disclosure are met.

The method for processing matrix symmetric instructions provided by the embodiments of the present disclosure has a wide application range, and the processing efficiency of the matrix for symmetric processing is high and the processing speed is fast.

The foregoing can be better understood based on the following terms:

Clause F1, a matrix symmetric instruction processing device, the device comprising:

The control module is used to parse the received matrix symmetric instruction, obtain the operation code and the operation domain of the matrix symmetric instruction, and determine the required standby for executing the matrix symmetric instruction according to the operation code and the operation domain Processing matrix and target address, and determining the symmetric strategy required for symmetric processing;

Clause F2. The device according to Clause F1, the operation domain further includes an input shape of a matrix to be processed,

Wherein, the processing module is further configured to perform symmetric processing on the matrix to be processed according to the input shape and the symmetry strategy to obtain the symmetrical matrix.

Clause F3. The device according to Clause F1, the operation domain further includes the output shape of the symmetric matrix,

Wherein, the processing module is further configured to perform symmetric processing on the matrix to be processed according to the output shape and the symmetric strategy to obtain the symmetric matrix.

Clause F4. The apparatus according to Clause F1, the operation domain is also used to indicate a symmetric strategy.

Clause F5. The apparatus according to Clause F1, the operation code is further used to indicate the symmetric strategy.

Clause F6, the device according to Clause F1,

The device also includes a storage module for storing the matrix to be processed,

Wherein, the control module includes:

An instruction storage sub-module for storing the matrix symmetric instructions;

An instruction processing submodule, used for parsing the matrix symmetric instruction to obtain the operation code and the operation domain of the matrix symmetric instruction;

A queue storage sub-module, which is used to store an instruction queue, and the instruction queue includes a plurality of instructions to be executed, which are sequentially arranged according to the execution order, and the plurality of instructions to be executed include the matrix symmetric instructions,

Wherein, the control module also includes:

Clause F7. A machine learning computing device, the device comprising:

One or more matrix symmetric instruction processing devices as described in any one of clauses F1-F6, used to obtain the to-be-processed matrix and control information from other processing devices, and perform specified machine learning operations, and pass the execution result through I /O interface is passed to other processing devices;

When the machine learning computing device includes a plurality of matrix symmetric instruction processing devices, the plurality of matrix symmetric instruction processing devices may be connected and transmit data through a specific structure;

Clause F8. A combined processing device, the combined processing device comprising:

Machine learning computing devices, general interconnection interfaces and other processing devices as described in clause F7;

Clause F9. A machine learning chip, the machine learning chip includes:

The machine learning arithmetic device according to clause F7 or the combined processing device according to clause F8.

Clause F10. An electronic device, the electronic device comprising:

Machine learning chip as described in clause F9.

Clause F11. A board card including: a storage device, an interface device and a control device, and a machine learning chip as described in Clause F9;

The storage device is used to store data;

Article F12. A method for processing matrix symmetric instructions. The method is applied to a device for processing matrix symmetric instructions. The method includes:

Clause F13. The method according to Clause F12, the operation domain further includes an input shape of the matrix to be processed,

Wherein, performing symmetric processing on the matrix to be processed according to the symmetric strategy to obtain a symmetric matrix includes:

According to the input shape and the symmetry strategy, the matrix to be processed is symmetrically processed to obtain the symmetrical matrix.

Clause F14. According to the method of Clause F12, the operation domain further includes the output shape of the symmetric matrix,

According to the output shape and the symmetry strategy, the matrix to be processed is symmetrically processed to obtain the symmetrical matrix.

Clause F15. According to the method of Clause F12, the operation domain is also used to indicate a symmetric strategy.

Clause F16. The method according to Clause F12, the opcode is also used to indicate the symmetric strategy.

Clause F17, according to the method described in Clause F12,

The method further includes: storing the matrix to be processed,

Wherein, parsing the received matrix symmetric instruction to obtain the operation code and operation domain of the matrix symmetric instruction includes:

Store the matrix symmetric instruction;

Analyzing the matrix symmetric instruction to obtain the operation code and the operation domain of the matrix symmetric instruction;

Store an instruction queue, the instruction queue includes a plurality of instructions to be executed in order according to the execution order, the plurality of instructions to be executed includes the matrix symmetric instructions

Wherein, the method further includes:

Since neural network algorithms are more and more widely used in the fields of image recognition, speech recognition, and natural language processing, the complexity of neural network algorithms is getting higher and higher, and the types and number of data operations involved are increasing. Among them, the matrix is a relatively common data form in neural network algorithm, which is composed of numbers and/or characters. The processing of the matrix in the neural network algorithm includes mirroring the matrix. In the related art, the efficiency of mirroring the matrix needs to be low and the speed is slow.

8-1 shows a block diagram of a matrix mirroring instruction processing apparatus according to an embodiment of the present disclosure. As shown in Figure 8-1, the device includes a control module 11-8 and a processing module 12-8.

The control module 11-8 is used to parse the received matrix mirroring instruction, obtain the operation code and operation domain of the matrix mirroring instruction, and determine the matrix to be mirrored and the target address required to execute the matrix mirroring instruction according to the operation code and the operation domain , And determine the mirroring strategy required for mirroring. The operation code is used to instruct the processing performed by the matrix mirroring instruction on the matrix data to be mirroring processing, and the operation domain includes the matrix address and the target address to be mirrored.

The processing module 12-8 performs mirror processing on the mirror matrix according to the mirror strategy to obtain the mirrored matrix, and stores the mirrored matrix in the target address.

In this embodiment, the matrix to be mirrored may be a data set formed by arranging multiple numbers and/or characters in an array. The mirror image processing is to perform a transformation process on the matrix, and the matrix to be mirror image is folded along a specific flip line (in a two-dimensional plane) or a specific flip plane (in a three-dimensional space) to obtain a mirror-processed matrix. For example, if the matrix to be mirrored is in a two-dimensional plane, the mirroring strategy may include at least one of folding the matrix to be mirrored along the horizontal direction of the matrix to be mirrored and folding the matrix to be mirrored along the vertical direction of the matrix to be mirrored . If the matrix to be mirrored is in three-dimensional space, the mirroring strategy may include folding the matrix to be mirrored along the horizontal plane of the matrix to be mirrored, folding the matrix to be mirrored along the vertical plane of the matrix to be mirrored, and being perpendicular to the vertical plane along the horizontal plane At least one of the planes to be mirrored into the plane. The mirroring strategy may include parameters required for mirroring processing such as flipping straight lines and flipping planes required for mirroring of the mirroring matrix. The matrix mirroring instruction may perform one or more mirroring processing on the matrix, which is not limited in the present disclosure.

For example, suppose the matrix to be mirrored is [[1,4,7],[2,5,8],[3,6,9]]. If the mirroring strategy is determined to be "horizontal mirroring" according to the matrix mirroring instruction, the device can obtain the mirrored matrix [[3,6,9],[2,5,8],[1, 4,7]]. If the symmetry strategy is determined to be "vertical mirroring" according to the matrix mirroring instruction, the device can obtain the post-symmetric matrix [[9,6,3],[8,5,2],[7 ,4,1]].

In this embodiment, the control module may obtain the matrix to be mirrored from the address of the matrix to be mirrored. The address of the matrix to be mirrored may be a physical address such as the first address storing the matrix to be mirrored, or may be a logical address or a linear address. The control module may store the matrix to be mirrored in the target address. The target address may be a physical address such as the first address of the matrix after storing the mirror, or a logical address or a linear address. The present disclosure does not limit the way of expressing the mirror matrix address and the target address. . The control module may obtain the matrix mirroring instruction and the matrix to be mirrored through the data input/output unit, and the data input/output unit may be one or more data I/O interfaces or I/O pins.

In this embodiment, for a matrix mirroring instruction, an operation code and an operation field may be included. The operation code may be the part of the instruction or field (usually represented by code) specified in the computer program to perform the operation, and is an instruction sequence number used to inform the device that executes the instruction which instruction needs to be executed. The operation domain can be the source of all data required to execute the corresponding instruction. All data required to execute the corresponding instruction include the matrix to be mirrored, the corresponding mirroring strategy, or the address to store the matrix to be mirrored, the corresponding mirroring strategy, etc. . For example, the operation domain may include the matrix address to be mirrored and the target address.

It should be understood that, those skilled in the art can set the instruction format of the matrix mirroring instruction, as well as the included operation codes and operation domains as needed, and this disclosure does not limit this.

In this embodiment, the device may include one or more control modules and one or more processing modules, and the number of control modules and processing modules may be set according to actual needs, which is not limited in the present disclosure. When the device includes a control module, the control module can receive the matrix mirroring instruction and control one or more processing modules to perform mirroring processing. When the device includes multiple control modules, the multiple control modules may receive matrix mirroring instructions respectively and control the corresponding one or more processing modules to perform mirroring processing.

The matrix mirroring instruction processing device provided by the embodiment of the present disclosure includes a control module and a processing module. The control module is used to analyze the received matrix mirroring instruction, obtain the operation code and operation domain of the matrix mirroring instruction, and determine the matrix to be mirrored and the target address required to execute the matrix mirroring instruction according to the operation code and the operation domain, and determine the progress The mirroring strategy required for mirroring. The processing module performs mirror processing on the mirror matrix according to the mirror strategy to obtain the mirrored matrix, and stores the mirrored matrix in the target address. The matrix mirroring instruction processing device provided by the embodiments of the present disclosure has a wide application range, and the processing efficiency of mirroring the matrix according to the matrix mirroring instruction is high and the processing speed is fast.

In a possible implementation manner, the operation domain may further include the input shape of the matrix to be mirrored.

The processing module 12-8 can also be used to perform mirror processing on the mirror matrix according to the input shape and mirror strategy to obtain the mirrored matrix.

In this implementation manner, it is convenient to perform symmetric processing on the matrix according to the input shape of the matrix to be mirrored, and the shape of the symmetric matrix can also be determined according to the input shape of the matrix to be mirrored. The shape of the matrix can be represented by the number of numbers and/or characters on the rows and columns of the matrix to be mirrored. For example, the matrix 1 to be mirrored is [[0,1,1],[0,1,-1]], and the shape of the matrix 1 to be mirrored is 3×2, that is, the matrix 1 to be processed is 3 rows and 2 Column, consisting of 6 numbers.

In a possible implementation, the default input shape of the matrix to be mirrored may be preset. When the input shape of the matrix to be mirrored is not included in the operation domain, the default input shape of the matrix to be mirrored may be determined as the input shape of the matrix to be mirrored of the current matrix mirroring instruction. This disclosure does not limit this.

In a possible implementation, the operation domain may further include the output shape of the mirrored matrix. Among them, the processing module 12-8 is also used to perform mirror processing on the mirror matrix according to the output shape and the mirror strategy to obtain the mirrored matrix.

In this implementation, the output shape may be the shape of the mirrored matrix. For example, the mirrored matrix is [[1,0],[0,1],[-1,0]], the shape of the mirrored matrix is 2×3, that is, the mirrored matrix is 2 rows and 3 columns , Consisting of 6 numbers.

In a possible implementation, the default output shape of the mirrored matrix can be preset. When the output shape of the mirrored matrix is not included in the operation domain, the default output shape of the mirrored matrix can be determined as the output shape of the mirrored matrix of the current matrix mirroring instruction. This disclosure does not limit this.

In a possible implementation, the operation domain can also be used to indicate the mirroring strategy.

In a possible implementation, the operation code can also be used to indicate the mirroring strategy.

In a possible implementation, the mirroring strategy may be determined according to the operation code or operation domain of the matrix mirroring instruction. The default mirroring strategy of the matrix to be mirrored can also be preset. When the mirroring strategy of the matrix to be mirrored is not included in the operation domain, the default mirroring strategy of the matrix to be mirrored may be determined as the mirroring strategy of the matrix to be mirrored of the current matrix mirroring instruction.

8-2 shows a block diagram of a matrix mirroring instruction processing apparatus according to an embodiment of the present disclosure. In a possible implementation manner, as shown in FIG. 8-2, the matrix mirroring instruction processing apparatus may further include: a storage module 13-8, configured to store the matrix to be mirrored.

In this implementation, the storage module may include one or more of a memory, a cache, and a register, and the cache may include a temporary storage cache. The matrix to be mirrored can be stored in the memory, cache, and/or register in the storage module according to needs, which is not limited in this disclosure.

In a possible implementation, as shown in FIG. 8-2, the control module 11-8 may include an instruction storage submodule 111-8, an instruction processing submodule 112-8, and a queue storage submodule 113-8.

The instruction storage submodule 111-8 is used to store matrix mirroring instructions.

The instruction processing sub-module 112-8 is used to parse the matrix mirroring instruction to obtain the operation code and operation domain of the matrix mirroring instruction.

The queue storage sub-module 113-8 is used to store an instruction queue. The instruction queue includes a plurality of instructions to be executed in order according to the execution order, and the plurality of instructions to be executed may include matrix mirroring instructions. The plurality of instructions to be executed may include other calculation instructions related to the matrix mirroring instruction.

In a possible implementation manner, as shown in FIG. 8-2, the control module 11-8 may further include a dependency relationship processing sub-module 114-8.

The dependency processing sub-module 114-8 is used to determine the dependency relationship between the first to-be-executed instruction and the zero-th to-be-executed instruction before the first to-be-executed instruction. 8 The first instruction to be executed can be cached in the instruction storage submodule 112-8, and after the execution of the zeroth instruction to be executed is completed, the first instruction to be executed is extracted from the instruction storage submodule 112-8 and sent to the processing module 12- 8. Wherein, the first instruction to be executed and the zeroth instruction to be executed are instructions among a plurality of instructions to be executed.

In a possible implementation manner, the instruction format of the matrix mirroring instruction may be:

Rotate2 type dst src src_shape dst_shape

Rotate2 is the operation code, and type, dst, src, src_shape, and dst_shape are the operation domains. Rotate2 is used to indicate that the instruction is a matrix mirroring instruction. type is the mirroring strategy. dst is the target address. src is the address of the matrix to be mirrored. src_shape is the input shape. dst_shape is the output shape.

Rotate2_type dst src src_shape dst_shape

Among them, Rotate2_type is the operation code, dst, src, src_shape, dst_shape is the operation domain. Rotate2 in Rotate2_type is used to indicate that the instruction is a matrix mirroring instruction, and type in Rotate2_type is a mirroring strategy. dst is the target address. src is the address of the matrix to be mirrored. src_shape is the input shape. dst_shape is the output shape.

It should be understood that, those skilled in the art can set the operation code of the matrix mirroring instruction, the operation code in the instruction format, and the position of the operation field according to needs, which is not limited in the present disclosure.

It should be noted that although the above embodiment is used as an example to introduce the matrix mirroring instruction processing apparatus as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various modules flexibly according to personal preferences and/or actual application scenarios, as long as the technical solutions of the present disclosure are met.

Application examples

The following describes an application example according to an embodiment of the present disclosure in conjunction with "mirror processing of a matrix mirror using a matrix mirroring instruction processing device" as an exemplary application scenario, so as to facilitate understanding of the flow of the matrix mirroring instruction processing device. Those skilled in the art should understand that the following application examples are only for the purpose of facilitating understanding of the embodiments of the present disclosure, and should not be considered as limitations to the embodiments of the present disclosure.

8-3 shows a schematic diagram of an application scenario of a matrix mirroring instruction processing apparatus according to an embodiment of the present disclosure. As shown in Figure 8-3, the matrix mirroring instruction processing device processes the matrix mirroring instruction as follows.

Example 1-8

The control module 11-8 receives the matrix mirroring instruction 1 (Rotate2_type 200 200 S1 S2), analyzes the matrix mirroring instruction, and obtains the operation code and operation domain of the matrix mirroring instruction 1. The operation code of the matrix mirroring instruction 1 is Rotate2_type. According to the operation code, it can be determined that the instruction is a matrix mirroring processing instruction, and the mirroring strategy is type. According to the operation domain, it can be determined that the matrix address to be mirrored is 100, the input shape is S1, the target address is 200, and the output shape is S2. Furthermore, the control module 11-8 obtains the matrix 1 to be mirrored whose input shape is S1 from the matrix address 100 to be mirrored.

The processing module 12-8 performs mirror processing on the mirror matrix 1 according to the mirror strategy to obtain the mirrored matrix 1', and stores the mirror matrix 1'in the target address 200.

Among them, matrix mirroring instruction 1 can be

Rotate2_type

200, 100, S1, S2, or Rotate2,

type

200, 100, S1, S2. The two are different command formats and represent the same processing process. The processing is similar and will not be repeated here.

In this way, the matrix mirroring instruction processing device can quickly and efficiently mirror the matrix according to the matrix mirroring instruction.

8-4 shows a flowchart of a matrix mirroring instruction processing method according to an embodiment of the present disclosure. As shown in FIG. 8-4, this method is applied to the above-mentioned matrix mirroring instruction processing device. The method includes step S51-8 and step S52-8.

In step S51-8, the received matrix mirroring instruction is analyzed to obtain the operation code and operation domain of the matrix mirroring instruction, and the matrix to be mirrored and the target address required to execute the matrix mirroring instruction are determined according to the operation code and the operation domain And determine the mirroring strategy required for mirroring. The operation code is used to indicate that the processing performed by the matrix mirroring instruction on the matrix is mirror processing, and the operation domain includes the address of the matrix to be mirrored and the target address.

In step S52-8, the mirroring matrix is mirrored according to the mirroring strategy to obtain the mirrored matrix, and the mirrored matrix is stored in the target address.

In a possible implementation manner, the operation domain may further include the input shape of the matrix to be mirrored. Wherein, mirroring the mirror matrix according to the mirroring strategy to obtain the mirrored matrix may include: mirroring the mirroring matrix according to the output shape and mirroring strategy to obtain the mirrored matrix.

In a possible implementation manner, the operation domain may further include the output shape of the mirrored matrix, where the mirroring matrix is mirrored according to the mirroring strategy to obtain the mirrored matrix, which may include: according to the output shape and mirroring strategy, the mirror The matrix is mirrored to obtain the mirrored matrix.

In a possible implementation manner, the method may further include: storing a matrix to be mirrored.

In a possible implementation manner, parsing the received matrix mirroring instruction to obtain the operation code and operation domain of the matrix mirroring instruction may include:

Storage matrix mirroring instruction;

Analyze the matrix mirroring instruction to obtain the operation code and operation domain of the matrix mirroring instruction;

The instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed in order according to the execution order, and the plurality of instructions to be executed may include matrix mirroring instructions.

In a possible implementation manner, the method may further include:

It should be noted that although the above embodiment is taken as an example to introduce the processing method of the matrix mirroring instruction as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various steps flexibly according to personal preferences and/or actual application scenarios, as long as the technical solutions of the present disclosure are met.

The matrix mirroring instruction processing method provided by the embodiments of the present disclosure has a wide application range, and has high processing efficiency and fast processing speed for the mirroring of matrix mechanical energy.

The foregoing can be better understood based on the following terms:

Clause G1, a matrix mirroring instruction processing device, the device comprising:

Clause G2. The device according to Clause G1, the operation domain further includes an input shape of the matrix to be mirrored,

Wherein, the processing module is further configured to perform mirror processing on the matrix to be mirrored according to the input shape and the mirroring strategy to obtain a mirrored matrix.

Clause G3. The device according to Clause G1, the operation domain further includes the output shape of the mirrored matrix,

Wherein, the processing module is further configured to perform mirror processing on the matrix to be mirrored according to the output shape and the mirroring strategy to obtain a mirrored matrix.

Clause G4. The apparatus according to Clause G1, the operation domain is further used to indicate a mirroring policy.

Clause G5. The apparatus according to Clause G1, the operation code is further used to indicate the mirroring policy.

Clause G6, the device according to Clause G1,

The device also includes a storage module for storing the matrix to be mirrored,

Wherein, the control module includes:

An instruction storage sub-module for storing the matrix mirroring instruction;

An instruction processing submodule, used for parsing the matrix mirroring instruction to obtain the operation code and the operation domain of the matrix mirroring instruction;

A queue storage sub-module, which is used to store an instruction queue, the instruction queue includes a plurality of instructions to be executed in order according to the execution order, the plurality of instructions to be executed includes the matrix mirroring instruction,

Wherein, the control module also includes:

Clause G7. A machine learning computing device, the device comprising:

One or more matrix mirroring instruction processing devices as described in any one of Clause G1-Clause G6, used to obtain the matrix and control information to be mirrored from other processing devices, and perform the specified machine learning operation, and pass the execution result through I /O interface is passed to other processing devices;

Clause G8. A combined processing device, the combined processing device comprising:

Machine learning computing devices, general interconnection interfaces and other processing devices as described in Clause G7;

Clause G9. A machine learning chip, the machine learning chip includes:

The machine learning arithmetic device according to clause G7 or the combined processing device according to clause G8.

Clause G10. An electronic device, the electronic device comprising:

Machine learning chip as described in clause G9.

Clause G11, a board card including: a storage device, an interface device and a control device, and a machine learning chip as described in Clause G9;

The storage device is used to store data;

Clause G12. A method for processing a matrix mirroring instruction. The method is applied to a matrix mirroring instruction processing apparatus. The method includes:

Clause G13. According to the method of Clause G12, the operation domain further includes the input shape of the matrix to be mirrored,

Wherein, performing mirror processing on the matrix to be mirrored according to the mirroring strategy to obtain the mirrored matrix includes:

Perform mirror processing on the matrix to be mirrored according to the input shape and the mirroring strategy to obtain the mirrored matrix.

Clause G14. According to the method of Clause G12, the operation domain further includes the output shape of the mirrored matrix,

According to the output shape and the mirroring strategy, perform mirror processing on the matrix to be mirrored to obtain the mirrored matrix.

Clause G15. According to the method of Clause G12, the operation domain is also used to indicate a mirroring policy.

Clause G16. The method according to Clause G12, the operation code is further used to indicate the mirroring policy.

Clause G17, according to the method described in Clause G12,

The method further includes: storing the matrix to be mirrored,

Wherein, analyzing the received matrix mirroring instruction to obtain the operation code and operation domain of the matrix mirroring instruction includes:

Store the matrix mirroring instruction;

Parse the matrix mirroring instruction to obtain the operation code and operation domain of the matrix mirroring instruction;

Store an instruction queue, the instruction queue includes a plurality of instructions to be executed in order according to the execution order, the plurality of instructions to be executed includes the matrix mirroring instruction,

Wherein, the method further includes:

Since neural network algorithms are more and more widely used in the fields of image recognition, speech recognition, and natural language processing, the complexity of neural network algorithms is getting higher and higher, and the types and number of data operations involved are increasing. Among them, the matrix is a relatively common data form in neural network algorithm, which is composed of numbers and/or characters. The processing of the matrix in the neural network algorithm includes rotating the matrix. In the related art, the efficiency of rotating the matrix needs to be low and the speed is slow.

9-1 shows a block diagram of a matrix rotation instruction processing device according to an embodiment of the present disclosure. As shown in Figure 9-1, the device includes a control module 11-9 and a processing module 12-9.

The control module 11-9 is used to parse the received matrix rotation instruction, obtain the operation code and operation domain of the matrix rotation instruction, and determine the matrix to be rotated and the target address required to execute the matrix rotation instruction according to the operation code and operation domain , And determine the rotation angle of the matrix to be rotated. The operation code is used to instruct the matrix rotation instruction to process the matrix data as rotation processing, and the operation domain includes the matrix address and the target address to be rotated.

The processing module 12-9 performs rotation processing on the rotation matrix according to the rotation angle to obtain the rotated matrix, and stores the rotated matrix in the target address.

In this embodiment, the matrix to be rotated may be a data set formed by arranging multiple numbers and/or characters in an array. The rotation angle may be any preset angle greater than 0 degrees or less than 0 degrees. Wherein, when the rotation angle is greater than 0 degrees, the rotation performed by the matrix to be rotated is clockwise; when the rotation angle is less than 0 degrees, the rotation performed by the matrix to be rotated is counterclockwise. For example, 90°, 180°, 270°, etc.

For example, suppose the matrix to be rotated is [[1,4,7],[2,5,8],[3,6,9]]. If the rotation angle is 90°, the device rotates the matrix to be rotated 90° clockwise to obtain the rotated matrix [[7,8,9],[4,5,6],[1,2 ,3]].

In this embodiment, the control module may obtain the matrix to be rotated from the address of the matrix to be rotated. The address of the matrix to be rotated may be a physical address such as the first address storing the matrix to be rotated, or a logical address or a linear address. The control module may store the matrix to be rotated in the target address. The target address may be a physical address such as the first address of the rotated matrix, or a logical address or a linear address. The present disclosure does not limit the manner of expressing the address of the rotation matrix and the target address. . The control module may obtain the matrix rotation instruction and the matrix to be rotated through the data input and output unit. The data input and output unit may be one or more data I/O interfaces or I/O pins.

In this embodiment, an operation code and an operation field may be included for a matrix rotation instruction. The operation code may be the part of the instruction or field (usually represented by code) specified in the computer program to perform the operation, and is an instruction sequence number used to inform the device that executes the instruction which instruction needs to be executed. The operation domain can be the source of all data required to execute the corresponding instruction. All data required to execute the corresponding instruction include the matrix to be rotated and the corresponding rotation angle, or the address to store the matrix to be rotated and the corresponding rotation angle, etc. . For example, the operation domain may include the matrix address to be rotated and the target address.

It should be understood that, those skilled in the art can set the instruction format of the matrix rotation instruction, as well as the included operation codes and operation fields as required, and the disclosure does not limit this.

In this embodiment, the device may include one or more control modules and one or more processing modules, and the number of control modules and processing modules may be set according to actual needs, which is not limited in the present disclosure. When the device includes a control module, the control module may receive a matrix rotation instruction and control one or more processing modules to perform rotation processing. When the device includes multiple control modules, the multiple control modules may respectively receive matrix rotation instructions and control the corresponding one or more processing modules to perform rotation processing.

A matrix rotation instruction processing device provided by an embodiment of the present disclosure includes a control module and a processing module. The control module is used to parse the received matrix rotation instruction, obtain the operation code and operation domain of the matrix rotation instruction, and determine the matrix to be rotated and the target address required to execute the matrix rotation instruction according to the operation code and operation domain, and determine the treatment The rotation angle of the rotation matrix. The rotation matrix is rotated according to the rotation angle to obtain the rotated matrix, and the rotated matrix is stored in the target address. The matrix rotation instruction processing device provided by the embodiments of the present disclosure has a wide application range, and the processing efficiency of rotating the matrix according to the matrix rotation instruction is high and the processing speed is fast.

In a possible implementation manner, the operation domain may further include the input shape of the matrix to be rotated. The processing module 12-9 is also used to rotate the matrix to be rotated according to the input shape and the rotation angle to obtain the rotated matrix.

In this implementation manner, the matrix may be rotated according to the input shape of the matrix to be rotated, or the shape of the rotated matrix may be determined according to the input shape of the matrix to be rotated. The shape of the matrix can be expressed by the number of numbers and/or characters on the rows and columns of the matrix to be rotated. For example, the matrix 1 to be rotated is [[1,0,1],[0,1,0],[-1,0,-1]], and the shape of the matrix 1 to be rotated is 3×3, that is, the The rotation matrix 1 to be processed consists of 3 rows and 3 columns, and is composed of 9 numbers.

In a possible implementation, the default input shape of the matrix to be rotated can be preset. When the input shape of the matrix to be rotated is not included in the operation domain, the default input shape of the matrix to be rotated can be determined as the input shape of the matrix to be rotated of the current matrix rotation instruction. The input shape may include at least the length of the matrix to be rotated and the width of the matrix to be rotated, which is not limited in this disclosure.

In a possible implementation manner, the operation domain may further include an output shape of the rotation matrix, and the processing module 12-9 is further configured to perform rotation processing on the rotation matrix according to the output shape and the rotation angle to obtain the rotated matrix.

In this implementation, the output shape may be the shape of the rotated matrix. For example, the matrix after rotation is [[1,-1],[1,1],[0,0]], the shape of the matrix after rotation is 2×3, that is, the symmetric matrix 2 is 2 rows, 3 Column, consisting of 6 numbers.

In a possible implementation, the default output shape of the rotated matrix can be preset. When the output shape of the rotated matrix is not included in the operation domain, the default output shape of the rotated matrix can be determined as the output shape of the rotated matrix of the current matrix rotation instruction. The output shape can include at least the length of the rotated matrix and The width of the matrix is not limited by this disclosure.

In a possible implementation, the operation field can also be used to indicate the rotation angle.

In a possible implementation, the operation code can also be used to indicate the rotation angle.

In a possible implementation manner, the rotation angle may be determined according to the operation code or operation field of the matrix rotation instruction. The default rotation angle of the matrix to be rotated can also be preset. When the rotation angle of the matrix to be rotated is not included in the operation domain, the default rotation angle of the matrix to be rotated may be determined as the rotation angle of the matrix to be rotated of the current matrix rotation instruction.

9-2 shows a block diagram of a matrix rotation instruction processing device according to an embodiment of the present disclosure. In a possible implementation manner, as shown in FIG. 9-2, the matrix rotation instruction processing apparatus may further include: a storage module 13-9, configured to store the matrix to be rotated.

In this implementation, the storage module may include one or more of a memory, a cache, and a register, and the cache may include a temporary storage cache. The matrix to be rotated can be stored in the memory, cache, and/or register in the storage module as needed, which is not limited in this disclosure.

In a possible implementation, as shown in FIG. 9-2, the control module 11-9 may include an instruction storage sub-module 111-9, an instruction processing sub-module 112-9, and a queue storage sub-module 113-9.

The instruction storage submodule 111-9 is used to store matrix rotation instructions.

The instruction processing sub-module 112-9 is used to parse the matrix rotation instruction to obtain the operation code and operation domain of the matrix rotation instruction.

The queue storage sub-module 113-9 is used to store an instruction queue. The instruction queue includes a plurality of instructions to be executed in order according to the execution order, and the plurality of instructions to be executed may include matrix rotation instructions. The plurality of instructions to be executed may include other calculation instructions related to the matrix rotation instruction.

In a possible implementation, as shown in FIG. 9-2, the control module 11-9 may further include a dependency processing sub-module 114-9.

The dependency processing sub-module 114-9 is configured to determine the dependency relationship between the first to-be-executed instruction and the zeroth to-be-executed instruction before the first to-be-executed instruction among the plurality of to-be-executed instructions. 9 The first instruction to be executed can be cached in the instruction storage submodule 112-9, and after the execution of the zeroth instruction to be executed is completed, the first instruction to be executed is extracted from the instruction storage submodule 112-9 and sent to the processing module 12- 9. Wherein, the first instruction to be executed and the zeroth instruction to be executed are instructions among a plurality of instructions to be executed.

In a possible implementation manner, the instruction format of the matrix rotation instruction may be:

Rotate angle dst src src_shape dst_shape

Rotate is the operation code, and angle, dst, src, src_shape, and dst_shape are the operation domains. Rotate is used to indicate that the instruction is a matrix rotation instruction. dst is the target address. src is the address of the matrix to be rotated. angle is the rotation angle. src_shape is the input shape. dst_shape is the output shape.

Rotate_angle dst src src_shape dst_shape

Among them, Rotate_angle is the operation code, dst, src, src_shape, dst_shape are the operation domain. Rotate in Rotate_angle is used to indicate that the instruction is a matrix rotation instruction, and angle in Rotate_angle is a rotation angle. dst is the target address. src is the address of the matrix to be rotated. src_shape is the input shape. dst_shape is the output shape.

In a possible implementation, the instruction format of the matrix rotation instruction rotated clockwise by 90° can be set as: Rotate_90 dst src src_shape dst_shape. The format of the matrix rotation command that can be rotated clockwise by 180° can be set as: Rotate_180 dst src src_shape dst_shape. The format of the matrix rotation command that can be rotated clockwise by 270° is set to Rotate_270 dst src src_shape dst_shape.

It should be understood that those skilled in the art can set the operation code of the matrix rotation instruction, the operation code in the instruction format, and the position of the operation field as needed, and the disclosure does not limit this.

It should be noted that although the above embodiment is used as an example to introduce the matrix rotation instruction processing device as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various modules flexibly according to personal preferences and/or actual application scenarios, as long as the technical solutions of the present disclosure are met.

Application examples

In the following, an application example according to an embodiment of the present disclosure is given in conjunction with "using a matrix rotation instruction processing device to rotate a matrix to be rotated" as an exemplary application scenario, so as to facilitate understanding of the flow of the matrix rotation instruction processing device. Those skilled in the art should understand that the following application examples are only for the purpose of facilitating understanding of the embodiments of the present disclosure, and should not be considered as limitations to the embodiments of the present disclosure.

9-3 shows a schematic diagram of an application scenario of a matrix rotation instruction processing device according to an embodiment of the present disclosure. As shown in FIG. 9-3, the matrix rotation instruction processing device processes the matrix rotation instruction as follows.

When the control module 11-9 receives the matrix rotation instruction 1 (Rotate 90, 200, 100, S1, S2), it parses the matrix rotation instruction 1, and obtains the operation code and operation field of the matrix rotation instruction 1. The operation code of the matrix rotation instruction 1 is Rotate. And it can be determined according to the operation domain: the rotation angle is 90 degrees, the matrix address to be rotated is 100, the input shape is S1, the target address is 200, and the output shape is S2. Further, the control module 11-9 obtains the matrix 1 to be rotated whose input shape is S1 from the matrix address 100 to be rotated.

The processing module 12-9 performs rotation processing on the rotation matrix 1 according to the rotation angle to obtain the rotated matrix 1', and stores the rotated matrix 1'in the target address 200.

Among them, the matrix rotation instruction 1 can be not only the above-mentioned Rotate 90, 200, 100, S1, S2, but also Rotate_90, 200, 100, S1, S2. The two are instructions with different instruction formats and represent the same processing process. The process is similar and will not be repeated here.

In this way, the matrix rotation instruction processing device can quickly and efficiently rotate the matrix according to the matrix rotation instruction.

9-4 shows a flowchart of a matrix rotation instruction processing method according to an embodiment of the present disclosure. As shown in FIG. 9-4, the method is applied to the above matrix rotation instruction processing device, and the method includes step S51-9 and step S52-9.

In step S51-9, the received matrix rotation instruction is analyzed to obtain the operation code and operation domain of the matrix rotation instruction, and the matrix to be rotated and the target address required to execute the matrix rotation instruction are determined according to the operation code and operation domain And determine the rotation angle of the matrix to be rotated. The operation code is used to instruct the matrix rotation instruction to process the matrix data as rotation processing, and the operation domain includes the matrix address and the target address to be rotated.

In step S52-9, the rotation matrix is rotated according to the rotation angle to obtain the rotated matrix, and the rotated matrix is stored in the target address.

In a possible implementation manner, the operation domain may further include the input shape of the matrix to be rotated. Wherein, performing the rotation processing on the rotation matrix according to the rotation angle to obtain the rotated matrix may include: performing rotation processing on the rotation matrix according to the input shape and the rotation angle to obtain the rotated matrix.

In a possible implementation, the operation domain may also include the output shape of the rotation matrix. Wherein, performing the rotation processing on the rotation matrix according to the rotation angle to obtain the rotated matrix may include: performing rotation processing on the rotation matrix according to the output shape and the rotation angle to obtain the rotated matrix.

In a possible implementation manner, the method may further include: storing the matrix to be rotated.

In a possible implementation manner, parsing the received matrix rotation instruction to obtain the operation code and operation domain of the matrix rotation instruction may include:

Storage matrix rotation instruction;

Analyze the matrix rotation instruction to obtain the operation code and operation domain of the matrix rotation instruction;

An instruction queue is stored, and the instruction queue includes a plurality of instructions to be executed that are sequentially arranged according to the execution order, and the plurality of instructions to be executed may include matrix rotation instructions.

In a possible implementation manner, the method may further include:

It should be noted that although the above embodiment is used as an example to introduce the matrix rotation instruction processing method as above, those skilled in the art can understand that the present disclosure should not be limited to this. In fact, the user can set various steps flexibly according to personal preferences and/or actual application scenarios, as long as the technical solutions of the present disclosure are met.

The matrix rotation instruction processing method provided by the embodiments of the present disclosure has a wide application range, and the processing efficiency of rotating the matrix according to the matrix rotation instruction is high and the processing speed is fast.

The foregoing can be better understood based on the following terms:

Clause H1, a matrix rotation instruction processing device, the device comprising:

Clause H2. The device according to Clause H1, the operation domain further includes an input shape of the matrix to be rotated,

The processing module is further configured to perform rotation processing on the matrix to be rotated according to the input shape and the rotation angle to obtain the rotated matrix.

Clause H3. The device according to Clause H1, the operation domain further includes an output shape of a rotation matrix,

The processing module is further configured to perform rotation processing on the matrix to be rotated according to the output shape and the rotation angle to obtain the rotated matrix.

Clause H4. The device according to Clause H1, the operation field is also used to indicate a rotation angle.

Clause H5. The device according to Clause H1, the operation code is also used to indicate a rotation angle.

Clause H6, the device according to Clause H1,

The device also includes a storage module for storing the matrix to be rotated,

Wherein, the control module includes:

An instruction storage sub-module for storing the matrix rotation instruction;

An instruction processing sub-module, which is used to analyze the matrix rotation instruction to obtain the operation code and operation domain of the matrix rotation instruction;

A queue storage submodule, which is used to store an instruction queue, the instruction queue includes a plurality of instructions to be executed in order according to the execution order, and the plurality of instructions to be executed include the matrix rotation instruction,

Wherein, the control module also includes:

Clause H7. A machine learning computing device, the device comprising:

One or more matrix rotation instruction processing devices as described in any one of Clause H1-Clause H6, used to obtain the matrix to be rotated and control information from other processing devices, and perform a specified machine learning operation, and pass the execution result through I /O interface is passed to other processing devices;

Clause H8. A combined processing device, the combined processing device comprising:

Machine learning computing devices, general interconnection interfaces and other processing devices as described in clause H7;

Clause H9. A machine learning chip, the machine learning chip includes:

The machine learning arithmetic device according to clause H7 or the combined processing device according to clause H8.

Clause H10. An electronic device, the electronic device comprising:

Machine learning chip as described in clause H9.

Clause H11, a board card including: a storage device, an interface device and a control device, and a machine learning chip as described in Clause H9;

The storage device is used to store data;

Clause H12. A method of processing matrix rotation instructions. The method is applied to a device for processing matrix rotation instructions. The method includes:

Clause H13. The method according to Clause H12, the operation domain further includes an input shape of the matrix to be rotated,

Wherein, rotating the matrix to be rotated according to the rotation angle to obtain a matrix after rotation includes:

Rotate the matrix to be rotated according to the input shape and the rotation angle to obtain the rotated matrix.

Clause H14. The method according to Clause H12, the operation domain further includes an output shape of a rotation matrix,

Rotate the matrix to be rotated according to the output shape and the rotation angle to obtain the rotated matrix.

Clause H15. The method according to Clause H12, the operation field is also used to indicate a rotation angle.

Clause H16. The method according to Clause H14, the operation code is also used to indicate a rotation angle.

Clause H17, according to the method described in Clause H12,

The method further includes: storing the matrix to be rotated,

Wherein, analyzing the received matrix rotation instruction to obtain the operation code and operation domain of the matrix rotation instruction includes:

Store the matrix rotation instruction;

Parse the matrix rotation instruction to obtain the operation code and operation domain of the matrix rotation instruction;

An instruction queue is stored, the instruction queue includes a plurality of instructions to be executed in order according to the execution order, the plurality of instructions to be executed includes the matrix rotation instruction,

Wherein, the method further includes:

An embodiment of the present disclosure also proposes a computer-readable storage medium on which computer program instructions are stored, and when the computer program instructions are executed by a processor, the above method is implemented. The computer-readable storage medium may be a non-volatile computer-readable storage medium.

An embodiment of the present disclosure also proposes an electronic device, including: a processor; a memory for storing processor executable instructions; wherein the processor is configured to call the instructions stored in the memory to perform the above method.

The present disclosure provides a machine learning computing device. The machine learning computing device may include one or more of the above-mentioned instruction processing devices for acquiring data to be operated and control information from other processing devices to perform specified machine learning operations. The machine learning computing device can obtain instructions from other machine learning computing devices or non-machine learning computing devices, and transfer the execution result to peripheral devices (also called other processing devices) through the I/O interface. Peripheral equipment such as camera, monitor, mouse, keyboard, network card, wifi interface, server. When more than one instruction processing device is included, the instruction processing device can link and transmit data through a specific structure, for example, interconnect and transmit data through the PCIE bus to support a larger-scale neural network operation. At this time, you can share the same control system or have separate control systems; you can share memory, or each accelerator has its own memory. In addition, the interconnection method can be any interconnection topology.

The machine learning computing device has high compatibility, and can be connected with various types of servers through the PCIE interface.

10a shows a block diagram of a combined processing device according to an embodiment of the present disclosure. As shown in FIG. 10a, the combined processing device includes the above-mentioned machine learning computing device, a universal interconnection interface, and other processing devices. The machine learning computing device interacts with other processing devices to complete the operation specified by the user.

Other processing devices include one or more processor types of general-purpose/dedicated processors such as central processing unit CPU, graphics processor GPU, neural network processor. The number of processors included in other processing devices is not limited. Other processing devices serve as an interface between the machine learning computing device and external data and control, including data handling, to complete the basic control of starting and stopping the machine learning computing device; other processing devices can also cooperate with the machine learning computing device to complete the computing task.

Universal interconnection interface, used to transfer data and control instructions between machine learning computing devices and other processing devices. The machine learning computing device obtains the required input data from other processing devices and writes them into the on-chip storage device of the machine learning computing device; it can obtain control instructions from other processing devices and write them into the control cache of the machine learning computing device; The data in the storage module of the machine learning computing device can be read and transmitted to other processing devices.

FIG. 10b shows a block diagram of a combined processing device according to an embodiment of the present disclosure. In a possible implementation manner, as shown in FIG. 10b, the combined processing device may further include a storage device, and the storage device is respectively connected to the machine learning operation device and other processing devices. The storage device is used to store data stored in the machine learning computing device and other processing devices, and is particularly suitable for data that cannot be saved in the internal storage of the machine learning computing device or other processing devices.

The combined processing device can be used as an SOC on-chip system for mobile phones, robots, drones, video surveillance equipment and other equipment, effectively reducing the core area of the control part, increasing processing speed, and reducing overall power consumption. In this case, the general interconnection interface of the combined processing device is connected to some components of the device. Some components such as camera, monitor, mouse, keyboard, network card, wifi interface.

The present disclosure provides a machine learning chip including the above machine learning arithmetic device or combination processing device.

The present disclosure provides a machine learning chip packaging structure including the above machine learning chip.

The present disclosure provides a board card. FIG. 11 shows a schematic diagram of a board card according to an embodiment of the present disclosure. As shown in FIG. 11, the board includes the above machine learning chip packaging structure or the above machine learning chip. In addition to the machine learning chip 389, the board may also include other supporting components, including but not limited to: a storage device 390, an interface device 391, and a control device 392.

The storage device 390 and the machine learning chip 389 (or the machine learning chip in the machine learning chip packaging structure) are connected via a bus, and are used to store data. The storage device 390 may include multiple sets of storage units 393. Each group of storage units 393 and the machine learning chip 389 are connected by a bus. It can be understood that each group of storage units 393 may be DDR SDRAM (English: Double Data Rate SDRAM, double rate synchronous dynamic random access memory).

DDR can double the speed of SDRAM without increasing the clock frequency. DDR allows data to be read on the rising and falling edges of the clock pulse. DDR is twice as fast as standard SDRAM.

In one embodiment, the memory device 390 may include 4 sets of memory cells 393. Each group of memory cells 393 may include multiple DDR4 particles (chips). In one embodiment, the machine learning chip 389 may include four 72-bit DDR4 controllers. Among the 72-bit DDR4 controllers, 64 bits are used for data transmission, and 8 bits are used for ECC check. It can be understood that when DDR4-3200 particles are used in each group of memory cells 393, the theoretical bandwidth of data transmission can reach 25600MB/s.

In one embodiment, each group of storage units 393 includes multiple double-rate synchronous dynamic random access memories arranged in parallel. DDR can transfer data twice in one clock cycle. A controller for controlling DDR is provided in the machine learning chip 389 for controlling the data transmission and data storage of each storage unit 393.

The interface device 391 is electrically connected to the machine learning chip 389 (or the machine learning chip in the machine learning chip packaging structure). The interface device 391 is used to realize data transmission between the machine learning chip 389 and an external device (such as a server or a computer). For example, in one embodiment, the interface device 391 may be a standard PCIE interface. For example, the data to be processed is transferred from the server to the machine learning chip 289 through a standard PCIE interface to realize data transfer. Preferably, when using PCIE 3.0 and X 16 interface transmission, the theoretical bandwidth can reach 16000MB/s. In another embodiment, the interface device 391 may also be other interfaces. The present disclosure does not limit the specific expressions of the other interfaces described above, and the interface device can implement the transfer function. In addition, the calculation result of the machine learning chip is still transmitted back to the external device (such as a server) by the interface device.

The control device 392 is electrically connected to the machine learning chip 389. The control device 392 is used to monitor the state of the machine learning chip 389. Specifically, the machine learning chip 389 and the control device 392 may be electrically connected through an SPI interface. The control device 392 may include a single-chip microcomputer (Micro Controller Unit, MCU). For example, the machine learning chip 389 may include multiple processing chips, multiple processing cores, or multiple processing circuits, and may drive multiple loads. Therefore, the machine learning chip 389 can be in different working states such as multiple loads and light loads. The control device can realize the regulation of the working state of multiple processing chips, multiple processes and/or multiple processing circuits in the machine learning chip.

The present disclosure provides an electronic device including the aforementioned machine learning chip or board.

Electronic equipment can include data processing devices, robots, computers, printers, scanners, tablets, smart terminals, mobile phones, driving recorders, navigators, sensors, cameras, servers, cloud servers, cameras, cameras, projectors, watches, Headphones, mobile storage, wearable devices, vehicles, household appliances, and/or medical devices.

Vehicles may include airplanes, ships, and/or vehicles. Household appliances may include TVs, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lights, gas stoves, and range hoods. The medical device may include a nuclear magnetic resonance apparatus, a B-mode ultrasound apparatus, and/or an electrocardiograph.

It should be noted that, for the sake of simple description, the foregoing method embodiments are all expressed as a series of action combinations, but those skilled in the art should know that this application is not limited by the described action sequence, Because according to the present application, certain steps can be performed in other orders or simultaneously. Secondly, those skilled in the art should also know that the embodiments described in the specification are all optional embodiments, and the involved actions and modules are not necessarily required by the present disclosure.

In the above embodiments, the description of each embodiment has its own emphasis. For a part that is not detailed in an embodiment, you can refer to related descriptions in other embodiments.

In the embodiments provided by the present disclosure, it should be understood that the disclosed system and device may be implemented in other ways. For example, the system and device embodiments described above are only schematic. For example, the division of devices, devices, and modules is only a logical function division. In actual implementation, there may be other divisions, for example, multiple modules may be combined Or it can be integrated into another system or device, or some features can be ignored or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices, devices, or modules, and may be in electrical or other forms.

A module described as a separate component may or may not be physically separated, and a component displayed as a module may or may not be a physical unit, that is, it may be located in one place or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional module in each embodiment of the present disclosure may be integrated into one processing unit, or each module may exist alone physically, or two or more modules may be integrated into one module. The above-mentioned integrated modules can be implemented in the form of hardware or software program modules.

If the integrated module is implemented in the form of a software program module and sold or used as an independent product, it can be stored in a computer-readable memory. Based on such an understanding, the technical solution of the present disclosure may be essentially or part of the contribution to the existing technology or all or part of the technical solution may be embodied in the form of a software product, the computer software product is stored in a memory, Several instructions are included to enable a computer device (which may be a personal computer, server, network device, etc.) to perform all or part of the steps of the methods described in various embodiments of the present disclosure. The aforementioned memory includes: U disk, Read-Only Memory (ROM, Read-Only Memory), Random Access Memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program codes.

A person of ordinary skill in the art may understand that all or part of the steps in the various methods of the above embodiments may be completed by a program instructing relevant hardware. The program may be stored in a computer-readable memory, and the memory may include: a flash disk , Read-Only Memory (English: Read-Only Memory, abbreviation: ROM), Random Access Device (English: Random Access Memory, abbreviation: RAM), magnetic disk or optical disk, etc.

It should be noted that the foregoing method embodiments are described as a series of action combinations for the sake of simple description, but those skilled in the art should be aware that the present disclosure is not limited by the sequence of actions described Because according to the present disclosure, certain steps may be performed in other orders or simultaneously. Secondly, those skilled in the art should also know that the embodiments described in the specification are all optional embodiments, and the involved actions and modules are not necessarily required by the present disclosure.

It should be further noted that although the flow charts in Figure 2-4, Figure 3-4, Figure 4-4, Figure 5-4, Figure 6-6, Figure 7-4, Figure 8-4, and Figure 9-4 The individual steps are displayed in the order indicated by the arrows, but these steps are not necessarily executed in the order indicated by the arrows. Unless clearly stated in this article, the execution of these steps is not strictly limited in order, and these steps can be executed in other orders. Moreover, at least some of the steps in FIGS. 2-4, 3-4, 4-4, 5-4, 6-6, 7-4, 8-4, and 9-4 may include multiple sub-steps Steps or multiple phases. These sub-steps or phases are not necessarily executed at the same time, but can be executed at different times. The execution order of these sub-steps or phases is not necessarily sequential, but can be performed At least part of the sub-steps or stages of the steps or other steps are executed in turn or alternately.

The embodiments of the present disclosure have been described above. The above description is exemplary, not exhaustive, and is not limited to the disclosed embodiments. Many modifications and changes will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the illustrated embodiments. The selection of terms used herein is intended to best explain the principles, practical applications or technical improvements of the technologies in the embodiments, or to enable other persons of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

A vector search instruction processing device, characterized in that the device includes:

The control module is used to parse the received vector search instruction, obtain the operation code and the operation domain of the vector search instruction, and determine the standby required to execute the vector search instruction according to the operation code and the operation domain Search vector, search condition and target address;

The operation module is used to sequentially determine whether a plurality of check numbers representing the search vector satisfy the search condition, determine the check number satisfying the search condition as the target number, and store the storage address of the target number Store in the target address as a search result,

Wherein, the operation code is used to indicate that the operation performed by the vector search instruction on the vector data is a search operation, and the operation domain includes the vector address to be searched and the target address.
The device according to claim 1, wherein the operation domain further includes an input length,

The control module is further configured to obtain the vector to be searched from the address of the vector to be searched according to the input length.
The device according to claim 1, wherein the operation domain further includes a width of the data to be checked,

The calculation module is further configured to determine the plurality of to-be-checked numbers from the to-be-searched vector according to the width of the to-be-checked number.
The apparatus according to claim 1, wherein the operation domain further includes a search condition,

The control module is also used to determine the search condition according to the operation domain.
The device according to claim 1, characterized in that

The control module is further configured to determine the search condition according to the operation code, wherein the operation code is also used to indicate the search condition of the vector search instruction.
The device according to claim 1, wherein the arithmetic module comprises:

At least one comparator is used to compare the plurality of to-be-checked numbers with the search condition to obtain a comparison result, so as to determine whether the to-be-checked number meets the search condition according to the comparison result.
The device according to any one of claims 1 to 6, wherein the number of to-be-checked satisfying the search condition includes at least one of the following:

The numeric value is the specified multiple of the specified value, and the sorting is the number to be checked of the specified sorting;

Number to be checked if the value is within the specified value interval;

The value is the number to be checked for the specified multiple of the specified value,

Wherein, the designated order includes at least one of the following:

The sorting of the number to be checked is the nth of the number to be checked whose value is a specified multiple of the specified value, where n is a positive integer greater than or equal to 1;

The order of the numbers to be checked is the m-th to the last one of the numbers to be checked whose value is the specified multiple of the specified value, where m is a positive integer greater than or equal to 1,

Wherein, m and n are less than or equal to the number of numbers to be checked in the vector to be looked up.
The device according to claim 1, characterized in that

The device further includes a storage module for storing the to-be-searched vector,

Wherein, the control module includes:

Instruction storage sub-module for storing the vector search instruction;

An instruction processing sub-module, which is used to parse the vector search instruction to obtain the operation code and operation domain of the vector search instruction;

A queue storage sub-module is used to store an instruction queue, the instruction queue includes a plurality of instructions to be executed, which are sequentially arranged according to the execution order, and the plurality of instructions to be executed include the vector search instruction,

Wherein, the control module also includes:

The dependency processing sub-module is used to determine the first pending instruction when there is an association between the first pending instruction among the multiple pending instructions and the zeroth pending instruction before the first pending instruction The execution instruction is cached in the instruction storage submodule, and after the execution of the zeroth instruction to be executed is completed, the first instruction to be executed is extracted from the instruction storage submodule and sent to the arithmetic module,

Wherein, the association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:

The first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.
A machine learning computing device, characterized in that the device includes:

One or more vector search instruction processing devices according to any one of claims 1-8, used to obtain data to be operated and control information from other processing devices, and perform a specified machine learning operation, and pass the execution result through I /O interface is passed to other processing devices;

When the machine learning operation device includes a plurality of the vector search instruction processing devices, the plurality of vector search instruction processing devices may be connected and transmit data through a specific structure;

Among them, a plurality of the vector search instruction processing devices interconnect and transmit data through a fast external device interconnect bus PCIE bus to support larger-scale machine learning operations; a plurality of the vector search instruction processing devices share the same control system Or have their own control systems; a plurality of the vector search instruction processing devices share memory or have their own memory; the interconnection method of the plurality of vector search instruction processing devices is an arbitrary interconnection topology.
A combined processing device, characterized in that the combined processing device includes:

The machine learning computing device, general interconnection interface and other processing devices as claimed in claim 9;

The machine learning computing device interacts with the other processing device to jointly complete the calculation operation specified by the user,

Wherein, the combined processing device further includes: a storage device, which is respectively connected to the machine learning computing device and the other processing device, and is used for storing data of the machine learning computing device and the other processing device.
A machine learning chip, characterized in that the machine learning chip includes:

The machine learning arithmetic device according to claim 9 or the combined processing device according to claim 10.
An electronic device, characterized in that the electronic device includes:

The machine learning chip according to claim 11.
A board card, characterized in that the board card comprises: a storage device, an interface device and a control device, and the machine learning chip according to claim 11;

Wherein, the machine learning chip is respectively connected to the storage device, the control device and the interface device;

The storage device is used to store data;

The interface device is used to realize data transmission between the machine learning chip and an external device;

The control device is used for monitoring the state of the machine learning chip.
A vector search instruction processing method, characterized in that the method is applied to a vector search instruction processing device, and the method includes:

Parse the received vector search instruction to obtain the operation code and operation domain of the vector search instruction, and determine the vector to be searched and the search condition required to execute the vector search instruction according to the operation code and the operation domain And destination address;

Sequentially determine whether a plurality of searched numbers representing the searched vector satisfy the search condition, determine the searched number satisfying the search condition as a target number, and store the storage address of the target number as a search result The target address,

Wherein, the operation code is used to indicate that the operation performed by the vector search instruction on the vector data is a search operation, and the operation domain includes the vector address to be searched and the target address.
The method according to claim 14, wherein the operation domain further includes an input length,

Wherein, determining the vector to be searched, the search condition and the target address required to execute the vector search instruction according to the operation code and the operation domain include:

Obtain the vector to be searched from the address of the vector to be searched according to the input length.
The method according to claim 14, wherein the operation domain further includes a width of the data to be checked, and the method further includes:

According to the width of the number to be checked, the plurality of numbers to be checked are determined from the vector to be looked up.
The method according to claim 14, wherein the operation domain further includes a search condition,

Wherein, determining the vector to be searched, the search condition and the target address required to execute the vector search instruction according to the operation code and the operation domain include:

According to the operation domain, the search condition is determined.
The method according to claim 14, wherein determining the vector to be searched, the search condition, and the target address required to execute the vector search instruction according to the operation code and the operation field includes:

The search condition is determined according to the operation code, and the operation code is also used to indicate the search condition of the vector search instruction.
The method according to claim 14, characterized in that sequentially determining whether a plurality of to-be-checked numbers representing the to-be-searched vector satisfy the search condition includes:

At least one comparator is used to compare the plurality of to-be-checked numbers with the search condition to obtain a comparison result, so as to determine whether the to-be-checked number satisfies the search condition according to the comparison result.
The method according to any one of claims 14-19, wherein the number of to-be-checked satisfying the search condition includes at least one of the following:

The numeric value is the specified multiple of the specified value, and the sorting is the number to be checked of the specified sorting;

Number to be checked if the value is within the specified value interval;

The value is the number to be checked for the specified multiple of the specified value,

Wherein, the designated order includes at least one of the following:

The sorting of the number to be checked is the nth of the number to be checked whose value is a specified multiple of the specified value, where n is a positive integer greater than or equal to 1;

The order of the numbers to be checked is the m-th to the last one of the numbers to be checked whose value is the specified multiple of the specified value, where m is a positive integer greater than or equal to 1,

Wherein, m and n are less than or equal to the number of numbers to be checked in the vector to be looked up.
The method according to claim 14, characterized in that

The method further includes: storing the to-be-searched vector,

Wherein, analyzing the received vector search instruction to obtain the operation code and operation domain of the vector search instruction includes:

Store the vector search instruction;

Parse the vector search instruction to obtain the operation code and operation domain of the vector search instruction;

Store an instruction queue, the instruction queue includes a plurality of instructions to be executed in order according to the execution order, the plurality of instructions to be executed includes the vector search instruction,

Wherein, the method further includes:

When it is determined that the first to-be-executed instruction among the plurality of to-be-executed instructions is associated with the zeroth to-be-executed instruction before the first to-be-executed instruction, cache the first to-be-executed instruction, and determine the After the execution of the zeroth to-be-executed instruction is completed, control to execute the execution of the first to-be-executed instruction,

Wherein, the association relationship between the first instruction to be executed and the zeroth instruction to be executed before the first instruction to be executed includes:

The first storage address interval storing the data required by the first instruction to be executed and the zeroth storage address interval storing the data required by the zeroth instruction to be executed have overlapping areas.