CN112396186A - Execution method, device and related product - Google Patents

Execution method, device and related product Download PDF

Info

Publication number
CN112396186A
CN112396186A CN201910740813.7A CN201910740813A CN112396186A CN 112396186 A CN112396186 A CN 112396186A CN 201910740813 A CN201910740813 A CN 201910740813A CN 112396186 A CN112396186 A CN 112396186A
Authority
CN
China
Prior art keywords
data
execution
executed
address
instruction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910740813.7A
Other languages
Chinese (zh)
Other versions
CN112396186B (en
Inventor
不公告发明人
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Cambricon Information Technology Co Ltd
Original Assignee
Shanghai Cambricon Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Cambricon Information Technology Co Ltd filed Critical Shanghai Cambricon Information Technology Co Ltd
Priority to CN201910740813.7A priority Critical patent/CN112396186B/en
Priority claimed from CN201910740813.7A external-priority patent/CN112396186B/en
Priority to PCT/CN2020/088248 priority patent/WO2020233387A1/en
Publication of CN112396186A publication Critical patent/CN112396186A/en
Application granted granted Critical
Publication of CN112396186B publication Critical patent/CN112396186B/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Abstract

The disclosure relates to an execution method, an execution device and a related product. The machine learning device comprises one or more instruction processing devices, is used for acquiring data to be executed and control information from other processing devices, executing specified machine learning execution and transmitting an execution result to other processing devices through an I/O interface; when the machine learning execution device comprises a plurality of instruction processing devices, the plurality of instruction processing devices can be connected through a specific structure and transmit data. The command processing devices are interconnected through a Peripheral Component Interface Express (PCIE) bus and transmit data; the plurality of instruction processing devices share the same control system or own control system and share the memory or own memory; the interconnection mode of the plurality of instruction processing apparatuses is an arbitrary interconnection topology. The execution method, the execution device and the related products provided by the embodiment of the disclosure have wide application range, high processing efficiency and high processing speed on the collected instructions.

Description

Execution method, device and related product
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to an execution method, an execution device, and a related product.
Background
With the continuous development of science and technology, machine learning, especially neural network algorithms, are more and more widely used. The method is well applied to the fields of image recognition, voice recognition, natural language processing and the like. But as neural network algorithms become more complex, the types and amounts of data execution involved continue to increase. In the related art, the collection of data is performed with low efficiency and low speed.
Disclosure of Invention
In view of the above, the present disclosure provides an execution method, an execution device, and a related product, so as to improve efficiency and speed of performing collection on data.
According to a first aspect of the present disclosure, there is provided a collection instruction processing apparatus, the apparatus including:
the control module is used for analyzing the acquired collection instruction to obtain an operation code and an operation domain of the collection instruction, and acquiring at least one index data, at least one to-be-executed data and a target address which are required by executing the collection instruction according to the operation code and the operation domain;
an execution module, configured to determine selected data from data to be operated according to the index data, and store the selected data and the number of the selected data as an execution result of the collection instruction in the target address,
the operation code is used for indicating that the execution of the collection instruction on the data is collection execution, and the operation domain comprises a data address to be executed, an index data address and the target address.
According to a second aspect of the present disclosure, there is provided a machine learning execution apparatus, the apparatus including:
one or more collection instruction processing apparatuses according to the first aspect described above, configured to acquire data to be executed and control information from another processing apparatus, execute a designated machine learning execution, and transmit an execution result to the other processing apparatus through an I/O interface;
when the machine learning execution device comprises a plurality of collection instruction processing devices, the collection instruction processing devices can be connected through a specific structure and transmit data;
the collection instruction processing devices are interconnected through a PCIE bus of a fast peripheral equipment interconnection bus and transmit data so as to support the execution of larger-scale machine learning; the collection instruction processing devices share the same control system or own respective control systems; the collection instruction processing devices share a memory or own respective memories; the interconnection mode of the plurality of collection instruction processing devices is any interconnection topology.
According to a third aspect of the present disclosure, there is provided a combined processing apparatus, the apparatus comprising:
the machine learning execution device, the universal interconnection interface and the other processing devices of the second aspect;
and the machine learning execution device interacts with the other processing devices to jointly complete the calculation operation designated by the user.
According to a fourth aspect of the present disclosure, there is provided a machine learning chip including the machine learning network execution apparatus of the second aspect or the combined processing apparatus of the third aspect.
According to a fifth aspect of the present disclosure, there is provided a machine learning chip package structure, which includes the machine learning chip of the fourth aspect.
According to a sixth aspect of the present disclosure, a board card is provided, which includes the machine learning chip packaging structure of the fifth aspect.
According to a seventh aspect of the present disclosure, there is provided an electronic device, which includes the machine learning chip of the fourth aspect or the board of the sixth aspect.
According to an eighth aspect of the present disclosure, there is provided a collection instruction processing method applied to a collection instruction processing apparatus, the method including:
analyzing the acquired collection instruction to obtain an operation code and an operation domain of the collection instruction, and acquiring at least one index data, at least one to-be-executed data and a target address required by executing the collection instruction according to the operation code and the operation domain;
determining selected data from the data to be executed according to the index data, and storing the selected data and the number of the selected data as an execution result of the collection instruction into the target address,
executing, wherein the operation code is used to indicate that the execution of the data by the collection instruction is collection execution, and the operation domain includes an address of data to be executed, an address of index data, and the target address.
In some embodiments, the electronic device comprises a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet, a smart terminal, a cell phone, a tachograph, a navigator, a sensor, a camera, a server, a cloud server, a camera, a camcorder, a projector, a watch, a headset, a mobile storage, a wearable device, a vehicle, a household appliance, and/or a medical device.
In some embodiments, the vehicle comprises an aircraft, a ship, and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph.
The device comprises a control module and an execution module. The control module is used for analyzing the acquired collection instruction to obtain an operation code and an operation domain of the collection instruction, and acquiring at least one index data, at least one to-be-executed data and a target address required by executing the collection instruction according to the operation code and the operation domain. Determining selected data from the data to be executed according to the index data, and storing the selected data and the number of the selected data as an execution result of the collection instruction into the target address,
the method and the device for processing the collection instruction and the related products provided by the embodiment of the disclosure have the advantages of wide application range, high processing efficiency of the collection instruction and high processing speed.
Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments, features, and aspects of the disclosure and, together with the description, serve to explain the principles of the disclosure.
FIG. 1 shows a block diagram of a gather instruction processing apparatus according to an embodiment of the present disclosure;
FIG. 2 illustrates a block diagram of a gather instruction processing apparatus according to an embodiment of the present disclosure;
2 a-2 e illustrate block diagrams of a gather instruction processing apparatus according to an embodiment of the present disclosure;
FIG. 3 shows a schematic diagram of an application scenario of a gather instruction processing apparatus according to an embodiment of the present disclosure;
FIGS. 4a, 4b show block diagrams of a combined processing device according to an embodiment of the present disclosure;
fig. 5 shows a schematic structural diagram of a board card according to an embodiment of the present disclosure;
FIG. 6 shows a flow diagram of a gather instruction processing method according to an embodiment of the present disclosure.
Detailed Description
Various exemplary embodiments, features and aspects of the present disclosure will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers can indicate functionally identical or similar elements. While the various aspects of the embodiments are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.
The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.
Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present disclosure.
Fig. 1 shows a block diagram of a collection instruction processing apparatus according to an embodiment of the present disclosure. As shown in fig. 1, the apparatus includes a control module 11 and an execution module 12.
The control module 11 is configured to analyze the acquired collection instruction to obtain an operation code and an operation domain of the collection instruction, and obtain at least one index data, at least one to-be-executed data, and a target address required for executing the collection instruction according to the operation code and the operation domain. The operation code is used for indicating that the execution of the collection instruction on the data is collection execution, and the operation domain comprises a data address to be executed, an index data address and a target address.
The execution module 12 is configured to determine selected data from data to be executed according to the index data, and store the selected data and the number of the selected data as an execution result of the collection instruction into the target address, where the operation code is used to indicate that execution of the collection instruction on the data is collection execution, and the operation domain includes an address of the data to be executed, an address of the index data, and the target address.
In this embodiment, the control module may obtain at least one to-be-executed data and at least one index data from the to-be-executed data address and the index data address, respectively. The control module may obtain the collection instruction, the at least one data to be executed, and the at least one index data through a data input output unit, which may be one or more data I/O interfaces or I/O pins.
In this embodiment, the operation code may be a part of an instruction or a field (usually indicated by a code) specified in the computer program to perform an operation, and is an instruction sequence number used to inform a device executing the instruction which instruction needs to be executed specifically. The operation domain may be a source of all data required for executing the corresponding instruction, including parameter data, data to be executed, a corresponding execution method, or an address storing the parameter data, the data to be executed, the corresponding execution method, and so on. For a gather instruction it must include an opcode and an operation field, where the operation field includes at least the data address to be executed, the index data address and the target address.
It should be understood that the instruction format of the gather instruction and the included opcode and operation fields may be set as desired by one skilled in the art, and are not limited by the present disclosure.
Optionally, as shown in fig. 2, the execution module includes one or more comparators, selectors, and counters. The one or more comparators are used for comparing the index data with a preset condition and determining whether the index data meets the preset condition; the one or more selectors are configured to, when the index data satisfy the preset condition, take the data to be executed corresponding to the index data satisfying the preset condition as the selected data. A counter for determining the number of the selected data.
Alternatively, the preset condition may be that the index data is not zero.
In this implementation, when the index data is not zero, the number of the index data that is not zero and the data to be executed corresponding to the index data that is not zero are sequentially stored to the first address and the second address of the target address. The preset condition may be that the index data is not a designated value, and the designated value may be 1 or the like. The preset conditions can be set by those skilled in the art according to actual needs, and the disclosure does not limit the preset conditions.
In this implementation, a preset condition or index data may be set as needed to store data required in the data to be executed to the target address. For example, to collect the data to be executed according to different collection requirements, different preset conditions may be set, or different index data may be set to implement different collections of the data to be executed.
Alternatively, the data to be executed may be tensor data, the index data may be tensor data corresponding to the data to be executed, and the number of index data may be greater than or equal to the number of data to be executed. At this time, a preset mapping relationship exists between the index data and the data to be executed, so that the execution module can select the selected data from the data to be executed according to the index data and the preset mapping relationship. Specifically, the number of the index data may be equal to the number of the data to be executed, the index data and the data to be executed are set in a one-to-one correspondence, and at this time, when the index data meets a preset condition, the data to be executed corresponding to the index data may be used as the selected data.
The tensor data may be neural network data, such as neuron data or weight data of a neural network. Tensor data refers to data above 0 dimension, which may have multiple dimensions. Specifically, the 0-dimensional tensor data is scalar data, the 1-dimensional tensor data is vector data, and the 2-dimensional tensor data may be matrix data. In other embodiments, the index data and the data to be executed may also be scalar data or the like, and are not limited herein.
Further optionally, the value of the index data is a bit value. For example, the index data has a value of 0 or 1. Optionally, when the value of the index data is 0, the data to be executed corresponding to the index data is discarded, and when the value of the index data is 1, the data to be executed corresponding to the index data is used as the selected data.
Optionally, the target address in the instruction operation domain may be a start address, the start address is divided into a first address and a second address, the first address is used for storing the number of the selected data, the second address is used for storing the selected data, and the execution module may determine, according to the start address and the size of the execution result, the size of a storage space required by the execution result, and store the execution result into the determined storage space.
Optionally, the target address comprises a first address for storing the number of the selected data and a second address for storing the selected data; wherein the size of the address space pointed to by the first address is smaller than or equal to the size of the address space pointed to by the second address. For example, the target address may comprise a plurality of row addresses, wherein a first row address points to a storage space for storing the selected number of data and other row addresses point to storage spaces for storing the selected data.
Alternatively, the source address and the destination address of the data to be executed in the operation domain may be one or more, one for each source address. The destination address includes a first address for storing the number of the selected data and a second address for storing the selected data, wherein a size of an address space to which the first address points is smaller than or equal to a size of an address space to which the second address points. The data to be operated on can be one or more. That is, in this embodiment, the apparatus may include one or more control modules and one or more execution modules, and the number of the control modules and the number of the execution modules may be set according to actual needs, which is not limited in this disclosure.
In other alternative embodiments, the execution result may include only the selected data, and accordingly, the target address is used to point to the address of the memory space of the selected data. Further optionally, the value of the index data is a bit value. For example, the index data has a value of 0 or 1. Optionally, when the value of the index data is 0, the data to be executed corresponding to the index data is discarded, and when the value of the index data is 1, the data to be executed corresponding to the index data is used as the selected data.
The collecting instruction processing device provided by the embodiment of the disclosure comprises a control module and an execution module. The control module is used for analyzing the acquired collection instruction to obtain an operation code and an operation domain of the collection instruction, and acquiring at least one index data, at least one to-be-executed data and a target address required by executing the collection instruction according to the operation code and the operation domain. The execution module is used for determining selected data from data to be executed according to the index data and storing the selected data and the number of the selected data as an execution result of the collection instruction into the target address. The execution module comprises one or more comparators for comparing the index data with a preset condition and determining whether the index data meets the preset condition; and one or more selectors, configured to, when the index data satisfies the preset condition, take data to be executed corresponding to the index data satisfying the preset condition as the selected data.
In a possible implementation manner, the apparatus further includes a storage module, configured to store the at least one index data, the at least one to-be-computed data, and the preset condition.
The collection instruction processing device provided by the embodiment of the disclosure has the advantages of wide application range, high processing efficiency of collection instructions and high processing speed.
FIG. 2a shows a block diagram of a gather instruction processing apparatus according to an embodiment of the present disclosure. In one possible implementation, as shown in fig. 2a, the execution module 12 may include a master execution sub-module 121 including a counter and at least one slave execution sub-module 122 including a comparator and a selector. Furthermore, the execution module may further include a data access circuit, the data access circuit may obtain data to be operated from the storage module, and the data access circuit may further store an execution result in the storage module. Alternatively, the data access circuit may be a direct memory access module.
The control module 11 is further configured to parse the collection instruction to obtain at least one execution instruction, and send the at least one index data, the at least one to-be-executed data, and the at least one execution instruction to the main execution sub-module 121.
The one or more comparators of the slave execution sub-module 122 are configured to compare the index data with a preset condition, and determine whether the index data satisfies the preset condition. And the selector of the slave execution submodule is used for taking the data to be executed corresponding to the index data meeting the preset condition as the selected data when the index data meets the preset condition, and sending the selected data to the master execution submodule.
A main execution sub-module 121, a counter of which is used to determine the number of the selected data, and store the number of the selected data and the selected data in the target address.
Optionally, the target address may include a first address and a second address, where the first address points to a memory space for storing the number of selected data, the second address points to a memory space for storing the selected data, and the memory space performed by the first address is smaller than the memory space pointed to by the second address.
FIG. 2b shows a block diagram of a gather instruction processing apparatus according to an embodiment of the present disclosure. In one possible implementation, as shown in fig. 2b, the execution module 12 may further include one or more branch execution sub-modules 123, where the branch execution sub-module 123 is configured to forward data and/or execution instructions between the master execution sub-module 121 and the slave execution sub-module 122. The main execution sub-module 121 is connected to one or more branch execution sub-modules 123.
FIG. 2c shows a block diagram of a gather instruction processing apparatus according to an embodiment of the present disclosure. In one possible implementation, as shown in fig. 2c, at least one slave execution submodule 122 is distributed in an array.
Each slave execution submodule 122 is connected with other adjacent slave execution submodules 122, the master execution submodule 121 is connected with k slave execution submodules 122 in the plurality of slave execution submodules 122, and the k slave execution submodules 122 are: the n slave execution sub-modules 122 of the 1 st row, the n slave execution sub-modules 122 of the m th row, and the m slave execution sub-modules 122 of the 1 st column.
As shown in fig. 2c, the k slave execution sub-modules only include the n slave execution sub-modules in the 1 st row, the n slave execution sub-modules in the m th row, and the m slave execution sub-modules in the 1 st column, that is, the k slave execution sub-modules are slave execution sub-modules directly connected to the master execution sub-module from among the plurality of slave execution sub-modules. The k slave execution sub-modules are used for forwarding data and instructions between the master execution sub-module and the plurality of slave execution sub-modules.
FIG. 2d shows a block diagram of a gather instruction processing apparatus according to an embodiment of the present disclosure. In one possible implementation, as shown in fig. 2d, the execution module may further include a tree sub-module 124. The tree submodule 124 includes a root port 401 and a plurality of branch ports 402. The root port 401 is connected to the master execution submodule 121, and the plurality of branch ports 402 are connected to the plurality of slave execution submodules 122, respectively.
The tree sub-module 124 has a transceiving function, and is used for forwarding data and/or execution instructions between the master execution sub-module 121 and the slave execution sub-module 122.
In one possible implementation, the tree submodule 124 may be an optional result of the apparatus, which may include at least one level of nodes. The nodes are line structures with forwarding functions, and the nodes do not have execution functions. The lowest level node is connected to the slave execution submodule to forward data and/or execution instructions between the master execution submodule 121 and the slave execution submodule 122. In particular, if the tree submodule has zero level nodes, the apparatus does not require the tree submodule.
In one possible implementation, the tree submodule 124 may include a plurality of nodes of an n-ary tree structure, and the plurality of nodes of the n-ary tree structure may have a plurality of layers.
For example, fig. 2e shows a block diagram of a gather instruction processing apparatus according to an embodiment of the present disclosure. As shown in fig. 2e, the n-ary tree structure may be a binary tree structure, with the tree sub-modules comprising 2 levels of nodes 01. The lowest node 01 is connected with the slave execution submodule 122 to forward data and/or execution instructions between the master execution submodule 121 and the slave execution submodule 122.
In this implementation, the n-ary tree structure may also be a ternary tree structure or the like, where n is a positive integer greater than or equal to 2. The number of n in the n-ary tree structure and the number of layers of nodes in the n-ary tree structure may be set by those skilled in the art as needed, and the disclosure is not limited thereto.
In one possible implementation, the main execution submodule 121 may include one or more comparators for performing comparison execution in the preceding process and/or the subsequent process.
In this implementation, the number of comparators and selectors in the slave execution submodule may be set according to the size of the data amount required for comparison execution, the processing speed of comparison execution, efficiency, and other requirements, which is not limited by the present disclosure. For example, taking a preset condition that "the index data is not 0" as an example, the comparator may compare the index data with 0 to obtain a comparison result. And the slave execution submodule is enabled to determine the data to be executed corresponding to the index data which is not 0 as the selected data according to the comparison result, and the selected data is transmitted to the master execution submodule. The main execution submodule may determine a number of selected data, and store the number of selected data and the selected data sequentially in a first address and a second address of a target address.
In one possible implementation, the operation domain may also include a read-in amount or a memory address of the read-in amount. The control module 11 is further configured to obtain a read amount, and obtain at least one to-be-executed data according to the read amount. The data volume of at least one piece of data to be executed is smaller than or equal to the read-in volume, and the read-in volume is smaller than or equal to the data volume of at least one piece of index data.
In this implementation, the read-in amount may be a data amount of the acquired at least one to-be-executed data, and may be a size of the acquired to-be-executed data. When the operation field directly contains a specific numerical value of the read amount, the numerical value may be determined as the read amount. When the memory address of the read amount is included in the operation field, the read amount can be acquired from the memory address.
In one possible implementation, when the read-in amount is not included in the operation domain, at least one to-be-executed data may be acquired according to a preset default read-in amount. The acquired data volume of the at least one piece of data to be executed is smaller than or equal to a default read-in volume, and the default read-in volume is smaller than or equal to the data volume of the at least one piece of index data.
In this implementation, the data size of the at least one data to be executed, the data size of the at least one index data, and the data size of the target address capable of data storage may be the same, and may all be equal to the read-in size or the default read-in size.
Therefore, the execution module can sequentially store the data to be executed corresponding to the index data meeting the preset condition into the target address, and the problems of insufficient target address, target address waste and the like are avoided.
In one possible implementation, as shown in fig. 2 a-2 e, the apparatus may further include a storage module 13. The storage module 13 is configured to store at least one index data, at least one to-be-executed data, and a preset condition.
In this implementation, the storage module may include one or more of a memory, a cache, and a register, and the cache may include a scratch pad cache. The at least one index data, the at least one data to be executed, and the preset condition may be stored in a memory, a cache, and/or a register of the storage module as needed, which is not limited by the present disclosure.
In a possible implementation manner, the apparatus may further include a direct memory access module for reading or storing data from the storage module.
In one possible implementation, as shown in fig. 2 a-2 e, the control module 11 may include an instruction storage sub-module 111, an instruction processing sub-module 112, and a queue storage sub-module 113.
The instruction storage submodule 111 is used to store a gather instruction.
The instruction processing sub-module 112 is configured to parse the collection instruction to obtain an operation code and an operation domain of the collection instruction.
The queue storage submodule 113 is configured to store an instruction queue, where the instruction queue includes a plurality of collection instructions arranged in sequence according to an execution order.
In this implementation, the instruction queue may be obtained by arranging the execution order of the plurality of gather instructions according to the receiving time, priority level, and the like of the gather instructions, so as to sequentially execute the plurality of gather instructions according to the instruction queue.
In one possible implementation, as shown in fig. 2 a-2 e, the execution module 12 may include a dependency processing sub-module 122.
The dependency relationship processing submodule 122 is configured to, when it is determined that the first collection instruction has an association relationship with a zeroth collection instruction before the first collection instruction, cache the first collection instruction in the instruction storage submodule 112, and after the zeroth collection instruction is executed, extract the first collection instruction from the instruction storage submodule 112 and send the first collection instruction to the execution module 12. Wherein the first gather instruction and the zeroth gather instruction are instructions of a plurality of gather instructions.
Wherein, the association relationship between the first collection instruction and the zeroth collection instruction before the first collection instruction comprises: the first memory address interval storing data required by the first gather instruction has an overlapping region with the zeroth memory address interval storing data required by the zeroth gather instruction. Conversely, the no association relationship between the first gather instruction and the zeroth gather instruction may be that the first memory address interval and the zeroth memory address interval have no overlapping area.
By the method, according to the dependency relationship among the collection instructions, the former collection instruction is executed after the former collection instruction is executed, and the accuracy of the execution result is ensured.
In one possible implementation, the instruction format of the gather instruction may be:
collect dst,src0,src1,size
wherein, collect is the operation code of the collection instruction, dst, src0, src1, and size are the operation domain of the collection instruction. dst is a target address, wherein dst comprises two parts of a first address and a second address, src0 is a data address to be executed, src1 is an index data address, and size is a read-in amount. The execution module may obtain the index data of the size and the data to be executed from the storage module according to the analyzed instruction, and when the index data meets a preset condition, the data to be executed corresponding to the index data meeting the preset condition is used as the selected data. The execution module may further determine the number of selected data and store the selected data and the number thereof in the memory space pointed to by the target address. Alternatively, the value of the index data may also be a bit value represented by 0 or 1.
When the data to be operated on is multiple, the src0 may include one src00 or multiple data addresses to be operated on, src00, src01, src02 …, src0n, where the one data address to be operated on may include one data to be operated on, and may also include one group of data to be operated on by one data address to be operated on, which is not limited in this disclosure. When the data to be operated on is multiple, the index data may be one or more, and the index data may be the same number of data of the same data type as the data to be operated on, in this case, src1 may include multiple index data addresses src10, src11, src12 …, src1 n. The index data may also be bit data, where src1 is a bit number equal to the number of data to be calculated, which is not limited by this disclosure.
When the data to be operated is multiple, the instruction format may include multiple data addresses to be operated and one or more index data addresses, and the instruction format of the collection instruction may be as follows, taking two data to be operated as an example:
collect,dst,src00,src01,src1,size
the instruction format of the gather instruction may also be:
collect,dst,src00,src01,src10,src11,size
it should be understood that the location of the opcode, opcode and operand field in the instruction format of the gather instruction may be set as desired by one skilled in the art and is not limited by the present disclosure.
In one possible implementation manner, the apparatus may be disposed in one or more of a Graphics Processing Unit (GPU), a Central Processing Unit (CPU), and an embedded Neural Network Processor (NPU).
It should be noted that, although the collection instruction processing apparatus is described above by taking the above-described embodiment as an example, those skilled in the art will understand that the present disclosure should not be limited thereto. In fact, the user can flexibly set each module according to personal preference and/or actual application scene, as long as the technical scheme of the disclosure is met.
Application example
An application example according to an embodiment of the present disclosure is given below in conjunction with "data collection by a collection instruction processing apparatus" as one exemplary application scenario to facilitate understanding of the flow of the collection instruction processing apparatus. It is to be understood by those skilled in the art that the following application examples are for the purpose of facilitating understanding of the embodiments of the present disclosure only and are not to be construed as limiting the embodiments of the present disclosure.
Fig. 3 shows a schematic diagram of an application scenario of a collection instruction processing apparatus according to an embodiment of the present disclosure. As shown in fig. 3, the collection instruction processing means processes the collection instruction as follows:
the control module 11 parses the acquired collection instruction 1 (for example, the collection instruction 1 is @ select #500#100#200#5), and obtains the operation code and the operation domain of the collection instruction 1. The operation code of the collection instruction 1 is select, the target address is 500, the target address includes a first address and a second address, the data address to be executed is 100, the index data address is 200, and the read-in amount is 5. The control module 11 acquires a plurality of data to be executed and a plurality of index data with a read-in amount of 5 from the data address to be executed 100 and the index data address 200, respectively.
It is assumed that the obtained plurality of data to be executed includes 1, 5, 6, 7, 3. The plurality of index data includes 1, 8, 0, 6, and 9. The preset condition is that the index data is not 0.
The comparator of the execution module 12 determines whether the plurality of index data are 0, and sequentially stores the data to be executed corresponding to the index data not being 0 in the target address 500 when the index data are not 0. Specifically, the execution module 12 determines whether the index data "1, 8, 0, 6, 9" are not 0, and determines that the selected data satisfying the preset condition "1, 5, 7, 3" in the data to be executed are sequentially stored in the second address of the target address 500 because the third index data is 0, and the counter of the execution module 12 determines that the number of the selected data is 4 and stores the selected data in the first address of the target address 500 according to the selected data stored in the second address. The working process of the above modules can refer to the above related description. As shown in fig. 3, the first address may be a memory space pointed to by a first row of the target address, and the second address may be a memory space pointed to by another row of the target address.
In a possible implementation manner, the obtained index data may be a bit value 11011, each bit of the bit value corresponds to one to-be-executed data, and the preset condition is that the value of the bit of the index data corresponding to the to-be-executed data is not 0.
The comparator of the execution module 12 determines whether the bit of the index data is 0, and the selector of the execution module may sequentially store the data to be executed corresponding to the bit of the index data that is not 0 into the target address 500 when the bit of the index data is not 0. Specifically, the comparator of the execution module 12 determines whether the index data "1, 0, 1" is not 0, and since the third bit of the index data is 0, the selector may sequentially store the selected data "1, 5, 7, 3" of the plurality of data to be executed as the selected data satisfying the preset condition into the second address of the target address 500, and the counter of the execution module 12 determines the number of the selected data to be 4 according to the selected data stored into the second address, and stores the number of the selected data into the first address of the target address 500. As shown in fig. 3, the first address may be a memory space pointed to by a first row of the target address, and the second address may be a memory space pointed to by another row of the target address.
Thus, the collection instruction processing device can process the collection instruction efficiently and quickly.
The present disclosure provides a machine learning execution device, which may include one or more of the above-described collection instruction processing devices, and is configured to acquire data to be executed and control information from other processing devices, and execute a designated machine learning execution. The machine learning execution device may obtain the collection instruction from other machine learning execution devices or non-machine learning execution devices, and transmit the execution result to the peripheral device (also referred to as other processing devices) through the I/O interface. Peripheral devices such as cameras, displays, mice, keyboards, network cards, wifi interfaces, servers. When more than one collection instruction processing device is included, the collection instruction processing devices can be linked and transmit data through a specific structure, for example, a PCIE bus for interconnection and data transmission, so as to support larger-scale implementation of the neural network. At this time, the same control system may be shared, or there may be separate control systems; the memory may be shared or there may be separate memories for each accelerator. In addition, the interconnection mode can be any interconnection topology.
The machine learning execution device has high compatibility and can be connected with various types of servers through PCIE interfaces.
Fig. 4a shows a block diagram of a combined processing device according to an embodiment of the present disclosure. As shown in fig. 4a, the combined processing device includes the machine learning execution device, the universal interconnection interface, and other processing devices. And the machine learning execution device interacts with other processing devices to jointly complete the operation specified by the user.
Other processing devices include one or more of general purpose/special purpose processors such as Central Processing Units (CPUs), Graphics Processing Units (GPUs), neural network processors, and the like. The number of processors included in the other processing devices is not limited. The other processing devices are used as interfaces of the machine learning execution device and external data and control, and include data transportation to finish basic control of starting, stopping and the like of the machine learning execution device; other processing devices can cooperate with the machine learning executing device to complete the executing task.
And the universal interconnection interface is used for transmitting data and control instructions between the machine learning execution device and other processing devices. The machine learning execution device acquires required input data from other processing devices and writes the input data into a storage device on the machine learning execution device; control instructions can be obtained from other processing devices and written into a control cache on a machine learning execution device slice; the data in the storage module of the machine learning execution device can also be read and transmitted to other processing devices.
Fig. 4b shows a block diagram of a combined processing device according to an embodiment of the present disclosure. In a possible implementation manner, as shown in fig. 4b, the combined processing device may further include a storage device, and the storage device is connected to the machine learning execution device and the other processing device, respectively. The storage device is used for storing data stored in the machine learning execution device and the other processing devices, and is particularly suitable for storing all data which needs to be executed in the internal storage of the machine learning execution device or the other processing devices.
The combined processing device can be used as an SOC (system on chip) system of equipment such as a mobile phone, a robot, an unmanned aerial vehicle and video monitoring equipment, the core area of a control part is effectively reduced, the processing speed is increased, and the overall power consumption is reduced. In this case, the generic interconnect interface of the combined processing device is connected to some component of the apparatus. Some parts are such as camera, display, mouse, keyboard, network card, wifi interface.
The present disclosure provides a machine learning chip, which includes the above machine learning execution device or combined processing device.
The present disclosure provides a machine learning chip package structure, which includes the above machine learning chip.
Fig. 5 shows a schematic structural diagram of a board card according to an embodiment of the present disclosure. As shown in fig. 5, the board includes the above-mentioned machine learning chip package structure or the above-mentioned machine learning chip. The board may include, in addition to the machine learning chip 389, other kits including, but not limited to: memory device 390, interface device 391 and control device 392.
The memory device 390 is coupled to a machine learning chip 389 (or a machine learning chip within a machine learning chip package structure) via a bus for storing data. Memory device 390 may include multiple sets of memory cells 393. Each group of memory cells 393 is coupled to a machine learning chip 389 via a bus. It is understood that each group 393 may be a DDR SDRAM (Double Data Rate SDRAM).
DDR can double the speed of SDRAM without increasing the clock frequency. DDR allows data to be read out on the rising and falling edges of the clock pulse. DDR is twice as fast as standard SDRAM.
In one embodiment, memory device 390 may include 4 groups of memory cells 393. Each group of memory cells 393 may include a plurality of DDR4 particles (chips). In one embodiment, the machine learning chip 389 may include 4 72-bit DDR4 controllers therein, where 64bit is used for data transmission and 8bit is used for ECC check in the 72-bit DDR4 controller. It is appreciated that when DDR4-3200 particles are used in each group of memory cells 393, the theoretical bandwidth of data transfer may reach 25600 MB/s.
In one embodiment, each group 393 of memory cells includes a plurality of double rate synchronous dynamic random access memories arranged in parallel. DDR can transfer data twice in one clock cycle. A controller for controlling DDR is provided in the machine learning chip 389 for controlling data transfer and data storage of each memory unit 393.
Interface device 391 is electrically coupled to machine learning chip 389 (or a machine learning chip within a machine learning chip package). The interface device 391 is used to implement data transmission between the machine learning chip 389 and an external device (e.g., a server or a computer). For example, in one embodiment, the interface device 391 may be a standard PCIE interface. For example, the data to be processed is transmitted to the machine learning chip 289 by the server through the standard PCIE interface, so as to implement data transfer. Preferably, when PCIE 3.0X 16 interface transmission is adopted, the theoretical bandwidth can reach 16000 MB/s. In another embodiment, the interface device 391 may also be another interface, and the disclosure does not limit the specific representation of the other interface, and the interface device can implement the switching function. In addition, the calculation result of the machine learning chip is still transmitted back to the external device (e.g., server) by the interface device.
The control device 392 is electrically connected to a machine learning chip 389. The control device 392 is used to monitor the state of the machine learning chip 389. Specifically, the machine learning chip 389 and the control device 392 may be electrically connected through an SPI interface. The control device 392 may include a single chip Microcomputer (MCU). For example, machine learning chip 389 may include multiple processing chips, multiple processing cores, or multiple processing circuits, which may carry multiple loads. Therefore, the machine learning chip 389 can be in different operation states such as a multi-load and a light load. The control device can regulate and control the working states of a plurality of processing chips, a plurality of processing circuits and/or a plurality of processing circuits in the machine learning chip.
The present disclosure provides an electronic device, which includes the above machine learning chip or board card.
The electronic device may include a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet, a smart terminal, a cell phone, a tachograph, a navigator, a sensor, a camera, a server, a cloud server, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a vehicle, a household appliance, and/or a medical device.
The vehicle may include an aircraft, a ship, and/or a vehicle. The household appliances may include televisions, air conditioners, microwave ovens, refrigerators, electric rice cookers, humidifiers, washing machines, electric lamps, gas cookers, and range hoods. The medical device may include a nuclear magnetic resonance apparatus, a B-mode ultrasound apparatus and/or an electrocardiograph.
FIG. 6 shows a flow diagram of a gather instruction processing method according to an embodiment of the present disclosure. As shown in fig. 6, the method is applied to the above-described collection instruction processing apparatus, and includes step S51 and step S52.
In step S51, the acquired collection instruction is parsed to obtain an operation code and an operation domain of the collection instruction, and at least one index data, at least one to-be-executed data, and a target address required for executing the collection instruction are obtained according to the operation code and the operation domain. The operation code is used for indicating that the execution of the collection instruction on the data is collection execution, and the operation domain comprises a data address to be executed, an index data address and a target address.
In step S52, selected data is determined from the data to be executed according to the index data, and the selected data and the number of the selected data are stored as the execution result of the collection instruction in the target address. In one possible implementation, the method may further include: and analyzing the collection instruction to obtain a plurality of execution instructions.
Wherein, the step S52 may include:
one or more comparators compare the index data with preset conditions to determine whether the index data meets the preset conditions;
and when the index data meet the preset condition, one or more selectors take the data to be executed corresponding to the index data meeting the preset condition as the selected data and store the selected data in a target address.
In one possible implementation, step S52 may include: the number of selected data is determined using a counter and stored at the target address.
In one possible implementation, the execution target address includes a first address for storing the number of the selected data and a second address for storing the selected data;
wherein the size of the address space pointed to by the first address is smaller than or equal to the size of the address space pointed to by the second address.
In a possible implementation manner, the operation domain may further include a read amount or a storage address of the read amount, and the step S51 may include: and acquiring the read-in amount, and acquiring a plurality of data to be executed according to the read-in amount. The data volume of at least one piece of data to be executed is smaller than or equal to the read-in volume, and the read-in volume is smaller than or equal to the data volume of the plurality of index data.
In one possible implementation, the method may further include: storing at least one index data, at least one data to be executed, and preset conditions.
In one possible implementation manner, the data to be executed is tensor data, the index data is tensor data corresponding to the data to be executed, and the number of the index data is greater than or equal to the number of the data to be executed.
In one possible implementation, the value of the index data is a bit value.
In one possible implementation, the execution module includes a master execution submodule including a counter and at least one slave execution submodule including a comparator and a selector;
the control module analyzes the compiled collecting instruction to obtain at least one execution instruction, and sends the at least one index data, the at least one to-be-executed data and the at least one execution instruction to the slave execution submodule;
a comparator in the slave execution submodule compares the index data with a preset condition to determine whether the index data meets the preset condition; when the index data meet the preset conditions, a selector in the slave execution sub-module takes the data to be executed corresponding to the index data meeting the preset conditions as the selected data and sends the selected data to the master execution sub-module;
and the counter of the main execution submodule determines the number of the selected data according to the selected data transmitted by at least one slave execution submodule, and stores the number of the selected data and the selected data into the target address.
In one possible implementation, step S51 may include:
storing the collection instruction;
analyzing the collection instruction to obtain an operation code and an operation domain of the collection instruction;
and storing an instruction queue, wherein the instruction queue comprises a plurality of collecting instructions which are sequentially arranged according to an execution sequence.
In one possible implementation, the method may further include:
caching the first collection instruction when the first collection instruction is determined to have an association relation with a zeroth collection instruction before the first collection instruction, executing the first collection instruction after the zeroth collection instruction is executed,
wherein, the association relationship between the first collection instruction and the zeroth collection instruction before the first collection instruction comprises:
the first memory address interval storing data required by the first gather instruction has an overlapping region with the zeroth memory address interval storing data required by the zeroth gather instruction.
In one possible implementation, the preset condition may include that the index data is not zero.
It should be noted that, although the collection instruction processing method is described above by taking the above-described embodiment as an example, those skilled in the art will understand that the present disclosure should not be limited thereto. In fact, the user can flexibly set each step according to personal preference and/or actual application scene, as long as the technical scheme of the disclosure is met.
The collection instruction processing method provided by the embodiment of the disclosure has the advantages of wide application range, high collection instruction processing efficiency and high processing speed.
It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are exemplary embodiments and that acts and modules referred to are not necessarily required by the disclosure.
In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the embodiments provided in the present disclosure, it should be understood that the disclosed system and apparatus may be implemented in other ways. For example, the above-described embodiments of systems and apparatuses are merely illustrative, and for example, a division of a device, an apparatus, and a module is merely a logical division, and an actual implementation may have another division, for example, a plurality of modules may be combined or integrated into another system or apparatus, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of some interfaces, devices, apparatuses or modules, and may be an electrical or other form.
Modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. The purpose of the scheme of the embodiment can be achieved by collecting part or all of the modules according to actual needs.
In addition, functional modules in the embodiments of the present disclosure may be integrated into one processing unit, or each module may exist alone physically, or two or more modules are integrated into one module. The integrated module can be realized in a form of hardware or a form of a software program module.
The integrated modules, if implemented in the form of software program modules and sold or used as a stand-alone product, may be stored in a computer readable memory. Based on such understanding, the technical solution of the present disclosure may be embodied in the form of a software product, which is stored in a memory and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present disclosure. And the aforementioned memory comprises: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.
Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable memory, which may include: flash Memory disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.
The foregoing may be better understood in light of the following clauses:
clause 1: a gather instruction processing apparatus, the apparatus comprising:
the control module is used for analyzing the acquired collection instruction to obtain an operation code and an operation domain of the collection instruction, and acquiring at least one index data, at least one data to be operated and a target address which are required by executing the collection instruction according to the operation code and the operation domain;
the execution module is used for determining selected data from data to be operated according to the index data and storing the selected data and the number of the selected data as an execution result of the collection instruction into the target address;
the operation code is used for indicating that the execution of the collection instruction on the data is collection execution, and the operation domain comprises a data address to be operated, an index data address and the target address.
Clause 2: the apparatus of clause 1, the execution module comprising:
one or more comparators, configured to compare the index data with a preset condition, and determine whether the index data satisfies the preset condition;
one or more selectors, configured to, when the index data satisfies the preset condition, take data to be executed corresponding to the index data satisfying the preset condition as the selected data; and a counter for determining the number of said selected data.
Clause 3: the apparatus according to clause 1 or 2, the data to be executed being tensor data, the index data being tensor data corresponding to the data to be executed, the number of the index data being greater than or equal to the number of the data to be executed.
Clause 4; the apparatus of clause 3, wherein the index data has a value of a bit number.
Clause 5: the apparatus of clause 1 or 2, the target address comprising a first address for storing the number of the selected data and a second address for storing the selected data;
wherein the size of the address space pointed to by the first address is smaller than or equal to the size of the address space pointed to by the second address.
Clause 6: the apparatus of any of clauses 1-5, the execution module comprising a master execution submodule including the counter and at least one slave execution submodule including the comparator and selector;
the control module is further configured to parse the compiled collecting instruction to obtain at least one execution instruction, and send the at least one index data, the at least one to-be-executed data, and the at least one execution instruction to the slave execution submodule;
the one or more comparators of the slave execution submodule are used for comparing the index data with preset conditions to determine whether the index data meet the preset conditions, and the selector is used for taking the data to be executed corresponding to the index data meeting the preset conditions as the selected data and sending the selected data to the master execution submodule when the index data meet the preset conditions;
and the counter of the main execution submodule is used for determining the number of the selected data and storing the number of the selected data and the selected data into the first address and the second address of the target address. .
Clause 7: the apparatus according to any of clauses 1-6, the operation domain further comprising a read-in amount or a storage address of the read-in amount,
wherein, the control module is further configured to obtain the read-in amount and obtain the at least one to-be-executed data according to the read-in amount,
wherein, the data volume of the at least one data to be executed is less than or equal to the read-in volume, and the read-in volume is less than or equal to the data volume of the at least one index data.
Clause 8: the apparatus of any of clauses 1-7, further comprising:
and the storage module is used for storing at least one of the index data, the data to be executed and the preset condition.
Clause 9: the apparatus of any of clauses 1-8, wherein the control module comprises:
the instruction storage submodule stores the collection instruction;
the instruction processing submodule analyzes the collection instruction to obtain an operation code and an operation domain of the collection instruction; and the queue storage submodule stores an instruction queue, and the instruction queue comprises a plurality of collection instructions which are sequentially arranged according to an execution sequence.
Clause 10: the apparatus of any of clauses 1-9, wherein the control module further comprises:
the dependency relationship processing submodule is used for caching a first to-be-executed instruction in the instruction storage submodule when the fact that the incidence relationship exists between the first to-be-executed instruction and a zeroth to-be-executed instruction before the first to-be-executed instruction is determined, extracting the first to-be-executed instruction from the instruction storage submodule after the zeroth to-be-executed instruction is executed, and sending the first to-be-executed instruction to the execution module,
wherein the association relationship between the first to-be-executed instruction and a zeroth to-be-executed instruction before the first to-be-executed instruction comprises:
and a first storage address interval for storing the data required by the first instruction to be executed and a zeroth storage address interval for storing the data required by the zeroth instruction to be executed have an overlapped area.
Clause 11: the apparatus of any of clauses 1-10, wherein the predetermined condition comprises index data being non-zero.
Clause 12: a gather instruction processing method applied to a gather instruction processing apparatus, the method comprising:
analyzing the acquired collection instruction to obtain an operation code and an operation domain of the collection instruction, and acquiring at least one index data, at least one to-be-executed data and a target address required by executing the collection instruction according to the operation code and the operation domain;
determining selected data from the data to be executed according to the index data, and storing the selected data and the number of the selected data as an execution result of the collection instruction into the target address;
the operation code is used for indicating that the execution of the collection instruction on the data is collection execution, and the operation domain comprises a data address to be executed, an index data address and the target address.
Clause 13: the method of clause 12, wherein determining selected data from the data to be executed according to the index data and storing the selected data and the number of selected data as the execution result of the gather instruction in the target address, comprises:
one or more comparators compare the index data with preset conditions to determine whether the index data meets the preset conditions;
when the index data meet the preset conditions, one or more selectors take the data to be executed corresponding to the index data meeting the preset conditions as the selected data and store the selected data into the target address;
a counter determines the number of the selected data and stores the number of the selected data in the target address.
Clause 14: the method according to clause 12 or 13, the data to be executed is tensor data, the index data is tensor data corresponding to the data to be executed, and the number of the index data is greater than or equal to the number of the data to be executed.
Clause 15: the method of clause 14, wherein the index data has a value of a bit number.
Clause 16: the method of clause 13 or 14, wherein the target address comprises a first address for storing the number of the selected data and a second address for storing the selected data;
wherein the size of the address space pointed to by the first address is smaller than or equal to the size of the address space pointed to by the second address.
Clause 17: the method of clause 13, wherein the execution module comprises a master execution submodule comprising the counter and at least one slave execution submodule comprising the comparator and selector;
the control module analyzes the compiled collecting instruction to obtain at least one execution instruction, and sends the at least one index data, the at least one to-be-executed data and the at least one execution instruction to the slave execution submodule;
the comparator of the slave execution submodule compares the index data with a preset condition to determine whether the index data meets the preset condition;
when the index data meet the preset conditions, the selector of the slave execution sub-module takes the data to be executed corresponding to the index data meeting the preset conditions as the selected data and sends the selected data to the master execution sub-module;
and the counter of the main execution submodule is used for determining the number of the selected data and storing the number of the selected data and the selected data into the first address and the second address of the target address.
Clause 18: the method according to clause 12, the operation domain further comprising a read-in amount or a storage address of the read-in amount,
wherein the control module acquires the read-in amount and acquires the at least one data to be executed according to the read-in amount,
wherein, the data volume of the at least one data to be executed is less than or equal to the read-in volume, and the read-in volume is less than or equal to the data volume of the at least one index data.
Clause 19: the method of clause 12, further comprising:
the storage module stores at least one of the index data, the data to be executed, and the preset condition.
Clause 20: according to the method of clause 12, the parsing the acquired collection instruction to obtain the operation code and the operation domain of the collection instruction includes:
the instruction storage submodule stores the collection instruction;
the instruction processing submodule analyzes the collection instruction to obtain an operation code and an operation domain of the collection instruction;
the queue storage submodule stores an instruction queue, and the instruction queue comprises a plurality of collection instructions which are sequentially arranged according to an execution sequence.
Clause 21: the method of any of clauses 12-20, further comprising:
when determining that the first to-be-executed instruction in the plurality of to-be-executed instructions is associated with a zeroth to-be-executed instruction before the first to-be-executed instruction, the dependency relationship processing submodule caches the first to-be-executed instruction in the instruction storage submodule, and after the zeroth to-be-executed instruction is executed, extracts the first to-be-executed instruction from the instruction storage submodule and sends the first to-be-executed instruction to the execution module,
wherein the association relationship between the first to-be-executed instruction and a zeroth to-be-executed instruction before the first to-be-executed instruction comprises:
and a first storage address interval for storing the data required by the first instruction to be executed and a zeroth storage address interval for storing the data required by the zeroth instruction to be executed have an overlapped area.
Clause 22: the method of any of clauses 12 to 21, wherein the predetermined condition comprises index data being non-zero.
Clause 22: a computer-readable medium, in which a computer program is stored, which, when being executed by one or more processing means, carries out the method steps of clauses 12-21.
The foregoing detailed description of the embodiments of the present application has been presented to illustrate the principles and implementations of the present application, and the above description of the embodiments is only provided to help understand the method and the core concept of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims (10)

1. A gather instruction processing apparatus, the apparatus comprising:
the control module is used for analyzing the acquired collection instruction to obtain an operation code and an operation domain of the collection instruction, and acquiring at least one index data, at least one to-be-executed data and a target address required by the collection instruction according to the operation code and the operation domain;
the execution module is used for determining selected data from the data to be executed according to the index data and storing the selected data and the number of the selected data as the execution result of the collection instruction into the target address;
the operation code is used for indicating that the operation of the collection instruction on the data is collection execution, and the operation domain comprises a data address to be executed, an index data address and the target address.
2. The apparatus of claim 1, wherein the execution module comprises:
one or more comparators, configured to compare the index data with a preset condition, and determine whether the index data satisfies the preset condition;
one or more selectors, configured to, when the index data satisfies the preset condition, take data to be executed corresponding to the index data satisfying the preset condition as the selected data; and
a counter for determining the number of the selected data.
3. The apparatus according to claim 1 or 2, wherein the data to be executed is tensor data, the index data is tensor data corresponding to the data to be executed, and the number of the index data is greater than or equal to the number of the data to be executed.
4. The apparatus of claim 3, wherein the value of the index data is a bit value.
5. The apparatus according to claim 1 or 2, wherein the execution target address includes a first address for storing the number of the selected data and a second address for storing the selected data;
wherein the size of the address space pointed to by the first address is smaller than or equal to the size of the address space pointed to by the second address.
6. The apparatus of claim 1, wherein the execution module comprises a master execution submodule comprising a counter and at least one slave execution submodule comprising a comparator and a selector;
the control module is further configured to parse the collection instruction to obtain at least one execution instruction, and send the at least one index data, the at least one to-be-executed data, and the at least one execution instruction to the main execution submodule;
the one or more comparators of the slave execution submodule are used for comparing the index data with preset conditions and determining whether the index data meet the preset conditions, and the selector of the slave execution submodule is used for taking the data to be executed corresponding to the index data meeting the preset conditions as the selected data and sending the selected data to the master execution submodule when the index data meet the preset conditions;
the counter of the main execution submodule is used for determining the number of the selected data according to the selected data transmitted by at least one slave execution submodule and storing the number of the selected data and the selected data into the target address.
7. The device according to claim 1, characterized in that the operation domain further comprises a read-in quantity or a storage address of the read-in quantity,
wherein, the control module is further configured to obtain the read-in amount and obtain the at least one to-be-executed data according to the read-in amount,
wherein, the data volume of the at least one data to be executed is less than or equal to the read-in volume, and the read-in volume is less than or equal to the data volume of the at least one index data.
8. The apparatus according to any one of claims 1 to 7, wherein the predetermined condition comprises that the index data is not zero.
9. A gather instruction processing method, the method comprising:
analyzing the acquired collection instruction to obtain an operation code and an operation domain of the collection instruction, and acquiring at least one index data, at least one to-be-executed data and a target address required by the collection instruction according to the operation code and the operation domain;
determining selected data from the data to be operated according to the index data, and storing the selected data and the number of the selected data as an execution result of the collection instruction into the target address;
the operation code is used for indicating that the execution of the collection instruction on the data is collection execution, and the operation domain comprises a data address to be executed, an index data address and the target address.
10. A computer-readable medium, in which a computer program is stored which, when being executed by one or more processing means, carries out the steps of the method as claimed in claim 9.
CN201910740813.7A 2019-05-17 2019-08-12 Execution method, execution device and related product CN112396186B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201910740813.7A CN112396186B (en) 2019-08-12 Execution method, execution device and related product
PCT/CN2020/088248 WO2020233387A1 (en) 2019-05-17 2020-04-30 Command processing method and apparatus, and related products

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910740813.7A CN112396186B (en) 2019-08-12 Execution method, execution device and related product

Publications (2)

Publication Number Publication Date
CN112396186A true CN112396186A (en) 2021-02-23
CN112396186B CN112396186B (en) 2024-05-03

Family

ID=

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6360220B1 (en) * 1998-08-04 2002-03-19 Microsoft Corporation Lock-free methods and systems for accessing and storing information in an indexed computer data structure having modifiable entries
CN101131719A (en) * 2006-08-23 2008-02-27 北京同方微电子有限公司 Micro-processor kernel used for cryptography arithmetic
US20100138378A1 (en) * 2005-07-08 2010-06-03 Brainlike Surveillance Research, Inc. System and Method for Auto-Adaptive Network
CN109032670A (en) * 2018-08-08 2018-12-18 上海寒武纪信息科技有限公司 Processing with Neural Network device and its method for executing vector duplicate instructions
CN109492241A (en) * 2018-08-10 2019-03-19 北京中科寒武纪科技有限公司 Conversion method, device, computer equipment and storage medium
CN109657782A (en) * 2018-12-14 2019-04-19 北京中科寒武纪科技有限公司 Operation method, device and Related product
CN109726822A (en) * 2018-12-14 2019-05-07 北京中科寒武纪科技有限公司 Operation method, device and Related product

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6360220B1 (en) * 1998-08-04 2002-03-19 Microsoft Corporation Lock-free methods and systems for accessing and storing information in an indexed computer data structure having modifiable entries
US20100138378A1 (en) * 2005-07-08 2010-06-03 Brainlike Surveillance Research, Inc. System and Method for Auto-Adaptive Network
CN101131719A (en) * 2006-08-23 2008-02-27 北京同方微电子有限公司 Micro-processor kernel used for cryptography arithmetic
CN109032670A (en) * 2018-08-08 2018-12-18 上海寒武纪信息科技有限公司 Processing with Neural Network device and its method for executing vector duplicate instructions
CN109492241A (en) * 2018-08-10 2019-03-19 北京中科寒武纪科技有限公司 Conversion method, device, computer equipment and storage medium
CN109657782A (en) * 2018-12-14 2019-04-19 北京中科寒武纪科技有限公司 Operation method, device and Related product
CN109726822A (en) * 2018-12-14 2019-05-07 北京中科寒武纪科技有限公司 Operation method, device and Related product

Similar Documents

Publication Publication Date Title
CN111079909B (en) Operation method, system and related product
CN112396186B (en) Execution method, execution device and related product
CN112395003A (en) Operation method, device and related product
CN112396186A (en) Execution method, device and related product
CN111353595A (en) Operation method, device and related product
CN112394985A (en) Execution method, device and related product
CN111813449A (en) Operation method, device and related product
CN111124497B (en) Operation method, operation device, computer equipment and storage medium
CN111026440B (en) Operation method, operation device, computer equipment and storage medium
CN111400341B (en) Scalar lookup instruction processing method and device and related product
CN111078125B (en) Operation method, device and related product
CN111290789B (en) Operation method, operation device, computer equipment and storage medium
CN111078285B (en) Operation method, system and related product
CN111079914B (en) Operation method, system and related product
CN111079915B (en) Operation method, device and related product
CN111079911B (en) Operation method, system and related product
CN112346781A (en) Instruction processing method and device and related product
CN112346707A (en) Instruction processing method and device and related product
CN111966401A (en) Instruction processing method and device and related product
CN111966402A (en) Instruction processing method and device and related product
CN111966325A (en) Instruction processing method and device and related product
CN111966403A (en) Instruction processing method and device and related product
CN111813537A (en) Operation method, device and related product
CN112394999A (en) Operation method, device and related product
CN111399905A (en) Operation method, device and related product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination