CN112395008A

CN112395008A - Operation method, operation device, computer equipment and storage medium

Info

Publication number: CN112395008A
Application number: CN201910745264.2A
Authority: CN
Inventors: 不公告发明人
Original assignee: Shanghai Cambricon Information Technology Co Ltd
Current assignee: Shanghai Cambricon Information Technology Co Ltd
Priority date: 2019-08-13
Filing date: 2019-08-13
Publication date: 2021-02-23

Abstract

The present disclosure relates to an arithmetic method, an apparatus, a computer device, and a storage medium. Wherein the combined processing device comprises: a machine learning arithmetic device, a universal interconnection interface and other processing devices; the machine learning arithmetic device interacts with other processing devices to jointly complete the calculation operation designated by the user, wherein the combined processing device further comprises: and the storage device is respectively connected with the machine learning arithmetic device and the other processing devices and is used for storing the data of the machine learning arithmetic device and the other processing devices. The operation method, the operation device, the computer equipment and the storage medium provided by the embodiment of the disclosure have wide application range, high operation processing efficiency and high processing speed.

Description

Operation method, operation device, computer equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a method and an apparatus for processing a most pooled index instruction, a computer device, and a storage medium.

Background

With the continuous development of science and technology, machine learning, especially neural network algorithms, are more and more widely used. The method is well applied to the fields of image recognition, voice recognition, natural language processing and the like. However, as the complexity of neural network algorithms is higher and higher, the types and the number of involved data operations are increasing. In the related art, the efficiency and the speed of performing the most-valued pooling index operation on the data are low.

Disclosure of Invention

In view of the above, the present disclosure provides a method, an apparatus, a computer device, and a storage medium for processing a most pooled index instruction, so as to improve efficiency and speed of performing a most pooled index operation on data.

According to a first aspect of the present disclosure, there is provided a most pooled index instruction processing apparatus, the apparatus comprising:

the control module is used for analyzing the obtained most-valued pooling index instruction to obtain an operation domain of the most-valued pooling index instruction, and acquiring data to be operated, a pooling core and a target address required by executing the most-valued pooling index instruction according to the operation domain;

and the operation module is used for carrying out the most valued pooling index operation on the data to be operated according to the pooling core, acquiring an operation result and storing the operation result into the target address, wherein the operation result comprises the index of the region where the most value is located in each region data of the data to be operated corresponding to the pooling core.

According to a second aspect of the present disclosure, there is provided a machine learning arithmetic device, the device including:

one or more of the most-valued pooling index instruction processing devices of the first aspect described above, configured to obtain data to be operated and control information from other processing devices, execute a specified machine learning operation, and transmit an execution result to the other processing devices through an I/O interface;

when the machine learning arithmetic device comprises a plurality of the most valued pooling index instruction processing devices, the plurality of the most valued pooling index instruction processing devices can be connected through a specific structure and transmit data;

the device comprises a plurality of maximum pooling index instruction processing devices, a PCIE bus, a fast Peripheral Component Interface Express (PCIE) bus, a data transmission bus and a data transmission bus, wherein the maximum pooling index instruction processing devices are interconnected through the PCIE bus and transmit data so as to support larger-scale machine learning operation; the plurality of the most value pooling index instruction processing devices share the same control system or own respective control systems; the most value pooling index instruction processing devices share a memory or have respective memories; the interconnection mode of the maximum value pooling index instruction processing devices is any interconnection topology.

According to a third aspect of the present disclosure, there is provided a combined processing apparatus, the apparatus comprising:

the machine learning arithmetic device, the universal interconnect interface, and the other processing device according to the second aspect;

and the machine learning operation device interacts with the other processing devices to jointly complete the calculation operation designated by the user.

According to a fourth aspect of the present disclosure, there is provided a machine learning chip including the machine learning network operation device of the second aspect or the combination processing device of the third aspect.

According to a fifth aspect of the present disclosure, there is provided a machine learning chip package structure, which includes the machine learning chip of the fourth aspect.

According to a sixth aspect of the present disclosure, a board card is provided, which includes the above-mentioned chip package structure for machine learning of the fifth aspect.

According to a seventh aspect of the present disclosure, there is provided an electronic device, which includes the machine learning chip of the fourth aspect or the board of the sixth aspect.

According to an eighth aspect of the present disclosure, there is provided a most-valued pooling index instruction processing method applied to a most-valued pooling index instruction processing apparatus, the method including:

analyzing the obtained most-valued pooling index instruction to obtain an operation code and an operation domain of the most-valued pooling index instruction, and obtaining data to be operated, a pooling core and a target address which are required by executing the most-valued pooling index instruction according to the operation domain;

and performing the most valued pooling index operation on the data to be operated according to the pooling core to obtain an operation result, and storing the operation result into the target address, wherein the operation result comprises the index of the region where the most value in the region data of the data to be operated corresponding to the pooling core is located.

According to a ninth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the above-described most-valued pooling index instruction processing method.

In some embodiments, the electronic device comprises a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet, a smart terminal, a cell phone, a tachograph, a navigator, a sensor, a camera, a server, a cloud server, a camera, a camcorder, a projector, a watch, a headset, a mobile storage, a wearable device, a vehicle, a household appliance, and/or a medical device.

In some embodiments, the vehicle comprises an aircraft, a ship, and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph.

The device comprises a control module and an operation module, wherein the control module is used for analyzing the obtained most valued pooling index instruction to obtain an operation domain of the most valued pooling index instruction, and obtaining data to be operated, a pooling core and a target address which are required for executing the most valued pooling index instruction according to the operation domain; and the operation module is used for performing the most valued pooling index operation on the data to be operated according to the pooling core to obtain an operation result and storing the operation result into the target address, wherein the operation result comprises the index of the region where the most value in the region data of the data to be operated corresponding to the pooling core is located. The method and the device for processing the most-valued pooling index instruction and the related products provided by the embodiment of the disclosure have the advantages of wide application range, high processing efficiency and high processing speed for the most-valued pooling index instruction, and high processing efficiency and high processing speed for performing the most-valued pooling index operation.

Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments, features, and aspects of the disclosure and, together with the description, serve to explain the principles of the disclosure.

Fig. 1 illustrates a block diagram of a most-valued pooling index instruction processing apparatus according to an embodiment of the present disclosure.

FIGS. 2a and 2b are schematic diagrams illustrating a most-valued pooling index operation according to an embodiment of the disclosure

Fig. 2 c-2 e illustrate a schematic diagram of an indexing approach for a pooled core according to an embodiment of the present disclosure.

3 a-3 f illustrate block diagrams of a most-valued pooling index instruction processing apparatus according to an embodiment of the present disclosure.

Fig. 4 a-4 c are diagrams illustrating a most-valued pooling index operation according to an embodiment of the disclosure.

Fig. 5 is a schematic diagram illustrating an application scenario of a most-valued pooling index instruction processing apparatus according to an embodiment of the present disclosure.

Fig. 6a, 6b show block diagrams of a combined processing device according to an embodiment of the present disclosure.

Fig. 7 shows a schematic structural diagram of a board card according to an embodiment of the present disclosure.

FIG. 8 illustrates a flow diagram of a method of most-valued pooling index instruction processing according to an embodiment of the present disclosure.

Detailed Description

The technical solutions in the embodiments of the present disclosure will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are some, not all embodiments of the present disclosure. All other embodiments, which can be derived by one skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the scope of protection of the present disclosure.

It should be understood that the terms "zero," "first," "second," and the like in the claims, the description, and the drawings of the present disclosure are used for distinguishing between different objects and not for describing a particular order. The terms "comprises" and "comprising," when used in the specification and claims of this disclosure, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the disclosure herein is for the purpose of describing particular embodiments only, and is not intended to be limiting of the disclosure. As used in the specification and claims of this disclosure, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the specification and claims of this disclosure refers to any and all possible combinations of one or more of the associated listed items and includes such combinations.

As used in this specification and claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to a determination" or "in response to a detection". Similarly, the phrase "if determined" or "if [ a described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ a described condition or event ]" or "in response to detecting [ a described condition or event ]".

Due to the wide use of neural network algorithms, the computing man power of computer hardware is continuously improved, and the types and the number of data operations involved in practical application are continuously improved. The most-valued pooling index operation (maxpool _ index, minpool _ index, maxabspool _ index, minabspool _ index) is an operation of obtaining the most-valued corresponding index value of all data in the local region. Because the programming languages are various in types, in order to realize the operation process of the most-valued pooling index operation under different language environments, in the related art, as the most-valued pooling index instruction which can be widely applied to various programming languages is not available at the present stage, technicians need to customize a plurality of instructions corresponding to the programming language environments to realize the most-valued pooling index operation, and the efficiency and the speed of the most-valued pooling index operation are low. The present disclosure provides a method and an apparatus for processing a most-valued pooling index instruction, a computer device, and a storage medium, which can realize the most-valued pooling index operation with only one instruction, and can significantly improve the efficiency and speed of performing the most-valued pooling index operation.

Fig. 1 illustrates a block diagram of a most-valued pooling index instruction processing apparatus according to an embodiment of the present disclosure. As shown in fig. 1, the apparatus includes a control module 11 and an operation module 12.

The control module 11 is configured to analyze the obtained most valued pooling index instruction to obtain an operation domain of the most valued pooling index instruction, and obtain data to be operated, a pooling core, and a target address required for executing the most valued pooling index instruction according to the operation domain. The operation code is used for indicating that the operation of the most-valued pooling index instruction on the data is the most-valued pooling index operation, and the operation domain comprises a data address to be operated, a pooling core and a target address.

And the operation module 12 is configured to perform a most valued pooling index operation on the data to be operated according to the pooling check, obtain an operation result, and store the operation result in the target address, where the operation result includes an index of an area where a most value in each area data of the data to be operated corresponding to the pooling check is located.

In this embodiment, the control module may obtain the data to be operated from the data address to be operated. The control module may obtain instructions and data through a data input output unit, which may be one or more data I/O interfaces or I/O pins.

In this embodiment, the operation code may be a part of an instruction or a field (usually indicated by a code) specified in the computer program to perform an operation, and is an instruction sequence number used to inform a device executing the instruction which instruction needs to be executed specifically. The operation domain may be a source of data required for executing the corresponding instruction, and the data required for executing the corresponding instruction includes parameters such as data to be operated on, a pooling core, and a corresponding operation method. For a most-valued pooling index instruction, it must include an opcode and an operation field, where the operation field includes at least the address of the data to be operated on, the pooling core, and the target address.

It should be understood that the instruction format of the most-valued pooling index instruction, and the opcode and operation field included therein may be set as desired by those skilled in the art, and the disclosure is not limited thereto.

In this embodiment, the apparatus may include one or more control modules and one or more operation modules, and the number of the control modules and the number of the operation modules may be set according to actual needs, which is not limited in this disclosure. When the device comprises a control module, the control module can receive the most-valued pooling index instruction and control one or more operation modules to perform the most-valued pooling index operation. When the device comprises a plurality of control modules, the control modules can respectively receive the most-valued pooling index instruction and control one or more corresponding operation modules to perform the most-valued pooling index operation.

The device for processing the most valued pooling index instruction provided by the embodiment of the disclosure comprises a control module and an operation module, wherein the control module is used for analyzing the obtained most valued pooling index instruction to obtain an operation code and an operation domain of the most valued pooling index instruction, and acquiring data to be operated, a pooling core and a target address which are required for executing the most valued pooling index instruction according to the operation code and the operation domain; the operation module is used for performing the most value pooling index operation on the data to be operated according to the pooling core to obtain an operation result and storing the operation result into the target address. The most-valued pooling index instruction processing device provided by the embodiment of the disclosure has a wide application range, high processing efficiency and high processing speed for the most-valued pooling index instruction, and high processing efficiency and high processing speed for performing the most-valued pooling index operation.

Wherein the most value is the maximum value, the minimum value, the maximum absolute value or the minimum absolute value. The maximum value refers to the maximum number obtained by comparing symbols in the data to be operated by the pooling core, and if the data to be operated corresponding to the pooling core comprises 4 numbers, the numerical values of which are 0, 2, 3 and-4 respectively, the maximum value is 3. The minimum value refers to the minimum value obtained by comparing symbols contained in the data to be operated by the pooling core, and as in the above example, the minimum value is-4 in 0, 2, 3 and-4. The maximum absolute value refers to the maximum number obtained by comparing the absolute values of the data to be calculated after the pooling core takes absolute values, and the maximum absolute value is-4 in the above examples of 0, 2, 3 and-4. The minimum absolute value refers to the minimum number obtained by comparing the absolute values of the data to be calculated after the pooling core takes absolute values, and is 0 in 0, 2, 3 and 4 in the above example.

Fig. 2a and 2b are schematic diagrams illustrating a most-valued pooling index operation according to an embodiment of the disclosure. And performing the most valued pooling index operation on the data to be processed according to the pooling core, namely performing comparison operation on the data value of the data to be processed in the area where the pooling core is positioned to obtain the most value in the area, and taking the index of the pooling core corresponding to the position where the most value is positioned as the operation result. And the post-pooling core moves a default stride or a designated stride along the width direction or the height direction, and the operations are repeated until all the most-valued pooling index operations are completed. The default stride here may be the same height as the pooled kernel in the height direction and the same width as the pooled kernel in the width direction. The pooling cores have a pre-specified indexing scheme, as shown in FIGS. 2 c-2 e, described in detail below. For example, assume that the size of the data to be processed is 9 x 9, the pooled core size is 2 x 2, and the

data

0, 1, 2, 3 in the pooled core are the indices of the pooled core, incremented in a row-first manner. First, the pooling core is located at the top left corner of the data to be processed, and the region of the data to be processed corresponding to the pooling core is the data at the top left corner of the data to be processed. Obtaining a comparison result through comparison; and determining that the most value of the data to be processed in the region is located at the position of the lower left corner in the region according to the comparison result, and taking the index of the pooling core corresponding to the position as the comparison result, namely the comparison result is 2. As shown in fig. 2b, the pooling core is moved in the width direction according to the default stride, i.e., the pooling core width, and the above-described maximum pooling index operation is repeated. Until all the most valued pooling index operations are completed.

Fig. 2 c-2 e illustrate a schematic diagram of an indexing approach for a pooled core according to an embodiment of the present disclosure. The indexing of the pooled kernel may be performed in various manners, taking the size of the pooled kernel as kw × kh as an example, and may be performed by starting with 0, performing indexing in a sequentially increasing manner with row priority, sequentially increasing by row, and after reaching the end of a row, performing sequentially increasing encoding from the head of the next row; as shown in fig. 2d, the column-first sequential increment mode may be used to perform indexing as shown in 0, that is, the column is sequentially incremented, and after the end of the column is reached, the incremental encoding is continued from the head of the next column; it is also possible to provide a look-up table, as shown in fig. 2e, and index the pooled cores by means of look-up table lookup.

It should be understood that the index mode of the pooled core in the most-valued pooled index instruction may be set as desired by one skilled in the art, and the disclosure is not limited thereto.

Fig. 3a illustrates a block diagram of a most valued pooling index instruction processing apparatus according to an embodiment of the present disclosure. In one possible implementation, as shown in fig. 3a, the operation module 12 may include one or more comparators 120 and one or more operation result determination modules 129. The comparator 120 is configured to perform comparison operation on data in each region of the to-be-operated data corresponding to the pooling core to obtain a comparison result; the operation result determining module 129 is configured to determine a maximum value of data in each area of the data to be operated according to the comparison result, and use an index of an area where the maximum value is located as an operation result.

In this implementation, the number of comparators may be set according to the data size of the comparison operation to be performed, the processing speed of the comparison operation, the efficiency, and the like, which is not limited by the present disclosure.

Fig. 3b illustrates a block diagram of a most valued pooled index instruction processing apparatus in accordance with various embodiments of the present disclosure. In one possible implementation, as shown in fig. 3b, the operation module 12 may include a master operation sub-module 121 and a plurality of slave operation sub-modules 122. The main operation sub-module 121 includes one or more comparators and one or more operation result determination modules.

In a possible implementation manner, the main operation sub-module 121 is configured to perform a comparison operation on data to be operated in a region corresponding to the pooled kernel by using a comparator to obtain a comparison result; determining the most value of the data in each area of the data to be operated according to the comparison result by using an operation result determining module, and taking the index of the area where the most value is located as an operation result; and stores the operation result in the target address.

In a possible implementation manner, the control module 11 may be further configured to analyze the obtained calculation instruction to obtain an operation domain and an operation code of the calculation instruction, and obtain data to be calculated, which is required for executing the calculation instruction, according to the operation domain and the operation code. The operation module 12 may also be configured to perform an operation on the data to be operated according to the calculation instruction to obtain a calculation result of the calculation instruction. The operation module may further include a plurality of operators for performing operations corresponding to operation types of the calculation instructions.

In this implementation, the calculation instruction may be other instructions for performing arithmetic operations, logical operations, and the like on data such as scalars, vectors, matrices, tensors, and the like, and those skilled in the art may set the calculation instruction according to actual needs, which is not limited by the present disclosure.

In this implementation, the arithmetic unit may include an adder, a divider, a multiplier, a comparator, and the like, which are capable of performing arithmetic operations, logical operations, and the like on data. The type and number of the arithmetic units may be set according to the requirements of the size of the data amount of the arithmetic operation to be performed, the operation type, the processing speed and efficiency of the arithmetic operation on the data, and the like, which is not limited by the present disclosure.

In one possible implementation, as shown in fig. 3b, the operation module 12 may include a master operation sub-module 121 and a plurality of slave operation sub-modules 122. The slave operation submodule 122 includes one or more comparators and one or more operation result determination modules.

In a possible implementation manner, the control module 11 is further configured to parse the most-valued pooling index instruction to obtain a plurality of operation instructions, and send the data to be operated and the plurality of operation instructions to the main operation sub-module 121.

The main operation sub-module 121 is configured to receive the data to be operated, the pooling core, and the target address, which are acquired by the control module and are required for executing the most valued pooling indexing instruction, and allocate and transmit the data to be operated, the pooling core, and the target address, which are required for executing the most valued pooling indexing instruction, to the slave operation sub-module.

The slave operation submodule 122 is configured to receive data to be operated, a pooling core and a target address which are distributed and transmitted by the master operation submodule and are required to execute the most valued pooling index instruction, and perform comparison operation on the data to be operated in the area corresponding to the pooling core by using the comparator to obtain a comparison result; determining the most value of the data in each area of the data to be operated according to the comparison result by using an operation result determining module, and taking the index of the area where the most value is located as an operation result; and stores the operation result in the target address.

In this implementation, when the calculation instruction is an operation performed on scalar or vector data, the apparatus may control the main operation sub-module to perform an operation corresponding to the calculation instruction by using an operator in the main operation sub-module. When the calculation instruction is to perform an operation on data with a dimension greater than or equal to 2, such as a matrix, a tensor, and the like, the device may control the slave operation submodule to perform an operation corresponding to the calculation instruction by using an operator therein.

It should be noted that, a person skilled in the art may set the connection manner between the master operation submodule and the plurality of slave operation submodules according to actual needs to implement the configuration setting of the operation module, for example, the configuration of the operation module may be an "H" configuration, an array configuration, a tree configuration, and the like, which is not limited in the present disclosure.

Fig. 3c shows a block diagram of a most valued pooling index instruction processing apparatus according to an embodiment of the present disclosure. In a possible implementation manner, as shown in fig. 3c, the operation module 12 may further include one or more branch operation sub-modules 123, where the branch operation sub-module 123 is configured to forward data and/or operation instructions between the master operation sub-module 121 and the slave operation sub-module 122. The main operation sub-module 121 is connected to one or more branch operation sub-modules 123. Therefore, the main operation sub-module, the branch operation sub-module and the slave operation sub-module in the operation module are connected by adopting an H-shaped structure, and data and/or operation instructions are forwarded by the branch operation sub-module, so that the resource occupation of the main operation sub-module is saved, and the instruction processing speed is further improved.

Fig. 3d illustrates a block diagram of a most valued pooling index instruction processing apparatus according to an embodiment of the present disclosure. In one possible implementation, as shown in FIG. 3d, a plurality of slave operation sub-modules 122 are distributed in an array.

Each slave operation submodule 122 is connected to another adjacent slave operation submodule 122, the master operation submodule 121 is connected to k slave operation submodules 122 of the plurality of slave operation submodules 122, and the k slave operation submodules 122 are: n slave operator modules 122 of row 1, n slave operator sub-modules 122 of row m, and m slave operator sub-modules 122 of column 1.

As shown in fig. 3d, the k slave operator modules include only the n slave operator modules in the 1 st row, the n slave operator modules in the m th row, and the m slave operator modules in the 1 st column, that is, the k slave operator modules are slave operator modules directly connected to the master operator module from among the plurality of slave operator modules. The k slave operation submodules are used for forwarding data and instructions between the main operation submodules and the multiple slave operation submodules. Therefore, the plurality of slave operation sub-modules are distributed in an array, the speed of sending data and/or operation instructions to the slave operation sub-modules by the master operation sub-module can be increased, and the instruction processing speed is further increased.

Fig. 3e illustrates a block diagram of a most valued pooling index instruction processing apparatus according to an embodiment of the present disclosure. In one possible implementation, as shown in fig. 3e, the operation module may further include a tree sub-module 124. The tree submodule 124 includes a root port 401 and a plurality of branch ports 402. The root port 401 is connected to the master operation submodule 121, and the plurality of branch ports 402 are connected to the plurality of slave operation submodules 122, respectively. The tree sub-module 124 has a transceiving function, and is configured to forward data and/or operation instructions between the master operation sub-module 121 and the slave operation sub-module 122. Therefore, the operation modules are connected in a tree-shaped structure under the action of the tree-shaped sub-modules, and the speed of sending data and/or operation instructions from the main operation sub-module to the auxiliary operation sub-module can be increased by utilizing the forwarding function of the tree-shaped sub-modules, so that the instruction processing speed is increased.

In one possible implementation, the tree submodule 124 may be an optional result of the apparatus, which may include at least one level of nodes. The nodes are line structures with forwarding functions, and the nodes do not have operation functions. The lowest level node is connected to the slave operation sub-module to forward data and/or operation instructions between the master operation sub-module 121 and the slave operation sub-module 122. In particular, if the tree submodule has zero level nodes, the apparatus does not require the tree submodule.

In one possible implementation, the tree submodule 124 may include a plurality of nodes of an n-ary tree structure, and the plurality of nodes of the n-ary tree structure may have a plurality of layers.

For example, fig. 3f illustrates a block diagram of a most-valued pooling index instruction processing device according to an embodiment of the present disclosure. As shown in FIG. 3f, the n-ary tree structure may be a binary tree structure with tree-type sub-modules including level 2 nodes 01. The lowest level node 01 is connected to the slave operator submodule 122 to forward data and/or operation instructions between the master operator submodule 121 and the slave operator submodule 122.

In this implementation, the n-ary tree structure may also be a ternary tree structure or the like, where n is a positive integer greater than or equal to 2. Those skilled in the art can set the number of n in the n-ary tree structure and the number of layers of nodes in the n-ary tree structure as needed, which is not limited by the present disclosure.

In one possible implementation, the operation field may further include an input height and an input width.

The control module is further used for acquiring the data to be operated corresponding to the input width and the input height from the data address to be operated.

In this implementation, the input height and input width may define the data size and size of the obtained data to be operated on. The input height and the input width included in the operation field may be specific numerical values, or may be storage addresses storing the input height and the input width. When specific values of the input height and the input width are directly included in the operation field, the specific values are determined as the corresponding input height and input width, respectively. When the storage addresses of the input height and the input width are included in the operation field, the input height and the input width may be obtained from the storage addresses of the input height and the input width, respectively.

In one possible implementation manner, when the input height and the input width are not included in the operation domain, the data to be operated may be acquired according to a default input height and a default input width which are set in advance.

By the mode, the data size and the size of the data to be operated can be limited, the accuracy of the operation result is ensured, and the device can execute the most-valued pooling index instruction.

In one possible implementation, the operation domain may further include an input channel number.

The control module is further used for acquiring the data to be operated corresponding to the number of the input channels from the data address to be operated.

In this implementation, the number of input channels may define the number of channels of the obtained data to be operated on. The number of input channels included in the operation field may be a specific numerical value, or may be a storage address for storing the number of input channels. And when the specific numerical value of the input channel number is directly included in the operation domain, determining the specific numerical value as the corresponding input channel number. When the storage address of the input channel number is included in the operation domain, the input channel number degree can be obtained from the storage address of the input channel number.

In a possible implementation manner, when the number of input channels is not included in the operation domain, the data to be operated may be acquired according to a preset default number of input channels.

By the mode, the number of input channels of the data to be operated can be limited, the accuracy of an operation result is ensured, and the device can execute the most-valued pooling index instruction.

In one possible implementation, the operation domain may further include a pooled core height and a pooled core width.

The control module 11 is further configured to perform a most valued pooling index operation according to the pooling core height and the pooling core width.

In one possible implementation, the operation domain may further include a first stride. The operation module 12 may be further configured to move the pooling kernel in the width direction according to the first step.

In one possible implementation, the operation domain may further include a second stride. The operation module 12 may be further configured to move the pooling kernel in the height direction according to the second step size.

In this implementation, the stride of the most-valued pooling index operation is the magnitude of each shift of the pooling kernel in performing the most-valued pooling index operation. The first step width may be a width of moving the pooled kernel in the width direction, and the second step width may be a height of moving the pooled kernel in the height direction. Here, the height direction is the same as the direction of the input height and the direction of the height of the pooling nucleus, and the width direction is the same as the direction of the input width and the direction of the width of the pooling nucleus.

In the present disclosure, only the pooling core is taken as a two-dimensional example, and parameters such as the height, the width, the first stride, and the second stride of the pooling core required for performing the most valued pooling index operation are described.

In a possible implementation manner, when the first stride and the second stride are not given in the operation domain of the most-valued pooling index instruction, the operation module may use the height and the width of the pooling core as the strides of the corresponding dimensions thereof, respectively, to ensure normal operation of the most-valued pooling index. For example, the operation module 12 may be further configured to move the pooling core on the data to be operated in a non-overlapping manner, and compare a plurality of data to be operated in the area corresponding to the pooling core to obtain an operation result.

In one possible implementation, when the pooled core height and the pooled core width are not included in the operation domain, a preset default pooled core height and a preset default pooled core width may be obtained, so that the control module and the operation module may execute the most-valued pooled index instruction. For example, the default height of the pooling core is 3, the default width of the pooling core is 2, and the operation module 12 may be further configured to move the pooling core two units at a time in the width direction or 3 units at a time in the height direction on the data to be operated, and compare a plurality of data to be operated in the area corresponding to the pooling core to obtain the operation result.

In a possible implementation manner, the operation module 12 is further configured to perform a worst pooling index operation on data that is an integer multiple of the size of the pooling core in the data to be operated on when the size of the data to be operated on is a non-integer multiple of the size of the pooling core. The size of the data to be calculated is a non-integer multiple of the size of the pooling kernel, and may include at least one of: the input width of the data to be operated is non-integral multiple of the width of the pooling core, and the input height of the data to be operated is non-integral multiple of the height of the pooling core. For example, the input data width and height are 5 and 4, respectively, the pooled kernel width and length are 2 and 2, respectively, and the first stride and the second stride are both 2 and 2. In this example, the unit of data is the same, and may be a byte, a pixel, or the like, which is not limited. Since the width of the pooling core is the same as the first stride, and the input data width is 5 and is not an integer multiple of the pooling core width, the data to be operated on which is an integer multiple of the size of the pooling core is subjected to the most valued pooling index operation, that is, the data input with the first 4 widths is subjected to the most valued pooling index operation.

In this implementation, for the remaining data that is a non-integer multiple of the pooled kernel in the data to be operated, the remaining data whose size is smaller than the size of the pooled kernel may be processed in various ways.

As shown in fig. 4a, for the remaining data with size smaller than the size of the pooling core, which is the non-integer multiple of the size of the pooling core, in the data to be operated on, the most-valued pooling index operation may not be performed. That is, for the above example, the last data of each line in the width direction may not be operated on.

As shown in fig. 4a, the maximum pooling index operation may be directly performed on the remaining data that is not an integer multiple of the pooled kernel in the data to be operated, and the size of the remaining data is smaller than the size of the pooled kernel. That is, for the above example, even if only the last data remains in each line in the width direction and is smaller than the pooling kernel width, the operation result is obtained by performing the most-valued pooling index operation on the last data in each line in the width direction.

As shown in fig. 4b, for the remaining data with the size smaller than the size of the pooling core, which is the non-integral multiple of the size of the pooling core, in the data to be operated, the padding is performed on the remaining data, and then the worst pooling index operation is performed to obtain the operation result. I.e. for the above example, the complement may be made in the width direction, i.e. 1 default value, e.g. 0, is supplemented in the width direction. At this time, the input width after the number complementing is 6, which is an integral multiple of the width of the pooling kernel, and then the most-valued pooling index operation is performed to obtain an operation result.

As shown in fig. 4c, for the remaining data that is not an integral multiple of the pooled kernel in the data to be calculated and has a size smaller than the size of the pooled kernel, the position of the pooled kernel is moved in reverse, so that the size of the data in the area corresponding to the pooled kernel after the reverse movement is equal to the size of the pooled kernel, and the data in the area corresponding to the pooled kernel after the reverse movement includes the remaining data, and the operation result is obtained by performing the most valued pooled index operation according to the data in the area corresponding to the pooled kernel after the movement. That is, in the above example, when the pooling cores are at the positions of the 1 st, 2 nd, 3 rd, and 4 th numbers, the pooling cores are integer multiples of the size of the pooling cores, and pooling calculation can be performed. And when the position of the pooling core is at the 5 th position and is not integral multiple of the size of the pooling core, the position of the pooling core is moved reversely, namely, the position of the pooling core is moved forward by one position, namely, the position of the pooling core at the 4 th and 5 th positions is integral multiple of the size of the pooling core, and the pooling operation is carried out to obtain an operation result.

In one possible implementation, as shown in fig. 3 a-3 f, the apparatus may further include a storage module 13. The storage module 13 is used for storing data to be operated.

In this implementation, the storage module may be one or more of a cache and a register, and the cache may include a cache, and may further include an NRAM (Neuron Random Access Memory). The cache may be used to store data to be operated on and operation results, and the register may be used to store data to be operated on, scalar data, parameters, and the like.

In one possible implementation, the cache may include a neuron cache. The neuron buffer, i.e., the neuron random access memory, may be configured to store neuron data in data to be operated on, where the neuron data may include neuron vector data.

In a possible implementation manner, the apparatus may further include a direct memory access module for reading or storing data from the storage module.

In one possible implementation, as shown in fig. 3 a-3 f, the control module 11 may include an instruction storage sub-module 111, an instruction processing sub-module 112, and a queue storage sub-module 113.

The instruction storage submodule 111 is used for storing the most valued pooling index instruction.

The instruction processing sub-module 112 is configured to parse the most valued pooling index instruction to obtain an operation code and an operation domain of the most valued pooling index instruction.

The queue storage submodule 113 is configured to store an instruction queue, where the instruction queue includes multiple instructions to be executed that are sequentially arranged according to an execution order, and the multiple instructions to be executed may include a most-valued pooling index instruction.

In this implementation manner, the execution order of the multiple instructions to be executed may be arranged according to the receiving time, the priority level, and the like of the instructions to be executed to obtain an instruction queue, so that the multiple instructions to be executed are sequentially executed according to the instruction queue.

In one possible implementation, as shown in fig. 3 a-3 f, the control module 11 may further include a dependency processing sub-module 114.

The dependency relationship processing submodule 114 is configured to, when it is determined that a first to-be-executed instruction in the multiple to-be-executed instructions is associated with a zeroth to-be-executed instruction before the first to-be-executed instruction, cache the first to-be-executed instruction in the instruction storage submodule 111, and after the zeroth to-be-executed instruction is executed, extract the first to-be-executed instruction from the instruction storage submodule 111 and send the first to-be-executed instruction to the operation module 12.

The method for determining the zero-th instruction to be executed before the first instruction to be executed has an incidence relation with the first instruction to be executed comprises the following steps: the first storage address interval for storing the data required by the first to-be-executed instruction and the zeroth storage address interval for storing the data required by the zeroth to-be-executed instruction have an overlapped area. On the contrary, there is no association relationship between the first to-be-executed instruction and the zeroth to-be-executed instruction before the first to-be-executed instruction, which may be that there is no overlapping area between the first storage address interval and the zeroth storage address interval.

By the method, according to the dependency relationship between the first to-be-executed instruction and the zeroth to-be-executed instruction before the first to-be-executed instruction, the subsequent first to-be-executed instruction is executed after the execution of the previous zeroth to-be-executed instruction is finished, and the accuracy of the operation result is ensured.

In one possible implementation, the instruction format of the most valued pooling index instruction may be:

maxabspool_index dstsrc0srcChannelsrcHeighsrcWidth

wherein, maxabspool _ index is an operation code of the maximum absolute value pooling index instruction, and dst, src0, src channel, src height and src width are operation domains of the maximum absolute value pooling index instruction. Wherein dst is a target address, src0 is a data address to be calculated, src channel is an input channel number, src height is an input height, and src width is an input width. That is, the size of the data to be processed obtained from src0 is as follows, the number of input channels is src channel, the input height is src height, and the input width is src width. The size of the pooled cores is taken to a default value. And storing the operation result after the maximum absolute value pooling into the place with the address dst.

minpool_index dstsrc0srcChannelsrcHeighsrcWidth kernelHeightkernelWidth

wherein minpool _ index is an operation code of the minimum value pooling index instruction, and dst, src0, src channel, src height, and src width are operation domains of the minimum value pooling index instruction. Wherein dst is a target address, src0 is a data address to be calculated, src channel is an input channel number, src height is an input height, src width is an input width, kernelHeight is a pooling core height, and kernelWidth is a pooling core width. That is, the size of the data to be processed obtained from src0 is as follows, the number of input channels is src channel, the input height is src height, and the input width is src width. The pooled cores were of the following size, with the pooled core height of kernelHeight and the pooled core width of kernelWidth. The step size of each movement of the pooling core is a default value, for example, the step size of each movement in the width direction is kernelWidth, and the step size of each movement in the height direction is kernalHeight. And storing the operation result after the minimum value pooling index operation into the position with the address dst.

maxpool_indexdstsrc0srcChannelsrcHeighsrcWidthkernelHeightkernelWidth strideX strideY

wherein maxpool _ index is an operation code of the maximum value pooling index instruction, and dst, src0, src channel, src height, src width, kernelHeight, kernelWidth, strideX and strideY are operation domains of the maximum value pooling index instruction. Wherein dst is a target address, src0 is a data address to be calculated, src channel is an input channel number, src height is an input height, src width is an input width, kernelHeight is a height of the pooled kernel, kernelWidth is a width of the pooled kernel, strideX is a first step of the pooled kernel moving in the width direction, and strideY is a second step of the pooled kernel moving in the height direction. That is, the size of the data to be processed obtained from src0 is such that the number of input channels is src channel, the input height is src height, and the input width is src width. The pooled cores were of the following size, with the pooled core height of kernelHeight and the pooled core width of kernelWidth. The step length of each movement of the pooled nucleus in the width direction is stride X, and the step length of each movement in the height direction is stride Y. And storing the operation result after the maximum value pooling index operation into the position with the address dst.

minabspool_index dstsrc0srcChannelsrcHeighsrcWidthkernelHeightkernelWidth strideX strideY

wherein, minabsplool _ index is the operation code of the minimum value pooling index instruction, dst, src0, src channel, src height, src width, kernelHeight, kernelWidth, strideX and strideY are the operation domains of the minimum absolute value pooling index instruction. Wherein dst is a target address, src0 is a data address to be calculated, src channel is the number of input channels, src height is an input height, src width is an input width, kernelHeight is a height of the pooled kernel, kernelWidth is a width of the pooled kernel, strideX is a first step of movement of the pooled kernel in the width direction, and strideY is a second step of movement of the pooled kernel in the height direction. That is, the size of the data to be processed obtained from src0 is as follows, the number of input channels is src channel, the input height is src height, and the input width is src width. The pooled cores were of the following size, with the pooled core height of kernelHeight and the pooled core width of kernelWidth. The step length of each movement of each pooling core in the width direction is stride X, and the step length of each movement in the height direction is stride Y. And storing the operation result after the minimum absolute value pooling index operation into the position with the address dst.

It should be understood that the position of the opcode, opcode and operand field in the instruction format of the most-valued pooling index instruction may be set as desired by those skilled in the art, and the disclosure is not limited thereto.

In one possible implementation manner, the apparatus may be disposed in one or more of a Graphics Processing Unit (GPU), a Central Processing Unit (CPU), and an embedded Neural Network Processor (NPU).

It should be noted that, although the most-valued pooling index instruction processing apparatus has been described above by taking the above-described embodiment as an example, those skilled in the art will understand that the present disclosure should not be limited thereto. In fact, the user can flexibly set each module according to personal preference and/or actual application scene, as long as the technical scheme of the disclosure is met.

Application example

An application example according to an embodiment of the present disclosure is given below in conjunction with "performing a most-valued pooling index operation with a most-valued pooling index instruction processing apparatus" as an exemplary application scenario to facilitate understanding of a flow of the most-valued pooling index instruction processing apparatus. It is understood by those skilled in the art that the following application examples are for the purpose of facilitating understanding of the embodiments of the present disclosure only and should not be construed as limiting the embodiments of the present disclosure

Fig. 5 is a schematic diagram illustrating an application scenario of a most-valued pooling index instruction processing apparatus according to an embodiment of the present disclosure. As shown in fig. 3, the procedure of processing the most-valued pooling index instruction by the most-valued pooling index instruction processing device is as follows:

the control module 11 analyzes the obtained most-valued pooling index instruction 1 (for example, the minimum pooling index instruction 1 is minpool _ index 500100564322121), and obtains the operation code and the operation domain of the minimum pooling index instruction 1. The opcode of the minimum pooling index instruction 1 is miniol _ index, the target address is 500, the address of the data to be calculated is 100, the number of input channels is 5, the input height is 64, the input width is 32, the pooling core height is 2, the pooling core width is 1, the first stride is 2, and the second stride is 1. The control module 11 obtains 64 × 32 × 5 data to be operated from the data address 100 to be operated.

The operation module 12 performs minimum pooling index operation on the data to be operated with the size of 64 × 32 on 5 input channels by using pooling cores, respectively, to obtain an operation result, and stores the operation result in the target address 500.

The working process of the above modules can refer to the above related description.

Therefore, the most-valued pooling index instruction can be efficiently and quickly processed, and the efficiency and the speed of the operation of the most-valued pooling index are obviously improved.

The present disclosure provides a machine learning arithmetic device, which may include one or more of the above-mentioned most value pooling index instruction processing devices, and is configured to acquire data to be operated and control information from other processing devices, and execute a designated machine learning operation. The machine learning arithmetic device may obtain the most-valued pooling index instruction from another machine learning arithmetic device or a non-machine learning arithmetic device, and transmit the execution result to a peripheral device (which may also be referred to as another processing device) through an I/O interface. Peripheral devices such as cameras, displays, mice, keyboards, network cards, wifi interfaces, servers. When more than one maximal-value pooling index instruction processing device is included, the maximal-value pooling index instruction processing devices can be linked and transmit data through a specific structure, for example, a PCIE bus is used for interconnection and data transmission, so as to support larger-scale operation of the neural network. At this time, the same control system can be shared, or independent control systems can be provided; the memory may be shared or there may be separate memories for each accelerator. In addition, the interconnection mode can be any interconnection topology.

The machine learning arithmetic device has higher compatibility and can be connected with various types of servers through PCIE interfaces.

Fig. 6a shows a block diagram of a combined processing device according to an embodiment of the present disclosure. As shown in fig. 6a, the combined processing device includes the machine learning arithmetic device, the universal interconnection interface, and other processing devices. The machine learning arithmetic device interacts with other processing devices to jointly complete the operation designated by the user.

Other processing devices include one or more of general purpose/special purpose processors such as Central Processing Units (CPUs), Graphics Processing Units (GPUs), neural network processors, and the like. The number of processors included in the other processing devices is not limited. The other processing devices are used as interfaces of the machine learning arithmetic device and external data and control, and comprise data transportation to finish basic control of starting, stopping and the like of the machine learning arithmetic device; other processing devices can cooperate with the machine learning computing device to complete computing tasks.

And the universal interconnection interface is used for transmitting data and control instructions between the machine learning arithmetic device and other processing devices. The machine learning arithmetic device obtains the required input data from other processing devices and writes the input data into a storage device on a machine learning arithmetic device chip; control instructions can be obtained from other processing devices and written into a control cache on a machine learning arithmetic device chip; the data in the storage module of the machine learning arithmetic device can also be read and transmitted to other processing devices.

Fig. 6b shows a block diagram of a combined processing device according to an embodiment of the present disclosure. In a possible implementation manner, as shown in fig. 6b, the combined processing device may further include a storage device, and the storage device is connected to the machine learning operation device and the other processing device respectively. The storage device is used for storing data stored in the machine learning arithmetic device and the other processing device, and is particularly suitable for data which is required to be calculated and cannot be stored in the internal storage of the machine learning arithmetic device or the other processing device.

The combined processing device can be used as an SOC (system on chip) system of equipment such as a mobile phone, a robot, an unmanned aerial vehicle and video monitoring equipment, the core area of a control part is effectively reduced, the processing speed is increased, and the overall power consumption is reduced. In this case, the universal interconnect interface of the combined processing device is connected to some components of the apparatus. Some parts are such as camera, display, mouse, keyboard, network card, wifi interface.

The present disclosure provides a machine learning chip, which includes the above machine learning arithmetic device or combined processing device.

The present disclosure provides a machine learning chip package structure, which includes the above machine learning chip.

Fig. 7 shows a schematic structural diagram of a board card according to an embodiment of the present disclosure. As shown in fig. 7, the board includes the above-mentioned machine learning chip package structure or the above-mentioned machine learning chip. The board may include, in addition to the machine learning chip 389, other kits including, but not limited to: memory device 390, interface device 391 and control device 392.

The memory device 390 is coupled to a machine learning chip 389 (or a machine learning chip within a machine learning chip package structure) via a bus for storing data. Memory device 390 may include multiple sets of memory cells 393. Each group 393 of memory cells is connected to the machine learning chip 389 through a bus. It is understood that each group 393 may be a DDR SDRAM (Double Data Rate SDRAM).

DDR can double the speed of SDRAM without increasing the clock frequency. DDR allows data to be read on both the rising and falling edges of the clock pulse. DDR is twice as fast as standard SDRAM.

In one embodiment, memory device 390 may include 4 groups of memory cells 393. Each group of memory cells 393 may include a plurality of DDR4 particles (chips). In one embodiment, the machine learning chip 389 may include 4 72-bit DDR4 controllers therein, where 64bit is used for data transmission and 8bit is used for ECC check in the 72-bit DDR4 controller. It is understood that when DDR4-3200 particles are used in each group of memory cells 393, the theoretical bandwidth of data transfer can reach 25600 MB/s.

In one embodiment, each group of memory cells 393 comprises a plurality of double rate synchronous dynamic random access memories arranged in parallel. DDR can transfer data twice in one clock cycle. A controller for controlling DDR is provided in the machine learning chip 389 for controlling data transfer and data storage of each memory unit 393.

Interface device 391 is electrically coupled to machine learning chip 389 (or a machine learning chip within a machine learning chip package). The interface device 391 is used to implement data transmission between the machine learning chip 389 and an external device (e.g., a server or a computer). For example, in one embodiment, the interface device 391 may be a standard PCIE interface. For example, the data to be processed is transmitted to the machine learning chip 289 by the server through the standard PCIE interface, so as to implement data transfer. Preferably, when PCIE 3.0X 16 interface transmission is adopted, the theoretical bandwidth can reach 16000 MB/s. In another embodiment, the interface device 391 may also be another interface, and the disclosure does not limit the specific representation of the other interface, and the interface device can implement the switching function. In addition, the calculation result of the machine learning chip is still transmitted back to the external device (e.g., server) by the interface device.

The control device 392 is electrically connected to a machine learning chip 389. The control device 392 is used to monitor the state of the machine learning chip 389. Specifically, the machine learning chip 389 and the control device 392 may be electrically connected through an SPI interface. The Controller 392 may include a single chip Microcomputer (MCU). For example, machine learning chip 389 may include multiple processing chips, multiple processing cores, or multiple processing circuits, which may carry multiple loads. Therefore, the machine learning chip 389 can be in different operating states such as a multi-load and a light load. The control device can regulate and control the working states of a plurality of processing chips, a plurality of processing circuits and/or a plurality of processing circuits in the machine learning chip.

The present disclosure provides an electronic device, which includes the above machine learning chip or board card.

The electronic device may include a data processing apparatus, a computer device, a robot, a computer, a printer, a scanner, a tablet, a smart terminal, a cell phone, a tachograph, a navigator, a sensor, a camera, a server, a cloud server, a camera, a camcorder, a projector, a watch, an earphone, a mobile storage, a wearable device, a vehicle, a household appliance, and/or a medical device.

The vehicle may include an aircraft, a ship, and/or a vehicle. The household appliances may include televisions, air conditioners, microwave ovens, refrigerators, electric rice cookers, humidifiers, washing machines, electric lamps, gas cookers, and range hoods. The medical device may include a nuclear magnetic resonance instrument, a B-ultrasonic instrument, and/or an electrocardiograph.

FIG. 8 illustrates a flow diagram of a method of most-valued pooling index instruction processing according to an embodiment of the present disclosure. The method can be applied to computer equipment and the like comprising a memory and a processor, wherein the memory is used for storing data used in the process of executing the method; the processor is used for executing relevant processing and operation steps, such as the steps S51 and S52. As shown in fig. 8, the method is applied to the above-described most pooled index instruction processing apparatus, and includes step S51 and step S52.

In step S51, the obtained most valued pooling index instruction is analyzed to obtain an operation code and an operation domain of the most valued pooling index instruction, and the data to be calculated, the pooling core, and the target address required for executing the most valued pooling index instruction are obtained according to the operation domain.

In step S52, performing a most valued pooling index operation on the data to be operated according to the pooling core to obtain an operation result, and storing the operation result in the target address, where the operation result includes an index of a region where a most valued is located in each region data of the data to be operated corresponding to the pooling core.

In a possible implementation manner, performing a most-valued pooling index operation on data to be operated according to a pooling core to obtain an operation result may include:

comparing the data in each area of the data to be operated corresponding to the pooling core to obtain a comparison result;

and determining the most value of the data in each area of the data to be operated according to the comparison result, and taking the index of the area where the most value is located as the operation result.

In one possible implementation, the most significant value is the maximum value, the minimum value, the maximum absolute value, or the minimum absolute value.

In one possible implementation mode, the operation module comprises a main operation submodule and a plurality of slave operation submodules, the main operation submodule comprises a comparator and an operation result determining module,

the method comprises the following steps of performing most-valued pooling index operation on data to be operated according to a pooling core to obtain an operation result, and storing the operation result into a target address, wherein the method comprises the following steps:

and comparing and operating the data in each area of the data to be operated corresponding to the pooling core by using the comparator to obtain a comparison result, determining the most value of the data in each area of the data to be operated according to the comparison result by using the operation result determining module, taking the index of the area where the most value is located as an operation result, and storing the operation result into the target address.

In one possible implementation mode, the operation module comprises a main operation submodule and a plurality of slave operation submodules, each slave operation module comprises a comparator and an operation result determining module,

wherein, according to the pooling core, performing the most valued pooling index operation on the data to be operated to obtain an operation result, comprising:

and comparing and operating a plurality of data to be operated in the area corresponding to the pooling core by using the plurality of comparators to obtain an operation result, and storing the operation result into the target address.

Receiving data to be operated, a pooling core and a target address required by the most-valued pooling index instruction, and distributing and transmitting the data to be operated, the pooling core and the target address which are required by the most-valued pooling index instruction to be executed respectively to a slave operation submodule;

receiving data to be operated, a pooling core and a target address which are distributed and transmitted by a main operation sub-module and are required by executing the most value pooling index instruction, comparing data in each area of the data to be operated corresponding to the pooling core by using one or more comparators to obtain a comparison result, determining the most value of the data in each area of the data to be operated according to the comparison result by using an operation result determining module, taking the index of the area where the most value is located as the operation result, and storing the operation result into the target address.

In one possible implementation, the operation field may further include an input height and an input width. The obtaining, according to the operation code and the operation domain, the data to be operated, the pooling core, and the target address required for executing the most valued pooling index instruction may include:

and acquiring the data to be operated corresponding to the input width and the input height from the data address to be operated.

In one possible implementation, the operation domain may further include an input channel number. The obtaining, according to the operation code and the operation domain, the data to be operated, the pooling core, and the target address required for executing the most valued pooling index instruction may include:

and acquiring the data to be operated corresponding to the number of the input channels from the data address to be operated.

In one possible implementation, the operation domain may further include a first stride. The performing the most valued pooling index operation on the data to be operated according to the pooling core may include: the pooled kernel is moved in the width direction in a first step.

In one possible implementation, the operation domain may further include a second stride. The performing the most valued pooling index operation on the data to be operated according to the pooling core may include: the pooling core is moved in the elevation direction in a second step.

and moving the pooling cores in a non-overlapping manner on the data to be operated, and comparing a plurality of data to be operated in the area corresponding to the pooling cores to obtain an operation result.

In a possible implementation manner, performing a most-valued pooling index operation on data to be operated according to a pooling core to obtain an operation result may include: and when the size of the data to be calculated is not integral multiple of the size of the pooling core, performing the most-valued pooling index operation on the data which is integral multiple of the size of the pooling core in the data to be calculated.

The size of the data to be operated is a non-integral multiple of the size of the pooling kernel, and the method comprises at least one of the following steps: when the first stride is not included or the first stride is equal to the width of the pooling core, the input width of the data to be operated on is a non-integral multiple of the width of the pooling core; when the first stride is included, the difference between the input width of the data to be operated on and the width of the pooling core is a non-integral multiple of the first stride; when the second stride is not included or the second stride is equal to the height of the pooled core, the input height of the data to be operated is a non-integral multiple of the height of the pooled core; when the second stride is included, the difference between the input height of the data to be operated on and the height of the pooling core is a non-integer multiple of the second stride.

In a possible implementation manner, performing a most-valued pooling index operation on data to be operated according to a pooling kernel to obtain an operation result, the method may further include: and when the size of the residual data in the data to be operated is smaller than the size of the pooling core, performing no most-valued pooling index operation on the residual data.

In a possible implementation manner, performing a most-valued pooling index operation on data to be operated according to a pooling kernel to obtain an operation result, the method may further include: and when the size of the residual data in the data to be operated is smaller than the size of the pooling core, performing the most-valued pooling index operation on the residual data or performing the most-valued pooling index operation after number complementing to obtain an operation result.

In a possible implementation manner, performing a most-valued pooling index operation on data to be operated according to a pooling kernel to obtain an operation result, the method may further include: and when the size of the residual data in the data to be operated is smaller than the size of the pooling nucleus, reversely moving the position of the pooling nucleus to enable the size of the data in the corresponding area of the pooling nucleus after the reverse movement to be equal to the size of the pooling nucleus, enabling the data in the corresponding area of the pooling nucleus after the reverse movement to comprise the residual data, and performing the operation of the maximum pooling index according to the data in the corresponding area of the pooling nucleus after the movement to obtain an operation result.

In one possible implementation, the method may further include: and storing the data to be operated and the operation result by using a storage module of the device. The storage module may include at least one of a register and a cache, where the cache is used to store data to be operated and an operation result, and the cache may include at least one neuron cache NRAM; the register is used for storing scalar data in the data to be operated; the neuron buffer is used for storing neuron data in data to be operated on, and the neuron data can comprise neuron vector data.

In a possible implementation manner, analyzing the obtained most valued pooling index instruction to obtain an operation code and an operation domain of the most valued pooling index instruction may include:

storing a most valued pooling index instruction;

analyzing the most valued pooling index instruction to obtain an operation code and an operation domain of the most valued pooling index instruction;

the method includes storing an instruction queue, where the instruction queue includes a plurality of instructions to be executed that are sequentially arranged according to an execution order, and the plurality of instructions to be executed may include a most-valued pooling index instruction.

In one possible implementation, the method may further include: when determining that the first to-be-executed instruction in the plurality of to-be-executed instructions has an association relation with a zeroth to-be-executed instruction before the first to-be-executed instruction, caching the first to-be-executed instruction, after the zeroth to-be-executed instruction is executed, executing the first to-be-executed instruction,

the first to-be-executed instruction and the zeroth to-be-executed instruction before the first to-be-executed instruction have an association relationship, and the association relationship comprises at least one of the following steps:

a first storage address interval for storing the data required by the first instruction to be executed and a zeroth storage address interval for storing the data required by the zeroth instruction to be executed have an overlapped area;

the first instruction to be executed comprises a most pooled index instruction.

It should be noted that, although the above-mentioned embodiment is taken as an example to describe the most-valued pooling index instruction processing method, those skilled in the art can understand that the disclosure should not be limited thereto. In fact, the user can flexibly set each step according to personal preference and/or actual application scene, as long as the technical scheme of the disclosure is met.

The method for processing the most valued pooling index instruction provided by the embodiment of the disclosure has the advantages of wide application range, high processing efficiency and high processing speed for the most valued pooling index instruction, and high efficiency and high speed for performing the most valued pooling index operation.

The present disclosure also provides a non-transitory computer readable storage medium having stored thereon computer program instructions, wherein the computer program instructions, when executed by a processor, implement the above-mentioned most pooled index instruction processing method.

It is noted that while for simplicity of explanation, the foregoing method embodiments have been described as a series of acts or combination of acts, it will be appreciated by those skilled in the art that the present disclosure is not limited by the order of acts, as some steps may, in accordance with the present disclosure, occur in other orders and concurrently. Further, those skilled in the art should also appreciate that the embodiments described in this specification are all alternative embodiments and that the acts and modules involved are not necessarily essential to the disclosure.

It should be further noted that, although the steps in the flowchart of fig. 6 are shown in sequence as indicated by the arrows, the steps are not necessarily executed in sequence as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in fig. 6 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performing the sub-steps or stages is not necessarily sequential, but may be performed alternately or alternatingly with other steps or at least a portion of the sub-steps or stages of other steps.

It should be understood that the above-described apparatus embodiments are merely exemplary, and that the apparatus of the present disclosure may be implemented in other ways. For example, the division of the units/modules in the above embodiments is only one logical function division, and there may be another division manner in actual implementation. For example, multiple units, modules, or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented.

In addition, unless otherwise specified, each functional unit/module in the embodiments of the present disclosure may be integrated into one unit/module, each unit/module may exist alone physically, or two or more units/modules may be integrated together. The integrated units/modules may be implemented in the form of hardware or software program modules.

If the integrated unit/module is implemented in hardware, the hardware may be digital circuits, analog circuits, etc. Physical implementations of hardware structures include, but are not limited to, transistors, memristors, and the like. Unless otherwise specified, the storage module may be any suitable magnetic storage medium or magneto-optical storage medium, such as resistive Random Access Memory (rram), Dynamic Random Access Memory (dram), Static Random Access Memory (SRAM), enhanced Dynamic Random Access Memory (edram), High-Bandwidth Memory (HBM), hybrid Memory cube (hmc), and the like.

The integrated units/modules, if implemented in the form of software program modules and sold or used as a stand-alone product, may be stored in a computer readable memory. Based on such understanding, the technical solution of the present disclosure may be embodied in the form of a software product, which is stored in a memory and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present disclosure. And the aforementioned memory comprises: a U disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments. The technical features of the embodiments may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The foregoing may be better understood in light of the following clauses:

clause a1, a most-valued pooling index instruction processing apparatus, comprising:

the control module is used for analyzing the obtained most-valued pooling index instruction to obtain an operation code and an operation domain of the most-valued pooling index instruction, and obtaining data to be operated, a pooling core and a target address which are required by executing the most-valued pooling index instruction according to the operation domain;

Clause a2, the apparatus according to clause a1, wherein the computing module comprises:

the comparator is used for carrying out comparison operation on the data in each area of the data to be operated corresponding to the pooling core to obtain a comparison result;

and the operation result determining module is used for determining the most value of the data in each area of the data to be operated according to the comparison result and taking the index of the area where the most value is located as the operation result.

Clause A3, the device according to clause a1 or clause a2, wherein,

the maximum value is a maximum value, a minimum value, a maximum absolute value or a minimum absolute value.

Clause a4, the apparatus of clause a1, wherein,

the index is sequentially increased by rows, sequentially increased by columns, or looked up according to a lookup table.

Clause a5, the apparatus of clause A3, wherein,

the operation module comprises a main operation sub-module and a plurality of slave operation sub-modules, the main operation sub-module comprises the comparator and the operation result determination module,

the main operation submodule is used for carrying out comparison operation on data in each area of the data to be operated corresponding to the pooling core by using the comparator to obtain a comparison result, determining the maximum value of the data in each area of the data to be operated according to the comparison result by using the operation result determining module, taking the index of the area where the maximum value is located as an operation result, and storing the operation result into the target address.

Clause a6, the apparatus of clause A3, the calculation module comprising a master calculation sub-module and a plurality of slave calculation sub-modules, the slave calculation sub-modules comprising one or more of the comparators and the calculation result determination module,

the main operation submodule is used for receiving the data to be operated, the pooling cores and the target addresses which are acquired by the control module and are required for executing the most-valued pooling index instruction, and distributing and transmitting the data to be operated, the pooling cores and the target addresses which are respectively required for executing the most-valued pooling index instruction to the slave operation submodule;

the slave operation sub-module is used for receiving the data to be operated, the pooling core and the target address which are distributed and transmitted by the master operation sub-module and are required by executing the most value pooling index instruction, comparing and operating the data in each area of the data to be operated corresponding to the pooling core by using one or more comparators to obtain a comparison result, determining the most value of the data in each area of the data to be operated according to the comparison result by using the operation result determining module, taking the index of the area where the most value is located as the operation result, and storing the operation result into the target address.

Clause a7, the apparatus of clause A3, the operation field further comprising an input height and an input width,

the control module is further configured to obtain data to be operated corresponding to the input width and the input height from the data address to be operated.

Clause A8, the apparatus of clause A3, the operation domain further comprising an input number of lanes,

the control module is further configured to obtain the data to be operated corresponding to the number of the input channels from the data address to be operated.

Clause a9, the apparatus of clause A3, the operational domain further comprising a pooled kernel height and a pooled kernel width,

the operation module is further configured to perform a most valued pooling index operation on the data to be operated according to the pooling core height and the pooling core width.

Clause a10, the apparatus of clause A3, the operational domain further comprising a first stride,

wherein the operation module is further configured to move the pooling kernel in an input width direction according to the first stride.

Clause a11, the apparatus of clause A3, the operational domain further comprising a second stride,

the operation module is further configured to move the pooling kernel in a height direction according to the second step size.

Clause a12, the apparatus according to clause A3, and the operation module are further configured to move the pooling core over the data to be operated, and compare a plurality of data to be operated in an area corresponding to the pooling core to obtain the operation result.

Clause a13, the apparatus according to clause A3, the operation module is further configured to, when the size of the data to be operated is a non-integral multiple of the size of the pooled core, perform a most-valued pooling index operation on data that is an integer multiple of the size of the pooled core in the data to be operated to obtain an operation result,

wherein the size of the data to be operated is a non-integral multiple of the size of the pooling kernel, and the size includes at least one of the following items: when the first stride is not included or the first stride is equal to the width of the pooling core, the input width of the data to be operated on is a non-integral multiple of the width of the pooling core; when the first stride is included, the difference between the input width of the data to be operated on and the width of the pooling core is a non-integral multiple of the first stride; when the second stride is not included or the second stride is equal to the height of the pooling core, the input height of the data to be operated is a non-integral multiple of the height of the pooling core; when the second stride is included, the difference between the input height of the data to be operated on and the height of the pooling core is a non-integer multiple of the second stride.

Clause a14, the apparatus according to clause a13, and the operation module are further configured to, when the size of remaining data in the data to be operated is smaller than the pooling core size, perform a most-valued pooling index operation on the remaining data, or perform a most-valued pooling index operation after performing a complement number, to obtain an operation result.

Clause a15, the apparatus according to clause a13, and the operation module, when the size of the remaining data in the data to be operated is smaller than the size of the pooling kernel, is further configured to move the position of the pooling kernel in the reverse direction, so that the size of the data in the area corresponding to the pooling kernel after the reverse movement is equal to the size of the pooling kernel, and the data in the area corresponding to the pooling kernel after the reverse movement includes the remaining data, and perform a most valued pooling index operation according to the data in the area corresponding to the moved pooling kernel, to obtain an operation result.

Clause a16, the apparatus of clause A3, further comprising:

and the storage module is used for storing the data to be operated and the operation result.

Clause a17, the apparatus of clause A3, the control module comprising:

the instruction storage submodule is used for storing the most value pooling index instruction;

the instruction processing submodule is used for analyzing the most valued pooling index instruction to obtain an operation code and an operation domain of the most valued pooling index instruction;

and the queue storage submodule is used for storing an instruction queue, the instruction queue comprises a plurality of instructions to be executed which are sequentially arranged according to an execution sequence, and the plurality of instructions to be executed comprise the most value pooling index instruction.

Clause a18, the apparatus of clause a17, the control module further comprising:

the dependency relationship processing submodule is used for caching a first to-be-executed instruction in the instruction storage submodule when the fact that the incidence relationship exists between the first to-be-executed instruction and a zeroth to-be-executed instruction before the first to-be-executed instruction is determined, extracting the first to-be-executed instruction from the instruction storage submodule after the zeroth to-be-executed instruction is executed, and sending the first to-be-executed instruction to the operation module,

wherein the association relationship between the first to-be-executed instruction and the zeroth to-be-executed instruction before the first to-be-executed instruction comprises at least one of the following:

and the first arithmetic unit required by executing the first to-be-executed instruction and the zero arithmetic unit required by executing the zero execution instruction are all the same or partially the same.

Clause a19, a machine learning computing device, the device comprising:

one or more most-valued pooling index instruction processing devices as described in any of clauses a1 to clause a18, configured to obtain data and control information to be computed from other processing devices, perform a specified machine learning operation, and transmit the execution result to other processing devices through an I/O interface;

Clause a20, a combination processing device, comprising:

the machine learning computing device, universal interconnect interface, and other processing device of clause a 19;

the machine learning arithmetic device interacts with the other processing devices to jointly complete the calculation operation designated by the user,

wherein the combination processing apparatus further comprises: and a storage device connected to the machine learning arithmetic device and the other processing device, respectively, for storing data of the machine learning arithmetic device and the other processing device.

Clause a21, a machine learning chip, the machine learning chip comprising:

the machine learning computing device of clause a19 or the combined processing device of clause a 20.

Clause a22, an electronic device, comprising:

the machine learning chip of clause a 21.

Clause a23, a card, comprising: a memory device, an interface device and a control device and a machine learning chip as described in clause a 19;

wherein the machine learning chip is connected with the storage device, the control device and the interface device respectively;

the storage device is used for storing data;

the interface device is used for realizing data transmission between the machine learning chip and external equipment;

and the control device is used for monitoring the state of the machine learning chip.

Clause a24, a most-valued pooling index instruction processing method, which is applied to a most-valued pooling index instruction processing apparatus including a control module and an operation module, the method comprising:

Clause a25, according to the method of clause a24, performing a most valued pooling indexing operation on the data to be operated according to the pooling core, and obtaining an operation result, including:

Clause a26, the method of clause a24 or clause a25, the most value being a maximum value, a minimum value, a maximum absolute value, or a minimum absolute value.

Clause a27, the method of clause a24, the index being sequentially incremented by row, sequentially incremented by column, or looked up according to a lookup table.

Clause a28, according to the method of clause a26, performing a most valued pooling indexing operation on the data to be operated according to the pooling core, and obtaining an operation result, including:

Clause a29, according to the method of clause a26, performing a most valued pooling indexing operation on the data to be operated according to the pooling core, and obtaining an operation result, including:

Clause a30, the method of clause a26, the operation field further comprising an input height and an input width,

acquiring data to be operated, a pooling core and a target address required for executing the most-valued pooling indexing instruction according to the operation code and the operation domain, wherein the method comprises the following steps:

Clause a31, the method of clause a26, the operation domain further comprising inputting a number of lanes,

Clause a32, the method of clause a26, the operational domain further comprising a pooled kernel height and a pooled kernel width,

and performing the most-valued pooling index operation on the data to be operated according to the pooling core height and the pooling core width.

Clause a33, the method of clause a26, the operational domain further comprising a first stride,

performing a most valued pooling index operation on the data to be operated according to the pooling core, including:

moving the pooling kernel in an input width direction according to the first stride.

Clause a34, the method of clause a26, the operational domain further comprising a second stride,

moving the pooling core in a height direction according to the second swath.

Clause a35, according to the method of clause a26, performing a most valued pooling indexing operation on the data to be operated according to the pooling core, and obtaining an operation result, including:

and moving the pooling core on the data to be operated, and comparing a plurality of data to be operated in the area corresponding to the pooling core to obtain the operation result.

Clause a36, according to the method of clause a26, performing a most valued pooling indexing operation on the data to be operated according to the pooling core, and obtaining an operation result, including:

when the size of the data to be operated is not integral multiple of the size of the pooling core, performing the most valued pooling index operation on the data which is integral multiple of the size of the pooling core in the data to be operated to obtain an operation result,

Clause a37, according to the method of clause a36, performing a most valued pooling indexing operation on the data to be operated according to the pooling core to obtain an operation result, further comprising:

and when the size of the residual data in the data to be operated is smaller than the size of the pooling kernel, performing the most-valued pooling index operation on the residual data or performing the most-valued pooling index operation after number complementing to obtain an operation result.

Clause a38, according to the method of clause a36, performing a most valued pooling indexing operation on the data to be operated according to the pooling core to obtain an operation result, further comprising:

and when the size of the residual data in the data to be operated is smaller than the size of the pooling nucleus, reversely moving the position of the pooling nucleus to enable the size of the data in the corresponding area of the pooling nucleus after the reverse movement to be equal to the size of the pooling nucleus, enabling the data in the corresponding area of the pooling nucleus after the reverse movement to comprise the residual data, and performing the operation of the maximum pooling index according to the data in the corresponding area of the pooled nucleus after the movement to obtain an operation result.

Clause a39, the method of clause a26, the method further comprising:

and storing the data to be operated and the operation result by utilizing a storage module of the device.

Clause a40, according to the method described in clause a26, parsing the obtained most-valued pooling index instruction to obtain the operation code and the operation domain of the most-valued pooling index instruction includes:

storing the most valued pooling index instruction;

and storing an instruction queue, wherein the instruction queue comprises a plurality of instructions to be executed which are sequentially arranged according to an execution sequence, and the plurality of instructions to be executed comprise the most value pooling index instruction.

Clause a41, the method of clause a40, the method further comprising:

when determining that the first to-be-executed instruction in the plurality of to-be-executed instructions is associated with a zeroth to-be-executed instruction before the first to-be-executed instruction, caching the first to-be-executed instruction, and after determining that the zeroth to-be-executed instruction is completely executed, controlling to execute the first to-be-executed instruction,

Clause a42, a non-transitory computer readable storage medium having computer program instructions stored thereon that, when executed by a processor, implement the method of any of clauses a 24-a 41.

The foregoing detailed description of the embodiments of the present application has been presented to illustrate the principles and implementations of the present application, and the above description of the embodiments is only provided to help understand the method and the core concept of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in view of the above, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. An apparatus for processing a most-valued pooling index instruction, the apparatus comprising:

the control module is used for analyzing the obtained most-valued pooling index instruction to obtain an operation domain of the most-valued pooling index instruction, and acquiring data to be operated, a pooling core and a target address which are required by executing the most-valued pooling index instruction according to the operation domain;

and the operation module is used for carrying out the most valued pooling index operation on the data to be operated according to the pooling core, acquiring an operation result and storing the operation result into the target address, wherein the operation result comprises the index of the region where the most value is located in the data of each region of the data to be operated corresponding to the pooling core.

2. The apparatus of claim 1, wherein the computing module comprises:

3. The device according to claim 1 or 2,

4. A machine learning arithmetic device, the device comprising:

one or more most pooled index instruction processing devices as claimed in any one of claims 1 to 3, configured to obtain data to be operated and control information from other processing devices, perform specified machine learning operation, and transmit the execution result to other processing devices via the I/O interface;

the device comprises a plurality of maximum pooling index instruction processing devices, a PCIE bus, a fast Peripheral Component Interface Express (PCIE) bus, a data transmission unit and a data transmission unit, wherein the maximum pooling index instruction processing devices are interconnected through the PCIE bus and transmit data so as to support larger-scale machine learning operation; the plurality of the most-valued pooling index instruction processing devices share the same control system or own respective control systems; the most value pooling index instruction processing devices share a memory or have respective memories; the interconnection mode of the maximum value pooling index instruction processing devices is any interconnection topology.

5. A combined processing apparatus, characterized in that the combined processing apparatus comprises:

the machine learning computing device, the universal interconnect interface, and the other processing device of claim 4;

6. A machine learning chip, the machine learning chip comprising:

the machine learning arithmetic device according to claim 4 or the combined processing device according to claim 5.

7. An electronic device, characterized in that the electronic device comprises:

the machine learning chip of claim 6.

8. The utility model provides a board card, its characterized in that, the board card includes: a memory device, an interface device and a control device and a machine learning chip according to claim 6;

the storage device is used for storing data;

9. A method for processing a most valued pooling index instruction, which is applied to a most valued pooling index instruction processing apparatus, the method comprising:

and performing the most valued pooling index operation on the data to be operated according to the pooling core to obtain an operation result, and storing the operation result into the target address, wherein the operation result comprises the index of the region where the most value is located in each region data of the data to be operated corresponding to the pooling core.

10. A non-transitory computer readable storage medium having stored thereon computer program instructions, wherein the computer program instructions, when executed by a processor, implement the method of claim 9.