CN112395002B

CN112395002B - Operation method, device, computer equipment and storage medium

Info

Publication number: CN112395002B
Application number: CN201910747969.8A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Shanghai Cambricon Information Technology Co Ltd
Current assignee: Shanghai Cambricon Information Technology Co Ltd
Priority date: 2019-08-14
Filing date: 2019-08-14
Publication date: 2023-04-18
Anticipated expiration: 2039-08-14
Also published as: CN112395002A

Abstract

The present disclosure relates to an arithmetic method, an apparatus, a computer device, and a storage medium. Wherein the combined processing device comprises: a machine learning arithmetic device, a universal interconnection interface and other processing devices; the machine learning arithmetic device interacts with other processing devices to jointly complete the calculation operation designated by the user, wherein the combined processing device further comprises: and the storage device is respectively connected with the machine learning arithmetic device and the other processing devices and is used for storing the data of the machine learning arithmetic device and the other processing devices. The operation method, the operation device, the computer equipment and the storage medium provided by the embodiment of the disclosure have wide application range, high operation processing efficiency and high processing speed.

Description

Operation method, operation device, computer equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to an inverse pooling instruction processing method and apparatus, a computer device, and a storage medium.

Background

With the continuous development of science and technology, machine learning, especially neural network algorithms, are more and more widely used. The method is well applied to the fields of image recognition, voice recognition, natural language processing and the like. However, as the complexity of neural network algorithms is higher and higher, the types and the number of involved data operations are increasing. In the related art, the efficiency and the speed of performing inverse pooling operation on data are low.

Disclosure of Invention

In view of the above, the present disclosure provides an inverse pooling instruction processing method, an inverse pooling instruction processing apparatus, a computer device, and a storage medium, so as to improve efficiency and speed of inverse pooling operation on data.

According to a first aspect of the present disclosure, there is provided an anti-pooling instruction processing apparatus, the apparatus comprising:

the control module is used for analyzing the obtained anti-pooling instruction to obtain an operation code and an operation domain of the anti-pooling instruction, and acquiring data to be processed, an input index, a pooling core and a target address which are required by executing the anti-pooling instruction according to the operation code and the operation domain;

and the operation module is used for performing inverse pooling operation on the data to be processed according to the pooling core and the input index to obtain an operation result and storing the operation result into the target address.

According to a second aspect of the present disclosure, there is provided a machine learning operation apparatus including:

one or more anti-pooling instruction processing devices of the first aspect, configured to obtain data to be processed and control information from another processing device, perform a specified machine learning operation, and transmit an execution result to the other processing device through an I/O interface;

when the machine learning arithmetic device comprises a plurality of the anti-pooling instruction processing devices, the plurality of the anti-pooling instruction processing devices can be connected through a specific structure and transmit data;

the system comprises a plurality of anti-pooling instruction processing devices, a PCIE bus, a fast peripheral component interface express (peripheral component interface express) bus, a data transmission module and a data transmission module, wherein the anti-pooling instruction processing devices are interconnected through the PCIE bus and transmit data so as to support larger-scale machine learning operation; the plurality of anti-pooling instruction processing devices share the same control system or own respective control systems; the plurality of anti-pooling instruction processing devices share a memory or own memories; the interconnection mode of the plurality of anti-pooling instruction processing devices is any interconnection topology.

According to a third aspect of the present disclosure, there is provided a combined processing apparatus, the apparatus comprising:

the machine learning arithmetic device, the universal interconnect interface, and the other processing device according to the second aspect;

and the machine learning arithmetic device interacts with the other processing devices to jointly complete the calculation operation designated by the user.

According to a fourth aspect of the present disclosure, there is provided a machine learning chip including the machine learning network operation device of the second aspect or the combination processing device of the third aspect.

According to a fifth aspect of the present disclosure, there is provided a machine learning chip package structure, which includes the machine learning chip of the fourth aspect.

According to a sixth aspect of the present disclosure, a board card is provided, which includes the machine learning chip packaging structure of the fifth aspect.

According to a seventh aspect of the present disclosure, there is provided an electronic device, which includes the machine learning chip of the fourth aspect or the board of the sixth aspect.

According to an eighth aspect of the present disclosure, there is provided an inverse-pooling instruction processing method applied to an inverse-pooling instruction processing apparatus, the method including:

analyzing the obtained anti-pooling instruction to obtain an operation code and an operation domain of the anti-pooling instruction, and obtaining data to be processed, an input index, a pooling core and a target address which are required by executing the anti-pooling instruction according to the operation domain;

and performing inverse pooling operation on the data to be processed according to the pooling core and the input index to obtain an operation result, and storing the operation result into the target address.

According to a ninth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the above-described anti-pooling instruction processing method.

In some embodiments, the electronic device comprises a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet, a smart terminal, a cell phone, a tachograph, a navigator, a sensor, a camera, a server, a cloud server, a camera, a camcorder, a projector, a watch, a headset, a mobile storage, a wearable device, a vehicle, a household appliance, and/or a medical device.

In some embodiments, the vehicle comprises an aircraft, a ship, and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph.

The device comprises a control module and an operation module, wherein the control module is used for analyzing the obtained inverse pooling instruction to obtain an operation code and an operation domain of the inverse pooling instruction, and acquiring data to be processed, an input index, a pooling core and a target address which are required by executing the inverse pooling instruction according to the operation domain; the operation module is used for performing inverse pooling operation on the data to be processed according to the pooling core and the input index to obtain an operation result, and storing the operation result into the target address. The anti-pooling instruction processing method and device and the related products provided by the embodiment of the disclosure have the advantages of wide application range, high processing efficiency and high processing speed of anti-pooling instructions, and high processing efficiency and high speed of anti-pooling operation.

Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments, features, and aspects of the disclosure and, together with the description, serve to explain the principles of the disclosure.

Fig. 1 shows a block diagram of an inverse pooled instruction processing apparatus according to an embodiment of the present disclosure.

FIGS. 2 a-2 c are schematic diagrams illustrating inverse pooling operations according to an embodiment of the present disclosure

Fig. 2 d-2 f are schematic diagrams illustrating the indexing of pooled cores according to an embodiment of the present disclosure.

3 a-3 f illustrate block diagrams of an anti-pooled instruction processing apparatus according to an embodiment of the present disclosure.

FIG. 4a is a schematic diagram illustrating the pooling core overlap movement of one embodiment.

FIG. 4b is a schematic diagram illustrating the spaced movement of the pooling cores of an embodiment.

Fig. 5 shows a schematic diagram of an application scenario of an inverse pooled instruction processing device according to an embodiment of the present disclosure.

Fig. 6a, 6b show block diagrams of a combined processing device according to an embodiment of the disclosure.

Fig. 7 shows a schematic structural diagram of a board card according to an embodiment of the present disclosure.

FIG. 8 illustrates a flow diagram of an anti-pooling instruction processing method according to an embodiment of the present disclosure.

Detailed Description

The technical solutions in the embodiments of the present disclosure will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are some, not all embodiments of the present disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.

It should be understood that the terms "zero," "first," "second," and the like in the claims, the description, and the drawings of the present disclosure are used for distinguishing between different objects and not for describing a particular order. The terms "comprises" and "comprising," when used in the specification and claims of this disclosure, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the disclosure herein is for the purpose of describing particular embodiments only, and is not intended to be limiting of the disclosure. As used in the specification and claims of this disclosure, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the specification and claims of this disclosure refers to any and all possible combinations of one or more of the associated listed items and includes such combinations.

As used in this specification and claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to a determination" or "in response to a detection". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".

Due to the wide use of neural network algorithms, the computing man power of computer hardware is continuously improved, and the types and the number of data operations involved in practical application are continuously improved. Inverse pooling (unpooling) is an operation that upsamples data to be processed according to an index. Because the programming languages are various in types, in order to realize the operation process of the anti-pooling operation under different language environments, in the related art, as the anti-pooling instructions which can be widely applied to various programming languages are not available at the present stage, technicians need to self-define a plurality of instructions corresponding to the programming language environments to realize the anti-pooling operation, and the anti-pooling operation is low in efficiency and low in speed. The present disclosure provides an inverse pooling instruction processing method, apparatus, computer device, and storage medium, which can realize inverse pooling operation with only one instruction, and can significantly improve efficiency and speed of performing inverse pooling operation.

Fig. 1 shows a block diagram of an inverse pooled instruction processing apparatus according to an embodiment of the present disclosure. As shown in fig. 1, the apparatus includes a control module 11 and an operation module 12.

And the control module 11 is configured to analyze the obtained inverse pooling instruction to obtain an operation code and an operation domain of the inverse pooling instruction, and obtain to-be-processed data, an input index, a pooling core, and a target address required for executing the inverse pooling instruction according to the operation code and the operation domain.

And the operation module 12 is configured to perform inverse pooling operation on the to-be-processed data according to the pooling core and the input index, obtain an operation result, and store the operation result in the target address.

In this embodiment, the control module may obtain the data to be processed from the address of the data to be processed. The control module may obtain instructions and data through a data input output unit, which may be one or more data I/O interfaces or I/O pins.

In this embodiment, the operation code may be a part of an instruction or a field (usually indicated by a code) specified in the computer program to perform an operation, and is an instruction sequence number used to inform a device executing the instruction which instruction needs to be executed specifically. The operation domain may be a source of data required for executing the corresponding instruction, and the data required for executing the corresponding instruction includes parameters such as data to be processed, an input index, a pooling core, and a corresponding operation method. For an anti-pooling instruction it must comprise an opcode and an operation domain, wherein the operation domain comprises at least the address of the data to be processed, the input index, the pooling core and the target address.

It should be understood that the instruction format of the anti-pooling instruction and the contained opcode and operation domain may be set as desired by those skilled in the art, and the disclosure is not limited thereto.

In this embodiment, the apparatus may include one or more control modules and one or more operation modules, and the number of the control modules and the number of the operation modules may be set according to actual needs, which is not limited in this disclosure. When the device comprises a control module, the control module can receive the anti-pooling instruction and control one or more operation modules to perform anti-pooling operation. When the device comprises a plurality of control modules, the control modules can respectively receive the anti-pooling instructions and control the corresponding operation module or modules to perform anti-pooling operation.

The device for processing the anti-pooling instruction provided by the embodiment of the disclosure comprises a control module and an operation module, wherein the control module is used for analyzing the acquired anti-pooling instruction to obtain an operation code and an operation domain of the anti-pooling instruction, and acquiring data to be processed, an input index, a pooling core and a target address which are required by executing the anti-pooling instruction according to the operation domain; the operation module is used for performing inverse pooling operation on the data to be processed according to the pooling core and the input index to obtain an operation result and storing the operation result into the target address. The anti-pooling instruction processing device provided by the embodiment of the disclosure has the advantages of wide application range, high processing efficiency and high processing speed for anti-pooling instructions, and high processing efficiency and high processing speed for anti-pooling operation.

Fig. 2 a-2 c show schematic diagrams of an inverse pooling operation of an embodiment of the present disclosure. And performing inverse pooling operation on the data to be processed according to the pooling core and the input index. The pooling core has a fixed indexing approach. And comparing the index corresponding to the data to be processed in the area where the pooling core is positioned with the input index, if the index is equal to the input index, taking the data to be processed as the operation result of the position, and if not, taking the operation result of the position as a preset default value. The preset default value may be zero.

In one possible implementation, the input index is one data, that is, the data to be processed corresponds to one input index. As shown in fig. 2a, assuming the pooled core size is 2 x 2, with indices starting with 0 and increasing sequentially by row, then the indices for the 2 x 2 pooled core are as shown. The first step and the second step both have amplitude values of 2 and the input index is 1. The size of the operation result is 4 × 4, and the default value is 0. Then the pooling core is firstly positioned at the position of the upper left corner of the output data, the index value of the position is 0, the position is compared with the input index, the comparison result is different, and according to the comparison result, namely the difference, the operation result at the position is a preset default value, namely 0. And comparing the second value of the region where the pooling core is positioned, namely the index is 1, with the input index, and writing the data to be processed into the position if the comparison result is the same. Similarly, the third number of regions in which the pooling cores are located, the fourth data, etc. are compared. After the data corresponding to the pooled kernel are compared, the pooled kernel is moved in the width direction by the first step, that is, the pooled kernel is moved by 2 units in the width direction, and the above operation is repeated, because the input index is 1 numerical value, that is, 1, for the region, the data to be processed is written into the position where the index of the pooled kernel is 1, and the default value, that is, 0, is written into other positions. After the operation is finished, the pooling kernel is moved along the height direction along with the second step width, and then the operation is repeated again from the starting point of the height position, similarly, the index of the pooling kernel is compared with the input index, namely 1, and the final result is obtained through the comparison result. Until all the inverse pooling operations are performed.

In one possible implementation, the input index is a set of data, and the number of the set of data is the same as that of the data to be processed, i.e., the data to be processed and the input index correspond to each other one by one. As shown in fig. 2c, assuming the pooled core size is 2 x 2, with indices starting with 0 and increasing sequentially by row, then the indices for the 2 x 2 pooled core are as shown. The amplitudes of the first step and the second step are both 2, the size of the data to be processed is 2 × 2, and the input indexes are also 2 × 2 and correspond to the input indexes one by one. The size of the operation result is 4 × 4, and the default value is 0. Then the pooling core is firstly located at the position of the upper left corner of the output data, the index value of the position is 0, the pooling core is compared with the input index, the input index is 2 at the moment, the comparison result is different, and according to the comparison result, namely the difference, the operation result at the position is a preset default value, namely 0. The second value of the region where the pooling core is located, which is the index 1, is compared with the input index, and if the comparison result is still different, the operation result is 0. And continuing to compare the third value, namely the index is 2, and if the third value is the same as the input index, the operation result at the position is the data to be processed. And repeating the operation by moving the pooling kernel in the width direction by a first step after all the operations of the pooling kernels are finished, namely moving the pooling kernels by 2 units in the width direction, wherein the input indexes are in one-to-one correspondence with the data to be processed, the output input index is 1, so that for the area, the data to be processed is written into the position of the pooling kernel index which is 1, and the other positions are written with default values, namely 0. After the operation is finished, the pooling kernel is moved along the height direction along with the second step width, and then the operation is repeated from the starting point of the height position, similarly, the index of the pooling kernel is compared with the input index, namely 0, and the final result is obtained through the comparison result. And moving along the width direction, comparing the pooled kernel region with the input index, namely 2, and repeatedly executing the reverse pooling operation until all the operations are finished.

In one possible implementation, the input index is a set of data, and part of the data to be processed corresponds to the input index. As shown in fig. 2b, the data of the same dimension in the data to be processed corresponds to the same input index, that is, in the width direction, the data of the same width corresponds to the same input index. The implementation process is similar to the above, and the difference is that the input indexes in the same dimension are the same, and the input indexes in different dimensions are different. The partial data may also include other modes, for example, the input data is three-dimensional, including input height, input width and input channel, and the data to be processed of the same input channel may also adopt the same input index, etc.; for example, the input data is pre-processed into multiple groups, each group using the same input index, etc.

Fig. 2 d-2 f are schematic diagrams illustrating the indexing of pooled cores according to an embodiment of the present disclosure. And performing inverse pooling operation on the data to be processed according to the pooling core and the input index. Wherein the pooling core has a fixed indexing approach.

In one possible implementation, the indexing of the pooling cores may be performed in a row-first sequentially increasing manner, i.e., starting with a fixed data and then sequentially increasing in a row-first manner. As shown in fig. 2d, starting with 0, one way to find the index value index of the (iw, ih) position may be index = ih × kw + iw.

In one possible implementation, the indexing of the pooled cores may be performed in a column-first sequentially increasing manner. As shown in fig. 2e, starting with 0, the indexing is performed in a sequentially increasing manner with column priority; then, one way to find the index value index of the (iw, ih) position may be index = iw × kh + ih.

In a possible implementation manner, the indexing manner of the pooled core may be indexing according to a lookup table, as shown in fig. 2f, a lookup table is set, and the pooled core is indexed. For example, for the location of c, the index is 10 by looking up the table.

It should be understood that the index of the pooled cores in the anti-pooling instruction may be set as desired by one skilled in the art, and the disclosure is not limited thereto.

FIG. 3a shows a block diagram of an inverse pooled instruction processing apparatus according to an embodiment of the present disclosure. In one possible implementation, as shown in fig. 3a, the operation module 12 may include one or more comparators 120. The comparator 120 is configured to perform a comparison operation on the to-be-processed data in the region corresponding to the pooled kernel to obtain a comparison result, and obtain an index of the comparison result as an operation result.

In this implementation, the number of comparators may be set according to the data size of the comparison operation to be performed, the processing speed of the comparison operation, the efficiency, and other requirements, which is not limited by the present disclosure.

Fig. 3b illustrates a block diagram of an inverse pooled instruction processing apparatus in accordance with multiple embodiments of the present disclosure. In one possible implementation, as shown in fig. 3b, the operation module 12 may include a master operation sub-module 121 and a plurality of slave operation sub-modules 122. The main operation submodule 121 includes one or more comparators.

In a possible implementation manner, the main operation sub-module 121 is configured to perform a comparison operation on the index in the area corresponding to the pooled core and the input indexes corresponding to the pooled cores by using the comparator to obtain a comparison result, obtain an operation result according to the comparison result, and store the operation result in the target address.

In a possible implementation manner, the control module 11 may be further configured to analyze the obtained calculation instruction to obtain an operation domain and an operation code of the calculation instruction, and obtain to-be-processed data required for executing the calculation instruction according to the operation domain. The operation module 12 may be further configured to perform an operation on the data to be processed according to the calculation instruction, so as to obtain a calculation result of the calculation instruction. The operation module can further comprise a plurality of operators for executing operations corresponding to the operation types of the calculation instructions.

In this implementation manner, the calculation instruction may be other instructions for performing arithmetic operations, logic operations, and the like on data such as scalars, vectors, matrices, tensors, and the like, and a person skilled in the art may set the calculation instruction according to actual needs, which is not limited by the present disclosure.

In this implementation, the arithmetic unit may include an adder, a divider, a multiplier, a comparator, and the like capable of performing arithmetic operations, logical operations, and the like on data. The type and number of the arithmetic units may be set according to the requirements of the size of the data amount of the arithmetic operation to be performed, the type of the arithmetic operation, the processing speed and efficiency of the arithmetic operation on the data, and the like, which is not limited by the present disclosure.

In one possible implementation, as shown in fig. 3b, the operation module 12 may include a master operation sub-module 121 and a plurality of slave operation sub-modules 122. The slave operand module 122 includes one or more comparators.

In a possible implementation manner, the control module 11 is further configured to analyze the calculation instruction to obtain a plurality of operation instructions, and send the data to be processed and the plurality of operation instructions to the main operation sub-module 121.

The main operation sub-module 121 is configured to receive the to-be-processed data, the input index, the pooling core, and the target address that are required for executing the inverse pooling instruction and are obtained by the control module, and allocate and transmit the to-be-processed data, the input index, the pooling core, and the target address that are required for executing the inverse pooling instruction and correspond to each other to the slave operation sub-module.

The slave operation submodule 122 is configured to receive the to-be-processed data, the input index, the pooled core, and the target address, which are allocated and transmitted by the master operation submodule and required to execute the anti-pooling instruction, compare the index in the region corresponding to the pooled core with the input index corresponding to the pooled core by using the comparator to obtain a comparison result, obtain an operation result according to the comparison result, and store the operation result in the target address.

In this implementation, when the computation instruction is an operation performed on scalar or vector data, the apparatus may control the main operation sub-module to perform an operation corresponding to the computation instruction by using an operator therein. When the calculation instruction is to perform an operation on data having a dimension greater than or equal to 2, such as a matrix, a tensor, or the like, the device may control the slave operation submodule to perform an operation corresponding to the calculation instruction by using an operator therein.

It should be noted that, a person skilled in the art may set the connection manner between the master operation submodule and the plurality of slave operation submodules according to actual needs to implement the configuration setting of the operation module, for example, the configuration of the operation module may be an "H" configuration, an array configuration, a tree configuration, and the like, which is not limited in the present disclosure.

Fig. 3c shows a block diagram of an inverse pooled instruction processing device according to an embodiment of the present disclosure. In one possible implementation, as shown in fig. 3c, the operation module 12 may further include one or more branch operation sub-modules 123, and the branch operation sub-module 123 is configured to forward data and/or operation instructions between the master operation sub-module 121 and the slave operation sub-module 122. The main operation sub-module 121 is connected to one or more branch operation sub-modules 123. Therefore, the main operation sub-module, the branch operation sub-module and the slave operation sub-module in the operation module are connected by adopting an H-shaped structure, and data and/or operation instructions are forwarded by the branch operation sub-module, so that the resource occupation of the main operation sub-module is saved, and the instruction processing speed is further improved.

Fig. 3d shows a block diagram of an inverse pooled instruction processing device according to an embodiment of the present disclosure. In one possible implementation, as shown in FIG. 3d, a plurality of slave operation sub-modules 122 are distributed in an array.

Each slave operation sub-module 122 is connected with other adjacent slave operation sub-modules 122, the master operation sub-module 121 is connected with k slave operation sub-modules 122 in the plurality of slave operation sub-modules 122, and the k slave operation sub-modules 122 are: n slave operator modules 122 in row 1, n slave operator modules 122 in row m, and m slave operator modules 122 in column 1.

As shown in fig. 3d, the k slave operator modules include only n slave operator modules in the 1 st row, n slave operator modules in the m th row, and m slave operator modules in the 1 st column, that is, the k slave operator modules are slave operator modules directly connected to the master operator module among the plurality of slave operator modules. The k slave operation submodules are used for forwarding data and instructions between the master operation submodules and the plurality of slave operation submodules. Therefore, the plurality of slave operation sub-modules are distributed in an array, the speed of sending data and/or operation instructions to the slave operation sub-modules by the master operation sub-module can be increased, and the instruction processing speed is further increased.

Fig. 3e shows a block diagram of an inverse pooled instruction processing apparatus according to an embodiment of the present disclosure. In one possible implementation, as shown in fig. 3e, the operation module may further include a tree sub-module 124. The tree submodule 124 includes a root port 401 and a plurality of branch ports 402. The root port 401 is connected to the master operation submodule 121, and the plurality of branch ports 402 are connected to the plurality of slave operation submodules 122. The tree sub-module 124 has a transceiving function, and is configured to forward data and/or operation instructions between the master operation sub-module 121 and the slave operation sub-module 122. Therefore, the operation modules are connected in a tree structure under the action of the tree sub-modules, and the forwarding function of the tree sub-modules is utilized, so that the speed of sending data and/or operation instructions from the main operation sub-module to the slave operation sub-module can be increased, and the instruction processing speed is increased.

In one possible implementation, the tree submodule 124 may be an optional result of the apparatus, which may include at least one level of nodes. The nodes are line structures with forwarding functions, and the nodes do not have operation functions. The nodes at the lowest layer are connected with the slave operation submodules to forward data and/or operation instructions between the master operation submodules 121 and the slave operation submodules 122. In particular, if the tree submodule has zero level nodes, the apparatus does not require the tree submodule.

In one possible implementation, the tree submodule 124 may include a plurality of nodes of an n-ary tree structure, and the plurality of nodes of the n-ary tree structure may have a plurality of layers.

For example, fig. 3f shows a block diagram of an inverse pooled instruction processing device according to an embodiment of the present disclosure. As shown in FIG. 2f, the n-ary tree structure may be a binary tree structure with tree-type sub-modules including 2 levels of nodes 01. The lowest level node 01 is connected with the slave operation sub-module 122 to forward data and/or operation instructions between the master operation sub-module 121 and the slave operation sub-module 122.

In this implementation, the n-ary tree structure may also be a ternary tree structure or the like, where n is a positive integer greater than or equal to 2. The number of n in the n-ary tree structure and the number of layers of nodes in the n-ary tree structure may be set by those skilled in the art as needed, and the disclosure is not limited thereto.

In one possible implementation, the operation domain may further include an output height and an output width.

The control module is further configured to write the operation result into the destination address, where the height of the operation result is the output height, and the width of the operation result is the output width.

In this implementation, the output width and the output height may define the data amount and size of the obtained operation result. The output width and the output height included in the operation field may be specific values, or may be storage addresses for storing the output width and the output height. When specific values of the output width and the output height are directly included in the operation field, the specific values are determined as the corresponding output width and output height, respectively. When the memory addresses of the output width and the output height are included in the operation domain, the output height and the output width can be obtained from the memory addresses of the output width and the output height, respectively.

In a possible implementation manner, when the output width and the output height are not included in the operation domain, the data to be processed may be obtained according to a preset default output height and a default output width, or may be obtained according to other operation domains such as an input height and an input width.

By the mode, the data size and the size of the operation result can be limited, the accuracy of the operation result is ensured, and the device can execute the anti-pooling instruction.

The control module is further configured to obtain an input height according to the output height, and obtain to-be-processed data corresponding to the input width and the input height from the to-be-processed data address according to the output width and the input width.

In this implementation, the input height may be obtained from the output height, and one possible implementation is that the relationship between the input height and the output height is:

output height = (input height-1) × second step width + height of pooling core

Output width = (input width-1) × first stride + pooled kernel width

By the mode, the data size and the size of the data to be processed can be limited, the accuracy of an operation result is ensured, and the device can execute the anti-pooling instruction.

In one possible implementation, the operation field may further include an input height and an input width.

The control module is further used for acquiring the data to be processed corresponding to the input width and the input height from the data address to be processed.

In this implementation, the input height and input width may define the data volume and size of the obtained data to be processed. The input height and the input width included in the operation field may be specific numerical values, or may be storage addresses storing the input height and the input width. When a specific numerical value of the input height and the input width is directly included in the operation field, the specific numerical value is determined as the corresponding input height and input width, respectively. When the storage addresses of the input height and the input width are included in the operation field, the input height and the input width may be obtained from the storage addresses of the input height and the input width, respectively.

In a possible implementation manner, when the input height and the input width are not included in the operation domain, the data to be processed may be obtained according to a preset default input height and a default input width, or may be obtained according to other operation domains such as an output height and an output width.

In one possible implementation, the operation domain may further include an input channel number.

The control module is further configured to obtain the data to be processed corresponding to the number of the input channels from the address of the data to be processed.

In this implementation, the number of input channels may define the number of channels of the obtained data to be processed, and the number of output channels is the same as the number of input channels. The number of input channels included in the operation field may be a specific numerical value, or may be a storage address for storing the number of input channels. When the specific numerical value of the input channel number is directly included in the operation domain, the specific numerical value is determined as the corresponding input channel number. When the storage address of the input channel number is included in the operation domain, the input channel number degree can be obtained from the storage address of the input channel number.

In a possible implementation manner, when the number of input channels is not included in the operation domain, the data to be processed may be acquired according to a preset default number of input channels.

By the mode, the number of input channels of the data to be processed can be limited, the accuracy of an operation result is ensured, and the device can execute the anti-pooling instruction.

In one possible implementation, the operation domain may further include a pooled core height and a pooled core width.

The operation module 12 is further configured to perform an inverse pooling operation according to the pooling core height and the pooling core width.

In one possible implementation manner, when the pooled core height and the pooled core width are not included in the operation domain, a preset default pooled core height and a preset default pooled core width may be obtained, so that the control module and the operation module may execute the anti-pooled instruction.

In one possible implementation, the operation domain may further include a first stride. The operation module 12 may be further configured to move the pooling core in the width direction according to the first step.

In one possible implementation, the operation domain may further include a second stride. The operation module 12 may be further configured to move the pooling kernel in the height direction according to the second step.

In one possible implementation, the operation domain may further include a first stride and a second stride. The operation module 12 may be further configured to move the pooling core in the width direction according to the first step and move the pooling core in the height direction according to the second step.

In this implementation, the stride of the inverse pooling operation is the magnitude of each move of the pooling kernel in performing the inverse pooling operation. The first step width may be a width of moving the pooled kernel in the width direction, and the second step width may be a height of moving the pooled kernel in the height direction.

In the present disclosure, parameters such as the height, the width, the first stride, and the second stride of the pooling kernel required for performing the inverse pooling operation are described by taking the pooling kernel as a two-dimensional example, and if the pooling kernel is multidimensional, the parameters of the pooling kernel include the size and stride of each dimension.

In a possible implementation manner, when the first stride and the second stride are not given in the operation domain of the inverse pooling instruction, the operation module may use the height and the width of the pooling kernel as the strides of the corresponding dimensions, respectively, to ensure normal operation of the inverse pooling operation.

In a possible implementation manner, the operation module 12 is further configured to accumulate the operation result at the overlap when the pooling cores are overlapped and moved, where the pooling cores are overlapped and moved, and include at least one of the following: when the operation domain contains the first stride, the first stride is less than the pooled core width; when the operation domain includes the second stride, the second stride is less than the pooled core height. Specifically, when the operation domain only contains the first stride and not the second stride, the pooled kernel overlap move means that the first stride is less than the pooled kernel width; when the operation domain only contains the second stride and not the first stride, the pooled kernel overlapping movement means that the second stride is smaller than the pooled kernel height; the pooled kernel overlap move is determined when both the first stride and the second stride are in the operational domain, and when at least one of the first stride is less than the pooled kernel width and the second stride is less than the pooled kernel height.

Fig. 4a shows a case of overlapping movement of pooled nuclei, where the pooled nuclei have a size of 3 × 3. The first step and the second step are both 2, and the shaded portion as shown is the overlap region. The operation results of the overlapping regions are accumulated. For example, when the region of the pooling kernel is the upper left corner, the operation result at the overlapping region a is 1; when the pooled kernel moves, the operation result in the overlapping area should be 2, and then the operation result at a is accumulated, i.e. 1+2=3.

In a possible implementation, the operation module 12 is further configured to write a default value at a distance when the pooling core has a distance movement, where the pooling core has a distance movement, and the default value includes at least one of: when the operation domain contains the first stride, the first stride is greater than the pooled core width; when the operation domain includes the second stride, the second stride is greater than the pooled core height. Specifically, when the operation domain only includes the first stride and not the second stride, the pooled core has a spaced movement meaning that the first stride is greater than the pooled core width; when the operation domain only contains the second stride and does not contain the first stride, the interval movement of the pooled kernel means that the second stride is larger than the height of the pooled kernel; the pooled kernel is moved with a pitch when both the first stride and the second stride are in the operational domain, and when at least one of the first stride is greater than the width of the pooled kernel and the second stride is greater than the height of the pooled kernel.

In one possible implementation, the default value is 0.

Fig. 4b shows a case where the pooled nuclei are moved at intervals, where the pooled nuclei have a size of 2 × 2. The first step and the second step are both 3, and the shaded area as shown is the spacing area.

In one implementation, the data for the pitch region may not be processed.

In one implementation, the operation result of the distance area may be regarded as a default value, and the default value may be zero.

In one possible implementation, as shown in fig. 3 a-3 f, the apparatus may further include a storage module 13. The storage module 13 is used for storing the data to be processed and the pooling core.

In this implementation, the storage module may be one or more of a cache and a register, and the cache may include a temporary cache and may further include at least one NRAM (Neuron Random Access Memory). The cache may be used to store data to be processed and operation results, and the register may be used to store data to be processed, scalar data, parameters, and the like.

In one possible implementation, the cache may include a neuron cache. The neuron buffer, i.e., the neuron random access memory, may be configured to store neuron data in the data to be processed, where the neuron data may include neuron vector data.

In a possible implementation manner, the apparatus may further include a direct memory access module for reading or storing data from the storage module.

In one possible implementation, as shown in fig. 3 a-3 f, the control module 11 may include an instruction storage sub-module 111, an instruction processing sub-module 112, and a queue storage sub-module 113.

The instruction storage submodule 111 is used for storing the anti-pooling instructions.

The instruction processing sub-module 112 is configured to parse the anti-pooling instruction to obtain an operation code and an operation domain of the anti-pooling instruction.

The queue storage submodule 113 is configured to store an instruction queue, where the instruction queue includes a plurality of instructions to be executed that are sequentially arranged according to an execution order, and the plurality of instructions to be executed may include an inverse pooling instruction.

In this implementation manner, the execution order of the multiple instructions to be executed may be arranged according to the receiving time, the priority level, and the like of the instructions to be executed to obtain an instruction queue, so that the multiple instructions to be executed are sequentially executed according to the instruction queue.

In one possible implementation, as shown in fig. 3 a-3 f, the control module 11 may further include a dependency processing sub-module 114.

The dependency relationship processing submodule 114 is configured to, when it is determined that a first to-be-executed instruction in the multiple to-be-executed instructions is associated with a zeroth to-be-executed instruction before the first to-be-executed instruction, cache the first to-be-executed instruction in the instruction storage submodule 111, and after the zeroth to-be-executed instruction is executed, extract the first to-be-executed instruction from the instruction storage submodule 111 and send the first to-be-executed instruction to the operation module 12.

The method for determining the zero-th instruction to be executed before the first instruction to be executed has an incidence relation with the first instruction to be executed comprises the following steps: the first storage address interval for storing the data required by the first to-be-executed instruction and the zeroth storage address interval for storing the data required by the zeroth to-be-executed instruction have an overlapped area. Conversely, no association relationship exists between the first to-be-executed instruction and the zeroth to-be-executed instruction before the first to-be-executed instruction, and the first storage address interval and the zeroth storage address interval do not have an overlapping area.

By the method, according to the dependency relationship between the first to-be-executed instruction and the zeroth to-be-executed instruction before the first to-be-executed instruction, the subsequent first to-be-executed instruction is executed after the execution of the previous zeroth to-be-executed instruction is finished, and the accuracy of the operation result is ensured.

In one possible implementation, the instruction format of the anti-pooling instruction may be:

unpool dstsrc0srcChannelsrcHeightsrcWidthdstHeightdstWidthkernelHeightkernelWidth

wherein, unprool is the operation code of the anti-pooling instruction, dst, src0, src channel, src height, and src width are the operation domain of the anti-pooling instruction. Wherein dst is a target address, src0 is a data address to be processed, src channel is the number of input channels, src height is an input height, src width is an input width, dstHeight is an output height, dstWidth is an output width, kernelHeight is a pooled kernel height, and kernelWidth is a pooled kernel width. That is, the size of the data to be processed obtained from src0 is as follows, the number of input channels is src channel, the input height is src height, and the input width is src width. The dimensions of the pooling core were as follows, with the pooling core height being kernelHeight and the pooling core width being kernelWidth. The step size of each movement of the pooling kernel is a default value, for example, the step size of each movement in the width direction is kernelWidth, and the step size of each movement in the height direction is kernalHeight. The output size is as follows, the number of output channels is src channel, the output height is dstHeight, and the output width is dstWidth. And storing the operation result after the inverse pooling into the place with the address dst.

unpool dstsrc0srcChanneldstHeighdstWidthkernelHeightkernelWidth strideXstrideY index

wherein, the unpool is an operation code of the anti-pooling instruction, and dst, src0, src channel, dstHeight, dstWidth, kernelHeight, kernelWidth, strideX, strideY and index are operation domains of the anti-pooling instruction. The dst is a target address, the src0 is a data address to be processed, the src channel is the number of input channels, the dstHeight is an output height, the dstWidth is an output width, the kernelHeight is a height of the pooled core, the kernelWidth is a width of the pooled core, the strideX is a first step length of the pooled core moving in the width direction, and the strideY is a second step length of the pooled core moving in the height direction. That is, the size of the data to be processed is obtained from the src0, and the size of the data to be processed is obtained from the output size, that is, the number of input channels is src channel, the input height is src height (= (dstHeight-kernelHeight)/strideY + 1), and the input width is src width (= (dstWidth-kernelWidth)/strideX + 1). The pooled cores were of the following size, with the pooled core height of kernelHeight and the pooled core width of kernelWidth. The step length of each movement of the pooled nucleus in the width direction is stride X, and the step length of each movement in the height direction is stride Y. The output size is as follows, the number of output channels is src channel, the output height is dstHeight, and the output width is dstWidth. And storing the operation result after the inverse pooling into the place with the address dst.

It should be understood that the position of the opcode, opcode and operand field in the instruction format of the anti-pooling instruction may be set by one skilled in the art as desired and is not limited by this disclosure.

In one possible implementation manner, the apparatus may be disposed in one or more of a Graphics Processing Unit (GPU), a Central Processing Unit (CPU), and an embedded Neural Network Processor (NPU).

It should be noted that, although the anti-pooling instruction processing apparatus is described above by taking the above-described embodiment as an example, those skilled in the art will appreciate that the present disclosure should not be limited thereto. In fact, the user can flexibly set each module according to personal preference and/or actual application scene, as long as the technical scheme of the disclosure is met.

Application example

An application example according to the embodiment of the present disclosure is given below in conjunction with "performing inverse pooling operation using an inverse pooling instruction processing apparatus" as an exemplary application scenario to facilitate understanding of a flow of the inverse pooling instruction processing apparatus. It is understood by those skilled in the art that the following application examples are merely for the purpose of facilitating understanding of the embodiments of the present disclosure and should not be construed as limiting the embodiments of the present disclosure

Fig. 5 shows a schematic diagram of an application scenario of an inverse pooling instruction processing apparatus according to an embodiment of the present disclosure. As shown in fig. 3, the anti-pooling instruction processing means processes the anti-pooling instruction as follows:

the control module 11 analyzes the obtained inverse pooling instruction 1 (for example, the inverse pooling instruction 1 is unprool 500100564322112), and obtains an operation code and an operation domain of the inverse pooling instruction 1. The opcode of the anti-pooling instruction 1 is unprool, the target address is 500, the address of the data to be processed is 100, the number of input channels is 5, the output height is 64, the output width is 32, the height of the pooling core is 2, the width of the pooling core is 1, the first stride is 1, and the second stride is 2. The control module 11 obtains the scale of the data to be processed according to the operation code, and one way is calculated by the following formula:

output height = (input height-1) × second step width + pooling kernel height, output width = (input width-1) × first step width + convolution kernel width

The input height is 32 and the input width is 32, so the control module 11 obtains 32 × 32 × 5 data to be processed from the data to be processed address 100.

The operation module 12 performs inverse pooling operation on the data to be processed with the scale of 32 × 32 on 5 input channels by using the pooling core, respectively, to obtain an operation result, and stores the operation result in the target address 500.

The working process of the above modules can refer to the above related description.

Therefore, the anti-pooling instruction can be efficiently and quickly processed, and the efficiency and the speed of anti-pooling operation are remarkably improved.

The present disclosure provides a machine learning arithmetic device, which may include one or more of the above-described inverse pooling instruction processing devices, and is configured to acquire data to be processed and control information from other processing devices and perform a specified machine learning arithmetic. The machine learning arithmetic device can obtain the anti-pooling instruction from other machine learning arithmetic devices or non-machine learning arithmetic devices, and transmit the execution result to the peripheral equipment (also called other processing devices) through the I/O interface. Peripheral devices such as cameras, displays, mice, keyboards, network cards, wifi interfaces, servers. When more than one anti-pooling command processing device is included, the anti-pooling command processing devices can be linked and transmit data through a specific structure, for example, a PCIE bus is used for interconnection and data transmission, so as to support larger-scale operation of the neural network. At this time, the same control system may be shared, or there may be separate control systems; the memory may be shared or there may be a separate memory for each accelerator. In addition, the interconnection mode can be any interconnection topology.

The machine learning arithmetic device has high compatibility and can be connected with various types of servers through PCIE interfaces.

Fig. 6a shows a block diagram of a combined processing device according to an embodiment of the present disclosure. As shown in fig. 6a, the combined processing device includes the machine learning arithmetic device, the universal interconnection interface and other processing devices. The machine learning arithmetic device interacts with other processing devices to jointly complete the operation designated by the user.

Other processing devices include one or more of general purpose/special purpose processors such as Central Processing Units (CPUs), graphics Processing Units (GPUs), neural network processors, and the like. The number of processors included in the other processing devices is not limited. The other processing devices are used as interfaces of the machine learning arithmetic device and external data and control, and comprise data transportation to finish basic control of starting, stopping and the like of the machine learning arithmetic device; other processing devices may cooperate with the machine learning computing device to perform computing tasks.

And the universal interconnection interface is used for transmitting data and control instructions between the machine learning arithmetic device and other processing devices. The machine learning arithmetic device obtains required input data from other processing devices and writes the required input data into a storage device on the machine learning arithmetic device chip; control instructions can be obtained from other processing devices and written into a control cache on a machine learning arithmetic device chip; the data in the storage module of the machine learning arithmetic device can also be read and transmitted to other processing devices.

Fig. 6b shows a block diagram of a combined processing device according to an embodiment of the present disclosure. In a possible implementation manner, as shown in fig. 6b, the combination processing device may further include a storage device, and the storage device is connected to the machine learning operation device and the other processing device respectively. The storage device is used to store data stored in the machine learning arithmetic device and the other processing devices, and is particularly suitable for data which is required to be calculated and cannot be stored in the internal storage of the machine learning arithmetic device or the other processing devices.

The combined processing device can be used as an SOC (system on chip) system of equipment such as a mobile phone, a robot, an unmanned aerial vehicle and video monitoring equipment, the core area of a control part is effectively reduced, the processing speed is increased, and the overall power consumption is reduced. In this case, the generic interconnect interface of the combined processing device is connected to some component of the apparatus. Some parts are such as camera, display, mouse, keyboard, network card, wifi interface.

The present disclosure provides a machine learning chip, which includes the above machine learning arithmetic device or combined processing device.

The present disclosure provides a machine learning chip package structure, which includes the above machine learning chip.

Fig. 5 shows a schematic structural diagram of a board card according to an embodiment of the present disclosure. As shown in fig. 7, the board includes the above-mentioned machine learning chip package structure or the above-mentioned machine learning chip. In addition to including the machine learning chip 389, the board may include other kits including, but not limited to: memory device 390, interface device 391 and control device 392.

The memory device 390 is coupled to a machine learning chip 389 (or a machine learning chip within a machine learning chip package structure) via a bus for storing data. Memory device 390 may include multiple sets of memory cells 393. Each group of memory cells 393 is coupled to a machine learning chip 389 via a bus. It is understood that each group 393 may be a DDR SDRAM (Double Data Rate SDRAM).

DDR can double up the speed of SDRAM without increasing the clock frequency. DDR allows data to be read out on the rising and falling edges of the clock pulse. DDR is twice as fast as standard SDRAM.

In one embodiment, memory device 390 may include 4 groups of memory cells 393. Each group of memory cells 393 may include multiple DDR4 particles (chips). In one embodiment, the machine learning chip 389 may include 4 72-bit DDR4 controllers, and 64 bits of the 72-bit DDR4 controllers are used for data transmission and 8 bits are used for ECC check.

In one embodiment, each group 393 of memory cells includes a plurality of double rate synchronous dynamic random access memories arranged in parallel. DDR can transfer data twice in one clock cycle. A controller for controlling DDR is provided in the machine learning chip 389 for controlling data transfer and data storage of each memory unit 393.

Interface device 391 is electrically coupled to machine learning chip 389 (or a machine learning chip within a machine learning chip package). The interface device 391 is used to implement data transmission between the machine learning chip 389 and an external device (e.g., a server or a computer). For example, in one embodiment, the interface device 391 may be a standard PCIE interface. For example, the data to be processed is transmitted to the machine learning chip 289 by the server through the standard PCIE interface, so as to implement data transfer. In another embodiment, the interface device 391 may also be another interface, and the disclosure does not limit the specific representation of the other interface, and the interface device can implement the switching function. In addition, the calculation result of the machine learning chip is still transmitted back to the external device (e.g., server) by the interface device.

The control device 392 is electrically connected to the machine learning chip 389. The control device 392 is used to monitor the state of the machine learning chip 389. Specifically, the machine learning chip 389 and the control device 392 may be electrically connected through an SPI interface. The control device 392 may include a single chip Microcomputer (MCU). For example, machine learning chip 389 may include multiple processing chips, multiple processing cores, or multiple processing circuits, which may carry multiple loads. Therefore, the machine learning chip 389 can be in different operating states such as a multi-load and a light load. The control device can regulate and control the working states of a plurality of processing chips, a plurality of processing circuits and/or a plurality of processing circuits in the machine learning chip.

The present disclosure provides an electronic device, which includes the above machine learning chip or board card.

The electronic device may include a data processing apparatus, a computer device, a robot, a computer, a printer, a scanner, a tablet, a smart terminal, a cell phone, a tachograph, a navigator, a sensor, a camera, a server, a cloud server, a camera, a camcorder, a projector, a watch, an earphone, a mobile storage, a wearable device, a vehicle, a household appliance, and/or a medical device.

The vehicle may include an aircraft, a ship, and/or a vehicle. The household appliances may include televisions, air conditioners, microwave ovens, refrigerators, electric rice cookers, humidifiers, washing machines, electric lamps, gas cookers, and range hoods. The medical device may include a nuclear magnetic resonance apparatus, a B-mode ultrasound apparatus and/or an electrocardiograph.

FIG. 8 illustrates a flow diagram of a method of anti-pooling instruction processing according to an embodiment of the present disclosure. The method can be applied to computer equipment and the like comprising a memory and a processor, wherein the memory is used for storing data used in the execution of the method; the processor is used for executing relevant processing and operation steps, such as step S51 and step S52 described below. As shown in fig. 8, the method is applied to the above-described inverse pooling instruction processing apparatus, and includes step S51 and step S52.

In step S51, the obtained inverse pooling instruction is analyzed to obtain an operation code and an operation domain of the inverse pooling instruction, and the to-be-processed data, the input index, the pooling core, and the target address required for executing the inverse pooling instruction are obtained according to the operation domain. The operation code is used for indicating that the operation of the anti-pooling instruction on the data is anti-pooling operation, and the operation domain comprises a data address to be processed, an input index, a target address and a pooling core.

In step S52, inverse pooling operation is performed on the data to be processed according to the pooling core and the input index to obtain an operation result, and the operation result is stored in the target address.

In one possible implementation, the method includes: the data to be processed corresponds to one of the input indexes.

In one possible implementation, the method includes: and the data to be processed corresponds to the input indexes one by one.

In one possible implementation, the method includes: and part of the data to be processed corresponds to one input index.

In a possible implementation manner, performing inverse pooling operation on the to-be-processed data according to the pooled kernel and the input index to obtain an operation result, including: and comparing the indexes in the area corresponding to the pooling core with the input indexes corresponding to the pooling cores by using the comparator to obtain a comparison result, and obtaining an operation result according to the comparison result.

In one possible implementation, the method includes: and the indexes in the area corresponding to the pooling core are sequentially increased in rows, sequentially increased in columns or searched according to a lookup table.

In one possible implementation, the operation module comprises a master operation submodule and a plurality of slave operation submodules, the master operation submodules comprise the comparators,

wherein, the performing inverse pooling operation on the data to be processed according to the pooling core and the input index to obtain an operation result, and storing the operation result into the target address includes:

and comparing the indexes in the area corresponding to the pooling core with the input indexes corresponding to the input indexes by using the comparator to obtain a comparison result, obtaining an operation result according to the comparison result, and storing the operation result into the target address.

In one possible implementation, the operation module comprises a master operation submodule and a plurality of slave operation submodules, the slave operation submodules comprise the comparator,

wherein, the performing inverse pooling operation on the data to be processed according to the pooling core and the input index, obtaining an index of a comparison result as an operation result, and storing the operation result in the target address includes:

and comparing and operating a plurality of data to be processed in the area corresponding to the pooling core by using the plurality of comparators to obtain a comparison result, obtaining an operation result, and storing the operation result into the target address.

Receiving the data to be processed, the input index, the pooling core and the target address which are acquired by a control module and are required for executing the anti-pooling instruction, and allocating and transmitting the data to be processed, the input index, the pooling core and the target address which are required for executing the anti-pooling instruction respectively to a slave operation sub-module;

and receiving the data to be processed, the input index, the pooling core and the target address which are distributed and transmitted by the main operation sub-module and are required by executing the anti-pooling instruction, comparing the index in the area corresponding to the pooling core with the corresponding input index by using the comparator to obtain a comparison result, obtaining an operation result according to the comparison result, and storing the operation result into the target address.

In a possible implementation manner, the operation domain further includes an input height and an input width, where the obtaining, according to the operation domain, the to-be-processed data, the input index, the pooled core, and the target address required for executing the anti-pooling instruction includes: and acquiring the data to be processed corresponding to the input width and the input height from the data address to be processed.

In a possible implementation manner, the obtaining, according to the operation domain, to-be-processed data, an input index, a pooling core, and a target address required for executing the anti-pooling instruction further includes: and writing the operation result into the destination address, wherein the height of the operation result is the output height, and the width of the operation result is the output width.

In a possible implementation manner, the obtaining, according to the operation domain, to-be-processed data, an input index, a pooling core, and a target address required for executing the anti-pooling instruction further includes: and respectively obtaining an input height and an input width according to the output height and the output width, and acquiring the data to be processed corresponding to the input width and the input height from the data address to be processed.

In a possible implementation manner, the operation domain further includes an input channel number, where the obtaining, according to the operation domain, the to-be-processed data, the input index, the pooled core, and the target address that are required for executing the anti-pooling instruction includes: and acquiring the data to be processed corresponding to the number of the input channels from the address of the data to be processed.

In a possible implementation manner, the operation domain further includes a pooled kernel height and a pooled kernel width, where the performing, according to the pooled kernel and the input index, an inverse pooling operation on the data to be processed includes: and performing inverse pooling operation on the data to be processed according to the height of the pooling core and the width of the pooling core.

In one possible implementation, the operation domain may further include a first stride. Performing inverse pooling operation on the data to be processed according to the pooling core and the input index may include: the pooled kernel is moved in the width direction in a first step.

In one possible implementation, the operation domain may further include a second stride. Performing inverse pooling operation on the data to be processed according to the pooling core and the input index may include: the pooling core is moved in the elevation direction in a second step.

In one possible implementation, the operation domain may further include a first stride and a second stride. Performing inverse pooling operation on the data to be processed according to the pooling core and the input index may include: the pooling cores are moved in the width direction in a first step and the pooling cores are moved in the height direction in a second step.

In a possible implementation manner, performing inverse pooling operation on the to-be-processed data according to the pooled kernel and the input index further includes:

when the pooling cores are moved in an overlapping manner, the operation results are accumulated at the overlapping position,

wherein the pooled nuclei are moved overlappingly, comprising at least one of:

when the operation domain contains the first stride, the first stride is less than the pooled core width;

when the operation domain includes the second stride, the second stride is less than the pooled core height.

wherein the pooled kernel moves overlappingly, and may be when the operation domain comprises the first stride and the second stride, the first stride is less than the pooled kernel width, or the second stride is less than the pooled kernel height, or the first stride is less than the pooled kernel width and the second stride is less than the pooled kernel height.

when the pooling core has a pitch shift, the operation result writes a default value at the pitch,

wherein the pooling cores are moved with a pitch comprising at least one of:

when the operation domain includes the first stride, the first stride is greater than the pooled kernel width;

when the operation domain includes the second stride, the second stride is greater than the pooled core height.

wherein the pooled kernel is moved with a pitch, which may be when the operation domain comprises the first stride and the second stride, the first stride being greater than the pooled kernel width, or the second stride being greater than the pooled kernel height, or the first stride being greater than the pooled kernel width and the second stride being greater than the pooled kernel height.

In a possible implementation manner, the performing, according to the pooled kernel and the input index, an inverse pooling operation on the data to be processed further includes:

the default value is zero.

In one possible implementation, the method may further include: and storing the data to be processed and the operation result by using a storage module of the device.

In a possible implementation manner, parsing the obtained anti-pooling instruction to obtain an operation code and an operation domain of the anti-pooling instruction may include:

storing an inverse pooling instruction;

analyzing the anti-pooling instruction to obtain an operation code and an operation domain of the anti-pooling instruction;

the method includes storing an instruction queue, where the instruction queue includes a plurality of instructions to be executed that are sequentially arranged according to an execution order, and the plurality of instructions to be executed may include anti-pooling instructions.

In one possible implementation, the method may further include: when determining that the first to-be-executed instruction in the plurality of to-be-executed instructions has an association relation with a zeroth to-be-executed instruction before the first to-be-executed instruction, caching the first to-be-executed instruction, after the zeroth to-be-executed instruction is executed, executing the first to-be-executed instruction,

the first to-be-executed instruction and the zeroth to-be-executed instruction before the first to-be-executed instruction have an association relationship, and the association relationship comprises at least one of the following steps:

a first storage address interval for storing the data required by the first instruction to be executed and a zeroth storage address interval for storing the data required by the zeroth instruction to be executed have an overlapped area;

and a first arithmetic unit used for executing the first to-be-executed instruction is completely or partially the same as a zero arithmetic unit used for executing the zero execution instruction.

It should be noted that, although the above-mentioned embodiment is taken as an example to describe the anti-pooling instruction processing method, those skilled in the art can understand that the disclosure should not be limited thereto. In fact, the user can flexibly set each step according to personal preference and/or actual application scene, as long as the technical scheme of the disclosure is met.

The anti-pooling instruction processing method provided by the embodiment of the disclosure has the advantages of wide application range, high processing efficiency and high processing speed of anti-pooling instructions, and high efficiency and high speed of anti-pooling operation.

The present disclosure also provides a non-transitory computer readable storage medium having stored thereon computer program instructions, wherein the computer program instructions, when executed by a processor, implement the above-mentioned inverse pooling instruction processing method.

It should be noted that for simplicity of description, the above-described method embodiments are shown as a series of combinations of acts, but it should be understood by those skilled in the art that the present disclosure is not limited by the order of acts described, as some steps may be performed in other orders or concurrently in accordance with the disclosure. Further, those skilled in the art will also appreciate that the embodiments described in the specification are exemplary embodiments and that acts and modules referred to are not necessarily required by the disclosure.

It should be further noted that, although the steps in the flowchart of fig. 6 are shown in sequence as indicated by the arrows, the steps are not necessarily executed in sequence as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in fig. 6 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

It should be understood that the above-described apparatus embodiments are merely exemplary, and that the apparatus of the present disclosure may be implemented in other ways. For example, the division of the units/modules in the above embodiments is only one logical function division, and there may be another division manner in actual implementation. For example, multiple units, modules, or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented.

In addition, unless otherwise specified, each functional unit/module in the embodiments of the present disclosure may be integrated into one unit/module, each unit/module may exist alone physically, or two or more units/modules may be integrated together. The integrated units/modules may be implemented in the form of hardware or software program modules.

If the integrated unit/module is implemented in hardware, the hardware may be digital circuits, analog circuits, etc. Physical implementations of hardware structures include, but are not limited to, transistors, memristors, and the like. Unless otherwise specified, the storage module may be any suitable magnetic storage medium or magneto-optical storage medium, such as Resistive Random Access Memory (RRAM), dynamic Random Access Memory (DRAM), static Random Access Memory (SRAM), enhanced Dynamic Random Access Memory (EDRAM), high-Bandwidth Memory (HBM), hybrid Memory Cubic (HMC), and the like.

The integrated units/modules, if implemented in the form of software program modules and sold or used as a stand-alone product, may be stored in a computer readable memory. Based on such understanding, the technical solution of the present disclosure may be embodied in the form of a software product, which is stored in a memory and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present disclosure. And the aforementioned memory comprises: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to the related descriptions of other embodiments. The technical features of the embodiments may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The foregoing may be better understood in light of the following clauses:

clause A1, an anti-pooling instruction processing apparatus, the apparatus comprising:

the control module is used for analyzing the obtained anti-pooling instruction to obtain an operation code and an operation domain of the anti-pooling instruction, and acquiring data to be processed, an input index, a pooling core and a target address which are required by executing the anti-pooling instruction according to the operation domain;

Clause A2, the apparatus of clause A1, including:

the data to be processed corresponds to one of the input indexes.

Clause A3, the apparatus of clause A1, including:

and the data to be processed corresponds to the input indexes one by one.

Clause A4, the apparatus of clause A1, comprising:

and part of the data to be processed corresponds to one input index.

Clause A5, the apparatus of any one of clauses A1-A4, the computing module comprising:

and the comparator is used for carrying out comparison operation on the indexes in the area corresponding to the pooling core and the corresponding input indexes to obtain a comparison result, and obtaining an operation result according to the comparison result.

Clause A6, the apparatus of clause A5, the computing module comprising:

and the indexes in the area corresponding to the pooling core are sequentially increased in rows, sequentially increased in columns or searched according to a lookup table.

Clause A7, the apparatus of clause A5 or clause A6, the calculation module comprising a master calculation sub-module and a plurality of slave calculation sub-modules, the master calculation sub-module comprising the comparator,

and the main operation sub-module is used for comparing the indexes in the area corresponding to the pooling core with the input indexes corresponding to the input indexes by using the comparator to obtain a comparison result, obtaining an operation result according to the comparison result and storing the operation result into the target address.

Clause A8, the apparatus of clause A5 or clause A6, the calculation module comprising a master calculation sub-module and a plurality of slave calculation sub-modules, the slave calculation sub-modules comprising the comparator,

the main operation sub-module is configured to receive the to-be-processed data, the input index, the pooling core, and the target address, which are acquired by the control module and are required for executing the inverse pooling instruction, and allocate and transmit the to-be-processed data, the input index, the pooling core, and the target address, which are required for executing the inverse pooling instruction, to the slave operation sub-module;

the slave operation submodule is configured to receive the to-be-processed data, the input index, the pooling core, and the target address, which are distributed and transmitted by the master operation submodule and are required to execute the inverse pooling instruction, perform a comparison operation on the index in the region corresponding to the pooling core and the input index corresponding to the pooling core by using the comparator to obtain a comparison result, obtain an operation result according to the comparison result, and store the operation result in the target address.

Clause A9, the apparatus of any one of clauses A1-A8, the operational field further comprising an input height and an input width,

the control module is further configured to obtain the to-be-processed data corresponding to the input width and the input height from the to-be-processed data address.

Clause a10, the apparatus of any one of clauses A1 to clause A8, the operational domain further comprising an output height and an output width,

Clause a11, the apparatus of clause a10, the operational realm further comprising an output height and an output width,

Clause a12, the apparatus of any one of clauses A1 to clause A8, the operational field further comprising inputting a number of channels,

the control module is further configured to obtain the to-be-processed data corresponding to the number of the input channels from the to-be-processed data address.

Clause a13, the apparatus of any one of clauses A1-A8, the operational domain further comprising a pooled kernel height and a pooled kernel width,

the operation module is further configured to perform inverse pooling operation on the data to be processed according to the pooling core height and the pooling core width.

Clause a14, the apparatus of any one of clauses A1-A8, the operational domain further comprising a first stride and/or a second stride,

wherein the operation module is further configured to move the pooling kernel in the width direction according to the first stride and/or move the pooling kernel in the height direction according to the second stride.

Clause a15, the apparatus of clause a14, the calculation module further configured to accumulate the calculation results at the overlap when the pooled cores move in overlap,

wherein the pooled kernels move overlappingly, comprising at least one of:

Clause a16, the apparatus of clause a14, the calculation module further configured to write a default value at the spacing when the pooling core has a spacing movement,

wherein the pooling cores are moved with a pitch comprising at least one of:

Clause a17, the apparatus of clause a16, the calculation module comprising:

the default value is zero.

Clause a18, the apparatus of clause A1, further comprising:

and the storage module is used for storing the data to be processed and the operation result.

Clause a19, the apparatus of clause A1, the control module, comprising:

the instruction storage submodule is used for storing the anti-pooling instruction;

the instruction processing submodule is used for analyzing the anti-pooling instruction to obtain an operation code and an operation domain of the anti-pooling instruction;

and the queue storage submodule is used for storing an instruction queue, the instruction queue comprises a plurality of instructions to be executed which are sequentially arranged according to an execution sequence, and the plurality of instructions to be executed comprise the anti-pooling instruction.

Clause a20, the apparatus of clause a19, the control module further comprising:

the dependency relationship processing submodule is used for caching a first instruction to be executed in the instruction storage submodule when the fact that the first instruction to be executed in the plurality of instructions to be executed is associated with a zeroth instruction to be executed before the first instruction to be executed is determined, extracting the first instruction to be executed from the instruction storage submodule after the zeroth instruction to be executed is executed, and sending the first instruction to be executed to the operation module,

wherein the association relationship between the first to-be-executed instruction and the zeroth to-be-executed instruction before the first to-be-executed instruction comprises at least one of the following:

Clause a21, a machine learning arithmetic device, the device comprising:

one or more anti-pooling instruction processing devices according to any of clauses A1 to clause a20, configured to obtain data to be processed and control information from other processing devices, perform a specified machine learning operation, and transmit an execution result to the other processing devices through an I/O interface;

Clause a22, a combined processing apparatus, comprising:

the machine learning computing device, universal interconnect interface, and other processing device of clause a 21;

the machine learning arithmetic device interacts with the other processing devices to jointly complete the calculation operation designated by the user,

wherein the combination processing apparatus further comprises: and a storage device connected to the machine learning calculation device and the other processing device, respectively, for storing data of the machine learning calculation device and the other processing device.

Clause a23, a machine learning chip, the machine learning chip comprising:

the machine learning computation means of clause a21 or the combination processing means of clause a 22.

Clause a24, an electronic device, comprising:

the machine learning chip of clause a 23.

Clause a25, a card, comprising: a memory device, an interface device and a control device and a machine learning chip as set forth in clause a 23;

wherein the machine learning chip is connected with the storage device, the control device and the interface device respectively;

the storage device is used for storing data;

the interface device is used for realizing data transmission between the machine learning chip and external equipment;

and the control device is used for monitoring the state of the machine learning chip.

Clause a26, an anti-pooling instruction processing method applied to an anti-pooling instruction processing apparatus, the method comprising:

analyzing the obtained anti-pooling instruction to obtain an operation code and an operation domain of the anti-pooling instruction, and obtaining data to be processed, a pooling core and a target address which are required by executing the anti-pooling instruction according to the operation code and the operation domain;

Clause a27, the method of clause a26, comprising:

the data to be processed corresponds to one of the input indexes.

Clause a28, the method of clause a26, including:

and the data to be processed corresponds to the input indexes one by one.

Clause a29, the method of clause a26, comprising:

and part of the data to be processed corresponds to one input index.

Clause a30, the method according to any one of clauses a26 to a29, performing inverse pooling operation on the data to be processed according to the pooling core and the input index to obtain an operation result, including:

and comparing the indexes in the area corresponding to the pooling core with the input indexes corresponding to the pooling cores by using the comparator to obtain a comparison result, and obtaining an operation result according to the comparison result.

Clause a31, the method of clause a30, comprising:

Clause a32, the method of clause a30 or clause a31, the calculation module comprising a master calculation sub-module and a plurality of slave calculation sub-modules, the master calculation sub-module comprising the comparator,

wherein, the performing inverse pooling operation on the data to be processed according to the pooling core and the input index to obtain an operation result, and storing the operation result in the target address includes:

Clause a33, the method of clause a30 or clause a31, the calculation module comprising a master calculation sub-module and a plurality of slave calculation sub-modules, the slave calculation sub-modules comprising the comparator,

and receiving the data to be processed, the input index, the pooling core and the target address which are distributed and transmitted by the main operation sub-module and required by executing the anti-pooling instruction, comparing the index in the area corresponding to the pooling core with the corresponding input index by using the comparator to obtain a comparison result, obtaining an operation result according to the comparison result, and storing the operation result into the target address.

Clause a34, the method of any one of clauses a26 to clause a33, the operation field further comprising an input height and an input width,

the obtaining, according to the operation domain, to-be-processed data, an input index, a pooling core, and a target address required for executing the anti-pooling instruction includes:

and acquiring the data to be processed corresponding to the input width and the input height from the data address to be processed.

Clause a35, the method of any one of clauses a 26-a 33, the operational field further comprising an output height and an output width,

the obtaining, according to the operation domain, the to-be-processed data, the input index, the pooled core, and the target address required for executing the anti-pooling instruction includes:

and writing the operation result into the destination address, wherein the height of the operation result is the output height, and the width of the operation result is the output width.

Clause a36, the method of clause a35, the operational realm further comprising an output height and an output width,

and respectively obtaining an input height and an input width according to the output height and the output width, and acquiring the data to be processed corresponding to the input width and the input height from the data address to be processed.

Clause a37, the method of any of clauses a 26-a 33, the operation domain further comprising inputting a number of channels,

and acquiring the data to be processed corresponding to the number of the input channels from the address of the data to be processed.

Clause a38, the method of any one of clauses a 26-a 33, the operational domain further comprising a pooled kernel height and a pooled kernel width,

wherein, the performing inverse pooling operation on the data to be processed according to the pooling kernel and the input index includes:

and performing inverse pooling operation on the data to be processed according to the height of the pooling core and the width of the pooling core.

Clause a39, the method according to any one of clauses a26 to a33, the operational domain further comprising a first stride and/or a second stride,

moving the pooled kernel in the width direction in accordance with the first stride and/or moving the pooled kernel in the height direction in accordance with the second stride.

Clause a40, the method according to clause a39, the performing inverse pooling operation on the data to be processed according to the pooled kernel and the input index, further comprising:

when the pooling cores are overlapped and moved, the operation results are accumulated at the overlapping position,

wherein the pooled nuclei are moved overlappingly, comprising at least one of:

Clause a41, the method according to clause a39, the performing inverse pooling operation on the data to be processed according to the pooled kernel and the input index, further comprising:

wherein the pooling cores are moved with a pitch comprising at least one of:

Clause a42, the method according to clause a41, wherein the inverse pooling operation is performed on the data to be processed according to the pooling core and the input index, further comprising:

the default value is zero.

Clause a43, the method of clause a26, further comprising:

and storing the data to be processed and the operation result by utilizing a storage module of the device.

Clause a44, according to the method described in clause a26, analyzing the obtained anti-pooling instruction to obtain an operation code and an operation domain of the anti-pooling instruction, including:

storing the anti-pooling instructions;

and storing an instruction queue, wherein the instruction queue comprises a plurality of instructions to be executed which are sequentially arranged according to an execution sequence, and the plurality of instructions to be executed comprise the anti-pooling instruction.

Clause a45, the method of clause a26, further comprising:

when determining that the first to-be-executed instruction in the plurality of to-be-executed instructions is associated with a zeroth to-be-executed instruction before the first to-be-executed instruction, caching the first to-be-executed instruction, and after determining that the zeroth to-be-executed instruction is completely executed, controlling to execute the first to-be-executed instruction,

Clause a46, a non-transitory computer readable storage medium having computer program instructions stored thereon that, when executed by a processor, implement the method of any of clauses a 26-a 45.

The foregoing detailed description of the embodiments of the present application has been presented to illustrate the principles and implementations of the present application, and the above description of the embodiments is only provided to help understand the method and the core concept of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, the specific implementation manner and the application scope may be changed, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. An anti-pooling instruction processing apparatus, the apparatus comprising:

the control module is used for analyzing the acquired anti-pooling instruction to obtain an operation code and an operation domain of the anti-pooling instruction, and acquiring data to be processed, an input index, a pooling core and a target address which are required by the execution of the anti-pooling instruction according to the operation code and the operation domain;

the operation module is used for performing inverse pooling operation on the data to be processed according to the input index and the pooling core to obtain an operation result and storing the operation result into the target address,

the operation module comprises:

and the comparator is used for carrying out comparison operation on the index in the area corresponding to the pooling core and the input index to obtain a comparison result and obtaining an operation result according to the comparison result.

2. The apparatus of claim 1, wherein the operation module comprises a master operation submodule and a plurality of slave operation submodule, the master operation submodule comprising the comparator,

and the main operation sub-module is used for comparing the index in the area corresponding to the pooling core with the input index by using the comparator to obtain a comparison result, obtaining an operation result according to the comparison result and storing the operation result into the target address.

3. A machine learning arithmetic device, the device comprising:

one or more anti-pooling instruction processing devices as claimed in any one of claims 1-2, configured to obtain data and control information to be processed from other processing devices, perform specified machine learning operations, and transfer the execution results to other processing devices via the I/O interface;

when the machine learning operation device comprises a plurality of the anti-pooling instruction processing devices, the plurality of anti-pooling instruction processing devices can be connected through a specific structure and transmit data;

4. A combined processing apparatus, characterized in that the combined processing apparatus comprises:

the machine learning computing device, the universal interconnect interface, and the other processing device of claim 3;

wherein the combination processing apparatus further comprises: and a storage device connected to the machine learning arithmetic device and the other processing device, respectively, for storing data of the machine learning arithmetic device and the other processing device.

5. A machine learning chip, the machine learning chip comprising:

a machine learning calculation device according to claim 3 or a combined processing device according to claim 4.

6. An electronic device, characterized in that the electronic device comprises:

the machine learning chip of claim 5.

7. The utility model provides a board card, its characterized in that, the board card includes: a memory device, an interface device and a control device and a machine learning chip according to claim 5;

the storage device is used for storing data;

8. An anti-pooling instruction processing method applied to an anti-pooling instruction processing apparatus, the method comprising:

performing inverse pooling operation on the data to be processed according to the pooling core and the input index to obtain an operation result, storing the operation result into the target address,

wherein, the performing inverse pooling operation on the data to be processed according to the pooling kernel and the input index to obtain an operation result comprises:

and comparing the index in the area corresponding to the pooling core with the input index by using a comparator to obtain a comparison result, and obtaining an operation result according to the comparison result.

9. A non-transitory computer readable storage medium having stored thereon computer program instructions, wherein the computer program instructions, when executed by a processor, implement the method of claim 8.