CN116243975A

CN116243975A - Operation method and related device

Info

Publication number: CN116243975A
Application number: CN202111495663.1A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Shanghai Cambricon Information Technology Co Ltd
Current assignee: Shanghai Cambricon Information Technology Co Ltd
Priority date: 2021-12-08
Filing date: 2021-12-08
Publication date: 2023-06-09

Abstract

The embodiment of the application discloses an operation method and a related device, wherein the operation method is applied to a board card comprising an operation device, the board card comprises a storage device, an interface device, a control device and a neural network chip, and the neural network chip is respectively connected with the storage device, the interface device and the control device; the memory device is used for storing data; the interface device is used for realizing data transmission between the neural network chip and external equipment; the control device is used for monitoring the state of the neural network chip. By adopting the embodiment of the application, the success rate of operation can be improved.

Description

Operation method and related device

Technical Field

The present disclosure relates to the field of artificial intelligence, and in particular, to an operation method and a related device.

Background

Operators are basic computing units of a deep learning framework (such as Pytorch, tensorFlow, etc.), and rich operator libraries provide a powerful basic tool for quickly building models for the deep learning framework. However, existing operator dispatch mechanisms, when run under a hardware backend (e.g., machine learning processor (machine learning unit, MLU)), require hardware to register an operator in advance and implement and adapt the specific functionality of the operator to implement operator invocation. If the operator does not register or fails to register, the operator can cause the call failure, thereby compromising the rapid integration and verification of the whole network.

Disclosure of Invention

The embodiment of the application provides an operation method and a related device, which can improve the success rate of operation.

In a first aspect, an embodiment of the present application provides an operation method, where the method includes:

acquiring a first calling instruction, wherein the first calling instruction is used for indicating a machine learning processor to call a target operator to operate on first data, and the first data is processable by the machine learning processor;

converting the first data into second data processable by the target processor in response to failure of the machine learning processor to invoke the target operator;

sending a second call instruction to the target processor, wherein the second call instruction is used for instructing the target processor to call the target operator to operate on the second data;

receiving third data obtained by operation of the target processor;

the third data is converted into fourth data processable by the machine learning processor.

In a possible implementation manner, the first call instruction includes an operator identifier of the target operator, and after the acquiring the first call instruction, the method further includes:

searching whether an operator identifier identical with the operator identifier of the target operator exists or not based on the operator identifier of a register operator of a preset operator library;

If the operator identification which is the same as the operator identification of the target operator is not found, determining that the machine learning processor fails to call the target operator.

In one possible implementation, after the determining that the machine learning processor fails to invoke the target operator, the method further includes:

determining an important value and/or an operand of the target operator;

calculating a registration evaluation value of the target operator based on the importance value and/or the operand of the target operator;

if the registration evaluation value is smaller than a preset threshold value, executing the step of converting the first data into second data which can be processed by the target processor; or alternatively

And if the registration evaluation value is greater than or equal to a preset threshold value, registering the target operator in the machine learning processor, so that the machine learning processor stores the operator identification and the operation logic of the target operator when the target operator is successfully registered.

In one possible implementation manner, the acquiring the first call instruction includes:

acquiring an operation instruction of data to be processed;

determining the operation logic of the data to be processed;

searching a target operator based on the operation logic;

Acquiring first data which can be processed by a machine learning processor based on the data to be processed;

a first call instruction is generated based on the target operator and the first data.

In a possible implementation manner, after the searching for the target operator based on the operation logic, the method further includes:

acquiring reference data which can be processed by the target processor based on the data to be processed;

generating a third call instruction based on the target operator and the reference data;

and sending the third calling instruction to the target processor, wherein the third calling instruction is used for indicating the target processor to call the target operator to operate on the reference data.

In one possible implementation, before the converting the first data into the second data that can be processed by the target processor, the method further includes:

determining a reference processor that has registered the target operator;

if the number of the reference processors is greater than 1, determining the operation priority of the reference processors;

and taking the reference processor corresponding to the maximum value of the operation priority as the target processor.

In a second aspect, embodiments of the present application provide another operation method, where the method includes:

Receiving a second calling instruction, wherein the second calling instruction is used for indicating a target processor to call a target operator to operate on second data, and the second data is obtained by converting first data;

invoking the target operator to operate the second data to obtain third data;

the third data is sent to a machine learning processor.

In one possible implementation, the second call instruction includes a target parameter declared by the target operator; the target operator comprises at least two functions, and parameters declared by each function are different;

and the calling the target operator to operate on the second data to obtain third data, wherein the method comprises the following steps of:

determining an objective function of the objective operator based on the objective parameter;

and operating the second data based on the objective function to obtain third data.

In a third aspect, embodiments of the present application provide an arithmetic device comprising the units of the method of any one of the first or second aspects of embodiments of the present application.

In a fourth aspect, embodiments of the present application provide a machine learning processor for performing the method of any one of the first aspects of the embodiments of the present application, the machine learning processor comprising: an arithmetic unit and a controller unit; the arithmetic unit includes: a master processing circuit and a plurality of slave processing circuits;

The controller unit is used for acquiring input data and calculation instructions;

the controller unit is further configured to parse the calculation instruction to obtain a plurality of operation instructions, and send the plurality of operation instructions and the input data to the main processing circuit;

the main processing circuit is used for executing preamble processing on the input data and transmitting data and operation instructions with the plurality of auxiliary processing circuits;

the slave processing circuits are used for executing intermediate operation in parallel according to the data and operation instructions transmitted from the master processing circuit to obtain a plurality of intermediate results, and transmitting the plurality of intermediate results to the master processing circuit;

and the main processing circuit is used for executing subsequent processing on the plurality of intermediate results to obtain a calculation result of the calculation instruction.

In a fifth aspect, embodiments of the present application provide a neural network chip, where the neural network chip is included in the computing device described in the third aspect of the embodiments of the present application, or the machine learning processor described in the fourth aspect of the embodiments of the present application.

In a sixth aspect, embodiments of the present application provide a board, where the board card includes the neural network chip according to the fifth aspect of embodiments of the present application.

In a seventh aspect, an embodiment of the present application provides an electronic device, where the electronic device includes an operation device according to the third aspect of the embodiment of the present application, or a machine learning processor according to the fourth aspect of the embodiment of the present application, or a neural network chip according to the fifth aspect of the embodiment of the present application, or a board card according to the sixth aspect of the embodiment of the present application.

In an eighth aspect, embodiments of the present application provide an electronic device comprising a processor, a memory and a communication interface, wherein the memory stores a computer program configured to be executed by the processor, the computer program comprising instructions for some or all of the steps as described in the first or second aspects of embodiments of the present application.

In a ninth aspect, embodiments of the present application provide a computer-readable storage medium storing a computer program that causes a computer to perform some or all of the steps as described in the first or second aspects of the embodiments of the present application.

After the first call instruction is acquired, if the machine learning processor fails to call the target operator, the first data is converted into second data which can be processed by the target processor. And then sending a second calling instruction to the target processor so that the target processor calls a target operator to operate on the second data. And then receiving the third data obtained by the operation of the target processor, and converting the third data into fourth data which can be processed by the machine learning processor. Therefore, after the machine learning processor fails to call the target operator, the target processor can be used for operation, and the success rate of operation is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained based on these drawings without inventive effort for a person skilled in the art. Wherein:

fig. 1A is a schematic structural diagram of a machine learning processor according to an embodiment of the present application;

fig. 1B is a schematic structural diagram of an operation unit of a machine learning processor according to an embodiment of the present application;

fig. 2 is a schematic flow chart of an operation method according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of an arithmetic device according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a board card according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.

The terms "first," "second," "third," and "fourth" and the like in the description and in the claims of this application and in the drawings, are used for distinguishing between different objects and not for describing a particular sequential order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus.

Reference herein to "an embodiment" means that a particular feature, result, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.

It should also be understood that the term "and/or" is merely one association relationship describing the associated object, and means that three relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship.

A machine learning processor (machine learning unit, MLU) as used herein is first described. Referring to fig. 1A, there is provided a machine learning processor for performing machine learning calculations, the machine learning processor comprising: a controller unit 11 and an arithmetic unit 12, wherein the controller unit 11 is connected with the arithmetic unit 12.

The controller unit 11 is used for acquiring input data and calculation instructions. In one possible implementation, the input data and the calculation instruction manner may be obtained through a data input/output unit, where the data input/output unit may specifically be one or more data I/O interfaces or I/O pins.

The above-described computing instructions include, but are not limited to: the present embodiments are not limited to the specific form of the above-described calculation instructions, either forward or reverse training instructions, or other neural network calculation instructions, etc., such as convolution calculation instructions.

The controller unit 11 is further configured to parse the calculation instruction to obtain a plurality of operation instructions, and send the plurality of operation instructions and the input data to the main processing circuit.

A master processing circuit 101 for performing preamble processing on the input data and transmitting data and operation instructions with the plurality of slave processing circuits;

A plurality of slave processing circuits 102, configured to execute intermediate operations in parallel according to the data and the operation instruction transmitted from the master processing circuit to obtain a plurality of intermediate results, and transmit the plurality of intermediate results to the master processing circuit;

the main processing circuit 101 is configured to perform subsequent processing on the plurality of intermediate results to obtain a calculation result of the calculation instruction.

In one possible implementation, the arithmetic unit 12 may include a branch processing circuit 103 as shown in fig. 1B; the specific connection structure is shown in fig. 1B, wherein:

the master processing circuit 101 is connected to the branch processing circuit(s) 103, and the branch processing circuit 103 is connected to the one or more slave processing circuits 102;

branch processing circuitry 103 for executing data or instructions that are forwarded between the master processing circuitry 101 and the slave processing circuitry 102.

According to the technical scheme, the machine learning processor is set to be of a master multi-slave structure, and for a calculation instruction of forward operation, the machine learning processor can split data according to the calculation instruction of forward operation, so that the part with larger calculation amount can be subjected to parallel operation through a plurality of slave processing circuits, the operation speed is improved, the operation time is saved, and the power consumption is further reduced.

Optionally, the machine learning calculation may specifically include: the artificial neural network operation, the input data may specifically include: neuron data and weight data are input. The calculation result may specifically be: and outputting the neuron data as a result of the artificial neural network operation.

The operation in the neural network may be one layer of operation in the neural network, and in the multi-layer neural network, the implementation process is that, in the forward operation, after the execution of the artificial neural network in the previous layer is completed, the operation instruction in the next layer performs the operation on the output neuron calculated in the operation unit as the input neuron in the next layer (or performs some operations on the output neuron and then uses the operation on the output neuron as the input neuron in the next layer). And simultaneously, the weight is replaced by the weight of the next layer. In the backward operation, when the backward operation of the artificial neural network of the previous layer is completed, the next-layer operation instruction performs an operation with the input neuron gradient calculated by the operation unit as the output neuron gradient of the next layer (or performs some operations on the input neuron gradient and then uses the operation as the output neuron gradient of the next layer), and simultaneously replaces the weight with the weight of the next layer.

The machine learning computation may also include support vector machine (support vector machine, SVM) operations, k-nearest neighbor (k-nn) operations, k-means (k-means) operations, principal component analysis operations, and the like. For convenience of description, a specific scheme of machine learning calculation is described below by taking an artificial neural network operation as an example.

For the artificial neural network operation, if the artificial neural network operation has multiple layers of operation, the input neurons and the output neurons of the multiple layers of operation do not refer to the neurons in the input layer and the neurons in the output layer of the whole neural network, but for any two adjacent layers in the network, the neurons in the lower layer of the forward operation of the network are the input neurons, and the neurons in the upper layer of the forward operation of the network are the output neurons. Taking a convolutional neural network as an example, let a convolutional neural network have L layers, k=1, 2, …, L-1, and for the K layer and the k+1 layer, we refer to the K layer as an input layer, where the neuron is the input neuron, and the k+1 layer is referred to as an output layer, where the neuron is the output neuron. That is, each layer except the topmost layer can be used as an input layer, and the next layer is a corresponding output layer.

Optionally, the machine-based processor may further include: the storage unit 10 and the direct memory access unit 50, the storage unit 10 may include: one or any combination of a register and a cache, specifically, the cache is used for storing the calculation instruction; the register is used for storing the input data and scalar; the cache is a scratch pad cache. The direct memory access unit 50 is used for reading or storing data from the storage unit 10.

Optionally, the controller unit comprises: an instruction storage unit 110, an instruction processing unit 111, and a store queue unit 113;

an instruction storage unit 110, configured to store a calculation instruction associated with the artificial neural network operation;

the instruction processing unit 111 is configured to parse the calculation instruction to obtain a plurality of operation instructions;

a store queue unit 113 for storing an instruction queue, the instruction queue comprising: a plurality of arithmetic instructions or calculation instructions to be executed in the order of the queue.

For example, in one possible embodiment, the main arithmetic processing circuit may also include a controller unit, which may include a main instruction processing unit, specifically for decoding instructions into micro-instructions. In another possible embodiment, of course, the slave processing circuit may also comprise a further controller unit comprising a slave instruction processing unit, in particular for receiving and processing microinstructions. The micro instruction may be the next instruction of the instruction, and may be obtained by splitting or decoding the instruction, and may be further decoded into control signals of each component, each unit, or each processing circuit.

In one possible implementation, the structure of the operation instruction may be as shown in table 1.

TABLE 1

Operation code

Registers or immediate

Register/immediate

...

The ellipses in the table above represent that multiple registers or immediate numbers may be included.

In another possible implementation, the computing instructions may include: one or more operation domains and an operation code. The operational instructions may include neural network operational instructions. Taking a neural network operation instruction as an example, as shown in table 2, a register number 0, a register number 1, a register number 2, a register number 3, and a register number 4 may be operation domains. Wherein each of register number 0, register number 1, register number 2, register number 3, register number 4 may be a number of one or more registers.

TABLE 2

The register may be an off-chip memory, or may be an on-chip memory in practical applications, and may be used to store data, where n is an integer greater than or equal to 1, for example, n=1 is 1-dimensional data, i.e., a vector, and n=2 is 2-dimensional data, i.e., a matrix, and n=3 or more is a multidimensional tensor (tensor).

Optionally, the controller unit may further include:

The dependency relationship processing unit 108 is configured to determine, when a plurality of operation instructions are provided, whether a first operation instruction has an association relationship with a zeroth operation instruction before the first operation instruction, if the first operation instruction has an association relationship with the zeroth operation instruction, cache the first operation instruction in the instruction storage unit, and extract the first operation instruction from the instruction storage unit and transmit the first operation instruction to the operation unit after the execution of the zeroth operation instruction is completed;

the determining whether the association relationship exists between the first operation instruction and the zeroth operation instruction before the first operation instruction includes:

extracting a first storage address interval of required data (for example, a matrix) in the first operation instruction according to the first operation instruction, extracting a zeroth storage address interval of the required matrix in the zeroth operation instruction according to the zeroth operation instruction, if the first storage address interval and the zeroth storage address interval have overlapping areas, determining that the first operation instruction and the zeroth operation instruction have an association relationship, if the first storage address interval and the zeroth storage address interval do not have overlapping areas, and determining that the first operation instruction and the zeroth operation instruction do not have an association relationship.

It can be seen that, compared with a general-purpose processor, the machine learning processor provided by the embodiment of the application can improve the running efficiency and the information processing efficiency, and can reduce the running time and the power consumption.

In order to solve the above problems, the embodiments of the present application provide an operation method, which can improve the success rate of operation.

Referring to fig. 2, fig. 2 is a flowchart of an operation method according to an embodiment of the present application, and the method may include the following steps S201 to S206, in which:

step S201: the machine learning processor obtains a first call instruction.

The first calling instruction is used for indicating the machine learning processor to call the target operator to operate on the first data. The first call instruction may carry an operator identifier of the target operator. The operator identification may identify identity information of the operator. The operator identification may be a tag, such as an operator name, that identifies the operator's arithmetic logic or a specific function of the operator. Or may be other information that can identify the identity of the operator, which is not limited.

The target operator is used for operating on the first data. The target operator may be an addition operator, a subtraction operator, a convolution operator, etc., or may be a pooling operator, an activation operator, a custom operator, etc., without limitation. The target operator may include at least one function by which the operational logic of the target operator, and the declared parameters, may be obtained. The parameters may include an incoming parameter or an outgoing parameter.

Illustratively, operator 1 is at::: tensor add (const at:: tensor:, const at::: scalar @. The operator identity may be operator 1 or add. The input parameters may be constat:: tensor, constat:: scalar. The arithmetic logic can be the addition of three incoming parameters of constat:: tensor, constat:: scalar.

The first data is data that can be processed by the machine learning processor. The first data may be data obtained by the machine learning processor from the read external device, may be data in the machine learning processor, or may be intermediate data calculated by the machine learning processor, or the like.

In one possible implementation, the machine learning processor obtaining the first call instruction may include the steps of:

acquiring an operation instruction of data to be processed; determining the operation logic of the data to be processed; searching a target operator based on the operation logic; acquiring first data which can be processed by a machine learning processor based on data to be processed; a first call instruction is generated based on the target operator and the first data.

In the embodiment of the present application, the data to be processed may be one-dimensional data or multidimensional data, which is not limited. The operation instruction is used for instructing the machine learning processor to process the data to be processed. The operation instruction may include one or more operation fields and an operation code. Wherein the operation code can be used for indicating the function of the calculation instruction, and the calculation instruction performs different operations through the operation code. The operation field may be used to indicate data information of a calculation instruction, including an immediate of the calculation instruction or a register number storing a data block in which the calculation instruction is executed. The specific structure of the operation instruction may refer to table 1 and table 2, and will not be described herein.

In one possible implementation manner, the user may input an operation instruction of the data to be processed through the electronic device, and the machine learning processor may acquire the operation instruction of the data to be processed by reading information of the electronic device, or may acquire the operation instruction of the data to be processed from the instruction storage unit, or the like, which is not limited.

After acquiring the operation instruction of the data to be processed, the instruction processing unit of the machine learning processor may parse the operation instruction to determine the operation logic of the data to be processed. Wherein the operational logic may be addition, subtraction, convolution, pooling, activation, etc. The target operator may then be looked up based on the arithmetic logic. For example, if the arithmetic logic is pooling, the target operator is the corresponding pooling operator. If the arithmetic logic is active, the target operator is the corresponding active operator, and so on. In an embodiment of the present application, the machine learning processor may obtain the processable first data from the data to be processed. The first call instruction generated based on the target operator and the first data may be used to instruct the machine learning processor to call the target operator to operate on the first data.

It can be seen that after the machine learning processor acquires the operation instruction of the data to be processed, the machine learning processor determines the operation logic of the data to be processed, and then generates the first call instruction based on the target operator searched by the operation logic and the first data acquired by the data to be processed. The first data is data which can be processed by the machine learning processor. Thus, the processing efficiency of the machine learning processor can be improved.

It should be noted that, the data to be processed may include data that can be processed by the machine learning processor, or may be data that cannot be processed by the machine learning processor or is not good at processing. If the data to be processed includes data that the machine learning processor cannot, or is not good at, it needs to be converted into reference data that the target processor can process. In addition, the data which can be processed by the machine learning processor can be configured to the target processor for operation, so that cross verification can be realized, and the operation accuracy is improved.

In one possible implementation, the machine learning processor may further include the following steps after searching for the target operator based on the arithmetic logic:

acquiring reference data which can be processed by a target processor based on data to be processed; generating a third call instruction based on the target operator and the reference data; and sending a third calling instruction to the target processor, wherein the third calling instruction is used for instructing the target processor to call a target operator to operate on the reference data.

In the embodiment of the application, the target operator can complete registration in the target processor in advance, so that the target processor can call the target operator to operate. The target processor may be a central processing unit (central processing unit, CPU), a graphics processor (graphics processing unit, GPU), or the like.

The specific implementation method for converting the data in the data to be processed into the reference data that can be processed by the target processor can refer to the method for converting the machine learning processor data into the central processor data hereinafter, which is not described herein. After the machine learning processor generates a third call instruction based on the target operator and the reference data, it is sent to the target processor. And after the target processor receives the third calling instruction, calling a target operator to operate on the reference data.

It should be noted that the target operator in the third call instruction may be the same as or different from the target operator in the first call instruction. The operator called by the machine learning processor is an operator registered by the machine learning processor, and the operator called by the target processor is an operator registered by the target processor. The type of the registered operator is not limited, and can be determined based on the hardware attribute of the electronic equipment configuration and the type of the operator. For example, to the extent that some operators are not decisive for the overall network delivery task, the operators may be considered to have low importance values, may not be registered in the machine learning processor, and may be registered in the target processor. For example, the operator has a large amount of computation, and can be registered in the machine learning processor and the target processor.

It can be seen that after the machine learning processor acquires the operation instruction of the data to be processed, the machine learning processor can also acquire the reference data which can be processed by the target processor based on the data to be processed and regenerate the third call instruction, so that the target processor calls the target operator to operate on the reference data, the processing efficiency of the target processor can be improved, and the overall operation efficiency is improved.

If the target operator completes registration in both the machine learning processor and the target processor in advance, both the machine learning processor and the target processor call the target operator to operate. After the operation is finished, the cross verification can be realized by comparing the operation time of the two information, the calculation resources occupied by the operation or the operation accuracy and the like, so that the operation efficiency is improved.

In a possible implementation manner, the first call instruction includes an operator identifier of the target operator, and the machine learning processor may further include the following steps after performing step S201:

searching whether an operator identifier identical with the operator identifier of the target operator exists or not based on the operator identifier of the register operator of the preset operator library; if the operator identification which is the same as the operator identification of the target operator is not found, determining that the machine learning processor fails to call the target operator.

In the embodiment of the present application, the operator identifier of the target operator may be carried in the first call instruction. The description of the operator identifier may refer to the foregoing, and will not be repeated here. The machine learning processor stores a preset operator library, the preset operator library stores registered operators which are already registered, and the registered operators can carry operator identifications. Thus, after obtaining the operator identity of the target operator, it is possible to find whether there is the same operator identity as the operator identity of the target operator based on the operator identity of the registered operator of the preset operator library. If the operator identification identical to the operator identification of the target operator is not found, the target operator can be determined to be the operator which is not registered by the machine learning processor. If the target operator is not registered in the machine learning processor, the target operator cannot be run under the machine learning processor, and when the machine learning processor calls the unregistered target operator, the call fails.

It can be seen that whether the target operator is an operator registered in the machine learning processor is judged based on the operator identification, so that whether the machine learning processor can call failure is determined, and the searching efficiency and accuracy can be improved.

In one possible implementation, after determining that the machine learning processor fails to invoke the target operator, the machine learning processor may further include the steps of:

Determining an important value and/or an operand of a target operator; calculating a registration evaluation value of the target operator based on the importance value and/or the calculation amount of the target operator; if the registered evaluation value is smaller than the preset threshold value, executing the step of converting the first data into second data which can be processed by the target processor; or if the registration evaluation value is greater than or equal to a preset threshold value, registering the target operator in the machine learning processor, so that the machine learning processor stores the operator identification and the operation logic of the target operator when the target operator is successfully registered.

In the embodiments of the present application. The importance value of the target operator is used to describe the effect of the target operator on the operation of the data to be processed. The larger the importance value, the greater the influence on the operation result of the data to be processed. The operand of the target operator refers to the operation resource that is consumed for operating the first data. The importance values of the target operators can be related based on the relevance and the operation sequence of each target operator corresponding to the data to be processed. The operand of a target operator may be dependent on the specific network delivery task and the computational capabilities of the processor invoking the target operator. In performing a network delivery task, machine learning processors typically require numerous operators to be invoked when performing operations. For example, some operators have a decisive effect on the overall network delivery task because they affect the operations of other operators, so that the importance of these operators is highest. Some operators bear most of operations, and the operators can be determined to have larger operation amount. Some operators consume most of the operation resources of the processor, so that the operation amount of the target operator can be determined to be larger.

The registration evaluation value may be used to evaluate whether the target operator needs to be registered in the machine learning processor. The embodiment of the application does not limit the calculation mode of the registration evaluation value, and some operation units of the electronic equipment can be used for carrying out weighted calculation on the important value and/or the operation quantity of the target operator, or calculating the product between the important value and the operation quantity, or calculating the maximum value or the minimum value between the important value and the operation quantity, and the like.

The preset threshold is a preset parameter, and may be specifically set according to historical experience. For example, if the preset threshold is 60 and the calculated registration evaluation value is 50, and the registration evaluation value is smaller than the preset threshold, the machine learning processor executes step S202 without registering the target operator. If the calculated registration evaluation value is 70, and the registration evaluation value is greater than the preset threshold, the target operator needs to be registered in the machine learning processor, so that the target operator can support the execution of the operation under the machine learning processor.

The embodiment of the present application does not limit the specific implementation manner of registering the target operator in the processor, and takes the machine learning processor as an example for illustration, in a possible implementation manner, the method may include the following steps: acquiring a registration configuration file of a target operator; generating a class corresponding to the target operator based on the registration configuration file; and adding the class corresponding to the target operator into a registry of the machine learning processor.

Wherein the registration profile includes operator identification of the target operator, arithmetic logic, incoming parameters, outgoing parameters, processors supported by the target operator, and the like. The class to which the target operator corresponds may be understood as a function that invokes the target operator. In this example, the class corresponding to the target operator is generated through the registration configuration file of the target operator, and then the class corresponding to the target operator is added to the registry to complete the registration of the target operator, so that the success rate of the registration can be improved.

It can be seen that the calculated registration evaluation value of the target operator can have higher accuracy by calculating the important value and/or the calculated registration evaluation value of the target operator. And judging whether the machine learning processor needs to register the target operator according to the registration evaluation value. And under the situation that the registration evaluation value is smaller than the preset threshold value, determining that the machine learning processor is not suitable for registering the target operator, and converting the first data into second data which can be processed by the target processor, so that the target processor can call the target operator to process the second data. Under the situation that the registration evaluation value is larger than or equal to a preset threshold value, determining that the machine learning processor is suitable for registering the target operator, and enabling the machine learning processor to register the target operator. When the registration is successful, the operator identification and the operation logic of the target operator are stored, so that the machine learning processor can call the target operator to operate, and the efficiency can be improved.

Step S202: in response to the machine learning processor failing to invoke the target operator, the machine learning processor then converts the first data into second data that is processable by the target processor.

In an embodiment of the present application, the reason why the machine learning processor fails to call the target operator may be that the target operator is not registered in the operator library of the machine learning processor. The method for determining whether the target operator is registered in the operator library of the machine learning processor may be described above, and will not be described herein.

In the embodiment of the application, the first data is data which can be processed by the machine learning processor, and when the machine learning processing calls the target operator for recognition, the first data can be converted into the second data which can be processed by the target processor, so that the target processor can conveniently call the target operator to operate the second data.

Taking the target processor as a central processor for illustration, the machine learning processor inputs first data into a preset first conversion function, and the obtained data is called second data which can be processed by the central processor. It should be noted that the first data may be a single data, or a data block corresponding to a plurality of data. Data transmission is generally performed in a data block format, and if the first data is single data, the first data may be stored in the data block format, and then the first data in the data block format may be converted based on a first conversion function. If the first data is a data block corresponding to the plurality of data, the first data may be converted based on the first conversion function.

It should be noted that, the above conversion method is only one way to convert the data that can be processed by the machine learning processor into the data that can be processed by the central processing unit, and may also be implemented by another conversion method, which is not limited in this embodiment of the present application.

In one possible implementation, before performing step S202, the machine learning processor may further include the steps of:

a reference processor that determines registered target operators; if the number of the reference processors is greater than 1, determining the operation priority of the reference processors; and taking the reference processor corresponding to the maximum value of the operation priority as a target processor.

Because the functional requirements and the use situations of operators are different, the same operator can also support the calculation to be performed on different hardware devices, such as a central processing unit and a graphics processor. In the embodiment of the application, the reference processor supported by the target operator can be determined from the information of the processors supported by the target operator in the registration configuration information of the target operator, so that the reference processor of the registered target operator is determined. It may also be determined from whether there is an operator identification of the target operator in the registry of each processor, whether that processor is a reference processor, etc.

In the embodiment of the present application, the operation priority of the reference processor is used to define the operation order of the reference processor, and the higher the operation priority is, the earlier the operation order is. The operation priority can be determined according to the information such as operation time, calculation resources occupied by operation or operation accuracy. For example, the shorter the reference processor operation time, the smaller the calculation resources occupied by the operation, and the higher the operation accuracy, the higher the operation priority can be considered. And taking the reference processor corresponding to the maximum value of the operation priority as a target processor.

It can be seen that when the number of the reference processors is greater than 1, the reference processor with the highest operation priority is taken as the target processor, which is helpful to reduce the operation time and improve the operation efficiency.

Step S203: the machine learning processor sends a second call instruction to the target processor.

Accordingly, the target processor receives the second call instruction. The second calling instruction is used for indicating the target processor to call the target operator to operate on the second data.

Step S204: and the target processor calls a target operator to operate on the second data to obtain third data.

The target operator may be an operator that has been registered in advance in the target processor, so the target processor may call the target operator to perform an operation. The target operator may comprise two different functions implementing the same operational function, each function declaring a different parameter. Illustratively, operator 1 is "at:: tensor add (const at:: tensor &, const at:: tensor &, const at:: scalar &)", operator 2 is "at:: tensor & add_ (at:: tensor &, const at::: scalar &)", operator 3 is "at:: tensor & add_out (const at:: tensor &, const at:: tensor &, const at::: scalar &, at: tensor +". All three operators can implement the same addition, but their functional declarations are different. Among these, operator 1 is the most common operator, all of which are incoming parameters, and cannot be modified. Operator 2 is an in-place operator whose first parameter is both an incoming parameter and an outgoing parameter. Operator 3 is an operator with out parameters, the last parameter of which is both the incoming and outgoing parameters.

In one possible implementation, the second call instruction includes a target parameter declared by the target operator; the target operator comprises at least two functions, and parameters declared by each function are different; step S204 may include the steps of:

determining an objective function of the objective operator based on the objective parameter; and operating the second data based on the objective function to obtain third data.

In an embodiment of the present application, when the machine learning processor fails to call the target operator, the machine learning processor calls a default (e.g., fallback) function registered by the machine learning processor, and then transmits the target parameter of the target operator to the target processor through the default function, so that the target processor. The target parameters are transmitted in a stack (stack) form, and the target processor needs to parse the stack corresponding to the target parameters of the target operator to obtain the target parameters before invoking the target operator.

As previously mentioned, the target operator may comprise at least two different functions implementing the same operational function, each function declaring a different parameter. Before the target operator is called, determining a target function of the target operator based on target parameters declared by the target operator, and operating the second data based on the target function to obtain third data. Thus, the efficiency and accuracy of the operation can be improved.

Step S205: the target processor sends third data to the machine learning processor.

The third data is a result obtained by the target processor calling the target operator to operate the second data. And the target processor calculates and obtains third data and then sends the third data to the machine learning processor, and correspondingly, the machine learning processor receives the third data obtained by the calculation of the target processor.

Step S206: the machine learning processor converts the third data into fourth data that the machine learning processor can process.

The third data is obtained by the operation of the target processor and accords with the data format of the target processor, so that the third data can be converted into fourth data which can be processed by the machine learning processor, and the subsequent machine learning processor can conveniently further process the fourth data.

Taking the target processor as the central processor for example, the machine learning processor inputs third data which can be obtained by operation of the central processor into a preset second conversion function, and the obtained data is called fourth data which can be processed by the machine learning processor. It should be noted that the third data may be a single data, or a data block corresponding to a plurality of data. The data transmission is usually performed in a data block format, and if the third data is single data, the third data may be stored in the data block format first, and then the third data in the data block format may be converted based on the second conversion function. If the third data is a data block corresponding to the plurality of data, the third data may be converted based on the second conversion function.

The second transfer function may be understood as an inverse function of the first transfer function, e.g. the first transfer function is inputelement.cpu (), and the second transfer function is inputelement.to ("mlu"). For another example, the first transfer function is args [ i ]. Tossor (). Cpu (), and the second transfer function is args [ i ]. Tossor (). To ("mlu").

It should be noted that, the above conversion method is only one way to convert the data that can be processed by the central processing unit into the data that can be processed by the machine learning processor, and may also be implemented by another conversion method, which is not limited in this embodiment of the present application.

In the method shown in fig. 2, it can be seen that after the machine learning processor acquires the first call instruction, if the machine learning processor fails to call the target operator, the machine learning processor converts the first data into the second data that can be processed by the target processor. And then sending a second calling instruction to the target processor so that the target processor calls a target operator to operate on the second data. And then receiving the third data obtained by the operation of the target processor, and converting the third data into fourth data which can be processed by the machine learning processor. Therefore, after the machine learning processor fails to call the target operator, the target processor can be used for operation, and the success rate of operation is improved.

Referring to fig. 3, fig. 3 is a schematic structural diagram of an arithmetic device according to an embodiment of the present application. As shown in fig. 3, the arithmetic device 300 includes a processing unit 301 and a communication unit 302.

When the computing device is a machine learning processor, a detailed description of the respective units is as follows:

the processing unit 301 is configured to obtain a first call instruction, where the first call instruction is configured to instruct a machine learning processor to call a target operator to perform an operation on first data, where the first data is data processable by the machine learning processor; converting the first data into second data processable by the target processor in response to failure of the machine learning processor to invoke the target operator; the communication unit 302 is configured to send a second call instruction to the target processor, where the second call instruction is configured to instruct the target processor to call the target operator to perform an operation on the second data; receiving third data obtained by operation of the target processor; the processing unit 301 is further configured to convert the third data into fourth data processable by the machine learning processor.

In a possible implementation manner, the first call instruction includes an operator identifier of the target operator, and the processing unit 301 is further configured to search whether an operator identifier that is the same as the operator identifier of the target operator exists based on an operator identifier of a register operator of a preset operator library; if the operator identification which is the same as the operator identification of the target operator is not found, determining that the machine learning processor fails to call the target operator.

In a possible implementation manner, the processing unit 301 is further configured to determine an importance value and/or an operand of the target operator; calculating a registration evaluation value of the target operator based on the importance value and/or the operand of the target operator; if the registration evaluation value is smaller than a preset threshold value, executing the step of converting the first data into second data which can be processed by the target processor; or if the registration evaluation value is greater than or equal to a preset threshold value, registering the target operator in the machine learning processor, so that the machine learning processor stores the operator identification and the operation logic of the target operator when the target operator is successfully registered.

In one possible implementation manner, the processing unit 301 is specifically configured to obtain an operation instruction of data to be processed; determining the operation logic of the data to be processed; searching a target operator based on the operation logic; acquiring first data which can be processed by a machine learning processor based on the data to be processed; a first call instruction is generated based on the target operator and the first data.

In a possible implementation manner, the processing unit 301 is further configured to obtain reference data processable by the target processor based on the data to be processed; generating a third call instruction based on the target operator and the reference data; the communication unit 302 is further configured to send the third call instruction to the target processor, where the third call instruction is used to instruct the target processor to call the target operator to perform an operation on the reference data.

In a possible implementation, the processing unit 301 is further configured to determine a reference processor that has registered the target operator; if the number of the reference processors is greater than 1, determining the operation priority of the reference processors; and taking the reference processor corresponding to the maximum value of the operation priority as the target processor.

When the computing device is a target processor, a detailed description of each unit is as follows:

the communication unit 302 is configured to receive a second call instruction, where the second call instruction is configured to instruct the target processor to call the target operator to perform an operation on second data, where the second data is obtained by converting the first data; the processing unit 301 is configured to invoke the target operator to perform an operation on the second data, so as to obtain third data; the communication unit 302 is further configured to send the third data to a machine learning processor.

In one possible implementation, the second call instruction includes a target parameter declared by the target operator; the target operator comprises at least two functions, and parameters declared by each function are different; the processing unit 301 is specifically configured to determine an objective function of the objective operator based on the objective parameter; and operating the second data based on the objective function to obtain third data.

It should be noted that the implementation of each unit may also correspond to the corresponding description of the method embodiment shown with reference to fig. 2.

The embodiment of the application provides a neural network chip, which comprises a machine learning processor shown in fig. 1 and an arithmetic device shown in fig. 3.

Referring to fig. 4, in addition to the neural network chip, the board may further include other matching components, where the matching components include, but are not limited to: a memory device, an interface device, and a control device; the memory device is connected with the neural network chip through a bus and is used for storing data.

The memory device may include multiple sets of memory cells. Each group of storage units is connected with the neural network chip through a bus. It is understood that each set of the memory cells may be double data rate synchronous dynamic random access memory (DDR SDRAM).

DDR can double the speed of SDRAM without increasing the clock frequency. DDR allows data to be read out on both the rising and falling edges of the clock pulse. DDR is twice as fast as standard SDRAM. In one embodiment, the memory device may include 4 sets of the memory cells. Each set of the memory cells may include a plurality of DDR4 particles (neural network chips). In one possible implementation, the neural network chip may include 4 72-bit DDR4 controllers therein, where 64 bits of the 72-bit DDR4 controllers are used for transmitting data, and 8 bits are used for ECC check. It is understood that the theoretical bandwidth of data transfer can reach 25600MB/s when DDR4-3200 granules are employed in each set of memory cells.

In one possible implementation, each set of the memory cells includes a plurality of double rate synchronous dynamic random access memories arranged in parallel. DDR can transfer data twice in one clock cycle. And setting a controller for controlling DDR in the neural network chip, wherein the controller is used for controlling data transmission and data storage of each storage unit.

The interface device is electrically connected with the neural network chip in the neural network chip packaging structure. The interface device is used for realizing data transmission between the neural network chip and an external device (such as a server or a computer). For example, in one embodiment, the interface device may be a standard PCIE interface. For example, the data to be processed is transferred from the server to the neural network chip through the standard PCIE interface, so as to implement data transfer. Preferably, when PCIE 3.0X10 interface transmission is adopted, the theoretical bandwidth can reach 16000MB/s. In another embodiment, the interface device may be another interface, and the application is not limited to the specific form of the other interface, and the interface unit may be capable of implementing a switching function. In addition, the calculation result of the neural network chip is still transmitted back to an external device (e.g., a server) by the interface device.

The control device is electrically connected with the neural network chip. The control device is used for monitoring the state of the neural network chip. Specifically, the neural network chip and the control device can be electrically connected through an SPI interface. The control device may comprise a single chip microcomputer (micro controller unit, MCU). The neural network chip may include a plurality of processing neural network chips, a plurality of processing cores or a plurality of processing circuits, and may drive a plurality of loads. Therefore, the neural network chip can be in different working states such as multi-load and light-load. The control device can realize the regulation and control of the working states of a plurality of processing neural network chips, a plurality of processing circuits and/or a plurality of processing circuits in the neural network chips.

The embodiment of the application provides electronic equipment, which comprises the machine learning processor, the computing device, the neural network chip or the board card.

Referring to fig. 5, a schematic structural diagram of an electronic device is provided in an embodiment of the present application. As shown in fig. 5, the electronic device 500 includes a processor 501, a memory 502, and a communication interface 503. Among them, the processor 501, the memory 502, and the communication interface 503 may be connected by a bus 505. The memory 502 stores a computer program 504, the computer program 504 being configured to be executed by the processor 501, the computer program 504 comprising instructions for some or all of the steps described in any of the methods as described in the method embodiments above.

The electronic device includes a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet, an intelligent terminal, a cell phone, a vehicle recorder, a navigator, a sensor, a camera, a server, a cloud server, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a vehicle, a household appliance, and/or a medical device.

The vehicle comprises an aircraft, a ship and/or a vehicle; the household appliances comprise televisions, air conditioners, microwave ovens, refrigerators, electric cookers, humidifiers, washing machines, electric lamps, gas cookers and range hoods; the medical device includes a nuclear magnetic resonance apparatus, a B-mode ultrasonic apparatus, and/or an electrocardiograph apparatus.

In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or by instructions in the form of software. The steps of a method disclosed in connection with the embodiments of the present application may be embodied directly in a hardware processor for execution, or in a combination of hardware and software modules in the processor for execution. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory, and the processor reads the information in the memory and, in combination with its hardware, performs the steps of the above method. To avoid repetition, a detailed description is not provided herein.

In various embodiments of the present application, the sequence number of each process does not mean the sequence of execution, and the execution sequence of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of the present application.

Those of ordinary skill in the art will appreciate that the various illustrative logical blocks (illustrative logical block, ILB) and steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the several embodiments provided in this application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital subscriber line), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk), etc.

The present application also provides a computer storage medium storing a computer program that is executed by a processor to implement some or all of the steps of any one of the operation methods described in the above method embodiments.

Embodiments of the present application also provide a computer program product comprising a non-transitory computer-readable storage medium storing a computer program operable to cause a computer to perform part or all of the steps of any one of the methods of operation as described in the method embodiments above.

The foregoing is merely specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the present application, and the changes and substitutions are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. An operation method, comprising:

receiving third data obtained by operation of the target processor;

2. The method of claim 1, wherein the first call instruction includes an operator identification of the target operator, the method further comprising, after the fetching of the first call instruction:

3. The method of claim 2, wherein after the determining that the machine learning processor failed to invoke the target operator, the method further comprises:

Determining an important value and/or an operand of the target operator;

4. A method according to any one of claims 1-3, wherein the fetching of the first call instruction comprises:

acquiring an operation instruction of data to be processed;

determining the operation logic of the data to be processed;

searching a target operator based on the operation logic;

5. The method of claim 4, wherein after said looking up a target operator based on said arithmetic logic, said method further comprises:

6. A method according to any of claims 1-3, wherein prior to said converting said first data into second data processable by a target processor, said method further comprises:

determining a reference processor that has registered the target operator;

7. An operation method, comprising:

invoking the target operator to operate the second data to obtain third data;

The third data is sent to a machine learning processor.

8. The method of claim 7, wherein the second call instruction includes a target parameter declared by the target operator; the target operator comprises at least two functions, and parameters declared by each function are different;

9. A computing device comprising means for the method of any one of claims 1-6 or 7-8.

10. A machine learning processor for performing the method of any of claims 1-6, the machine learning processor comprising: an arithmetic unit and a controller unit; the arithmetic unit includes: a master processing circuit and a plurality of slave processing circuits;

11. A neural network chip, characterized in that it comprises the computing device of claim 9 or the machine learning processor of claim 10.

12. A board comprising a memory device, an interface device, and a control device, and the neural network chip of claim 11, wherein:

the neural network chip is respectively connected with the storage device, the interface device and the control device;

the storage device is used for storing data;

the interface device is used for realizing data transmission between the neural network chip and external equipment;

The control device is used for monitoring the state of the neural network chip.

13. An electronic device comprising the computing apparatus of claim 9, or the machine learning processor of claim 10, or the neural network chip of claim 11, or the board card of claim 12.

14. An electronic device comprising a processor, a memory and a communication interface, wherein the memory stores a computer program configured to be executed by the processor, the computer program comprising instructions for performing the steps of the method of any of claims 1-6 or 7-8.

15. A computer readable storage medium storing a computer program that causes a computer to execute to implement the method of any one of claims 1-6 or 7-8.