WO2020073874A1 - 机器学习运算的分配系统及方法 - Google Patents

机器学习运算的分配系统及方法 Download PDF

Info

Publication number
WO2020073874A1
WO2020073874A1 PCT/CN2019/109552 CN2019109552W WO2020073874A1 WO 2020073874 A1 WO2020073874 A1 WO 2020073874A1 CN 2019109552 W CN2019109552 W CN 2019109552W WO 2020073874 A1 WO2020073874 A1 WO 2020073874A1
Authority
WO
WIPO (PCT)
Prior art keywords
terminal
cloud
machine learning
instruction
computing
Prior art date
Application number
PCT/CN2019/109552
Other languages
English (en)
French (fr)
Inventor
孟小甫
孙咏哲
杜子东
周徐达
曾洪博
Original Assignee
中科寒武纪科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN201811190161.6A external-priority patent/CN111047045B/zh
Priority claimed from CN201811424173.0A external-priority patent/CN111222632B/zh
Application filed by 中科寒武纪科技股份有限公司 filed Critical 中科寒武纪科技股份有限公司
Publication of WO2020073874A1 publication Critical patent/WO2020073874A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Definitions

  • the invention relates to the field of information processing technology, in particular to a machine learning operation distribution system and method.
  • Machine learning has made major breakthroughs in recent years. For example, in machine learning technology, neural network models trained with deep learning algorithms have achieved remarkable results in image recognition, speech processing, intelligent robotics and other applications.
  • the deep neural network builds a model to simulate the neural connection structure of the human brain.
  • the data features are described in layers through multiple transformation stages.
  • machine learning techniques have many problems in the actual application process, such as occupying many resources, slow operation speed, and large energy consumption.
  • a distribution system for machine learning operations including: a terminal server and a cloud server;
  • the terminal server is used to generate a corresponding computing task according to the demand information, and select the first machine learning algorithm running on the terminal server according to the computing task and the hardware performance parameters of the terminal server, and according to the computing task and the cloud
  • the hardware performance parameters of the server are selected from the second machine learning algorithm running on the cloud server;
  • a terminal server control instruction is generated according to the first machine learning algorithm and the operation task, and a cloud server control instruction is generated according to the second machine learning algorithm and the operation task.
  • a method for distributing machine learning operations including:
  • a terminal server control instruction is generated according to the first machine learning algorithm and the operation task, and a cloud server control instruction is generated according to the second machine learning algorithm and the operation task.
  • the above machine learning computing distribution system and method when it is necessary to complete the computing task according to the user's demand information, execute the computing task in the terminal server and the cloud server respectively, so as to achieve the same using different machine learning algorithms
  • the purpose of the calculation task and you can get the calculation results with different degrees of accuracy.
  • Machine learning algorithms Based on different machine learning algorithms, terminal server control instructions that can be controlled in the terminal server and cloud server control instructions that can be controlled in the cloud server are generated in the terminal server.
  • the terminal computing results can be output first, which avoids the need for users to wait for a long time, improves processing efficiency, and makes full use of the computing resources of the terminal server and the cloud server, making the same A computing task can be performed on the terminal server and the cloud server device.
  • a computing device including:
  • the computing device is used to perform network model calculations, and the computing device is used to perform neural network operations;
  • the computing device includes: an arithmetic unit, a controller unit, and a storage unit;
  • the storage unit is used to store weights and input neurons, and the weights include important bits and non-important bits;
  • the controller unit is used to obtain the important bits and non-important bits of the weight and the input neuron, and convert the important bits and non-important bits of the weight and the input nerve Yuan is transferred to the arithmetic unit;
  • the operation unit is configured to perform operation on the input neuron and the important bit to obtain the first operation result of the output neuron;
  • the input neuron and the non-significant bit are operated to obtain a second operation result, and the first operation result and the second operation The sum of the results is used as the output neuron.
  • a machine learning computing device includes one or more computing devices according to the first aspect, for acquiring input data and control information to be computed from other processing devices, and Perform specified machine learning operations, and pass the execution results to other processing devices through the I / O interface;
  • the machine learning computing device includes a plurality of the computing devices
  • the plurality of the computing devices can be connected and transmit data through a specific structure
  • multiple computing devices interconnect and transmit data through a PCIE bus, a fast external device interconnect bus, to support larger-scale machine learning operations; multiple computing devices share the same control system or have their own control systems Multiple computing devices share memory or have their own memory; the interconnection method of multiple computing devices is any interconnection topology.
  • a combined processing device includes the machine learning computing device according to the second aspect, a universal interconnection interface, and other processing devices;
  • the machine learning operation device interacts with the other processing device to jointly complete the calculation operation specified by the user.
  • an embodiment of the present application provides a neural network chip.
  • the neural network chip includes the machine learning computing device according to the second aspect or the combined processing device according to the fifth aspect.
  • an embodiment of the present application provides an electronic device, where the electronic device includes the chip according to the sixth aspect.
  • an embodiment of the present application provides a board card, wherein the board card includes: a storage device, an interface device, and a control device, and the neural network chip described in the sixth aspect;
  • the neural network chip is respectively connected to the storage device, the control device and the interface device;
  • the storage device is used for storing data
  • the interface device is used to realize data transmission between the chip and an external device
  • the control device is used for monitoring the state of the chip.
  • an embodiment of the present application provides a calculation method, including:
  • the first operation result is greater than the preset threshold, an operation is performed between the input neuron and the non-significant bit to obtain a second operation result, and the first operation result and the first The sum of the two calculation results is used as the output neuron.
  • an embodiment of the present application provides a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program for electronic data exchange, wherein the computer program causes the computer to execute Part or all of the steps described in the nine aspects.
  • an embodiment of the present application provides a computer program product, wherein the computer program product includes a non-transitory computer-readable storage medium that stores a computer program, and the computer program can be operated by a computer to execute the embodiment of the present application Part or all of the steps described in the ninth aspect.
  • the computer program product may be a software installation package.
  • the computing device obtains the first operation result of the output neuron by obtaining the weighted important bits and non-important bits, and the input neuron, and calculating the input neuron and the important bit. If the result is less than or equal to the preset threshold, the operation of the current output neuron is skipped. If the first operation result is greater than the preset threshold, the input neuron and the non-significant bits are operated to obtain the second operation result. The sum of the operation result and the second operation result is used as an output neuron. Furthermore, if the prediction result of an output neuron does not require an operation, the operation process of the output neuron is skipped.
  • the new computing device integrates computing methods to predict and skip output neurons that do not need to be computed. Thereby reducing the calculation time and calculation energy consumption of the neural network.
  • FIG. 1-1 is a schematic structural diagram of a machine learning computing distribution system according to an embodiment
  • 1-2 is a schematic structural diagram of a machine learning computing distribution system according to another embodiment
  • 1-3 is a schematic structural diagram of a machine learning operation distribution system according to another embodiment
  • FIG. 1-4 is a diagram of an operation mode of operation-storage-communication of an embodiment
  • 1-5A is a schematic structural diagram of a computing device according to an embodiment
  • 1-5B is a structural diagram of a computing device according to an embodiment
  • 1-5C is a structural diagram of a computing device provided by another embodiment
  • 1-5D is a structural diagram of a main processing circuit of an embodiment
  • FIGS. 1-5E are structural diagrams of another computing device according to an embodiment
  • 1-5F is a schematic structural diagram of a tree module according to an embodiment
  • FIGS. 1-5G are structural diagrams of yet another computing device according to an embodiment
  • FIGS. 1-5H are structural diagrams of still another computing device according to an embodiment
  • 1-5I is a schematic structural diagram of a computing device according to an embodiment
  • FIG. 1-6 is a flowchart of a machine learning operation distribution method according to an embodiment.
  • FIG. 2-1A is a schematic structural diagram of a computing device according to an embodiment of the present invention.
  • FIG. 2-1B is a schematic structural diagram of a layered storage device according to an embodiment of the present application.
  • FIG. 2-1C is a schematic structural diagram of a 3T SRAM memory cell provided by an embodiment of the present application.
  • FIG. 2-1D is a schematic structural diagram of a data processing device according to an embodiment of the present application.
  • FIG. 2-1E is a schematic structural diagram of another data processing apparatus provided by an embodiment of the present application.
  • FIG. 2-2 is a schematic structural diagram of a computing device provided by an embodiment of the present application.
  • Figure 2-3 is a structural diagram of a computing device provided by an embodiment of the present application.
  • 2-4 is a structural diagram of a computing device provided by another embodiment of the present application.
  • 2-5 is a structural diagram of a main processing circuit provided by an embodiment of the present application.
  • FIGS. 2-6 are structural diagrams of another computing device provided by embodiments of the present application.
  • FIGS. 2-7 are schematic structural diagrams of a tree module provided by an embodiment of the present application.
  • FIGS. 2-8 are structural diagrams of yet another computing device provided by an embodiment of the present application.
  • FIGS. 2-9 are structural diagrams of still another computing device provided by an embodiment of the present application.
  • FIGS. 2-10 are structural diagrams of a combined processing device provided by an embodiment of the present application.
  • FIGS. 2-11 are schematic structural diagrams of a computing device provided by an embodiment of the present application.
  • FIGS. 2-12 are structural diagrams of another combined processing device provided by an embodiment of the present application.
  • 2-13 are schematic structural diagrams of a board provided by an embodiment of the present application.
  • 2-14 are schematic flowcharts of a calculation method provided by an embodiment of the present invention.
  • a machine learning computing distribution system includes: a cloud server 10 and a terminal server 20.
  • the user inputs corresponding demand information through a terminal device according to his actual needs.
  • the terminal device includes an input acquisition unit containing a control function, and the input acquisition unit can be selected by the user, such as an APP or an API of other programs. Interface etc.
  • the demand information input by the user is mainly determined by three aspects, one is the function demand information, the other is the accuracy demand information, and the other is the memory demand information.
  • the computing tasks include functional requirements tasks, accuracy requirements tasks and memory requirements tasks. It needs to be clear that the computing task of the first machine learning algorithm and the computing task of the second machine learning algorithm are the same computing task.
  • Hardware performance parameters include but are not limited to computing power, energy consumption, accuracy and speed.
  • machine learning algorithms include but are not limited to neural network algorithms and deep learning algorithms.
  • the machine learning algorithm has obvious stage-by-stage characteristics, such as the operation of each layer of neural network, each iteration of the clustering algorithm, and so on.
  • the machine learning algorithm is divided into multiple-stage algorithms.
  • the machine learning algorithm is a multi-layer neural network algorithm, and multiple stages include multiple layers.
  • the machine learning algorithm is a clustering algorithm, and multiple stages are multiple iterations. In each stage of calculation, the terminal server 20 and the cloud server 10 can be used for calculation.
  • the computing power of the terminal server is low, the computing performance of the corresponding first machine learning algorithm is also low.
  • the computing power of the cloud server is high, and the computing performance of the corresponding second machine learning algorithm is also high.
  • calculating the corresponding computing task of the first machine learning algorithm at each stage in the terminal server 20 can more quickly obtain a terminal computing result with lower accuracy.
  • the computing task of computing the second machine learning algorithm in each stage corresponding to the cloud server 10 takes a long time, it can obtain a cloud computing result with high accuracy. Therefore, although the terminal operation result can be obtained faster than the cloud operation result, the cloud operation result is more accurate than the terminal operation result.
  • the terminal server 20 may obtain the image faster than the cloud server 10
  • the animal in is the result of a cat, but the cloud server 10 may also obtain more accurate calculation results such as the cat's breed.
  • the above machine learning computing distribution system and method when it is necessary to complete the computing task according to the user's demand information, execute the computing task in the terminal server and the cloud server respectively, so as to achieve the same using different machine learning algorithms
  • the purpose of the calculation task and you can get the calculation results with different degrees of accuracy.
  • Machine learning algorithms Based on different machine learning algorithms, terminal server control instructions that can be controlled in the terminal server and cloud server control instructions that can be controlled in the cloud server are generated in the terminal server.
  • the terminal computing results can be output first, which avoids the need for users to wait for a long time, improves processing efficiency, and makes full use of the computing resources of the terminal server and the cloud server, making the same A computing task can be performed on the terminal server and the cloud server device.
  • the terminal server 20 is further used to parse the terminal server control instruction to obtain a terminal control signal, and calculate the corresponding first machine learning algorithm at each stage according to the terminal control signal To obtain the terminal operation result, and send the cloud server control instruction to the cloud server 10.
  • the cloud server 10 is configured to receive the cloud server control instruction, parse the cloud server control instruction to obtain a cloud control signal, and calculate the corresponding second machine at each stage according to the cloud control signal Learning the computing task of the algorithm to get the cloud computing result.
  • the hardware performance parameter includes computing capability
  • the terminal server 20 is specifically used to obtain the computing capability of the terminal server 20 and the computing capability of the cloud server 10; according to the computing task and The computing power of the terminal server selects the first machine learning algorithm, and the second machine learning algorithm is selected according to the computing task and the computing power of the cloud server.
  • the hardware performance parameters of the terminal server 20 include the computing capabilities of the terminal server 20
  • the hardware performance parameters of the cloud server 10 include the computing capabilities of the cloud server 10.
  • the computing capability can be obtained from the configuration information preset by the computing module.
  • the computing power of the server affects the computing speed of the server. According to the computing power of the computing module, a more suitable machine learning algorithm can be further accurately obtained.
  • the first machine learning algorithm includes a first neural network model
  • the second machine learning algorithm includes a second neural network model.
  • a neural network model is used as an example to specifically describe that the machine learning operation distribution system is specifically applied to the distribution of neural network operations, and the distribution system includes:
  • the terminal server 20 is used to obtain demand information, hardware performance parameters of the terminal server 20 and hardware performance parameters of the cloud server 10; generate corresponding computing tasks according to the demand information, and according to the computing tasks and all
  • the hardware performance parameters of the terminal server 20 are selected from the first neural network model running on the terminal server 20, and the second from the computing task and the hardware performance parameters of the cloud server 10 are selected from the second running on the cloud server 10 Neural network model; generating terminal server control instructions based on the selected first neural network model and the computing task, and generating cloud server control instructions based on the selected second neural network model and the computing task;
  • the terminal server control instruction is analyzed to obtain a terminal control signal, and the corresponding first neural network model calculation task is calculated according to the terminal control signal to obtain a terminal operation result, and the cloud server control instruction is sent to the cloud server 10 .
  • the cloud server 10 is used to receive the cloud server control instruction, parse the cloud server control instruction to obtain a cloud control signal, and calculate the corresponding second neural network model operation task according to the cloud control signal to obtain the cloud Operation result.
  • the calculation task when the calculation task needs to be completed according to the user's demand information, the calculation task is executed in the terminal server and the cloud server respectively, so as to achieve the purpose of using different neural network models to complete the same calculation task, and can Obtain calculation results with different degrees of accuracy.
  • Neural network model Based on different neural network models, terminal server control instructions that can be controlled in the terminal server and cloud server control instructions that can be controlled in the cloud server are generated in the terminal server.
  • the terminal computing results can be output first, which avoids the need for users to wait for a long time, improves processing efficiency, and makes full use of the computing resources of the terminal server and the cloud server, making the same A computing task can be performed on the terminal server and the cloud server device.
  • the terminal server 20 is further configured to, after outputting the terminal operation result, upon receiving the operation stop instruction, send the operation stop instruction to the cloud server 10 to terminate The computing work of the cloud server 10.
  • the user can obtain an operation result with low accuracy. If the user wants to obtain a more accurate calculation result, he can wait for the cloud server 10 to complete the calculation and output the cloud calculation result through the terminal server 20. As a result, the user obtains an operation result with lower accuracy and an operation result with higher accuracy, respectively. However, if the user obtains an operation result with low accuracy and thinks that the operation result has met his needs, and therefore does not want to obtain an operation result with higher accuracy, the user can input a stop operation instruction through the user terminal. After receiving the stop operation instruction, the distribution system terminates the calculation work of the cloud server 10, that is, the operation result with high accuracy is in a state of not being completed or even if it is completed but no longer output.
  • the user can choose to get only one operation result with lower accuracy, which can save the user's time, and can guarantee the operation performance of the machine learning operation distribution system, and avoid the waste of operation resources.
  • the terminal server 20 includes a terminal controller unit 210, a terminal arithmetic unit 220, and a terminal communication unit 230; the terminal controller unit 210 communicates with the terminal arithmetic unit 220 and the terminal communication unit, respectively 230 connections.
  • the terminal controller unit 210 is used to obtain demand information, hardware performance parameters of the terminal server 20 and hardware performance parameters of the cloud server 10; generate a corresponding computing task according to the demand information, and according to the
  • the calculation task and the hardware performance parameters of the terminal server 20 are selected in the first machine learning algorithm running on the terminal server 20, and the cloud server 10 is selected in accordance with the calculation tasks and the hardware performance parameters of the cloud server 10 Running second machine learning algorithm; generating terminal server control instructions based on the first machine learning algorithm and the computing task, and generating cloud server control instructions based on the second machine learning algorithm and the computing task, and
  • the terminal server control command is analyzed to obtain a terminal control signal.
  • the terminal operation unit 220 is used to calculate the operation task of the corresponding first machine learning algorithm according to the terminal control signal to obtain a terminal operation result; the terminal communication unit 230 is used to send the cloud server control instruction to the ⁇ Server10.
  • the terminal controller unit 210 obtains the demand information input by the user and generates corresponding computing tasks, and evaluates according to the hardware performance parameters of the terminal server 20 and the cloud server 10, such as computing capability, energy consumption, accuracy, speed, etc. result. Then select a suitable first machine learning algorithm for the terminal server and a suitable second machine learning algorithm for the cloud server based on the demand information and the evaluation results, and generate different according to the computing power of the above different machine learning algorithms Control instructions.
  • the instruction set including the control instruction is pre-stored in the terminal server 20 and the cloud server 10, and the terminal controller unit 210 generates the terminal server control instruction for the terminal server 20 and the cloud server 10 according to the input demand information. Cloud server control instructions.
  • the following mathematical model may be selected as an embodiment.
  • the index is the maximum number of floating point / fixed-point operations per second, which is recorded as the parameter C; then analyze the computing needs, here is first to judge the macro neural network model Function g (x), that is, whether to choose CNN, RNN or DNN, etc.
  • CNN and DNN are more used in the field of image vision, and RNN is more used in the field of text and audio. Through basic filtering, you can judge the suitable one more quickly.
  • Neural network type then filter based on energy consumption W, accuracy R and speed S.
  • the terminal controller unit 210 evaluates by establishing a mathematical model of parameters such as energy consumption, speed, and accuracy, and then selects the machine learning algorithm most suitable for the terminal server 20 and the cloud server 10, and performs training or inference.
  • the hardware configuration of the terminal server 20 can be directly obtained through the system, such as system calls such as Android / IOS; the hardware configuration of the cloud server 10 is sent by the terminal server 20 to the cloud server 10 through the terminal communication unit 230 to obtain the returned configuration information .
  • the terminal controller unit 210 also parses the terminal server control instruction to obtain a terminal control signal, and the terminal controller unit 210 sends the terminal control signal to the terminal arithmetic unit 220 and the terminal communication unit 230.
  • the terminal operation unit 220 receives the corresponding terminal control signal, and calculates the operation task of the corresponding first machine learning algorithm according to the terminal control signal to obtain the terminal operation result.
  • the terminal communication unit 230 is used to send the cloud server control instruction to the cloud server 10.
  • the above-mentioned first machine learning algorithm includes a first neural network model.
  • the cloud server 10 includes a cloud controller unit 110, a cloud computing unit 120, and a cloud communication unit 130; the cloud controller unit 110 communicates with the cloud computing unit 120 and the cloud communication unit, respectively 130 is connected, and the cloud communication unit 130 is connected to the terminal communication unit 230 for data interaction between the cloud server 10 and the terminal server 20.
  • the cloud communication unit 130 is used to receive the cloud server control instruction, send the cloud server control instruction to the cloud controller unit 110, and obtain the cloud computing result and send it to the terminal server 20;
  • the cloud controller unit 110 is used to receive the cloud server control instruction, and parse the cloud server control instruction to obtain a cloud control signal;
  • the cloud computing unit 120 is used to calculate the corresponding second according to the cloud control signal The operation task of the machine learning algorithm is to obtain the cloud computing result, and the cloud computing result is sent to the terminal server 20 through the cloud communication unit 130.
  • the terminal controller unit 210 sends the generated cloud server control instruction to the cloud server 10 through the terminal communication unit 230.
  • the cloud communication unit 130 receives the cloud server control instruction and sends it to the cloud controller unit 110.
  • the cloud controller unit 110 parses the cloud server control instruction to obtain the cloud control signal and sends it to the cloud computing unit 120 and the cloud Communication unit 130.
  • the cloud computing unit 120 receives the corresponding cloud control signal, calculates the computing task of the corresponding second machine learning algorithm according to the cloud control signal, and obtains the cloud computing result.
  • the above second machine learning algorithm includes a second neural network model.
  • the data communication between the cloud server 10 and the terminal server 20 is accompanied by the calculation process of the cloud server 10 and the terminal server 20 separately.
  • the terminal communication unit 230 sends data to the cloud communication unit 130 according to the corresponding terminal control signal; in turn, the cloud communication unit 130 also sends data to the terminal communication unit 230 according to the corresponding cloud control signal. Since the terminal server 20 is to obtain a low-accuracy operation result, the operation time consumed is short. After the operation of the terminal server 20 is completed, the terminal operation result is first sent to the user's terminal device.
  • the cloud communication unit 130 sends the cloud calculation result to the terminal communication unit 230, and the terminal The server 20 sends the cloud computing result to the user's terminal device.
  • the terminal communication unit 230 and the cloud communication unit 130 respectively perform data transmission between the terminal server 20 and the cloud server 10 through a communication protocol.
  • the terminal server 20 further includes a terminal storage unit 240.
  • the terminal storage unit 240 is connected to the terminal arithmetic unit 220 and the terminal controller unit 210, respectively.
  • the terminal storage unit 240 is used to receive input data from the terminal server 20 and perform Terminal data storage.
  • the terminal storage unit 240 may determine the input data of the terminal and store the data and store the operation process of the terminal according to the terminal server control instruction generated by the terminal instruction generation circuit 210b.
  • the stored data format may be a floating point number or a quantized fixed point number.
  • the terminal storage unit 240 may be a device or storage space capable of storing data, such as sram, dram, etc., for storing data of the terminal and instructions of the terminal.
  • the data includes but is not limited to at least one of input neurons, output neurons, weights, images, and vectors.
  • the terminal operation unit 220 and the terminal storage unit 240 are two separate components. After the operation of the terminal operation unit 220 is completed, the terminal operation result is first transferred to the terminal storage unit 240, and then Then, the terminal storage unit 240 and the terminal communication unit 230 encode and transmit the terminal operation result, and in the process of encoding and transmission communication, the terminal arithmetic unit 220 has already started the next round of operation. Using this working mode will not cause excessive waiting delay.
  • the equivalent computing time of each round is the actual computing time + dump time. Since the transfer time is much shorter than the encoding transmission time, this method can fully mobilize the computing power of the terminal computing unit 220, so that the terminal computing unit 220 works as full as possible.
  • the corresponding terminal server control command can be generated in the terminal command generation circuit 210b according to the above-mentioned working mode.
  • the implementation of this part may be entirely implemented by an algorithm, and the CPU device of the terminal server 20 itself may be used.
  • the cloud server 10 further includes a cloud storage unit 140.
  • the cloud storage unit 140 is connected to the cloud computing unit 120 and the cloud controller unit 110 respectively.
  • the cloud storage unit 140 is used to receive cloud input data and perform cloud data Storage.
  • the cloud storage unit 140 may determine the cloud input data according to the cloud server control instruction and store the data and store the cloud computing process.
  • the stored data format may be a floating point number or a quantized fixed point number.
  • the cloud storage unit 140 may be a device or storage space capable of storing data, such as sram, dram, etc., for storing data in the cloud and instructions in the cloud.
  • the data includes but is not limited to at least one of input neurons, output neurons, weights, images, and vectors.
  • the cloud computing unit 120 and the cloud storage unit 140 are two separate components. After the cloud computing unit 120 completes the operation, the cloud computing result is first transferred to the cloud storage unit 140, and then Then, the cloud storage unit 140 and the cloud communication unit 130 encode and transmit the cloud operation result, and in the process of encoding and transmission communication, the cloud operation unit 120 has already started the next round of calculation. Using this working mode will not cause excessive waiting delay.
  • the equivalent computing time of each round is the actual computing time + dump time. Since the transfer time is much shorter than the encoding transmission time, this method can fully mobilize the computing capability of the cloud computing unit 120, so that the cloud computing unit 120 works as full as possible. It should be noted that the corresponding cloud server control command can be generated in the terminal command generation circuit 210b according to the above working mode.
  • the terminal controller unit 210 includes a terminal evaluation circuit 210a, a terminal instruction generation circuit 210b, and a terminal instruction analysis circuit 210c; the terminal instruction generation circuit 210b evaluates the terminal
  • the circuit 210a is connected to the terminal instruction analysis circuit 210c.
  • the terminal evaluation circuit 210a, the terminal instruction generation circuit 210b and the terminal instruction analysis circuit 210c are connected to the terminal operation unit 220 and the terminal storage unit 240, respectively.
  • the terminal communication unit 230 is connected.
  • the terminal evaluation circuit 210a is used to obtain demand information, hardware performance parameters of the terminal server 20 and hardware performance parameters of the cloud server 10; generate a corresponding computing task according to the demand information, and according to the computing task and
  • the hardware performance parameters of the terminal server 20 are selected from the first machine learning algorithm running on the terminal server 20, and the first computer learning algorithm running on the cloud server 10 is selected according to the calculation task and the hardware performance parameters of the cloud server 10
  • Two machine learning algorithms the terminal instruction generating circuit 210b is used to generate terminal server control instructions based on the first machine learning algorithm and the computing task, and generate a cloud server according to the second machine learning algorithm and the computing task Control instruction;
  • the terminal instruction analysis circuit 210c is used to analyze the terminal server control instruction to obtain a terminal control signal.
  • the terminal evaluation circuit 210a obtains the demand information input by the user, and selects a first machine learning algorithm for the terminal with a low computing power and an operation based on the demand information and according to the hardware performance parameters of the terminal server 20 and the cloud server 10, respectively.
  • the second machine learning algorithm for the cloud with higher capabilities.
  • the terminal instruction generation circuit 210b After the selection is completed, the terminal instruction generation circuit 210b generates corresponding terminal server control instructions respectively according to the low computing power of the first machine learning algorithm for the terminal server 20 and the high computing power of the second machine learning algorithm for the cloud server 10 And cloud server control instructions.
  • the control instructions in the terminal server control instructions and the cloud server control instructions can include operation allocation instructions, memory access instructions and data communication instructions, respectively.
  • the terminal server control instruction is used for control in the terminal server 20.
  • the cloud server control instruction is sent to the cloud communication unit 130 through the terminal communication unit 230, and then sent to the cloud controller unit 110 by the cloud communication unit 130 to be stored in the cloud server 10.
  • the terminal instruction analysis circuit 210c is used to analyze the terminal server control instruction to obtain a terminal control signal, and to cause the terminal operation unit 220, the terminal storage unit 240, and the terminal communication unit 230 to operate according to the terminal server control instruction according to the terminal control signal.
  • the allocation method used by the arithmetic allocation scheme may be: the same arithmetic task is allocated according to the different arithmetic capabilities, precision, speed and energy consumption of the machine learning algorithm, that is, the Different machine learning algorithms but complete the same computing task.
  • the terminal server 20 and the cloud server 10 can calculate the same calculation task at the same time, or can calculate the same calculation task at different times, or select a pair of calculation tasks according to the user's needs for calculation.
  • the AlexNet neural network model has low computing power, but its space-time cost is minimal.
  • the computing power of the ResNet neural network model is based on more energy consumption.
  • the neural network model with low computing power can give a less accurate operation result, and the operation result can be within the range accepted by the user.
  • the neural network model with low computing power requires lower power consumption and proper inference time. Therefore, for the lower performance of the terminal server 20 compared to the cloud server 10, the lower computing power in the terminal server 20 can be selected
  • the calculation of the first neural network model is completed in the cloud server 10 with the calculation of the second neural network model with high calculation capability. And it is up to the user's needs to decide whether to further obtain high-precision operation classification results. In this way, it is possible to provide the user with a low-accuracy calculation result first, avoiding a long waiting time, and at the same time providing the user with a choice of scenarios.
  • the memory access instruction is a memory management instruction based on calculation allocation, and is used to control the terminal storage unit 240 or the cloud storage unit 140 to perform data storage.
  • the data communication instruction is a data interaction instruction to the cloud server 10 and the terminal server 20, and is used to control the data communication between the terminal communication unit 230 and the cloud communication unit 130.
  • system-level scheduling of multiple terminal servers 20 and one cloud server 10 can be performed, and multiple terminal servers 20 and one cloud server 10 jointly complete a system-level task with high complexity.
  • the cloud controller unit 110 includes a cloud command parsing circuit 110a, and the cloud command parsing circuit 110a is connected to the cloud computing unit 120, the cloud storage unit 140, and the cloud communication unit 130, respectively.
  • the cloud instruction parsing circuit 110a is used to receive the cloud server control instruction, and parse the cloud server control instruction to obtain the cloud control signal, and enable the cloud computing unit 120 and the cloud storage unit according to the cloud control signal 140 and the cloud communication unit 130 operate according to the cloud server control instructions.
  • the operation principles of the cloud computing unit 120, the cloud storage unit 140, and the cloud communication unit 130 are the same as the above-described terminal computing unit 220, terminal storage unit 240, and terminal communication unit
  • the operating principle of 230 is the same, so I won't repeat it here.
  • the cloud command parsing circuit 110a obtains the cloud control signal by parsing the cloud server control command, and sends the cloud control signal to other components of the cloud server 10, so that the cloud server 10 can complete the calculation of the cloud neural network in an orderly manner, greatly speeding up The computing speed of the cloud neural network.
  • the terminal arithmetic unit 220 is connected to the terminal communication unit 230, and the terminal storage unit 240 is connected to the terminal communication unit 230.
  • the terminal communication unit 230 may encode and send the output data of the terminal operation unit 220 and the terminal storage unit 240 to the cloud communication unit 130. Conversely, the terminal communication unit 230 may also receive the data sent by the cloud communication unit 130, decode the data, and send it to the terminal arithmetic unit 220 and the terminal storage unit 240 again.
  • the task amount of the terminal controller unit 210 can be reduced, so that the terminal controller unit 210 can complete the generation process of the control instruction in more detail.
  • the cloud computing unit 120 is connected to the cloud communication unit 130
  • the cloud storage unit 140 is connected to the cloud communication unit 130.
  • the cloud communication unit 130 may encode the output data of the cloud computing unit 120 and the cloud storage unit 140 and send it to the terminal communication unit 230. Conversely, the cloud communication unit 130 may also receive the data sent by the terminal communication unit 230 and decode the data and send it to the cloud computing unit 120 and the cloud storage unit 140 again.
  • the terminal computing unit 220 may be a computing component of the terminal server 20 itself, and the cloud computing unit 120 may be a computing component of the cloud server 10 itself.
  • the computing component can be a CPU, a GPU, or a neural network chip.
  • the terminal operation unit 220 and the cloud operation unit 120 may be operation units in the data processing unit of the artificial neural network chip, which are used to control data according to control instructions stored in the storage unit (terminal storage unit 240 or cloud storage unit 140) Perform the corresponding operation.
  • the cloud controller unit 110 and the terminal controller unit 210 are the controller unit 311
  • the cloud computing unit 120 and the terminal computing unit 220 are the computing unit 312
  • the arithmetic unit 312 includes: a master processing circuit 3101 and a plurality of slave processing circuits 3102;
  • the controller unit 311 is used to obtain input data and calculation instructions; in an optional solution, specifically, the input data and calculation instructions may be obtained through a data input and output unit, and the data input and output unit may specifically be one or more Data I / O interface or I / O pin.
  • the above calculation instructions include but are not limited to: forward operation instructions or reverse training instructions, or other neural network operation instructions, etc., such as convolution operation instructions.
  • the specific implementation of the present application does not limit the specific expression form of the above calculation instructions.
  • the controller unit 311 is further configured to parse the calculation instruction to obtain a plurality of calculation instructions, and send the plurality of calculation instructions and the input data to the main processing circuit;
  • the main processing circuit 3101 is configured to perform pre-processing on the input data and transfer data and operation instructions with the multiple slave processing circuits;
  • a plurality of slave processing circuits 3102 configured to execute intermediate operations in parallel based on data transmitted from the master processing circuit and operation instructions to obtain a plurality of intermediate results, and transmit the plurality of intermediate results to the master processing circuit;
  • the main processing circuit 3101 is configured to perform subsequent processing on the plurality of intermediate results to obtain the calculation result of the calculation instruction.
  • the technical solution provided in this application sets the computing unit to a master multi-slave structure.
  • it can split the data according to the calculation instructions of the forward operation, so that multiple slave processing circuits can The part with a large amount of calculation performs parallel operation, thereby increasing the operation speed, saving operation time, and further reducing power consumption.
  • the above machine learning calculation may specifically include: artificial neural network operation
  • the above input data may specifically include: input neuron data and weight data.
  • the above calculation result may specifically be: the result of the operation of the artificial neural network outputs the neuron data.
  • the operation in the neural network it can be a layer of operation in the neural network.
  • the implementation process is that in the forward operation, when the previous layer of artificial neural network is completed, the operation of the next layer The instruction will use the output neuron calculated in the arithmetic unit as the input neuron of the next layer (or perform some operations on the output neuron and then use it as the input neuron of the next layer), and at the same time, the weight It is also replaced with the weight of the next layer; in the reverse operation, when the reverse operation of the artificial neural network of the previous layer is completed, the next layer of computing instructions will use the input neuron gradient calculated in the computing unit as the next The output neuron gradient of the first layer is operated (or some operation is performed on the input neuron gradient and then used as the output neuron gradient of the next layer), and the weights are replaced by the weights of the next layer.
  • the above machine learning calculation may also include support vector machine operation, k-nearest neighbor (k-nn) operation, k-mean (k-means) operation, principal component analysis operation and so on.
  • k-nn k-nearest neighbor
  • k-means k-mean
  • principal component analysis operation k-means
  • the input neurons and output neurons of the multi-layer operations do not refer to the neurons in the input layer and the output layers of the entire neural network, but for In any two adjacent layers in the network, the neurons in the lower layer of the network forward operation are input neurons, and the neurons in the upper layer of the network forward operation are output neurons.
  • K 1, 2, ..., L-1.
  • K + 1th layer we divide the Kth layer This is called the input layer, where the neurons are the input neurons, and the K + 1th layer is called the output layer, and the neurons are the output neurons.
  • each layer can be used as an input layer, and the next layer is the corresponding output layer.
  • the controller unit includes: an instruction cache unit 3110, an instruction processing unit 3111, and a storage queue unit 3113;
  • the instruction cache unit 3110 is used to store the calculation instructions associated with the artificial neural network operation
  • the instruction processing unit 3111 is configured to parse the calculation instruction to obtain multiple operation instructions
  • the storage queue unit 3113 is used to store an instruction queue.
  • the instruction queue includes a plurality of operation instructions or calculation instructions to be executed in the order of the queue.
  • the main operation processing circuit may also include a controller unit, and the controller unit may include a main instruction processing unit, which is specifically used to decode instructions into microinstructions.
  • the slave operation processing circuit may also include another controller unit, and the other controller unit includes a slave instruction processing unit, which is specifically used to receive and process microinstructions.
  • the above microinstruction may be the next level instruction of the instruction.
  • the microinstruction can be obtained by splitting or decoding the instruction, and can be further decoded into control signals of each component, each unit, or each processing circuit.
  • the structure of the calculation instruction may be as shown in the following table.
  • the calculation instruction may include: one or more operation fields and an operation code.
  • the calculation instruction may include a neural network operation instruction. Taking neural network operation instructions as an example, as shown in Table 1, among them, register number 0, register number 1, register number 2, register number 3, and register number 4 can be operation domains. Wherein, each register number 0, register number 1, register number 2, register number 3, register number 4 may be the number of one or more registers.
  • the above register may be an off-chip memory. Of course, in actual application, it may also be an on-chip memory for storing data.
  • controller unit may further include:
  • the dependency processing unit 3112 is configured to determine whether there is an association between the first operation instruction and the zeroth operation instruction before the first operation instruction when there are multiple operation instructions, such as the first operation instruction and all If the zeroth operation instruction has an association relationship, the first operation instruction is cached in the instruction storage unit, and after the execution of the zeroth operation instruction is completed, the first operation instruction is extracted from the instruction storage unit Transmitted to the arithmetic unit;
  • the determining whether there is an association relationship between the first operation instruction and the zeroth operation instruction before the first operation instruction includes:
  • Extracting the first storage address interval of the data (such as a matrix) required in the first arithmetic instruction according to the first arithmetic instruction, and extracting the zeroth of the required matrix in the zeroth arithmetic instruction according to the zeroth arithmetic instruction A storage address interval, if the first storage address interval and the zeroth storage address interval have overlapping areas, it is determined that the first arithmetic instruction and the zeroth arithmetic instruction have an association relationship, such as the first storage If the address interval does not overlap with the zeroth storage address interval, it is determined that the first operation instruction and the zeroth operation instruction do not have an association relationship.
  • the arithmetic unit 312 may include a master processing circuit 3101 and multiple slave processing circuits 3102.
  • multiple slave processing circuits are distributed in an array; each slave processing circuit is connected to an adjacent other slave processing circuit, and a master processing circuit is connected to the multiple slave processing circuits K slave processing circuits, the k slave processing circuits are: n slave processing circuits in the first row, n slave processing circuits in the mth row, and m slave processing circuits in the first column, it should be noted that , The K slave processing circuits shown in FIG.
  • the slave processing circuit is a slave processing circuit directly connected to the master processing circuit among the plurality of slave processing circuits.
  • K slave processing circuits are used to transfer data and instructions between the master processing circuit and the plurality of slave processing circuits.
  • the main processing circuit 3101 may further include one or any combination of a conversion processing circuit 3101a, an activation processing circuit 3101b, and an addition processing circuit 3101c;
  • the conversion processing circuit 3101a is used to perform the exchange between the first data structure and the second data structure (such as the conversion of continuous data and discrete data) of the data block or intermediate result received by the main processing circuit; or to receive the main processing circuit
  • the data block or the intermediate result performs the exchange between the first data type and the second data type (for example, conversion of fixed-point type and floating-point type);
  • the activation processing circuit 3101b is used to execute the activation operation of the data in the main processing circuit
  • the addition processing circuit 3101c is used to perform addition operation or accumulation operation.
  • the main processing circuit is used to determine that the input neuron is broadcast data, the weight value is distribution data, the distribution data is distributed into multiple data blocks, and at least one of the multiple data blocks and multiple At least one of the calculation instructions is sent to the slave processing circuit;
  • the plurality of slave processing circuits are configured to perform an operation on the received data block according to the operation instruction to obtain an intermediate result, and transmit the operation result to the master processing circuit;
  • the main processing circuit is configured to process a plurality of intermediate results sent from the processing circuit to obtain the result of the calculation instruction, and send the result of the calculation instruction to the controller unit.
  • the slave processing circuit includes: a multiplication processing circuit
  • the multiplication processing circuit is configured to perform a product operation on the received data block to obtain a product result
  • the forwarding processing circuit (optional) is used to forward the received data block or product result.
  • An accumulation processing circuit is configured to perform an accumulation operation on the product result to obtain the intermediate result.
  • the operation instruction is a matrix multiplying matrix instruction, an accumulation instruction, an activation instruction, and other calculation instructions.
  • the actual formula that needs to be executed can be: Among them, the weight w is multiplied by the input data x i , and the sum is added, and then the offset b is added to do the activation operation s (h) to obtain the final output result s.
  • the arithmetic unit includes: a tree module 340, and the tree module 340 includes: a root port 3401 and a plurality of branch ports 3402, the The root port of the tree module is connected to the main processing circuit, and the multiple branch ports of the tree module are respectively connected to one slave processing circuit of the multiple slave processing circuits;
  • the tree module has a sending and receiving function.
  • the tree module is a sending function
  • the tree module is a receiving function.
  • the tree module is used to forward data blocks, weights, and operation instructions between the master processing circuit and the multiple slave processing circuits.
  • the tree module is a selectable result of the computing device, and it may include at least one layer of nodes.
  • the node has a line structure with a forwarding function, and the node itself may not have a computing function. If the tree module has zero-level nodes, the tree module is not required.
  • the tree-shaped module may be an n-ary tree structure, for example, a binary tree structure as shown in FIGS. 1-5F, and of course it may also be a trigeminal tree structure, where n may be an integer greater than or equal to 2.
  • the specific implementation of the present application does not limit the specific value of the above-mentioned n.
  • the above-mentioned number of layers may also be 2.
  • the slave processing circuit may be connected to nodes of layers other than the penultimate layer node, for example, as shown in FIG. 1-5F The nodes in the penultimate layer shown.
  • the above operation unit may carry a separate cache, as shown in FIGS. 1-5G, and may include: a neuron cache unit 363, which stores the input neuron vector data and output nerve of the slave processing circuit Metadata.
  • the operation unit may further include: a weight buffer unit 364 for buffering weight data required by the slave processing circuit in the calculation process.
  • the operation unit 312 is shown in FIG. 1-5B, and may include a branch processing circuit 3103; its specific connection structure is shown in FIG. 1-5B, where,
  • the main processing circuit 3101 is connected to the branch processing circuit 3103 (one or more), and the branch processing circuit 3103 is connected to one or more slave processing circuits 3102;
  • the branch processing circuit 3103 is used to perform forwarding of data or instructions between the main processing circuit 3101 and the slave processing circuit 3102.
  • the controller unit obtains the input neuron matrix x, the weight matrix w and the fully connected operation instruction from the storage unit, and transmits the input neuron matrix x, the weight matrix w and the fully connected operation instruction to the main processing circuit;
  • the main processing circuit determines the input neuron matrix x as broadcast data, determines the weight matrix w as distribution data, splits the weight matrix w into 8 sub-matrices, and then distributes the 8 sub-matrices to 8 slave processes through the tree module Circuit, broadcasting the input neuron matrix x to 8 slave processing circuits,
  • the slave processing circuit executes the multiplication and accumulation operations of 8 sub-matrices and input neuron matrix x in parallel to obtain 8 intermediate results, and sends the 8 intermediate results to the main processing circuit;
  • the main processing circuit is used to sort the 8 intermediate results to obtain the operation result of wx, perform the operation of the offset b to perform the activation operation to obtain the final result y, and send the final result y to the controller unit
  • the final result y is output or stored in the storage unit.
  • the method for the computing device shown in FIG. 1-5A to execute the neural network forward operation instruction may specifically be:
  • the controller unit extracts the neural network forward operation instruction, the operation domain corresponding to the neural network operation instruction and at least one operation code from the instruction storage unit, and the controller unit transmits the operation domain to the data access unit and sends the at least one operation code To the arithmetic unit.
  • the controller unit extracts the weight w and offset b corresponding to the operation domain from the storage unit (when b is 0, there is no need to extract the offset b), and transmits the weight w and offset b to the main processing of the arithmetic unit Circuit, the controller unit extracts the input data Xi from the storage unit and sends the input data Xi to the main processing circuit.
  • the main processing circuit determines the multiplication operation according to the at least one operation code, determines the input data Xi as broadcast data, determines the weight data as distribution data, and splits the weight w into n data blocks;
  • the instruction processing unit of the controller unit determines the multiplication instruction, the offset instruction and the accumulation instruction according to the at least one operation code, and sends the multiplication instruction, the offset instruction and the accumulation instruction to the main processing circuit, and the main processing circuit inputs the multiplication instruction and the input data Xi sends it to multiple slave processing circuits in a broadcast manner, and distributes the n data blocks to the multiple slave processing circuits (for example, with n slave processing circuits, each slave processing circuit sends one data block); multiple The slave processing circuit is used to perform multiplication operation on the input data Xi and the received data block according to the multiplication instruction to obtain an intermediate result, and send the intermediate result to the main processing circuit, which according to the accumulation instruction sends multiple slaves
  • the intermediate result sent by the processing circuit performs an accumulation operation to obtain an accumulation result, executes the accumulation result to add an offset b according to the offset instruction to obtain a final result, and sends the final result to the controller unit.
  • the technical solution provided by this application realizes the multiplication and offset operations of the neural network through one instruction, that is, the neural network operation instruction, and the intermediate results of the neural network calculation do not need to be stored or extracted, reducing the storage and extraction operations of the intermediate data Therefore, it has the advantages of reducing the corresponding operation steps and improving the calculation effect of the neural network.
  • This application also discloses a machine learning computing device, which includes one or more computing devices mentioned in this application, for obtaining data to be calculated and control information from other processing devices, performing specified machine learning operations, and executing The result is transferred to the peripheral device through the I / O interface.
  • Peripheral equipment such as camera, monitor, mouse, keyboard, network card, wifi interface, server.
  • the computing devices can link and transmit data through a specific structure, for example, interconnect and transmit data through the PCIE bus to support larger-scale machine learning operations.
  • the interconnection method can be any interconnection topology.
  • the machine learning computing device has high compatibility, and can be connected with various types of servers through the PCIE interface.
  • the distribution method includes the following steps:
  • S702 Obtain demand information, hardware performance parameters of the terminal server, and hardware performance parameters of the cloud server.
  • the user inputs his own demand through the terminal device, and the terminal server obtains the demand information input by the user.
  • the demand information input by the user is mainly determined by three aspects, one is the function demand information, the other is the accuracy demand information, and the other is the memory demand information.
  • functional requirements information such as the data set required to identify all animals and the data set required only to identify cats, there is an inclusive relationship. If the user only needs the functional requirements of a vertical field, then only the user ’s
  • the demand is input through the input acquisition unit of the control part, and the corresponding data set is selected according to the size of its own memory and the required precision.
  • the terminal server obtains demand information, hardware performance parameters of the terminal server and hardware performance parameters of the cloud server.
  • the hardware performance parameters may include computing power, energy consumption, speed, and accuracy.
  • S704 Generate a corresponding computing task according to the demand information, and select the first machine learning algorithm to run on the terminal server according to the computing task and the hardware performance parameters of the terminal server, and according to the computing task and all
  • the hardware performance parameter of the cloud server is a second machine learning algorithm running on the cloud server.
  • the terminal controller unit in the terminal server, the terminal controller unit generates a corresponding calculation task according to the demand information.
  • the terminal evaluation circuit in the terminal controller unit evaluates the computing capacity, energy consumption, speed, and accuracy of the terminal server and the cloud server to establish a mathematical model, and then selects the most suitable machine learning algorithm for the terminal server and the cloud server. And then train or reason.
  • S706 Generate a terminal server control instruction according to the first machine learning algorithm and the operation task, and generate a cloud server control instruction according to the second machine learning algorithm and the operation task.
  • the terminal controller unit allocates computing tasks according to the scale of the first machine learning algorithm for the terminal server and according to the computing power of the first machine learning algorithm; and according to the cloud server The scale of the second machine learning algorithm, and according to the computing power of the second machine learning algorithm, the above-mentioned computing tasks are distributed, so that the terminal server and the cloud server respectively complete the same computing task.
  • the terminal instruction generation circuit generates corresponding terminal server control instructions and cloud server control instructions based on user needs and selected data sets, and based on the computing power of different machine learning algorithms.
  • the terminal communication unit and the cloud communication unit transmit control instructions between the terminal server and the cloud server. Specifically, after the control instruction is generated, the terminal communication unit and the cloud communication unit respectively transmit between the terminal server and the cloud server through a communication protocol.
  • the above machine learning operation distribution method when it is necessary to complete the operation task according to the user's demand information, execute the operation task in the terminal server and the cloud server respectively, so as to achieve the same operation task by using different machine learning algorithms For different purposes, and you can get calculation results with different degrees of accuracy. Specifically, first evaluate the hardware performance parameters of the terminal server and the cloud server, and select a first machine learning algorithm running on the terminal server with low computing power and a second machine running on the cloud server with high computing power. Machine learning algorithms. Based on different machine learning algorithms, terminal server control instructions that can be controlled in the terminal server and cloud server control instructions that can be controlled in the cloud server are generated in the terminal server.
  • the terminal computing results can be output first, which avoids the need for users to wait for a long time, improves processing efficiency, and makes full use of the computing resources of the terminal server and the cloud server, making the same A computing task can be performed on the terminal server and the cloud server device.
  • the method further includes the following steps:
  • S708 Analyze the terminal server control instruction and the cloud server control instruction separately, obtain a terminal control signal according to the terminal server control instruction, and obtain a cloud control signal according to the cloud server control instruction.
  • the cloud instruction parsing circuit in the cloud controller unit analyzes the sent cloud server control instruction to obtain the cloud control signal
  • the terminal instruction analysis circuit analyzes the terminal server control instruction to obtain the terminal control signal.
  • S710 Extract terminal data to be processed according to the terminal control signal, and extract cloud data to be processed according to the cloud control signal.
  • the data to be processed includes one or more of training data or test data.
  • the cloud controller unit extracts the corresponding cloud training data or cloud test data according to the cloud control signal, and sends it to the buffer of the cloud computing unit.
  • a certain memory space can be pre-allocated to realize the data of the intermediate process of computing Interaction.
  • the terminal controller unit extracts the corresponding terminal training data or terminal test data according to the terminal control signal, and sends it to the buffer of the terminal computing unit.
  • a certain memory space can be pre-allocated for the data of the intermediate process of computing Interaction.
  • S712 Calculate the operation task of the first machine learning algorithm at each stage corresponding to the terminal server according to the terminal to-be-processed data to obtain terminal operation results, and / or calculate the cloud server according to the cloud-to-be-processed data Corresponding to the computing task of the second machine learning algorithm in each stage corresponding to in order to obtain the cloud computing result.
  • the terminal controller unit sends the terminal pending data to the terminal computing unit, and the terminal computing unit calculates the computing task of the first machine learning algorithm corresponding to each stage in the terminal server according to the transmitted terminal pending data .
  • the cloud controller unit sends the cloud pending data to the cloud computing unit, and the cloud computing unit calculates the computing task of the second machine learning algorithm corresponding to each stage in the cloud server according to the transmitted cloud pending data.
  • the terminal communication unit sends data to the cloud communication unit according to the corresponding terminal control signal.
  • the cloud communication unit is also based on The corresponding cloud control signal sends data to the terminal communication unit, and sends the terminal operation result and the cloud operation result to the user's terminal device through the terminal server.
  • S704 includes:
  • S7044 Select a first machine learning algorithm based on the computing task and the computing power of the terminal server, and select a second machine learning algorithm based on the computing task and the computing power of the cloud server.
  • the computing power of the terminal server is weaker than that of the cloud server. Therefore, correspondingly, a first machine learning algorithm with low computing power is selected according to the computing power of the terminal server, and a second machine learning algorithm with high computing power is selected according to the computing power of the cloud server.
  • the level of computing power affects the calculation time and calculation accuracy. For example, the second machine learning algorithm with higher computing power can obtain a more accurate calculation result, but the calculation time may be longer.
  • the distribution method further includes:
  • the user can obtain an operation result with lower accuracy. If the user wants to obtain a more accurate operation result, he can wait for the cloud server operation to be completed, and then output the cloud operation result through the terminal server. At this time, the user respectively obtains a less accurate operation result and a more accurate operation result. High calculation results. However, if the user does not want to obtain a more accurate operation result after obtaining a lower accuracy operation result, the user terminal inputs a stop operation instruction, the distribution system receives the stop operation instruction, and terminates the cloud server operation , That is, the operation result with higher accuracy is in a state of not being completed or even completed but no longer output.
  • S708 specifically includes:
  • the terminal server is used to analyze the terminal server control instruction to obtain a terminal control signal
  • S7084 Extract corresponding terminal training data or terminal test data according to the terminal control signal.
  • the terminal instruction parsing circuit is used to parse the terminal server control instruction to obtain a terminal control signal, and extract corresponding terminal training data or terminal test data according to the terminal control signal.
  • the data includes images, audio, text, etc. Images include still pictures, pictures that make up videos, or videos. Audio includes vocal audio, music, noise, etc. Text includes structured text, text characters in various languages, etc.
  • S708 also includes:
  • the cloud server is used to parse the cloud server control instruction to obtain a cloud control signal
  • the cloud instruction parsing circuit is used to parse the cloud server control instruction to obtain a cloud control signal, and extract corresponding cloud training data or cloud test data according to the cloud control signal.
  • S712 specifically includes:
  • S712 specifically includes:
  • S7124 Use the cloud server and calculate the computing task of the second machine learning algorithm at each stage corresponding to the cloud server according to the cloud training data or cloud test data to obtain cloud computing results.
  • the cloud computing unit executes the operation of the corresponding second machine learning algorithm at each stage according to the cloud training data or the cloud test data to obtain the cloud computing result.
  • the terminal operation unit executes the operation of the corresponding first machine learning algorithm at each stage according to the terminal training data or the terminal test data to obtain the terminal operation result.
  • the data communication between the terminal server and the cloud server is completed through the cloud communication unit and the terminal communication unit.
  • the data communication between the computing part and the storage part between the cloud server and the terminal server is forwarded through the cloud controller unit and the terminal communication unit respectively, and finally the cloud communication unit and the terminal communication unit interact together.
  • the neural network with low computing power is used in the terminal server to calculate the above computing tasks, a lower accuracy computing result can be obtained first, and then, based on the user's further demand information, it can be further obtained in the cloud server.
  • the neural network with computing power obtains a highly accurate computing result.
  • FIGS. 1-6 are displayed in order according to the arrows, the steps are not necessarily executed in the order indicated by the arrows. Unless clearly stated in this article, the execution of these steps is not strictly limited in order, and these steps can be executed in other orders. Moreover, at least some of the steps in FIGS. 1-6 may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed at the same time, but may be executed at different times. The execution order of is not necessarily sequential, but may be executed in turn or alternately with at least a part of other steps or sub-steps or stages of other steps.
  • neural networks In the field of data processing, neural networks (neural networks) have been very successful applications, but large-scale neural network operations require a large amount of computing time and energy consumption, which poses serious challenges to the processing platform. Therefore, reducing the computation time and energy consumption of neural networks has become an urgent problem to be solved.
  • FIG. 2-1A is a schematic structural diagram of a computing device according to an embodiment of the present invention.
  • the computing device 100 includes:
  • the storage unit 1019 is configured to store weights and input neurons, and the weights include important bits and non-important bits;
  • the controller unit 1029 is configured to obtain the important bits and non-important bits of the weight and the input neuron, and convert the important bits and non-important bits of the weight and the input Neurons are transmitted to the arithmetic unit 1039;
  • the operation unit 1039 is configured to perform operation on the input neuron and the important bit to obtain a first operation result of the output neuron;
  • the input neuron and the non-significant bit are operated to obtain a second operation result, and the first operation result and the second operation The sum of the results is used as the output neuron.
  • the data stored in the storage unit 1019 input neurons or weights, which include floating-point data and fixed-point data, the sign bit and exponent part of the floating-point data are designated as important bits, and the base part Specified as non-important bits, the sign bit in fixed-point data and the first x bits of the numerical part are designated as important bits, and the remaining bits of the numerical part are designated as non-important bits, where x is greater than or equal to 0 and Positive integer less than m, m is the total bit of fixed-point data.
  • ECC Error Correcting Code: ECC
  • the above-mentioned preset threshold can be set by the user or the system default.
  • the preset threshold can be 0, or can also be other integers, or decimals.
  • the input neuron is represented by N in
  • the input neuron includes n bits, where n bits include n1 important bits and n2 non-important bits, if n1
  • the value corresponding to the important bits is represented by N1 in
  • the value corresponding to the n2 non-significant bits is represented by N2 in
  • n1 + n2 n
  • N in N1 in + N2 in
  • n is a positive integer
  • n1 is Natural number and less than n.
  • the positions of the n1 important bits are continuous or discontinuous.
  • the positions of the n1 important bits are continuous or discontinuous.
  • the operation unit 1039 when there are multiple input neurons, the operation unit 1039 includes multiple multipliers and at least one adder;
  • the plurality of multipliers and the at least one adder are used to calculate the output neuron according to the following formula:
  • the operation unit 1039 includes a plurality of multipliers and at least one adder, and the operation unit completes the above operations through a plurality of multipliers and at least one adder.
  • T is the number of input neurons
  • N out is the output neuron
  • N1 in (i) is the important bit of the i-th input neuron
  • N2 in (i) is the non-important bit of the i-th input neuron
  • W1 (i) is the important bit of the i-th weight
  • W2 (i) is the non-important bit of the i-th weight
  • N in (i) represents the value of the i-th input neuron
  • W (i) Represents the value of the ith weight
  • connection layer convolution layer or lstm of the neural network model
  • the operation unit 1039 further includes a comparator, and the operation unit 1039 is specifically configured to: when the comparison result of the comparator is that the first operation result is less than or equal to a preset threshold, then Skip the operation of the output neuron; if the first operation result is greater than the preset threshold, the input neuron and the non-significant bit are operated to obtain a second operation result, and the In terms of the sum of the first operation result and the second operation result as an output neuron, the operation unit is specifically used to:
  • the above arithmetic unit 1039 further includes a comparator, which is mainly used for comparison operations. If the above first operation result is less than or equal to the preset threshold, the operation of the current input neuron is skipped, and the inner product operation of the next input neuron is executed.
  • the final output neuron N out is as follows:
  • the important bits and non-important bits of the weight are obtained, as well as the input neurons, and the input neurons and important bits are operated to obtain the first operation of the output neuron
  • the first operation result is less than or equal to the preset threshold, the operation of the current output neuron is skipped, and if the first operation result is greater than the preset threshold, the operation between the input neuron and the non-important bits is obtained
  • the second operation result the sum of the first operation result and the second operation result is used as the output neuron.
  • the prediction result of an output neuron does not require an operation, the operation process of the output neuron is skipped.
  • the new computing device integrates computing methods to predict and skip output neurons that do not need to be computed. Thereby reducing the calculation time and calculation energy consumption of the neural network.
  • the arithmetic unit includes: a master processing circuit and a plurality of slave processing circuits;
  • the master processing circuit is used to split the input neuron into multiple data blocks, broadcast the important bits of the weight to the multiple slave processing circuits, and distribute the multiple data blocks to all Describe multiple slave processing circuits;
  • the slave processing circuit is used to calculate the received data block and the important bits of the weight to obtain a partial result, and send the partial result to the master processing circuit;
  • the main processing circuit is also specifically used for splicing all the received partial results to obtain the first operation result.
  • the arithmetic unit further includes one or more branch processing circuits, each branch processing circuit is connected to at least one slave processing circuit,
  • the branch processing circuit is configured to forward data blocks, broadcast data, and operation instructions between the main processing circuit and the plurality of slave processing circuits.
  • the plurality of slave processing circuits are distributed in an array; each slave processing circuit is connected to an adjacent other slave processing circuit, and the master processing circuit is connected to k of the plurality of slave processing circuits
  • the slave processing circuit, the k slave processing circuits are: p slave processing circuits in the first row, p slave processing circuits in the q row, and q slave processing circuits in the first column;
  • the K slave processing circuits are used for forwarding data and instructions between the master processing circuit and a plurality of slave processing circuits
  • the main processing circuit is used to determine that the input neuron is distribution data, the important bit of the weight is broadcast data, distribute one distribution data into multiple data blocks, and divide at least one of the multiple data blocks At least one operation instruction among the data block and the plurality of operation instructions is sent to the K slave processing circuits;
  • the K slave processing circuits are used to convert data between the master processing circuit and the plurality of slave processing circuits.
  • the main processing circuit includes one or any combination of an activation processing circuit and an addition processing circuit.
  • the slave processing circuit includes: a multiplication processing circuit
  • the multiplication processing circuit is configured to perform a product operation on the received data block to obtain a product result.
  • the slave processing circuit further includes an accumulation processing circuit configured to perform an accumulation operation on the product result.
  • FIG. 2-1B is a schematic structural diagram of a layered storage device provided by an embodiment of the present application.
  • the device includes: an accurate storage unit and an inaccurate storage unit, and an accurate storage unit It is used to store important bits in data, and the non-precision storage unit is used to store non-important bits in data.
  • Accurate storage units use error checking and correction ECC memory, and non-accurate storage units use non-ECC memory.
  • the data stored in the hierarchical storage device are neural network parameters, including input neurons, weights and output neurons, accurate storage units store input neurons, output neurons and important bits of weights, and non-precision storage units Store non-important bits of input neurons, output neurons, and weights.
  • the data stored in the hierarchical storage device includes floating-point data and fixed-point data, the sign bit and exponent part of the floating-point data are designated as important bits, the base part is designated as non-important bits, and the fixed-point
  • the sign bit in the type data and the first x bits of the value part are designated as important bits, and the remaining bits in the value part are designated as non-significant bits, where x is a positive integer greater than or equal to 0 and less than m, and m is a fixed point
  • the total bits of type data Store important bits in ECC memory for accurate storage, and store non-important bits in non-ECC memory for non-precision storage.
  • the ECC memory includes DRAM (Dynamic Random Access Memory, DRAM) dynamic random access memory with ECC check and SRAM (Static Random-Access Memory, SRAM) static random access memory with ECC check; wherein, SRAM with ECC check can use 3T SRAM.
  • DRAM Dynamic Random Access Memory
  • SRAM Static Random-Access Memory
  • non-ECC memory includes non-ECC check DRAM and non-ECC check SRAM, and the non-ECC check SRAM may use 3TSRAM.
  • the unit of each bit stored in 3T SRAM consists of 3 MOS tubes.
  • FIG. 2-1C is a schematic structural diagram of a 3T SRAM memory cell provided by an embodiment of the present application.
  • the 3T SRAM memory cell is composed of 3 MOS, which are M1 (Part 1 A MOS tube), M2 (second MOS tube) and M3 (third MOS tube). M1 is used for gating, and M2 and M3 are used for storage.
  • the gate of M1 is electrically connected to the word line WL (Word Line), the source is electrically connected to the bit line BL (Bit Line); the gate of M2 is connected to the source of M3, and is connected to the operating voltage Vdd through the resistor R2, and the drain of M2 is grounded ; The gate of M3 is connected to the source of M2 and the drain of M1, and to the working voltage Vdd through resistor R1, and the drain of M3 is grounded.
  • WL is used to control the gated access of the memory cell, and BL is used to read and write the memory cell. When performing a read operation, pull WL high to read the bit from BL. When a write operation is performed, WL is pulled high, BL is pulled high or low. Since the driving capability of BL is stronger than that of the memory cell, it will force the original state to be overwritten.
  • the storage device of the present application uses approximate storage technology, which can fully exploit the fault tolerance of the neural network and store the neural parameters approximately.
  • the important bits in the parameters are stored accurately, and the unimportant bits are stored inaccurately, thereby reducing storage overhead. And memory access energy consumption.
  • FIG. 2-1D is a schematic structural diagram of a data processing device according to an embodiment of the present application
  • the data processing device includes: an inaccurate arithmetic unit, an instruction control unit and the above-mentioned hierarchical storage device.
  • the hierarchical storage device receives instructions and operation parameters, and stores important bits and instructions in the operation parameters in an accurate storage unit, and stores non-important bits in the operation parameters in an inaccurate storage unit.
  • the instruction control unit receives the instructions in the hierarchical storage device, and decodes the instructions to generate control information to control the inexact computing unit to perform calculation operations.
  • the non-precision calculation unit receives the calculation parameters in the layered storage device, performs calculation according to the control information, and transmits the calculation result to the layered storage device for storage or output.
  • the non-precision computing unit is a neural network processor.
  • the above operation parameters are neural network parameters
  • the hierarchical storage device is used to store the neurons, weights and instructions of the neural network, and store the important bits of the neurons, the important bits of the weights and the instructions in the precise storage unit
  • the non-important bits of the neuron and the non-important bits of the weight are stored in the non-precision storage unit.
  • the non-precision computing unit receives the input neurons and weights in the layered storage device, completes the neural network operation according to the control information to obtain output neurons, and retransmits the output neurons to the layered storage device for storage or output.
  • the non-precision arithmetic unit can have two calculation modes: (1) The non-accurate arithmetic unit directly receives the important bits of the input neuron and the important bits of the weight from the precision storage unit of the layered storage device for calculation ; (2) The non-precision computing unit receives the input neurons and weights of important bits and non-important bits to complete the calculation, in which the important bits and non-important bits of the input neurons and weights are in the storage unit Splice when reading.
  • the data processing device further includes a preprocessing module for preprocessing the input original data and transmitting it to the storage device.
  • the preprocessing includes segmentation and Gaussian filtering , Binarization, regularization, normalization, etc.
  • the data processing device further includes an instruction cache, an input neuron hierarchical cache, a weighted hierarchical cache, and an output neuron hierarchical cache, where the instruction cache is provided between the hierarchical storage device and the instruction control unit for Store special instructions; the input neuron hierarchical cache is set between the storage device and the non-precision arithmetic unit, and is used to cache the input neuron.
  • the input neuron hierarchical cache includes the input neuron precise cache and the input neuron inexact cache, respectively Cache important bits and non-important bits of input neurons; weighted layered cache is set between the storage device and the non-precision computing unit, used to cache weighted data, weighted layered cache includes weighted precise cache and weight Value inaccurate caching, which caches the important bits and non-important bits of the weight separately; the output neuron hierarchical cache is set between the storage device and the inaccurate arithmetic unit, and is used to cache the output neurons.
  • Layer caching includes output neuron precise caching and output neuron inexact caching. Bits and unimportant bits.
  • the data processing device further includes a direct data access unit DMA (direct memory access), which is used to perform storage device, instruction cache, weight layered cache, input neuron layered cache and output neuron layered cache Reading or writing data or instructions.
  • DMA direct data access unit
  • the inexact operation unit includes but is not limited to three parts, the first part is a multiplier, the second part is an addition tree, and the third part is an activation function unit.
  • the data (in1) is accumulated by the addition tree and added to the input data (in2) to obtain the output data (out).
  • the non-precision computing unit may also include a pooling unit, which pools the input data (in) to obtain the output data (out) through the pooling operation.
  • the process is out pool (in), where pool is the pooling operation, and the pooling operation Including but not limited to: average pooling, maximum pooling, median pooling, input data in is data in a pooling core related to output out.
  • the operation performed by the non-precision arithmetic unit includes several parts.
  • the first part is to multiply the input data 1 and the input data 2 to obtain the multiplied data;
  • the second part performs the addition tree operation, which is used to divide the input data 1 through the addition tree. Add the input data 1 or add the input data 1 step by step through the addition tree and add the input data 2 to obtain the output data;
  • the third part performs the activation function operation, and the output data is obtained by the active function operation (active) operation .
  • the operations of the above parts can be freely combined to achieve various functions.
  • the data processing device of the present application can make full use of the approximate storage technology, and fully exploit the fault tolerance of the neural network to reduce the calculation amount of the neural network and the memory access amount of the neural network, thereby reducing the calculation energy consumption and memory access energy consumption.
  • Through the use of special SIMD instructions and customized computing units for multi-layer artificial neural network operations it solves the problems of insufficient computing performance of CPU and GPU and large front-end decoding overhead, and effectively improves the support of multi-layer artificial neural network computing algorithms;
  • Through the use of dedicated on-chip cache for multi-layer artificial neural network arithmetic algorithms the importance of input neurons and weight data is fully tapped, avoiding repeated reading of these data into memory, reducing memory access bandwidth and avoiding Memory bandwidth has become a bottleneck in the performance of multi-layer artificial neural network operations and training algorithms.
  • the data processing device may include a non-neural network processor, for example, a general-purpose arithmetic processor, and the general-purpose arithmetic has corresponding general-purpose arithmetic instructions and data, for example, scalar arithmetic operations , Scalar logic operations, etc.
  • a general-purpose operation processor includes, for example but not limited to, one or more multipliers, one or more adders, and performs basic operations such as addition and multiplication.
  • the computing device 100 is presented in the form of a module.
  • Module here may refer to an application-specific integrated circuit (ASIC), a processor and memory that execute one or more software or firmware programs, an integrated logic circuit, and / or other devices that can provide the above functions .
  • ASIC application-specific integrated circuit
  • controller unit 1029, and arithmetic unit 1039 may be implemented by the devices shown in FIGS. 2-2 to 2-13.
  • a computing device for performing machine learning calculations.
  • the computing device includes: a controller unit 11 and an arithmetic unit 12, wherein the controller unit 11 is connected to the arithmetic unit 12, the
  • the arithmetic unit 12 includes: a master processing circuit and a plurality of slave processing circuits;
  • the controller unit 11 is used to obtain input data and calculation instructions; in an optional solution, specifically, the method of obtaining input data and calculation instructions may be obtained through a data input and output unit, and the data input and output unit may specifically be one or Multiple data I / O interfaces or I / O pins.
  • the above calculation instructions include but are not limited to: forward operation instructions or reverse training instructions, or other neural network operation instructions, etc., such as convolution operation instructions.
  • the specific implementation of the present application does not limit the specific expression form of the above calculation instructions.
  • the controller unit 11 is further configured to parse the calculation instruction to obtain a plurality of calculation instructions, and send the plurality of calculation instructions and the input data to the main processing circuit;
  • the main processing circuit 101 is configured to perform pre-processing on the input data and transfer data and operation instructions with the multiple slave processing circuits;
  • a plurality of slave processing circuits 102 configured to execute intermediate operations in parallel based on data transmitted from the master processing circuit and operation instructions to obtain a plurality of intermediate results, and transmit the plurality of intermediate results to the master processing circuit;
  • the main processing circuit 101 is configured to perform subsequent processing on the plurality of intermediate results to obtain the calculation result of the calculation instruction.
  • the technical solution provided in this application sets the computing unit to a master multi-slave structure.
  • it can split the data according to the calculation instructions of the forward operation, so that multiple slave processing circuits can The part with a large amount of calculation performs parallel operation, thereby increasing the operation speed, saving operation time, and further reducing power consumption.
  • the above machine learning calculation may specifically include: artificial neural network operation
  • the above input data may specifically include: input neuron data and weight data.
  • the above calculation result may specifically be: the result of the operation of the artificial neural network outputs the neuron data.
  • the operation in the neural network it can be a layer of operation in the neural network.
  • the implementation process is that in the forward operation, when the previous layer of artificial neural network is completed, the operation of the next layer The instruction will use the output neuron calculated in the arithmetic unit as the input neuron of the next layer (or perform some operations on the output neuron and then use it as the input neuron of the next layer), and at the same time, the weight It is also replaced with the weight of the next layer; in the reverse operation, when the reverse operation of the artificial neural network of the previous layer is completed, the next layer of computing instructions will use the input neuron gradient calculated in the computing unit as the next The output neuron gradient of the first layer is operated (or some operation is performed on the input neuron gradient and then used as the output neuron gradient of the next layer), and the weights are replaced by the weights of the next layer.
  • the above machine learning calculation may also include support vector machine operation, k-nearest neighbor (k-nn) operation, k-mean (k-means) operation, principal component analysis operation and so on.
  • k-nn k-nearest neighbor
  • k-means k-mean
  • principal component analysis operation k-means
  • the input neurons and output neurons of the multi-layer operations do not refer to the neurons in the input layer and the output layers of the entire neural network, but for In any two adjacent layers in the network, the neurons in the lower layer of the network forward operation are input neurons, and the neurons in the upper layer of the network forward operation are output neurons.
  • K 1, 2, ..., L-1.
  • K + 1th layer we divide the Kth layer This is called the input layer, where the neurons are the input neurons, and the K + 1th layer is called the output layer, and the neurons are the output neurons.
  • each layer can be used as an input layer, and the next layer is the corresponding output layer.
  • the above computing device may further include the storage unit 10 and the direct memory access unit 50.
  • the storage unit 10 may include one or any combination of registers and caches. Specifically, the cache is used to store the Calculation instruction; the register, used to store the input data and scalar; the cache is a high-speed temporary storage cache.
  • the direct memory access unit 50 is used to read or store data from the storage unit 10.
  • the controller unit includes: an instruction storage unit 110, an instruction processing unit 111, and a storage queue unit 113;
  • the instruction storage unit 110 is used to store calculation instructions associated with the artificial neural network operation
  • the instruction processing unit 111 is configured to parse the calculation instruction to obtain multiple operation instructions
  • the storage queue unit 113 is configured to store an instruction queue, and the instruction queue includes a plurality of operation instructions or calculation instructions to be executed in the order of the queue.
  • the main operation processing circuit may also include a controller unit, and the controller unit may include a main instruction processing unit, which is specifically used to decode instructions into microinstructions.
  • the slave operation processing circuit may also include another controller unit, and the other controller unit includes a slave instruction processing unit, which is specifically used to receive and process microinstructions.
  • the above microinstruction may be the next level instruction of the instruction.
  • the microinstruction can be obtained by splitting or decoding the instruction, and can be further decoded into control signals of each component, each unit, or each processing circuit.
  • the structure of the calculation instruction may be as shown in the following table.
  • the calculation instruction may include: one or more operation fields and an operation code.
  • the calculation instruction may include a neural network operation instruction. Taking neural network operation instructions as an example, as shown in Table 1, among them, register number 0, register number 1, register number 2, register number 3, and register number 4 can be operation domains. Wherein, each register number 0, register number 1, register number 2, register number 3, register number 4 may be the number of one or more registers.
  • the above register may be an off-chip memory. Of course, in practical applications, it may also be an on-chip memory for storing data.
  • controller unit may further include:
  • the dependency processing unit 108 is configured to determine whether there is an association between the first operation instruction and the zeroth operation instruction before the first operation instruction when there are multiple operation instructions, such as the first operation instruction and all If the zeroth operation instruction is associated, the first operation instruction is cached in the instruction storage unit, and after the execution of the zeroth operation instruction is completed, the first operation instruction is extracted from the instruction storage unit Transmitted to the arithmetic unit;
  • the determining whether there is an association relationship between the first operation instruction and the zeroth operation instruction before the first operation instruction includes:
  • Extracting the first storage address interval of the data (such as a matrix) required in the first arithmetic instruction according to the first arithmetic instruction, and extracting the zeroth of the required matrix in the zeroth arithmetic instruction according to the zeroth arithmetic instruction A storage address interval, if the first storage address interval and the zeroth storage address interval have overlapping areas, it is determined that the first arithmetic instruction and the zeroth arithmetic instruction have an association relationship, such as the first storage If the address interval does not overlap with the zeroth storage address interval, it is determined that the first operation instruction and the zeroth operation instruction do not have an association relationship.
  • the arithmetic unit 12 may include a master processing circuit 101 and multiple slave processing circuits 102.
  • a plurality of slave processing circuits are distributed in an array; each slave processing circuit is connected to an adjacent other slave processing circuit, and a master processing circuit is connected to the plurality of slave processing circuits K slave processing circuits, the k slave processing circuits are: p slave processing circuits in the first row, p slave processing circuits in the q row, and q slave processing circuits in the first column, it should be noted that , The K slave processing circuits shown in FIGS.
  • the slave processing circuit is a slave processing circuit directly connected to the master processing circuit among the plurality of slave processing circuits.
  • K slave processing circuits are used to transfer data and instructions between the master processing circuit and the plurality of slave processing circuits.
  • the main processing circuit may further include one or any combination of a conversion processing circuit 114, an activation processing circuit 115, and an addition processing circuit 116;
  • the conversion processing circuit 114 is used to perform the interchange between the first data structure and the second data structure (such as the conversion of continuous data and discrete data) of the data block or intermediate result received by the main processing circuit; or to receive the main processing circuit
  • the data block or the intermediate result performs the exchange between the first data type and the second data type (for example, conversion of fixed-point type and floating-point type);
  • the activation processing circuit 115 is used to execute the activation operation of the data in the main processing circuit
  • the addition processing circuit 116 is used to perform an addition operation or an accumulation operation.
  • the main processing circuit is used to determine that the input neuron is broadcast data, the weight value is distribution data, the distribution data is distributed into multiple data blocks, and at least one of the multiple data blocks and multiple At least one operation instruction among the operation instructions is sent to the slave processing circuit;
  • the plurality of slave processing circuits are configured to perform an operation on the received data block according to the operation instruction to obtain an intermediate result, and transmit the operation result to the master processing circuit;
  • the main processing circuit is configured to process a plurality of intermediate results sent from the processing circuit to obtain the result of the calculation instruction, and send the result of the calculation instruction to the controller unit.
  • the slave processing circuit includes: a multiplication processing circuit
  • the multiplication processing circuit is configured to perform a product operation on the received data block to obtain a product result
  • the forwarding processing circuit (optional) is used to forward the received data block or product result.
  • An accumulation processing circuit is configured to perform an accumulation operation on the product result to obtain the intermediate result.
  • the operation instruction is a matrix multiplying matrix instruction, an accumulation instruction, an activation instruction, and other calculation instructions.
  • the actual formula that needs to be executed can be: Among them, the weight w is multiplied by the input data x i , and the sum is added, and then the offset b is added to do the activation operation s (h) to obtain the final output result s.
  • the operation unit includes: a tree module 40, and the tree module includes: a root port 401 and a plurality of branch ports 404, the tree The root port of the module is connected to the master processing circuit, and the multiple branch ports of the tree module are respectively connected to one of the slave processing circuits in the slave processing circuits;
  • the tree module has a sending and receiving function.
  • the tree module is a sending function
  • the tree module is a receiving function.
  • the tree module is used to forward data blocks, weights, and operation instructions between the master processing circuit and the multiple slave processing circuits.
  • the tree module is a selectable result of the computing device, and it may include at least one layer of nodes.
  • the node has a line structure with a forwarding function, and the node itself may not have a computing function. If the tree module has zero-level nodes, the tree module is not required.
  • the tree module may be a p-tree structure, for example, a binary tree structure as shown in FIGS. 2-7, and of course, a tri-tree structure, and p may be an integer greater than or equal to 2.
  • the specific implementation of the present application does not limit the specific value of the above-mentioned p.
  • the above-mentioned number of layers may also be 2.
  • the slave processing circuit may be connected to nodes of layers other than the penultimate layer node. The nodes in the penultimate layer shown.
  • the operation unit may carry a separate buffer, as shown in FIGS. 2-8, and may include: a neuron buffer unit, and the neuron buffer unit 63 buffers input neuron vector data and output neurons of the slave processing circuit Value data.
  • the operation unit may further include a weight buffer unit 64 for buffering weight data required by the slave processing circuit in the calculation process.
  • the arithmetic unit 12 may include a branch processing circuit 103; its specific connection structure is shown in FIG. 2-3, where,
  • the main processing circuit 101 is connected to the branch processing circuit 103 (one or more), and the branch processing circuit 103 is connected to one or more slave processing circuits 102;
  • the branch processing circuit 103 is used to perform forwarding of data or instructions between the main processing circuit 101 and the slave processing circuit 102.
  • the controller unit obtains the input neuron matrix x, the weight matrix w and the fully connected operation instruction from the storage unit, and transmits the input neuron matrix x, the weight matrix w and the fully connected operation instruction to the main processing circuit;
  • the main processing circuit determines the input neuron matrix x as broadcast data, determines the weight matrix w as distribution data, splits the weight matrix w into 8 sub-matrices, and then distributes the 8 sub-matrices to 8 slave processes through the tree module Circuit, broadcasting the input neuron matrix x to 8 slave processing circuits,
  • the slave processing circuit executes the multiplication and accumulation operations of 8 sub-matrices and input neuron matrix x in parallel to obtain 8 intermediate results, and sends the 8 intermediate results to the main processing circuit;
  • the main processing circuit is used to sort the 8 intermediate results to obtain the operation result of wx, perform the operation of the offset b to perform the activation operation to obtain the final result y, and send the final result y to the controller unit
  • the final result y is output or stored in the storage unit.
  • the method for the computing device shown in Figure 2-2 to execute the neural network forward operation instruction may specifically be:
  • the controller unit extracts the neural network forward operation instruction, the operation domain corresponding to the neural network operation instruction and at least one operation code from the instruction storage unit, and the controller unit transmits the operation domain to the data access unit and sends the at least one operation code To the arithmetic unit.
  • the controller unit extracts the weight w and offset b corresponding to the operation domain from the storage unit (when b is 0, there is no need to extract the offset b), and transmits the weight w and offset b to the main processing of the arithmetic unit Circuit, the controller unit extracts the input data Xi from the storage unit and sends the input data Xi to the main processing circuit.
  • the main processing circuit determines the multiplication operation according to the at least one operation code, determines the input data Xi as broadcast data, determines the weight data as distribution data, and splits the weight w into p data blocks;
  • the instruction processing unit of the controller unit determines the multiplication instruction, the offset instruction and the accumulation instruction according to the at least one operation code, and sends the multiplication instruction, the offset instruction and the accumulation instruction to the main processing circuit, and the main processing circuit inputs the multiplication instruction and the input data Xi sends it to multiple slave processing circuits in a broadcast manner, and distributes the p data blocks to the multiple slave processing circuits (for example, with p slave processing circuits, each slave processing circuit sends one data block); multiple The slave processing circuit is used to perform multiplication operation on the input data Xi and the received data block according to the multiplication instruction to obtain an intermediate result, and send the intermediate result to the main processing circuit, which according to the accumulation instruction sends multiple slaves
  • the intermediate result sent by the processing circuit performs an accumulation operation to obtain an accumulation result, executes the accumulation result to add an offset b according to the offset instruction to obtain a final result, and sends the final result to the controller unit.
  • the technical solution provided by this application realizes the multiplication and offset operations of the neural network through one instruction, that is, the neural network operation instruction, and the intermediate results of the neural network calculation do not need to be stored or extracted, reducing the storage and extraction operations of the intermediate data Therefore, it has the advantages of reducing the corresponding operation steps and improving the calculation effect of the neural network.
  • This application also discloses a machine learning computing device, which includes one or more computing devices mentioned in this application, for obtaining data to be calculated and control information from other processing devices, performing specified machine learning operations, and executing The result is transferred to the peripheral device through the I / O interface.
  • Peripheral equipment such as camera, monitor, mouse, keyboard, network card, wifi interface, server.
  • the computing devices can link and transmit data through a specific structure, for example, interconnect and transmit data through the PCIE bus to support larger-scale machine learning operations.
  • the interconnection method can be any interconnection topology.
  • the machine learning computing device has high compatibility, and can be connected with various types of servers through the PCIE interface.
  • the present application also discloses a combined processing device, which includes the above-mentioned machine learning computing device, a universal interconnection interface, and other processing devices.
  • the machine learning computing device interacts with other processing devices to complete the operation specified by the user.
  • 2-10 are schematic diagrams of combined processing devices.
  • processing devices include one or more types of general-purpose / special-purpose processors such as central processing unit CPU, graphics processor GPU, neural network processor.
  • the number of processors included in other processing devices is not limited.
  • Other processing devices serve as an interface between the machine learning computing device and external data and control, including data handling, to complete the basic control of starting and stopping the machine learning computing device; other processing devices can also cooperate with the machine learning computing device to complete the computing task.
  • a universal interconnection interface is used to transfer data and control instructions between the machine learning computing device and other processing devices.
  • the machine learning computing device obtains the required input data from other processing devices and writes them into the on-chip storage device of the machine learning computing device; it can obtain control instructions from other processing devices and write them into the control cache of the machine learning computing device; also The data in the storage module of the machine learning computing device can be read and transmitted to other processing devices.
  • the structure may further include a storage device, which is respectively connected to the machine learning operation device and the other processing device.
  • the storage device is used to store data stored in the machine learning computing device and the other processing device, and is particularly suitable for data that cannot be saved in the internal storage of the machine learning computing device or other processing device.
  • the combined processing device can be used as an SOC on-chip system for mobile phones, robots, drones, video surveillance equipment, etc., effectively reducing the core area of the control part, increasing processing speed, and reducing overall power consumption.
  • the general interconnection interface of the combined processing device is connected to some components of the device. Some components such as camera, monitor, mouse, keyboard, network card, wifi interface.
  • a chip is also applied, which includes the aforementioned machine learning arithmetic device or combined processing device.
  • a chip packaging structure is applied, which includes the above chip.
  • a board card is applied, which includes the above chip packaging structure.
  • FIG. 2-13 provides a board card.
  • the board card may also include other supporting components.
  • the supporting components include but are not limited to: a storage device 390 and an interface device 391. And control device 392;
  • the storage device 390 is connected to the chip in the chip packaging structure through a bus, and is used to store data.
  • the storage device may include multiple sets of storage units 393. Each group of the storage unit and the chip are connected by a bus. It can be understood that each group of memory cells may be DDR SDRAM (English: Double Data Rate SDRAM, double rate synchronous dynamic random access memory).
  • DDR can double the speed of SDRAM without increasing the clock frequency. DDR allows data to be read on the rising and falling edges of the clock pulse. DDR is twice as fast as standard SDRAM.
  • the storage device may include 4 sets of the storage unit. Each group of the memory cells may include multiple DDR4 particles (chips). In one embodiment, the chip may include four 72-bit DDR4 controllers. Among the 72-bit DDR4 controllers, 64 bits are used for data transmission, and 8 bits are used for ECC check. It can be understood that when DDR4-3200 particles are used in each group of the memory cells, the theoretical bandwidth of data transmission can reach 25600MB / s.
  • each group of the storage units includes multiple double-rate synchronous dynamic random access memories arranged in parallel.
  • DDR can transfer data twice in one clock cycle.
  • a controller for controlling DDR is provided in the chip for controlling data transmission and data storage of each storage unit.
  • the interface device is electrically connected to the chip in the chip packaging structure.
  • the interface device is used to realize data transmission between the chip and an external device (such as a server or a computer).
  • the interface device may be a standard PCIE interface.
  • the data to be processed is transferred from the server to the chip through a standard PCIE interface to implement data transfer.
  • the interface device may also be other interfaces.
  • the present application does not limit the specific expressions of the other interfaces described above, and the interface unit may implement the transfer function.
  • the calculation result of the chip is still transmitted back to an external device (such as a server) by the interface device.
  • the control device is electrically connected to the chip.
  • the control device is used to monitor the state of the chip.
  • the chip and the control device may be electrically connected through an SPI interface.
  • the control device may include a microcontroller (Micro Controller Unit, MCU).
  • MCU Micro Controller Unit
  • the chip may include multiple processing chips, multiple processing cores, or multiple processing circuits, and may drive multiple loads. Therefore, the chip may be in different working states such as multiple loads and light loads.
  • the control device can realize the regulation of the working states of multiple processing chips, multiple processing and or multiple processing circuits in the chip.
  • an electronic device is applied, which includes the above-mentioned board.
  • Electronic equipment includes data processing devices, robots, computers, printers, scanners, tablets, smart terminals, mobile phones, driving recorders, navigators, sensors, cameras, servers, cloud servers, cameras, cameras, projectors, watches, headphones , Mobile storage, wearable devices, vehicles, household appliances, and / or medical devices.
  • the vehicles include airplanes, ships, and / or vehicles;
  • the household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lights, gas stoves, and range hoods; and
  • the medical equipment includes nuclear magnetic resonance instruments, B-ultrasound and / or electrocardiograph.
  • the master + interconnection module + slave architecture it can also be accumulated in the interconnection (for example, K-tree (as shown in Figure 2-7)) module.
  • the multiplier in the slave operation module may be a parallel multiplier or a serial multiplier. Because this patent is divided into important bits and non-important bits, the bit width of important bits is floating. For example, the total number of bits is 16 bits, and the important bits can be 3, 5, 8 bits. Therefore, to use a parallel multiplier to perform calculations, 16 * 16 must be done, which is very wasteful. Conversely, if serial is used, 3, 5, 8 multiplication can be achieved with only a part of multipliers, and the power consumption is more ideal.
  • FIG. 2-14 is a schematic flowchart of a calculation method provided by an embodiment of the present invention. As shown in Figure 2-14, the method includes:
  • the positions of the n1 important bits are continuous or discontinuous.
  • the weight is represented by W
  • the weight includes w bits, where w1 bits are important bits and w2 bits are non-significant bits
  • W1 + w2 w
  • W W1 + W2
  • w is a positive integer
  • w1 is a natural number and less than w.
  • the positions of the n1 important bits are continuous or discontinuous.
  • N out is the output neuron
  • N1 in (i) is the important bit of the i-th input neuron
  • N2 in (i) is the non-important bit of the i-th input neuron Bits
  • W1 (i) is the important bit of the i-th weight
  • W2 (i) is the non-important bit of the i-th weight
  • N in (i) represents the value of the i-th input neuron
  • W ( i) represents the value of the ith weight
  • N in (i) N1 in (i) + N2 in (i)
  • W (i) W1 (i) + W2 (i);
  • the operation of the output neuron is skipped; if the first operation result is greater than the preset threshold , Then the input neuron and the non-important bit are operated to obtain a second operation result, and the sum of the first operation result and the second operation result is used as the output neuron, which may include the following steps :
  • An embodiment of the present invention also provides a computer storage medium, wherein the computer storage medium stores a computer program for electronic data exchange, which causes the computer to perform some or all of the steps of any method described in the above method embodiments ,
  • the above computer includes electronic devices.
  • An embodiment of the present invention also provides a computer program product, where the computer program product includes a non-transitory computer-readable storage medium that stores the computer program, and the computer program enables the computer to execute any of the methods described in the above method embodiments. Part or all steps.
  • the computer program product may be a software installation package, and the computer includes an electronic device.
  • the disclosed device may be implemented in other ways.
  • the device embodiments described above are only schematic.
  • the division of the unit is only a logical function division.
  • there may be another division manner for example, multiple units or components may be combined or may Integration into another system, or some features can be ignored, or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above integrated unit can be implemented in the form of hardware.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Neurology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Feedback Control In General (AREA)

Abstract

一种机器学习运算的分配系统,根据终端服务器控制指令在终端服务器中使用运算能力较低的第一机器学习算法计算上述运算任务时,可得到一个准确性较低的运算结果。而根据云端服务器控制指令在云端服务器中使用运算能力较高的第二机器学习算法也计算上述同一个运算任务时,可得到一个准确性较高的运算结果。这样灵活地使用不同的机器学习算法分别执行同一个运算任务可使用户分别得到一个准确性较低的运算结果和一个准确性较高的运算结果,从而满足了用户的需求。并且,由于终端服务器的运算能力较弱,终端运算结果能够先输出,这样避免了用户需要长时间的等待,提高了处理效率。

Description

机器学习运算的分配系统及方法 技术领域
本发明涉及信息处理技术领域,特别是涉及一种机器学习运算的分配系统及方法。
背景技术
机器学习近些年来取得了重大突破,比如,在机器学习技术中,采用深度学习算法训练的神经网络模型在图像识别、语音处理、智能机器人等应用领域取得了令人瞩目的成果。深度神经网络通过建立模型来模拟人类大脑的神经连接结构,在处理图像、声音和文本等信号时,通过多个变换阶段分层对数据特征进行描述。然而,随着机器学习算法的复杂度不断提高,机器学习技术在实际应用过程中存在占用资源多、运算速度慢、能量消耗大等问题。
比如,在传统的机器学习算法的处理过程中,为了通用性的要求,往往需要占据很大的内存空间在云端来存储训练好的权重。
然而,采用上述方法会导致机器学习算法的处理时间长,处理效率低,进而导致用户的等待时间过长。
发明内容
基于此,有必要针对上述机器学习算法的处理效率低的问题,提供一种处理效率高的机器学习运算的分配系统及方法。
第一方面,提供一种机器学习运算的分配系统,包括:终端服务器和云端服务器;
所述终端服务器用于根据需求信息生成对应的运算任务,并根据所述运算任务和终端服务器的硬件性能参数选取在所述终端服务器运行的第一机器学习算法,以及根据所述运算任务和云端服务器的硬件性能参数选取在所述云端服务器运行的第二机器学习算法;
根据所述第一机器学习算法和所述运算任务生成终端服务器控制指令,以及根据所述第二机器学习算法和所述运算任务生成云端服务器控制指令。
第二方面,提供一种机器学习运算的分配方法,包括:
获取需求信息、终端服务器的硬件性能参数和云端服务器的硬件性能参数;
根据所述需求信息生成对应的运算任务,并根据所述运算任务和所述终端服务器的硬件性能参数选取在所述终端服务器运行的第一机器学习算法,以及根据所述运算任务和所述云端服务器的硬件性能参数选取在所述云端服务器运行的第二机器学习算法;
根据所述第一机器学习算法和所述运算任务生成终端服务器控制指令,以及根据所述第二机器学习算法和所述运算任务生成云端服务器控制指令。
上述机器学习运算的分配系统及方法,当需要根据用户的需求信息来完成运算任务时,分别在终端服务器和云端服务器中都执行该运算任务,以此实现利用不同的机器学习算法来完成同一个运算任务的目的,并可以得到不同的精确程度的运算结果。具体而言,首先对终端服务器和云端服务器的硬件性能参数进行评估,分别选取一个运算能力较低的在终端服务器运行的第一机器学习算法和一个运算能力较高的在云端服务器运行的第二机器学习算法。基于不同的机器学习算法,在终端服务器中生成可在终端服务器中进行控制的终端服务器控制指令以及可在云端服务器中进行控制的云端服务器控制指令。由此可知,当采用上述终端服务器控制指令和云端服务器控制指令时,根据终端服务器控制指令在终端服务器中使用运 算能力较低的第一机器学习算法计算上述运算任务时,可得到一个准确性较低的运算结果。而根据云端服务器控制指令在云端服务器中使用运算能力较高的第二机器学习算法也计算上述同一个运算任务时,可得到一个准确性较高的运算结果。这样灵活地使用不同的机器学习算法分别执行同一个运算任务,可使用户分别得到一个准确性较低的运算结果和一个准确性较高的运算结果,从而实现了基于用户的需求,。并且,由于终端服务器的运算能力较弱,终端运算结果能够先输出,这样避免了用户需要长时间的等待,提高了处理效率,且充分利用了终端服务器与云端服务器两部分的计算资源,使得同一个运算任务可以在终端服务器与云端服务器设备上共同进行。
第三方面,提供一种计算装置,包括:
所述计算装置用于执行网络模型的计算,所述计算装置用于执行神经网络运算;所述计算装置包括:运算单元、控制器单元以及存储单元;
所述存储单元,用于存储权值和输入神经元,所述权值包括重要比特位和非重要比特位;
所述控制器单元,用于获取所述权值的重要比特位和非重要比特位,以及所述输入神经元,并将所述权值的重要比特位和非重要比特位、所述输入神经元传输给所述运算单元;
所述运算单元,用于将所述输入神经元和所述重要比特位进行运算,得到输出神经元的第一运算结果;
以及若所述第一运算结果小于或等于预设阈值,则跳过当前输出神经元的运算;
若所述第一运算结果大于所述预设阈值,则将所述输入神经元与所述非重要比特位进行运算,得到第二运算结果,将所述第一运算结果与所述第二运算结果之和作为输出神经元。
第四方面,提供一种机器学习运算装置,所述机器学习运算装置包括一个或多个如第一方面所述的计算装置,用于从其他处理装置中获取待运算输入数据和控制信息,并执行指定的机器学习运算,将执行结果通过I/O接口传递给其他处理装置;
当所述机器学习运算装置包含多个所述计算装置时,所述多个所述计算装置间可以通过特定的结构进行连接并传输数据;
其中,多个所述计算装置通过快速外部设备互连总线PCIE总线进行互联并传输数据,以支持更大规模的机器学习的运算;多个所述计算装置共享同一控制系统或拥有各自的控制系统;多个所述计算装置共享内存或者拥有各自的内存;多个所述计算装置的互联方式是任意互联拓扑。
第五方面,提供一种组合处理装置,所述组合处理装置包括如第二方面所述的机器学习运算装置,通用互联接口和其他处理装置;
所述机器学习运算装置与所述其他处理装置进行交互,共同完成用户指定的计算操作。
第六方面,本申请实施例提供了一种神经网络芯片,所述神经网络芯片包括如第二方面所述的机器学习运算装置或如第五方面所述的组合处理装置。
第七方面,本申请实施例提供了一种电子设备,所述电子设备包括如第六方面所述的芯片。
第八方面,本申请实施例提供了一种板卡,其特征在于,所述板卡包括:存储器件、接口装置和控制器件以及如第六方面所述的神经网络芯片;
其中,所述神经网络芯片与所述存储器件、所述控制器件以及所述接口装置分别连接;
所述存储器件,用于存储数据;
所述接口装置,用于实现所述芯片与外部设备之间的数据传输;
所述控制器件,用于对所述芯片的状态进行监控。
第九方面,本申请实施例提供了一种计算方法,包括:
获取所述权值的重要比特位和非重要比特位,以及所述输入神经元;
将所述输入神经元和所述重要比特位进行运算,得到输出神经元的第一运算结果;
若所述第一运算结果小于或等于预设阈值,则跳过当前输出神经元的运算;
若所述第一运算结果大于所述预设阈值,则将所述输入神经元与所述非重要比特位之间进行运算,得到第二运算结果,将所述第一运算结果与所述第二运算结果之和作为输出神经元。
第十方面,本申请实施例提供了一种计算机可读存储介质,其中,上述计算机可读存储介质存储用于电子数据交换的计算机程序,其中,上述计算机程序使得计算机执行如本申请实施例第九方面中所描述的部分或全部步骤。
第十一方面,本申请实施例提供了一种计算机程序产品,其中,上述计算机程序产品包括存储了计算机程序的非瞬时性计算机可读存储介质,上述计算机程序可操作计算机执行如本申请实施例第九方面中所描述的部分或全部步骤。该计算机程序产品可以为一个软件安装包。
可以看出,计算装置通过获取权值的重要比特位和非重要比特位,以及输入神经元,将输入神经元和重要比特位进行运算,得到输出神经元的第一运算结果,若第一运算结果小于或等于预设阈值,则跳过当前输出神经元的运算,若第一运算结果大于预设阈值,则将输入神经元与非重要比特位进行运算,得到第二运算结果,将第一运算结果与第二运算结果之和作为输出神经元,进而,如果某个输出神经元的预测结果为不需要进行运算,则跳过该输出神经元的运算过程。新的运算装置中集成了运算方法,能够预测并跳过不需要进行运算的输出神经元。从而减少神经网络的计算时间和计算能耗。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1-1为一实施例的机器学习运算的分配系统的结构示意图;
图1-2为另一实施例的机器学习运算的分配系统的结构示意图;
图1-3为另一实施例的机器学习运算的分配系统的结构示意图;
图1-4为一实施例的运算-存储-通信工作模式图;
图1-5A为一实施例的一种计算装置的结构示意图;
图1-5B是为一实施例的计算装置的结构图;
图1-5C为另一实施例提供的计算装置的结构图;
图1-5D为一实施例的主处理电路的结构图;
图1-5E为一实施例的另一种计算装置的结构图;
图1-5F为一实施例的树型模块的结构示意图;
图1-5G为一实施例的又一种计算装置的结构图;
图1-5H为一实施例的还一种计算装置的结构图;
图1-5I为一实施例的一种计算装置的结构示意图;
图1-6为一实施例的机器学习运算的分配方法的流程图。
图2-1A为本发明实施例提供的一种计算装置的结构示意图;
图2-1B为本申请实施例提供的一种分层存储装置结构示意图。
图2-1C为本申请实施例提供的一种3T SRAM存储单元的结构示意图。
图2-1D为本申请实施例提供的一种数据处理装置结构示意图。
图2-1E为本申请实施例提供的另一种数据处理装置结构示意图。
图2-2是本申请实施例提供的一种计算装置的结构示意图;
图2-3是本申请一个实施例提供的计算装置的结构图;
图2-4是本申请另一个实施例提供的计算装置的结构图;
图2-5是本申请实施例提供的主处理电路的结构图;
图2-6是本申请实施例提供的另一种计算装置的结构图;
图2-7是本申请实施例提供的树型模块的结构示意图;
图2-8是本申请实施例提供的又一种计算装置的结构图;
图2-9是本申请实施例提供的还一种计算装置的结构图;
图2-10是本申请实施例提供的一种组合处理装置的结构图;
图2-11是本申请实施例提供的一种计算装置的结构示意图;
图2-12是本申请实施例提供的另一种组合处理装置的结构图;
图2-13是本申请实施例提供的一种板卡的结构示意图;
图2-14为本发明实施例提供的一种计算方法的流程示意图。
具体实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。
在一个实施例中,提供了一种机器学习运算的分配系统,该分配系统包括:云端服务器10和终端服务器20。
根据需求信息生成对应的运算任务,并根据所述运算任务和终端服务器20的硬件性能参数选取在所述终端服务器20运行的第一机器学习算法,以及根据所述运算任务和云端服务器10的硬件性能参数选取在所述云端服务器10运行的第二机器学习算法;根据所述第一机器学习算法和所述运算任务生成终端服务器控制指令,以及根据所述第二机器学习算法和所述运算任务生成云端服务器控制指令。
具体地,用户根据自身的实际需求通过终端设备输入相应的需求信息,该终端设备包括含有控制功能的输入获取单元,输入获取单元可由用户来选择,比如可以是APP,也可以是其它程序的API接口等。用户输入的需求信息主要由三方面决定,一方面是功能需求信息,一方面是准确度需求信息,另一方面是内存需求信息。对应地,运算任务包括功能需求任务、准确度需求任务及内存需求任务。需要清楚,第一机器学习算法的运算任务与第二机器学习算法的运算任务是同一个运算任务。硬件性能参数包括但不限于运算能力、能耗、精度及速度等。
更具体地,机器学习算法包括但不限于神经网络算法和深度学习算法。机器学习算法具有明显的逐阶段特征,比如每一层神经网络的运算、聚类算法的每次迭代等等。进一步地, 划分机器学习算法为多个阶段的算法。在一个实施例中,机器学习算法为多层神经网络算法,多个阶段包括多个层。在另一个实施例中,机器学习算法为聚类算法,多个阶段为多次迭代。在每一阶段的计算中都可以分别通过终端服务器20与云端服务器10进行计算。
需要理解的是,由于终端服务器的运算能力较低,对应的第一机器学习算法的运算性能也较低。相反地,云端服务器的运算能力较高,对应的第二机器学习算法的运算性能也较高。
因此,在终端服务器20中计算对应的每个阶段的第一机器学习算法的运算任务,能够更加快速地得到一个准确性较低的终端运算结果。而在云端服务器10中计算对应的每个阶段的第二机器学习算法的运算任务虽然需要消耗较长时间,但能够得到一个准确性较高的云端运算结果。于是,虽然终端运算结果能够相对于云端运算结果更快地得出,但云端运算结果相对于终端运算结果更为准确。
此处举个简单的例子,若需要辨别图像中的动物是一只猫,分别在终端服务器20和云端服务器10中进行图像识别,则终端服务器20可能会比云端服务器10更快地得出图像中的动物是一只猫的结果,但云端服务器10可能还会得到这只猫的品种等更为准确的运算结果。
上述机器学习运算的分配系统及方法,当需要根据用户的需求信息来完成运算任务时,分别在终端服务器和云端服务器中都执行该运算任务,以此实现利用不同的机器学习算法来完成同一个运算任务的目的,并可以得到不同的精确程度的运算结果。具体而言,首先对终端服务器和云端服务器的硬件性能参数进行评估,分别选取一个运算能力较低的在终端服务器运行的第一机器学习算法和一个运算能力较高的在云端服务器运行的第二机器学习算法。基于不同的机器学习算法,在终端服务器中生成可在终端服务器中进行控制的终端服务器控制指令以及可在云端服务器中进行控制的云端服务器控制指令。
由此可知,当采用上述终端服务器控制指令和云端服务器控制指令时,根据终端服务器控制指令在终端服务器中使用运算能力较低的第一机器学习算法计算上述运算任务时,可得到一个准确性较低的运算结果。而根据云端服务器控制指令在云端服务器中使用运算能力较高的第二机器学习算法也计算上述同一个运算任务时,可得到一个准确性较高的运算结果。这样灵活地使用不同的机器学习算法分别执行同一个运算任务可使用户分别得到一个准确性较低的运算结果和一个准确性较高的运算结果,从而实现了基于用户的需求。并且,由于终端服务器的运算能力较弱,终端运算结果能够先输出,这样避免了用户需要长时间的等待,提高了处理效率,且充分利用了终端服务器与云端服务器两部分的计算资源,使得同一个运算任务可以在终端服务器与云端服务器设备上共同进行。
进一步地,在一个实施例中,所述终端服务器20还用于对所述终端服务器控制指令进行解析得到终端控制信号,并根据所述终端控制信号计算对应的每个阶段的第一机器学习算法的运算任务以得到终端运算结果,以及将所述云端服务器控制指令发送至所述云端服务器10。
更进一步地,所述云端服务器10用于接收所述云端服务器控制指令,对所述云端服务器控制指令进行解析得到云端控制信号,并根据所述云端控制信号计算对应的每个阶段的第二机器学习算法的运算任务以得到云端运算结果。
在其中一个实施例中,所述硬件性能参数包括运算能力,则所述终端服务器20具体用于获取所述终端服务器20的运算能力和所述云端服务器10的运算能力;根据所述运算任务和所述终端服务器的运算能力选取第一机器学习算法,以及根据所述运算任务和所述云端服务器的运算能力选取第二机器学习算法。
具体地,终端服务器20的硬件性能参数包括终端服务器20的运算能力,云端服务器10 的硬件性能参数包括云端服务器10的运算能力。其中,运算能力可从运算模块预设的配置信息中获得。服务器的运算能力影响服务器的运算速度,根据运算模块的运算能力可进一步准确地获得更为合适的机器学习算法。
在具体的一个实施例中,所述第一机器学习算法包括第一神经网络模型,所述第二机器学习算法包括第二神经网络模型。在本实施例中,以神经网络模型为例来具体说明,即将所述机器学习运算的分配系统具体应用于神经网络运算的分配,则该分配系统包括:
所述终端服务器20用于获取需求信息、所述终端服务器20的硬件性能参数和所述云端服务器10的硬件性能参数;根据所述需求信息生成对应的运算任务,并根据所述运算任务和所述终端服务器20的硬件性能参数选取在所述终端服务器20运行的第一神经网络模型,以及根据所述运算任务和所述云端服务器10的硬件性能参数选取在所述云端服务器10运行的第二神经网络模型;根据选取好的所述第一神经网络模型和所述运算任务生成终端服务器控制指令,以及根据选取好的所述第二神经网络模型和所述运算任务生成云端服务器控制指令;对所述终端服务器控制指令进行解析得到终端控制信号,并根据所述终端控制信号计算对应的第一神经网络模型的运算任务以得到终端运算结果,以及将所述云端服务器控制指令发送至云端服务器10。
所述云端服务器10用于接收所述云端服务器控制指令,对所述云端服务器控制指令进行解析得到云端控制信号,并根据所述云端控制信号计算对应的第二神经网络模型的运算任务以得到云端运算结果。
具体地,当需要根据用户的需求信息来完成运算任务时,分别在终端服务器和云端服务器中都执行该运算任务,以此实现利用不同的神经网络模型来完成同一个运算任务的目的,并可以得到不同的精确程度的运算结果。具体而言,首先对终端服务器和云端服务器的硬件性能参数进行评估,分别选取一个运算能力较低的在终端服务器运行的第一神经网络模型和一个运算能力较高的在云端服务器运行的第二神经网络模型。基于不同的神经网络模型,在终端服务器中生成可在终端服务器中进行控制的终端服务器控制指令以及可在云端服务器中进行控制的云端服务器控制指令。
由此可知,当采用上述终端服务器控制指令和云端服务器控制指令时,根据终端服务器控制指令在终端服务器中使用运算能力较低的第一神经网络模型计算上述运算任务时,可得到一个准确性较低的运算结果。而根据云端服务器控制指令在云端服务器中使用运算能力较高的第二神经网络模型也计算上述同一个运算任务时,可得到一个准确性较高的运算结果。这样灵活地使用不同的神经网络模型分别执行同一个运算任务可使用户分别得到一个准确性较低的运算结果和一个准确性较高的运算结果,从而实现了基于用户的需求。并且,由于终端服务器的运算能力较弱,终端运算结果能够先输出,这样避免了用户需要长时间的等待,提高了处理效率,且充分利用了终端服务器与云端服务器两部分的计算资源,使得同一个运算任务可以在终端服务器与云端服务器设备上共同进行。
进一步地,在其中一个实施例中,所述终端服务器20还用于将所述终端运算结果输出后,在接收到停止运算指令时,发送所述停止运算指令至所述云端服务器10,以终止所述云端服务器10的运算工作。
具体地,终端服务器20将所述终端运算结果输出后,用户便可以得到了一个准确性较低的运算结果。若用户想获得一个更为准确的运算结果,可等待云端服务器10运算完成后,将云端运算结果通过终端服务器20输出。由此,用户便分别得到了一个准确性较低的运算结果 和一个准确性较高的运算结果。然而,若用户在得到一个准确性较低的运算结果后,认为该运算结果已经满足了自己的需求,因此并不想获得准确性较高的运算结果,则用户可通过用户终端输入停止运算指令。分配系统在接收该停止运算指令后,终止云端服务器10的运算工作,即准确性较高的运算结果处于尚未完成状态或即使完成但不再输出状态。
通过设置停止运算的方式,用户可选择只得到一个准确性较低的运算结果,这样可节省用户的时间,并且能够保障机器学习运算的分配系统的运算性能,避免运算资源的浪费。
在其中一个实施例中,所述终端服务器20包括终端控制器单元210、终端运算单元220和终端通信单元230;所述终端控制器单元210分别与所述终端运算单元220和所述终端通信单元230连接。
其中,所述终端控制器单元210用于获取需求信息、所述终端服务器20的硬件性能参数和所述云端服务器10的硬件性能参数;根据所述需求信息生成对应的运算任务,并根据所述运算任务和所述终端服务器20的硬件性能参数选取在所述终端服务器20运行的第一机器学习算法,以及根据所述运算任务和所述云端服务器10的硬件性能参数选取在所述云端服务器10运行的第二机器学习算法;根据所述第一机器学习算法和所述运算任务生成终端服务器控制指令,以及根据所述第二机器学习算法和所述运算任务生成云端服务器控制指令,并对所述终端服务器控制指令进行解析得到终端控制信号。
所述终端运算单元220用于根据所述终端控制信号计算对应的第一机器学习算法的运算任务以得到终端运算结果;所述终端通信单元230用于将所述云端服务器控制指令发送至所述云端服务器10。
具体地,终端控制器单元210获取用户输入的需求信息并生成对应的运算任务,根据终端服务器20和云端服务器10的硬件性能参数,如运算能力、能耗、精度、速度等进行评估,得到评估结果。然后基于该需求信息和评估结果分别选择一个合适的用于终端服务器的第一机器学习算法和一个合适的用于云端服务器的第二机器学习算法,并根据上述不同机器学习算法的运算能力生成不同的控制指令。
需要说明,包含控制指令的指令集预先存储于终端服务器20与云端服务器10中,终端控制器单元210会根据输入的需求信息分别生成用于终端服务器20的终端服务器控制指令和用于云端服务器10的云端服务器控制指令。
更具体地,在选择使用的神经网络模型时,可以选用如下的数学模型作为一种实施例。首先获取终端服务器20或者云端服务器10的运算能力,指标为每秒可进行的最大浮点/定点运算次数,记为参数C;然后对运算需求进行分析,这里首先是判断宏观的神经网络模型的函数g(x),即判断是选用CNN、RNN还是DNN等,一般来说,图像视觉领域用CNN和DNN较多,文本音频领域用RNN较多,通过基础的筛选可以更快地判断适合的神经网络类型;然后根据能耗W、精度R和速度S进行筛选,可以选择的一种实施例是以Alexnet的性能作为baseline,分别对其他神经网络进行的参数量化,最终的评分函数可以为F(x)=lg(C)*g(x)*S*(R^2)/W,其中的精度和功耗的具体数学形式可以根据用户需求进行更多加权,最终根据不同神经网络在不同硬件设施和用户需求的评分选择最优评分,挑选出最合适的神经网络模型。
需要说明,终端控制器单元210通过对能耗、速度、精度等参数建立数学模型来进行评估,然后选择最适合终端服务器20和云端服务器10的机器学习算法,并进行训练或推理。其中,终端服务器20的硬件配置可以直接通过系统获取,比如安卓/IOS等的系统调用;云端 服务器10的硬件配置由终端服务器20通过终端通信单元230发送请求到云端服务器10来获得返回的配置信息。
进一步地,终端控制器单元210还对终端服务器控制指令进行解析,得到终端控制信号,终端控制器单元210发送终端控制信号至终端运算单元220与终端通信单元230。终端运算单元220接收相应的终端控制信号,根据该终端控制信号计算对应的第一机器学习算法的运算任务以得到终端运算结果。终端通信单元230用于将云端服务器控制指令发送至云端服务器10。
可选地,上述第一机器学习算法包括第一神经网络模型。
在其中一个实施例中,所述云端服务器10包括云端控制器单元110、云端运算单元120和云端通信单元130;所述云端控制器单元110分别与所述云端运算单元120和所述云端通信单元130连接,所述云端通信单元130与所述终端通信单元230通信连接,用于在所述云端服务器10与所述终端服务器20之间进行数据交互。
其中,所述云端通信单元130用于接收所述云端服务器控制指令,并将所述云端服务器控制指令发送至所述云端控制器单元110,以及获取云端运算结果并发送至所述终端服务器20;所述云端控制器单元110用于接收所述云端服务器控制指令,对所述云端服务器控制指令进行解析得到云端控制信号;所述云端运算单元120用于根据所述云端控制信号计算对应的第二机器学习算法的运算任务以得到云端运算结果,并将所述云端运算结果通过所述云端通信单元130发送至所述终端服务器20。
具体地,终端控制器单元210将生成的云端服务器控制指令通过终端通信单元230发送至云端服务器10。在云端服务器10中,云端通信单元130接收云端服务器控制指令并发送至云端控制器单元110,云端控制器单元110对云端服务器控制指令进行解析,得到云端控制信号并发送至云端运算单元120与云端通信单元130。云端运算单元120接收相应的云端控制信号,根据该云端控制信号计算对应的第二机器学习算法的运算任务并得到云端运算结果。
可选地,上述第二机器学习算法包括第二神经网络模型。
进一步地,在云端服务器10与终端服务器20分别进行运算的过程中,同时伴随着云端服务器10与终端服务器20之间的数据通信。终端通信单元230根据相应的终端控制信号发送数据给云端通信单元130;反过来,云端通信单元130也根据相应的云端控制信号发送数据给终端通信单元230。由于终端服务器20是为了获取一个准确性较低的运算结果,所消耗的运算时间短,在终端服务器20运算完成后,先把终端运算结果发送至用户的终端设备上。若用户在得到准确性较低的运算结果后,还想要获得一个更为准确的运算结果,则在云端服务器10运算完成后,云端通信单元130发送云端运算结果至终端通信单元230,由终端服务器20将云端运算结果发送至用户的终端设备上。需要说明,终端通信单元230与云端通信单元130之间通过通讯协议分别在终端服务器20和云端服务器10之间进行数据传输。
在其中一个实施例中,终端服务器20还包括终端存储单元240,终端存储单元240分别与终端运算单元220、终端控制器单元210连接,终端存储单元240用于接收终端服务器20的输入数据并进行终端数据的存储。
具体地,终端存储单元240可以根据终端指令生成电路210b生成的终端服务器控制指令确定终端的输入数据并进行数据存储以及对终端运算过程进行存储。可选地,存储的数据格式可以是浮点数,也可以是量化的定点数。
另外,终端存储单元240可以是sram,dram等等能够对数据进行存储的装置或存储空间,用于对终端的数据以及终端的指令进行存储。其中,数据包括但不限于输入神经元、输出神经元、权值、图像以及向量中的至少一种。
更进一步地,在终端服务器20中,终端运算单元220和终端存储单元240是单独的两个部件,在终端运算单元220运算完成后,将终端运算结果先转存到终端存储单元240中,然后再由终端存储单元240和终端通信单元230将终端运算结果进行编码传输通信,而在编码传输通信的过程中,终端运算单元220已经开始下一轮的运算。采用这种工作模式,不会带来过多的等待延时。对于终端运算单元220而言,每轮的等效运算时间是实际运算时间+转存时间。由于转存时间比编码传输时间少很多,这样的方式可以充分调动终端运算单元220的运算能力,使终端运算单元220尽量满载工作。需要说明,可在终端指令生成电路210b内按照上述工作模式进行对应的终端服务器控制指令的生成。可选地,该部分的实现可以完全由算法实现,使用终端服务器20本身的CPU设备即可。
在其中一个实施例中,云端服务器10还包括云端存储单元140,云端存储单元140分别与云端运算单元120、云端控制器单元110连接,云端存储单元140用于接收云端的输入数据并进行云端数据的存储。
具体地,云端存储单元140可以根据云端服务器控制指令确定云端的输入数据并进行数据存储以及对云端运算过程进行存储。可选地,存储的数据格式可以是浮点数,也可以是量化的定点数。
较佳地,云端存储单元140可以是sram,dram等等能够对数据进行存储的装置或存储空间,用于对云端的数据和云端的指令进行存储。其中,数据包括但不限于输入神经元、输出神经元、权值、图像以及向量中的至少一种。
更进一步地,在云端服务器10中,云端运算单元120和云端存储单元140是单独的两个部件,在云端运算单元120运算完成后,将云端运算结果先转存到云端存储单元140中,然后再由云端存储单元140和云端通信单元130将云端运算结果进行编码传输通信,而在编码传输通信的过程中,云端运算单元120已经开始下一轮的运算。采用这种工作模式,不会带来过多的等待延时。对于云端运算单元120而言,每轮的等效运算时间是实际运算时间+转存时间。由于转存时间比编码传输时间少很多,这样的方式可以充分调动云端运算单元120的运算能力,使云端运算单元120尽量满载工作。需要说明,可在终端指令生成电路210b内按照上述工作模式进行对应的云端服务器控制指令的生成。
更为具体地,在其中一个实施例中,所述终端控制器单元210包括终端评估电路210a、终端指令生成电路210b和终端指令解析电路210c;所述终端指令生成电路210b分别与所述终端评估电路210a和所述终端指令解析电路210c连接,所述终端评估电路210a、所述终端指令生成电路210b和所述终端指令解析电路210c分别与所述终端运算单元220、所述终端存储单元240和所述终端通信单元230连接。
所述终端评估电路210a用于获取需求信息、所述终端服务器20的硬件性能参数和所述云端服务器10的硬件性能参数;根据所述需求信息生成对应的运算任务,并根据所述运算任务和所述终端服务器20的硬件性能参数选取在所述终端服务器20运行的第一机器学习算法,以及根据所述运算任务和所述云端服务器10的硬件性能参数选取在所述云端服务器10运行的第二机器学习算法;所述终端指令生成电路210b用于根据所述第一机器学习算法和所述运算任务生成终端服务器控制指令,以及根据所述第二机器学习算法和所述运算任务生成云端 服务器控制指令;所述终端指令解析电路210c用于对所述终端服务器控制指令进行解析得到终端控制信号。
具体地,终端评估电路210a获取用户输入的需求信息,基于需求信息并根据终端服务器20和云端服务器10的硬件性能参数分别选取一个运算能力较低的用于终端的第一机器学习算法和一个运算能力较高的用于云端的第二机器学习算法。选取完成后,终端指令生成电路210b根据用于终端服务器20的第一机器学习算法的低运算能力和用于云端服务器10的第二机器学习算法的高运算能力分别生成相对应的终端服务器控制指令和云端服务器控制指令。其中,终端服务器控制指令和云端服务器控制指令中的控制指令均可分别包括运算分配指令、访存指令和数据通讯指令。终端服务器控制指令用于在终端服务器20中进行控制,云端服务器控制指令通过终端通信单元230发送至云端通信单元130中,再由云端通信单元130发送至云端控制器单元110以在云端服务器10中进行控制。终端指令解析电路210c用于对终端服务器控制指令进行解析以得到终端控制信号,以及根据终端控制信号使终端运算单元220、终端存储单元240以及终端通信单元230按照终端服务器控制指令运行。
进一步地,在终端指令生成电路210b生成控制指令的过程中,运算分配方案使用的分配方式可以是:根据机器学习算法的运算能力、精度、速度和能耗的不同分配同一个运算任务,即采用不同的机器学习算法但完成同一个运算任务。在终端服务器20中计算一个运算能力较低的第一机器学习算法的运算任务,获得一个准确性较低的运算结果,而在云端服务器10中计算一个运算能力较高的第二机器学习算法的运算任务。如此可先获得一个准确性较低的运算结果,若用户有进一步需求,可再得到一个准确性较高的运算结果,采用这种分配方式不影响QoS(服务质量)。需要说明,终端服务器20与云端服务器10可同时对同一个运算任务进行计算,也可异时对同一个运算任务进行计算,亦可根据用户的需求选其一对运算任务进行计算。
以传统的神经网络模型为例,存在不同运算能力的神经网络模型。用ImageNet数据集做测试之后,可以获知不同的神经网络模型运算能力是不一样的,其除了与神经网络模型的结构本身的优化有关之外,还与运算复杂性有一定程度的正相关性。举例来说,AlexNet神经网络模型的运算能力较低,但是其时空成本是最小的。而ResNet神经网络模型的运算能力是建立在其更多的能耗上。但是,低运算能力的神经网络模型可给出一个准确性较低的运算结果,该运算结果可以在用户对其接受的范围内的。
低运算能力的神经网络模型需要较低的功耗和适当的推理时间,因此,对于终端服务器20相比较于云端服务器10较低的性能而言,可以选择在终端服务器20中完成运算能力较低的第一神经网络模型的运算,在云端服务器10中完成运算能力较高的第二神经网络模型的运算。并且由用户需求决定是否进一步获取高精度的运算分类结果。这样,实现了可以先提供用户一个准确性较低的运算结果,避免等待时间过长,同时也为用户提供了场景的选择。
访存指令是在运算分配的基础上的内存管理指令,用于控制终端存储单元240或云端存储单元140进行数据存储。数据通讯指令是对云端服务器10和终端服务器20的数据交互指令,用于控制终端通信单元230与云端通信单元130之间进行数据交互。
更进一步地,可以进行对多个终端服务器20与一个云端服务器10的系统级调度,多个终端服务器20与一个云端服务器10共同完成一个复杂度很高的系统级任务。
在其中一个实施例中,云端控制器单元110包括云端指令解析电路110a,云端指令解析电路110a分别与云端运算单元120、云端存储单元140和云端通信单元130连接。
具体地,在云端服务器10中,云端指令解析电路110a用于接收云端服务器控制指令,并对云端服务器控制指令进行解析,获得云端控制信号,以及根据云端控制信号使云端运算单元120、云端存储单元140以及云端通信单元130按照云端服务器控制指令运行,需要清楚的是,云端运算单元120、云端存储单元140以及云端通信单元130的运行原理与上述终端运算单元220、终端存储单元240以及终端通信单元230的运行原理相同,在此不再赘述。
云端指令解析电路110a通过解析云端服务器控制指令,得到云端控制信号,并将云端控制信号发送给云端服务器10的其它部件,使得云端服务器10内可以有序地完成云端神经网络的运算,极大地加速了云端神经网络的运算速度。
在一些实施例中,终端运算单元220与终端通信单元230连接,且终端存储单元240与终端通信单元230连接。
具体地,终端通信单元230可以对终端运算单元220和终端存储单元240的输出数据进行编码并发送至云端通信单元130。反过来,终端通信单元230也可以接收云端通信单元130发送的数据,并对该数据进行解码再次发送至终端运算单元220和终端存储单元240。通过采用这样的设计方式,可以减轻终端控制器单元210的任务量,使得终端控制器单元210可以更细化地完成控制指令的生成过程。
在另一些实施例中,云端运算单元120与云端通信单元130连接,云端存储单元140与云端通信单元130连接。
具体地,云端通信单元130可以对云端运算单元120和云端存储单元140的输出数据进行编码并发送至终端通信单元230。反过来,云端通信单元130也可以接收终端通信单元230发送的数据,并对该据进行解码再次发送至云端运算单元120和云端存储单元140。
进一步地,在一些实施例中,终端运算单元220可以是终端服务器20本身的运算部件,云端运算单元120可以是云端服务器10本身的运算部件。比如:运算部件可以是CPU,可以是GPU,也可以是神经网络芯片。优选地,终端运算单元220和云端运算单元120可以是人工神经网络芯片的数据处理单元中的运算单元,用于根据存储单元(终端存储单元240或云端存储单元140)中存储的控制指令对数据执行相应的运算。
在一个可选的实施例中,请参阅图1-5A,云端控制器单元110与终端控制器单元210为控制器单元311,云端运算单元120与终端运算单元220为运算单元312,
该运算单元312包括:一个主处理电路3101和多个从处理电路3102;
控制器单元311,用于获取输入数据以及计算指令;在一种可选方案中,具体的,可以通过数据输入输出单元获取输入数据以及计算指令,该数据输入输出单元具体可以为一个或多个数据I/O接口或I/O引脚。
上述计算指令包括但不限于:正向运算指令或反向训练指令,或其他神经网络运算指令等等,例如卷积运算指令,本申请具体实施方式并不限制上述计算指令的具体表现形式。
控制器单元311,还用于解析该计算指令得到多个运算指令,将该多个运算指令以及所述输入数据发送给所述主处理电路;
主处理电路3101,用于对所述输入数据执行前序处理以及与所述多个从处理电路之间传输数据以及运算指令;
多个从处理电路3102,用于依据从所述主处理电路传输的数据以及运算指令并行执行中间运算得到多个中间结果,并将多个中间结果传输给所述主处理电路;
主处理电路3101,用于对所述多个中间结果执行后续处理得到所述计算指令的计算结果。
本申请提供的技术方案将运算单元设置成一主多从结构,对于正向运算的计算指令,其可以将依据正向运算的计算指令将数据进行拆分,这样通过多个从处理电路即能够对计算量较大的部分进行并行运算,从而提高运算速度,节省运算时间,进而降低功耗。
可选的,上述机器学习计算具体可以包括:人工神经网络运算,上述输入数据具体可以包括:输入神经元数据和权值数据。上述计算结果具体可以为:人工神经网络运算的结果即输出神经元数据。
对于神经网络中的运算可以为神经网络中的一层的运算,对于多层神经网络,其实现过程是,在正向运算中,当上一层人工神经网络执行完成之后,下一层的运算指令会将运算单元中计算出的输出神经元作为下一层的输入神经元进行运算(或者是对该输出神经元进行某些操作再作为下一层的输入神经元),同时,将权值也替换为下一层的权值;在反向运算中,当上一层人工神经网络的反向运算执行完成后,下一层运算指令会将运算单元中计算出的输入神经元梯度作为下一层的输出神经元梯度进行运算(或者是对该输入神经元梯度进行某些操作再作为下一层的输出神经元梯度),同时将权值替换为下一层的权值。
上述机器学习计算还可以包括支持向量机运算,k-近邻(k-nn)运算,k-均值(k-means)运算,主成分分析运算等等。为了描述的方便,下面以人工神经网络运算为例来说明机器学习计算的具体方案。
对于人工神经网络运算,如果该人工神经网络运算具有多层运算,多层运算的输入神经元和输出神经元并非是指整个神经网络的输入层中神经元和输出层中神经元,而是对于网络中任意相邻的两层,处于网络正向运算下层中的神经元即为输入神经元,处于网络正向运算上层中的神经元即为输出神经元。以卷积神经网络为例,设一个卷积神经网络有L层,K=1,2,...,L-1,对于第K层和第K+1层来说,我们将第K层称为输入层,其中的神经元为所述输入神经元,第K+1层称为输出层,其中的神经元为所述输出神经元。即除最顶层外,每一层都可以作为输入层,其下一层为对应的输出层。
可选的,该控制器单元包括:指令缓存单元3110、指令处理单元3111和存储队列单元3113;
指令缓存单元3110,用于存储所述人工神经网络运算关联的计算指令;
所述指令处理单元3111,用于对所述计算指令解析得到多个运算指令;
存储队列单元3113,用于存储指令队列,该指令队列包括:按该队列的前后顺序待执行的多个运算指令或计算指令。
举例说明,在一个可选的技术方案中,主运算处理电路也可以包括一个控制器单元,该控制器单元可以包括主指令处理单元,具体用于将指令译码成微指令。当然在另一种可选方案中,从运算处理电路也可以包括另一个控制器单元,该另一个控制器单元包括从指令处理单元,具体用于接收并处理微指令。上述微指令可以为指令的下一级指令,该微指令可以通过对指令的拆分或解码后获得,能被进一步解码为各部件、各单元或各处理电路的控制信号。
在一种可选方案中,该计算指令的结构可以如下表所示。
操作码 寄存器或立即数 寄存器/立即数 ...
上表中的省略号表示可以包括多个寄存器或立即数。
在另一种可选方案中,该计算指令可以包括:一个或多个操作域以及一个操作码。该计 算指令可以包括神经网络运算指令。以神经网络运算指令为例,如表1所示,其中,寄存器号0、寄存器号1、寄存器号2、寄存器号3、寄存器号4可以为操作域。其中,每个寄存器号0、寄存器号1、寄存器号2、寄存器号3、寄存器号4可以是一个或者多个寄存器的号码。
Figure PCTCN2019109552-appb-000001
上述寄存器可以为片外存储器,当然在实际应用中,也可以为片内存储器,用于存储数据,该数据具体可以为n维数据,n为大于等于1的整数,例如,n=1时,为1维数据,即向量,如n=2时,为2维数据,即矩阵,如n=3或3以上时,为多维张量。
可选的,该控制器单元还可以包括:
所述依赖关系处理单元3112,用于在具有多个运算指令时,确定第一运算指令与所述第一运算指令之前的第零运算指令是否存在关联关系,如所述第一运算指令与所述第零运算指令存在关联关系,则将所述第一运算指令缓存在所述指令存储单元内,在所述第零运算指令执行完毕后,从所述指令存储单元提取所述第一运算指令传输至所述运算单元;
所述确定该第一运算指令与第一运算指令之前的第零运算指令是否存在关联关系包括:
依据所述第一运算指令提取所述第一运算指令中所需数据(例如矩阵)的第一存储地址区间,依据所述第零运算指令提取所述第零运算指令中所需矩阵的第零存储地址区间,如所述第一存储地址区间与所述第零存储地址区间具有重叠的区域,则确定所述第一运算指令与所述第零运算指令具有关联关系,如所述第一存储地址区间与所述第零存储地址区间不具有重叠的区域,则确定所述第一运算指令与所述第零运算指令不具有关联关系。
在另一种可选实施例中,运算单元312如图1-5C所示,可以包括一个主处理电路3101和多个从处理电路3102。在一个实施例里,如图1-5C所示,多个从处理电路呈阵列分布;每个从处理电路与相邻的其他从处理电路连接,主处理电路连接所述多个从处理电路中的k个从处理电路,所述k个从处理电路为:第1行的n个从处理电路、第m行的n个从处理电路以及第1列的m个从处理电路,需要说明的是,如图1-5C所示的K个从处理电路仅包括第1行的n个从处理电路、第m行的n个从处理电路以及第1列的m个从处理电路,即该k个从处理电路为多个从处理电路中直接与主处理电路连接的从处理电路。
K个从处理电路,用于在所述主处理电路以及多个从处理电路之间的数据以及指令的转发。
可选的,如图1-5D所示,该主处理电路3101还可以包括:转换处理电路3101a、激活 处理电路3101b、加法处理电路3101c中的一种或任意组合;
转换处理电路3101a,用于将主处理电路接收的数据块或中间结果执行第一数据结构与第二数据结构之间的互换(例如连续数据与离散数据的转换);或将主处理电路接收的数据块或中间结果执行第一数据类型与第二数据类型之间的互换(例如定点类型与浮点类型的转换);
激活处理电路3101b,用于执行主处理电路内数据的激活运算;
加法处理电路3101c,用于执行加法运算或累加运算。
所述主处理电路,用于确定所述输入神经元为广播数据,权值为分发数据,将分发数据分配成多个数据块,将所述多个数据块中的至少一个数据块以及多个运算指令中的至少一个运算指令发送给所述从处理电路;
所述多个从处理电路,用于依据该运算指令对接收到的数据块执行运算得到中间结果,并将运算结果传输给所述主处理电路;
所述主处理电路,用于将多个从处理电路发送的中间结果进行处理得到该计算指令的结果,将该计算指令的结果发送给所述控制器单元。
所述从处理电路包括:乘法处理电路;
所述乘法处理电路,用于对接收到的数据块执行乘积运算得到乘积结果;
转发处理电路(可选的),用于将接收到的数据块或乘积结果转发。
累加处理电路,所述累加处理电路,用于对该乘积结果执行累加运算得到该中间结果。
另一个实施例里,该运算指令为矩阵乘以矩阵的指令、累加指令、激活指令等等计算指令。
下面通过神经网络运算指令来说明如图1-5A所示的计算装置的具体计算方法。对于神经网络运算指令来说,其实际需要执行的公式可以为:
Figure PCTCN2019109552-appb-000002
其中,即将权值w乘以输入数据x i,进行求和,然后加上偏置b后做激活运算s(h),得到最终的输出结果s。
在一种可选的实施方案中,如图1-5E所示,所述运算单元包括:树型模块340,所述树型模块340包括:一个根端口3401和多个支端口3402,所述树型模块的根端口连接所述主处理电路,所述树型模块的多个支端口分别连接多个从处理电路中的一个从处理电路;
上述树型模块具有收发功能,例如如图1-5E所示,该树型模块即为发送功能,如图1-5I所示,该树型模块即为接收功能。
所述树型模块,用于转发所述主处理电路与所述多个从处理电路之间的数据块、权值以及运算指令。
可选的,该树型模块为计算装置的可选择结果,其可以包括至少1层节点,该节点为具有转发功能的线结构,该节点本身可以不具有计算功能。如树型模块具有零层节点,即无需该树型模块。
可选的,该树型模块可以为n叉树结构,例如,如图1-5F所示的二叉树结构,当然也可以为三叉树结构,该n可以为大于等于2的整数。本申请具体实施方式并不限制上述n的具体取值,上述层数也可以为2,从处理电路可以连接除倒数第二层节点以外的其他层的节点,例如可以连接如图1-5F所示的倒数第一层的节点。
可选的,上述运算单元可以携带单独的缓存,如图1-5G所示,可以包括:神经元缓存单元363,该神经元缓存单元363缓存该从处理电路的输入神经元向量数据和输出神经元值数据。
如图1-5H所示,该运算单元还可以包括:权值缓存单元364,用于缓存该从处理电路在计算过程中需要的权值数据。
在一种可选实施例中,运算单元312如图1-5B所示,可以包括分支处理电路3103;其具体的连接结构如图1-5B所示,其中,
主处理电路3101与分支处理电路3103(一个或多个)连接,分支处理电路3103与一个或多个从处理电路3102连接;
分支处理电路3103,用于执行转发主处理电路3101与从处理电路3102之间的数据或指令。
在一种可选实施例中,以神经网络运算中的全连接运算为例,过程可以为:y=f(wx+b),其中,x为输入神经元矩阵,w为权值矩阵,b为偏置标量,f为激活函数,具体可以为:sigmoid函数,tanh、relu、softmax函数中的任意一个。这里假设为二叉树结构,具有8个从处理电路,其实现的方法可以为:
控制器单元从存储单元内获取输入神经元矩阵x,权值矩阵w以及全连接运算指令,将输入神经元矩阵x,权值矩阵w以及全连接运算指令传输给主处理电路;
主处理电路确定该输入神经元矩阵x为广播数据,确定权值矩阵w为分发数据,将权值矩阵w拆分成8个子矩阵,然后将8个子矩阵通过树型模块分发给8个从处理电路,将输入神经元矩阵x广播给8个从处理电路,
从处理电路并行执行8个子矩阵与输入神经元矩阵x的乘法运算和累加运算得到8个中间结果,将8个中间结果发送给主处理电路;
主处理电路,用于将8个中间结果排序得到wx的运算结果,将该运算结果执行偏置b的运算后执行激活操作得到最终结果y,将最终结果y发送至控制器单元,控制器单元将该最终结果y输出或存储至存储单元内。
如图1-5A所示的计算装置执行神经网络正向运算指令的方法具体可以为:
控制器单元从指令存储单元内提取神经网络正向运算指令、神经网络运算指令对应的操作域以及至少一个操作码,控制器单元将该操作域传输至数据访问单元,将该至少一个操作码发送至运算单元。
控制器单元从存储单元内提取该操作域对应的权值w和偏置b(当b为0时,不需要提取偏置b),将权值w和偏置b传输至运算单元的主处理电路,控制器单元从存储单元内提取输入数据Xi,将该输入数据Xi发送至主处理电路。
主处理电路依据该至少一个操作码确定为乘法运算,确定输入数据Xi为广播数据,确定权值数据为分发数据,将权值w拆分成n个数据块;
控制器单元的指令处理单元依据该至少一个操作码确定乘法指令、偏置指令和累加指令,将乘法指令、偏置指令和累加指令发送至主处理电路,主处理电路将该乘法指令、输入数据Xi以广播的方式发送给多个从处理电路,将该n个数据块分发给该多个从处理电路(例如具有n个从处理电路,那么每个从处理电路发送一个数据块);多个从处理电路,用于依据该乘法指令将该输入数据Xi与接收到的数据块执行乘法运算得到中间结果,将该中间结果发送至主处理电路,该主处理电路依据该累加指令将多个从处理电路发送的中间结果执行累加运算得到累加结果,依据该偏置指令将该累加结果执行加偏置b得到最终结果,将该最终结果发送至该控制器单元。
另外,加法运算和乘法运算的顺序可以调换。
本申请提供的技术方案通过一个指令即神经网络运算指令即实现了神经网络的乘法运算以及偏置运算,在神经网络计算的中间结果均无需存储或提取,减少了中间数据的存储以及提取操作,所以其具有减少对应的操作步骤,提高神经网络的计算效果的优点。
本申请还揭露了一个机器学习运算装置,其包括一个或多个在本申请中提到的计算装置,用于从其他处理装置中获取待运算数据和控制信息,执行指定的机器学习运算,执行结果通过I/O接口传递给外围设备。外围设备譬如摄像头,显示器,鼠标,键盘,网卡,wifi接口,服务器。当包含一个以上计算装置时,计算装置间可以通过特定的结构进行链接并传输数据,譬如,通过PCIE总线进行互联并传输数据,以支持更大规模的机器学习的运算。此时,可以共享同一控制系统,也可以有各自独立的控制系统;可以共享内存,也可以每个加速器有各自的内存。此外,其互联方式可以是任意互联拓扑。
该机器学习运算装置具有较高的兼容性,可通过PCIE接口与各种类型的服务器相连接。
在其中一个实施例中,请参阅图1-6,提供了一种机器学习运算的分配方法,该分配方法包括如下步骤:
S702,获取需求信息、所述终端服务器的硬件性能参数和所述云端服务器的硬件性能参数。
具体地,用户通过终端设备输入自身的需求,终端服务器获取用户输入的需求信息。用户输入的需求信息主要由三方面决定,一方面是功能需求信息,另一方面是准确度需求信息,再一方面是内存需求信息。比如,对于功能需求信息而言,比如识别所有动物需要的数据集和只需要识别猫的数据集是存在包含关系的,如果用户只是需要某一垂直领域的功能需求的话,则只需将用户的需求通过控制部分的输入获取单元进行输入,并且根据自身内存大小以及所需精度的大小选择好对应的数据集。终端服务器获取需求信息、终端服务器的硬件性能参数和云端服务器的硬件性能参数,硬件性能参数可包括运算能力、能耗、速度和精度。
S704,根据所述需求信息生成对应的运算任务,并根据所述运算任务和所述终端服务器的硬件性能参数选取在所述终端服务器运行的第一机器学习算法,以及根据所述运算任务和所述云端服务器的硬件性能参数在所述云端服务器运行的第二机器学习算法。
具体地,在终端服务器中,终端控制器单元根据所述需求信息生成对应的运算任务。并且,终端控制器单元中的终端评估电路对终端服务器以及云端服务器的运算能力、能耗、速度、精度建立数学模型进行评估,然后在终端服务器和云端服务器各选择最为适合的一种机器学习算法,然后进行训练或推理。
S706,根据所述第一机器学习算法和所述运算任务生成终端服务器控制指令,以及根据所述第二机器学习算法和所述运算任务生成云端服务器控制指令。
具体地,在终端服务器中,终端控制器单元根据用于终端服务器的第一机器学习算法的规模,并根据该第一机器学习算法的运算能力,将运算任务进行分配;以及根据用于云端服务器的第二机器学习算法的规模,并根据该第二机器学习算法的运算能力,将上述运算任务进行分配,以此让终端服务器与云端服务器分别完成同一个运算任务。在终端终端控制器单元中,终端指令生成电路会根据用户的需求和选用的数据集,并基于不同的机器学习算法的运算能力生成相对应的终端服务器控制指令与云端服务器控制指令。
更进一步地,终端通信单元与云端通信单元将控制指令在终端服务器和云端服务器之间进行传输。具体地,在控制指令生成后,终端通信单元与云端通信单元之间通过通讯协议分别在终端服务器和云端服务器之间进行传输。
上述机器学习运算的分配方法,当需要根据用户的需求信息来完成运算任务时,分别在终端服务器和云端服务器中都执行该运算任务,以此实现利用不同的机器学习算法来完成同一个运算任务的目的,并可以得到不同的精确程度的运算结果。具体而言,首先对终端服务器和云端服务器的硬件性能参数进行评估,分别选取一个运算能力较低的在终端服务器运行的第一机器学习算法和一个运算能力较高的在云端服务器运行的第二机器学习算法。基于不同的机器学习算法,在终端服务器中生成可在终端服务器中进行控制的终端服务器控制指令以及可在云端服务器中进行控制的云端服务器控制指令。
由此可知,当采用上述终端服务器控制指令和云端服务器控制指令时,根据终端服务器控制指令在终端服务器中使用运算能力较低的第一机器学习算法计算上述运算任务时,可得到一个准确性较低的运算结果。而根据云端服务器控制指令在云端服务器中使用运算能力较高的第二机器学习算法也计算上述同一个运算任务时,可得到一个准确性较高的运算结果。这样灵活地使用不同的机器学习算法分别执行同一个运算任务可使用户分别得到一个准确性较低的运算结果和一个准确性较高的运算结果,从而实现了基于用户的需求。并且,由于终端服务器的运算能力较弱,终端运算结果能够先输出,这样避免了用户需要长时间的等待,提高了处理效率,且充分利用了终端服务器与云端服务器两部分的计算资源,使得同一个运算任务可以在终端服务器与云端服务器设备上共同进行。
进一步地,在一个实施例中,该方法还包括以下步骤:
S708,分别对所述终端服务器控制指令和所述云端服务器控制指令进行解析,根据所述终端服务器控制指令获得终端控制信号,以及根据所述云端服务器控制指令获得云端控制信号。
具体地,终端控制器单元将云端服务器控制指令发送至云端服务器后,云端控制器单元中的云端指令解析电路对发来的云端服务器控制指令进行解析,获得云端控制信号,在终端控制器单元中,终端指令解析电路对终端服务器控制指令进行解析,获得终端控制信号。
S710,根据所述终端控制信号提取终端待处理数据,以及根据所述云端控制信号提取云端待处理数据。
具体地,待处理数据包括训练数据或测试数据的一种或多种。在云端服务器中,云端控制器单元根据云端控制信号提取对应的云端训练数据或者云端测试数据,发送到云端运算单元的缓冲区,同时可以预分配一定的内存空间,用于实现运算中间过程的数据交互。在终端服务器中,终端控制器单元根据终端控制信号提取对应的终端训练数据或者终端测试数据,发送到终端运算单元的缓冲区,同时可以预分配一定的内存空间,用于实现运算中间过程的数据交互。
S712,根据所述终端待处理数据计算所述终端服务器中对应的每个阶段的第一机器学习算法的运算任务以得到终端运算结果,和/或根据所述云端待处理数据计算所述云端服务器中对应的每个阶段的第二机器学习算法的运算任务以得到云端运算结果。
具体地,在终端服务器中,终端控制器单元发送终端待处理数据至终端运算单元,终端运算单元根据传送的终端待处理数据计算终端服务器中对应的每个阶段的第一机器学习算法的运算任务。在云端服务器中,云端控制器单元发送云端待处理数据至云端运算单元,云端运算单元根据传送的云端待处理数据计算云端服务器中对应的每个阶段的第二机器学习算法的运算任务。
在云端服务器与终端服务器进行运算的过程中,同时伴随着云端服务器与终端服务器之 间的数据通信,终端通信单元根据相应的终端控制信号发送数据给云端通信单元,反过来,云端通信单元也根据相应的云端控制信号发送数据给终端通信单元,将终端运算结果与云端运算结果通过终端服务器发送至用户的终端设备上。
在其中一个实施例中,涉及根据服务器的运算能力选取机器学习算法的具体过程。在本实施例中,S704包括:
S7042,获取所述终端服务器的运算能力和所述云端服务器的运算能力;
S7044,根据所述运算任务、所述终端服务器的运算能力选取第一机器学习算法,以及根据所述运算任务、所述云端服务器的运算能力选取第二机器学习算法。
具体地,需要清楚,终端服务器的运算能力相对于云端服务器的运算能力要弱。因此,相对应地,根据终端服务器的运算能力选择一个运算能力较低的第一机器学习算法,而根据云端服务器的运算能力选择一个运算能力较高的第二机器学习算法。运算能力的高低影响计算时间以及计算精度,比如,采用运算能力较高的第二机器学习算法可得到一个更为准确的运算结果,但可能计算时间较长。
在其中一个实施例中,涉及终止计算云端运算结果的具体过程,其中,该分配方法还包括:
S714,将所述终端运算结果输出后,在接收到停止运算指令时,终止所述云端服务器的运算工作。
具体地,终端服务器将终端运算结果输出后,此时,用户可以得到一个准确性较低的运算结果。若用户想获得一个更为准确的运算结果,可等待云端服务器运算完成后,将云端运算结果通过终端服务器输出,这时,用户便分别得到了一个准确性较低的运算结果和一个准确性较高的运算结果。然而,若用户在得到一个准确性较低的运算结果后,并不想获得更准确性的运算结果,则通过用户终端输入停止运算指令,分配系统接收该停止运算指令,并终止云端服务器的运算工作,即准确性较高的运算结果处于尚未完成状态或即使完成但不再输出状态。
在其中一个实施例中,涉及对终端服务器控制指令进行解析的具体过程。其中,S708具体包括:
S7082,利用终端服务器对所述终端服务器控制指令进行解析,获得终端控制信号;
S7084,根据所述终端控制信号提取相对应的终端训练数据或者终端测试数据。
具体地,终端指令解析电路用于对终端服务器控制指令进行解析以得到终端控制信号,并根据终端控制信号提取相对应的终端训练数据或者终端测试数据。其中,数据包括图像、音频、文本等。图像包括静态图片、组成视频的图片、或视频等。音频包括人声音频、乐曲、噪声等。文本包括结构化文本、各种语言的文本字符等。
在其中一个实施例中,涉及对云端服务器控制指令进行解析的具体过程。其中,S708还包括:
S7081,利用云端服务器对所述云端服务器控制指令进行解析,获得云端控制信号;
S7083,根据所述云端控制信号提取相对应的云端训练数据或者云端测试数据。
具体地,云端指令解析电路用于对云端服务器控制指令进行解析以得到云端控制信号,并根据云端控制信号提取相对应的云端训练数据或者云端测试数据。
在一个实施例中,涉及运算结果如何得出的具体过程。其中,S712具体包括:
S7122,利用终端服务器并根据所述终端训练数据或者终端测试数据,计算所述终端服务 器中对应的每个阶段的第一机器学习算法的运算任务以得到终端运算结果。
在另一个实施例中,S712具体包括:
S7124,利用云端服务器并根据所述云端训练数据或者云端测试数据,计算所述云端服务器中对应的每个阶段的第二机器学习算法的运算任务以得到云端运算结果。
结合上述实施例具体说明,在云端服务器中,云端运算单元根据云端训练数据或者云端测试数据执行对应的每个阶段的第二机器学习算法的运算,得到云端运算结果。在终端服务器中,终端运算单元根据终端训练数据或者终端测试数据执行对应的每个阶段的第一机器学习算法的运算,得到终端运算结果。在云端服务器和终端服务器的运算过程中,通过云端通信单元和终端通信单元共同完成终端服务器和云端服务器之间的数据通信。云端服务器和终端服务器之间的运算部分和存储部分的数据通信分别通过云端控制器单元和终端通信单元进行转发,最终由云端通信单元和终端通信单元共同进行交互。
由于在终端服务器中使用低运算能力的神经网络计算上述运算任务,因此,可先得到一个准确性较低的运算结果,之后,基于用户的进一步的需求信息,可进一步得到在云端服务器中使用高运算能力的神经网络得到的准确性较高的运算结果。
应该理解的是,虽然图1-6的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,图1-6中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些子步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。
在数据处理领域,神经网络(neural network)已经获得了非常成功的应用,但是大规模神经网络的运算需要消耗大量的计算时间和计算能耗,对处理平台带来严峻挑战。因此,减少神经网络的计算时间和计算能耗成为一个亟待解决的问题。
请参见图2-1A,图2-1A为本发明实施例提供的一种计算装置的结构示意图。如图2-1所示,该计算装置100包括:
所述存储单元1019,用于存储权值和输入神经元,所述权值包括重要比特位和非重要比特位;
所述控制器单元1029,用于获取所述权值的重要比特位和非重要比特位,以及所述输入神经元,并将所述权值的重要比特位和非重要比特位、所述输入神经元传输给所述运算单元1039;
所述运算单元1039,用于将所述输入神经元和所述重要比特位进行运算,得到输出神经元的第一运算结果;
以及若所述第一运算结果小于或等于预设阈值,则跳过当前输出神经元的运算;
若所述第一运算结果大于所述预设阈值,则将所述输入神经元与所述非重要比特位进行运算,得到第二运算结果,将所述第一运算结果与所述第二运算结果之和作为输出神经元。
其中,存储单元1019中存储的数据,输入神经元或者,权值,其包括浮点型数据和定点型数据,将浮点型数据中的符号位和指数部分指定为重要比特位,将底数部分指定为非重要比特位,将定点型数据中的符号位和数值部分的前x比特位指定为重要比特位,将数值部分的剩余比特指定为非重要比特位,其中,x为大于等于0且小于m的正整数,m为定点型数 据的总比特位。将重要比特位存放在错误检查和纠正ECC(Error Correcting Code:简称ECC)内存进行精确存储,将非重要比特位存放在非ECC内存都进行非精确存储。
上述预设阈值可以由用户自行设置或者系统默认,例如,预设阈值可以为0,或者,也可以为其他整数,或者,小数。
在一个可能的示例中,若所述输入神经元以N in表示,该输入神经元包括n个比特位,其中,n个比特位包括n1个重要比特位和n2个非重要比特位,若n1个重要比特位对应的值以N1 in表示,该n2个非重要比特位对应的值以N2 in表示,则n1+n2=n,N in=N1 in+N2 in,n为正整数,n1为自然数且小于n。
在一个可能的示例中,所述n1个重要比特位的位置是连续的,或者,不连续的。
在一个可能的示例中,若所述权值以W表示,该权值包括w个比特位,其中,w1个比特位为重要比特位,w2个比特位为非重要比特位,若该w1个比特位对应的值以W1表示,该w2个比特位对应的值为W2表示,则w1+w2=w,W=W1+W2,w为正整数,w1为自然数且小于w。
在一个可能的示例中,所述n1个重要比特位的位置是连续的,或者,不连续的。
在一个可能的示例中,在所述输入神经元为多个时,所述运算单元1039包括多个乘法器和至少一个加法器;
所述多个乘法器和所述至少一个加法器,用于按照如下公式计算所述输出神经元:
Figure PCTCN2019109552-appb-000003
其中,运算单元1039包括多个乘法器和至少一个加法器,运算单元通过多个乘法器和至少一个加法器完成上述运算。T为输入神经元的数量,N out为输出神经元,N1 in(i)是第i个输入神经元的重要比特位,N2 in(i)表示第i个输入神经元的非重要比特位,W1(i)为第i个权值的重要比特位,W2(i)为第i个权值的非重要比特位,N in(i)表示第i个输入神经元的值,W(i)表示第i个权值的值,N in(i)=N1 in(i)+N2 in(i)且W(i)=W1(i)+W2(i);
优先计算N out中的
Figure PCTCN2019109552-appb-000004
并将
Figure PCTCN2019109552-appb-000005
作为所述第一运算结果。
具体实现中,输出神经元的计算公式如下:
Figure PCTCN2019109552-appb-000006
其变换形式如下:
Figure PCTCN2019109552-appb-000007
其中,
Figure PCTCN2019109552-appb-000008
可运用于神经网络模型的连接层、卷积层或者lstm
层运算,因为这些运算用到了这种内积操作。
在一个可能的示例中,所述运算单元1039还包括比较器,所述运算单元1039具体用于:在所述比较器的比较结果为所述第一运算结果小于或等于预设阈值时,则跳过所述输出神经元的运算;若所述第一运算结果大于所述预设阈值,则将所述输入神经元与所述非重要比特位进行运算,得到第二运算结果,将所述第一运算结果与所述第二运算结果之和作为输出神经元方面,所述运算单元具体用于:
Figure PCTCN2019109552-appb-000009
小于或等于所述预设阈值,则跳过当前输出神经元的运算;
Figure PCTCN2019109552-appb-000010
大于所述预设阈值,则继续运算N out,并输出最终的N out
其中,上述运算单元1039还包括比较器,比较器主要用于比较运算。上述第一运算结果若小于或等于预设阈值,则跳过当前输入神经元的运算,执行下一次输入神经元的内积运算,在第一运算结果大于预设阈值时,则继续运算
Figure PCTCN2019109552-appb-000011
Figure PCTCN2019109552-appb-000012
最终输出神经元N out如下:
Figure PCTCN2019109552-appb-000013
可以看出,在本发明实施例的方案中,获取权值的重要比特位和非重要比特位,以及输入神经元,将输入神经元和重要比特位进行运算,得到输出神经元的第一运算结果,若第一运算结果小于或等于预设阈值,则跳过当前输出神经元的运算,若第一运算结果大于预设阈值,则将输入神经元与非重要比特位之间进行运算,得到第二运算结果,将第一运算结果与第二运算结果之和作为输出神经元,进而,如果某个输出神经元的预测结果为不需要进行运算,则跳过该输出神经元的运算过程。新的运算装置中集成了运算方法,能够预测并跳过不需要进行运算的输出神经元。从而减少神经网络的计算时间和计算能耗。
在一个可能的示例中,所述运算单元包括:一个主处理电路和多个从处理电路;
所述主处理电路用于将所述输入神经元拆分为多个数据块,将所述权值的重要比特位广播给所述多个从处理电路,将所述多个数据块分发给所述多个从处理电路;
所述从处理电路,用于将接收到的数据块与权重的重要比特位进行运算,得到部分结果,将所述部分结果发送给所述主处理电路;
所述主处理电路,还具体用于将接收到的所有部分结果拼接,得到所述第一运算结果。
在一个可能的示例中,所述运算单元还包括一个或多个分支处理电路,每个分支处理电路连接至少一个从处理电路,
所述分支处理电路,用于转发所述主处理电路与所述多个从处理电路之间的数据块、广播数据以及运算指令。
在一个可能的示例中,所述多个从处理电路呈阵列分布;每个从处理电路与相邻的其他从处理电路连接,所述主处理电路连接所述多个从处理电路中的k个从处理电路,所述k个从处理电路为:第1行的p个从处理电路、第q行的p个从处理电路以及第1列的q个从处理电路;
所述K个从处理电路,用于在所述主处理电路以及多个从处理电路之间的数据以及指令的转发;
所述主处理电路,用于确定所述输入神经元为分发数据,权值的重要比特位为广播数据,将一个分发数据分配成多个数据块,将所述多个数据块中的至少一个数据块以及多个运算指令中的至少一个运算指令发送给所述K个从处理电路;
所述K个从处理电路,用于转换所述主处理电路与所述多个从处理电路之间的数据。
在一个可能的示例中,所述主处理电路包括:激活处理电路、加法处理电路中的一种或任意组合。
在一个可能的示例中,所述从处理电路包括:乘法处理电路;
所述乘法处理电路,用于对接收到的数据块执行乘积运算得到乘积结果。
在一个可能的示例中,所述从处理电路还包括:累加处理电路,所述累加处理电路,用于对该乘积结果执行累加运算。
参阅图2-1B,图2-1B为本申请实施例提供的一种分层存储装置结构示意图,如图2-1B所示,该装置包括:精确存储单元和非精确存储单元,精确存储单元用于存储数据中的重要比特位,非精确存储单元用于存储数据中的非重要比特位。
精确存储单元采用错误检查和纠正ECC内存,非精确存储单元采用非ECC内存。
进一步地,分层存储装置存储的数据为神经网络参数,包括输入神经元、权值和输出神经元,精确存储单元存储输入神经元、输出神经元以及权值的重要比特位,非精确存储单元存储输入神经元、输出神经元以及权值的非重要比特位。
进一步地,分层存储装置存储的数据包括浮点型数据和定点型数据,将浮点型数据中的符号位和指数部分指定为重要比特位,将底数部分指定为非重要比特位,将定点型数据中的符号位和数值部分的前x比特位指定为重要比特位,将数值部分的剩余比特指定为非重要比特位,其中,x为大于等于0且小于m的正整数,m为定点型数据的总比特位。将重要比特位存放在ECC内存进行精确存储,将非重要比特位存放在非ECC内存都进行非精确存储。
进一步地,ECC内存包括有ECC校验的DRAM(Dynamic Random Access Memory,简称:DRAM)动态随机存取存储器和有ECC校验的SRAM(StaticRandom-AccessMemory,简称SRAM)静态随机存取存储器;其中,有ECC校验的SRAM可采用3T SRAM。
进一步地,非ECC内存包括非ECC校验的DRAM和非ECC校验的SRAM,非ECC校验的SRAM可采用3TSRAM。
其中,3T SRAM中存放的每一个比特的单元由3个MOS管组成。
参阅图2-1C,图2-1C为本申请实施例提供的一种3T SRAM存储单元的结构示意图,如图2-1C所示,3T SRAM存储单元由3个MOS组成,分别是M1(第一MOS管),M2(第二MOS管)和M3(第三MOS管)。M1用于门控,M2和M3用于存储。
M1栅极与字线WL(Word Line)电连接,源极与位线BL(Bit Line)电连接;M2栅极与M3源极连接,并通过电阻R2与工作电压Vdd连接,M2漏极接地;M3栅极与M2源极、M1漏极连接,并通过电阻R1与工作电压Vdd连接,M3漏极接地。WL用来控制存储单元的门控访问,BL来进行存储单元的读写。当进行读操作时,拉高WL,从BL中读出位即可。当进行写操作时,拉高WL,拉高或者拉低BL,由于BL的驱动能力比存储单元强,会强制覆盖原来的状态。
本申请的存储装置采用近似存储技术,能够充分挖掘神经网络的容错能力,将神经参数进行近似存储,参数中重要的比特位采用精确存储,不重要的比特位采用非精确存储,从而减少存储开销和访存能耗开销。
本申请的实施例提供了一种数据处理装置,该装置近似与存储技术相对应的加速装置,参阅图2-1D,图2-1D为本申请实施例提供的一种数据处理装置的结构示意图,该数据处理装置包括:非精确运算单元、指令控制单元和上述的分层存储装置。
分层存储装置接收指令和运算参数,并将运算参数中的重要比特位和指令存储于精确存 储单元,将运算参数中的非重要比特位存储于非精确存储单元。
指令控制单元接收分层存储装置中的指令,并将指令进行译码生成控制信息控制非精确运算单元进行计算操作。
非精确运算单元接收分层存储装置中的运算参数,依据控制信息进行运算,并将运算结果传输至分层存储装置进行存储或输出。
进一步地,非精确运算单元为神经网络处理器。进一步地,上述运算参数为神经网络参数,分层存储装置用来存储神经网络的神经元,权值和指令,将神经元的重要比特位、权值的重要比特位和指令存储在精确存储单元,神经元的非重要比特位和权值的非重要比特位存储在非精确存储单元。非精确运算单元接收分层存储装置中的输入神经元和权值,依据控制信息完成神经网络运算得到输出神经元,并将输出神经元重新传输至分层存储装置进行存储或输出。
进一步地,非精确运算单元可以有两种计算模式:(1)非精确运算单元直接接收来自分层存储装置的精确存储单元中的输入神经元的重要比特位和权值的重要比特位进行计算;(2)非精确运算单元接收重要比特位和非重要比特位拼接完整的输入神经元和权值进行计算,其中,输入神经元和权值的重要比特位和非重要比特位在存储单元中读取时进行拼接。
进一步地,参阅图2-1E,如图2-1E所示,数据处理装置还包括预处理模块,用于对输入的原始数据进行预处理并传输至存储装置,预处理包括切分、高斯滤波、二值化、正则化、归一化等等。
进一步地,数据处理装置还包括指令缓存、输入神经元分层缓存、权值分层缓存和输出神经元分层缓存,其中,指令缓存设置在分层存储装置和指令控制单元之间,用于存储专用指令;输入神经元分层缓存设置在存储装置和非精确运算单元之间,用于缓存输入神经元,输入神经元分层缓存包括输入神经元精确缓存和输入神经元非精确缓存,分别缓存输入神经元的重要比特位和非重要比特位;权值分层缓存设置在存储装置和非精确运算单元之间,用于缓存权值数据,权值分层缓存包括权值精确缓存和权值非精确缓存,分别缓存权值的重要比特位和非重要比特位;输出神经元分层缓存设置在存储装置和非精确运算单元之间,用于缓存输出神经元,所述输出神经元分层缓存包括输出神经元精确缓存和输出神经元非精确缓存,分别缓存输出神经元的重要比特位和非重要比特位。
进一步地,数据处理装置还包括直接数据存取单元DMA(direct memory access),用于在存储装置、指令缓存、权值分层缓存、输入神经元分层缓存和输出神经元分层缓存中进行数据或者指令读写。
进一步地,上述指令缓存、输入神经元分层缓存、权值分层缓存和输出神经元分层缓存均采用3T SRAM。
进一步地,非精确运算单元包括但不限于三个部分,第一部分乘法器,第二部分加法树,第三部分为激活函数单元。第一部分将输入数据1(in1)和输入数据2(in2)相乘得到相乘之后的输出(out),过程为:out=in1*in2;第二部分将输入数据in1通过加法树逐级相加得到输出数据(out),其中in1是一个长度为U的向量,U大于1,过程为:out=in1[1]+in1[2]+...+in1[U];或者,将输入数据(in1)通过加法树累加之后和输入数据(in2)相加得到输出数据(out),过程为:out=in1[1]+in1[2]+...+in1[U]+in2;或者,将输入数据(in1)和输入数据(in2)相加得到输出数据(out),过称为:out=in1+in2;第三部分将输入数据(in)通过激活函数(active)运算得到激活输出数据(out),过程为:out=active(in),激活函数active 可以是sigmoid、tanh、relu、softmax等,除了做激活操作,第三部分可通过其他的非线性函数将输入数据(in)通过运算(f)得到输出数据(out),过程为:out=f(in)。
非精确运算单元还可以包括池化单元,池化单元将输入数据(in)通过池化运算得到输出数据(out),过程为out=pool(in),其中pool为池化运算,池化运算包括但不限于:平均值池化,最大值池化,中值池化,输入数据in是和输出out相关的一个池化核中的数据。
非精确运算单元执行运算包括几个部分,第一部分是将输入数据1和输入数据2相乘,得到相乘之后的数据;第二部分执行加法树运算,用于将输入数据1通过加法树逐级相加,或者将所述输入数据1通过加法树逐级相加后和输入数据2相加得到输出数据;第三部分执行激活函数运算,对输入数据通过激活函数(active)运算得到输出数据。以上几个部分的运算可以自由组合,从而实现各种不同功能的运算。
本申请的数据处理装置能够充分利用近似存储技术,并充分挖掘神经网络的容错能力,减少神经网络的计算量和神经网络访存量,从而减少计算能耗和访存能耗。通过采用针对多层人工神经网络运算的专用SIMD指令和定制的运算单元,解决了CPU和GPU运算性能不足,前端译码开销大的问题,有效提高了对多层人工神经网络运算算法的支持;通过采用针对多层人工神经网络运算算法的专用非精确存储的片上缓存,充分挖掘了输入神经元和权值数据的重要性,避免了反复向内存读取这些数据,降低了内存访问带宽,避免了内存带宽成为多层人工神经网络运算及其训练算法性能瓶颈的问题。
以上仅是示例性的说明,但本申请并不限于此,数据处理装置可以包括非神经网络处理器,例如,通用运算处理器,通用运算具有相应的通用运算指令和数据,例如,标量算数运算、标量逻辑运算等,通用运算处理器例如但不限于包括一个或多个乘法器、一个或多个加法器,执行例如加法、乘法等基本运算。
在本实施例中,计算装置100是以模块的形式来呈现。这里的“模块”可以指特定应用集成电路(application-specific integrated circuit,ASIC),执行一个或多个软件或固件程序的处理器和存储器,集成逻辑电路,和/或其他可以提供上述功能的器件。此外,以上存储单元1019、控制器单元1029和运算单元1039可通过图2-2~图2-13所示的装置来实现。
参阅图2-2,提供了一种计算装置,该计算装置用于执行机器学习计算,该计算装置包括:控制器单元11和运算单元12,其中,控制器单元11与运算单元12连接,该运算单元12包括:一个主处理电路和多个从处理电路;
控制器单元11,用于获取输入数据以及计算指令;在一种可选方案中,具体的,获取输入数据以及计算指令方式可以通过数据输入输出单元得到,该数据输入输出单元具体可以为一个或多个数据I/O接口或I/O引脚。
上述计算指令包括但不限于:正向运算指令或反向训练指令,或其他神经网络运算指令等等,例如卷积运算指令,本申请具体实施方式并不限制上述计算指令的具体表现形式。
控制器单元11,还用于解析该计算指令得到多个运算指令,将该多个运算指令以及所述输入数据发送给所述主处理电路;
主处理电路101,用于对所述输入数据执行前序处理以及与所述多个从处理电路之间传输数据以及运算指令;
多个从处理电路102,用于依据从所述主处理电路传输的数据以及运算指令并行执行中间运算得到多个中间结果,并将多个中间结果传输给所述主处理电路;
主处理电路101,用于对所述多个中间结果执行后续处理得到所述计算指令的计算结果。
本申请提供的技术方案将运算单元设置成一主多从结构,对于正向运算的计算指令,其可以将依据正向运算的计算指令将数据进行拆分,这样通过多个从处理电路即能够对计算量较大的部分进行并行运算,从而提高运算速度,节省运算时间,进而降低功耗。
可选的,上述机器学习计算具体可以包括:人工神经网络运算,上述输入数据具体可以包括:输入神经元数据和权值数据。上述计算结果具体可以为:人工神经网络运算的结果即输出神经元数据。
对于神经网络中的运算可以为神经网络中的一层的运算,对于多层神经网络,其实现过程是,在正向运算中,当上一层人工神经网络执行完成之后,下一层的运算指令会将运算单元中计算出的输出神经元作为下一层的输入神经元进行运算(或者是对该输出神经元进行某些操作再作为下一层的输入神经元),同时,将权值也替换为下一层的权值;在反向运算中,当上一层人工神经网络的反向运算执行完成后,下一层运算指令会将运算单元中计算出的输入神经元梯度作为下一层的输出神经元梯度进行运算(或者是对该输入神经元梯度进行某些操作再作为下一层的输出神经元梯度),同时将权值替换为下一层的权值。
上述机器学习计算还可以包括支持向量机运算,k-近邻(k-nn)运算,k-均值(k-means)运算,主成分分析运算等等。为了描述的方便,下面以人工神经网络运算为例来说明机器学习计算的具体方案。
对于人工神经网络运算,如果该人工神经网络运算具有多层运算,多层运算的输入神经元和输出神经元并非是指整个神经网络的输入层中神经元和输出层中神经元,而是对于网络中任意相邻的两层,处于网络正向运算下层中的神经元即为输入神经元,处于网络正向运算上层中的神经元即为输出神经元。以卷积神经网络为例,设一个卷积神经网络有L层,K=1,2,...,L-1,对于第K层和第K+1层来说,我们将第K层称为输入层,其中的神经元为所述输入神经元,第K+1层称为输出层,其中的神经元为所述输出神经元。即除最顶层外,每一层都可以作为输入层,其下一层为对应的输出层。
可选的,上述计算装置还可以包括:该存储单元10和直接内存访问单元50,存储单元10可以包括:寄存器、缓存中的一个或任意组合,具体的,所述缓存,用于存储所述计算指令;所述寄存器,用于存储所述输入数据和标量;所述缓存为高速暂存缓存。直接内存访问单元50用于从存储单元10读取或存储数据。
可选的,该控制器单元包括:指令存储单元110、指令处理单元111和存储队列单元113;
指令存储单元110,用于存储所述人工神经网络运算关联的计算指令;
所述指令处理单元111,用于对所述计算指令解析得到多个运算指令;
存储队列单元113,用于存储指令队列,该指令队列包括:按该队列的前后顺序待执行的多个运算指令或计算指令。
举例说明,在一个可选的技术方案中,主运算处理电路也可以包括一个控制器单元,该控制器单元可以包括主指令处理单元,具体用于将指令译码成微指令。当然在另一种可选方案中,从运算处理电路也可以包括另一个控制器单元,该另一个控制器单元包括从指令处理单元,具体用于接收并处理微指令。上述微指令可以为指令的下一级指令,该微指令可以通过对指令的拆分或解码后获得,能被进一步解码为各部件、各单元或各处理电路的控制信号。
在一种可选方案中,该计算指令的结构可以如下表所示。
操作码 寄存器或立即数 寄存器/立即数
上表中的省略号表示可以包括多个寄存器或立即数。
在另一种可选方案中,该计算指令可以包括:一个或多个操作域以及一个操作码。该计算指令可以包括神经网络运算指令。以神经网络运算指令为例,如表1所示,其中,寄存器号0、寄存器号1、寄存器号2、寄存器号3、寄存器号4可以为操作域。其中,每个寄存器号0、寄存器号1、寄存器号2、寄存器号3、寄存器号4可以是一个或者多个寄存器的号码。
Figure PCTCN2019109552-appb-000014
上述寄存器可以为片外存储器,当然在实际应用中,也可以为片内存储器,用于存储数据,该数据具体可以为p维数据,p为大于等于1的整数,例如,p=1时,为1维数据,即向量,如p=2时,为2维数据,即矩阵,如p=3或3以上时,为多维张量。
可选的,该控制器单元还可以包括:
所述依赖关系处理单元108,用于在具有多个运算指令时,确定第一运算指令与所述第一运算指令之前的第零运算指令是否存在关联关系,如所述第一运算指令与所述第零运算指令存在关联关系,则将所述第一运算指令缓存在所述指令存储单元内,在所述第零运算指令执行完毕后,从所述指令存储单元提取所述第一运算指令传输至所述运算单元;
所述确定该第一运算指令与第一运算指令之前的第零运算指令是否存在关联关系包括:
依据所述第一运算指令提取所述第一运算指令中所需数据(例如矩阵)的第一存储地址区间,依据所述第零运算指令提取所述第零运算指令中所需矩阵的第零存储地址区间,如所述第一存储地址区间与所述第零存储地址区间具有重叠的区域,则确定所述第一运算指令与所述第零运算指令具有关联关系,如所述第一存储地址区间与所述第零存储地址区间不具有重叠的区域,则确定所述第一运算指令与所述第零运算指令不具有关联关系。
在另一种可选实施例中,运算单元12如图2-4所示,可以包括一个主处理电路101和多个从处理电路102。在一个实施例里,如图2-4所示,多个从处理电路呈阵列分布;每个从处理电路与相邻的其他从处理电路连接,主处理电路连接所述多个从处理电路中的k个从处理电路,所述k个从处理电路为:第1行的p个从处理电路、第q行的p个从处理电路以及第1列的q个从处理电路,需要说明的是,如图2-4所示的K个从处理电路仅包括第1行的p个从处理电路、第q行的p个从处理电路以及第1列的q个从处理电路,即该k个从处理电路为多个从处理电路中直接与主处理电路连接的从处理电路。
K个从处理电路,用于在所述主处理电路以及多个从处理电路之间的数据以及指令的转 发。
可选的,如图2-5所示,该主处理电路还可以包括:转换处理电路114、激活处理电路115、加法处理电路116中的一种或任意组合;
转换处理电路114,用于将主处理电路接收的数据块或中间结果执行第一数据结构与第二数据结构之间的互换(例如连续数据与离散数据的转换);或将主处理电路接收的数据块或中间结果执行第一数据类型与第二数据类型之间的互换(例如定点类型与浮点类型的转换);
激活处理电路115,用于执行主处理电路内数据的激活运算;
加法处理电路116,用于执行加法运算或累加运算。
所述主处理电路,用于将确定所述输入神经元为广播数据,权值为分发数据,将分发数据分配成多个数据块,将所述多个数据块中的至少一个数据块以及多个运算指令中的至少一个运算指令发送给所述从处理电路;
所述多个从处理电路,用于依据该运算指令对接收到的数据块执行运算得到中间结果,并将运算结果传输给所述主处理电路;
所述主处理电路,用于将多个从处理电路发送的中间结果进行处理得到该计算指令的结果,将该计算指令的结果发送给所述控制器单元。
所述从处理电路包括:乘法处理电路;
所述乘法处理电路,用于对接收到的数据块执行乘积运算得到乘积结果;
转发处理电路(可选的),用于将接收到的数据块或乘积结果转发。
累加处理电路,所述累加处理电路,用于对该乘积结果执行累加运算得到该中间结果。
另一个实施例里,该运算指令为矩阵乘以矩阵的指令、累加指令、激活指令等等计算指令。
下面通过神经网络运算指令来说明如图2-2所示的计算装置的具体计算方法。对于神经网络运算指令来说,其实际需要执行的公式可以为:
Figure PCTCN2019109552-appb-000015
其中,即将权值w乘以输入数据x i,进行求和,然后加上偏置b后做激活运算s(h),得到最终的输出结果s。
在一种可选的实施方案中,如图2-6所示,所述运算单元包括:树型模块40,所述树型模块包括:一个根端口401和多个支端口404,所述树型模块的根端口连接所述主处理电路,所述树型模块的多个支端口分别连接多个从处理电路中的一个从处理电路;
上述树型模块具有收发功能,例如如图2-6所示,该树型模块即为发送功能,如图2-11所示,该树型模块即为接收功能。
所述树型模块,用于转发所述主处理电路与所述多个从处理电路之间的数据块、权值以及运算指令。
可选的,该树型模块为计算装置的可选择结果,其可以包括至少1层节点,该节点为具有转发功能的线结构,该节点本身可以不具有计算功能。如树型模块具有零层节点,即无需该树型模块。
可选的,该树型模块可以为p叉树结构,例如,如图2-7所示的二叉树结构,当然也可以为三叉树结构,该p可以为大于等于2的整数。本申请具体实施方式并不限制上述p的具体取值,上述层数也可以为2,从处理电路可以连接除倒数第二层节点以外的其他层的节点,例如可以连接如图2-7所示的倒数第一层的节点。
可选的,上述运算单元可以携带单独的缓存,如图2-8所示,可以包括:神经元缓存单 元,该神经元缓存单元63缓存该从处理电路的输入神经元向量数据和输出神经元值数据。
如图2-9所示,该运算单元还可以包括:权值缓存单元64,用于缓存该从处理电路在计算过程中需要的权值数据。
在一种可选实施例中,运算单元12如图2-3所示,可以包括分支处理电路103;其具体的连接结构如图2-3所示,其中,
主处理电路101与分支处理电路103(一个或多个)连接,分支处理电路103与一个或多个从处理电路102连接;
分支处理电路103,用于执行转发主处理电路101与从处理电路102之间的数据或指令。
在一种可选实施例中,以神经网络运算中的全连接运算为例,过程可以为:y=f(wx+b),其中,x为输入神经元矩阵,w为权值矩阵,b为偏置标量,f为激活函数,具体可以为:sigmoid函数,tanh、relu、softmax函数中的任意一个。这里假设为二叉树结构,具有8个从处理电路,其实现的方法可以为:
控制器单元从存储单元内获取输入神经元矩阵x,权值矩阵w以及全连接运算指令,将输入神经元矩阵x,权值矩阵w以及全连接运算指令传输给主处理电路;
主处理电路确定该输入神经元矩阵x为广播数据,确定权值矩阵w为分发数据,将权值矩阵w拆分成8个子矩阵,然后将8个子矩阵通过树型模块分发给8个从处理电路,将输入神经元矩阵x广播给8个从处理电路,
从处理电路并行执行8个子矩阵与输入神经元矩阵x的乘法运算和累加运算得到8个中间结果,将8个中间结果发送给主处理电路;
主处理电路,用于将8个中间结果排序得到wx的运算结果,将该运算结果执行偏置b的运算后执行激活操作得到最终结果y,将最终结果y发送至控制器单元,控制器单元将该最终结果y输出或存储至存储单元内。
如图2-2所示的计算装置执行神经网络正向运算指令的方法具体可以为:
控制器单元从指令存储单元内提取神经网络正向运算指令、神经网络运算指令对应的操作域以及至少一个操作码,控制器单元将该操作域传输至数据访问单元,将该至少一个操作码发送至运算单元。
控制器单元从存储单元内提取该操作域对应的权值w和偏置b(当b为0时,不需要提取偏置b),将权值w和偏置b传输至运算单元的主处理电路,控制器单元从存储单元内提取输入数据Xi,将该输入数据Xi发送至主处理电路。
主处理电路依据该至少一个操作码确定为乘法运算,确定输入数据Xi为广播数据,确定权值数据为分发数据,将权值w拆分成p个数据块;
控制器单元的指令处理单元依据该至少一个操作码确定乘法指令、偏置指令和累加指令,将乘法指令、偏置指令和累加指令发送至主处理电路,主处理电路将该乘法指令、输入数据Xi以广播的方式发送给多个从处理电路,将该p个数据块分发给该多个从处理电路(例如具有p个从处理电路,那么每个从处理电路发送一个数据块);多个从处理电路,用于依据该乘法指令将该输入数据Xi与接收到的数据块执行乘法运算得到中间结果,将该中间结果发送至主处理电路,该主处理电路依据该累加指令将多个从处理电路发送的中间结果执行累加运算得到累加结果,依据该偏置指令将该累加结果执行加偏置b得到最终结果,将该最终结果发送至该控制器单元。
另外,加法运算和乘法运算的顺序可以调换。
本申请提供的技术方案通过一个指令即神经网络运算指令即实现了神经网络的乘法运算以及偏置运算,在神经网络计算的中间结果均无需存储或提取,减少了中间数据的存储以及提取操作,所以其具有减少对应的操作步骤,提高神经网络的计算效果的优点。
本申请还揭露了一个机器学习运算装置,其包括一个或多个在本申请中提到的计算装置,用于从其他处理装置中获取待运算数据和控制信息,执行指定的机器学习运算,执行结果通过I/O接口传递给外围设备。外围设备譬如摄像头,显示器,鼠标,键盘,网卡,wifi接口,服务器。当包含一个以上计算装置时,计算装置间可以通过特定的结构进行链接并传输数据,譬如,通过PCIE总线进行互联并传输数据,以支持更大规模的机器学习的运算。此时,可以共享同一控制系统,也可以有各自独立的控制系统;可以共享内存,也可以每个加速器有各自的内存。此外,其互联方式可以是任意互联拓扑。
该机器学习运算装置具有较高的兼容性,可通过PCIE接口与各种类型的服务器相连接。
本申请还揭露了一个组合处理装置,其包括上述的机器学习运算装置,通用互联接口,和其他处理装置。机器学习运算装置与其他处理装置进行交互,共同完成用户指定的操作。图2-10为组合处理装置的示意图。
其他处理装置,包括中央处理器CPU、图形处理器GPU、神经网络处理器等通用/专用处理器中的一种或以上的处理器类型。其他处理装置所包括的处理器数量不做限制。其他处理装置作为机器学习运算装置与外部数据和控制的接口,包括数据搬运,完成对本机器学习运算装置的开启、停止等基本控制;其他处理装置也可以和机器学习运算装置协作共同完成运算任务。
通用互联接口,用于在所述机器学习运算装置与其他处理装置间传输数据和控制指令。该机器学习运算装置从其他处理装置中获取所需的输入数据,写入机器学习运算装置片上的存储装置;可以从其他处理装置中获取控制指令,写入机器学习运算装置片上的控制缓存;也可以读取机器学习运算装置的存储模块中的数据并传输给其他处理装置。
可选的,该结构如图2-12所示,还可以包括存储装置,存储装置分别与所述机器学习运算装置和所述其他处理装置连接。存储装置用于保存在所述机器学习运算装置和所述其他处理装置的数据,尤其适用于所需要运算的数据在本机器学习运算装置或其他处理装置的内部存储中无法全部保存的数据。
该组合处理装置可以作为手机、机器人、无人机、视频监控设备等设备的SOC片上系统,有效降低控制部分的核心面积,提高处理速度,降低整体功耗。此情况时,该组合处理装置的通用互联接口与设备的某些部件相连接。某些部件譬如摄像头,显示器,鼠标,键盘,网卡,wifi接口。
在一些实施例里,还申请了一种芯片,其包括了上述机器学习运算装置或组合处理装置。
在一些实施例里,申请了一种芯片封装结构,其包括了上述芯片。
在一些实施例里,申请了一种板卡,其包括了上述芯片封装结构。参阅图2-13,图2-13提供了一种板卡,上述板卡除了包括上述芯片389以外,还可以包括其他的配套部件,该配套部件包括但不限于:存储器件390、接口装置391和控制器件392;
所述存储器件390与所述芯片封装结构内的芯片通过总线连接,用于存储数据。所述存储器件可以包括多组存储单元393。每一组所述存储单元与所述芯片通过总线连接。可以理解,每一组所述存储单元可以是DDR SDRAM(英文:Double Data Rate SDRAM,双倍速率同 步动态随机存储器)。
DDR不需要提高时钟频率就能加倍提高SDRAM的速度。DDR允许在时钟脉冲的上升沿和下降沿读出数据。DDR的速度是标准SDRAM的两倍。在一个实施例中,所述存储装置可以包括4组所述存储单元。每一组所述存储单元可以包括多个DDR4颗粒(芯片)。在一个实施例中,所述芯片内部可以包括4个72位DDR4控制器,上述72位DDR4控制器中64bit用于传输数据,8bit用于ECC校验。可以理解,当每一组所述存储单元中采用DDR4-3200颗粒时,数据传输的理论带宽可达到25600MB/s。
在一个实施例中,每一组所述存储单元包括多个并联设置的双倍速率同步动态随机存储器。DDR在一个时钟周期内可以传输两次数据。在所述芯片中设置控制DDR的控制器,用于对每个所述存储单元的数据传输与数据存储的控制。
所述接口装置与所述芯片封装结构内的芯片电连接。所述接口装置用于实现所述芯片与外部设备(例如服务器或计算机)之间的数据传输。例如在一个实施例中,所述接口装置可以为标准PCIE接口。比如,待处理的数据由服务器通过标准PCIE接口传递至所述芯片,实现数据转移。可选的,当采用PCIE 3.0X 16接口传输时,理论带宽可达到16000MB/s。在另一个实施例中,所述接口装置还可以是其他的接口,本申请并不限制上述其他的接口的具体表现形式,所述接口单元能够实现转接功能即可。另外,所述芯片的计算结果仍由所述接口装置传送回外部设备(例如服务器)。
所述控制器件与所述芯片电连接。所述控制器件用于对所述芯片的状态进行监控。具体的,所述芯片与所述控制器件可以通过SPI接口电连接。所述控制器件可以包括单片机(Micro Controller Unit,MCU)。如所述芯片可以包括多个处理芯片、多个处理核或多个处理电路,可以带动多个负载。因此,所述芯片可以处于多负载和轻负载等不同的工作状态。通过所述控制装置可以实现对所述芯片中多个处理芯片、多个处理和或多个处理电路的工作状态的调控。
在一些实施例里,申请了一种电子设备,其包括了上述板卡。
电子设备包括数据处理装置、机器人、电脑、打印机、扫描仪、平板电脑、智能终端、手机、行车记录仪、导航仪、传感器、摄像头、服务器、云端服务器、相机、摄像机、投影仪、手表、耳机、移动存储、可穿戴设备、交通工具、家用电器、和/或医疗设备。
所述交通工具包括飞机、轮船和/或车辆;所述家用电器包括电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机;所述医疗设备包括核磁共振仪、B超仪和/或心电图仪。
具体实现中,输出神经元的上述第一运算结果,即
Figure PCTCN2019109552-appb-000016
是再分成m1*m2份,并由m1个从运算模块来做的,m2>=1,即是m次(m拍)。k*m份的数据在从运算模块里算完后可以传给主运算模块,让主运算模块累加,k>=2。对于上述的主+互联模块+从的架构来说,也可以在互联(例如,K树(如图2-7所示))模块里累加。
进一步地,从运算模块里的乘法器可以是并行乘法器,也可以是串行乘法器。因为此专利分成重要比特位和非重要比特位的方法,导致重要比特位的位宽是浮动的。比如总位数是16位,重要比特位可以是3,5,8位。因此使用并行乘法器来运算,必须要做16*16,那就非常浪费。反之用串行来做,就可以只用一部分乘法器实现3、5、8乘法,功耗就更理想。
参见图2-14,图2-14为本发明实施例提供的一种计算方法的流程示意图。如图2-14所示,该方法包括:
1401、获取所述权值的重要比特位和非重要比特位,以及所述输入神经元。
1402、将所述输入神经元和所述重要比特位进行运算,得到输出神经元的第一运算结果。
1403、若所述第一运算结果小于或等于预设阈值,则跳过当前输出神经元的运算。
1404、若所述第一运算结果大于所述预设阈值,则将所述输入神经元与所述非重要比特位之间进行运算,得到第二运算结果,将所述第一运算结果与所述第二运算结果之和作为输出神经元。
在一种可行的实施例中,若所述输入神经元以N in表示,该输入神经元包括n个比特位,其中,n个比特位包括n1个重要比特位和n2个非重要比特位,若n1个重要比特位对应的值以N1 in表示,该n2个非重要比特位对应的值以N2 in表示,则n1+n2=n,N in=N1 in+N2 in,n为正整数,n1为自然数且小于n。
在一种可行的实施例中,所述n1个重要比特位的位置是连续的,或者,不连续的。
在一种可行的实施例中,若所述权值以W表示,该权值包括w个比特位,其中,w1个比特位为重要比特位,w2个比特位为非重要比特位,若该w1个比特位对应的值以W1表示,该w2个比特位对应的值为W2表示,则w1+w2=w,W=W1+W2,w为正整数,w1为自然数且小于w。
在一种可行的实施例中,所述n1个重要比特位的位置是连续的,或者,不连续的。
在一种可行的实施例中,在所述输入神经元为多个时,可包括如下步骤:
按照如下公式计算所述输出神经元:
Figure PCTCN2019109552-appb-000017
其中,T为输入神经元的数量,N out为输出神经元,N1 in(i)是第i个输入神经元的重要比特位,N2 in(i)表示第i个输入神经元的非重要比特位,W1(i)为第i个权值的重要比特位,W2(i)为第i个权值的非重要比特位,N in(i)表示第i个输入神经元的值,W(i)表示第i个权值的值,N in(i)=N1 in(i)+N2 in(i)且W(i)=W1(i)+W2(i);
优先计算N out中的
Figure PCTCN2019109552-appb-000018
并将
Figure PCTCN2019109552-appb-000019
作为所述第一运算结果。
在一种可行的实施例中,在所述若所述第一运算结果小于或等于预设阈值,则跳过所述输出神经元的运算;若所述第一运算结果大于所述预设阈值,则将所述输入神经元与所述非重要比特位进行运算,得到第二运算结果,将所述第一运算结果与所述第二运算结果之和作为输出神经元方面,可包括如下步骤:
Figure PCTCN2019109552-appb-000020
小于或等于所述预设阈值,则跳过当前输出神经元的运算;
Figure PCTCN2019109552-appb-000021
大于所述预设阈值,则继续运算N out,并输出最终的N out
需要说明的是,图2-14所示的方法的各个步骤的具体实现过程可参见上述计算装置的具体实现过程,在此不再叙述。
本发明实施例还提供一种计算机存储介质,其中,该计算机存储介质存储用于电子数据交换的计算机程序,该计算机程序使得计算机执行如上述方法实施例中记载的任一方法的部 分或全部步骤,上述计算机包括电子设备。
本发明实施例还提供一种计算机程序产品,上述计算机程序产品包括存储了计算机程序的非瞬时性计算机可读存储介质,上述计算机程序可使计算机执行如上述方法实施例中记载的任一方法的部分或全部步骤。该计算机程序产品可以为一个软件安装包,上述计算机包括电子设备。
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本发明并不受所描述的动作顺序的限制,因为依据本发明,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于可选实施例,所涉及的动作和模块并不一定是本发明所必须的。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。
在本申请所提供的几个实施例中,应该理解到,所揭露的装置,可通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本发明各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元可以采用硬件的形式实现。
以上对本发明实施例进行了详细介绍,本文中应用了具体个例对本发明的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本发明的方法及其核心思想;同时,对于本领域的一般技术人员,依据本发明的思想,在具体实施方式及应用范围上均会有改变之处,综上上述,本说明书内容不应理解为对本发明的限制。

Claims (29)

  1. 一种机器学习运算的分配系统,其特征在于,包括:终端服务器和云端服务器;
    所述终端服务器用于根据需求信息生成对应的运算任务,并根据所述运算任务和终端服务器的硬件性能参数选取在所述终端服务器运行的第一机器学习算法,以及根据所述运算任务和云端服务器的硬件性能参数选取在所述云端服务器运行的第二机器学习算法;
    根据所述第一机器学习算法和所述运算任务生成终端服务器控制指令,以及根据所述第二机器学习算法和所述运算任务生成云端服务器控制指令。
  2. 根据权利要求1所述的机器学习运算的分配系统,其特征在于,所述终端服务器还用于对所述终端服务器控制指令进行解析得到终端控制信号,并根据所述终端控制信号计算对应的每个阶段的第一机器学习算法的运算任务以得到终端运算结果,以及将所述云端服务器控制指令发送至所述云端服务器。
  3. 根据权利要求1所述的机器学习运算的分配系统,其特征在于,所述云端服务器用于接收所述云端服务器控制指令,对所述云端服务器控制指令进行解析得到云端控制信号,并根据所述云端控制信号计算对应的每个阶段的第二机器学习算法的运算任务以得到云端运算结果。
  4. 根据权利要求1所述的机器学习运算的分配系统,其特征在于,所述硬件性能参数包括运算能力,
    所述终端服务器根据所述运算任务和终端服务器的硬件性能参数选取在所述终端服务器运行的第一机器学习算法,以及根据所述运算任务和云端服务器的硬件性能参数选取在所述云端服务器运行的第二机器学习算法,包括:
    获取所述终端服务器的运算能力和所述云端服务器的运算能力;
    根据所述运算任务和所述终端服务器的运算能力选取第一机器学习算法,以及根据所述运算任务和所述云端服务器的运算能力选取第二机器学习算法。
  5. 根据权利要求1所述的机器学习运算的分配系统,其特征在于,所述第一机器学习算法包括第一神经网络模型,所述第二机器学习算法包括第二神经网络模型。
  6. 根据权利要求1-5任一所述的机器学习运算的分配系统,其特征在于,所述终端服务器还用于将所述终端运算结果输出后,在接收到停止运算指令时,发送所述停止运算指令至所述云端服务器,以终止所述云端服务器的运算工作。
  7. 根据权利要求1-5任一所述的机器学习运算的分配系统,其特征在于,所述终端服务器包括终端控制器单元、终端运算单元和终端通信单元;所述终端控制器单元分别与所述终端运算单元和所述终端通信单元连接;
    其中,所述终端控制器单元用于获取需求信息、所述终端服务器的硬件性能参数和所述云端服务器的硬件性能参数;根据所述需求信息生成对应的运算任务,并根据所述运算任务和所述终端服务器的硬件性能参数选取在所述终端服务器运行的第一机器学习算法,以及根据所述运算任务和所述云端服务器的硬件性能参数选取在所述云端服务器运行的第二机器学习算法;根据所述第一机器学习算法和所述运算任务生成终端服务器控制指令,以及根据所述第二机器学习算法和所述运算任务生成云端服务器控制指令,并对所述终端服务器控制指令进行解析得到终端控制信号;
    所述终端运算单元用于根据所述终端控制信号计算对应的第一机器学习算法的运算任务 以得到终端运算结果;
    所述终端通信单元用于将所述云端服务器控制指令发送至所述云端服务器。
  8. 根据权利要求7所述的机器学习运算的分配系统,其特征在于,所述云端服务器包括云端控制器单元、云端运算单元和云端通信单元;所述云端控制器单元分别与所述云端运算单元和所述云端通信单元连接,所述云端通信单元与所述终端通信单元通信连接,用于在所述云端服务器与所述终端服务器之间进行数据交互;
    其中,所述云端通信单元用于接收所述云端服务器控制指令,并将所述云端服务器控制指令发送至所述云端控制器单元,以及获取云端运算结果并发送至所述终端服务器;
    所述云端控制器单元用于接收所述云端服务器控制指令,对所述云端服务器控制指令进行解析得到云端控制信号;
    所述云端运算单元用于根据所述云端控制信号计算对应的第二机器学习算法的运算任务以得到云端运算结果,并将所述云端运算结果通过所述云端通信单元发送至所述终端服务器。
  9. 根据权利要求8所述的机器学习运算的分配系统,其特征在于,所述终端运算单元或所述云端运算单元包括:一个主处理电路和多个从处理电路;
    所述终端控制器单元或所述云端控制器单元,用于获取输入数据以及计算指令;
    所述终端控制器单元或所述云端控制器单元,还用于解析该计算指令得到多个运算指令,将该多个运算指令以及所述输入数据发送给所述主处理电路;
    所述主处理电路,用于对所述输入数据执行前序处理以及与所述多个从处理电路之间传输数据和运算指令;
    所述多个从处理电路,用于依据从所述主处理电路传输的数据以及运算指令并行执行中间运算得到多个中间结果,并将多个中间结果传输给所述主处理电路;
    所述主处理电路,用于对所述多个中间结果执行后续处理得到所述计算指令的计算结果。
  10. 根据权利要求9所述的机器学习运算的分配系统,其特征在于,所述主处理电路包括:依赖关系处理单元;
    所述依赖关系处理单元,用于确定第一运算指令与所述第一运算指令之前的第零运算指令是否存在关联关系,如所述第一运算指令与所述第零运算指令存在关联关系,将所述第一运算指令缓存在所述指令存储单元内,在所述第零运算指令执行完毕后,从所述指令存储单元提取所述第一运算指令传输至所述运算单元;
    所述确定该第一运算指令与第一运算指令之前的第零运算指令是否存在关联关系包括:
    依据所述第一运算指令提取所述第一运算指令中所需数据的第一存储地址区间,依据所述第零运算指令提取所述第零运算指令中所需数据的第零存储地址区间,如所述第一存储地址区间与所述第零存储地址区间具有重叠的区域,确定所述第一运算指令与所述第零运算指令具有关联关系,如所述第一存储地址区间与所述第零存储地址区间不具有重叠的区域,确定所述第一运算指令与所述第零运算指令不具有关联关系。
  11. 根据权利要求8所述的机器学习运算的分配系统,其特征在于,所述终端运算单元或所述云端运算单元还包括:树型模块,所述树型模块包括:一个根端口和多个支端口,所述树型模块的根端口连接所述主处理电路,所述树型模块的多个支端口分别连接多个从处理电路中的一个从处理电路;
    所述树型模块,用于转发所述主处理电路与所述多个从处理电路之间的数据块、权值以及运算指令。
  12. 根据权利要求9所述的机器学习运算的分配系统,其特征在于,所述多个从处理电路呈阵列分布;每个从处理电路与相邻的其他从处理电路连接,所述主处理电路连接所述多个从处理电路中的k个从处理电路,所述k个基础电路为:第1行的n个从处理电路、第m行的n个从处理电路以及第1列的m个从处理电路;
    所述K个从处理电路,用于在所述主处理电路以及多个从处理电路之间的数据以及指令的转发;
    所述主处理电路,用于确定输入神经元为广播数据,权值为分发数据,将一个输入数据分发数据分配成多个数据块,将所述多个数据块中的至少一个数据块以及多个运算指令中的至少一个运算指令发送给所述K个从处理电路;
    所述K个从处理电路,用于转换所述主处理电路与所述多个从处理电路之间的数据;
    所述多个从处理电路,用于依据该运算指令对接收到的数据块执行运算得到中间结果,并将运算结果传输给所述K个从处理电路;
    所述主处理电路,用于将所述K个从处理电路发送的中间结果进行后续处理得到该计算指令的结果,将该计算指令的结果发送给所述控制器单元。
  13. 根据权利要求7所述的机器学习运算的分配系统,其特征在于,所述终端服务器还包括终端存储单元;所述终端存储单元分别与所述终端控制器单元、所述终端运算单元连接,用于接收所述终端服务器的输入数据并存储。
  14. 根据权利要求8所述的机器学习运算的分配系统,其特征在于,所述云端服务器还包括云端存储单元;所述云端存储单元分别与所述云端控制器单元、所述云端运算单元连接,用于接收所述云端服务器的输入数据并存储。
  15. 根据权利要求13所述的机器学习运算的分配系统,其特征在于,所述终端控制器单元包括终端评估电路、终端指令生成电路和终端指令解析电路;
    所述终端指令生成电路分别与所述终端评估电路和所述终端指令解析电路连接,所述终端评估电路、所述终端指令生成电路和所述终端指令解析电路分别与所述终端运算单元、所述终端存储单元和所述终端通信单元连接;
    所述终端评估电路用于获取需求信息、所述终端服务器的硬件性能参数和所述云端服务器的硬件性能参数;根据所述需求信息生成对应的运算任务,并根据所述运算任务和所述终端服务器的硬件性能参数选取在所述终端服务器运行的第一机器学习算法,以及根据所述运算任务和所述云端服务器的硬件性能参数选取在所述云端服务器运行的第二机器学习算法;
    所述终端指令生成电路用于根据所述第一机器学习算法和所述运算任务生成终端服务器控制指令,以及根据所述第二机器学习算法和所述运算任务生成云端服务器控制指令;
    所述终端指令解析电路用于对所述终端服务器控制指令进行解析得到终端控制信号。
  16. 根据权利要求13所述的机器学习运算的分配系统,其特征在于,所述终端运算单元与所述终端通信单元连接,且所述终端存储单元与所述终端通信单元连接。
  17. 根据权利要求14所述的机器学习运算的分配系统,其特征在于,所述云端控制器单元包括云端指令解析电路;所述云端指令解析电路分别与所述云端运算单元、所述云端存储单元和所述云端通信单元连接。
  18. 根据权利要求14所述的机器学习运算的分配系统,其特征在于,所述云端运算单元与所述云端通信单元连接,且所述云端存储单元与所述云端通信单元连接。
  19. 一种机器学习运算的分配方法,其特征在于,包括:
    获取需求信息、终端服务器的硬件性能参数和云端服务器的硬件性能参数;
    根据所述需求信息生成对应的运算任务,并根据所述运算任务和所述终端服务器的硬件性能参数选取在所述终端服务器运行的第一机器学习算法,以及根据所述运算任务和所述云端服务器的硬件性能参数选取在所述云端服务器运行的第二机器学习算法;
    根据所述第一机器学习算法和所述运算任务生成终端服务器控制指令,以及根据所述第二机器学习算法和所述运算任务生成云端服务器控制指令。
  20. 根据权利要求19所述的机器学习运算的分配方法,其特征在于,还包括:
    分别对所述终端服务器控制指令和所述云端服务器控制指令进行解析,根据所述终端服务器控制指令获得终端控制信号,以及根据所述云端服务器控制指令获得云端控制信号;
    根据所述终端控制信号提取终端待处理数据,以及根据所述云端控制信号提取云端待处理数据;
    根据所述终端待处理数据计算所述终端服务器中对应的每个阶段的第一机器学习算法的运算任务以得到终端运算结果,和/或根据所述云端待处理数据计算所述云端服务器中对应的每个阶段的第二机器学习算法的运算任务以得到云端运算结果。
  21. 根据权利要求19所述的机器学习运算的分配方法,其特征在于,
    所述根据所述运算任务和所述终端服务器的硬件性能参数选取在所述终端服务器运行的第一机器学习算法,以及根据所述运算任务和所述云端服务器的硬件性能参数在所述云端服务器运行的第二机器学习算法,包括:
    获取所述终端服务器的运算能力和所述云端服务器的运算能力;
    根据所述运算任务、所述终端服务器的运算能力选取第一机器学习算法,以及根据所述运算任务、所述云端服务器的运算能力选取第二机器学习算法。
  22. 根据权利要求19所述的机器学习运算的分配方法,其特征在于,所述第一机器学习算法包括第一神经网络模型,所述第二机器学习算法包括第二神经网络模型。
  23. 根据权利要求19-22任一所述的机器学习运算的分配方法,其特征在于,还包括:
    将所述终端运算结果输出后,在接收到停止运算指令时,终止所述云端服务器的运算工作。
  24. 根据权利要求20或22所述的机器学习运算的分配方法,其特征在于,
    所述分别对所述终端服务器控制指令和所述云端服务器控制指令进行解析,根据所述终端服务器控制指令获得终端控制信号,以及根据所述云端服务器控制指令获得云端控制信号,包括:
    利用终端服务器对所述终端服务器控制指令进行解析,获得终端控制信号;
    根据所述终端控制信号提取相对应的终端训练数据或者终端测试数据。
  25. 根据权利要求20或22所述的机器学习运算的分配方法,其特征在于,
    所述分别对所述终端服务器控制指令和所述云端服务器控制指令进行解析,根据所述终端服务器控制指令获得终端控制信号,以及根据所述云端服务器控制指令获得云端控制信号,还包括:
    利用云端服务器对所述云端服务器控制指令进行解析,获得云端控制信号;
    根据所述云端控制信号提取相对应的云端训练数据或者云端测试数据。
  26. 根据权利要求24所述的机器学习运算的分配方法,其特征在于,
    所述根据所述终端待处理数据计算所述终端服务器中对应的每个阶段的第一机器学习算 法的运算任务以得到终端运算结果,包括:
    利用终端服务器并根据所述终端训练数据或者终端测试数据,计算所述终端服务器中对应的每个阶段的第一机器学习算法的运算任务以得到终端运算结果。
  27. 根据权利要求25所述的机器学习运算的分配方法,其特征在于,所述根据所述云端待处理数据计算所述云端服务器中对应的每个阶段的第二机器学习算法的运算任务以得到云端运算结果,包括:
    利用云端服务器并根据所述云端训练数据或者云端测试数据,计算所述云端服务器中对应的每个阶段的第二机器学习算法的运算任务以得到云端运算结果。
  28. 根据权利要求19所述的机器学习运算的分配方法,其特征在于,多个从处理电路呈阵列分布;每个从处理电路与相邻的其他从处理电路连接,主处理电路连接所述多个从处理电路中的k个从处理电路,k个基础电路为:第1行的n个从处理电路、第m行的n个从处理电路以及第1列的m个从处理电路;
    所述K个从处理电路在所述主处理电路以及多个从处理电路之间的数据以及指令的转发;
    所述主处理电路确定所述输入神经元为广播数据,权值为分发数据,将一个输入数据分发数据分配成多个数据块,将所述多个数据块中的至少一个数据块以及多个运算指令中的至少一个运算指令发送给所述K个从处理电路;
    所述K个从处理电路转换所述主处理电路与所述多个从处理电路之间的数据;
    所述多个从处理电路依据该运算指令对接收到的数据块执行运算得到中间结果,并将运算结果传输给所述K个从处理电路;
    所述主处理电路将所述K个从处理电路发送的中间结果进行后续处理得到该计算指令的结果,将该计算指令的结果发送给所述控制器单元。
  29. 根据权利要求19所述的机器学习运算的分配方法,其特征在于,运算单元还包括一个或多个分支处理电路,每个分支处理电路连接至少一个从处理电路;
    主处理电路确定输入神经元为广播数据,权值为分发数据,将一个输入神经元分发数据分配成多个数据块,将所述多个数据块中的至少一个数据块、权值广播数据以及多个运算指令中的至少一个运算指令发送给所述分支处理电路;
    所述分支处理电路转发所述主处理电路与所述多个从处理电路之间的数据块、广播数据权值以及运算指令;
    所述多个从处理电路依据该运算指令对接收到的数据块以及广播数据权值执行运算得到中间结果,并将中间结果传输给所述分支处理电路;
    所述主处理电路将分支处理电路发送的中间结果进行后续处理得到该计算指令的结果,将该计算指令的结果发送给所述控制器单元。
PCT/CN2019/109552 2018-10-12 2019-09-30 机器学习运算的分配系统及方法 WO2020073874A1 (zh)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN201811190161.6A CN111047045B (zh) 2018-10-12 2018-10-12 机器学习运算的分配系统及方法
CN201811190161.6 2018-10-12
CN201811424173.0A CN111222632B (zh) 2018-11-27 2018-11-27 计算装置、计算方法及相关产品
CN201811424173.0 2018-11-27

Publications (1)

Publication Number Publication Date
WO2020073874A1 true WO2020073874A1 (zh) 2020-04-16

Family

ID=70163774

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/109552 WO2020073874A1 (zh) 2018-10-12 2019-09-30 机器学习运算的分配系统及方法

Country Status (1)

Country Link
WO (1) WO2020073874A1 (zh)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101784060A (zh) * 2009-01-19 2010-07-21 华为技术有限公司 参数处理方法、网络诊断方法、终端、服务器和系统
CN103945545A (zh) * 2014-04-15 2014-07-23 南京邮电大学 一种异构网络资源优化方法
CN104767833A (zh) * 2015-05-04 2015-07-08 厦门大学 一种移动终端的计算任务的云端转移方法
US20160092794A1 (en) * 2013-06-29 2016-03-31 Emc Corporation General framework for cross-validation of machine learning algorithms using sql on distributed systems
CN106816057A (zh) * 2017-01-25 2017-06-09 公安部上海消防研究所 一种虚拟消防训练系统
CN107943463A (zh) * 2017-12-15 2018-04-20 清华大学 交互式自动化大数据分析应用开发系统

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101784060A (zh) * 2009-01-19 2010-07-21 华为技术有限公司 参数处理方法、网络诊断方法、终端、服务器和系统
US20160092794A1 (en) * 2013-06-29 2016-03-31 Emc Corporation General framework for cross-validation of machine learning algorithms using sql on distributed systems
CN103945545A (zh) * 2014-04-15 2014-07-23 南京邮电大学 一种异构网络资源优化方法
CN104767833A (zh) * 2015-05-04 2015-07-08 厦门大学 一种移动终端的计算任务的云端转移方法
CN106816057A (zh) * 2017-01-25 2017-06-09 公安部上海消防研究所 一种虚拟消防训练系统
CN107943463A (zh) * 2017-12-15 2018-04-20 清华大学 交互式自动化大数据分析应用开发系统

Similar Documents

Publication Publication Date Title
CN109543832B (zh) 一种计算装置及板卡
CN109522052B (zh) 一种计算装置及板卡
WO2020073211A1 (zh) 运算加速器、处理方法及相关设备
WO2019218896A1 (zh) 计算方法以及相关产品
TWI795519B (zh) 計算裝置、機器學習運算裝置、組合處理裝置、神經網絡芯片、電子設備、板卡及執行機器學習計算的方法
US20210065005A1 (en) Systems and methods for providing vector-wise sparsity in a neural network
WO2019127838A1 (zh) 卷积神经网络实现方法及装置、终端、存储介质
US20210241095A1 (en) Deep learning processing apparatus and method, device and storage medium
CN110163357B (zh) 一种计算装置及方法
CN111105023B (zh) 数据流重构方法及可重构数据流处理器
CN110059797B (zh) 一种计算装置及相关产品
CN111047045B (zh) 机器学习运算的分配系统及方法
CN111860773B (zh) 处理装置和用于信息处理的方法
CN111047022A (zh) 一种计算装置及相关产品
US11775808B2 (en) Neural network computation device and method
CN111353591A (zh) 一种计算装置及相关产品
CN111930681A (zh) 一种计算装置及相关产品
WO2021082725A1 (zh) Winograd卷积运算方法及相关产品
CN109740730B (zh) 运算方法、装置及相关产品
CN109711538B (zh) 运算方法、装置及相关产品
WO2020073874A1 (zh) 机器学习运算的分配系统及方法
Gonçalves et al. Exploring data size to run convolutional neural networks in low density fpgas
WO2021082746A1 (zh) 运算装置及相关产品
Bai et al. An OpenCL-based FPGA accelerator with the Winograd’s minimal filtering algorithm for convolution neuron networks
CN111078625B (zh) 片上网络处理系统和片上网络数据处理方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19871156

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19871156

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 19871156

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 270921)

122 Ep: pct application non-entry in european phase

Ref document number: 19871156

Country of ref document: EP

Kind code of ref document: A1